Regular expressions (re ou regex)
(Expressões regulares)
Daniel Moser
Feb 15th, 2017
2nd IAG Python Boot Camp
Preliminaries
Pythonista raiz |
Pythonista nutella |
|---|---|
Segue PEP8 |
Estilo é o do momento |
Usa docstrings |
Comenta com # |
Usa editor de texto |
Usa PyCharm |
Testa código no ipython |
Usa Jupyter/Notebook |
Codifica em OOP |
Codifica em procedural |
Usa regex |
Manipula strings como str |
Publica código no PyPi |
Publica código no github |
Basics
A regular expression is a sequence of characters that define a search pattern. In other words, is a specific textual syntax for representing patterns that a matching text need to conform to.
Must read: Paulo Penteado's talk Processing strings (PDF)
- And.. Read the Docs!
regex recognizes special characters with "\" (example: \n, \t).
. any character, but new line. If the DOTALL flag has been specified, this matches any character including a newline.
^ beginning of a string. In Python MULTILINE mode, the beginning of a line
$ end of a string or before the end of a line
* many occurrences
+ one or more occurrences
? 0 or 1 occurrence
{m} the exact pattern m-times
{m,} the exact pattern m or more times
{m,n} the exact pattern between m and n times
{,n} the exact pattern n or less times
\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
() Defines a group
(?P<id>) name the group with id
... many, many more.
Examples
From Wikipedia:
Text to be working over:
"at", "bat", "cat", "hat", "[rat]", "dog"; "at", "ccat", "chat", "hcat", "hhat", "s", "saw", "seed".
Regex:
.at matches any three-character string ending with "at", including "hat", "cat", and "bat".
[hc]at matches "hat" and "cat".
[^b]at matches all strings matched by .at except "bat".
[^hc]at matches all strings matched by .at other than "hat" and "cat".
^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
[hc]at$ matches "hat" and "cat", but only at the end of the string or line.
\[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
s.* matches s followed by zero or more characters, for example: "s" and "saw" and "seed".
[hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at".
[hc]?at matches "hat", "cat", and "at".
[hc]*at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", "at", and so on.
cat|dog matches "cat" or "dog".
Others
[^\s]+ returns a word until the first space/empty character.
Online testers
Choose one (or several)!!
Python
re: built-in regex module
regex: third-part regex module (a bit more features)
regex Python tip
The . (dot) doesn't have the original regex meaning with the default re in Python.
So, we need to enable it using the flag re.DOTALL. Example:
outgroups = re.findall(rule, string, flags=re.DOTALL)
The re.DOTALL flag tells python to make the '.'' (dot) special character match all characters, including newline characters. This is very important when working with multi-line strings.
Python examples
import re
"""Rapid `regex` test. Output: True/False"""
if re.search("regex pattern", subject):
print('Pattern found!')
else:
print('Pattern not found!')
# To use the regular expression multiple times:
re_obj = re.compile("regex pattern")
if re_obj.search(subject):
print('Pattern found!')
else:
print('Pattern not found!')
"""Split example"""
regex = re.compile(r'\W+')
out = regex.split('This is a test, short and sweet, of split().')
print(out)
"""Substitution example"""
def start_case_words(s):
""" Function to put a string in Start Case.
It can by vectorized by numpy: ``vecstart = np.vectorize(start_case_words) """
return re.sub(r'\w+', lambda m:m.group(0).capitalize(), s)
out = start_case_words('This is a test, short and sweet, of split().')
print(out)
"""Retrieving the matched text"""
match_obj = re.search("regex pattern", subject)
if match_obj:
result = match_obj.group()
else:
result = "" # or None
# To use the regular expression multiple times:
re_obj = re.compile("regex pattern")
match_obj = re_obj.search(subject)
if match_obj:
result = match_obj.group()
else:
result = "" # or None
"""All matches examples"""
rule = r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)'
regex = re.compile(rule, re.MULTILINE)
matches0 = []
for m in regex.finditer(text):
matches0.append(m.groups())
# for m in matches0:
# print 'Name: %s\nSequence:%s' % (m[0], m[1])
# Other way
regex = re.compile(rule, re.MULTILINE)
matches1 = [m.groups() for m in regex.finditer(text)]
# Another:
matches3 = re.compile(rule, re.MULTILINE).findall(text)
# Other way (MUCH better):
matches2 = re.findall(rule, text)
Good references
Exercise
From the text below:
Retrieve all lines that contains the word "better".
Count the length of each sentence (in words).
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
"""Solution from J. Trevisan """
import re
t = """..."""
lineslen = [len(re.findall("[^\s+]", line)) for line in t.split("\n")]
print(lineslen)
Create a dictionary in which the keys are the acronyms of the USP institutes and the values the complete name. You must use regex!
Escola de Artes, Ciências e Humanidades (EACH) Escola de Comunicações e Artes (ECA) Escola de Educação Física e Esporte (EEFE) Escola de Enfermagem (EE) Escola Politécnica (Poli) Faculdade de Arquitetura e Urbanismo (FAU)
"""Solution from J. Trevisan """
z = """..."""
d = dict([reversed(x) for x in re.findall("(.+) \((.+)\)", z)])
print(d)