Regular expressions (re ou regex)

(Expressões regulares)

Daniel Moser

Feb 15th, 2017

2nd IAG Python Boot Camp

Table of contents

Preliminaries
Basics
Examples
- From Wikipedia:
- Others
Online testers
Python
- regex Python tip
- Python examples
Good references
Exercise

Preliminaries

Pythonista raiz	Pythonista nutella
Segue PEP8	Estilo é o do momento
Usa `docstrings`	Comenta com #
Usa editor de texto	Usa PyCharm
Testa código no `ipython`	Usa Jupyter/Notebook
Codifica em OOP	Codifica em procedural
Usa `regex`	Manipula strings como `str`
Publica código no `PyPi`	Publica código no github

Basics

A regular expression is a sequence of characters that define a search pattern. In other words, is a specific textual syntax for representing patterns that a matching text need to conform to.

Must read: Paulo Penteado's talk Processing strings (PDF)

And.. Read the Docs!

regex recognizes special characters with "\" (example: \n, \t).
. any character, but new line. If the DOTALL flag has been specified, this matches any character including a newline.
^ beginning of a string. In Python MULTILINE mode, the beginning of a line
$ end of a string or before the end of a line
* many occurrences
+ one or more occurrences
? 0 or 1 occurrence
{m} the exact pattern m-times
{m,} the exact pattern m or more times
{m,n} the exact pattern between m and n times
{,n} the exact pattern n or less times
\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
() Defines a group
(?P<id>) name the group with id
... many, many more.

Examples

From Wikipedia:

Text to be working over:

"at", "bat", "cat", "hat", "[rat]", "dog";
"at", "ccat", "chat", "hcat", "hhat", "s", "saw", "seed".

Regex:

.at matches any three-character string ending with "at", including "hat", "cat", and "bat".
[hc]at matches "hat" and "cat".
[^b]at matches all strings matched by .at except "bat".
[^hc]at matches all strings matched by .at other than "hat" and "cat".
^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
[hc]at$ matches "hat" and "cat", but only at the end of the string or line.
\[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
s.* matches s followed by zero or more characters, for example: "s" and "saw" and "seed".
[hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at".
[hc]?at matches "hat", "cat", and "at".
[hc]*at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", "at", and so on.
cat|dog matches "cat" or "dog".

Others

[^\s]+ returns a word until the first space/empty character.

Online testers

Choose one (or several)!!

Python

re: built-in regex module
regex: third-part regex module (a bit more features)

regex Python tip

The . (dot) doesn't have the original regex meaning with the default re in Python.

So, we need to enable it using the flag re.DOTALL. Example:

outgroups = re.findall(rule, string, flags=re.DOTALL)

The re.DOTALL flag tells python to make the '.'' (dot) special character match all characters, including newline characters. This is very important when working with multi-line strings.

Python examples

import re

"""Rapid `regex` test. Output: True/False"""

if re.search("regex pattern", subject):
    print('Pattern found!')
else:
    print('Pattern not found!')

# To use the regular expression multiple times:
re_obj = re.compile("regex pattern")
if re_obj.search(subject):
    print('Pattern found!')
else:
    print('Pattern not found!')

"""Split example"""

regex = re.compile(r'\W+')
out = regex.split('This is a test, short and sweet, of split().')
print(out)

"""Substitution example"""

def start_case_words(s):
    """ Function to put a string in Start Case.

    It can by vectorized by numpy: ``vecstart = np.vectorize(start_case_words) """
    return re.sub(r'\w+', lambda m:m.group(0).capitalize(), s)

out = start_case_words('This is a test, short and sweet, of split().')
print(out)

"""Retrieving the matched text"""

match_obj = re.search("regex pattern", subject)
if match_obj:
    result = match_obj.group()
else:
    result = ""  # or None

# To use the regular expression multiple times:
re_obj = re.compile("regex pattern")
match_obj = re_obj.search(subject)
if match_obj:
    result = match_obj.group()
else:
    result = ""  # or None

"""All matches examples"""

rule = r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)'

regex = re.compile(rule, re.MULTILINE)
matches0 = []
for m in regex.finditer(text):
    matches0.append(m.groups())

# for m in matches0:
#     print 'Name: %s\nSequence:%s' % (m[0], m[1])

# Other way
regex = re.compile(rule, re.MULTILINE)
matches1 = [m.groups() for m in regex.finditer(text)]

# Another:
matches3 = re.compile(rule, re.MULTILINE).findall(text)

# Other way (MUCH better):
matches2 = re.findall(rule, text)

Good references

Exercise

From the text below:

1. Retrieve all lines that contains the word "better".
1. Count the length of each sentence (in words).

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

"""Solution from J. Trevisan """

import re

t = """..."""

lineslen = [len(re.findall("[^\s+]", line)) for line in t.split("\n")]
print(lineslen)

Create a dictionary in which the keys are the acronyms of the USP institutes and the values the complete name. You must use regex!

Escola de Artes, Ciências e Humanidades (EACH)
Escola de Comunicações e Artes (ECA)
Escola de Educação Física e Esporte (EEFE)
Escola de Enfermagem (EE)
Escola Politécnica (Poli)
Faculdade de Arquitetura e Urbanismo (FAU)

"""Solution from J. Trevisan """

z = """..."""

d = dict([reversed(x) for x in re.findall("(.+) \((.+)\)", z)])
print(d)