Regular expressions (re ou regex)

(Expressões regulares)

Daniel Moser

Feb 15th, 2017

2nd IAG Python Boot Camp

Table of contents

Preliminaries

../figs/regex-raiznutella.png
Pythonista raiz Pythonista nutella
Segue PEP8 Estilo é o do momento
Usa docstrings Comenta com #
Usa editor de texto Usa PyCharm
Testa código no ipython Usa Jupyter/Notebook
Codifica em OOP Codifica em procedural
Usa regex Manipula strings como str
Publica código no PyPi Publica código no github

Basics

A regular expression is a sequence of characters that define a search pattern. In other words, is a specific textual syntax for representing patterns that a matching text need to conform to.

Must read: Paulo Penteado's talk Processing strings (PDF)

And.. Read the Docs!

Examples

From Wikipedia:

Text to be working over:

"at", "bat", "cat", "hat", "[rat]", "dog";
"at", "ccat", "chat", "hcat", "hhat", "s", "saw", "seed".

Regex:

  • .at matches any three-character string ending with "at", including "hat", "cat", and "bat".
  • [hc]at matches "hat" and "cat".
  • [^b]at matches all strings matched by .at except "bat".
  • [^hc]at matches all strings matched by .at other than "hat" and "cat".
  • ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
  • [hc]at$ matches "hat" and "cat", but only at the end of the string or line.
  • \[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
  • s.* matches s followed by zero or more characters, for example: "s" and "saw" and "seed".
  • [hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at".
  • [hc]?at matches "hat", "cat", and "at".
  • [hc]*at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", "at", and so on.
  • cat|dog matches "cat" or "dog".

Others

  • [^\s]+ returns a word until the first space/empty character.

Online testers

Choose one (or several)!!

Python

regex Python tip

The . (dot) doesn't have the original regex meaning with the default re in Python.

So, we need to enable it using the flag re.DOTALL. Example:

outgroups = re.findall(rule, string, flags=re.DOTALL)

The re.DOTALL flag tells python to make the '.'' (dot) special character match all characters, including newline characters. This is very important when working with multi-line strings.

Python examples

import re
"""Rapid `regex` test. Output: True/False"""

if re.search("regex pattern", subject):
    print('Pattern found!')
else:
    print('Pattern not found!')

# To use the regular expression multiple times:
re_obj = re.compile("regex pattern")
if re_obj.search(subject):
    print('Pattern found!')
else:
    print('Pattern not found!')
"""Split example"""

regex = re.compile(r'\W+')
out = regex.split('This is a test, short and sweet, of split().')
print(out)
"""Substitution example"""

def start_case_words(s):
    """ Function to put a string in Start Case.

    It can by vectorized by numpy: ``vecstart = np.vectorize(start_case_words) """
    return re.sub(r'\w+', lambda m:m.group(0).capitalize(), s)

out = start_case_words('This is a test, short and sweet, of split().')
print(out)
"""Retrieving the matched text"""

match_obj = re.search("regex pattern", subject)
if match_obj:
    result = match_obj.group()
else:
    result = ""  # or None

# To use the regular expression multiple times:
re_obj = re.compile("regex pattern")
match_obj = re_obj.search(subject)
if match_obj:
    result = match_obj.group()
else:
    result = ""  # or None
"""All matches examples"""

rule = r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)'

regex = re.compile(rule, re.MULTILINE)
matches0 = []
for m in regex.finditer(text):
    matches0.append(m.groups())

# for m in matches0:
#     print 'Name: %s\nSequence:%s' % (m[0], m[1])

# Other way
regex = re.compile(rule, re.MULTILINE)
matches1 = [m.groups() for m in regex.finditer(text)]

# Another:
matches3 = re.compile(rule, re.MULTILINE).findall(text)

# Other way (MUCH better):
matches2 = re.findall(rule, text)

Good references

Exercise

  1. From the text below:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
"""Solution from J. Trevisan """

import re

t = """..."""

lineslen = [len(re.findall("[^\s+]", line)) for line in t.split("\n")]
print(lineslen)
  1. Create a dictionary in which the keys are the acronyms of the USP institutes and the values the complete name. You must use regex!
Escola de Artes, Ciências e Humanidades (EACH)
Escola de Comunicações e Artes (ECA)
Escola de Educação Física e Esporte (EEFE)
Escola de Enfermagem (EE)
Escola Politécnica (Poli)
Faculdade de Arquitetura e Urbanismo (FAU)
"""Solution from J. Trevisan """

z = """..."""

d = dict([reversed(x) for x in re.findall("(.+) \((.+)\)", z)])
print(d)