(Expressões regulares)
Daniel Moser
Feb 15th, 2017
2nd IAG Python Boot Camp
Table of contents
Pythonista raiz | Pythonista nutella |
---|---|
Segue PEP8 | Estilo é o do momento |
Usa docstrings | Comenta com # |
Usa editor de texto | Usa PyCharm |
Testa código no ipython | Usa Jupyter/Notebook |
Codifica em OOP | Codifica em procedural |
Usa regex | Manipula strings como str |
Publica código no PyPi | Publica código no github |
A regular expression is a sequence of characters that define a search pattern. In other words, is a specific textual syntax for representing patterns that a matching text need to conform to.
Must read: Paulo Penteado's talk Processing strings (PDF)
Text to be working over:
"at", "bat", "cat", "hat", "[rat]", "dog"; "at", "ccat", "chat", "hcat", "hhat", "s", "saw", "seed".
Regex:
Choose one (or several)!!
The . (dot) doesn't have the original regex meaning with the default re in Python.
So, we need to enable it using the flag re.DOTALL. Example:
outgroups = re.findall(rule, string, flags=re.DOTALL)
The re.DOTALL flag tells python to make the '.'' (dot) special character match all characters, including newline characters. This is very important when working with multi-line strings.
import re
"""Rapid `regex` test. Output: True/False""" if re.search("regex pattern", subject): print('Pattern found!') else: print('Pattern not found!') # To use the regular expression multiple times: re_obj = re.compile("regex pattern") if re_obj.search(subject): print('Pattern found!') else: print('Pattern not found!')
"""Split example""" regex = re.compile(r'\W+') out = regex.split('This is a test, short and sweet, of split().') print(out)
"""Substitution example""" def start_case_words(s): """ Function to put a string in Start Case. It can by vectorized by numpy: ``vecstart = np.vectorize(start_case_words) """ return re.sub(r'\w+', lambda m:m.group(0).capitalize(), s) out = start_case_words('This is a test, short and sweet, of split().') print(out)
"""Retrieving the matched text""" match_obj = re.search("regex pattern", subject) if match_obj: result = match_obj.group() else: result = "" # or None # To use the regular expression multiple times: re_obj = re.compile("regex pattern") match_obj = re_obj.search(subject) if match_obj: result = match_obj.group() else: result = "" # or None
"""All matches examples""" rule = r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)' regex = re.compile(rule, re.MULTILINE) matches0 = [] for m in regex.finditer(text): matches0.append(m.groups()) # for m in matches0: # print 'Name: %s\nSequence:%s' % (m[0], m[1]) # Other way regex = re.compile(rule, re.MULTILINE) matches1 = [m.groups() for m in regex.finditer(text)] # Another: matches3 = re.compile(rule, re.MULTILINE).findall(text) # Other way (MUCH better): matches2 = re.findall(rule, text)
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
"""Solution from J. Trevisan """ import re t = """...""" lineslen = [len(re.findall("[^\s+]", line)) for line in t.split("\n")] print(lineslen)
Escola de Artes, Ciências e Humanidades (EACH) Escola de Comunicações e Artes (ECA) Escola de Educação Física e Esporte (EEFE) Escola de Enfermagem (EE) Escola Politécnica (Poli) Faculdade de Arquitetura e Urbanismo (FAU)
"""Solution from J. Trevisan """ z = """...""" d = dict([reversed(x) for x in re.findall("(.+) \((.+)\)", z)]) print(d)