Programming Python (188 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
6.87Mb size Format: txt, pdf, ePub
More Pattern Examples

For more context,
the next few examples present short test files that match
simple but representative pattern forms. Comments in
Example 19-3
describe the
operations exercised; check
Table 19-1
to see
which operators are used in these patterns. If they are still confusing,
try running these tests interactively, and call
group(0)
instead of
start()
to see which strings are being matched
by the patterns.

Example 19-3. PP4E\Lang\re-basics.py

"""
literals, sets, ranges, alternatives, and escapes
all tests here print 2: offset where pattern found
"""
import re # the one to use today
pattern, string = "A.C.", "xxABCDxx" # nonspecial chars match themselves
matchobj = re.search(pattern, string) # '.' means any one char
if matchobj: # search returns match object or None
print(matchobj.start()) # start is index where matched
pattobj = re.compile("A.*C.*") # 'R*' means zero or more Rs
matchobj = pattobj.search("xxABCDxx") # compile returns pattern obj
if matchobj: # patt.search returns match obj
print(matchobj.start())
# selection sets
print(re.search(" *A.C[DE][D-F][^G-ZE]G\t+ ?", "..ABCDEFG\t..").start())
# alternatives: R1|R2 means R1 or R2
print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").start()) # test each char
print(re.search("(?:A|X)(?:B|Y)(?:C|Z)D", "..AYCD..").start()) # same, not saved
print(re.search("A|XB|YC|ZD", "..AYCD..").start()) # matches just A!
print(re.search("(A|XB|YC|ZD)YCD", "..AYCD..").start()) # just first char
# word boundaries
print(re.search(r"\bABCD", "..ABCD ").start()) # \b means word boundary
print(re.search(r"ABCD\b", "..ABCD ").start()) # use r'...' to escape '\'

Notice again that there are different ways to kick off a match
with
re
: by calling module search
functions and by making compiled pattern objects. In either event, you
can hang on to the resulting match object or not. All the print call
statements in this script show a result of
2
—the offset where the pattern was found in
the string. In the first test, for example,
A.C.
matches the
ABCD
at offset
2
in the search string (i.e., after the first
xx
):

C:\...\PP4E\Lang>
python re-basic.py
2
...8 more 2s omitted...

Next, in
Example 19-4
,
parts of the pattern strings enclosed in parentheses delimit
groups
; the parts of the string they matched are
available after the match.

Example 19-4. PP4E\Lang\re-groups.py

"""
groups: extract substrings matched by REs in '()' parts
groups are denoted by position, but (?PR) can also name them
"""
import re
patt = re.compile("A(.)B(.)C(.)") # saves 3 substrings
mobj = patt.match("A0B1C2") # each '()' is a group, 1..n
print(mobj.group(1), mobj.group(2), mobj.group(3)) # group() gives substring
patt = re.compile("A(.*)B(.*)C(.*)") # saves 3 substrings
mobj = patt.match("A000B111C222") # groups() gives all groups
print(mobj.groups())
print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups())
print(re.search("(?PA|X)(?PB|Y)(?PC|Z)D", "..AYCD..").groupdict())
patt = re.compile(r"[\t ]*#\s*define\s*([a-z0-9_]*)\s*(.*)")
mobj = patt.search(" # define spam 1 + 2 + 3") # parts of C #define
print(mobj.groups()) # \s is whitespace

In the first test here, for instance, the three
(.)
groups each match a single character, but
they retain the character matched; calling
group
pulls out the character matched. The
second test’s
(.*)
groups match and
retain any number of characters. The third and fourth tests shows how
alternatives can be grouped by both position and name, and the last test
matches C
#define
lines—more on this
pattern in a moment:

C:\...\PP4E\Lang>
python re-groups.py
0 1 2
('000', '111', '222')
('A', 'Y', 'C')
{'a': 'A', 'c': 'C', 'b': 'Y'}
('spam', '1 + 2 + 3')

Finally, besides matches and substring extraction,
re
also includes tools for string replacement
or substitution (see
Example 19-5
).

Example 19-5. PP4E\Lang\re-subst.py

"substitutions: replace occurrences of pattern in string"
import re
print(re.sub('[ABC]', '*', 'XAXAXBXBXCXC'))
print(re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_')) # alternatives char + _
print(re.sub('(.) spam', 'spam\\1', 'x spam, y spam')) # group back ref (or r'')
def mapper(matchobj):
return 'spam' + matchobj.group(1)
print(re.sub('(.) spam', mapper, 'x spam, y spam')) # mapping function

In the first test, all characters in the set are replaced; in the
second, they must be followed by an underscore. The last two tests
illustrate more advanced group back-references and mapping functions in
the replacement. Note the
\\1
required to escape
\1
for Python’s
string rules;
r'spam\1'
would work
just as well. See also the earlier interactive tests in the section for
additional substitution and splitting
examples:

C:\...\PP4E\Lang>
python re-subst.py
X*X*X*X*X*X*
XA-X*XB-X*XC-X*
spamx, spamy
spamx, spamy
Scanning C Header Files for Patterns

To wrap up,
let’s turn to a more realistic example: the script in
Example 19-6
puts these
pattern operators to more practical use. It uses regular expressions to
find
#define
and
#include
lines in C header files and extract
their components. The generality of the patterns makes them detect a
variety of line formats; pattern groups (the parts in parentheses) are
used to extract matched substrings from a line after a match.

Example 19-6. PP4E\Lang\cheader.py

"Scan C header files to extract parts of #define and #include lines"
import sys, re
pattDefine = re.compile( # compile to pattobj
'^#[\t ]*define[\t ]+(\w+)[\t ]*(.*)') # "# define xxx yyy..."
# \w like [a-zA-Z0-9_]
pattInclude = re.compile(
'^#[\t ]*include[\t ]+[<"]([\w\./]+)') # "# include ..."
def scan(fileobj):
count = 0
for line in fileobj: # scan by lines: iterator
count += 1
matchobj = pattDefine.match(line) # None if match fails
if matchobj:
name = matchobj.group(1) # substrings for (...) parts
body = matchobj.group(2)
print(count, 'defined', name, '=', body.strip())
continue
matchobj = pattInclude.match(line)
if matchobj:
start, stop = matchobj.span(1) # start/stop indexes of (...)
filename = line[start:stop] # slice out of line
print(count, 'include', filename) # same as matchobj.group(1)
if len(sys.argv) == 1:
scan(sys.stdin) # no args: read stdin
else:
scan(open(sys.argv[1], 'r')) # arg: input filename

To test, let’s run this script on the text file in
Example 19-7
.

Example 19-7. PP4E\Lang\test.h

#ifndef TEST_H
#define TEST_H
#include
#include
# include "Python.h"
#define DEBUG
#define HELLO 'hello regex world'
# define SPAM 1234
#define EGGS sunny + side + up
#define ADDER(arg) 123 + arg
#endif

Notice the spaces after
#
in
some of these lines; regular expressions are flexible enough to account
for such departures from the norm. Here is the script at work; picking
out
#include
and
#define
lines and their parts. For each
matched line, it prints the line number, the line type, and any matched
substrings:

C:\...\PP4E\Lang>
python cheader.py test.h
2 defined TEST_H =
4 include stdio.h
5 include lib/spam.h
6 include Python.h
8 defined DEBUG =
9 defined HELLO = 'hello regex world'
10 defined SPAM = 1234
12 defined EGGS = sunny + side + up
13 defined ADDER = (arg) 123 + arg

For an additional example of regular expressions at work, see the
file
pygrep1.py
in the book
examples package; it implements a simple pattern-based “grep” file
search utility, but was cut here for space. As we’ll see, we can also
sometimes use regular expressions to parse information from XML and HTML
text—the topics of the next
section.

XML and HTML Parsing

Beyond string
objects and regular expressions, Python ships with support
for parsing some specific and commonly used types of formatted text. In
particular, it provides precoded parsers for XML and HTML which we can
deploy and customize for our text processing goals.

In the XML department, Python includes
parsing support in its standard library and plays host to a
prolific XML special-interest group. XML (for eXtensible Markup Language)
is a tag-based markup language for describing many kinds of structured
data. Among other things, it has been adopted in roles such as a standard
database and Internet content representation in many contexts. As an
object-oriented scripting language, Python mixes remarkably well with
XML’s core notion of structured document interchange.

XML is based upon a tag syntax familiar to web page writers, used to
describe and package data. The
xml
module package in Python’s standard library includes tools
for
parsing
this data from XML text, with both the
SAX and the DOM standard parsing models, as well as the
Python-specific
ElementTree package. Although regular expressions can
sometimes extract information from XML documents, too, they can be easily
misled by unexpected text, and don’t directly support the notion of
arbitrarily nested XML constructs (more on this limitation later when we
explore languages in general).

In short, SAX parsers provide a subclass with methods called during
the parsing operation, and DOM parsers are given access to an object tree
representing the (usually) already parsed document.
SAX parsers are essentially state machines and must record
(and possibly stack) page details as the parse progresses;
DOM parsers walk object trees using loops, attributes, and
methods defined by the DOM standard.
ElementTree is roughly a Python-specific analog of DOM, and
as such can often yield simpler code; it can also be used to generate XML
text from their object-based representations.

Beyond these parsing tools, Python also ships with an
xmlrpc
package to
support the client and server sides of the XML-RPC protocol (remote
procedure calls that transmit objects encoded as XML over HTTP), as well
as a standard
HTML parser,
html.parser
,
that works on similar principles and is presented later in
this chapter. The third-party domain has even more XML-related tools; most
of these are maintained separately from Python to allow for more flexible
release schedules. Beginning with Python 2.3, the
Expat
parser
is also included as the underlying engine that drives the
parsing process.

XML Parsing in Action

XML processing is a
large, evolving topic, and it is mostly beyond the scope
of this book. For an example of a simple XML parsing task, though,
consider the XML file in
Example 19-8
. This file defines a
handful of O’Reilly Python books—ISBN numbers as attributes, and titles,
publication dates, and authors as nested tags (with apologies to Python
books not listed in this completely random sample—there are
many!).

Example 19-8. PP4E\Lang\Xml\books.xml



Python & XML
December 2001
Jones, Drake


Programming Python, 4th Edition
October 2010
Lutz


Learning Python, 4th Edition
September 2009
Lutz


Python Pocket Reference, 4th Edition
October 2009
Lutz


Python Cookbook, 2nd Edition
March 2005
Martelli, Ravenscroft, Ascher


Python in a Nutshell, 2nd Edition
July 2006
Martelli



Let’s quickly explore ways to extract this file’s book ISBN
numbers and corresponding titles by example, using each of the four
primary Python tools at our disposal—patterns, SAX, DOM, and
ElementTree.

Regular expression parsing

In some contexts,
the regular expressions we met earlier can be used to
parse information from XML files. They are not complete parsers, and
are not very robust or accurate in the presence of arbitrary text
(text in tag attributes can especially throw them off). Where
applicable, though, they offer a simple option.
Example 19-9
shows how we might
go about parsing the XML file in
Example 19-8
with the prior
section’s
re
module. Like all four
examples in this section, it scans the XML file looking at ISBN
numbers and associated titles, and stores the two as keys and values
in a Python dictionary.

Example 19-9. PP4E\Lang\Xml\rebook.py

"""
XML parsing: regular expressions (no robust or general)
"""
import re, pprint
text = open('books.xml').read() # str if str pattern
pattern = '(?s)isbn="(.*?)".*?(.*?)' # *?=nongreedy
found = re.findall(pattern, text) # (?s)=dot matches /n
mapping = {isbn: title for (isbn, title) in found} # dict from tuple list
pprint.pprint(mapping)

When run, the
re.findall
method locates all the nested tags we’re interested in,
extracts their content, and returns a list of tuples representing the
two parenthesized groups in the pattern.
Python’s
pprint
module displays the dictionary created by the comprehension nicely.
The extract works, but only as long as the text doesn’t deviate from
the expected pattern in ways that would invalidate our script.
Moreover, the XML entity for “&” in the first book’s title is not
un-escaped automatically:

C:\...\PP4E\Lang\Xml>
python rebook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
SAX parsing

To do better, Python’s full-blown XML
parsing tools let us perform this data extraction in a
more accurate and robust way.
Example 19-10
, for instance,
defines a SAX-based parsing procedure: its class implements callback
methods that will be called during the parse, and its top-level code
creates and runs a parser.

Example 19-10. PP4E\Lang\Xml\saxbook.py

"""
XML parsing: SAX is a callback-based API for intercepting parser events
"""
import xml.sax, xml.sax.handler, pprint
class BookHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inTitle = False # handle XML parser events
self.mapping = {} # a state machine model
def startElement(self, name, attributes):
if name == "book": # on start book tag
self.buffer = "" # save ISBN for dict key
self.isbn = attributes["isbn"]
elif name == "title": # on start title tag
self.inTitle = True # save title text to follow
def characters(self, data):
if self.inTitle: # on text within tag
self.buffer += data # save text if in title
def endElement(self, name):
if name == "title":
self.inTitle = False # on end title tag
self.mapping[self.isbn] = self.buffer # store title text in dict
parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse('books.xml')
pprint.pprint(handler.mapping)

The SAX model is efficient, but it is potentially confusing at
first glance, because the class must keep track of where the parse
currently is using state information. For example, when the title tag
is first detected, we set a state flag and initialize a buffer; as
each character within the title tag is parsed, we append it to the
buffer until the ending portion of the title tag is encountered. The
net effect saves the title tag’s content as a string. This model is
simple, but can be complex to manage; in cases of potentially
arbitrary nesting, for instance, state information may need to be
stacked as the class receives callbacks for nested tags.

To kick off the parse, we make a parser object, set its handler
to an instance of our class, and start the parse; as Python scans the
XML file, our class’s methods are called automatically as components
are encountered. When the parse is complete, we use the Python
pprint
module to display the result again—the
mapping
dictionary object attached to our
handler. The result is the mostly the same this time, but notice that
the “&” escape sequence is properly un-escaped now—SAX performs
XML parsing, not text matching:

C:\...\PP4E\Lang\Xml>
python saxbook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
DOM parsing

The DOM parsing
model for XML is perhaps simpler to understand—we simply
traverse a tree of objects after the parse—but it might be less
efficient for large documents, if the document is parsed all at once
ahead of time and stored in memory. DOM also supports random access to
document parts via tree fetches, nested loops for known structures,
and recursive traversals for arbitrary nesting; in SAX, we are limited
to a single linear parse.
Example 19-11
is a DOM-based
equivalent to the SAX parser of the preceding section.

Example 19-11. PP4E\Lang\Xml\dombook.py

"""
XML parsing: DOM gives whole document to the application as a traversable object
"""
import pprint
import xml.dom.minidom
from xml.dom.minidom import Node
doc = xml.dom.minidom.parse("books.xml") # load doc into object
# usually parsed up front
mapping = {}
for node in doc.getElementsByTagName("book"): # traverse DOM object
isbn = node.getAttribute("isbn") # via DOM object API
L = node.getElementsByTagName("title")
for node2 in L:
title = ""
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
title += node3.data
mapping[isbn] = title
# mapping now has the same value as in the SAX example
pprint.pprint(mapping)

The output of this script is the same as what we generated
interactively for the SAX parser; here, though, it is built up by
walking the document object tree after the parse has finished using
method calls and attributes defined by the cross-language DOM standard
specification. This is both a strength and potential weakness of
DOM—its API is language neutral, but it may seem a bit nonintuitive
and verbose to some Python programmers accustomed to simpler
models:

C:\...\PP4E\Lang\Xml>
python dombook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
ElementTree parsing

As a fourth option, the popular
ElementTree package is a standard library tool for both
parsing and generating XML. As a parser, it’s essentially a more
Pythonic type of
DOM—
it parses
documents into a tree of objects again, but the API for navigating the
tree is more lightweight, because it’s Python-specific.

ElementTree provides easy-to-use tools for parsing, changing,
and generating XML documents. For both parsing and generating, it
represents documents as a tree of
Python
“element” objects. Each element
in the tree has a tag name, attribute dictionary, text value, and
sequence of child elements. The element object produced by a parse can
be navigating with normal Python loops for a known structures, and
with recursion where arbitrary nesting is possible.

The ElementTree system began its life as a third-party
extension, but it was largely incorporated into Python’s standard
library as the package
xml.etree
.
Example 19-12
shows how to use
it to parse our book catalog file one last time.

Example 19-12. PP4E\Lang\Xml\etreebook.py

"""
XML parsing: ElementTree (etree) provides a Python-based API for parsing/generating
"""
import pprint
from xml.etree.ElementTree import parse
mapping = {}
tree = parse('books.xml')
for B in tree.findall('book'):
isbn = B.attrib['isbn']
for T in B.findall('title'):
mapping[isbn] = T.text
pprint.pprint(mapping)

When run we get the exact same results as for SAX and DOM again,
but the code required to extract the file’s details seems noticeably
simpler this time around:

C:\...\PP4E\Lang\Xml>
python etreebook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
Other XML topics

Naturally, there
is much more to Python’s XML support than these simple
examples imply. In deference to space, though, here are pointers to
XML resources in lieu of additional examples:

Standard library

First, be sure to consult the Python library manual for
more on the standard library’s XML support tools. See the
entries for
re
,
xml.sax
.,
xml.dom
, and
xml.etree
for more on this section’s
examples.

PyXML SIG tools

You can also find
Python XML tools and documentation at the XML
Special Interest Group (SIG) web page at
http://www.python.org
. This SIG is dedicated to
wedding XML technologies with Python, and it publishes free XML
tools independent of Python itself. Much of the standard
library’s XML support originated with this group’s work.

Third-party tools

You can
also find free, third-party Python support tools
for XML on the Web by following links at the XML SIGs web page.
Of special interest, the 4Suite open source package provides
integrated tools for XML processing, including open technologies
such as DOM, SAX, RDF, XSLT, XInclude, XPointer, XLink, and
XPath.

Documentation

A variety of
books have been published which specifically
address XML and text processing in Python. O’Reilly offers a
book dedicated to the subject of XML processing in Python,
Python &
XML
, written by
Christopher A. Jones and Fred L. Drake, Jr.

As usual, be sure to also see your favorite web search engine
for more recent developments on this
front.