Programming Python (24 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
6.82Mb size Format: txt, pdf, ePub
File Scanners

Before we leave
our file tools survey, it’s time for something that
performs a more tangible task and illustrates some of what we’ve learned
so far. Unlike some shell-tool languages, Python doesn’t have an
implicit file-scanning loop procedure, but it’s simple to write a
general one that we can reuse for all time. The module in
Example 4-1
defines a general
file-scanning routine, which simply applies a passed-in Python function
to each line in an external file.

Example 4-1. PP4E\System\Filetools\scanfile.py

def scanner(name, function):
file = open(name, 'r') # create a file object
while True:
line = file.readline() # call file methods
if not line: break # until end-of-file
function(line) # call a function object
file.close()

The
scanner
function
doesn’t care what line-processing function is passed in,
and that accounts for most of its generality—it is happy to apply
any
single-argument function that exists now or in
the future to all of the lines in a text file. If we code this module
and put it in a directory on the module search path, we can use it any
time we need to step through a file line by line.
Example 4-2
is a client script that
does simple line translations.

Example 4-2. PP4E\System\Filetools\commands.py

#!/usr/local/bin/python
from sys import argv
from scanfile import scanner
class UnknownCommand(Exception): pass
def processLine(line): # define a function
if line[0] == '*': # applied to each line
print("Ms.", line[1:-1])
elif line[0] == '+':
print("Mr.", line[1:-1]) # strip first and last char: \n
else:
raise UnknownCommand(line) # raise an exception
filename = 'data.txt'
if len(argv) == 2: filename = argv[1] # allow filename cmd arg
scanner(filename, processLine) # start the scanner

The text file
hillbillies.txt
contains the
following lines:

*Granny
+Jethro
*Elly May
+"Uncle Jed"

and our commands script could be run as follows:

C:\...\PP4E\System\Filetools>
python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly May
Mr. "Uncle Jed"

This works, but there are a variety of coding alternatives for
both files, some of which may be better than those listed above. For
instance, we could also code the command processor of
Example 4-2
in the following way;
especially if the number of command options starts to become large, such
a data-driven approach may be more concise
and easier
to maintain than a large
if
statement with essentially
redundant actions (if you ever have to change the way output lines
print, you’ll have to change it in only one place with this
form):

commands = {'*': 'Ms.', '+': 'Mr.'}     # data is easier to expand than code?
def processLine(line):
try:
print(commands[line[0]], line[1:-1])
except KeyError:
raise UnknownCommand(line)

The scanner could similarly be improved. As a rule of thumb, we
can also usually speed things up by shifting processing from Python code
to built-in tools. For instance, if we’re concerned with speed, we can
probably make our file scanner faster by using the file’s
line
iterator
to step through the file instead of the manual
readline
loop in
Example 4-1
(though you’d have to time
this with your Python to be sure):

def scanner(name, function):
for line in open(name, 'r'): # scan line by line
function(line) # call a function object

And we can work more magic in
Example 4-1
with the iteration tools
like the
map
built-in function, the
list comprehension expression, and the generator expression. Here are
three minimalist’s versions; the
for
loop is replaced by
map
or a
comprehension, and we let Python close the file for us when it is
garbage collected or the script exits (these all build a temporary list
of results along the way to run through their iterations, but this
overhead is likely trivial for all but the largest of files):

def scanner(name, function):
list(map(function, open(name, 'r')))
def scanner(name, function):
[function(line) for line in open(name, 'r')]
def scanner(name, function):
list(function(line) for line in open(name, 'r'))
File filters

The preceding
works as planned, but what if we also want to
change
a file while scanning it?
Example 4-3
shows two approaches:
one uses explicit files, and the other uses the standard input/output
streams to allow for redirection on the command line.

Example 4-3. PP4E\System\Filetools\filters.py

import sys
def filter_files(name, function): # filter file through function
input = open(name, 'r') # create file objects
output = open(name + '.out', 'w') # explicit output file too
for line in input:
output.write(function(line)) # write the modified line
input.close()
output.close() # output has a '.out' suffix
def filter_stream(function): # no explicit files
while True: # use standard streams
line = sys.stdin.readline() # or: input()
if not line: break
print(function(line), end='') # or: sys.stdout.write()
if __name__ == '__main__':
filter_stream(lambda line: line) # copy stdin to stdout if run

Notice that the newer
context managers
feature discussed earlier could save us a few lines here
in the file-based filter of
Example 4-3
, and also guarantee
immediate file closures if the processing function fails with an
exception:

def filter_files(name, function):
with open(name, 'r') as input, open(name + '.out', 'w') as output:
for line in input:
output.write(function(line)) # write the modified line

And again, file object
line iterators
could simplify the stream-based filter’s code in this
example as well:

def filter_stream(function):
for line in sys.stdin: # read by lines automatically
print(function(line), end='')

Since the standard streams are preopened for us, they’re often
easier to use. When run standalone, it simply parrots
stdin
to
stdout
:

C:\...\PP4E\System\Filetools>
filters.py < hillbillies.txt
*Granny
+Jethro
*Elly May
+"Uncle Jed"

But this module is also useful when imported as a library
(clients provide the line-processing function):

>>>
from filters import filter_files
>>>
filter_files('hillbillies.txt', str.upper)
>>>
print(open('hillbillies.txt.out').read())
*GRANNY
+JETHRO
*ELLY MAY
+"UNCLE JED"

We’ll see files in action often in the remainder of this book,
especially in the more complete and functional system examples of
Chapter 6
. First though, we turn to
tools for processing our files’
home.

[
9
]
For instance, to process
pipes
, described
in
Chapter 5
. The Python
os.pipe
call returns two file descriptors,
which can be processed with
os
module file tools or wrapped in a file object with
os.fdopen
. When used with descriptor-based
file tools in
os
, pipes deal in
byte strings, not text. Some device files may require lower-level
control as well.

[
10
]
For related tools, see also the
shutil
module in Python’s standard
library; it has higher-level tools for copying and removing files
and more. We’ll also write directory compare, copy, and search
tools of our own in
Chapter 6
,
after we’ve had a chance to study the directory tools presented
later in this chapter.

Directory Tools

One of the more
common tasks in the shell utilities domain is applying an
operation to a set of files in a
directory
—a “folder”
in Windows-speak. By running a script on a batch of files, we can automate
(that is,
script
) tasks we might have to otherwise
run repeatedly by hand.

For instance, suppose you need to search all of your Python files in
a development directory for a global variable name (perhaps you’ve
forgotten where it is used). There are many platform-specific ways to do
this (e.g., the
find
and
grep
commands in Unix), but Python scripts that
accomplish such tasks will work on every platform where Python
works—Windows, Unix, Linux, Macintosh, and just about any other platform
commonly used today. If you simply copy your script to any machine you
wish to use it on, it will work regardless of which other tools are
available there; all you need is Python. Moreover, coding such tasks in
Python also allows you to perform arbitrary actions along the
way—replacements, deletions, and whatever else you can code in the Python
language.

Walking One Directory

The most common
way to go about writing such tools is to first grab a list
of the names of the files you wish to process, and then step through
that list with a Python
for
loop or
other iteration tool, processing each file in turn. The trick we need to
learn here, then, is how to get such a directory list within our
scripts. For scanning directories there are at least three options:
running shell listing commands with
os.popen
, matching filename patterns with
glob.glob
, and getting directory
listings with
os.listdir
. They vary
in interface, result format, and portability.

Running shell listing commands with os.popen

How did you go
about getting directory file listings before you heard
of Python? If you’re new to shell tools programming, the answer may be
“Well, I started a Windows file explorer and clicked on things,” but
I’m thinking here in terms of less GUI-oriented command-line
mechanisms.

On Unix, directory listings are usually obtained by typing
ls
in a shell; on Windows, they can
be generated with a
dir
command
typed in an MS-DOS console box. Because Python scripts may use
os.popen
to run any command line
that we can type in a shell, they are the most general way to grab a
directory listing inside a Python program. We met
os.popen
in the prior chapters; it runs a
shell command string and gives us a file object from which we can read
the command’s output. To illustrate, let’s first assume the following
directory structures—I have both the usual
dir
and a Unix-like
ls
command from Cygwin on my Windows
laptop:

c:\temp>
dir /B
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt
c:\temp>
c:\cygwin\bin\ls
PP3E parts random.bin spam.txt temp.bin temp.txt
c:\temp>
c:\cygwin\bin\ls parts
part0001 part0002 part0003 part0004

The
parts
and
PP3E
names are a nested subdirectory in
C:\temp
here (the latter is a copy of the prior
edition’s examples tree, which I used occasionally in this text). Now,
as we’ve seen, scripts can grab a listing of file and directory names
at this level by simply spawning the appropriate platform-specific
command line and reading its output (the text normally thrown up on
the console window):

C:\temp>
python
>>>
import os
>>>
os.popen('dir /B').readlines()
['parts\n', 'PP3E\n', 'random.bin\n', 'spam.txt\n', 'temp.bin\n', 'temp.txt\n']

Lines read from a shell command come back with a trailing
end-of-line character, but it’s easy enough to slice it off; the
os.popen
result also gives us a
line iterator just like normal files:

>>>
for line in os.popen('dir /B'):
...
print(line[:-1])
...
parts
PP3E
random.bin
spam.txt
temp.bin
temp.txt
>>>
lines = [line[:-1] for line in os.popen('dir /B')]
>>>
lines
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']

For pipe objects, the effect of iterators may be even more
useful than simply avoiding loading the entire result into memory all
at once:
readlines
will always
block the caller until the spawned program is completely finished,
whereas the iterator might not.

The
dir
and
ls
commands let us be specific about filename patterns to
be matched and directory names to be listed by using name patterns;
again, we’re just running shell commands here, so anything you can
type at a shell prompt goes:

>>>
os.popen('dir *.bin /B').readlines()
['random.bin\n', 'temp.bin\n']
>>>
os.popen(r'c:\cygwin\bin\ls *.bin').readlines()
['random.bin\n', 'temp.bin\n']
>>>
list(os.popen(r'dir parts /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
[fname for fname in os.popen(r'c:\cygwin\bin\ls parts')]
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']

These calls use general tools and work as advertised. As I noted
earlier, though, the downsides of
os.popen
are that it requires using a
platform-specific shell command and it incurs a performance hit to
start up an independent program. In fact, different listing tools may
sometimes produce different results:

>>>
list(os.popen(r'dir parts\part* /B'))
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
>>>
list(os.popen(r'c:\cygwin\bin\ls parts/part*'))
['parts/part0001\n', 'parts/part0002\n', 'parts/part0003\n', 'parts/part0004\n']

The next two alternative techniques do better on both
counts.

The glob module

The term
globbing
comes
from the
*
wildcard character
in filename patterns; per computing folklore, a
*
matches a “glob” of characters. In less
poetic terms, globbing simply means collecting the names of all
entries in a directory—files and subdirectories—whose names match a
given filename pattern. In Unix shells, globbing expands filename
patterns within a command line into all matching filenames before the
command is ever run. In Python, we can do something similar by calling
the
glob.glob
built-in—a
tool that accepts a filename pattern to expand, and returns a list
(not a generator) of matching file names:

>>>
import glob
>>>
glob.glob('*')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
glob.glob('*.bin')
['random.bin', 'temp.bin']
>>>
glob.glob('parts')
['parts']
>>>
glob.glob('parts/*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
>>>
glob.glob('parts\part*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']

The
glob
call accepts the
usual filename pattern syntax used in shells:
?
means any one character,
*
means any number of characters, and
[]
is a character selection
set.
[
11
]
The pattern should include a directory path if you wish
to glob in something other than the current working directory, and the
module accepts either Unix or DOS-style directory separators (
/
or
\
).
This call is implemented without spawning a shell command (it uses
os.listdir
, described in the next
section) and so is likely to be faster and more portable and uniform
across all Python platforms than the
os.popen
schemes shown earlier.

Technically speaking,
glob
is
a bit more powerful than described so far. In fact, using it to list
files in one directory is just one use of its pattern-matching skills.
For instance, it can also be used to collect matching names across
multiple directories, simply because each level in a passed-in
directory path can be a pattern too:

>>>
for path in glob.glob(r'PP3E\Examples\PP3E\*\s*.py'): print(path)
...
PP3E\Examples\PP3E\Lang\summer-alt.py
PP3E\Examples\PP3E\Lang\summer.py
PP3E\Examples\PP3E\PyTools\search_all.py

Here, we get back filenames from two different directories that
match the
s*.py
pattern; because
the directory name preceding this is a
*
wildcard, Python collects all possible
ways to reach the base filenames. Using
os.popen
to spawn shell commands achieves
the same effect, but only if the underlying shell or listing command
does, too, and with possibly different result formats across tools and
platforms.

The os.listdir call

The
os
module’s
listdir
call provides yet another way to collect filenames in a
Python list. It takes a simple directory name string, not a filename
pattern, and returns a list containing the names of all entries in
that directory—both simple files and nested
directories—
for use in the calling
script:

>>>
import os
>>>
os.listdir('.')
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>>
os.listdir(os.curdir)
['parts', 'PP3E', 'random.bin', 'spam.txt', 'temp.bin', 'temp.txt']
>>>
>>>
os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

This, too, is done without resorting to shell commands and so is
both fast and portable to all major Python platforms. The result is
not in any particular order across platforms (but can be sorted with
the list
sort
method or
sorted
built-in function); returns base
filenames without their directory path prefixes; does not include
names “.” or “..” if present; and includes names of both files and
directories at the listed level.

To compare all three listing techniques, let’s run them here
side by side on an explicit directory. They differ in some ways but
are mostly just variations on a theme for this task—
os.popen
returns end-of-lines and may sort
filenames on some platforms,
glob.glob
accepts a pattern and returns
filenames with directory prefixes, and
os.listdir
takes a simple directory name and
returns names without directory prefixes:

>>>
os.popen('dir /b parts').readlines()
['part0001\n', 'part0002\n', 'part0003\n', 'part0004\n']
>>>
glob.glob(r'parts\*')
['parts\\part0001', 'parts\\part0002', 'parts\\part0003', 'parts\\part0004']
>>>
os.listdir('parts')
['part0001', 'part0002', 'part0003', 'part0004']

Of these three,
glob
and
listdir
are generally better
options if you care about script portability and result uniformity,
and
listdir
seems fastest in recent
Python releases (but gauge its performance yourself—implementations
may change over time).

Splitting and joining listing results

In the last example, I
pointed out that
glob
returns names with directory paths, whereas
listdir
gives raw base filenames. For
convenient processing, scripts often need to split
glob
results into base files or expand
listdir
results into full paths.
Such translations are easy if we let the
os.path
module do all the work for us. For
example, a script that intends to copy all files elsewhere will
typically need to first split off the base filenames from
glob
results so that it can add different
directory names on the front:

>>>
dirname = r'C:\temp\parts'
>>>
>>>
import glob
>>>
for file in glob.glob(dirname + '/*'):
...
head, tail = os.path.split(file)
...
print(head, tail, '=>', ('C:\\Other\\' + tail))
...
C:\temp\parts part0001 => C:\Other\part0001
C:\temp\parts part0002 => C:\Other\part0002
C:\temp\parts part0003 => C:\Other\part0003
C:\temp\parts part0004 => C:\Other\part0004

Here, the names after the
=>
represent names that files might be
moved to. Conversely, a script that means to process all files in a
different directory than the one it runs in will probably need to
prepend
listdir
results with the
target directory name before passing filenames on to other
tools:

>>>
import os
>>>
for file in os.listdir(dirname):
...
print(dirname, file, '=>', os.path.join(dirname, file))
...
C:\temp\parts part0001 => C:\temp\parts\part0001
C:\temp\parts part0002 => C:\temp\parts\part0002
C:\temp\parts part0003 => C:\temp\parts\part0003
C:\temp\parts part0004 => C:\temp\parts\part0004

When you begin writing realistic directory processing tools of
the sort we’ll develop in
Chapter 6
,
you’ll find these calls to be almost
habit.

Other books

The Gondola Scam by Jonathan Gash
Lovesong by Alex Miller
The Reckless Bride by Stephanie Laurens
Don't Go Home by Carolyn Hart
PARIS 1919 by Margaret MacMillan
Shadows of the Empire by Steve Perry
Grape Expectations by Caro Feely, Caro