Thefind
module of
the prior section isn’t quite the general string searcher
we’re after, but it’s an important first step—it collects files that we
can then search in an automated script. In fact, the act of collecting
matching files in a tree is enough by itself to support a wide variety
of day-to-day system tasks.
For example, one of the other common tasks I perform on a regular
basis is removing all the bytecode files in a tree. Because these are
not always portable across major Python releases, it’s usually a good
idea to ship programs without them and let Python create new ones on
first imports. Now that we’re expertos.walk
users, we could cut out the middleman
and use it directly.
Example 6-14
codes a portable and
general command-line tool, with support for arguments, exception
processing, tracing, and list-only mode.
Example 6-14. PP4E\Tools\cleanpyc.py
"""
delete all .pyc bytecode files in a directory tree: use the
command line arg as root if given, else current working dir
"""
import os, sys
findonly = False
rootdir = os.getcwd() if len(sys.argv) == 1 else sys.argv[1]
found = removed = 0
for (thisDirLevel, subsHere, filesHere) in os.walk(rootdir):
for filename in filesHere:
if filename.endswith('.pyc'):
fullname = os.path.join(thisDirLevel, filename)
print('=>', fullname)
if not findonly:
try:
os.remove(fullname)
removed += 1
except:
type, inst = sys.exc_info()[:2]
print('*'*4, 'Failed:', filename, type, inst)
found += 1
print('Found', found, 'files, removed', removed)
When run, this script walks a directory tree (the CWD by default,
or else one passed in on the command line), deleting any and all
bytecode files along the way:
C:\...\Examples\PP4E>Tools\cleanpyc.py
=> C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\__init__.pyc
=> C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\Preview\initdata.pyc
=> C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\Preview\make_db_file.pyc
=> C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\Preview\manager.pyc
=> C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\Preview\person.pyc
...more lines here...
Found 24 files, removed 24
C:\...\PP4E\Tools>cleanpyc.py .
=> .\find.pyc
=> .\visitor.pyc
=> .\__init__.pyc
Found 3 files, removed 3
This script works, but it’s a bit more manual and code-y than it
needs to be. In fact, now that we also know about find operations,
writing scripts based upon them is almost trivial when we just need to
match filenames.
Example 6-15
, for instance, falls
back on spawning shell find commands if you have them.
Example 6-15. PP4E\Tools\cleanpyc-find-shell.py
"""
find and delete all "*.pyc" bytecode files at and below the directory
named on the command-line; assumes a nonportable Unix-like find command
"""
import os, sys
rundir = sys.argv[1]
if sys.platform[:3] == 'win':
findcmd = r'c:\cygwin\bin\find %s -name "*.pyc" -print' % rundir
else:
findcmd = 'find %s -name "*.pyc" -print' % rundir
print(findcmd)
count = 0
for fileline in os.popen(findcmd): # for all result lines
count += 1 # have \n at the end
print(fileline, end='')
os.remove(fileline.rstrip())
print('Removed %d .pyc files' % count)
When run, files returned by the shell command are removed:
C:\...\PP4E\Tools>cleanpyc-find-shell.py .
c:\cygwin\bin\find . -name "*.pyc" -print
./find.pyc
./visitor.pyc
./__init__.pyc
Removed 3 .pyc files
This script usesos.popen
to
collect the output of a Cygwinfind
program installed on one of my Windows computers, or else the standardfind
tool on the Linux side. It’s
also
completely nonportable
to Windows machines
that don’t have the Unix-likefind
program installed, and that includes other computers of my own (not to
mention those throughout most of the world at large). As we’ve seen,
spawning shell commands also incurs performance penalties for starting a
new program.
We can do much better on the portability and performance fronts
and still retain code simplicity, by applying the find tool we wrote in
Python in the prior section. The new script is shown in
Example 6-16
.
Example 6-16. PP4E\Tools\cleanpyc-find-py.py
"""
find and delete all "*.pyc" bytecode files at and below the directory
named on the command-line; this uses a Python-coded find utility, and
so is portable; run this to delete .pyc's from an old Python release;
"""
import os, sys, find # here, gets Tools.find
count = 0
for filename in find.find('*.pyc', sys.argv[1]):
count += 1
print(filename)
os.remove(filename)
print('Removed %d .pyc files' % count)
When run, all bytecode files in the tree rooted at the passed-in
directory name are removed as before; this time, though, our script
works just about everywhere Python does:
C:\...\PP4E\Tools>cleanpyc-find-py.py .
.\find.pyc
.\visitor.pyc
.\__init__.pyc
Removed 3 .pyc files
This works portably, and it avoids external program startup costs.
Butfind
is really just half the
story—it collects files matching a name pattern but doesn’t search their
content. Although extra code can add such searching to a find’s result,
a more manual approach can allow us to tap into the search process more
directly. The next section shows
how.
After experimenting with greps and globs and finds, in the end, to
help ease the task of performing global searches on all platforms I
might ever use, I wound up coding a task-specific Python script to do
most of the work for me.
Example 6-17
employs the
following standard Python tools that we met in the preceding chapters:os.walk
to visit files in a
directory,os.path.splitext
to skip
over files with binary-type extensions, andos.path.join
to portably combine a directory
path and filename.
Because it’s pure Python code, it can be run the same way on both
Linux and Windows. In fact, it should work on any computer where Python
has been installed. Moreover, because it uses direct system calls, it
will likely be faster than approaches that rely on underlying shell
commands.
Example 6-17. PP4E\Tools\search_all.py
"""
################################################################################
Use: "python ...\Tools\search_all.py dir string".
Search all files at and below a named directory for a string; uses the
os.walk interface, rather than doing a find.find to collect names first;
similar to calling visitfile for each find.find result for "*" pattern;
################################################################################
"""
import os, sys
listonly = False
textexts = ['.py', '.pyw', '.txt', '.c', '.h'] # ignore binary files
def searcher(startdir, searchkey):
global fcount, vcount
fcount = vcount = 0
for (thisDir, dirsHere, filesHere) in os.walk(startdir):
for fname in filesHere: # do non-dir files here
fpath = os.path.join(thisDir, fname) # fnames have no dirpath
visitfile(fpath, searchkey)
def visitfile(fpath, searchkey): # for each non-dir file
global fcount, vcount # search for string
print(vcount+1, '=>', fpath) # skip protected files
try:
if not listonly:
if os.path.splitext(fpath)[1] not in textexts:
print('Skipping', fpath)
elif searchkey in open(fpath).read():
input('%s has %s' % (fpath, searchkey))
fcount += 1
except:
print('Failed:', fpath, sys.exc_info()[0])
vcount += 1
if __name__ == '__main__':
searcher(sys.argv[1], sys.argv[2])
print('Found in %d files, visited %d' % (fcount, vcount))
Operationally, this script works roughly the same as calling itsvisitfile
function for every result
generated by ourfind.find
tool with
a pattern of “*”; but because this version is specific to searching
content it can better tailored for its goal. Really, this equivalence
holds only because a “*” pattern invokes an exhaustive traversal infind.find
, and that’s all that this
new script’ssearcher
function
does. The finder is good at selecting specific file types, but this
script benefits from a more custom single traversal.
When run standalone, the search key is passed on the command line;
when imported, clients call this module’ssearcher
function directly. For example, to
search (that is, grep) for all appearances of a string in the book
examples tree, I run a command line like this in a DOS or Unix
shell:
C:\\PP4E>Tools\search_all.py . mimetypes
1 => .\LaunchBrowser.py
2 => .\Launcher.py
3 => .\Launch_PyDemos.pyw
4 => .\Launch_PyGadgets_bar.pyw
5 => .\__init__.py
6 => .\__init__.pyc
Skipping .\__init__.pyc
7 => .\Preview\attachgui.py
8 => .\Preview\bob.pkl
Skipping .\Preview\bob.pkl
...more lines omitted: pauses for Enter key press at matches...
Found in 2 files, visited 184
The script lists each file it checks as it goes, tells you which
files it is skipping (names that end in extensions not listed in the
variabletextexts
that imply binary
data), and pauses for an Enter key press each time it announces a file
containing the search string. Thesearch_all
script works the same way when it
is
imported
rather than run, but there is no final
statistics output line (fcount
andvcount
live in the module and so
would have to be imported to be inspected here):
C:\...\PP4E\dev\Examples\PP4E>python
>>>import Tools.search_all
>>>search_all.searcher(r'C:\temp\PP3E\Examples', 'mimetypes')
...more lines omitted: 8 pauses for Enter key press along the way...
>>>search_all.fcount, search_all.vcount
# matches, files
(8, 1429)
However launched, this script tracks down all references to a
string in an entire directory tree: a name of a changed book examples
file, object, or directory, for instance. It’s exactly what I was
looking for—or at least I thought so, until further deliberation drove
me to seek more complete and better structured solutions, the topic of
the next
section.
Be sure to also see the coverage of regular expressions in
Chapter 19
. Thesearch_all
script here searches for a simple
string in each file with thein
string membership expression, but it would be trivial to extend it to
search for a regular expression pattern match instead (roughly, just
replacein
with a call to a regular
expression object’s search method). Of course, such a mutation will be
much more trivial after we’ve learned how.
Also notice thetextexts
list
in
Example 6-17
, which
attempts to list all possible binary file types: it would be more
general and robust to use themimetypes
logic we will meet near the end of
this chapter in order to guess file content type from its name, but
the skips list provides more control and sufficed for the trees I used
this script against.
Finally note that for simplicity many of the directory searches
in this chapter assume that text is encoded per the underlying
platform’s Unicode default. They could open text in binary mode to
avoid decoding errors, but searches might then be inaccurate because
of encoding scheme differences in the raw encoded bytes. To see how to
do better, watch for the “grep” utility in
Chapter 11
’s PyEdit GUI, which will apply an
encoding name to all the files in a searched tree and ignore those
text or binary files that fail to decode.
[
22
]
In fact, the act of searching files often goes by the
colloquial name “grepping” among developers who have spent any
substantial time in the Unix ghetto.
Laziness is the
mother of many a framework. Armed with the
portablesearch_all
script from
Example 6-17
, I was able to
better pinpoint files to be edited every time I changed the book examples
tree content or structure. At least initially, in one window I ransearch_all
to pick out suspicious files
and edited each along the way by hand in another window.
Pretty soon, though, this became tedious, too. Manually typing
filenames into editor commands is no fun, especially when the number of
files to edit is large. Since I occasionally have better things to do than
manually start dozens of text editor sessions, I started looking for a way
to
automatically
run an editor on each suspicious
file.
Unfortunately,search_all
simply
prints results to the screen. Although that text could be intercepted withos.popen
and parsed by another program,
a more direct approach that spawns edit sessions during the search may be
simpler. That would require major changes to the tree search script as
currently coded, though, and make it useful for just one specific purpose.
At this point, three thoughts came to mind:
After writing a few directory walking utilities, it became
clear that I was rewriting the same sort of code over and over
again. Traversals could be even further simplified by wrapping
common details for reuse. Although theos.walk
tool avoids having to write
recursive functions, its model tends to foster redundant operations
and code (e.g., directory name joins, tracing prints).
Past experience informed me that it would be better in the
long run to add features to a general directory searcher as external
components, rather than changing the original script itself. Because
editing files was just one possible extension (what about automating
text replacements, too?), a more general, customizable, and reusable
approach seemed the way to go. Althoughos.walk
is straightforward to use, its
nested loop-based structure doesn’t quite lend itself to
customization the way a class can.
Based on past experience, I also knew that it’s a generally
good idea to insulate programs from implementation details as much
as possible. Whileos.walk
hides
the details of recursive traversal, it still imposes a very specific
interface on its clients, which is prone to change over time. Indeed
it has—as I’ll explain further at the end of this section, one of
Python’s tree walkers was removed altogether in 3.X, instantly
breaking code that relied upon it. It would be better to hide such
dependencies behind a more neutral interface, so that clients won’t
break as our needs change.
Of course, if you’ve studied Python in any depth, you know that all
these goals point to using an
object-oriented
framework
for traversals and searching.
Example 6-18
is a concrete
realization of these goals. It exports a generalFileVisitor
class that
mostly just wrapsos.walk
for easier
use and extension, as well as a genericSearchVisitor
class that
generalizes the notion of directory searches.
By itself,SearchVisitor
simply
does whatsearch_all
did, but it also
opens up the search process to customization—bits of its behavior can be
modified by overloading its methods in subclasses. Moreover, its core
search logic can be reused everywhere we need to search. Simply define a
subclass that adds extensions for a specific task. The same goes forFileVisitor
—by redefining its methods
and using its attributes, we can tap into tree search using OOP coding
techniques. As is usual in programming, once you repeat
tactical
tasks often enough, they tend to inspire
this kind of
strategic
thinking
.
Example 6-18. PP4E\Tools\visitor.py
"""
####################################################################################
Test: "python ...\Tools\visitor.py dir testmask [string]". Uses classes and
subclasses to wrap some of the details of os.walk call usage to walk and search;
testmask is an integer bitmask with 1 bit per available self-test; see also:
visitor_*/.py subclasses use cases; frameworks should generally use__X pseudo
private names, but all names here are exported for use in subclasses and clients;
redefine reset to support multiple independent walks that require subclass updates;
####################################################################################
"""
import os, sys
class FileVisitor:
"""
Visits all nondirectory files below startDir (default '.');
override visit* methods to provide custom file/dir handlers;
context arg/attribute is optional subclass-specific state;
trace switch: 0 is silent, 1 is directories, 2 adds files
"""
def __init__(self, context=None, trace=2):
self.fcount = 0
self.dcount = 0
self.context = context
self.trace = trace
def run(self, startDir=os.curdir, reset=True):
if reset: self.reset()
for (thisDir, dirsHere, filesHere) in os.walk(startDir):
self.visitdir(thisDir)
for fname in filesHere: # for non-dir files
fpath = os.path.join(thisDir, fname) # fnames have no path
self.visitfile(fpath)
def reset(self): # to reuse walker
self.fcount = self.dcount = 0 # for independent walks
def visitdir(self, dirpath): # called for each dir
self.dcount += 1 # override or extend me
if self.trace > 0: print(dirpath, '...')
def visitfile(self, filepath): # called for each file
self.fcount += 1 # override or extend me
if self.trace > 1: print(self.fcount, '=>', filepath)
class SearchVisitor(FileVisitor):
"""
Search files at and below startDir for a string;
subclass: redefine visitmatch, extension lists, candidate as needed;
subclasses can use testexts to specify file types to search (but can
also redefine candidate to use mimetypes for text content: see ahead)
"""
skipexts = []
testexts = ['.txt', '.py', '.pyw', '.html', '.c', '.h'] # search these exts
#skipexts = ['.gif', '.jpg', '.pyc', '.o', '.a', '.exe'] # or skip these exts
def __init__(self, searchkey, trace=2):
FileVisitor.__init__(self, searchkey, trace)
self.scount = 0
def reset(self): # on independent walks
self.scount = 0
def candidate(self, fname): # redef for mimetypes
ext = os.path.splitext(fname)[1]
if self.testexts:
return ext in self.testexts # in test list
else: # or not in skip list
return ext not in self.skipexts
def visitfile(self, fname): # test for a match
FileVisitor.visitfile(self, fname)
if not self.candidate(fname):
if self.trace > 0: print('Skipping', fname)
else:
text = open(fname).read() # 'rb' if undecodable
if self.context in text: # or text.find() != −1
self.visitmatch(fname, text)
self.scount += 1
def visitmatch(self, fname, text): # process a match
print('%s has %s' % (fname, self.context)) # override me lower
if __name__ == '__main__':
# self-test logic
dolist = 1
dosearch = 2 # 3=do list and search
donext = 4 # when next test added
def selftest(testmask):
if testmask & dolist:
visitor = FileVisitor(trace=2)
visitor.run(sys.argv[2])
print('Visited %d files and %d dirs' % (visitor.fcount, visitor.dcount))
if testmask & dosearch:
visitor = SearchVisitor(sys.argv[3], trace=0)
visitor.run(sys.argv[2])
print('Found in %d files, visited %d' % (visitor.scount, visitor.fcount))
selftest(int(sys.argv[1])) # e.g., 3 = dolist | dosearch
This module primarily serves to export classes for external use, but
it does something useful when run standalone, too. If you invoke it as a
script with a test mask of1
and a root
directory name, it makes and runs aFileVisitor
object and prints an exhaustive
listing of every file and directory at and below the root:
C:\...\PP4E\Tools>visitor.py 1 C:\temp\PP3E\Examples
C:\temp\PP3E\Examples ...
1 => C:\temp\PP3E\Examples\README-root.txt
C:\temp\PP3E\Examples\PP3E ...
2 => C:\temp\PP3E\Examples\PP3E\echoEnvironment.pyw
3 => C:\temp\PP3E\Examples\PP3E\LaunchBrowser.pyw
4 => C:\temp\PP3E\Examples\PP3E\Launcher.py
5 => C:\temp\PP3E\Examples\PP3E\Launcher.pyc
...more output omitted (pipe into more or a file)...
1424 => C:\temp\PP3E\Examples\PP3E\System\Threads\thread-count.py
1425 => C:\temp\PP3E\Examples\PP3E\System\Threads\thread1.py
C:\temp\PP3E\Examples\PP3E\TempParts ...
1426 => C:\temp\PP3E\Examples\PP3E\TempParts\109_0237.JPG
1427 => C:\temp\PP3E\Examples\PP3E\TempParts\lawnlake1-jan-03.jpg
1428 => C:\temp\PP3E\Examples\PP3E\TempParts\part-001.txt
1429 => C:\temp\PP3E\Examples\PP3E\TempParts\part-002.html
Visited 1429 files and 186 dirs
If you instead invoke this script with a2
as its first command-line argument, it makes
and runs aSearchVisitor
object using
the third argument as the search key. This form is similar to running the
search_all.py
script we met earlier, but it simply
reports each matching file without pausing:
C:\...\PP4E\Tools>visitor.py 2 C:\temp\PP3E\Examples mimetypes
C:\temp\PP3E\Examples\PP3E\extras\LosAlamosAdvancedClass\day1-system\data.txt ha
s mimetypes
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailParser.py has mimetypes
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailSender.py has mimetypes
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat.py has mimetypes
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat_modular.py has mimet
ypes
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\ftptools.py has mimetypes
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\uploadflat.py has mimetypes
C:\temp\PP3E\Examples\PP3E\System\Media\playfile.py has mimetypes
Found in 8 files, visited 1429
Technically, passing this script a first argument of3
runs
both
aFileVisitor
and aSearchVisitor
(two separate traversals are
performed). The first argument is really used as a bit mask to select one
or more supported self-tests; if a test’s bit is on in the binary value of
the argument, the test will be run. Because 3 is 011 in binary, it selects
both a search (010) and a listing (001). In a more user-friendly system,
we might want to be more symbolic about that (e.g., check for-search
and-list
arguments), but bit masks work just as
well for this script’s scope.
As usual, this module can also be used interactively. The following
is one way to determine how many files and directories you have in
specific directories; the last command walks over your entire drive (after
a generally noticeable delay!). See also the “biggest file” example at the
start of this chapter for issues such as potential repeat visits not
handled by this walker:
C:\...\PP4E\Tools>python
>>>from visitor import FileVisitor
>>>V = FileVisitor(trace=0)
>>>V.run(r'C:\temp\PP3E\Examples')
>>>V.dcount, V.fcount
(186, 1429)
>>>V.run('..')
# independent walk (reset counts)
>>>V.dcount, V.fcount
(19, 181)
>>>V.run('..', reset=False)
# accumulative walk (keep counts)
>>>V.dcount, V.fcount
(38, 362)
>>>V = FileVisitor(trace=0)
# new independent walker (own counts)
>>>V.run(r'C:\\')
# entire drive: try '/' on Unix-en
>>>V.dcount, V.fcount
(24992, 198585)
Although the visitor module is useful by itself for listing and
searching trees, it was really designed to be extended. In the rest of
this section, let’s quickly step through a handful of visitor clients
which add more specific tree operations, using normal OO customization
techniques.
After genericizing tree
traversals and searches, it’s easy to add automatic file
editing in a brand-new, separate component.
Example 6-19
defines a
newEditVisitor
class that
simply customizes thevisitmatch
method of theSearchVisitor
class to
open a text editor on the matched file. Yes, this is the complete
program—it needs to do something special only when visiting matched
files, and so it needs to provide only that behavior. The rest of the
traversal and search logic is unchanged and inherited.
Example 6-19. PP4E\Tools\visitor_edit.py
"""
Use: "python ...\Tools\visitor_edit.py string rootdir?".
Add auto-editor startup to SearchVisitor in an external subclass component;
Automatically pops up an editor on each file containing string as it traverses;
can also use editor='edit' or 'notepad' on Windows; to use texteditor from
later in the book, try r'python Gui\TextEditor\textEditor.py'; could also
send a search command to go to the first match on start in some editors;
"""
import os, sys
from visitor import SearchVisitor
class EditVisitor(SearchVisitor):
"""
edit files at and below startDir having string
"""
editor = r'C:\cygwin\bin\vim-nox.exe' # ymmv!
def visitmatch(self, fpathname, text):
os.system('%s %s' % (self.editor, fpathname))
if __name__ == '__main__':
visitor = EditVisitor(sys.argv[1])
visitor.run('.' if len(sys.argv) < 3 else sys.argv[2])
print('Edited %d files, visited %d' % (visitor.scount, visitor.fcount))
When we make and run anEditVisitor
, a text editor is started with theos.system
command-line spawn call,
which usually blocks its caller until the spawned program finishes. As
coded, when run on my machines, each time this script finds a matched
file during the traversal, it starts up the vi text editor within the
console window where the script was started; exiting the editor resumes
the tree walk.
Let’s find and edit some files. When run as a script, we pass this
program the search string as a command argument (here, the stringmimetypes
is the search key). The
root directory passed to therun
method is either the second argument or “.” (the current run directory)
by default. Traversal status messages show up in the console, but each
matched file now automatically pops up in a text editor along the way.
In the following, the editor is started eight times—try this with an
editor and tree of your own to get a better feel for how it
works:
C:\...\PP4E\Tools> visitor_edit.py mimetypes C:\temp\PP3E\Examples
C:\temp\PP3E\Examples ...
1 => C:\temp\PP3E\Examples\README-root.txt
C:\temp\PP3E\Examples\PP3E ...
2 => C:\temp\PP3E\Examples\PP3E\echoEnvironment.pyw
3 => C:\temp\PP3E\Examples\PP3E\LaunchBrowser.pyw
4 => C:\temp\PP3E\Examples\PP3E\Launcher.py
5 => C:\temp\PP3E\Examples\PP3E\Launcher.pyc
Skipping C:\temp\PP3E\Examples\PP3E\Launcher.pyc
...more output omitted...
1427 => C:\temp\PP3E\Examples\PP3E\TempParts\lawnlake1-jan-03.jpg
Skipping C:\temp\PP3E\Examples\PP3E\TempParts\lawnlake1-jan-03.jpg
1428 => C:\temp\PP3E\Examples\PP3E\TempParts\part-001.txt
1429 => C:\temp\PP3E\Examples\PP3E\TempParts\part-002.html
Edited 8 files, visited 1429
This, finally, is the exact tool I was looking for to simplify
global book examples tree maintenance. After major changes to things
such as shared modules and file and directory names, I run this script
on the examples root directory with an appropriate search string and
edit any files it pops up as needed. I still need to change files by
hand in the editor, but that’s often safer than blind global
replacements.