Programming Python (21 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
2.47Mb size Format: txt, pdf, ePub

[
8
]
Notice that
input
raises an
exception to signal end-of-file, but file read methods simply return
an empty string for this condition. Because
input
also strips the end-of-line
character at the end of lines, an empty string result means an empty
line, so an exception is necessary to specify the end-of-file
condition. File read methods retain the end-of-line character and
denote an empty line as
"\n"
instead of
""
. This is one way in
which reading
sys.stdin
directly
differs from
input
. The latter
also accepts a prompt string that is automatically printed before
input is accepted.

Chapter 4. File and Directory Tools
“Erase Your Hard Drive in Five Easy Steps!”

This chapter continues our look at system interfaces in Python by
focusing on file and directory-related tools. As you’ll see, it’s easy to
process files and directory trees with Python’s built-in and standard
library support. Because files are part of the core Python language, some
of this chapter’s material is a review of file basics covered in books
like
Learning
Python
, Fourth Edition, and we’ll defer to such
resources for more background details on some file-related concepts. For
example, iteration, context managers, and the file object’s support for
Unicode encodings are demonstrated along the way, but these topics are not
repeated in full here. This chapter’s goal is to tell enough of the file
story to get you started writing useful scripts.

File Tools

External files
are at the heart of much of what we do with system
utilities. For instance, a testing system may read its inputs from one
file, store program results in another file, and check expected results by
loading yet another file. Even user interface and Internet-oriented
programs may load binary images and audio clips from files on the
underlying computer. It’s a core programming concept.

In Python, the built-in
open
function is the primary tool scripts use to access the files
on the underlying computer system. Since this function is an inherent part
of the Python language, you may already be familiar with its basic
workings. When called, the
open
function returns a new
file object
that is connected to the
external file; the file object has methods that transfer data to and from
the file and perform a variety of file-related operations. The
open
function also provides a
portable
interface to the underlying filesystem—it
works the same way on every platform on which Python runs.

Other file-related modules built into Python allow us to do things
such as manipulate lower-level descriptor-based files (
os
); copy, remove, and move files and
collections of files (
os
and
shutil
); store data and objects in files by key
(
dbm
and
shelve
); and access SQL databases (
sqlite3
and third-party add-ons). The last two
of these categories are related to database topics, addressed in
Chapter 17
.

In this section, we’ll take a brief tutorial look at the built-in
file object and explore a handful of more advanced file-related topics. As
usual, you should consult either Python’s library manual or reference
books such as
Python Pocket
Reference
for further details and methods we don’t have
space to cover here. Remember, for quick interactive help, you can also
run
dir(
file
)
on an open file object to see an attributes
list that includes methods;
help(
file
)
for general help; and
help(
file
.read)
for help on a specific method such as
read
, though the file object
implementation in 3.1 provides less information for
help
than the library manual and other
resources.

The File Object Model in Python 3.X

Just like the string
types we noted in
Chapter 2
, file
support in Python 3.X is a bit richer than it was in the past. As we
noted earlier, in Python 3.X
str
strings always represent Unicode text (ASCII or wider), and
bytes
and
bytearray
strings represent raw binary data.
Python 3.X draws a similar and related distinction between files
containing text and binary data:

  • Text files
    contain Unicode text. In your script, text file
    content is always a
    str
    string—
    a sequence of characters
    (technically, Unicode “code points”). Text files perform the
    automatic line-end translations described in this chapter by default
    and automatically apply Unicode encodings to file content: they
    encode to and decode from raw binary bytes on transfers to and from
    the file, according to a provided or default encoding name. Encoding
    is trivial for ASCII text, but may be sophisticated in other
    cases.

  • Binary files
    contain raw 8-bit bytes. In your script, binary file
    content is always a byte string, usually a
    bytes
    object—a sequence of small integers,
    which supports most
    str
    operations and displays as ASCII characters whenever possible.
    Binary files perform no translations of data when it is transferred
    to and from files: no line-end translations or Unicode encodings are
    performed.

In practice, text files are used for all truly text-related data,
and binary files store items like packed binary data, images, audio
files, executables, and so on. As a programmer you distinguish between
the two file types in the mode string argument you pass to
open
: adding a “b” (e.g.,
'rb'
,
'wb'
)
means the file contains binary data. For coding new file content, use
normal strings for text (e.g.,
'spam'
or
bytes.decode()
) and byte strings
for binary (e.g.,
b'spam'
or
str.encode()
).

Unless your file scope is limited to ASCII text, the 3.X
text/binary distinction can sometimes impact your code. Text files
create and require
str
strings, and
binary files use byte strings; because you cannot freely mix the two
string types in expressions, you must choose file mode carefully. Many
built-in tools we’ll use in this book make the choice for us; the
struct
and
pickle
modules, for instance, deal in byte
strings in 3.X, and the
xml
package
in Unicode
str
. You must even be
aware of the 3.X text/binary distinction when using system tools like
pipe descriptors and
sockets
, because they transfer
data as byte strings today (though their content can be decoded and
encoded as Unicode text if needed).

Moreover, because text-mode files require that content be
decodable per a Unicode encoding scheme, you must read undecodable file
content in binary mode, as byte strings (or catch Unicode exceptions in
try
statements and skip the file
altogether). This may include both truly binary files as well as text
files that use encodings that are nondefault and unknown. As we’ll see
later in this chapter, because
str
strings are always Unicode in 3.X, it’s sometimes also necessary to
select byte string mode for the names of files in directory tools such
as
os.listdir
,
glob.glob
, and
os.walk
if they cannot be decoded (passing in
byte strings essentially suppresses decoding).

In fact, we’ll see examples where the Python 3.X distinction
between
str
text and
bytes
binary pops up in tools beyond basic
files throughout this book—in Chapters
5
and
12
when we explore sockets; in Chapters
6
and
11
when we’ll need to ignore Unicode
errors in file and directory searches; in
Chapter 12
, where we’ll see how client-side Internet
protocol modules such as FTP and email, which run atop sockets, imply
file modes and encoding requirements; and more.

But just as for string types, although we will see some of these
concepts in action in this chapter, we’re going to take much of this
story as a given here. File and string objects are core language
material and are prerequisite to this text. As mentioned earlier,
because they are addressed by a 45-page chapter in the book
Learning
Python
, Fourth Edition, I won’t repeat their coverage
in full in this book. If you find yourself confused by the Unicode and
binary file and string concepts in the following sections, I encourage
you to refer to that text or other resources for more background
information in this
domain
.

Using Built-in File Objects

Despite the text/binary
dichotomy in Python 3.X, files are still very
straightforward to use. For most purposes, in fact, the
open
built-in function
and its files objects are all you need to remember to process files in
your scripts. The file object returned by
open
has methods for reading data (
read
,
readline
,
readlines
); writing data (
write
,
writelines
); freeing system resources
(
close
); moving to arbitrary
positions in the file (
seek
); forcing
data in output buffers to be transferred to disk (
flush
); fetching the underlying file handle
(
fileno
); and more. Since the
built-in file object is so easy to use, let’s jump right into a few
interactive examples.

Output files

To make a new
file, call
open
with
two arguments: the external
name
of the file to
be created and a
mode
string
w
(short for
write
). To
store data on the file, call the
file object’s
write
method with a string containing the data to store, and then call
the
close
method to
close the file. File
write
calls
return the number of characters or bytes written (which we’ll
sometimes omit in this book to save space), and as we’ll see,
close
calls are often optional, unless you
need to open and read the file again during the same program or
session:

C:\temp>
python
>>>
file = open('data.txt', 'w')
# open output file object: creates
>>>
file.write('Hello file world!\n')
# writes strings verbatim
18
>>>
file.write('Bye file world.\n')
# returns number chars/bytes written
18
>>>
file.close()
# closed on gc and exit too

And that’s it—you’ve just generated a brand-new text file on
your computer, regardless of the computer on which you type this
code:

C:\temp>
dir data.txt /B
data.txt
C:\temp>
type data.txt
Hello file world!
Bye file world.

There is nothing unusual about the new file; here, I use the DOS
dir
and
type
commands to list and display the new
file, but it shows up in a file explorer GUI, too.

Opening

In the
open
function
call shown in the preceding example, the first
argument can optionally specify a complete directory path as part of
the filename string. If we pass just a simple filename without a
path, the file will appear in Python’s current working directory.
That is, it shows up in the place where the code is run. Here, the
directory
C:\temp
on my machine is implied by
the bare filename
data.txt
, so this actually
creates a file at
C:\temp\data.txt
. More
accurately, the filename is relative to the current working
directory if it does not include a complete absolute directory path.
See
Current Working Directory
(
Chapter 3
), for a refresher on this
topic.

Also note that when opening in
w
mode, Python either creates the external
file if it does not yet exist or erases the file’s current contents
if it is already present on your machine (so be careful out
there—you’ll delete whatever was in the file before).

Writing

Notice that we
added an explicit
\n
end-of-line character to lines written
to the file; unlike the
print
built-in function, file object
write
methods write exactly what they are
passed without adding any extra formatting. The string passed to
write
shows up character for
character on the external file. In text files, data written may
undergo line-end or Unicode translations which we’ll describe ahead,
but these are undone when the data is later read back.

Output files also sport
a
writelines
method, which simply writes all of the strings in a list one at a
time without adding any extra formatting. For example, here is a
writelines
equivalent to the two
write
calls shown earlier:

file.writelines(['Hello file world!\n', 'Bye   file world.\n'])

This call isn’t as commonly used (and can be emulated with a
simple
for
loop or other
iteration tool), but it is convenient in scripts that save output in
a list to be written later.

Closing

The file
close
method
used earlier finalizes file contents and frees up
system resources. For instance, closing forces buffered output data
to be flushed out to disk. Normally, files are automatically closed
when the file object is garbage collected by the interpreter (that
is, when it is no longer referenced). This includes all remaining
open files when the Python session or program exits. Because of
that,
close
calls are often
optional. In fact, it’s common to see file-processing code in Python
in this idiom:

open('somefile.txt', 'w').write("G'day Bruce\n")       # write to temporary object
open('somefile.txt', 'r').read() # read from temporary object

Since both these expressions make a temporary file object, use
it immediately, and do not save a reference to it, the file object
is reclaimed right after data is transferred, and is automatically
closed in the process. There is usually no need for such code to
call the
close
method
explicitly.

In some contexts, though, you may wish to explicitly close
anyhow:

  • For one, because the Jython implementation relies on
    Java’s garbage collector, you can’t always be as sure about when
    files will be reclaimed as you can in standard Python. If you
    run your Python code with Jython, you may need to close manually
    if many files are created in a short amount of time (e.g. in a
    loop), in order to avoid running out of file resources on
    operating systems where this matters.

  • For another, some IDEs, such as Python’s standard IDLE
    GUI, may hold on to your file objects longer than you expect (in
    stack tracebacks of prior errors, for instance), and thus
    prevent them from being garbage collected as soon as you might
    expect. If you write to an output file in IDLE, be sure to
    explicitly close (or flush) your file if you need to reliably
    read it back during the same IDLE session. Otherwise, output
    buffers might not be flushed to disk and your file may be
    incomplete when read.

  • And while it seems very unlikely today, it’s not
    impossible that this auto-close on reclaim file feature could
    change in future. This is technically a feature of the file
    object’s implementation, which may or may not be considered part
    of the language definition over time.

For these reasons, manual close calls are not a bad idea in
nontrivial programs, even if they are technically not required.
Closing is a generally harmless but robust habit to
form.

Ensuring file closure: Exception handlers and context
managers

Manual file
close method calls are easy in straight-line code, but
how do you ensure file closure when exceptions might kick your program
beyond the point where the close call is coded? First of all, make
sure you must—files close themselves when they are collected, and this
will happen eventually, even when exceptions occur.

If closure is required, though, there are two basic
alternatives: the
try
statement’s
finally
clause is the most general,
since it allows you to provide general exit actions for any type of
exceptions:

myfile = open(filename, 'w')
try:
...process myfile...
finally:
myfile.close()

In recent Python releases, though, the
with
statement
provides a more concise alternative for some specific objects and exit
actions, including closing files:

with open(filename, 'w') as myfile:
...process myfile, auto-closed on statement exit...

This statement relies on the file object’s context manager: code
automatically run both on statement entry and on statement exit
regardless of exception behavior. Because the file object’s exit code
closes the file automatically, this guarantees file closure whether an
exception occurs during the statement or not.

The
with
statement is notably
shorter (3 lines) than the
try
/
finally
alternative, but it’s also less
general—
with
applies only to
objects that support the context manager protocol, whereas
try
/
finally
allows arbitrary exit actions for
arbitrary exception contexts. While some other object types have
context managers, too (e.g., thread locks),
with
is limited in scope. In fact, if you
want to remember just one exit actions option,
try
/
finally
is the most inclusive. Still,
with
yields less code for files
that must be closed and can serve well in such specific roles. It can
even save a line of code when no
exceptions
are expected (albeit at the
expense of further nesting and indenting file
processing
logic):

myfile = open(filename, 'w')               # traditional form
...process myfile...
myfile.close()
with open(filename) as myfile: # context manager form
...process myfile...

In Python 3.1 and later, this statement can also specify
multiple (a.k.a. nested) context managers—any number of context
manager items may be separated by commas, and multiple items work the
same as nested
with
statements. In
general terms, the 3.1 and later code:

with A() as a, B() as b:
...statements...

Runs the same as the following, which works in 3.1, 3.0, and
2.6:

with A() as a:
with B() as b:
...statements...

For example, when the
with
statement block exits in the following, both files’ exit actions are
automatically run to close the files, regardless of exception
outcomes:

with open('data') as fin, open('results', 'w') as fout:
for line in fin:
fout.write(transform(line))

Context manager–dependent code like this seems to have become
more common in recent years, but this is likely at least in part
because newcomers are accustomed to languages that require manual
close calls in all cases. In most contexts there is no need to wrap
all your Python file-processing code in
with
statements—the files object’s
auto-close-on-collection behavior often suffices, and manual close
calls are enough for many other scripts. You should use the
with
or
try
options outlined here only if you must
close, and only in the presence of potential exceptions. Since
standard C Python automatically closes files on collection, though,
neither option is required in many (and perhaps
most) scripts.

Input files

Reading data
from external files is just as easy as writing, but
there are more methods that let us load data in a variety of modes.
Input text files are opened with either a mode flag of
r
(for “read”) or no mode flag at all—it
defaults to
r
if omitted, and it
commonly is. Once opened, we can read the lines of a text file with
the
readlines
method:

C:\temp>
python
>>>
file = open('data.txt')
# open input file object: 'r' default
>>>
lines = file.readlines()
# read into line string list
>>>
for line in lines:
# BUT use file line iterator! (ahead)
...
print(line, end='')
# lines have a '\n' at end
...
Hello file world!
Bye file world.

The
readlines
method loads
the entire contents of the file into memory and gives it to our
scripts as a list of line strings that we can step through in a loop.
In fact, there are many ways to read an input file:

file.read()

Returns a string
containing all the characters (or bytes) stored in
the file

file.read(N)

Returns a string containing the next N characters (or
bytes) from the file

file.readline()

Reads through
the next
\n
and
returns a line string

file.readlines()

Reads the entire
file and returns a list of line strings

Let’s run these method calls to read files, lines, and
characters from a text file—the
seek(0)
call is used here before each test
to rewind the file to its beginning (more on this call in a
moment):

>>>
file.seek(0)
# go back to the front of file
>>>
file.read()
# read entire file into string
'Hello file world!\nBye file world.\n'
>>>
file.seek(0)
# read entire file into lines list
>>>
file.readlines()
['Hello file world!\n', 'Bye file world.\n']
>>>
file.seek(0)
>>>
file.readline()
# read one line at a time
'Hello file world!\n'
>>>
file.readline()
'Bye file world.\n'
>>>
file.readline()
# empty string at end-of-file
''
>>>
file.seek(0)
# read N (or remaining) chars/bytes
>>>
file.read(1), file.read(8)
# empty string at end-of-file
('H', 'ello fil')

All of these input methods let us be specific about how much to
fetch. Here are a few rules of thumb about which to choose:

  • read()
    and
    readlines()
    load the
    entire
    file
    into memory all at once. That makes them handy for
    grabbing a file’s contents with as little code as possible. It
    also makes them generally fast, but costly in terms of memory for
    huge files—loading a multigigabyte file into memory is not
    generally a good thing to do (and might not be possible at all on
    a given computer).

  • On the other hand, because the
    readline()
    and
    read(N)
    calls fetch just
    part
    of the file
    (the next line or N-character-or-byte
    block), they are safer for potentially big files but a bit less
    convenient and sometimes slower. Both return an empty string when
    they reach end-of-file. If speed matters and your files aren’t
    huge,
    read
    or
    readlines
    may be a generally better
    choice.

  • See also the discussion of the newer file iterators in the
    next section. As we’ll see, iterators combine the convenience of
    readlines()
    with the space
    efficiency of
    readline()
    and
    are the preferred way to read text files by lines today.

The
seek(0)
call used
repeatedly here means “go back to the start of the file.” In our
example, it is an alternative to reopening the file each time. In
files, all read and write operations take place at the current
position; files normally start at offset 0 when opened and advance as
data is transferred. The
seek
call
simply lets us move to a new position for the next transfer operation.
More on this method later when we explore random access
files.

Other books

Certain People by Birmingham, Stephen;
The Chalon Heads by Barry Maitland
Gateway by Frederik Pohl
Demonworld by Kyle B.Stiff
The Truth-Teller's Lie by Sophie Hannah
Lawyer Trap by R. J. Jagger
I Wish... by Wren Emerson