Programming Python (26 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
2.53Mb size Format: txt, pdf, ePub

[
11
]
In fact,
glob
just uses
the standard
fnmatch
module to
match name patterns; see the
fnmatch
description in
Chapter 6
’s
find
module example for more
details.

Chapter 5. Parallel System Tools
“Telling the Monkeys What to Do”

Most computers
spend a lot of time doing nothing. If you start a system
monitor tool and watch the CPU utilization, you’ll see what I mean—it’s
rare to see one hit 100 percent, even when you are running multiple
programs.
[
12
]
There are just too many delays built into software: disk
accesses, network traffic, database queries, waiting for users to click a
button, and so on. In fact, the majority of a modern CPU’s capacity is
often spent in an idle state; faster chips help speed up performance
demand peaks, but much of their power can go largely unused.

Early on in computing, programmers realized that they could tap into
such unused processing power by running more than one program at the same
time. By dividing the CPU’s attention among a set of tasks, its capacity
need not go to waste while any given task is waiting for an external event
to occur. The technique is usually called
parallel processing
(and sometimes
“multiprocessing”
or even “multitasking”) because many tasks seem to be
performed at once, overlapping and parallel in time. It’s at the heart of
modern operating systems, and it gave rise to the notion of
multiple-active-window computer interfaces we’ve all come to take for
granted. Even within a single program, dividing processing into tasks that
run in parallel can make the overall system faster, at least as measured
by the clock on your wall.

Just as important is that modern software systems are expected to be
responsive to users regardless of the amount of work they must perform
behind the scenes. It’s usually unacceptable for a program to stall while
busy carrying out a request. Consider an email-browser user interface, for
example; when asked to fetch email from a server, the program must
download text from a server over a network. If you have enough email or a
slow enough Internet link, that step alone can take minutes to finish. But
while the download task proceeds, the program as a whole shouldn’t
stall—it still must respond to screen redraws, mouse clicks, and so
on.

Parallel processing comes to the rescue here, too. By performing
such long-running tasks in parallel with the rest of the program, the
system at large can remain responsive no matter how busy some of its parts
may be. Moreover, the parallel processing model is a natural fit for
structuring such programs and others; some tasks are more easily
conceptualized and coded as components running as independent, parallel
entities.

There are two fundamental ways to get tasks running at the same time
in Python

process forks
and
spawned
threads
. Functionally, both rely on underlying operating system
services to run bits of Python code in parallel. Procedurally, they are
very different in terms of interface, portability, and communication. For
instance, at this writing direct process forks are not supported on
Windows under standard Python (though they are under Cygwin Python on
Windows).

By contrast, Python’s thread support works on all major platforms.
Moreover, the
os.spawn
family of calls
provides additional ways to launch programs in a platform-neutral way that
is similar to forks, and the
os.popen
and
os.system
calls and
subprocess
module we
studied in Chapters
2
and
3
can
be used to portably spawn programs with shell commands. The newer
multiprocessing
module offers additional ways to
run processes portably in many contexts.

In this chapter, which is a continuation of our look at system
interfaces available to Python programmers, we explore Python’s built-in
tools for starting tasks in parallel, as well as communicating with those
tasks. In some sense, we’ve already begun doing so—
os.system
,
os.popen
, and
subprocess
, which we learned and applied over
the last three chapters, are a fairly portable way to spawn and speak with
command-line programs, too. We won’t repeat full coverage of those tools
here.

Instead, our emphasis in this chapter is on introducing more direct
techniques—forks, threads, pipes, signals, sockets, and other launching
techniques—and on using Python’s built-in tools that support them, such as
the
os.fork
call and the
threading
,
queue
, and
multiprocessing
modules. In the next chapter
(and in the remainder of this book), we use these techniques in more
realistic programs, so be sure you understand the basics here before
flipping ahead.

One note up front: although the process, thread, and IPC mechanisms
we will explore in this chapter are the primary parallel processing tools
in Python scripts, the third party domain offers additional options which
may serve more advanced or specialized roles. As just one example, the MPI
for Python system allows Python scripts to also employ the Message Passing
Interface (MPI) standard
, allowing Python programs to exploit multiple processors in
various ways (see the Web for details). While such specific extensions are
beyond our scope in this book, the fundamentals of multiprocessing that we
will explore here should apply to more advanced techniques you may
encounter in your parallel futures.

[
12
]
To watch on Windows, click the Start button, select All Programs
→ Accessories → System Tools → Resource Monitor, and monitor
CPU/Processor usage (Task Manager’s Performance tab may give similar
results). The graph rarely climbed above single-digit percentages on
my laptop machine while writing this footnote (at least until I typed
while True: pass
in a Python
interactive session window…).

Forking Processes

Forked
processes are a traditional way to structure parallel tasks,
and they are a fundamental part of the Unix tool set. Forking is a
straightforward way to start an independent program, whether it is
different from the calling program or not. Forking is based on the notion
of
copying
programs: when a program calls the fork
routine, the operating system makes a new copy of that program and its
process in memory and starts running that copy in parallel with the
original. Some systems don’t really copy the original program (it’s an
expensive operation), but the new copy works as if it were a literal
copy.

After a fork operation, the original copy of the program is called
the
parent
process, and the copy created
by
os.fork
is called
the
child
process. In general, parents can
make any number of children, and children can create child processes of
their own; all forked processes run independently and in parallel under
the operating system’s control, and children may continue to run after
their parent exits.

This is probably simpler in practice than in theory, though. The
Python script in
Example 5-1
forks
new child processes until you type the letter
q
at
the console.

Example 5-1. PP4E\System\Processes\fork1.py

"forks child processes until you type 'q'"
import os
def child():
print('Hello from child', os.getpid())
os._exit(0) # else goes back to parent loop
def parent():
while True:
newpid = os.fork()
if newpid == 0:
child()
else:
print('Hello from parent', os.getpid(), newpid)
if input() == 'q': break
parent()

Python’s process forking tools, available in the
os
module, are simply
thin wrappers over standard forking calls in the system library also used
by C language programs. To start a new, parallel process, call the
os.fork
built-in function. Because this
function generates a copy of the calling program, it returns a different
value in each copy: zero in the child process and the process ID of the
new child in the parent.

Programs generally test this result to begin different processing in
the child only; this script, for instance, runs the
child
function in child processes only.
[
13
]

Because forking is ingrained in the
Unix programming model, this script works well on Unix,
Linux, and modern Macs. Unfortunately, this script won’t work on the
standard version of Python for Windows today, because
fork
is too much at odds with the Windows model.
Python scripts can always spawn threads on Windows, and the
multiprocessing
module described later in this
chapter provides an alternative for running processes portably, which can
obviate the need for process forks on Windows in contexts that conform to
its constraints (albeit at some potential cost in low-level
control).

The script in
Example 5-1
does work on Windows, however, if you use the Python shipped with the
Cygwin system
(or build one of your own from source-code with Cygwin’s
libraries). Cygwin is a free, open source system that provides full
Unix-like functionality for Windows (and is described further in
More on Cygwin Python for Windows
). You can fork with Python
on Windows under Cygwin, even though its behavior is not exactly the same
as true Unix forks. Because it’s close enough for this book’s examples,
though, let’s use it to run our script live:

[C:\...\PP4E\System\Processes]$
python fork1.py
Hello from parent 7296 7920
Hello from child 7920
Hello from parent 7296 3988
Hello from child 3988
Hello from parent 7296 6796
Hello from child 6796
q

These messages represent three forked child processes; the unique
identifiers of all the processes involved are fetched and displayed with
the
os.getpid
call. A subtle
point: the
child
process function is
also careful to exit explicitly with an
os._exit
call. We’ll discuss this call in more
detail later in this chapter, but if it’s not made, the child process
would live on after the
child
function
returns (remember, it’s just a copy of the original process). The net
effect is that the child would go back to the loop in
parent
and start forking children of its own
(i.e., the parent would have grandchildren). If you delete the exit call
and rerun, you’ll likely have to type more than one q to stop, because
multiple processes are running in the
parent
function.

In
Example 5-1
, each process
exits very soon after it starts, so there’s little overlap in time. Let’s
do something slightly more sophisticated to better illustrate multiple
forked processes running in parallel.
Example 5-2
starts up 5 copies of
itself, each copy counting up to 5 with a one-second delay between
iterations. The
time.sleep
standard
library call simply pauses the calling process for a number of seconds
(you can pass a floating-point value to pause for fractions of
seconds).

Example 5-2. PP4E\System\Processes\fork-count.py

"""
fork basics: start 5 copies of this program running in parallel with
the original; each copy counts up to 5 on the same stdout stream--forks
copy process memory, including file descriptors; fork doesn't currently
work on Windows without Cygwin: use os.spawnv or multiprocessing on
Windows instead; spawnv is roughly like a fork+exec combination;
"""
import os, time
def counter(count): # run in new process
for i in range(count):
time.sleep(1) # simulate real work
print('[%s] => %s' % (os.getpid(), i))
for i in range(5):
pid = os.fork()
if pid != 0:
print('Process %d spawned' % pid) # in parent: continue
else:
counter(5) # else in child/new process
os._exit(0) # run function and exit
print('Main process exiting.') # parent need not wait

When run, this script starts 5 processes immediately and exits. All
5 forked processes check in with their first count display one second
later and every second thereafter. Notice that child processes continue to
run, even if the parent process that created them terminates:

[C:\...\PP4E\System\Processes]$
python fork-count.py
Process 4556 spawned
Process 3724 spawned
Process 6360 spawned
Process 6476 spawned
Process 6684 spawned
Main process exiting.
[4556] => 0
[3724] => 0
[6360] => 0
[6476] => 0
[6684] => 0
[4556] => 1
[3724] => 1
[6360] => 1
[6476] => 1
[6684] => 1
[4556] => 2
[3724] => 2
[6360] => 2
[6476] => 2
[6684] => 2
...more output omitted...

The output of all of these processes shows up on the same screen,
because all of them share the standard output stream (and a system prompt
may show up along the way, too). Technically, a forked process gets a copy
of the original process’s global memory, including open file descriptors.
Because of that, global objects like files start out with the same values
in a child process, so all the processes here are tied to the same single
stream. But it’s important to remember that global memory is copied, not
shared; if a child process changes a global object, it changes only its
own copy. (As we’ll see, this works differently in threads, the topic of
the next
section.)

The fork/exec Combination

In Examples
5-1
and
5-2
, child processes simply ran a
function within the
Python program and then exited. On Unix-like platforms,
forks are often the basis of starting independently running programs
that are completely different from the program that performed the
fork
call. For instance,
Example 5-3
forks new processes until
we type
q
again, but child processes run a
brand-new program instead of calling a function in the same file.

Example 5-3. PP4E\System\Processes\fork-exec.py

"starts programs until you type 'q'"
import os
parm = 0
while True:
parm += 1
pid = os.fork()
if pid == 0: # copy process
os.execlp('python', 'python', 'child.py', str(parm)) # overlay program
assert False, 'error starting program' # shouldn't return
else:
print('Child is', pid)
if input() == 'q': break

If you’ve done much Unix development, the
fork
/
exec
combination will probably look familiar. The main thing to notice is the
os.execlp
call in this code. In a
nutshell, this call replaces (overlays) the program running in the
current process with a brand new program. Because of that, the
combination
of
os.fork
and
os.execlp
means start a new process and run a
new program in that process—in other words, launch a new program in
parallel with the original program.

os.exec call formats

The arguments
to
os.execlp
specify
the program to be run by giving command-line arguments used to start
the program (i.e., what Python scripts know as
sys.argv
). If successful, the new program
begins running and the call to
os.execlp
itself never returns (since the
original program has been replaced, there’s really nothing to return
to). If the call does return, an error has occurred, so we code an
assert
after it that will always
raise an exception if reached.

There are a handful of
os.exec
variants in the Python standard
library; some allow us to configure environment variables for the new
program, pass command-line arguments in different forms, and so on.
All are available on both Unix and Windows, and they replace the
calling program (i.e., the Python interpreter).
exec
comes in eight flavors, which can be a
bit confusing unless you generalize:

os.execv(
program,
commandlinesequence
)

The basic “v”
exec
form
is
passed an executable program’s name, along with a
list or tuple of command-line argument strings used to run the
executable (that is, the words you would normally type in a
shell to start a program).

os.execl(
program, cmdarg1,
cmdarg2,... cmdargN
)

The basic “l”
exec
form
is
passed an executable’s name, followed by one or
more command-line arguments passed as individual function
arguments. This is the same as
os.execv(
program,
(
cmdarg1,
cmdarg2,...
))
.

os.execlp
os.execvp

Adding the
letter p to the
execv
and
execl
names means that Python will
locate the executable’s directory using your system search-path
setting (i.e.,
PATH
).

os.execle
os.execve

Adding a
letter e to the
execv
and
execl
names means an extra,
last
argument is a dictionary containing
shell environment variables to send to the program.

os.execvpe
os.execlpe

Adding the
letters p and e to the basic
exec
names means to use the search
path
and
to accept a shell environment
settings dictionary.

So when the script in
Example 5-3
calls
os.execlp
, individually passed parameters
specify a command line for the program to be run on, and the word
python
maps to an executable file according to
the underlying system search-path setting environment variable
(
PATH
). It’s as if we were running
a command of the form
python child.py
1
in a shell, but with a different command-line argument on
the end each time.

Spawned child program

Just as when
typed at a shell, the string of arguments passed to
os.execlp
by the
fork-exec
script in
Example 5-3
starts another Python
program file, as shown in
Example 5-4
.

Example 5-4. PP4E\System\Processes\child.py

import os, sys
print('Hello from child', os.getpid(), sys.argv[1])

Here is this code in action on Linux. It doesn’t look much
different from the original
fork1.py
, but it’s
really running a new
program
in each forked
process. More observant readers may notice that the child process ID
displayed is the same in the parent program and the launched
child.py
program;
os.execlp
simply overlays a program in the
same process:

[C:\...\PP4E\System\Processes]$
python fork-exec.py
Child is 4556
Hello from child 4556 1
Child is 5920
Hello from child 5920 2
Child is 316
Hello from child 316 3
q

There are other ways to start up programs in Python besides the
fork/exec combination. For example, the
os.system
and
os.popen
calls and
subprocess
module which we explored in
Chapters
2
and
3
allow us to spawn shell commands. And the
os.spawnv
call and
multiprocessing
module, which we’ll meet
later in this chapter, allow us to start independent programs and
processes more portably. In fact, we’ll see later that
multiprocessing
’s process spawning model can
be used as a sort of portable replacement for
os.fork
in some contexts (albeit a less
efficient one) and used in conjunction with the
os.exec*
calls shown here to achieve a
similar effect in standard Windows Python.

We’ll see more process fork examples later in this chapter,
especially in the program exits and process communication sections, so
we’ll forego additional examples here. We’ll also discuss additional
process topics in later chapters of this book. For instance, forks are
revisited in
Chapter 12
to deal with servers
and their
zombies
—dead processes lurking in
system tables after their demise. For now, let’s move on to threads, a
subject which at least some programmers find to be substantially less
frightening…

More on Cygwin Python for Windows

As mentioned, the
os.fork
call is present in the
Cygwin version of Python on Windows. Even though this
call is missing in the standard version of Python for Windows, you
can fork processes on Windows with Python if you install and use
Cygwin. However, the Cygwin fork call is not as efficient and does
not work exactly the same as a fork on true Unix systems.

Cygwin is a free, open source package which includes a library
that attempts to provide a Unix-like API for use on Windows
machines, along with a set of command-line tools that implement a
Unix-like environment. It makes it easier to apply Unix skills and
code on Windows computers.

According to its FAQ documentation, though, “Cygwin fork()
essentially works like a noncopy on write version of fork() (like
old Unix versions used to do). Because of this it can be a little
slow. In most cases, you are better off using the spawn family of
calls if possible.” Since this book’s fork examples don’t need to
care about performance, Cygwin’s fork suffices.

In addition to the fork call, Cygwin provides other Unix tools
that would otherwise not be available on all flavors of Windows,
including
os.mkfifo
(discussed
later in this chapter). It also comes with a
gcc
compiler environment for building C
extensions for Python on Windows that will be familiar to Unix
developers. As long as you’re willing to use Cygwin libraries to
build your application and power your Python, it’s very close to
Unix on Windows.

Like all third-party libraries, though, Cygwin adds an extra
dependency to your systems. Perhaps more critically, Cygwin
currently uses the GNU GPL license, which adds distribution
requirements beyond those of standard Python. Unlike using Python
itself, shipping a program that uses Cygwin libraries may require
that your program’s source code be made freely available (though
RedHat offers a “buy-out” option which can relieve you of this
requirement). Note that this is a complex legal issue, and you
should study Cygwin’s license on your own if this may impact your
programs. Its license does, however, impose more constraints than
Python’s (Python uses a “BSD”-style license, not the GPL).

Despite licensing issue, Cygwin still can be a great way to
get Unix-like functionality on Windows without installing a
completely different operating system such as
Linux—
a more complete but generally
more complex option. For more details, see
http://cygwin.com
or run a search for Cygwin on the
Web.

See also the standard library’s
multiprocessing
module and
os.spawn
family of calls, covered later in
this chapter, for alternative way to start parallel tasks and
programs on Unix and Windows that do not require fork and exec
calls. To run a simple function call in parallel on Windows (rather
than on an external program), also see the section on standard
library threads later in this chapter. Threads,
multiprocessing
, and
os.spawn
calls work on Windows in standard
Python.

Fourth Edition
Update
: As I was updating this chapter in February 2010,
Cygwin’s official Python was still Python 2.5.2. To get Python 3.1
under Cygwin, it had to be built from its source code. If this is
still required when you read this, make sure you have
gcc
and
make
installed on your Cygwin, then fetch
Python’s source code package from python.org, unpack it, and build
Python with the usual commands:

./configure
make
make test
sudo make install

This will install Python as
python3
. The same procedure works on all
Unix-like platforms; on OS X and Cygwin, the executable is called
python.exe
; elsewhere it’s named
python
. You can generally skip
the last two of the above steps if you’re willing to run Python 3.1
out of your own build directory. Be sure to also check if Python 3.X
is a standard Cygwin package by the time you read this; when
building from source you may have to tweak a few files (I had to
comment-out a
#define
in
Modules/main.c
), but these are too
specific and temporal to get into here.

Other books

Banker to the Poor by Muhammad Yunus, Alan Jolis
At the Reunion Buffet by Alexander McCall Smith
Out of the Ashes by Michael Morpurgo
His First Wife by Grace Octavia
Platform by Michel Houellebecq
BELLA MAFIA by Lynda La Plante
The Book of Fathers by Miklos Vamos
Single White Female by John Lutz
Think About Love by Vanessa Grant