Programming Python (34 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
12.01Mb size Format: txt, pdf, ePub

[
17
]
We will clarify the notions of “client” and “server” in the
Internet programming part of this book. There, we’ll communicate
with sockets (which we’ll see later in this chapter are roughly
like bidirectional pipes for programs running both across networks
and on the same machine), but the overall conversation model is
similar. Named pipes (fifos), described ahead, are also a better
match to the client/server model because they can be accessed by
arbitrary, unrelated processes (no forks are required). But as
we’ll see, the socket port model is generally used by most
Internet scripting protocols—email, for instance, is mostly just
formatted strings shipped over sockets between programs on
standard port numbers reserved for the email protocol.

The multiprocessing Module

Now that you
know about IPC alternatives and have had a chance to explore
processes, threads, and both process nonportability and thread GIL
limitations, it turns out that there is another alternative, which aims to
provide just the best of both worlds. As mentioned earlier, Python’s
standard library
multiprocessing
module
package allows scripts to spawn processes using an API very
similar to the
threading
module.

This relatively new package works on both Unix and Windows, unlike
low-level process forks. It supports a process spawning model which is
largely platform-neutral, and provides tools for related goals, such as
IPC, including locks, pipes, and queues. In addition, because it uses
processes instead of threads to run code in parallel, it effectively works
around the limitations of the thread GIL. Hence,
multiprocessing
allows the programmer to
leverage the capacity of multiple processors for parallel tasks, while
retaining much of the simplicity and portability of the threading
model.

Why multiprocessing?

So why learn yet another parallel processing paradigm and toolkit,
when we already have the threads, processes, and IPC tools like sockets,
pipes, and thread queues that we’ve already studied? Before we get into
the details, I want to begin with a few words about why you may (or may
not) care about this package. In more specific terms, although this
module’s performance may not compete with that of pure threads or
process forks for some applications, this module offers a compelling
solution for many:

  • Compared to raw process forks, you gain cross-platform
    portability and powerful IPC tools.

  • Compared to threads, you essentially trade some potential and
    platform
    -
    dependent
    extra task start-up time for
    the ability to run tasks in truly parallel fashion on multi-core or
    multi-CPU machines.

On the other hand, this module imposes some constraints and
tradeoffs that threads do not:

  • Since objects are copied across process boundaries, shared
    mutable state does not work as it does for threads—changes in one
    process are not generally noticed in the other. Really, freely
    shared state may be the most compelling reason to use threads; its
    absence in this module may prove limiting in some threading
    contexts.

  • Because this module requires pickleability for both its
    processes on Windows, as well as some of its IPC tools in general,
    some coding paradigms are difficult or nonportable—especially if
    they use bound methods or pass unpickleable objects such as sockets
    to spawned processes.

For instance, common coding patterns with lambda that work for the
threading
module cannot be used as
process target callables in this module on Windows, because they cannot
be pickled. Similarly, because bound object methods are also not
pickleable, a threaded program may require a more indirect design if it
either runs bound methods in its threads or implements thread exit
actions by posting arbitrary callables (possibly including bound
methods) on shared queues. The in-process model of threads supports such
direct lambda and bound method use, but the separate processes of
multiprocessing
do not.

In fact we’ll write a thread manager for GUIs in
Chapter 10
that relies on queueing
in-process
callables this way to implement
thread exit actions—the callables are queued by worker threads, and
fetched and dispatched by the main thread. Because the
threaded
PyMailGUI program we’ll code in
Chapter 14
both uses this manager to queue
bound methods for thread exit actions and runs bound methods as the main
action of a thread itself, it could not be directly translated to the
separate process model implied by
multiprocessing
.

Without getting into too many details here, to use
multiprocessing
, PyMailGUI’s actions might
have to be coded as simple functions or complete process subclasses for
pickleability. Worse, they may have to be implemented as simpler action
identifiers dispatched in the main process, if they update either the
GUI itself or object state in general —pickling results in an object
copy in the receiving process, not a reference to the original, and
forks on Unix essentially copy an entire process. Updating the state of
a mutable message cache copied by pickling it to pass to a new process,
for example, has no effect on the original.

The pickleability constraints for process arguments on Windows can
limit
multi
processing
’s scope in other contexts as
well. For instance, in
Chapter 12
, we’ll find
that this module doesn’t directly solve the lack of portability for the
os.fork
call for traditionally coded
socket servers
on Windows, because connected
sockets are not pickled correctly when passed into a new process created
by this module to converse with a client. In this context, threads
provide a more portable and likely more efficient solution.

Applications that pass simpler types of messages, of course, may
fare better. Message constraints are easier to accommodate when they are
part of an initial process-based design. Moreover, other tools in this
module, such as its managers and shared memory API, while narrowly
focused and not as general as shared thread state, offer additional
mutable state options for some programs.

Fundamentally, though, because
multiprocessing
is based on separate
processes, it may be best geared for tasks which are relatively
independent, do not share mutable object state freely, and can make do
with the message passing and shared memory tools provided by this
module. This includes many applications, but this module is not
necessarily a direct replacement for every threaded program, and it is
not an alternative to process forks in all contexts.

To truly understand both this module package’s benefits, as well
as its tradeoffs, let’s turn to a first example and explore this
package’s implementation along the
way.

The Basics: Processes and Locks

We don’t have
space to do full justice to this sophisticated module in
this book; see its coverage in the Python library manual for the full
story. But as a brief introduction, by design most of this module’s
interfaces mirror the
threading
and
queue
modules we’ve already met, so
they should already seem familiar. For example, the
multiprocessing
module’s
Process
class is
intended to mimic the
threading
module’s
Thread
class we met
earlier—it allows us to launch a function call in parallel with the
calling script; with this module, though, the function runs in a process
instead of a thread.
Example 5-29
illustrates these basics in action:

Example 5-29. PP4E\System\Processes\multi1.py

"""
multiprocess basics: Process works like threading.Thread, but
runs function call in parallel in a process instead of a thread;
locks can be used to synchronize, e.g. prints on some platforms;
starts new interpreter on windows, forks a new process on unix;
"""
import os
from multiprocessing import Process, Lock
def whoami(label, lock):
msg = '%s: name:%s, pid:%s'
with lock:
print(msg % (label, __name__, os.getpid()))
if __name__ == '__main__':
lock = Lock()
whoami('function call', lock)
p = Process(target=whoami, args=('spawned child', lock))
p.start()
p.join()
for i in range(5):
Process(target=whoami, args=(('run process %s' % i), lock)).start()
with lock:
print('Main process exit.')

When run, this script first calls a function directly and
in-process; then launches a call to that function in a new process and
waits for it to exit; and finally spawns five function call processes in
parallel in a loop—all using an API identical to that of the
threading.
Thread
model we studied earlier in this
chapter. Here’s this script’s output on Windows; notice how the five
child processes spawned at the end of this script outlive their parent,
as is the usual case for processes:

C:\...\PP4E\System\Processes>
multi1.py
function call: name:__main__, pid:8752
spawned child: name:__main__, pid:9268
Main process exit.
run process 3: name:__main__, pid:9296
run process 1: name:__main__, pid:8792
run process 4: name:__main__, pid:2224
run process 2: name:__main__, pid:8716
run process 0: name:__main__, pid:6936

Just like the
threading.Thread
class we met earlier, the
multiprocessing.Process
object can either be
passed a
target
with arguments (as
done here) or subclassed to redefine its
run
action method. Its
start
method invokes its
run
method in a new process, and the default
run
simply calls the passed-in
target. Also like
threading
, a
join
method waits for child process
exit, and a
Lock
object is provided
as one of a handful of process synchronization tools; it’s used here to
ensure that prints don’t overlap among processes on platforms where this
might matter (it may not on Windows).

Implementation and usage rules

Technically, to achieve
its portability, this module currently works by
selecting from platform-specific alternatives:

  • On Unix, it forks a new child process and invokes the
    Process
    object’s
    run
    method in the new child.

  • On Windows, it spawns a new interpreter by using
    Windows-specific process creation tools, passing the pickled
    Process
    object in to the new
    process over a pipe, and starting a “python -c” command line in
    the new process, which runs a special Python-coded function in
    this package that reads and unpickles the
    Process
    and invokes its
    run
    method.

We met pickling briefly in
Chapter 1
,
and we will study it further later in this book. The implementation is
a bit more complex than this, and is prone to change over time, of
course, but it’s really quite an amazing trick. While the portable API
generally hides these details from your code, its basic structure can
still have subtle impacts on the way you’re allowed to use it. For
instance:

  • On Windows, the main process’s logic should generally be
    nested under a
    __name__ ==
    __main__
    test as done here when using this module, so it
    can be imported freely by a new interpreter without side effects.
    As we’ll learn in more detail in
    Chapter 17
    , unpickling classes and
    functions requires an import of their enclosing module, and this
    is the root of this requirement.

  • Moreover, when globals are accessed in child processes on
    Windows, their values may not be the same as that in the parent at
    start
    time, because their
    module will be imported into a new process.

  • Also on Windows, all arguments to
    Process
    must be pickleable. Because this
    includes
    target
    , targets should
    be simple functions so they can be pickled; they cannot be bound
    or unbound object
    methods
    and cannot be
    functions created with a
    lambda
    . See
    pickle
    in Python’s library manual for
    more on pickleability rules; nearly every object type works, but
    callables like functions and classes must be importable—they are
    pickled by name only, and later imported to recreate bytecode. On
    Windows, objects with system state, such as connected sockets,
    won’t generally work as arguments to a process target either,
    because they are not
    pickleable
    .

  • Similarly, instances of custom
    Process
    subclasses must be pickleable on
    Windows as well. This includes all their attribute values. Objects
    available in this package (e.g.,
    Lock
    in
    Example 5-29
    ) are pickleable, and
    so may be used as both
    Process
    constructor arguments and subclass attributes.

  • IPC objects in this package that appear in later examples
    like
    Pipe
    and
    Queue
    accept only pickleable objects,
    because of their implementation (more on this in the next
    section).

  • On Unix, although a child process can make use of a shared
    global item created in the parent, it’s better to pass the object
    as an argument to the child process’s constructor, both for
    portability to Windows and to avoid potential problems if such
    objects were garbage collected in the parent.

There are additional rules documented in the library manual. In
general, though, if you stick to passing in shared objects to
processes and using the synchronization and
communication
tools provided by this
package, your code will usually be portable and correct. Let’s look
next at a few of those tools in
action.

IPC Tools: Pipes, Shared Memory, and Queues

While the processes
created by this package can always communicate using
general system-wide tools like the sockets and fifo files we met
earlier, the
multiprocessing
module
also provides portable message passing tools specifically geared to this
purpose for the processes it spawns:

  • Its
    Pipe
    object
    provides an anonymous pipe, which serves as a
    connection between two processes. When called,
    Pipe
    returns
    two
    Connection
    objects that represent the ends of the pipe. Pipes are bidirectional
    by default, and allow arbitrary pickleable Python objects to be sent
    and received. On Unix they are implemented internally today with
    either a connected socket pair or the
    os.pipe
    call we met earlier, and on
    Windows with named pipes specific to that platform. Much like the
    Process
    object described earlier,
    though, the
    Pipe
    object’s
    portable API spares callers from such things.

  • Its
    Value
    and
    Array
    objects
    implement shared process/thread-safe memory for
    communication between processes. These calls return scalar and array
    objects based in the
    ctypes
    module and
    created in shared memory, with access synchronized by
    default.

  • Its
    Queue
    object
    serves as a FIFO list of Python objects, which allows
    multiple producers and consumers. A queue is essentially a pipe with
    extra locking mechanisms to coordinate more arbitrary accesses, and
    inherits the pickleability constraints of
    Pipe
    .

Because these devices are safe to use across multiple processes,
they can often serve to synchronize points of communication and obviate
lower-level tools like locks, much the same as the thread queues we met
earlier. As usual, a pipe (or a pair of them) may be used to implement a
request/reply model. Queues support more flexible models; in fact, a GUI
that wishes to avoid the limitations of the GIL might use the
multi
processing
module’s
Process
and
Queue
to spawn long-running tasks that post
results, rather than threads. As mentioned, although this may incur
extra start-up overhead on some platforms, unlike threads today, tasks
coded this way can be as truly parallel as the underlying platform
allows.

One constraint worth noting here: this package’s pipes (and by
proxy, queues)
pickle
the objects passed through
them, so that they can be reconstructed in the receiving process (as
we’ve seen, on Windows the receiver process may be a fully independent
Python interpreter). Because of that, they do not support unpickleable
objects; as suggested earlier, this includes some callables like bound
methods and lambda functions (see file
multi-badq.py
in the book examples package
for a demonstration of code that violates this constraint). Objects with
system state, such as sockets, may fail as well. Most other Python
object types, including classes and simple functions, work fine on pipes
and queues.

Also keep in mind that because they are pickled, objects
transferred this way are effectively
copied
in the
receiving process; direct in-place changes to mutable objects’ state
won’t be noticed in the sender. This makes sense if you remember that
this package runs independent processes with their own memory spaces;
state cannot be as freely shared as in threading, regardless of which
IPC tools you use.

multiprocessing pipes

To demonstrate
the IPC tools listed above, the next three examples
implement three flavors of communication between parent and child
processes.
Example 5-30
uses a
simple shared pipe object to send and receive data between parent and
child processes.

Example 5-30. PP4E\System\Processes\multi2.py

"""
Use multiprocess anonymous pipes to communicate. Returns 2 connection
object representing ends of the pipe: objects are sent on one end and
received on the other, though pipes are bidirectional by default
"""
import os
from multiprocessing import Process, Pipe
def sender(pipe):
"""
send object to parent on anonymous pipe
"""
pipe.send(['spam'] + [42, 'eggs'])
pipe.close()
def talker(pipe):
"""
send and receive objects on a pipe
"""
pipe.send(dict(name='Bob', spam=42))
reply = pipe.recv()
print('talker got:', reply)
if __name__ == '__main__':
(parentEnd, childEnd) = Pipe()
Process(target=sender, args=(childEnd,)).start() # spawn child with pipe
print('parent got:', parentEnd.recv()) # receive from child
parentEnd.close() # or auto-closed on gc
(parentEnd, childEnd) = Pipe()
child = Process(target=talker, args=(childEnd,))
child.start()
print('parent got:', parentEnd.recv()) # receieve from child
parentEnd.send({x * 2 for x in 'spam'}) # send to child
child.join() # wait for child exit
print('parent exit')

When run on Windows, here’s this script’s output—one child
passes an object to the parent, and the other both sends and receives
on the same pipe:

C:\...\PP4E\System\Processes>
multi2.py
parent got: ['spam', 42, 'eggs']
parent got: {'name': 'Bob', 'spam': 42}
talker got: {'ss', 'aa', 'pp', 'mm'}
parent exit

This module’s pipe objects make communication between two
processes portable (and nearly trivial).

Shared memory and globals

Example 5-31
uses shared
memory to serve as both inputs and outputs of spawned
processes. To make this work portably, we must create objects defined
by the package and pass them to
Process
constructors. The last test in this
demo (“loop4”) probably represents the most common use case for shared
memory—that of distributing computation work to multiple parallel
processes.

Example 5-31. PP4E\System\Processes\multi3.py

"""
Use multiprocess shared memory objects to communicate.
Passed objects are shared, but globals are not on Windows.
Last test here reflects common use case: distributing work.
"""
import os
from multiprocessing import Process, Value, Array
procs = 3
count = 0 # per-process globals, not shared
def showdata(label, val, arr):
"""
print data values in this process
"""
msg = '%-12s: pid:%4s, global:%s, value:%s, array:%s'
print(msg % (label, os.getpid(), count, val.value, list(arr)))
def updater(val, arr):
"""
communicate via shared memory
"""
global count
count += 1 # global count not shared
val.value += 1 # passed in objects are
for i in range(3): arr[i] += 1
if __name__ == '__main__':
scalar = Value('i', 0) # shared memory: process/thread safe
vector = Array('d', procs) # type codes from ctypes: int, double
# show start value in parent process
showdata('parent start', scalar, vector)
# spawn child, pass in shared memory
p = Process(target=showdata, args=('child ', scalar, vector))
p.start(); p.join()
# pass in shared memory updated in parent, wait for each to finish
# each child sees updates in parent so far for args (but not global)
print('\nloop1 (updates in parent, serial children)...')
for i in range(procs):
count += 1
scalar.value += 1
vector[i] += 1
p = Process(target=showdata, args=(('process %s' % i), scalar, vector))
p.start(); p.join()
# same as prior, but allow children to run in parallel
# all see the last iteration's result because all share objects
print('\nloop2 (updates in parent, parallel children)...')
ps = []
for i in range(procs):
count += 1
scalar.value += 1
vector[i] += 1
p = Process(target=showdata, args=(('process %s' % i), scalar, vector))
p.start()
ps.append(p)
for p in ps: p.join()
# shared memory updated in spawned children, wait for each
print('\nloop3 (updates in serial children)...')
for i in range(procs):
p = Process(target=updater, args=(scalar, vector))
p.start()
p.join()
showdata('parent temp', scalar, vector)
# same, but allow children to update in parallel
ps = []
print('\nloop4 (updates in parallel children)...')
for i in range(procs):
p = Process(target=updater, args=(scalar, vector))
p.start()
ps.append(p)
for p in ps: p.join()
# global count=6 in parent only
# show final results here # scalar=12: +6 parent, +6 in 6 children
showdata('parent end', scalar, vector) # array[i]=8: +2 parent, +6 in 6 children

The following is this script’s output on Windows. Trace through
this and the code to see how it runs; notice how the changed value of
the global variable is not shared by the spawned processes on Windows,
but passed-in
Value
and
Array
objects are. The final output line
reflects changes made to shared memory in both the parent and spawned
children—the array’s final values are all 8.0, because they were
incremented twice in the parent, and once in each of six spawned
children; the scalar value similarly reflects changes made by both
parent and child; but unlike for threads, the global is per-process
data on Windows:

C:\...\PP4E\System\Processes>
multi3.py
parent start: pid:6204, global:0, value:0, array:[0.0, 0.0, 0.0]
child : pid:9660, global:0, value:0, array:[0.0, 0.0, 0.0]
loop1 (updates in parent, serial children)...
process 0 : pid:3900, global:0, value:1, array:[1.0, 0.0, 0.0]
process 1 : pid:5072, global:0, value:2, array:[1.0, 1.0, 0.0]
process 2 : pid:9472, global:0, value:3, array:[1.0, 1.0, 1.0]
loop2 (updates in parent, parallel children)...
process 1 : pid:9468, global:0, value:6, array:[2.0, 2.0, 2.0]
process 2 : pid:9036, global:0, value:6, array:[2.0, 2.0, 2.0]
process 0 : pid:9548, global:0, value:6, array:[2.0, 2.0, 2.0]
loop3 (updates in serial children)...
parent temp : pid:6204, global:6, value:9, array:[5.0, 5.0, 5.0]
loop4 (updates in parallel children)...
parent end : pid:6204, global:6, value:12, array:[8.0, 8.0, 8.0]

If you imagine the last test here run with a much larger array
and many more parallel children, you might begin to sense some of the
power of this package for
distributing work.

Queues and subclassing

Finally, besides
basic spawning and IPC tools, the
multiprocessing
module also:

  • Allows its
    Process
    class
    to be subclassed to provide structure and state retention (much
    like
    threading.Thread
    , but for
    processes).

  • Implements a process-safe
    Queue
    object which may be shared by any
    number of processes for more general communication needs (much
    like
    queue.Queue
    , but for
    processes).

Queues support a more flexible multiple client/server model.
Example 5-32
, for instance,
spawns three producer threads to post to a shared queue and repeatedly
polls for results to appear—in much the same fashion that a GUI might
collect results in parallel with the display itself, though here the
concurrency is achieved with processes instead of threads.

Example 5-32. PP4E\System\Processes\multi4.py

"""
Process class can also be subclassed just like threading.Thread;
Queue works like queue.Queue but for cross-process, not cross-thread
"""
import os, time, queue
from multiprocessing import Process, Queue # process-safe shared queue
# queue is a pipe + locks/semas
class Counter(Process):
label = ' @'
def __init__(self, start, queue): # retain state for use in run
self.state = start
self.post = queue
Process.__init__(self)
def run(self): # run in newprocess on start()
for i in range(3):
time.sleep(1)
self.state += 1
print(self.label ,self.pid, self.state) # self.pid is this child's pid
self.post.put([self.pid, self.state]) # stdout file is shared by all
print(self.label, self.pid, '-')
if __name__ == '__main__':
print('start', os.getpid())
expected = 9
post = Queue()
p = Counter(0, post) # start 3 processes sharing queue
q = Counter(100, post) # children are producers
r = Counter(1000, post)
p.start(); q.start(); r.start()
while expected: # parent consumes data on queue
time.sleep(0.5) # this is essentially like a GUI,
try: # though GUIs often use threads
data = post.get(block=False)
except queue.Empty:
print('no data...')
else:
print('posted:', data)
expected -= 1
p.join(); q.join(); r.join() # must get before join putter
print('finish', os.getpid(), r.exitcode) # exitcode is child exit status

Notice in this code how:

  • The
    time.sleep
    calls in
    this code’s producer simulate long-running tasks.

  • All four processes share the same output stream;
    print
    calls go the same place and don’t
    overlap badly on Windows (as we saw earlier, the
    multiprocessing
    module also has a
    shareable
    Lock
    object to
    synchronize access if required).

  • The exit status of child process is available after they
    finish in their
    exitcode
    attribute
    .

When run, the output of the main consumer process traces its
queue fetches, and the (indented) output of spawned child producer
processes gives process IDs and state.

C:\...\PP4E\System\Processes>
multi4.py
start 6296
no data...
no data...
@ 8008 101
posted: [8008, 101]
@ 6068 1
@ 3760 1001
posted: [6068, 1]
@ 8008 102
posted: [3760, 1001]
@ 6068 2
@ 3760 1002
posted: [8008, 102]
@ 8008 103
@ 8008 -
posted: [6068, 2]
@ 6068 3
@ 6068 -
@ 3760 1003
@ 3760 -
posted: [3760, 1002]
posted: [8008, 103]
posted: [6068, 3]
posted: [3760, 1003]
finish 6296 0

If you imagine the “@” lines here as results of long-running
operations and the others as a main GUI thread, the wide relevance of
this package may become more
apparent.

Other books

Soul Kissed by Erin Kellison
The Exiled by Christopher Charles
A Bat in the Belfry by Sarah Graves
The Blue Door by Christa J. Kinde
A Sport of Nature by Nadine Gordimer
The Hearts We Mend by Kathryn Springer
One Man's War by Lindsay McKenna