Probably the
biggest limitation of DBM keyed files is in what they can
store: data stored under a key must be a simple string. If you want to
store Python objects in a DBM file, you can sometimes manually convert
them to and from strings on writes and reads (e.g., withstr
andeval
calls), but this takes you only so far. For arbitrarily complex Python
objects such as class instances and nested data structures, you need
something more. Class instance objects, for example, cannot usually be
later re-created from their standard string representations. Moreover,
custom to-string conversions and from-string parsers are error prone and
not general.
The Pythonpickle
module, a
standard part of the Python system, provides the conversion
step needed. It’s a sort of super general data formatting and
de-formatting
tool—pickle
converts nearly arbitrary Python
in-memory objects to and from a single linear string format, suitable for
storing in flat files, shipping across network sockets between trusted
sources, and so on. This conversion from object
to string is often called
serialization
—arbitrary data
structures in memory are mapped to a serial string form.
The string representation used for objects is also sometimes
referred to as a byte stream, due to its linear format. It retains all the
content and references structure of the original in-memory object. When
the object is later re-created from its byte string, it will be a new
in-memory object identical in structure and value to the original, though
located at a different memory address.
The net effect is that the re-created object is effectively a
copy
of the original; in Python-speak, the two will
be==
but notis
. Since the recreation typically happens in an
entirely new process, this difference is often irrelevant (though as we
saw in
Chapter 5
, this generally precludes
using pickled objects directly as cross-process shared state).
Pickling works on almost any Python datatype—numbers, lists,
dictionaries, class instances, nested structures,
and more—and so is a general way to store data. Because
pickles contain native Python objects, there is almost no database API to
be found; the objects stored with pickling are processed with normal
Python syntax when they are later retrieved.
Pickling may
sound complicated the first time you encounter it, but the
good news is that Python hides all the complexity of object-to-string
conversion. In fact, the pickle module ’s interfaces are incredibly
simple to use. For example, to pickle an object into a serialized
string, we can either make a pickler and call its methods or use
convenience functions in the module to achieve the same effect:
P =
pickle.Pickler(
file
)
Make a new pickler for pickling to an open output file
objectfile
.
P.dump(
object
)
Write an object onto the pickler’s file/stream.
pickle.dump(
object,
file
)
Same as the last two calls combined: pickle an object onto
an open file.
string =
pickle.dumps(
object
)
Return the pickled representation ofobject
as a character string.
Unpickling from a serialized string back to the original object is
similar—both object and convenience function interfaces are
available:
U =
pickle.Unpickler(
file
)
Make an unpickler for unpickling from an open input file
objectfile
.
object = U.load()
Read an object from the unpickler’s file/stream.
object =
pickle.load(
file
)
Same as the last two calls combined: unpickle an object from
an open file.
object =
pickle.loads(
string
)
Read an object from a character string rather than a
file.
Pickler
andUnpickler
are exported classes. In all of the
preceding cases,file
is either an
open file object or any object that implements the same attributes as
file objects:
Pickler
calls
the file’swrite
method with a string argument.
Unpickler
calls the
file’sread
method
with a byte count, andreadline
without arguments.
Any object that provides these attributes can be passed in to thefile
parameters. In particular,file
can be an instance of a Python
class that provides the read/write methods (i.e., the expected file-like
interface
). This lets you map pickled streams to
in-
memory
objects with classes,
for arbitrary use. For instance, theio.BytesIO
class in the standard library
discussed in
Chapter 3
provides an
interface that maps file calls to and from in-memory byte strings and is
an alternative to the pickler’sdumps
/loads
string calls.
This hook also lets you ship Python objects across a network, by
providing sockets wrapped to look like files in pickle calls at the
sender, and unpickle calls at the receiver (see
Making Sockets Look Like Files and Streams
for more details).
In fact, for some, pickling Python objects across a trusted network
serves as a simpler alternative to network transport protocols such
as
SOAP and XML-RPC, provided that Python is on both ends of
the communication (pickled objects are represented with a
Python-specific format, not with XML text).
Recent changes
: In Python 3.X, pickled
objects are always represented asbytes
, notstr
, regardless of the protocol level which
you request (even the oldest ASCII protocol yields bytes). Because of
this, files used to store pickled Python objects should always be
opened in binary mode. Moreover, in 3.X an optimized_pickle
implementation module is also
selected and used automatically if present. More on both topics
later.
Although pickled objects can be shipped in exotic ways, in more
typical use, to pickle an object to a flat file, we just open the file
in write mode and call thedump
function:
C:\...\PP4E\Dbase>python
>>>table = {'a': [1, 2, 3],
'b': ['spam', 'eggs'],
'c': {'name':'bob'}}
>>>
>>>import pickle
>>>mydb = open('dbase', 'wb')
>>>pickle.dump(table, mydb)
Notice the
nesting in the object pickled here—the pickler handles
arbitrary structures. Also note that we’re using binary mode files here;
in Python 3.X, we really must, because the pickled representation of an
object is always abytes
object in
all cases. To unpickle later in another session or program run, simply
reopen the file and callload
:
C:\...\PP4E\Dbase>python
>>>import pickle
>>>mydb = open('dbase', 'rb')
>>>table = pickle.load(mydb)
>>>table
{'a': [1, 2, 3], 'c': {'name': 'bob'}, 'b': ['spam', 'eggs']}
The object you get back from unpickling has the same value and
reference structure as the original, but it is located at a different
address in memory. This is true whether the object is unpickled in the
same or a future process. Again, the unpickled object is==
but is notis
:
C:\...\PP4E\Dbase>python
>>>import pickle
>>>f = open('temp', 'wb')
>>>x = ['Hello', ('pickle', 'world')]
# list with nested tuple
>>>pickle.dump(x, f)
>>>f.close()
# close to flush changes
>>>
>>>f = open('temp', 'rb')
>>>y = pickle.load(f)
>>>y
['Hello', ('pickle', 'world')]
>>>
>>>x == y, x is y
# same value, diff objects
(True, False)
To make this process simpler still, the module in
Example 17-1
wraps pickling and
unpickling calls in functions that also open the files where the
serialized form of the object is stored.
Example 17-1. PP4E\Dbase\filepickle.py
"Pickle to/from flat file utilities"
import pickle
def saveDbase(filename, object):
"save object to file"
file = open(filename, 'wb')
pickle.dump(object, file) # pickle to binary file
file.close() # any file-like object will do
def loadDbase(filename):
"load object from file"
file = open(filename, 'rb')
object = pickle.load(file) # unpickle from binary file
file.close() # re-creates object in memory
return object
To store and fetch now, simply call these module functions; here
they are in action managing a fairly complex structure with multiple
references to the same nested
object—
the nested list calledL
at first is stored only once in the
file:
C:\...\PP4E\Dbase>python
>>>from filepickle import *
>>>L = [0]
>>>D = {'x':0, 'y':L}
>>>table = {'A':L, 'B':D}
# L appears twice
>>>saveDbase('myfile', table)
# serialize to file
C:\...\PP4E\Dbase>python
>>>from filepickle import *
>>>table = loadDbase('myfile')
# reload/unpickle
>>>table
{'A': [0], 'B': {'y': [0], 'x': 0}}
>>>table['A'][0] = 1
# change shared object
>>>saveDbase('myfile', table)
# rewrite to the file
C:\...\PP4E\Dbase>python
>>>from filepickle import *
>>>print(loadDbase('myfile'))
# both L's updated as expected
{'A': [1], 'B': {'y': [1], 'x': 0}}
Besides built-in types like the lists, tuples, and dictionaries of
the examples so far,
class instances
may also be pickled to file-like objects. This provides a
natural way to associate behavior with stored data (class methods
process instance attributes) and provides a simple migration path (class
changes made in module files are automatically picked up by stored
instances). Here’s a brief interactive demonstration:
>>>class Rec:
def __init__(self, hours):
self.hours = hours
def pay(self, rate=50):
return self.hours * rate
>>>bob = Rec(40)
>>>import pickle
>>>pickle.dump(bob, open('bobrec', 'wb'))
>>>
>>>rec = pickle.load(open('bobrec', 'rb'))
>>>rec.hours
40
>>>rec.pay()
2000
We’ll explore how this works in more detail in conjunction with
shelves later in this chapter—as we’ll see, although thepickle
module can be used directly this way,
it is also the underlying translation engine in both shelves
and
ZODB databases.
In general, Python can pickle just about anything, except
for:
Compiled code objects: functions and classes record just their
names and those of their modules in pickles, to allow for later
reimport and automatic acquisition of changes made in module
files.
Instances of classes that do not follow class importability
rules: in short, the class must be importable on object loads (more
on this at the end of the section
Shelve Files
).
Instances of some built-in and user-defined types that are
coded in C or depend upon transient operating system states (e.g.,
open file objects cannot be pickled).
APicklingError
is raised if an
object cannot be pickled. Again, we’ll revisit the pickler’s constraints
on pickleable objects and classes when we study shelves.
In later Python releases, the pickler
introduced the notion of
protocols
—storage formats for pickled data. Specify
the desired protocol by passing an extra parameter to the pickling calls
(but not to unpickling calls: the protocol is automatically determined
from the pickled data):
pickle.dump(object, file,protocol
) # or protocol=N keyword argument
Pickled data may be created in either text or binary protocols;
the binary protocols’ format is more efficient, but it cannot be readily
understood if inspected. By default, the storage protocol in Python 3.X
is a 3.X-only binarybytes
format
(also known as protocol 3). In text mode (protocol 0), the pickled data
is printable ASCII text, which can be read by humans (it’s essentially
instructions for a stack machine), but it is still abytes
object in Python 3.X. The alternative
protocols (protocols 1 and 2) create the pickled data in binary format
as well.
For all protocols, pickled data is abytes
object in 3.X, not astr
, and therefore implies binary-mode reads
and writes when stored in flat files (see
Chapter 4
if you’ve forgotten why).
Similarly, we must use abytes
-oriented object when forging the file
object’s interface:
>>>import io, pickle
>>>pickle.dumps([1, 2, 3])
# default=binary protocol
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'
>>>pickle.dumps([1, 2, 3], protocol=0)
# ASCII format protocol
b'(lp0\nL1L\naL2L\naL3L\na.'
>>>pickle.dump([1, 2, 3], open('temp','wb'))
# same if protocol=0, ASCII
>>>pickle.dump([1, 2, 3], open('temp','w'))
# must use 'rb' to read too
TypeError: must be str, not bytes
>>>pickle.dump([1, 2, 3], open('temp','w'), protocol=0)
TypeError: must be str, not bytes
>>>B = io.BytesIO()
# use bytes streams/buffers
>>>pickle.dump([1, 2, 3], B)
>>>B.getvalue()
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'
>>>B = io.BytesIO()
# also bytes for ASCII
>>>pickle.dump([1, 2, 3], B, protocol=0)
>>>B.getvalue()
b'(lp0\nL1L\naL2L\naL3L\na.'
>>>S = io.StringIO()
# it's not a str anymore
>>>pickle.dump([1, 2, 3], S)
# same if protocol=0, ASCII
TypeError: string argument expected, got 'bytes'
>>>pickle.dump([1, 2, 3], S, protocol=0)
TypeError: string argument expected, got 'bytes'
Refer to Python’s library manual for
more information on the pickler; it supports additional interfaces that
classes may use to customize its behavior, which we’ll bypass here in
the interest of space. Also check outmarshal
, a module that serializes an object
too, but can handle only simple object types.pickle
is more general thanmarshal
and is normally preferred.
An additional related module,_pickle
, is a C-coded optimization ofpickle
, and is automatically used bypickle
internally if available; it need not be
selected or used directly. Theshelve
module inherits this optimization automatically by proxy. I haven’t
explainedshelve
yet, but I will
now.