Programming Python (71 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
3.79Mb size Format: txt, pdf, ePub
Unicode and the Text widget

The application of all this to tkinter
Text
displays is straightforward: if we open
in binary mode to read
bytes
, we
don’t need to be concerned about encodings in our own
code—
tkinter interprets the data as
expected, at least for these two encodings:

>>>
from tkinter import Text
>>>
t = Text()
>>>
t.insert('1.0', open('ldata', 'rb').read())
>>>
t.pack()
# string appears in GUI OK
>>>
t.get('1.0', 'end')
'AÄBäC\n'
>>>
>>>
t = Text()
>>>
t.insert('1.0', open('udata', 'rb').read())
>>>
t.pack()
# string appears in GUI OK
>>>
t.get('1.0', 'end')
'AÄBäC\n'

It works the same if we pass a
str
fetched in text mode, but we then need
to know the encoding type on the Python side of the fence—reads will
fail if the encoding type doesn’t match the stored data:

>>>
t = Text()
>>>
t.insert('1.0', open('ldata', 'r', encoding='latin-1').read())
>>>
t.pack()
>>>
t.get('1.0', 'end')
'AÄBäC\n'
>>>
>>>
t = Text()
>>>
t.insert('1.0', open('udata', 'r', encoding='utf-8').read())
>>>
t.pack()
>>>
t.get('1.0', 'end')
'AÄBäC\n'

Either way, though, the fetched content is always a Unicode
str
, so binary mode really only
addresses loads: we still need to know an encoding to store, whether
we write in text mode directly or write in binary mode after manual
encoding:

>>>
c = t.get('1.0', 'end')
>>>
c
# content is str
'AÄBäC\n'
>>>
open('cdata', 'wb').write(c)
# binary mode needs bytes
TypeError: must be bytes or buffer, not str
>>>
open('cdata', 'w', encoding='latin-1').write(c)
# each write returns 6
>>>
open('cdata', 'rb').read()
b'A\xc4B\xe4C\r\n'
>>>
open('cdata', 'w', encoding='utf-8').write(c)
# different bytes on files
>>>
open('cdata', 'rb').read()
b'A\xc3\x84B\xc3\xa4C\r\n'
>>>
open('cdata', 'w', encoding='utf-16').write(c)
>>>
open('cdata', 'rb').read()
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00\r\x00\n\x00'
>>>
open('cdata', 'wb').write( c.encode('latin-1') )
# manual encoding first
>>>
open('cdata', 'rb').read()
# same but no \r on Win
b'A\xc4B\xe4C\n'
>>>
open('cdata', 'w', encoding='ascii').write(c)
# still must be compatible
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in position 1: o

Notice the last test here: like manual encoding, file writes can
still fail if the data cannot be encoded in the target scheme. Because
of that, programs may need to recover from exceptions or try
alternative schemes; this is especially true on platforms where ASCII
may be the default platform encoding.

The problem with treating text as bytes

The prior sections’
rules may seem complex, but they boil down to the
following:

  • Unless strings always use the platform default, we need to
    know encoding types to read or write in text mode and to manually
    decode or encode for binary mode.

  • We can use almost any encoding to write new files as long as
    it can handle the string’s characters, but must provide one that
    is compatible with the existing data’s binary format on
    reads.

  • We don’t need to know the encoding mode to read text as
    bytes
    in binary mode for
    display, but the
    str
    content
    returned by the
    Text
    widget
    still requires us to encode to write on saves.

So why not always load text files in binary mode to display them
in a tkinter
Text
widget? While
binary mode input files seem to side-step encoding issues for display,
passing text to tkinter as
bytes
instead of
str
really just
delegates the encoding issue to the Tk library, which imposes
constraints of its own.

More specifically, opening input files in binary mode to read
bytes may seem to support viewing arbitrary types of text, but it has
two potential downsides:

  • It shifts the burden of deciding encoding type from our
    script to the Tk GUI library. The library must still determine how
    to render those bytes and may not support all encodings
    possible.

  • It allows opening and viewing data that is not text in
    nature, thereby defeating some of the purpose of the validity
    checks performed by text decoding.

The first point is probably the most crucial here. In
experiments I’ve run on Windows, Tk seems to correctly handle raw
bytes
strings encoded in ASCII,
UTF-8 and Latin-1 format, but not UTF-16 or others such as CP500. By
contrast, these all render correctly if decoded in Python to
str
before being passed on to Tk. In
programs intended for the world at large, this wider support is
crucial today. If you’re able to know or ask for encodings, you’re
better off using
str
both for
display and saves.

To some degree, regardless of whether you pass in
str
or
bytes
, tkinter GUIs are subject to the
constraints imposed by the underlying Tk library and the Tcl language
it uses internally, as well as any imposed by the techniques Python’s
tkinter uses to interface with Tk. For example:

  • Tcl, the internal implementation language of the Tk library,
    stores strings internally in UTF-8 format, and decrees that
    strings passed in to and returned from its C API be in this
    format.

  • Tcl attempts to convert byte strings to its internal UTF-8
    format, and generally supports translation using the platform and
    locale encodings in the local operating system with Latin-1 as a
    fallback.

  • Python’s tkinter passes
    bytes
    strings to Tcl directly, but
    copies Python
    str
    Unicode
    strings to and from Tcl Unicode string objects.

  • Tk inherits all of Tcl’s Unicode policies, but adds
    additional font selection policies for display.

In other words, GUIs that display text in tkinter are somewhat
at the mercy of multiple layers of software, above and beyond the
Python language itself. In general, though, Unicode is broadly
supported by Tk’s
Text
widget for
Python
str
, but not for Python
bytes
. As you can probably tell,
though, this story quickly becomes very low-level and detailed, so we
won’t explore it further in this book; see the Web and other resources
for more on tkinter, Tk, and Tcl, and the interfaces
between them.

Other binary mode considerations

Even in contexts
where it’s sufficient, using binary mode files to
finesse encodings for display is more complicated than you might
think. We always need to be careful to write output in binary mode,
too, so what we read is what we later write—if we read in binary mode,
content end-lines will be
\r\n
on
Windows, and we don’t want text-mode files to expand this to
\r\r\n
. Moreover, there’s another difference
in tkinter for
str
and
bytes
. A
str
read from a text-mode file appears in
the GUI as you expect, and end-lines are mapped on Windows as
usual:

C:\...\PP4E\Gui\Tour>
python
>>>
from tkinter import *
>>>
T = Text()
# str from text-mode file
>>>
T.insert('1.0', open('jack.txt').read())
# platform default encoding
>>>
T.pack()
# appears in GUI normally
>>>
T.get('1.0', 'end')[:75]
'000) All work and no play makes Jack a dull boy.\n001) All work and no pla'

If you pass in a
bytes
obtained from a binary-mode file, however, it’s odd in the GUI on
Windows—there’s an extra space at the end of each line, which reflects
the
\r
that is not stripped by
binary mode files:

C:\...\PP4E\Gui\Tour>
python
>>>
from tkinter import *
>>>
T = Text()
# bytes from binary-mode
>>>
T.insert('1.0', open('jack.txt', 'rb').read())
# no decoding occurs
>>>
T.pack()
# lines have space at end!
>>>
T.get('1.0', 'end')[:75]
'000) All work and no play makes Jack a dull boy.\r\n001) All work and no pl'

To use
bytes
to allow for
arbitrary text but make the text appear as expected by users, we also
have to strip the
\r
characters at
line end manually. This assumes that a
\r\n
combination doesn’t mean something
special in the text’s encoding scheme, though data in which this
sequence does not mean end-of-line will likely have other issues when
displayed. The following avoids the extra end-of-line spaces—we open
for input in binary mode for undecoded bytes, but drop
\r
:

C:\...\PP4E\Gui\Tour>
python
>>>
from tkinter import *
# use bytes, strip \r if any
>>>
T = Text()
>>>
data = open('jack.txt', 'rb').read()
>>>
data = data.replace(b'\r\n', b'\n')
>>>
T.insert('1.0', data)
>>>
T.pack()
>>>
T.get('1.0', 'end')[:75]
'000) All work and no play makes Jack a dull boy.\n001) All work and no pla'

To save content later, we can either add the
\r
characters back on Windows only, manually
encode to
bytes
, and save in binary
mode; or we can open in text mode to make the file object restore the
\r
if needed and encode for us, and
write the
str
content string
directly. The second of these is probably simpler, as we don’t need to
care about platform differences.

Either way, though, we still face an encoding step—we can either
rely on the platform default encoding or obtain an encoding name from
user interfaces. In the following, for example, the text-mode file
converts end-lines and encodes to
bytes
internally using the platform default.
If we care about supporting arbitrary Unicode types or run on a
platform whose default does not accommodate characters displayed, we
would need to pass in an explicit encoding argument (the Python slice
operation here has the same effect as fetching through Tk’s “end-1c”
position specification):

...continuing prior listing...
>>>
content = T.get('1.0', 'end')[:-1]
# drop added \n at end
>>>
open('copyjack.txt', 'w').write(content)
# use platform default
12500 # text mode adds \n on Win
>>>
^Z
C:\...\PP4E\Gui\Tour>
fc jack.txt copyjack.txt
Comparing files jack.txt and COPYJACK.TXT
FC: no differences encountered
Supporting Unicode in PyEdit (ahead)

We’ll see a use
case of accommodating the
Text
widget’s Unicode behavior in the larger
PyEdit example of
Chapter 11
. Really,
supporting Unicode just means supporting
arbitrary
Unicode encodings in text
files on opens and saves; once in memory, text processing can always
be performed in terms of
str
, since
that’s how tkinter returns content. To support Unicode, PyEdit will
open both input and output files in text mode with explicit encodings
whenever possible, and fall back on opening input files in binary mode
only as a last resort. This avoids relying on the limited Unicode
support Tk provides for display of raw byte strings.

To make this policy work, PyEdit will accept encoding names from
a wide variety of sources and allow the user to configure which to
attempt. Encodings may be obtained from user dialog inputs,
configuration file settings, the platform default, the prior open’s
encoding on saves, and even internal program values (parsed from email
headers, for instance). These sources are attempted until the first
that succeeds, though it may also be desirable to limit encoding
attempts to just one such source in some
contexts
.

Watch for this code in
Chapter 14
.
Frankly, PyEdit in this edition originally read and wrote files in
text mode with platform default encodings. I didn’t consider the
implications of Unicode on PyEdit until the PyMailGUI example’s
Internet world raised the specter of arbitrary text encodings. If it
seems that strings are a lot more complicated than they used to be,
it’s probably only because your scope has been too
narrow.

Other books

Boxcar Children 64 - Black Pearl Mystery by Warner, Gertrude Chandler
The Weather Girl by Amy Vastine
A Hint of Rapture by Miriam Minger
Change of Plans by C.L. Blackwell
The Baby Truth by Stella Bagwell
Sasha’s Dad by Geri Krotow
All Of You (Only You) by Cahill, Rhian