Programming Python (70 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
12.47Mb size Format: txt, pdf, ePub
Unicode and the Text Widget

I told you earlier that text content in
the
Text
widget is
always a string. Technically, though, there are two string types in
Python 3.X:
str
for Unicode text, and
bytes
for byte strings. Moreover,
text can be represented in a variety of Unicode encodings when stored on
files. It turns out that both these factors can impact programs that
wish to use
Text
well in Python
3.X.

In short, tkinter’s
Text
and
other text-related widgets such as
Entry
support display of International
character sets for both
str
and
bytes
, but we must pass decoded
Unicode
str
to support the broadest
range of character types. In this section, we decompose the text story
in tkinter in general to show why.

String types in the Text widget

You may or may not have noticed, but all our examples so far
have been representing content as
str
strings—either hardcoded in scripts, or
fetched and saved using simple text-mode files which assume the
platform default encoding. Technically, though, the
Text
widget allows us to insert
both
str
and
bytes
:

>>>
from tkinter import Text
>>>
T = Text()
>>>
T.insert('1.0', 'spam')
# insert a str
>>>
T.insert('end', b'eggs')
# insert a bytes
>>>
T.pack()
# "spameggs" appears in text widget now
>>>
T.get('1.0', 'end')
# fetch content
'spameggs\n'

Inserting text as
bytes
might
be useful for viewing arbitrary kinds of Unicode text, especially if
the encoding name is unknown. For example, text fetched over the
Internet (e.g., attached to an email or fetched by FTP) could be in
any Unicode encoding; storing it in binary-mode files and displaying
it as
bytes
in a
Text
widget may at least seem to side-step
the encoding in our scripts.

Unfortunately, though, the
Text
widget returns its content as
str
strings, regardless of whether it was
inserted as
str
or
bytes
—we get back already-decoded Unicode
text strings either way:

>>>
T = Text()
>>>
T.insert('1.0', 'Textfileline1\n')
>>>
T.insert('end', 'Textfileline2\n')
# content is str for str
>>>
T.get('1.0', 'end')
# pack() is irrelevent to get()
'Textfileline1\nTextfileline2\n\n'
>>>
T = Text()
>>>
T.insert('1.0', b'Bytesfileline1\r\n')
# content is str for bytes too!
>>>
T.insert('end', b'Bytesfileline2\r\n')
# and \r displays as a space
>>>
T.get('1.0', 'end')
'Bytesfileline1\r\nBytesfileline2\r\n\n'

In fact, we get back
str
for
content even if we insert
both
str
and
bytes
, with a single
\n
added at the end for good measure, as the
first example in this section shows; here’s a more comprehensive
illustration:

>>>
T = Text()
>>>
T.insert('1.0', 'Textfileline1\n')
>>>
T.insert('end', 'Textfileline2\n')
# content is str for both
>>>
T.insert('1.0', b'Bytesfileline1\r\n')
# one \n added for either type
>>>
T.insert('end', b'Bytesfileline2\r\n')
# pack() displays as 4 lines
>>>
T.get('1.0', 'end')
'Bytesfileline1\r\nTextfileline1\nTextfileline2\nBytesfileline2\r\n\n'
>>>
>>>
print(T.get('1.0', 'end'))
Bytesfileline1
Textfileline1
Textfileline2
Bytesfileline2

This makes it easy to perform text processing on content after
it is fetched: we may conduct it in terms of
str
, regardless of which type of string was
inserted. However, this also makes it difficult to treat text data
generically from a Unicode perspective: we cannot save the returned
str
content to a binary mode file
as is, because binary mode files expect
bytes
. We must either encode to
bytes
manually first or open the file in
text mode and rely on it to encode the
str
. In either case we must know the Unicode
encoding name to apply, assume the platform default suffices, fall
back on guesses and hope one works, or ask the user.

In other words, although tkinter allows us to insert and view
some text of unknown encoding as
bytes
, the fact that it’s returned as
str
strings means we generally need
to know how to encode it anyhow on saves, to satisfy Python 3.X file
interfaces. Moreover, because
bytes
inserted into
Text
widgets must
also be decodable according to the limited Unicode policies of the
underlying Tk library, we’re generally better off decoding text to
str
ourselves if we wish to support
Unicode broadly. To truly understand why that’s true, we need to take
a brief excursion through the
Land of Unicode.

Unicode text in strings

The reason for all this extra
complexity, of course, is that in a world with Unicode,
we cannot really think of “text” anymore without also asking “which
kind.” Text in general can be encoded in a wide variety of Unicode
encoding schemes. In Python, this is always a factor for
str
and pertains to
bytes
when it contains encoded text.
Python’s
str
Unicode strings are
simply strings once they are created, but you have to take encodings
into consideration when transferring them to and from files and when
passing them to libraries that impose constraints on text
encodings.

We won’t cover Unicode encodings it in depth here (see
Learning
Python
for background details, as well as the brief
look at implications for files in
Chapter 4
), but a quick review is in order
to illustrate how this relates to
Text
widgets. First of all, keep in mind
that ASCII text data normally just works in most contexts, because it
is a subset of most Unicode encoding schemes. Data outside the ASCII
7-bit range, though, may be represented differently as bytes in
different encoding schemes.

For instance, the following must decode a Latin-1 bytes string
using the Latin-1 encoding—using the platform default or an explicitly
named encoding that doesn’t match the bytes will fail:

>>>
b = b'A\xc4B\xe4C'
# these bytes are latin-1 format text
>>>
b
b'A\xc4B\xe4C'
>>>
s = b.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>
s = b.decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>
s = b.decode('latin1')
>>>
s
'AÄBäC'

Once you’ve decoded to a Unicode string, you can “convert” it to
a variety of different encoding schemes. Really, this simply
translates to alternative binary encoding formats, from which we can
decode again later; a Unicode string has no Unicode type per se, only
encoded binary data does:

>>>
s.encode('latin-1')
b'A\xc4B\xe4C'
>>>
s.encode('utf-8')
b'A\xc3\x84B\xc3\xa4C'
>>>
s.encode('utf-16')
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00'
>>>
s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in position 1: o...

Notice the last test here: the string you encode to must be
compatible with the scheme you choose, or you’ll get an exception;
here, ASCII is too narrow to represent characters decoded from Latin-1
bytes. Even though you can convert to different (compatible)
representations’ bytes, you must generally know what the encoded
format is in order to decode back to a string:

>>>
s.encode('utf-16').decode('utf-16')
'AÄBäC'
>>>
s.encode('latin-1').decode('latin-1')
'AÄBäC'
>>>
s.encode('latin-1').decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>
s.encode('utf-8').decode('latin-1')
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...

Note the last test here again. Technically, encoding Unicode
code points (characters) to UTF-8 bytes and then decoding back again
per the Latin-1 format does not raise an error, but trying to print
the result does: it’s scrambled garbage. To maintain fidelity, you
must generally know what format encoded bytes are in:

>>>
s
'AÄBäC'
>>>
x = s.encode('utf-8').decode('utf-8')
# OK if encoding matches data
>>>
x
'AÄBäC'
>>>
x = s.encode('latin-1').decode('latin-1')
# any compatible encoding works
>>>
x
'AÄBäC'
>>>
x = s.encode('utf-8').decode('latin-1')
# decoding works, result is garbage
>>>
x
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
>>>
len(s), len(x)
# no longer the same string
(5, 7)
>>>
s.encode('utf-8')
# no longer same code points
b'A\xc3\x84B\xc3\xa4C'
>>>
x.encode('utf-8')
b'A\xc3\x83\xc2\x84B\xc3\x83\xc2\xa4C'
>>>
s.encode('latin-1')
b'A\xc4B\xe4C'
>>>
x.encode('latin-1')
b'A\xc3\x84B\xc3\xa4C'

Curiously, the original string may still be there after a
mismatch like this—if we encode the scrambled bytes back to Latin-1
again (as 8-bit characters) and then decode properly, we might restore
the original (in some contexts this can constitute a sort of second
chance if data is decoded wrong initially):

>>>
s
'AÄBäC'
>>>
s.encode('utf-8').decode('latin-1')
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
>>>
s.encode('utf-8').decode('latin-1').encode('latin-1')
b'A\xc3\x84B\xc3\xa4C'
>>>
s.encode('utf-8').decode('latin-1').encode('latin-1').decode('utf-8')
'AÄBäC'
>>>
s.encode('utf-8').decode('latin-1').encode('latin-1').decode('utf-8') == s
True

On the other hand, we can use a different encoding name to
decode, as long as it’s compatible with the format of the data; ASCII,
UTF-8, and Latin-1, for instance, all format ASCII text the same
way:

>>>
'spam'.encode('utf8').decode('latin1')
'spam'
>>>
'spam'.encode('latin1').decode('ascii')
'spam'

It’s important to remember that a string’s decoded value doesn’t
depend on the encoding it came from—once decoded, a string has no
notion of encoding and is simply a sequence of Unicode characters
(“code points”). Hence, we really only need to care about encodings at
the point of transfer to and from
files:

>>>
s
'AÄBäC'
>>>
s.encode('utf-16').decode('utf-16') == s.encode('latin-1').decode('latin-1')
True
Unicode text in files

Now, the
same rules apply to text files, because Unicode strings
are stored in files as encoded bytes. When writing, we can encode in
any format that accommodates the string’s characters. When reading,
though, we generally must know what that encoding is or provide one
that formats characters the same way:

>>>
open('ldata', 'w', encoding='latin-1').write(s)
# store in latin-1 format
5
>>>
open('udata', 'w', encoding='utf-8').write(s)
# store in utf-8 format
5
>>>
open('ldata', 'r', encoding='latin-1').read()
# OK if correct name given
'AÄBäC'
>>>
open('udata', 'r', encoding='utf-8').read()
'AÄBäC'
>>>
open('ldata', 'r').read()
# else, may not work
'AÄBäC'
>>>
open('udata', 'r').read()
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-3: cha...
>>>
open('ldata', 'r', encoding='utf-8').read()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>
open('udata', 'r', encoding='latin-1').read()
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...

By contrast, binary mode files don’t attempt to decode into a
Unicode string; they happily read whatever is present, whether the
data was written to the file in text mode with automatically encoded
str
strings (as in the preceding
interaction) or in binary mode with manually encoded
bytes
strings:

>>>
open('ldata', 'rb').read()
b'A\xc4B\xe4C'
>>>
open('udata', 'rb').read()
b'A\xc3\x84B\xc3\xa4C'
>>>
open('sdata', 'wb').write( s.encode('utf-16') )
# return value: 12
>>>
open('sdata', 'rb').read()
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00'

Other books

Radio Sphere by Devin terSteeg
The Kill-Off by Jim Thompson
Broken Ties by Gloria Davidson Marlow
Criminal Confections by Colette London
The Fall by Toro, Guillermo Del, Hogan, Chuck
Got It Going On by Stephanie Perry Moore
Shattered by Foxx, Alexia