I told you earlier that text content in
theText
widget is
always a string. Technically, though, there are two string types in
Python 3.X:str
for Unicode text, andbytes
for byte strings. Moreover,
text can be represented in a variety of Unicode encodings when stored on
files. It turns out that both these factors can impact programs that
wish to useText
well in Python
3.X.
In short, tkinter’sText
and
other text-related widgets such asEntry
support display of International
character sets for bothstr
andbytes
, but we must pass decoded
Unicodestr
to support the broadest
range of character types. In this section, we decompose the text story
in tkinter in general to show why.
You may or may not have noticed, but all our examples so far
have been representing content asstr
strings—either hardcoded in scripts, or
fetched and saved using simple text-mode files which assume the
platform default encoding. Technically, though, theText
widget allows us to insert
bothstr
andbytes
:
>>>from tkinter import Text
>>>T = Text()
>>>T.insert('1.0', 'spam')
# insert a str
>>>T.insert('end', b'eggs')
# insert a bytes
>>>T.pack()
# "spameggs" appears in text widget now
>>>T.get('1.0', 'end')
# fetch content
'spameggs\n'
Inserting text asbytes
might
be useful for viewing arbitrary kinds of Unicode text, especially if
the encoding name is unknown. For example, text fetched over the
Internet (e.g., attached to an email or fetched by FTP) could be in
any Unicode encoding; storing it in binary-mode files and displaying
it asbytes
in aText
widget may at least seem to side-step
the encoding in our scripts.
Unfortunately, though, theText
widget returns its content asstr
strings, regardless of whether it was
inserted asstr
orbytes
—we get back already-decoded Unicode
text strings either way:
>>>T = Text()
>>>T.insert('1.0', 'Textfileline1\n')
>>>T.insert('end', 'Textfileline2\n')
# content is str for str
>>>T.get('1.0', 'end')
# pack() is irrelevent to get()
'Textfileline1\nTextfileline2\n\n'
>>>T = Text()
>>>T.insert('1.0', b'Bytesfileline1\r\n')
# content is str for bytes too!
>>>T.insert('end', b'Bytesfileline2\r\n')
# and \r displays as a space
>>>T.get('1.0', 'end')
'Bytesfileline1\r\nBytesfileline2\r\n\n'
In fact, we get backstr
for
content even if we insert
bothstr
andbytes
, with a single\n
added at the end for good measure, as the
first example in this section shows; here’s a more comprehensive
illustration:
>>>T = Text()
>>>T.insert('1.0', 'Textfileline1\n')
>>>T.insert('end', 'Textfileline2\n')
# content is str for both
>>>T.insert('1.0', b'Bytesfileline1\r\n')
# one \n added for either type
>>>T.insert('end', b'Bytesfileline2\r\n')
# pack() displays as 4 lines
>>>T.get('1.0', 'end')
'Bytesfileline1\r\nTextfileline1\nTextfileline2\nBytesfileline2\r\n\n'
>>>
>>>print(T.get('1.0', 'end'))
Bytesfileline1
Textfileline1
Textfileline2
Bytesfileline2
This makes it easy to perform text processing on content after
it is fetched: we may conduct it in terms ofstr
, regardless of which type of string was
inserted. However, this also makes it difficult to treat text data
generically from a Unicode perspective: we cannot save the returnedstr
content to a binary mode file
as is, because binary mode files expectbytes
. We must either encode tobytes
manually first or open the file in
text mode and rely on it to encode thestr
. In either case we must know the Unicode
encoding name to apply, assume the platform default suffices, fall
back on guesses and hope one works, or ask the user.
In other words, although tkinter allows us to insert and view
some text of unknown encoding asbytes
, the fact that it’s returned asstr
strings means we generally need
to know how to encode it anyhow on saves, to satisfy Python 3.X file
interfaces. Moreover, becausebytes
inserted intoText
widgets must
also be decodable according to the limited Unicode policies of the
underlying Tk library, we’re generally better off decoding text tostr
ourselves if we wish to support
Unicode broadly. To truly understand why that’s true, we need to take
a brief excursion through the
Land of Unicode.
The reason for all this extra
complexity, of course, is that in a world with Unicode,
we cannot really think of “text” anymore without also asking “which
kind.” Text in general can be encoded in a wide variety of Unicode
encoding schemes. In Python, this is always a factor forstr
and pertains tobytes
when it contains encoded text.
Python’sstr
Unicode strings are
simply strings once they are created, but you have to take encodings
into consideration when transferring them to and from files and when
passing them to libraries that impose constraints on text
encodings.
We won’t cover Unicode encodings it in depth here (see
Learning
Python
for background details, as well as the brief
look at implications for files in
Chapter 4
), but a quick review is in order
to illustrate how this relates toText
widgets. First of all, keep in mind
that ASCII text data normally just works in most contexts, because it
is a subset of most Unicode encoding schemes. Data outside the ASCII
7-bit range, though, may be represented differently as bytes in
different encoding schemes.
For instance, the following must decode a Latin-1 bytes string
using the Latin-1 encoding—using the platform default or an explicitly
named encoding that doesn’t match the bytes will fail:
>>>b = b'A\xc4B\xe4C'
# these bytes are latin-1 format text
>>>b
b'A\xc4B\xe4C'
>>>s = b.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>s = b.decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>s = b.decode('latin1')
>>>s
'AÄBäC'
Once you’ve decoded to a Unicode string, you can “convert” it to
a variety of different encoding schemes. Really, this simply
translates to alternative binary encoding formats, from which we can
decode again later; a Unicode string has no Unicode type per se, only
encoded binary data does:
>>>s.encode('latin-1')
b'A\xc4B\xe4C'
>>>s.encode('utf-8')
b'A\xc3\x84B\xc3\xa4C'
>>>s.encode('utf-16')
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00'
>>>s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in position 1: o...
Notice the last test here: the string you encode to must be
compatible with the scheme you choose, or you’ll get an exception;
here, ASCII is too narrow to represent characters decoded from Latin-1
bytes. Even though you can convert to different (compatible)
representations’ bytes, you must generally know what the encoded
format is in order to decode back to a string:
>>>s.encode('utf-16').decode('utf-16')
'AÄBäC'
>>>s.encode('latin-1').decode('latin-1')
'AÄBäC'
>>>s.encode('latin-1').decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>s.encode('utf-8').decode('latin-1')
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
Note the last test here again. Technically, encoding Unicode
code points (characters) to UTF-8 bytes and then decoding back again
per the Latin-1 format does not raise an error, but trying to print
the result does: it’s scrambled garbage. To maintain fidelity, you
must generally know what format encoded bytes are in:
>>>s
'AÄBäC'
>>>x = s.encode('utf-8').decode('utf-8')
# OK if encoding matches data
>>>x
'AÄBäC'
>>>x = s.encode('latin-1').decode('latin-1')
# any compatible encoding works
>>>x
'AÄBäC'
>>>x = s.encode('utf-8').decode('latin-1')
# decoding works, result is garbage
>>>x
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
>>>len(s), len(x)
# no longer the same string
(5, 7)
>>>s.encode('utf-8')
# no longer same code points
b'A\xc3\x84B\xc3\xa4C'
>>>x.encode('utf-8')
b'A\xc3\x83\xc2\x84B\xc3\x83\xc2\xa4C'
>>>s.encode('latin-1')
b'A\xc4B\xe4C'
>>>x.encode('latin-1')
b'A\xc3\x84B\xc3\xa4C'
Curiously, the original string may still be there after a
mismatch like this—if we encode the scrambled bytes back to Latin-1
again (as 8-bit characters) and then decode properly, we might restore
the original (in some contexts this can constitute a sort of second
chance if data is decoded wrong initially):
>>>s
'AÄBäC'
>>>s.encode('utf-8').decode('latin-1')
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
>>>s.encode('utf-8').decode('latin-1').encode('latin-1')
b'A\xc3\x84B\xc3\xa4C'
>>>s.encode('utf-8').decode('latin-1').encode('latin-1').decode('utf-8')
'AÄBäC'
>>>s.encode('utf-8').decode('latin-1').encode('latin-1').decode('utf-8') == s
True
On the other hand, we can use a different encoding name to
decode, as long as it’s compatible with the format of the data; ASCII,
UTF-8, and Latin-1, for instance, all format ASCII text the same
way:
>>>'spam'.encode('utf8').decode('latin1')
'spam'
>>>'spam'.encode('latin1').decode('ascii')
'spam'
It’s important to remember that a string’s decoded value doesn’t
depend on the encoding it came from—once decoded, a string has no
notion of encoding and is simply a sequence of Unicode characters
(“code points”). Hence, we really only need to care about encodings at
the point of transfer to and from
files:
>>>s
'AÄBäC'
>>>s.encode('utf-16').decode('utf-16') == s.encode('latin-1').decode('latin-1')
True
Now, the
same rules apply to text files, because Unicode strings
are stored in files as encoded bytes. When writing, we can encode in
any format that accommodates the string’s characters. When reading,
though, we generally must know what that encoding is or provide one
that formats characters the same way:
>>>open('ldata', 'w', encoding='latin-1').write(s)
# store in latin-1 format
5
>>>open('udata', 'w', encoding='utf-8').write(s)
# store in utf-8 format
5
>>>open('ldata', 'r', encoding='latin-1').read()
# OK if correct name given
'AÄBäC'
>>>open('udata', 'r', encoding='utf-8').read()
'AÄBäC'
>>>open('ldata', 'r').read()
# else, may not work
'AÄBäC'
>>>open('udata', 'r').read()
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-3: cha...
>>>open('ldata', 'r', encoding='utf-8').read()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>>open('udata', 'r', encoding='latin-1').read()
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
By contrast, binary mode files don’t attempt to decode into a
Unicode string; they happily read whatever is present, whether the
data was written to the file in text mode with automatically encodedstr
strings (as in the preceding
interaction) or in binary mode with manually encodedbytes
strings:
>>>open('ldata', 'rb').read()
b'A\xc4B\xe4C'
>>>open('udata', 'rb').read()
b'A\xc3\x84B\xc3\xa4C'
>>>open('sdata', 'wb').write( s.encode('utf-16') )
# return value: 12
>>>open('sdata', 'rb').read()
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00'