Thehttp.client
module
we just met provides low-level control for HTTP clients.
When dealing with items available on the Web, though, it’s often easier to
code downloads with Python’s standardurllib.request
module, introduced in the FTP section earlier in this
chapter. Since this module is another way to talk HTTP, let’s expand on
its interfaces here.
Recall that given a URL,urllib.request
either downloads the requested
object over the Net to a local file or gives us a file-like object from
which we can read the requested object’s contents. As a result, the script
in
Example 13-30
does the same
work as thehttp.client
script we just
wrote but requires noticeably less code.
Example 13-30. PP4E\Internet\Other\http-getfile-urllib1.py
"""
fetch a file from an HTTP (web) server over sockets via urllib; urllib supports
HTTP, FTP, files, and HTTPS via URL address strings; for HTTP, the URL can name
a file or trigger a remote CGI script; see also the urllib example in the FTP
section, and the CGI script invocation in a later chapter; files can be fetched
over the net with Python in many ways that vary in code and server requirements:
over sockets, FTP, HTTP, urllib, and CGI outputs; caveat: should run filename
through urllib.parse.quote to escape properly unless hardcoded--see later chapters;
"""
import sys
from urllib.request import urlopen
showlines = 6
try:
servername, filename = sys.argv[1:] # cmdline args?
except:
servername, filename = 'learning-python.com', '/index.html'
remoteaddr = 'http://%s%s' % (servername, filename) # can name a CGI script too
print(remoteaddr)
remotefile = urlopen(remoteaddr) # returns input file object
remotedata = remotefile.readlines() # read data directly here
remotefile.close()
for line in remotedata[:showlines]: print(line) # bytes with embedded \n
Almost all HTTP transfer details are hidden behind theurllib.request
interface here. This version
works in almost the same way as thehttp.client
version we wrote first, but it
builds and submits an Internet URL address to get its work done (the
constructed URL is printed as the script’s first output line). As we saw
in the FTP section of this chapter, theurllib.request
functionurlopen
returns a file-like object from which we
can read the remote data. But because the constructed URLs begin with
“http://” here, theurllib.request
module automatically employs the lower-level HTTP interfaces to download
the requested file instead of FTP:
C:\...\PP4E\Internet\Other>http-getfile-urllib1.py
http://learning-python.com/index.html
b'\n'
b' \n'
b'\n'
b"Mark Lutz's Python Training Services \n"
b'b'\n'
b'\n'
b'\n'
b'\n'
b'\n'
C:\...\PP4E\Internet\Other>http-getfile-urllib1.py www.rmi.net /~lutz
http://www.rmi.net/~lutz
b'\n'
b'\n'
b'\n'
b"Mark Lutz's Book Support Site \n"
b'\n'
b'\n'
C:\...\PP4E\Internet\Other>http-getfile-urllib1.py
localhost /cgi-bin/languages.py?language=Java
http://localhost/cgi-bin/languages.py?language=Java
b'Languages \n'
b'Syntax
\n'
b'Java
\n'
b' System.out.println("Hello World"); \n'
b'
\n'
b'
\n'
As before, the filename argument can name a simple file or a program
invocation with optional parameters at the end, as in the last run here.
If you read this output carefully, you’ll notice that this script still
works if you leave the “index.html” off the end of a site’s root filename
(in the third command line); unlike the raw HTTP version of the preceding
section, the URL-based interface is smart enough to do the right
thing.
One last mutation: the followingurllib.request
downloader script uses the
slightly higher-levelurlretrieve
interface
in that module to automatically save the downloaded file or script
output to a local file on the client machine. This interface is handy if
we really mean to store the fetched data (e.g., to mimic the FTP
protocol). If we plan on processing the downloaded data immediately,
though, this form may be less convenient than the version we just met:
we need to open and read the saved file. Moreover, we need to provide an
extra protocol for specifying or extracting a local filename, as in
Example 13-31
.
Example 13-31. PP4E\Internet\Other\http-getfile-urllib2.py
"""
fetch a file from an HTTP (web) server over sockets via urlllib; this version
uses an interface that saves the fetched data to a local binary-mode file; the
local filename is either passed in as a cmdline arg or stripped from the URL with
urllib.parse: the filename argument may have a directory path at the front and query
parameters at end, so os.path.split is not enough (only splits off directory path);
caveat: should urllib.parse.quote filename unless known ok--see later chapters;
"""
import sys, os, urllib.request, urllib.parse
showlines = 6
try:
servername, filename = sys.argv[1:3] # first 2 cmdline args?
except:
servername, filename = 'learning-python.com', '/index.html'
remoteaddr = 'http://%s%s' % (servername, filename) # any address on the Net
if len(sys.argv) == 4: # get result filename
localname = sys.argv[3]
else:
(scheme, server, path, parms, query, frag) = urllib.parse.urlparse(remoteaddr)
localname = os.path.split(path)[1]
print(remoteaddr, localname)
urllib.request.urlretrieve(remoteaddr, localname) # can be file or script
remotedata = open(localname, 'rb').readlines() # saved to local file
for line in remotedata[:showlines]: print(line) # file is bytes/binary
Let’s run this last variant from a command line. Its basic
operation is the same as the last two versions: like the prior one, it
builds a URL, and like both of the last two, we can list an explicit
target server and file path on the command line:
C:\...\PP4E\Internet\Other>http-getfile-urllib2.py
http://learning-python.com/index.html index.html
b'\n'
b' \n'
b'\n'
b"Mark Lutz's Python Training Services \n"
b'b'\n'
b'\n'
b'\n'
b'\n'
b'\n'
Because this version uses aurllib.request
interface that automatically
saves the downloaded data in a local file, it’s similar to FTP downloads
in spirit. But this script must also somehow come up with a local
filename for storing the data. You can either let the script strip and
use the base filename from the constructed URL, or explicitly pass a
local filename as a last command-line argument. In the prior run, for
instance, the downloaded web page is stored in the local file
index.html
in the current working
directory—
the base filename stripped from
the URL (the script prints the URL and local filename as its first
output line). In the next run, the local filename is passed explicitly
as
py-index.html
:
C:\...\PP4E\Internet\Other>http-getfile-urllib2.py
www.python.org /index.html py-index.html
http://www.python.org/index.html py-index.html
b'\n'
b'\n'
b'\n'
C:\...\PP4E\Internet\Other>http-getfile-urllib2.py www.rmi.net /~lutz books.html
http://www.rmi.net/~lutz books.html
b'\n'
b'\n'
b'\n'
b"Mark Lutz's Book Support Site \n"
b'\n'
b'\n'
C:\...\PP4E\Internet\Other>http-getfile-urllib2.py www.rmi.net /~lutz/about-pp.html
http://www.rmi.net/~lutz/about-pp.html about-pp.html
b'\n'
b'\n'
b'\n'
b'About "Programming Python" \n'
b'\n'
b'\n'
The next listing shows this script being used to trigger a
remote program. As before, if you don’t give the local filename
explicitly, the script strips the base filename out of the filename
argument. That’s not always easy or appropriate for program
invocations—
the filename can contain
both a remote directory path at the front and query parameters at the
end for a remote program invocation.
Given a script invocation URL and no explicit output filename,
the script extracts the base filename in the middle by using first the
standardurllib.parse
module to pull out the file path, and thenos.path.split
to strip off the directory
path. However, the resulting filename is a remote script’s name, and
it may or may not be an appropriate place to store the data locally.
In the first run that follows, for example, the script’s output goes
in a local file called
languages.py
, the script
name in the middle of the URL; in the second, we instead name the
output
CxxSyntax.html
explicitly to suppress
filename extraction:
C:\...\PP4E\Internet\Other>python http-getfile-urllib2.py localhost
/cgi-bin/languages.py?language=Scheme
http://localhost/cgi-bin/languages.py?language=Scheme languages.py
b'Languages \n'
b'Syntax
\n'
b'Scheme
\n'
b' (display "Hello World") (newline) \n'
b'
\n'
b'
\n'
C:\...\PP4E\Internet\Other>python http-getfile-urllib2.py localhost
/cgi-bin/languages.py?language=C++ CxxSyntax.html
http://localhost/cgi-bin/languages.py?language=C++ CxxSyntax.html
b'Languages \n'
b'Syntax
\n'
b'C
\n'
b"Sorry--I don't know that language\n"
b'
\n'
b'
\n'
The remote script returns a not-found message when passed “C++”
in the last command here. It turns out that “+” is a special character
in URL strings (meaning a space), and to be robust, both of theurllib
scripts we’ve just written
should really run thefilename
string through something calledurllib.parse.quote
, a tool that escapes
special characters for transmission. We will talk about this in depth
in
Chapter 15
, so consider this a
preview for now. But to make this invocation work, we need to use
special sequences in the constructed URL. Here’s how to do it by
hand:
C:\...\PP4E\Internet\Other>python http-getfile-urllib2.py localhost
/cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html
http://localhost/cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html
b'Languages \n'
b'Syntax
\n'
b'C++
\n'
b' cout << "Hello World" << endl; \n'
b'
\n'
b'
\n'
The odd%2b
strings in this
command line are not entirely magical: the escaping required for URLs
can be seen by running standard Python tools manually—this is what
these scripts should do automatically to be able to handle all
possible cases well;url
lib
.
parse
.
unquote
can undo these escapes if
needed:
C:\...\PP4E\Internet\Other>python
>>>import urllib.parse
>>>urllib.parse.quote('C++')
'c%2B%2B'
Again, don’t work too hard at understanding these last few
commands; we will revisit URLs and URL escapes in
Chapter 15
, while exploring server-side
scripting in Python. I will also explain there why the C++ result came
back with other oddities like<<
—
HTML escapes for<<
, generated by the toolcgi.escape
in the script on the
server that produces the reply, and usually undone by HTML parsers
including Python’shtml.parser
module we’ll meet in
Chapter 19
:
>>>import cgi
>>>cgi.escape('<<')
'<<'
Also in
Chapter 15
, we’ll meeturllib
support for
proxies
, and its support for client-side
cookies
. We’ll discuss the related HTTPS concept
in
Chapter 16
—HTTP transmissions over
secure sockets, supported byurllib.request
on the client side if SSL
support is compiled into your Python. For now, it’s time to wrap up
our look at the Web, and the Internet at large, from the client side
of the
fence.