% mechanize -- Documentation

<span class="docwarning">This documentation is in need of
reorganisation!</span>

This page is the old ClientCookie documentation.  It deals with operation on
the level of `urllib2 Handler` objects, and also with adding headers,
debugging, and cookie handling.  See the [front page](./) for more typical use.


Examples
--------

~~~~{.python}
import mechanize
response = mechanize.urlopen("http://example.com/")
~~~~

This function behaves identically to `urllib2.urlopen()`, except that it deals
with cookies automatically.

Here is a more complicated example, involving `Request` objects (useful if you
want to pass `Request`s around, add headers to them, etc.):

~~~~{.python}
import mechanize
request = mechanize.Request("http://example.com/")
# note we're using the urlopen from mechanize, not urllib2
response = mechanize.urlopen(request)
# let's say this next request requires a cookie that was set
# in response
request2 = mechanize.Request("http://example.com/spam.html")
response2 = mechanize.urlopen(request2)

print response2.geturl()
print response2.info()  # headers
print response2.read()  # body (readline and readlines work too)
~~~~

In these examples, the workings are hidden inside the `mechanize.urlopen()`
function, which is an extension of `urllib2.urlopen()`.  Redirects, proxies and
cookies are handled automatically by this function (note that you may need a
bit of configuration to get your proxies correctly set up: see `urllib2`
documentation).

There is also a `urlretrieve()` function, which works like
`urllib.urlretrieve()`.

An example at a slightly lower level shows how the module processes cookies
more clearly:

~~~~{.python}
# Don't copy this blindly!  You probably want to follow the examples
# above, not this one.
import mechanize

# Build an opener that *doesn't* automatically call .add_cookie_header()
# and .extract_cookies(), so we can do it manually without interference.
class NullCookieProcessor(mechanize.HTTPCookieProcessor):
    def http_request(self, request): return request
    def http_response(self, request, response): return response
opener = mechanize.build_opener(NullCookieProcessor)

request = mechanize.Request("http://example.com/")
response = mechanize.urlopen(request)
cj = mechanize.CookieJar()
cj.extract_cookies(response, request)
# let's say this next request requires a cookie that was set in response
request2 = mechanize.Request("http://example.com/spam.html")
cj.add_cookie_header(request2)
response2 = mechanize.urlopen(request2)
~~~~

The `CookieJar` class does all the work.  There are essentially two operations:
`.extract_cookies()` extracts HTTP cookies from `Set-Cookie` (the original
[Netscape cookie standard](http://curl.haxx.se/rfc/cookie_spec.html)) and
`Set-Cookie2` ([RFC 2965](http://www.ietf.org/rfc/rfc2965.txt)) headers from a
response if and only if they should be set given the request, and
`.add_cookie_header()` adds `Cookie` headers if and only if they are
appropriate for a particular HTTP request.  Incoming cookies are checked for
acceptability based on the host name, etc.  Cookies are only set on outgoing
requests if they match the request's host name, path, etc.

**Note that if you're using `mechanize.urlopen()` (or if you're using
`mechanize.HTTPCookieProcessor` by some other means), you don't need to call
`.extract_cookies()` or `.add_cookie_header()` yourself**.  If, on the other
hand, you want to use mechanize to provide cookie handling for an HTTP client
other than mechanize itself, you will need to use this pair of methods.  You
can make your own `request` and `response` objects, which must support the
interfaces described in the docstrings of `.extract_cookies()` and
`.add_cookie_header()`.

There are also some `CookieJar` subclasses which can store cookies in files and
databases.  `FileCookieJar` is the abstract class for `CookieJar`s that can
store cookies in disk files.  `LWPCookieJar` saves cookies in a format
compatible with the libwww-perl library.  This class is convenient if you want
to store cookies in a human-readable file:

~~~~{.python}
import mechanize
cj = mechanize.LWPCookieJar()
cj.revert("cookie3.txt")
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
r = opener.open("http://foobar.com/")
cj.save("cookie3.txt")
~~~~

The `.revert()` method discards all existing cookies held by the `CookieJar`
(it won't lose any existing cookies if the load fails).  The `.load()` method,
on the other hand, adds the loaded cookies to existing cookies held in the
`CookieJar` (old cookies are kept unless overwritten by newly loaded ones).

`MozillaCookieJar` can load and save to the Mozilla/Netscape/lynx-compatible
`'cookies.txt'` format.  This format loses some information (unusual and
nonstandard cookie attributes such as comment, and also information specific to
RFC 2965 cookies).  The subclass `MSIECookieJar` can load (but not save) from
Microsoft Internet Explorer's cookie files on Windows.


Important note
--------------

Only use names you can import directly from the `mechanize` package, and that
don't start with a single underscore.  Everything else is subject to change or
disappearance without notice.


Cooperating with Browsers
-------------------------

**Firefox since version 3 persists cookies in an sqlite database, which is not
supported by MozillaCookieJar.**

The subclass `MozillaCookieJar` differs from `CookieJar` only in storing
cookies using a different, Firefox 2/Mozilla/Netscape-compatible, file format
known as "cookies.txt".  The lynx browser also uses this format.  This file
format can't store RFC 2965 cookies, so they are downgraded to Netscape cookies
on saving.  `LWPCookieJar` itself uses a libwww-perl specific format
(\`Set-Cookie3') -- see the example above.  Python and your browser should be
able to share a cookies file (note that the file location here will differ on
non-unix OSes):

**WARNING:** you may want to back up your browser's cookies file if you use
`MozillaCookieJar` to save cookies.  I *think* it works, but there have been
bugs in the past!

~~~~{.python}
import os, mechanize
cookies = mechanize.MozillaCookieJar()
cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt"))
# see also the save and revert methods
~~~~

Note that cookies saved while Mozilla is running will get clobbered by Mozilla
-- see `MozillaCookieJar.__doc__`.

`MSIECookieJar` does the same for Microsoft Internet Explorer (MSIE) 5.x and
6.x on Windows, but does not allow saving cookies in this format.  In future,
the Windows API calls might be used to load and save (though the index has to
be read directly, since there is no API for that, AFAIK; there's also an
unfinished `MSIEDBCookieJar`, which uses (reads and writes) the Windows MSIE
cookie database directly, rather than storing copies of cookies as
`MSIECookieJar` does).

~~~~{.python}
import mechanize
cj = mechanize.MSIECookieJar(delayload=True)
cj.load_from_registry()  # finds cookie index file from registry
~~~~

A true `delayload` argument speeds things up.

On Windows 9x (win 95, win 98, win ME), you need to supply a username to the
`.load_from_registry()` method:

~~~~{.python}
cj.load_from_registry(username="jbloggs")
~~~~

Konqueror/Safari and Opera use different file formats, which aren't yet
supported.


Saving cookies in a file
------------------------

If you have no need to co-operate with a browser, the most convenient way to
save cookies on disk between sessions in human-readable form is to use
`LWPCookieJar`.  This class uses a libwww-perl specific format
(\`Set-Cookie3').  Unlike `MozilliaCookieJar`, this file format doesn't lose
information.


Supplying a CookieJar
---------------------

You might want to do this to [use your browser's
cookies](#cooperating-with-browsers), to customize `CookieJar`'s behaviour by
passing constructor arguments, or to be able to get at the cookies it will hold
(for example, for saving cookies between sessions and for debugging).

If you're using the higher-level `urllib2`-like interface (`urlopen()`, etc),
you'll have to let it know what `CookieJar` it should use:

~~~~{.python}
import mechanize
cookies = mechanize.CookieJar()
# build_opener() adds standard handlers (such as HTTPHandler and
# HTTPCookieProcessor) by default.  The cookie processor we supply
# will replace the default one.
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

r = opener.open("http://example.com/")  # GET
r = opener.open("http://example.com/", data)  # POST
~~~~

The `urlopen()` function uses a global `OpenerDirector` instance to do its
work, so if you want to use `urlopen()` with your own `CookieJar`, install the
`OpenerDirector` you built with `build_opener()` using the
`mechanize.install_opener()` function, then proceed as usual:

~~~~{.python}
mechanize.install_opener(opener)
r = mechanize.urlopen("http://example.com/")
~~~~

Of course, everyone using `urlopen` is using the same global `CookieJar`
instance!

<a name="policy" />

You can set a policy object (must satisfy the interface defined by
`mechanize.CookiePolicy`), which determines which cookies are allowed to be set
and returned.  Use the `policy` argument to the `CookieJar` constructor, or use
the `.set\_policy()` method.  The default implementation has some useful
switches:

~~~~{.python}
from mechanize import CookieJar, DefaultCookiePolicy as Policy
cookies = CookieJar()
# turn on RFC 2965 cookies, be more strict about domains when setting and
# returning Netscape cookies, and block some domains from setting cookies
# or having them returned (read the DefaultCookiePolicy docstring for the
# domain matching rules here)
policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict,
                blocked_domains=["ads.net", ".ads.net"])
cookies.set_policy(policy)
~~~~


Additional Handlers
-------------------

The following handlers are provided in addition to those provided by `urllib2`:

`HTTPRobotRulesProcessor`
:   WWW Robots (also called wanderers or spiders) are programs that traverse many
    pages in the World Wide Web by recursively retrieving linked pages.  This
    kind of program can place significant loads on web servers, so there is a
    [standard](http://www.robotstxt.org/wc/norobots.html) for a `robots.txt`
    file by which web site operators can request robots to keep out of their
    site, or out of particular areas of it.  This handler uses the standard
    Python library's `robotparser` module.  It raises
    `mechanize.RobotExclusionError` (subclass of `mechanize.HTTPError`) if an
    attempt is made to open a URL prohibited by `robots.txt`.

`HTTPEquivProcessor`
:   The `<META HTTP-EQUIV>` tag is a way of including data in HTML to be treated
    as if it were part of the HTTP headers.  mechanize can automatically read
    these tags and add the `HTTP-EQUIV` headers to the response object's real
    HTTP headers.  The HTML is left unchanged.

`HTTPRefreshProcessor`
:   The `Refresh` HTTP header is a non-standard header which is widely used.  It
    requests that the user-agent follow a URL after a specified time delay.
    mechanize can treat these headers (which may have been set in `<META
    HTTP-EQUIV>` tags) as if they were 302 redirections.  Exactly when and how
    `Refresh` headers are handled is configurable using the constructor
    arguments.

`HTTPRefererProcessor`
:   The `Referer` HTTP header lets the server know which URL you've just visited.
    Some servers use this header as state information, and don't like it if
    this is not present.  It's a chore to add this header by hand every time
    you make a request.  This adds it automatically.  **NOTE**: this only makes
    sense if you use each handler for a single chain of HTTP requests (so, for
    example, if you use a single HTTPRefererProcessor to fetch a series of URLs
    extracted from a single page, **this will break**).
    [mechanize.Browser](../mechanize/) does this properly.

Example:

~~~~{.python}
import mechanize
cookies = mechanize.CookieJar()

opener = mechanize.build_opener(mechanize.HTTPRefererProcessor,
                                mechanize.HTTPEquivProcessor,
                                mechanize.HTTPRefreshProcessor,
                                )
opener.open("http://www.rhubarb.com/")
~~~~


Seekable responses
------------------

Response objects returned from (or raised as exceptions by)
`mechanize.SeekableResponseOpener`, `mechanize.UserAgent` (if
`.set_seekable_responses(True)` has been called) and `mechanize.Browser()` have
`.seek()`, `.get_data()` and `.set_data()` methods:

~~~~{.python}
import mechanize
opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener()
response = opener.open("http://example.com/")
# same return value as .read(), but without affecting seek position
total_nr_bytes = len(response.get_data())
assert len(response.read()) == total_nr_bytes
assert len(response.read()) == 0  # we've already read the data
response.seek(0)
assert len(response.read()) == total_nr_bytes
response.set_data("blah\n")
assert response.get_data() == "blah\n"
...
~~~~

This caching behaviour can be avoided by using `mechanize.OpenerDirector`.  It
can also be avoided with `mechanize.UserAgent`.  Note that `HTTPEquivProcessor`
and `HTTPResponseDebugProcessor` require seekable responses and so are not
compatible with `mechanize.OpenerDirector` and `mechanize.UserAgent`.

~~~~{.python}
import mechanize
ua = mechanize.UserAgent()
ua.set_seekable_responses(False)
ua.set_handle_equiv(False)
ua.set_debug_responses(False)
~~~~

Note that if you turn on features that use seekable responses (currently:
HTTP-EQUIV handling and response body debug printing), returned responses *may*
be seekable as a side-effect of these features.  However, this is not
guaranteed (currently, in these cases, returned response objects are seekable,
but raised respose objects — `mechanize.HTTPError` instances — are not
seekable).  This applies regardless of whether you use `mechanize.UserAgent` or
`mechanize.OpenerDirector`.  If you explicitly request seekable responses by
calling `.set_seekable_responses(True)` on a `mechanize.UserAgent` instance, or
by using `mechanize.Browser` or `mechanize.SeekableResponseOpener`, which
always return seekable responses, then both returned and raised responses are
guaranteed to be seekable.

Handlers should call `response = mechanize.seek_wrapped_response(response)` if
they require the `.seek()`, `.get_data()` or `.set_data()` methods.


Request object lifetime
-----------------------

Note that handlers may create new `Request` instances (for example when
performing redirects) rather than adding headers to existing `Request` objects.


Adding headers
--------------

Adding headers is done like so:

~~~~{.python}
import mechanize
req = mechanize.Request("http://foobar.com/")
req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/")
r = mechanize.urlopen(req)
~~~~

You can also use the `headers` argument to the `mechanize.Request` constructor.

mechanize adds some headers to `Request` objects automatically -- see the next
section for details.


Automatically-added headers
---------------------------

`OpenerDirector` automatically adds a `User-Agent` header to every `Request`.

To change this and/or add similar headers, use your own `OpenerDirector`:

~~~~{.python}
import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
                     ("From", "responsible.person@example.com")]
~~~~

Again, to use `urlopen()`, install your `OpenerDirector` globally:

~~~~{.python}
mechanize.install_opener(opener)
r = mechanize.urlopen("http://example.com/")
~~~~

Also, a few standard headers (`Content-Length`, `Content-Type` and `Host`) are
added when the `Request` is passed to `urlopen()` (or `OpenerDirector.open()`).
You shouldn't need to change these headers, but since this is done by
`AbstractHTTPHandler`, you can change the way it works by passing a subclass of
that handler to `build_opener()` (or, as always, by constructing an opener
yourself and calling `.add_handler()`).



Initiating unverifiable transactions
------------------------------------

This section is only of interest for correct handling of third-party HTTP
cookies.  See [below](#note-about-cookie-standards) for an explanation of
'third-party'.

First, some terminology.

An *unverifiable request* (defined fully by ([RFC
2965](http://www.ietf.org/rfc/rfc2965.txt)) is one whose URL the user did not
have the option to approve.  For example, a transaction is unverifiable if the
request is for an image in an HTML document, and the user had no option to
approve the fetching of the image from a particular URL.

The *request-host of the origin transaction* (defined fully by RFC 2965) is the
host name or IP address of the original request that was initiated by the user.
For example, if the request is for an image in an HTML document, this is the
request-host of the request for the page containing the image.

**mechanize knows that redirected transactions are unverifiable, and will
handle that on its own (ie. you don't need to think about the origin
request-host or verifiability yourself).**

If you want to initiate an unverifiable transaction yourself (which you should
if, for example, you're downloading the images from a page, and 'the user'
hasn't explicitly OKed those URLs):

~~~~{.python}
request = Request(origin_req_host="www.example.com", unverifiable=True)
~~~~


RFC 2965 support
----------------

Support for the RFC 2965 protocol is switched off by default, because few
browsers implement it, so the RFC 2965 protocol is essentially never seen on
the internet.  To switch it on, see [here](#policy).


Parsing HTTP dates
------------------

A function named `str2time` is provided by the package, which may be useful for
parsing dates in HTTP headers.  `str2time` is intended to be liberal, since
HTTP date/time formats are poorly standardised in practice.  There is no need
to use this function in normal operations: `CookieJar` instances keep track of
cookie lifetimes automatically.  This function will stay around in some form,
though the supported date/time formats may change.


Dealing with bad HTML
---------------------

XXX Intro

XXX Test me


Note about cookie standards
---------------------------

There are several standards relevant to HTTP cookies.

The Netscape protocol is the only standard supported by most web browsers
(including Internet Explorer and Firefox).  This is a *de facto* standard
defined by the behaviour of popular browsers, and neither the
[cookie\_spec.html](http://curl.haxx.se/rfc/cookie_spec.html) document that was
published by Netscape, nor the RFCs that were published later, describe the
Netscape protocol accurately or completely.  Netscape protocol cookies are also
known as V0 cookies, to distinguish them from RFC 2109 or RFC 2965 cookies,
which have a version cookie-attribute with a value of 1.

[RFC 2109](http://www.ietf.org/rfc/rfc2109.txt) was introduced to fix some
problems identified with the Netscape protocol, while still keeping the same
HTTP headers (`Cookie` and `Set-Cookie`).  The most prominent of these problems
is the 'third-party' cookie issue, which was an accidental feature of the
Netscape protocol.  Some features defined by RFC2109 (such as the port and
max-age cookie attributes) are now part of the de facto Netscape protocol, but
the RFC was never implemented fully by browsers, because of differences in
behaviour between the Netscape and Internet Explorer browsers of the time.

[RFC 2965](http://www.ietf.org/rfc/rfc2965.txt) attempted to fix the
compatibility problem by introducing two new headers, `Set-Cookie2` and
`Cookie2`.  Unlike the `Cookie` header, `Cookie2` does *not* carry cookies to
the server -- rather, it simply advertises to the server that RFC 2965 is
understood.  `Set-Cookie2` *does* carry cookies, from server to client: the new
header means that both IE and Netscape ignore these cookies.  This preserves
backwards compatibility, but popular browsers did not implement the RFC, so it
was never widely adopted.  One confusing point to note about RFC 2965 is that
it uses the same value (1) of the Version attribute in HTTP headers as does RFC
2109.  See also [RFC 2964](http://www.ietf.org/rfc/rfc2964.txt), which
discusses use of the protocol.

Because Netscape cookies are so poorly specified, the general philosophy of the
module's Netscape protocol implementation is to start with RFC 2965 and open
holes where required for Netscape protocol-compatibility.  RFC 2965 cookies are
*always* treated as RFC 2965 requires, of course.

There is more information about the history of HTTP cookies in [this paper by
David Kristol](http://arxiv.org/abs/cs.SE/0105018).

Recently (2011), [an IETF effort has
started](http://tools.ietf.org/html/draft-ietf-httpstate-cookie) to specify the
syntax and semantics of the `Cookie` and `Set-Cookie` headers as they are
actually used on the internet.


<!-- Local Variables: -->
<!-- fill-column:79 -->
<!-- End: -->
