Package lxml :: Package html
[frames] | no frames]

Package html

source code

The lxml.html tool set for HTML handling.
Submodules

Functions
 
document_fromstring(html, **kw) source code
 
fragments_fromstring(html, no_leading_text=False, base_url=None, **kw)
Parses several HTML elements, returning a list of elements.
source code
 
fragment_fromstring(html, create_parent=False, base_url=None, **kw)
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
source code
 
fromstring(html, base_url=None, **kw)
Parse the html, returning a single element/document.
source code
 
submit_form(form, extra_values=None, open_http=None)
Helper function to submit a form.
source code
 
tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html')
Return an HTML string representation of the document.
source code
 
open_in_browser(doc)
Open the HTML document in a web browser (saving it to a temporary file to open it).
source code
 
Element(*args, **kw) source code
Variables
  find_rel_links = _MethodFunc('find_rel_links', copy= False)
  find_class = _MethodFunc('find_class', copy= False)
  make_links_absolute = _MethodFunc('make_links_absolute', copy=...
  resolve_base_href = _MethodFunc('resolve_base_href', copy= True)
  iterlinks = _MethodFunc('iterlinks', copy= False)
  rewrite_links = _MethodFunc('rewrite_links', copy= True)
Function Details

fragments_fromstring(html, no_leading_text=False, base_url=None, **kw)

source code 

Parses several HTML elements, returning a list of elements.

The first item in the list may be a string (though leading whitespace is removed). If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)

fragment_fromstring(html, create_parent=False, base_url=None, **kw)

source code 

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.

If create_parent is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)

fromstring(html, base_url=None, **kw)

source code 

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)

submit_form(form, extra_values=None, open_http=None)

source code 

Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects.

You can use this like:

form = doc.forms[0]
form.inputs['foo'].value = 'bar' # etc
response = form.submit()
doc = parse(response)
doc.make_links_absolute(response.geturl())

To change the HTTP requester, pass a function as open_http keyword argument that opens the URL for you. The function must have the following signature:

open_http(method, URL, values)

The action is one of 'GET' or 'POST', the URL is the target URL as a string, and the values are a sequence of (name, value) tuples with the form data.

tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html')

source code 

Return an HTML string representation of the document.

Note: if include_meta_content_type is true this will create a <meta http-equiv="Content-Type" ...> tag in the head; regardless of the value of include_meta_content_type any existing <meta http-equiv="Content-Type" ...> tag will be removed

The encoding argument controls the output encoding (defauts to ASCII, with &#...; character references for any characters outside of ASCII).

The method argument defines the output method. It defaults to 'html', but can also be 'xml' for xhtml output, or 'text' to serialise to plain text without markup. Note that you can pass the builtin unicode type as encoding argument to serialise to a unicode string.

Example:

>>> from lxml import html
>>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')

>>> html.tostring(root)
'<p>Hello<br>world!</p>'
>>> html.tostring(root, method='html')
'<p>Hello<br>world!</p>'

>>> html.tostring(root, method='xml')
'<p>Hello<br/>world!</p>'

>>> html.tostring(root, method='text')
'Helloworld!'

>>> html.tostring(root, method='text', encoding=unicode)
u'Helloworld!'

Variables Details

make_links_absolute

Value:
_MethodFunc('make_links_absolute', copy= True)