lxml.html.html5parser module

An interface to html5lib that mimics the lxml.html interface.

class lxml.html.html5parser.HTMLParser(strict=False, **kwargs)[source]

Bases: HTMLParser

An html5lib HTML parser with lxml as tree.

_parse(stream, innerHTML=False, container='div', scripting=False, **kwargs)[source]

adjustForeignAttributes(token)[source]

adjustMathMLAttributes(token)[source]

adjustSVGAttributes(token)[source]

isHTMLIntegrationPoint(element)[source]

isMathMLTextIntegrationPoint(element)[source]

mainLoop()[source]

parse(stream, *args, **kwargs)[source]

Parse a HTML document into a well-formed tree

Parameters:

stream –
a file-like object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element).
scripting – treat noscript elements as if JavaScript was turned on

Returns:

parsed tree

Example:

>>> from html5lib.html5parser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parse('<html><body><p>This is a doc</p></body></html>')
<Element u'{http://www.w3.org/1999/xhtml}html' at 0x7feac4909db0>

parseError(errorcode='XXX-undefined-error', datavars=None)[source]

parseFragment(stream, *args, **kwargs)[source]

Parse a HTML fragment into a well-formed tree fragment

Parameters:

container – name of the element we’re setting the innerHTML property if set to None, default to ‘div’
stream –
a file-like object or string containing the HTML to be parsed

The optional encoding parameter must be a string that indicates the encoding. If specified, that encoding will be used, regardless of any BOM or later declaration (such as in a meta element)
scripting – treat noscript elements as if JavaScript was turned on

Returns:

parsed tree

Example:

>>> from html5lib.html5libparser import HTMLParser
>>> parser = HTMLParser()
>>> parser.parseFragment('<b>this is a fragment</b>')
<Element u'DOCUMENT_FRAGMENT' at 0x7feac484b090>

parseRCDataRawtext(token, contentType)[source]

reparseTokenNormal(token)[source]

reset()[source]

resetInsertionMode()[source]

property documentEncoding: Name of the character encoding that was used to decode the input stream, or None if that is not determined yet

lxml.html.html5parser._find_tag(tree, tag)[source]

lxml.html.html5parser._looks_like_url(str)[source]

lxml.html.html5parser.document_fromstring(html, guess_charset=None, parser=None)[source]

Parse a whole document into a string.

If guess_charset is true, or if the input is not Unicode but a byte string, the chardet library will perform charset guessing on the string.

lxml.html.html5parser.fragment_fromstring(html, create_parent=False, guess_charset=None, parser=None)[source]

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.

If ‘create_parent’ is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed.

If guess_charset is true, the chardet library will perform charset guessing on the string.

lxml.html.html5parser.fragments_fromstring(html, no_leading_text=False, guess_charset=None, parser=None)[source]

Parses several HTML elements, returning a list of elements.

The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.

If guess_charset is true, the chardet library will perform charset guessing on the string.

lxml.html.html5parser.fromstring(html, guess_charset=None, parser=None)[source]

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.

‘base_url’ will set the document’s base_url attribute (and the tree’s docinfo.URL)

If guess_charset is true, or if the input is not Unicode but a byte string, the chardet library will perform charset guessing on the string.

lxml.html.html5parser.parse(filename_url_or_file, guess_charset=None, parser=None)[source]

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.

If guess_charset is true, the useChardet option is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode).