lxml.html.html5parser module¶
An interface to html5lib that mimics the lxml.html interface.
-
class
lxml.html.html5parser.
HTMLParser
(strict=False, **kwargs)[source]¶ Bases:
html5lib.html5parser.HTMLParser
An html5lib HTML parser with lxml as tree.
-
lxml.html.html5parser.
document_fromstring
(html, guess_charset=None, parser=None)[source]¶ Parse a whole document into a string.
If guess_charset is true, or if the input is not Unicode but a byte string, the chardet library will perform charset guessing on the string.
-
lxml.html.html5parser.
fragment_fromstring
(html, create_parent=False, guess_charset=None, parser=None)[source]¶ Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
If ‘create_parent’ is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed.
If guess_charset is true, the chardet library will perform charset guessing on the string.
-
lxml.html.html5parser.
fragments_fromstring
(html, no_leading_text=False, guess_charset=None, parser=None)[source]¶ Parses several HTML elements, returning a list of elements.
The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.
If guess_charset is true, the chardet library will perform charset guessing on the string.
-
lxml.html.html5parser.
fromstring
(html, guess_charset=None, parser=None)[source]¶ Parse the html, returning a single element/document.
This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.
‘base_url’ will set the document’s base_url attribute (and the tree’s docinfo.URL)
If guess_charset is true, or if the input is not Unicode but a byte string, the chardet library will perform charset guessing on the string.
-
lxml.html.html5parser.
parse
(filename_url_or_file, guess_charset=None, parser=None)[source]¶ Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use
parse(...).getroot()
to get the document root.If
guess_charset
is true, theuseChardet
option is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode).