Package lxml :: Package html :: Module html5parser
[hide private]
[frames] | no frames]

Module html5parser

source code

An interface to html5lib that mimics the lxml.html interface.
Classes [hide private]
  HTMLParser
An html5lib HTML parser with lxml as tree.
  XHTMLParser
An html5lib XHTML Parser with lxml as tree.
Functions [hide private]
 
_find_tag(tree, tag) source code
 
document_fromstring(html, guess_charset=True, parser=None)
Parse a whole document into a string.
source code
 
fragments_fromstring(html, no_leading_text=False, guess_charset=False, parser=None)
Parses several HTML elements, returning a list of elements.
source code
 
fragment_fromstring(html, create_parent=False, guess_charset=False, parser=None)
Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
source code
 
fromstring(html, guess_charset=True, parser=None)
Parse the html, returning a single element/document.
source code
 
parse(filename_url_or_file, guess_charset=True, parser=None)
Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.
source code
 
_looks_like_url(str) source code
Variables [hide private]
  xhtml_parser = XHTMLParser()
  html_parser = <lxml.html.html5parser.HTMLParser object>
  __package__ = 'lxml.html'
Function Details [hide private]

fragments_fromstring(html, no_leading_text=False, guess_charset=False, parser=None)

source code 

Parses several HTML elements, returning a list of elements.

The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.

If guess_charset is True and the text was not unicode but a bytestring, the chardet library will perform charset guessing on the string.

fragment_fromstring(html, create_parent=False, guess_charset=False, parser=None)

source code 

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.

If create_parent is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed.

fromstring(html, guess_charset=True, parser=None)

source code 

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)