lxml.html package¶

Submodules¶

Module contents¶

The lxml.html tool set for HTML handling.

class lxml.html.CheckboxGroup(iterable=(), /)[source]¶

Bases: list

Represents a group of checkboxes (<input type=checkbox>) that have the same name.

In addition to using this like a list, the .value attribute returns a set-like object that you can add to or remove from to check and uncheck checkboxes. You can also use .value_options to get the possible values.

property value¶: Return a set-like object that can be modified to check or uncheck individual checkboxes according to their value.

property value_options¶: Returns a list of all the possible values.

class lxml.html.CheckboxValues(group)[source]¶

Bases: lxml.html._setmixin.SetMixin

Represents the values of the checked checkboxes in a group of checkboxes with the same name.

add(value)[source]¶: Add an element.

remove(value)[source]¶: Remove an element. If not a member, raise a KeyError.

_abc_impl = <_abc_data object>¶

class lxml.html.Classes(attributes)[source]¶

Bases: collections.abc.MutableSet

Provides access to an element’s class attribute as a set-like collection. Usage:

>>> el = fromstring('<p class="hidden large">Text</p>')
>>> classes = el.classes  # or: classes = Classes(el.attrib)
>>> classes |= ['block', 'paragraph']
>>> el.get('class')
'hidden large block paragraph'
>>> classes.toggle('hidden')
False
>>> el.get('class')
'large block paragraph'
>>> classes -= ('some', 'classes', 'block')
>>> el.get('class')
'large paragraph'

add(value)[source]¶

Add a class.

This has no effect if the class is already present.

discard(value)[source]¶

Remove a class if it is currently present.

If the class is not present, do nothing.

remove(value)[source]¶

Remove a class; it must currently be present.

If the class is not present, raise a KeyError.

toggle(value)[source]¶

Add a class name if it isn’t there yet, or remove it if it exists.

Returns true if the class was added (and is now enabled) and false if it was removed (and is now disabled).

update(values)[source]¶: Add all names from ‘values’.

_abc_impl = <_abc_data object>¶

class lxml.html.FieldsDict(inputs)[source]¶

Bases: collections.abc.MutableMapping

keys() → a set-like object providing a view on D’s keys[source]¶

_abc_impl = <_abc_data object>¶

class lxml.html.FormElement[source]¶

Bases: lxml.html.HtmlElement

Represents a <form> element.

_name()[source]¶

form_values()[source]¶: Return a list of tuples of the field values for the form. This is suitable to be passed to urllib.urlencode().

property action¶: Get/set the form’s action attribute.

property fields¶: Dictionary-like object that represents all the fields in this form. You can set values in this dictionary to effect the form.

property inputs¶

Returns an accessor for all the input elements in the form.

See InputGetter for more information about the object.

property method¶: Get/set the form’s method. Always returns a capitalized string, and defaults to 'GET'

class lxml.html.HTMLParser(**kwargs)[source]¶

Bases: lxml.etree.HTMLParser

An HTML parser that is configured to return lxml.html Element objects.

class lxml.html.HtmlComment[source]¶: Bases: lxml.etree.CommentBase, lxml.html.HtmlMixin

class lxml.html.HtmlElement[source]¶

Bases: lxml.etree.ElementBase, lxml.html.HtmlMixin

cssselect(expr, translator='html')¶

Run the CSS expression on this element and its children, returning a list of the results.

Equivalent to lxml.cssselect.CSSSelect(expr, translator=’html’)(self) – note that pre-compiling the expression can provide a substantial speedup.

set(self, key, value=None)¶: Sets an element attribute. If no value is provided, or if the value is None, creates a ‘boolean’ attribute without value, e.g. “<form novalidate></form>” for form.set('novalidate').

class lxml.html.HtmlElementClassLookup(classes=None, mixins=None)[source]¶

Bases: lxml.etree.CustomElementClassLookup

A lookup scheme for HTML Element classes.

To create a lookup instance with different Element classes, pass a tag name mapping of Element classes in the classes keyword argument and/or a tag name mapping of Mixin classes in the mixins keyword argument. The special key ‘*’ denotes a Mixin class that should be mixed into all Element classes.

lookup(self, type, doc, namespace, name)[source]¶

_default_element_classes = {'form': <class 'lxml.html.FormElement'>, 'input': <class 'lxml.html.InputElement'>, 'label': <class 'lxml.html.LabelElement'>, 'select': <class 'lxml.html.SelectElement'>, 'textarea': <class 'lxml.html.TextareaElement'>}¶

class lxml.html.HtmlEntity[source]¶: Bases: lxml.etree.EntityBase, lxml.html.HtmlMixin

class lxml.html.HtmlMixin[source]¶

Bases: object

cssselect(expr, translator='html')[source]¶

Run the CSS expression on this element and its children, returning a list of the results.

Equivalent to lxml.cssselect.CSSSelect(expr, translator=’html’)(self) – note that pre-compiling the expression can provide a substantial speedup.

drop_tag()[source]¶

Remove the tag, but not its children or text. The children and text are merged into the parent.

Example:

>>> h = fragment_fromstring('<div>Hello <b>World!</b></div>')
>>> h.find('.//b').drop_tag()
>>> print(tostring(h, encoding='unicode'))
<div>Hello World!</div>

drop_tree()[source]¶: Removes this element from the tree, including its children and text. The tail text is joined to the previous element or parent.

find_class(class_name)[source]¶: Find any elements with the given class name.

find_rel_links(rel)[source]¶: Find any links like <a rel="{rel}">...</a>; returns a list of elements.

get_element_by_id(id, *default)[source]¶

Get the first element in a document with the given id. If none is found, return the default argument if provided or raise KeyError otherwise.

Note that there can be more than one element with the same id, and this isn’t uncommon in HTML documents found in the wild. Browsers return only the first match, and this function does the same.

iterlinks()[source]¶

Yield (element, attribute, link, pos), where attribute may be None (indicating the link is in the text). pos is the position where the link occurs; often 0, but sometimes something else in the case of links in stylesheets or style tags.

Note: <base href> is not taken into account in any way. The link you get is exactly the link in the document.

Note: multiple links inside of a single text string or attribute value are returned in reversed order. This makes it possible to replace or delete them from the text string value based on their reported text positions. Otherwise, a modification at one text position can change the positions of links reported later on.

make_links_absolute(base_url=None, resolve_base_href=True, handle_failures=None)[source]¶

Make all links in the document absolute, given the base_url for the document (the full URL where the document came from), or if no base_url is given, then the .base_url of the document.

If resolve_base_href is true, then any <base href> tags in the document are used and removed from the document. If it is false then any such tag is ignored.

If handle_failures is None (default), a failure to process a URL will abort the processing. If set to ‘ignore’, errors are ignored. If set to ‘discard’, failing URLs will be removed.

resolve_base_href(handle_failures=None)[source]¶

Find any <base href> tag in the document, and apply its values to all links found in the document. Also remove the tag once it has been applied.

If handle_failures is None (default), a failure to process a URL will abort the processing. If set to ‘ignore’, errors are ignored. If set to ‘discard’, failing URLs will be removed.

rewrite_links(link_repl_func, resolve_base_href=True, base_href=None)[source]¶

Rewrite all the links in the document. For each link link_repl_func(link) will be called, and the return value will replace the old link.

Note that links may not be absolute (unless you first called make_links_absolute()), and may be internal (e.g., '#anchor'). They can also be values like 'mailto:email' or 'javascript:expr'.

If you give base_href then all links passed to link_repl_func() will take that into account.

If the link_repl_func returns None, the attribute or tag text will be removed completely.

set(self, key, value=None)[source]¶: Sets an element attribute. If no value is provided, or if the value is None, creates a ‘boolean’ attribute without value, e.g. “<form novalidate></form>” for form.set('novalidate').

text_content()[source]¶: Return the text content of the tag (and the text in any children).

property base_url¶

Returns the base URL, given when the page was parsed.

Use with urlparse.urljoin(el.base_url, href) to get absolute URLs.

property body¶: Return the <body> element. Can be called from a child element to get the document’s head.

property classes¶: A set-like wrapper around the ‘class’ attribute.

property forms¶: Return a list of all the forms

property head¶: Returns the <head> element. Can be called from a child element to get the document’s head.

property label¶: Get or set any <label> element associated with this element.

class lxml.html.HtmlProcessingInstruction[source]¶: Bases: lxml.etree.PIBase, lxml.html.HtmlMixin

class lxml.html.InputElement[source]¶

Bases: lxml.html.InputMixin, lxml.html.HtmlElement

Represents an <input> element.

You can get the type with .type (which is lower-cased and defaults to 'text').

Also you can get and set the value with .value

Checkboxes and radios have the attribute input.checkable == True (for all others it is false) and a boolean attribute .checked.

property checkable¶: Boolean: can this element be checked?

property checked¶

Boolean attribute to get/set the presence of the checked attribute.

You can only use this on checkable input types.

property type¶: Return the type of this element (using the type attribute).

property value¶

Get/set the value of this element, using the value attribute.

Also, if this is a checkbox and it has no value, this defaults to 'on'. If it is a checkbox or radio that is not checked, this returns None.

class lxml.html.InputGetter(form)[source]¶

Bases: object

An accessor that represents all the input fields in a form.

You can get fields by name from this, with form.inputs['field_name']. If there are a set of checkboxes with the same name, they are returned as a list (a CheckboxGroup which also allows value setting). Radio inputs are handled similarly. Use .keys() and .items() to process all fields in this way.

You can also iterate over this to get all input elements. This won’t return the same thing as if you get all the names, as checkboxes and radio elements are returned individually.

items()[source]¶

Returns all fields with their names, similar to dict.items().

Returns: A list of (name, field) tuples.

keys()[source]¶

Returns all unique field names, in document order.

Returns: A list of all unique field names.

class lxml.html.InputMixin[source]¶

Bases: object

Mix-in for all input elements (input, select, and textarea)

property name¶: Get/set the name of the element

class lxml.html.LabelElement[source]¶

Bases: lxml.html.HtmlElement

Represents a <label> element.

Label elements are linked to other elements with their for attribute. You can access this element with label.for_element.

property for_element¶: Get/set the element this label points to. Return None if it can’t be found.

class lxml.html.MultipleSelectOptions(select)[source]¶

Bases: lxml.html._setmixin.SetMixin

Represents all the selected options in a <select multiple> element.

You can add to this set-like option to select an option, or remove to unselect the option.

add(item)[source]¶: Add an element.

remove(item)[source]¶: Remove an element. If not a member, raise a KeyError.

_abc_impl = <_abc_data object>¶

property options¶: Iterator of all the <option> elements.

class lxml.html.RadioGroup(iterable=(), /)[source]¶

Bases: list

This object represents several <input type=radio> elements that have the same name.

You can use this like a list, but also use the property .value to check/uncheck inputs. Also you can use .value_options to get the possible values.

property value¶: Get/set the value, which checks the radio with that value (and unchecks any other value).

property value_options¶: Returns a list of all the possible values.

class lxml.html.SelectElement[source]¶

Bases: lxml.html.InputMixin, lxml.html.HtmlElement

<select> element. You can get the name with .name.

.value will be the value of the selected option, unless this is a multi-select element (<select multiple>), in which case it will be a set-like object. In either case .value_options gives the possible values.

The boolean attribute .multiple shows if this is a multi-select.

property multiple¶: Boolean attribute: is there a multiple attribute on this element.

property value¶

Get/set the value of this select (the selected option).

If this is a multi-select, this is a set-like object that represents all the selected options.

property value_options¶: All the possible values this select can have (the value attribute of all the <option> elements.

class lxml.html.TextareaElement[source]¶

Bases: lxml.html.InputMixin, lxml.html.HtmlElement

<textarea> element. You can get the name with .name and get/set the value with .value

property value¶: Get/set the value (which is the contents of this element)

class lxml.html.XHTMLParser(**kwargs)[source]¶

Bases: lxml.etree.XMLParser

An XML parser that is configured to return lxml.html Element objects.

Note that this parser is not really XHTML aware unless you let it load a DTD that declares the HTML entities. To do this, make sure you have the XHTML DTDs installed in your catalogs, and create the parser like this:

>>> parser = XHTMLParser(load_dtd=True)

If you additionally want to validate the document, use this:

>>> parser = XHTMLParser(dtd_validation=True)

For catalog support, see http://www.xmlsoft.org/catalog.html.

class lxml.html._MethodFunc(name, copy=False, source_class=<class 'lxml.html.HtmlMixin'>)[source]¶

Bases: object

An object that represents a method on an element as a function; the function takes either an element or an HTML string. It returns whatever the function normally returns, or if the function works in-place (and so returns None) it returns a serialized form of the resulting document.

lxml.html.Element(*args, **kw)[source]¶

Create a new HTML Element.

This can also be used for XHTML documents.

lxml.html.__bytes_replace_meta_content_type(repl, string, count=0)¶: Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

lxml.html.__fix_docstring(s)[source]¶

lxml.html.__str_replace_meta_content_type(repl, string, count=0)¶: Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

lxml.html._contains_block_level_tag(el)[source]¶

lxml.html._element_name(el)[source]¶

lxml.html._iter_css_imports(string, pos=0, endpos=9223372036854775807)¶

Return an iterator over all non-overlapping matches for the RE pattern in string.

For each match, the iterator returns a match object.

lxml.html._iter_css_urls(string, pos=0, endpos=9223372036854775807)¶

Return an iterator over all non-overlapping matches for the RE pattern in string.

For each match, the iterator returns a match object.

lxml.html._looks_like_full_html_bytes(string, pos=0, endpos=9223372036854775807)¶: Matches zero or more characters at the beginning of the string.

lxml.html._looks_like_full_html_unicode(string, pos=0, endpos=9223372036854775807)¶: Matches zero or more characters at the beginning of the string.

lxml.html._nons(tag)[source]¶

lxml.html._parse_meta_refresh_url(string, pos=0, endpos=9223372036854775807)¶

Scan through string looking for a match, and return a corresponding match object instance.

Return None if no position in the string matches.

lxml.html._transform_result(typ, result)[source]¶: Convert the result back into the input type.

lxml.html._unquote_match(s, pos)[source]¶

lxml.html.document_fromstring(html, parser=None, ensure_head_body=False, **kw)[source]¶

lxml.html.fragment_fromstring(html, create_parent=False, base_url=None, parser=None, **kw)[source]¶

Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.

If create_parent is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is also allowed, as are multiple elements as result of the parsing.

Passing a base_url will set the document’s base_url attribute (and the tree’s docinfo.URL).

lxml.html.fragments_fromstring(html, no_leading_text=False, base_url=None, parser=None, **kw)[source]¶

Parses several HTML elements, returning a list of elements.

The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.

base_url will set the document’s base_url attribute (and the tree’s docinfo.URL).

lxml.html.fromstring(html, base_url=None, parser=None, **kw)[source]¶

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.

base_url will set the document’s base_url attribute (and the tree’s docinfo.URL)

lxml.html.html_to_xhtml(html)[source]¶: Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace.

lxml.html.open_http_urllib(method, url, values)[source]¶

lxml.html.open_in_browser(doc, encoding=None)[source]¶: Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging.

lxml.html.parse(filename_or_url, parser=None, base_url=None, **kw)[source]¶

Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use parse(...).getroot() to get the document root.

You can override the base URL with the base_url keyword. This is most useful when parsing from a file-like object.

lxml.html.submit_form(form, extra_values=None, open_http=None)[source]¶

Helper function to submit a form. Returns a file-like object, as from urllib.urlopen(). This object also has a .geturl() function, which shows the URL if there were any redirects.

You can use this like:

form = doc.forms[0]
form.inputs['foo'].value = 'bar' # etc
response = form.submit()
doc = parse(response)
doc.make_links_absolute(response.geturl())

To change the HTTP requester, pass a function as open_http keyword argument that opens the URL for you. The function must have the following signature:

open_http(method, URL, values)

The action is one of ‘GET’ or ‘POST’, the URL is the target URL as a string, and the values are a sequence of (name, value) tuples with the form data.

lxml.html.tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html', with_tail=True, doctype=None)[source]¶

Return an HTML string representation of the document.

Note: if include_meta_content_type is true this will create a <meta http-equiv="Content-Type" ...> tag in the head; regardless of the value of include_meta_content_type any existing <meta http-equiv="Content-Type" ...> tag will be removed

The encoding argument controls the output encoding (defaults to ASCII, with &#…; character references for any characters outside of ASCII). Note that you can pass the name 'unicode' as encoding argument to serialise to a Unicode string.

The method argument defines the output method. It defaults to ‘html’, but can also be ‘xml’ for xhtml output, or ‘text’ to serialise to plain text without markup.

To leave out the tail text of the top-level element that is being serialised, pass with_tail=False.

The doctype option allows passing in a plain string that will be serialised before the XML tree. Note that passing in non well-formed content here will make the XML output non well-formed. Also, an existing doctype in the document tree will not be removed when serialising an ElementTree instance.

Example:

>>> from lxml import html
>>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')

>>> html.tostring(root)
b'<p>Hello<br>world!</p>'
>>> html.tostring(root, method='html')
b'<p>Hello<br>world!</p>'

>>> html.tostring(root, method='xml')
b'<p>Hello<br/>world!</p>'

>>> html.tostring(root, method='text')
b'Helloworld!'

>>> html.tostring(root, method='text', encoding='unicode')
'Helloworld!'

>>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>')
>>> html.tostring(root[0], method='text', encoding='unicode')
'Helloworld!TAIL'

>>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False)
'Helloworld!'

>>> doc = html.document_fromstring('<p>Hello<br>world!</p>')
>>> html.tostring(doc, method='html', encoding='unicode')
'<html><body><p>Hello<br>world!</p></body></html>'

>>> print(html.tostring(doc, method='html', encoding='unicode',
...          doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"'
...                  ' "http://www.w3.org/TR/html4/strict.dtd">'))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><body><p>Hello<br>world!</p></body></html>

lxml.html.xhtml_to_html(xhtml)[source]¶: Convert all tags in an XHTML tree to HTML by removing their XHTML namespace.