lxml.html package¶
Submodules¶
Module contents¶
The lxml.html
tool set for HTML handling.
-
class
lxml.html.
CheckboxGroup
(iterable=(), /)[source]¶ Bases:
list
Represents a group of checkboxes (
<input type=checkbox>
) that have the same name.In addition to using this like a list, the
.value
attribute returns a set-like object that you can add to or remove from to check and uncheck checkboxes. You can also use.value_options
to get the possible values.-
property
value
¶ Return a set-like object that can be modified to check or uncheck individual checkboxes according to their value.
-
property
value_options
¶ Returns a list of all the possible values.
-
property
-
class
lxml.html.
CheckboxValues
(group)[source]¶ Bases:
lxml.html._setmixin.SetMixin
Represents the values of the checked checkboxes in a group of checkboxes with the same name.
-
_abc_impl
= <_abc_data object>¶
-
-
class
lxml.html.
Classes
(attributes)[source]¶ Bases:
collections.abc.MutableSet
Provides access to an element’s class attribute as a set-like collection. Usage:
>>> el = fromstring('<p class="hidden large">Text</p>') >>> classes = el.classes # or: classes = Classes(el.attrib) >>> classes |= ['block', 'paragraph'] >>> el.get('class') 'hidden large block paragraph' >>> classes.toggle('hidden') False >>> el.get('class') 'large block paragraph' >>> classes -= ('some', 'classes', 'block') >>> el.get('class') 'large paragraph'
-
discard
(value)[source]¶ Remove a class if it is currently present.
If the class is not present, do nothing.
-
remove
(value)[source]¶ Remove a class; it must currently be present.
If the class is not present, raise a KeyError.
-
toggle
(value)[source]¶ Add a class name if it isn’t there yet, or remove it if it exists.
Returns true if the class was added (and is now enabled) and false if it was removed (and is now disabled).
-
_abc_impl
= <_abc_data object>¶
-
-
class
lxml.html.
FieldsDict
(inputs)[source]¶ Bases:
collections.abc.MutableMapping
-
_abc_impl
= <_abc_data object>¶
-
-
class
lxml.html.
FormElement
[source]¶ Bases:
lxml.html.HtmlElement
Represents a <form> element.
-
form_values
()[source]¶ Return a list of tuples of the field values for the form. This is suitable to be passed to
urllib.urlencode()
.
-
property
action
¶ Get/set the form’s
action
attribute.
-
property
fields
¶ Dictionary-like object that represents all the fields in this form. You can set values in this dictionary to effect the form.
-
property
inputs
¶ Returns an accessor for all the input elements in the form.
See InputGetter for more information about the object.
-
property
method
¶ Get/set the form’s method. Always returns a capitalized string, and defaults to
'GET'
-
-
class
lxml.html.
HTMLParser
(**kwargs)[source]¶ Bases:
lxml.etree.HTMLParser
An HTML parser that is configured to return lxml.html Element objects.
-
class
lxml.html.
HtmlElement
[source]¶ Bases:
lxml.etree.ElementBase
,lxml.html.HtmlMixin
-
cssselect
(expr, translator='html')¶ Run the CSS expression on this element and its children, returning a list of the results.
Equivalent to lxml.cssselect.CSSSelect(expr, translator=’html’)(self) – note that pre-compiling the expression can provide a substantial speedup.
-
set
(self, key, value=None)¶ Sets an element attribute. If no value is provided, or if the value is None, creates a ‘boolean’ attribute without value, e.g. “<form novalidate></form>” for
form.set('novalidate')
.
-
-
class
lxml.html.
HtmlElementClassLookup
(classes=None, mixins=None)[source]¶ Bases:
lxml.etree.CustomElementClassLookup
A lookup scheme for HTML Element classes.
To create a lookup instance with different Element classes, pass a tag name mapping of Element classes in the
classes
keyword argument and/or a tag name mapping of Mixin classes in themixins
keyword argument. The special key ‘*’ denotes a Mixin class that should be mixed into all Element classes.-
_default_element_classes
= {'form': <class 'lxml.html.FormElement'>, 'input': <class 'lxml.html.InputElement'>, 'label': <class 'lxml.html.LabelElement'>, 'select': <class 'lxml.html.SelectElement'>, 'textarea': <class 'lxml.html.TextareaElement'>}¶
-
-
class
lxml.html.
HtmlMixin
[source]¶ Bases:
object
-
cssselect
(expr, translator='html')[source]¶ Run the CSS expression on this element and its children, returning a list of the results.
Equivalent to lxml.cssselect.CSSSelect(expr, translator=’html’)(self) – note that pre-compiling the expression can provide a substantial speedup.
-
drop_tag
()[source]¶ Remove the tag, but not its children or text. The children and text are merged into the parent.
Example:
>>> h = fragment_fromstring('<div>Hello <b>World!</b></div>') >>> h.find('.//b').drop_tag() >>> print(tostring(h, encoding='unicode')) <div>Hello World!</div>
-
drop_tree
()[source]¶ Removes this element from the tree, including its children and text. The tail text is joined to the previous element or parent.
-
find_rel_links
(rel)[source]¶ Find any links like
<a rel="{rel}">...</a>
; returns a list of elements.
-
get_element_by_id
(id, *default)[source]¶ Get the first element in a document with the given id. If none is found, return the default argument if provided or raise KeyError otherwise.
Note that there can be more than one element with the same id, and this isn’t uncommon in HTML documents found in the wild. Browsers return only the first match, and this function does the same.
-
iterlinks
()[source]¶ Yield (element, attribute, link, pos), where attribute may be None (indicating the link is in the text).
pos
is the position where the link occurs; often 0, but sometimes something else in the case of links in stylesheets or style tags.Note: <base href> is not taken into account in any way. The link you get is exactly the link in the document.
Note: multiple links inside of a single text string or attribute value are returned in reversed order. This makes it possible to replace or delete them from the text string value based on their reported text positions. Otherwise, a modification at one text position can change the positions of links reported later on.
-
make_links_absolute
(base_url=None, resolve_base_href=True, handle_failures=None)[source]¶ Make all links in the document absolute, given the
base_url
for the document (the full URL where the document came from), or if nobase_url
is given, then the.base_url
of the document.If
resolve_base_href
is true, then any<base href>
tags in the document are used and removed from the document. If it is false then any such tag is ignored.If
handle_failures
is None (default), a failure to process a URL will abort the processing. If set to ‘ignore’, errors are ignored. If set to ‘discard’, failing URLs will be removed.
-
resolve_base_href
(handle_failures=None)[source]¶ Find any
<base href>
tag in the document, and apply its values to all links found in the document. Also remove the tag once it has been applied.If
handle_failures
is None (default), a failure to process a URL will abort the processing. If set to ‘ignore’, errors are ignored. If set to ‘discard’, failing URLs will be removed.
-
rewrite_links
(link_repl_func, resolve_base_href=True, base_href=None)[source]¶ Rewrite all the links in the document. For each link
link_repl_func(link)
will be called, and the return value will replace the old link.Note that links may not be absolute (unless you first called
make_links_absolute()
), and may be internal (e.g.,'#anchor'
). They can also be values like'mailto:email'
or'javascript:expr'
.If you give
base_href
then all links passed tolink_repl_func()
will take that into account.If the
link_repl_func
returns None, the attribute or tag text will be removed completely.
-
set
(self, key, value=None)[source]¶ Sets an element attribute. If no value is provided, or if the value is None, creates a ‘boolean’ attribute without value, e.g. “<form novalidate></form>” for
form.set('novalidate')
.
-
property
base_url
¶ Returns the base URL, given when the page was parsed.
Use with
urlparse.urljoin(el.base_url, href)
to get absolute URLs.
-
property
body
¶ Return the <body> element. Can be called from a child element to get the document’s head.
-
property
classes
¶ A set-like wrapper around the ‘class’ attribute.
-
property
forms
¶ Return a list of all the forms
-
property
head
¶ Returns the <head> element. Can be called from a child element to get the document’s head.
-
property
label
¶ Get or set any <label> element associated with this element.
-
-
class
lxml.html.
HtmlProcessingInstruction
[source]¶ Bases:
lxml.etree.PIBase
,lxml.html.HtmlMixin
-
class
lxml.html.
InputElement
[source]¶ Bases:
lxml.html.InputMixin
,lxml.html.HtmlElement
Represents an
<input>
element.You can get the type with
.type
(which is lower-cased and defaults to'text'
).Also you can get and set the value with
.value
Checkboxes and radios have the attribute
input.checkable == True
(for all others it is false) and a boolean attribute.checked
.-
property
checkable
¶ Boolean: can this element be checked?
-
property
checked
¶ Boolean attribute to get/set the presence of the
checked
attribute.You can only use this on checkable input types.
-
property
type
¶ Return the type of this element (using the type attribute).
-
property
value
¶ Get/set the value of this element, using the
value
attribute.Also, if this is a checkbox and it has no value, this defaults to
'on'
. If it is a checkbox or radio that is not checked, this returns None.
-
property
-
class
lxml.html.
InputGetter
(form)[source]¶ Bases:
object
An accessor that represents all the input fields in a form.
You can get fields by name from this, with
form.inputs['field_name']
. If there are a set of checkboxes with the same name, they are returned as a list (a CheckboxGroup which also allows value setting). Radio inputs are handled similarly. Use.keys()
and.items()
to process all fields in this way.You can also iterate over this to get all input elements. This won’t return the same thing as if you get all the names, as checkboxes and radio elements are returned individually.
-
class
lxml.html.
InputMixin
[source]¶ Bases:
object
Mix-in for all input elements (input, select, and textarea)
-
property
name
¶ Get/set the name of the element
-
property
-
class
lxml.html.
LabelElement
[source]¶ Bases:
lxml.html.HtmlElement
Represents a
<label>
element.Label elements are linked to other elements with their
for
attribute. You can access this element withlabel.for_element
.-
property
for_element
¶ Get/set the element this label points to. Return None if it can’t be found.
-
property
-
class
lxml.html.
MultipleSelectOptions
(select)[source]¶ Bases:
lxml.html._setmixin.SetMixin
Represents all the selected options in a
<select multiple>
element.You can add to this set-like option to select an option, or remove to unselect the option.
-
_abc_impl
= <_abc_data object>¶
-
property
options
¶ Iterator of all the
<option>
elements.
-
-
class
lxml.html.
RadioGroup
(iterable=(), /)[source]¶ Bases:
list
This object represents several
<input type=radio>
elements that have the same name.You can use this like a list, but also use the property
.value
to check/uncheck inputs. Also you can use.value_options
to get the possible values.-
property
value
¶ Get/set the value, which checks the radio with that value (and unchecks any other value).
-
property
value_options
¶ Returns a list of all the possible values.
-
property
-
class
lxml.html.
SelectElement
[source]¶ Bases:
lxml.html.InputMixin
,lxml.html.HtmlElement
<select>
element. You can get the name with.name
..value
will be the value of the selected option, unless this is a multi-select element (<select multiple>
), in which case it will be a set-like object. In either case.value_options
gives the possible values.The boolean attribute
.multiple
shows if this is a multi-select.-
property
multiple
¶ Boolean attribute: is there a
multiple
attribute on this element.
-
property
value
¶ Get/set the value of this select (the selected option).
If this is a multi-select, this is a set-like object that represents all the selected options.
-
property
value_options
¶ All the possible values this select can have (the
value
attribute of all the<option>
elements.
-
property
-
class
lxml.html.
TextareaElement
[source]¶ Bases:
lxml.html.InputMixin
,lxml.html.HtmlElement
<textarea>
element. You can get the name with.name
and get/set the value with.value
-
property
value
¶ Get/set the value (which is the contents of this element)
-
property
-
class
lxml.html.
XHTMLParser
(**kwargs)[source]¶ Bases:
lxml.etree.XMLParser
An XML parser that is configured to return lxml.html Element objects.
Note that this parser is not really XHTML aware unless you let it load a DTD that declares the HTML entities. To do this, make sure you have the XHTML DTDs installed in your catalogs, and create the parser like this:
>>> parser = XHTMLParser(load_dtd=True)
If you additionally want to validate the document, use this:
>>> parser = XHTMLParser(dtd_validation=True)
For catalog support, see http://www.xmlsoft.org/catalog.html.
-
class
lxml.html.
_MethodFunc
(name, copy=False, source_class=<class 'lxml.html.HtmlMixin'>)[source]¶ Bases:
object
An object that represents a method on an element as a function; the function takes either an element or an HTML string. It returns whatever the function normally returns, or if the function works in-place (and so returns None) it returns a serialized form of the resulting document.
-
lxml.html.
Element
(*args, **kw)[source]¶ Create a new HTML Element.
This can also be used for XHTML documents.
-
lxml.html.
__bytes_replace_meta_content_type
(repl, string, count=0)¶ Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
-
lxml.html.
__str_replace_meta_content_type
(repl, string, count=0)¶ Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
-
lxml.html.
_iter_css_imports
(string, pos=0, endpos=9223372036854775807)¶ Return an iterator over all non-overlapping matches for the RE pattern in string.
For each match, the iterator returns a match object.
-
lxml.html.
_iter_css_urls
(string, pos=0, endpos=9223372036854775807)¶ Return an iterator over all non-overlapping matches for the RE pattern in string.
For each match, the iterator returns a match object.
-
lxml.html.
_looks_like_full_html_bytes
(string, pos=0, endpos=9223372036854775807)¶ Matches zero or more characters at the beginning of the string.
-
lxml.html.
_looks_like_full_html_unicode
(string, pos=0, endpos=9223372036854775807)¶ Matches zero or more characters at the beginning of the string.
-
lxml.html.
_parse_meta_refresh_url
(string, pos=0, endpos=9223372036854775807)¶ Scan through string looking for a match, and return a corresponding match object instance.
Return None if no position in the string matches.
-
lxml.html.
fragment_fromstring
(html, create_parent=False, base_url=None, parser=None, **kw)[source]¶ Parses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element.
If
create_parent
is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is also allowed, as are multiple elements as result of the parsing.Passing a
base_url
will set the document’sbase_url
attribute (and the tree’s docinfo.URL).
-
lxml.html.
fragments_fromstring
(html, no_leading_text=False, base_url=None, parser=None, **kw)[source]¶ Parses several HTML elements, returning a list of elements.
The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements.
base_url will set the document’s base_url attribute (and the tree’s docinfo.URL).
-
lxml.html.
fromstring
(html, base_url=None, parser=None, **kw)[source]¶ Parse the html, returning a single element/document.
This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document.
base_url will set the document’s base_url attribute (and the tree’s docinfo.URL)
-
lxml.html.
html_to_xhtml
(html)[source]¶ Convert all tags in an HTML tree to XHTML by moving them to the XHTML namespace.
-
lxml.html.
open_in_browser
(doc, encoding=None)[source]¶ Open the HTML document in a web browser, saving it to a temporary file to open it. Note that this does not delete the file after use. This is mainly meant for debugging.
-
lxml.html.
parse
(filename_or_url, parser=None, base_url=None, **kw)[source]¶ Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use
parse(...).getroot()
to get the document root.You can override the base URL with the
base_url
keyword. This is most useful when parsing from a file-like object.
-
lxml.html.
submit_form
(form, extra_values=None, open_http=None)[source]¶ Helper function to submit a form. Returns a file-like object, as from
urllib.urlopen()
. This object also has a.geturl()
function, which shows the URL if there were any redirects.You can use this like:
form = doc.forms[0] form.inputs['foo'].value = 'bar' # etc response = form.submit() doc = parse(response) doc.make_links_absolute(response.geturl())
To change the HTTP requester, pass a function as
open_http
keyword argument that opens the URL for you. The function must have the following signature:open_http(method, URL, values)
The action is one of ‘GET’ or ‘POST’, the URL is the target URL as a string, and the values are a sequence of
(name, value)
tuples with the form data.
-
lxml.html.
tostring
(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method='html', with_tail=True, doctype=None)[source]¶ Return an HTML string representation of the document.
Note: if include_meta_content_type is true this will create a
<meta http-equiv="Content-Type" ...>
tag in the head; regardless of the value of include_meta_content_type any existing<meta http-equiv="Content-Type" ...>
tag will be removedThe
encoding
argument controls the output encoding (defaults to ASCII, with &#…; character references for any characters outside of ASCII). Note that you can pass the name'unicode'
asencoding
argument to serialise to a Unicode string.The
method
argument defines the output method. It defaults to ‘html’, but can also be ‘xml’ for xhtml output, or ‘text’ to serialise to plain text without markup.To leave out the tail text of the top-level element that is being serialised, pass
with_tail=False
.The
doctype
option allows passing in a plain string that will be serialised before the XML tree. Note that passing in non well-formed content here will make the XML output non well-formed. Also, an existing doctype in the document tree will not be removed when serialising an ElementTree instance.Example:
>>> from lxml import html >>> root = html.fragment_fromstring('<p>Hello<br>world!</p>') >>> html.tostring(root) b'<p>Hello<br>world!</p>' >>> html.tostring(root, method='html') b'<p>Hello<br>world!</p>' >>> html.tostring(root, method='xml') b'<p>Hello<br/>world!</p>' >>> html.tostring(root, method='text') b'Helloworld!' >>> html.tostring(root, method='text', encoding='unicode') 'Helloworld!' >>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>') >>> html.tostring(root[0], method='text', encoding='unicode') 'Helloworld!TAIL' >>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False) 'Helloworld!' >>> doc = html.document_fromstring('<p>Hello<br>world!</p>') >>> html.tostring(doc, method='html', encoding='unicode') '<html><body><p>Hello<br>world!</p></body></html>' >>> print(html.tostring(doc, method='html', encoding='unicode', ... doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"' ... ' "http://www.w3.org/TR/html4/strict.dtd">')) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html><body><p>Hello<br>world!</p></body></html>