Since version 2.0, lxml provides a dedicated package for dealing with HTML: lxml.html. It provides a special Element API for HTML elements, as well as a number of utilities for common tasks.
The main API is based on the lxml.etree API, and thus, on the ElementTree API.
HTML elements have all the methods that come with ElementTree, but also include some extra methods:
One of the interesting modules in the lxml.html package deals with doctests. It can be hard to compare two HTML pages for equality, as whitespace differences aren't meaningful and the structural formatting can differ. This is even more a problem in doctests, where output is tested for equality and small differences in whitespace or the order of attributes can let a test fail. And given the verbosity of tag-based languages, it may take more than a quick look to find the actual differences in the doctest output.
Luckily, lxml provides the lxml.doctestcompare module that supports relaxed comparison of XML and HTML pages and provides a readable diff in the output when a test fails. The HTML comparison is most easily used by importing the usedoctest module in a doctest:
>>> import lxml.html.usedoctest
Now, if you have a HTML document and want to compare it to an expected result document in a doctest, you can do the following:
>>> import lxml.html >>> html = lxml.html.HTML('''\ ... <html><body onload="" color="white"> ... <p>Hi !</p> ... </body></html> ... ''') >>> print lxml.html.tostring(html) <html><body onload="" color="white"><p>Hi !</p></body></html> >>> print lxml.html.tostring(html) <html> <body color="white" onload=""> <p>Hi !</p> </body> </html> >>> print lxml.html.tostring(html) <html> <body color="white" onload=""> <p>Hi !</p> </body> </html>
In documentation, you would likely prefer the pretty printed HTML output, as it is the most readable. However, the three documents are equivalent from the point of view of an HTML tool, so the doctest will silently accept any of the above. This allows you to concentrate on readability in your doctests, even if the real output is a straight ugly HTML one-liner.
Note that there is also an lxml.usedoctest module which you can import for XML comparisons. The HTML parser notably ignores namespaces and some other XMLisms.
lxml.html comes with a predefined HTML vocabulary for the E-factory, originally written by Fredrik Lundh. This allows you to quickly generate HTML pages and fragments:
>>> from lxml.html import builder as h >>> html = h.HTML( ... h.HEAD( ... h.LINK(rel="stylesheet", href="great.css", type="text/css"), ... h.TITLE("Best Page Ever") ... ), ... h.BODY( ... h.H1(h.CLASS("heading"), "Top News"), ... h.P("World News only on this page", style="font-size: 200%"), ... "Ah, and here's some more text, by the way.", ... lxml.html.HTML("<p>... and this is a parsed fragment ...</p>") ... ) ... ) >>> print lxml.html.tostring(html) <html> <head> <link href="great.css" rel="stylesheet" type="text/css"> <title>Best Page Ever</title> </head> <body> <h1 class="heading">Top News</h1> <p style="font-size: 200%">World News only on this page</p> Ah, and here's some more text, by the way. <p>... and this is a parsed fragment ...</p> </body> </html>
Note that you should use lxml.html.tostring and not lxml.tostring. lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. In particular, things like <script src="..."></script> will be serialized as <script src="..." />, which completely confuses browsers.
There are several methods on elements that allow you to see and modify the links in a document.
This yields (element, attribute, link, pos) for every link in the document. attribute may be None if the link is in the text (as will be the case with a <style> tag with @import).
This finds any link in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute. It also searches style attributes for url(link), and <style> tags for @import and url().
This function does not pay attention to <base href>.
This makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.
If resolve_base_href is true, then any <base href> tag will be taken into account (just calling self.resolve_base_href()).
This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.
For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:bob@example.com" (or javascript:...).
If you want access to the context of the link, you should use .iterlinks() instead.
In addition to these methods, there are corresponding functions:
These functions will parse html if it is a string, then return the new HTML as a string. If you pass in a document, the document will be copied (except for iterlinks()), the method performed, and the new document returned.
The module lxml.html.clean provides a Cleaner class for cleaning up HTML pages. It supports removing embedded or script content, special tags, CSS style annotations and much more.
Say, you have an evil web page from an untrusted source that contains lots of content that upsets browsers and tries to run evil code on the client side:
>>> html = '''\ ... <html> ... <head> ... <script type="text/javascript" src="evil-site"></script> ... <link rel="alternate" type="text/rss" src="evil-rss"> ... <style> ... body {background-image: url(javascript:do_evil)}; ... div {color: expression(evil)}; ... </style> ... </head> ... <body onload="evil_function()"> ... <!-- I am interpreted for EVIL! --> ... <a href="javascript:evil_function()">a link</a> ... <a href="#" onclick="evil_function()">another link</a> ... <p onclick="evil_function()">a paragraph</p> ... <div style="display: none">secret EVIL!</div> ... <object> of EVIL! </object> ... <iframe src="evil-site"></iframe> ... <form action="evil-site"> ... Password: <input type="password" name="password"> ... </form> ... <blink>annoying EVIL!</blink> ... <a href="evil-site">spam spam SPAM!</a> ... <image src="evil!"> ... </body> ... </html>'''
To remove the all suspicious content from this unparsed document, use the clean_html function:
>>> from lxml.html.clean import clean_html >>> print clean_html(html) <html> <body> <div> <style>/* deleted */</style> <a href="">a link</a> <a href="#">another link</a> <p>a paragraph</p> <div>secret EVIL!</div> of EVIL! Password: annoying EVIL! <a href="evil-site">spam spam SPAM!</a> <img src="evil!"> </div> </body> </html>
The Cleaner class supports several keyword arguments to control exactly which content is removed:
>>> from lxml.html.clean import Cleaner >>> cleaner = Cleaner(page_structure=False, links=False) >>> print cleaner.clean_html(html) <html> <head> <link rel="alternate" src="evil-rss" type="text/rss"> <style>/* deleted */</style> </head> <body> <a href="">a link</a> <a href="#">another link</a> <p>a paragraph</p> <div>secret EVIL!</div> of EVIL! Password: annoying EVIL! <a href="evil-site">spam spam SPAM!</a> <img src="evil!"> </body> </html> >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True, ... page_structure=False, safe_attrs_only=False) >>> print cleaner.clean_html(html) <html> <head> </head> <body> <a href="">a link</a> <a href="#">another link</a> <p>a paragraph</p> <div>secret EVIL!</div> of EVIL! Password: annoying EVIL! <a href="evil-site" rel="nofollow">spam spam SPAM!</a> <img src="evil!"> </body> </html>
See the docstring of Cleaner for the details of what can be cleaned.
In addition to cleaning up malicious HTML, lxml.html.clean contains functions to do other things to your HTML. This includes autolinking:
autolink(doc, ...) autolink_html(html, ...)
This finds anything that looks like a link (e.g., http://example.com) in the text of an HTML document, and turns it into an anchor. It avoids making bad links.
Links in the elements <textarea>, <pre>, <code>, anything in the head of the document. You can pass in a list of elements to avoid in avoid_elements=['textarea', ...].
Links to some hosts can be avoided. By default links to localhost*, example.* and 127.0.0.1 are not autolinked. Pass in avoid_hosts=[list_of_regexes] to control this.
Elements with the nolink CSS class are not autolinked. Pass in avoid_classes=['code', ...] to control this.
The autolink_html() version of the function parses the HTML string first, and returns a string.
You can also wrap long words in your html:
word_break(doc, max_width=40, ...) word_break_html(html, ...)
This finds any long words in the text of the document and inserts ​ in the document (which is the Unicode zero-width space).
This avoids the elements <pre>, <textarea>, and <code>. You can control this with avoid_elements=['textarea', ...].
It also avoids elements with the CSS class nobreak. You can control this with avoid_classes=['code', ...].
Lastly you can control the character that is inserted with break_character=u'\u200b'. However, you cannot insert markup, only text.
word_break_html(html) parses the HTML document and returns a string.
This example parses the hCard microformat.
First we get the page:
>>> import urllib >>> from lxml.html import HTML >>> url = 'http://microformats.org/' >>> content = urllib.urlopen(url).read() >>> doc = HTML(content) >>> doc.make_links_absolute(url)
Then we create some objects to put the information in:
>>> class Card(object): ... def __init__(self, **kw): ... for name, value in kw: ... setattr(self, name, value) >>> class Phone(object): ... def __init__(self, phone, types=()): ... self.phone, self.types = phone, types
And some generally handy functions for microformats:
>>> def get_text(el, class_name): ... els = el.find_class(class_name) ... if els: ... return els[0].text_content() ... else: ... return '' >>> def get_value(el): ... return get_text(el, 'value') or el.text_content() >>> def get_all_texts(el, class_name): ... return [e.text_content() for e in els.find_class(class_name)] >>> def parse_addresses(el): ... # Ideally this would parse street, etc. ... return el.find_class('adr')
Then the parsing:
>>> for el in doc.find_class('hcard'): ... card = Card() ... card.el = el ... card.fn = get_text(el, 'fn') ... card.tels = [] ... for tel_el in card.find_class('tel'): ... card.tels.append(Phone(get_value(tel_el), ... get_all_texts(tel_el, 'type'))) ... card.addresses = parse_addresses(el)