Package lxml :: Package html :: Module clean
[show private | hide private]
[frames | no frames]

Module lxml.html.clean

Classes
Cleaner Instances cleans the document of each of the possible offending elements.

Function Summary
  _break_text(text, max_width, break_character)
  _insert_break(word, width, break_character)
  _link_text(text, link_regexes, avoid_hosts, factory)
  autolink(el, link_regexes, avoid_elements, avoid_hosts, avoid_classes)
Turn any URLs into links.
  autolink_html(html, *args, **kw)
Turn any URLs into links.
  word_break(el, max_width, avoid_elements, avoid_classes, break_character)
Breaks any long words found in the body of the text (not attributes).
  word_break_html(html, *args, **kw)

Function Details

autolink(el, link_regexes=[<_sre.SRE_Pattern object at 0x82ec270>, <_sre.SRE_Patter..., avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[<_sre.SRE_Pattern object at 0x40852448>, <_sre.SRE_Patte..., avoid_classes=['nolink'])

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).

If you pass in an element, the elements tail will not be substituted, only the contents of the element.

autolink_html(html, *args, **kw)

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).

If you pass in an element, the elements tail will not be substituted, only the contents of the element.

word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character=u'\u200b')

Breaks any long words found in the body of the text (not attributes).

Doesn't effect any of the tags in avoid_elements, by default ``<textarea>`` and ``<pre>``

Breaks words by inserting &#8203;, which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space.

See http://www.cs.tut.fi/~jkorpela/html/nobr.html for a discussion

Generated by Epydoc 2.1 on Sat Aug 18 12:44:27 2007 http://epydoc.sf.net