Package lxml :: Package html :: Module clean
[hide private]
[frames] | no frames]

Module clean

source code

A cleanup tool for HTML.

Removes unwanted tags and content. See the Cleaner class for details.

Classes [hide private]
str(object='') -> string
str(object='') -> string
Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.
Functions [hide private]
Return a string of one character with ordinal i; 0 <= i < 256.
search(string[, pos[, endpos]]) --> match object or None. Scan through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.
source code
search(string[, pos[, endpos]]) --> match object or None. Scan through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.
source code
_is_javascript_scheme(s) source code
sub(repl, string[, count = 0]) --> newstring Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
source code
clean_html(html) source code
autolink(el, link_regexes=[re.compile(r'(?i)(?P<body>https?://(?P<host>[a-z0-9\._-]+)(?:..., avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[re.compile(r'(?i)^localhost'), re.compile(r'(?i)\bexample\.(?..., avoid_classes=['nolink'])
Turn any URLs into links.
source code
_link_text(text, link_regexes, avoid_hosts, factory) source code
autolink_html(html, *args, **kw)
Turn any URLs into links.
source code
word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character=u'')
Breaks any long words found in the body of the text (not attributes).
source code
word_break_html(html, *args, **kw) source code
_break_text(text, max_width, break_character) source code
_insert_break(word, width, break_character) source code
Variables [hide private]
  basestring = str, bytes
  _css_javascript_re = re.compile(r'(?is)expression\s*\(.*?\)')
  _css_import_re = re.compile(r'(?i)@\s*import')
  _conditional_comment_re = re.compile(r'(?is)\[if[\s\n\r]+.*?\]...
  _find_styled_elements = descendant-or-self::*[@style]
  _find_external_links = descendant-or-self::a [normalize-space...
  clean = <lxml.html.clean.Cleaner object>
  _link_regexes = [re.compile(r'(?i)(?P<body>https?://(?P<host>[...
  _avoid_elements = ['textarea', 'pre', 'code', 'head', 'select'...
  _avoid_hosts = [re.compile(r'(?i)^localhost'), re.compile(r'(?...
  _avoid_classes = ['nolink']
  _avoid_word_break_elements = ['pre', 'textarea', 'code']
  _avoid_word_break_classes = ['nobreak']
  _break_prefer_re = re.compile(r'(?i)[^a-z]')
  __package__ = 'lxml.html'
Function Details [hide private]

autolink(el, link_regexes=[re.compile(r'(?i)(?P<body>https?://(?P<host>[a-z0-9\._-]+)(?:..., avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[re.compile(r'(?i)^localhost'), re.compile(r'(?i)\bexample\.(?..., avoid_classes=['nolink'])

source code 

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and

If you pass in an element, the element's tail will not be substituted, only the contents of the element.

autolink_html(html, *args, **kw)

source code 

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and

If you pass in an element, the element's tail will not be substituted, only the contents of the element.

word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character=u'')

source code 

Breaks any long words found in the body of the text (not attributes).

Doesn't effect any of the tags in avoid_elements, by default <textarea> and <pre>

Breaks words by inserting &#8203;, which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space.

See for a discussion

Variables Details [hide private]




descendant-or-self::a  [normalize-space(@href) and substring(normalize\
-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@h\
ref) and substring(normalize-space(@href),1,1) != '#']




['textarea', 'pre', 'code', 'head', 'select', 'a']

