Package lxml :: Package html :: Module clean

Module clean

A cleanup tool for HTML.

Removes unwanted tags and content. See the Cleaner class for details.

Classes

Cleaner
Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.

Functions

[hide private]

_substitute_whitespace(...)
sub(repl, string[, count = 0]) --> newstring Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

source code

clean_html(html)

source code

autolink(el, link_regexes=[re.compile(r'(?i)(?P<body>https?://(?P<host>[a-z0-9\._-]+)(?:..., avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[re.compile(r'(?i)^localhost'), re.compile(r'(?i)\bexample\.(?..., avoid_classes=['nolink'])
Turn any URLs into links. source code

_link_text(text, link_regexes, avoid_hosts, factory)

source code

autolink_html(html, *args, **kw)
Turn any URLs into links.

source code

word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character=u'')
Breaks any long words found in the body of the text (not attributes). source code

word_break_html(html, *args, **kw)

source code

_break_text(text, max_width, break_character)

source code

_insert_break(word, width, break_character)

source code

Variables

[hide private]

_css_javascript_re = re.compile(r'(?is)expression\s*\(.*?\)')

_css_import_re = re.compile(r'(?i)@\s*import')

_javascript_scheme_re = re.compile(r'(?i)\s*(?:javascript|jscr...

_conditional_comment_re = re.compile(r'(?is)\[if[\s\n\r]+.*?\]...

_find_styled_elements = descendant-or-self::*[@style]

_find_external_links = descendant-or-self::a [normalize-space...

clean = Cleaner()

_link_regexes = [re.compile(r'(?i)(?P<body>https?://(?P<host>[...

_avoid_elements = ['textarea', 'pre', 'code', 'head', 'select'...

_avoid_hosts = [re.compile(r'(?i)^localhost'), re.compile(r'(?...

_avoid_classes = ['nolink']

_avoid_word_break_elements = ['pre', 'textarea', 'code']

_avoid_word_break_classes = ['nobreak']

_break_prefer_re = re.compile(r'(?i)[^a-z]')

__package__ = 'lxml.html'

Function Details

[hide private]

autolink(el, link_regexes=`[`re.compile(r'`(?i)(?P<body>`https`?`://`(?P<host>[`a`-`z0`-`9\._-`]+)(?:...`, avoid_elements=`['textarea',` `'pre',` `'code',` `'head',` `'select',` `'a']`, avoid_hosts=`[`re.compile(r'`(?i)`^localhost')`,` re.compile(r'`(?i)`\bexample\.`(?...`, avoid_classes=`['nolink']`)

source code

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).

If you pass in an element, the element's tail will not be substituted, only the contents of the element.

autolink_html(html, *args, **kw)

source code

Turn any URLs into links.

It will search for links identified by the given regular expressions (by default mailto and http(s) links).

If you pass in an element, the element's tail will not be substituted, only the contents of the element.

word_break(el, max_width=40, avoid_elements=`['pre',` `'textarea',` `'code']`, avoid_classes=`['nobreak']`, break_character=`u''`)

source code

Breaks any long words found in the body of the text (not attributes).

Doesn't effect any of the tags in avoid_elements, by default <textarea> and <pre>

Breaks words by inserting , which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space.

See http://www.cs.tut.fi/~jkorpela/html/nobr.html for a discussion

Variables Details

[hide private]

_javascript_scheme_re

Value:

re.compile(r'(?i)\s*(?:javascript|jscript|livescript|vbscript|about|mo
cha):')

_conditional_comment_re

Value:

re.compile(r'(?is)\[if[\s\n\r]+.*?\][\s\n\r]*>')

_find_external_links

Value:

descendant-or-self::a  [normalize-space(@href) and substring(normalize
-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@h
ref) and substring(normalize-space(@href),1,1) != '#']

_link_regexes

Value:

[re.compile(r'(?i)(?P<body>https?://(?P<host>[a-z0-9\._-]+)(?:/[/-_\.,
a-z0-9%&\?;=~]*)?(?:\([/-_\.,a-z0-9%&\?;=~]*\))?)'),
 re.compile(r'(?i)mailto:(?P<body>[a-z0-9\._-]+@(?P<host>[a-z0-9_\._]+
[a-z]))')]

_avoid_elements

Value:

['textarea', 'pre', 'code', 'head', 'select', 'a']

_avoid_hosts

Value:

[re.compile(r'(?i)^localhost'),
 re.compile(r'(?i)\bexample\.(?:com|org|net)$'),
 re.compile(r'^127\.0\.0\.1$')]

Module clean

autolink(el, link_regexes=[re.compile(r'(?i)(?P<body>https?://(?P<host>[a-z0-9\._-]+)(?:..., avoid_elements=['textarea', 'pre', 'code', 'head', 'select', 'a'], avoid_hosts=[re.compile(r'(?i)^localhost'), re.compile(r'(?i)\bexample\.(?..., avoid_classes=['nolink'])

autolink_html(html, *args, **kw)

word_break(el, max_width=40, avoid_elements=['pre', 'textarea', 'code'], avoid_classes=['nobreak'], break_character=u'​')

_javascript_scheme_re

_conditional_comment_re

_find_external_links

_link_regexes

_avoid_elements

_avoid_hosts

word_break(el, max_width=40, avoid_elements=`['pre',` `'textarea',` `'code']`, avoid_classes=`['nobreak']`, break_character=`u''`)