Package lxml :: Package html :: Module clean :: Class Cleaner
[hide private]
[frames] | no frames]

Class Cleaner

source code

object --+

Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.

Removes any <script> tags.
Removes any Javascript, like an onclick attribute. Also removes stylesheets as they could contain Javascript.
Removes any comments.
Removes any style tags or attributes.
Removes any <link> tags
Removes any <meta> tags
Structural parts of a page: <head>, <html>, <title>.
Removes any processing instructions.
Removes any embedded objects (flash, iframes)
Removes any frame-related tags
Removes any form tags
Tags that aren't wrong, but are annoying. <blink> and <marquee>
A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
A list of tags to include (default include all).
Remove any tags that aren't standard parts of HTML.
If true, only include 'safe' attributes (specifically the list from the feedparser HTML sanitisation web site).
A set of attribute names to override the default list of attributes considered 'safe' (when safe_attrs_only=True).
If true, then any <a> tags will have rel="nofollow" added to them.

A list or set of hosts that you can use for embedded content (for content like <object>, <link rel="stylesheet">, etc). You can also implement/override the method allow_embedded_url(el, url) or allow_element(el) to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) embedded.

Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning.

Note that you may also need to set whitelist_tags.

A set of tags that can be included with host_whitelist. The default is iframe and embed; you may wish to include other tags like script, or you may want to implement allow_embedded_url for more control. Set to None to include all tags.

This modifies the document in place.

Instance Methods [hide private]
__init__(self, **kw)
x.__init__(...) initializes x; see help(type(x)) for signature
source code
__call__(self, doc)
Cleans the document.
source code
allow_follow(self, anchor)
Override to suppress rel="nofollow" on some anchors.
source code
allow_element(self, el) source code
allow_embedded_url(self, el, url) source code
kill_conditional_comments(self, doc)
IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional.
source code
_kill_elements(self, doc, condition, iterate=None) source code
_remove_javascript_link(self, link) source code
sub(repl, string[, count = 0]) --> newstring Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
source code
_has_sneaky_javascript(self, style)
Depending on the browser, stuff like e x p r e s s i o n(...) can get interpreted, or expre/* stuff */ssion(...). This checks for attempt to do stuff like this.
source code
clean_html(self, html) source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Variables [hide private]
  scripts = True
  javascript = True
  comments = True
  style = False
  links = True
  meta = True
  page_structure = True
  processing_instructions = True
  embedded = True
  frames = True
  forms = True
  annoying_tags = True
  remove_tags = None
  allow_tags = None
  kill_tags = None
  remove_unknown_tags = True
  safe_attrs_only = True
  safe_attrs = frozenset(['abbr', 'accept', 'accept-charset', 'a...
  add_nofollow = False
  host_whitelist = ()
  whitelist_tags = set(['embed', 'iframe'])
  _tag_link_attrs = {'a': 'href', 'applet': ['code', 'object'], ...
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, **kw)

source code 
x.__init__(...) initializes x; see help(type(x)) for signature
Overrides: object.__init__
(inherited documentation)

_has_sneaky_javascript(self, style)

source code 

Depending on the browser, stuff like e x p r e s s i o n(...) can get interpreted, or expre/* stuff */ssion(...). This checks for attempt to do stuff like this.

Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.

Class Variable Details [hide private]




{'a': 'href',
 'applet': ['code', 'object'],
 'embed': 'src',
 'iframe': 'src',
 'layer': 'src',
 'link': 'href',
 'script': 'src'}