Package lxml :: Package html :: Module clean :: Class Cleaner
[hide private]
[frames] | no frames]

Class Cleaner

source code

object --+
         |
        Cleaner


Instances cleans the document of each of the possible offending
elements.  The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.

``scripts``:
    Removes any ``<script>`` tags.

``javascript``:
    Removes any Javascript, like an ``onclick`` attribute.

``comments``:
    Removes any comments.

``style``:
    Removes any style tags or attributes.

``links``:
    Removes any ``<link>`` tags

``meta``:
    Removes any ``<meta>`` tags

``page_structure``:
    Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.

``processing_instructions``:
    Removes any processing instructions.

``embedded``:
    Removes any embedded objects (flash, iframes)

``frames``:
    Removes any frame-related tags

``forms``:
    Removes any form tags

``annoying_tags``:
    Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marque>``

``remove_tags``:
    A list of tags to remove.

``allow_tags``:
    A list of tags to include (default include all).

``remove_unknown_tags``:
    Remove any tags that aren't standard parts of HTML.

``safe_attrs_only``:
    If true, only include 'safe' attributes (specifically the list
    from `feedparser
    <http://feedparser.org/docs/html-sanitization.html>`_).

``add_nofollow``:
    If true, then any <a> tags will have ``rel="nofollow"`` added to them.

``host_whitelist``:
    A list or set of hosts that you can use for embedded content
    (for content like ``<object>``, ``<link rel="stylesheet">``, etc).
    You can also implement/override the method
    ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
    implement more complex rules for what can be embedded.
    Anything that passes this test will be shown, regardless of
    the value of (for instance) ``embedded``.

    Note that this parameter might not work as intended if you do not
    make the links absolute before doing the cleaning.

``whitelist_tags``:
    A set of tags that can be included with ``host_whitelist``.
    The default is ``iframe`` and ``embed``; you may wish to
    include other tags like ``script``, or you may want to
    implement ``allow_embedded_url`` for more control.  Set to None to
    include all tags.

This modifies the document *in place*.



Instance Methods [hide private]
 
__init__(self, **kw)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
source code
 
__call__(self, doc)
Cleans the document.
source code
 
allow_follow(self, anchor)
Override to suppress rel="nofollow" on some anchors.
source code
 
allow_element(self, el) source code
 
allow_embedded_url(self, el, url) source code
 
kill_conditional_comments(self, doc)
IE conditional comments basically embed HTML that the parser doesn't normally see.
source code
 
_kill_elements(self, doc, condition, iterate=None) source code
 
_remove_javascript_link(self, link) source code
 
_substitute_comments(...)
sub(repl, string[, count = 0]) --> newstring Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
source code
 
_has_sneaky_javascript(self, style)
Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``.
source code
 
clean_html(self, html) source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Variables [hide private]
  scripts = True
  javascript = True
  comments = True
  style = True
  links = True
  meta = True
  page_structure = True
  processing_instructions = True
  embedded = True
  frames = True
  forms = True
  annoying_tags = True
  remove_tags = None
  allow_tags = None
  remove_unknown_tags = True
  safe_attrs_only = True
  add_nofollow = True
  host_whitelist = ()
  whitelist_tags = set(['embed', 'iframe'])
  _tag_link_attrs = {'a': 'href', 'applet': ['code', 'object'], ...
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, **kw)
(Constructor)

source code 
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
Overrides: object.__init__
(inherited documentation)

kill_conditional_comments(self, doc)

source code 
IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional.

_has_sneaky_javascript(self, style)

source code 

Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this.

Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.

Class Variable Details [hide private]

_tag_link_attrs

Value:
{'a': 'href',
 'applet': ['code', 'object'],
 'embed': 'src',
 'iframe': 'src',
 'layer': 'src',
 'link': 'href',
 'script': 'src'}