Package lxml :: Package html :: Module clean :: Class Cleaner
[hide private]
[frames] | no frames]

Class Cleaner

source code

object --+
         |
        Cleaner


Instances cleans the document of each of the possible offending
elements.  The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.

``scripts``:
    Removes any ``<script>`` tags.

``javascript``:
    Removes any Javascript, like an ``onclick`` attribute.

``comments``:
    Removes any comments.

``style``:
    Removes any style tags or attributes.

``links``:
    Removes any ``<link>`` tags

``meta``:
    Removes any ``<meta>`` tags

``page_structure``:
    Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.

``processing_instructions``:
    Removes any processing instructions.

``embedded``:
    Removes any embedded objects (flash, iframes)

``frames``:
    Removes any frame-related tags

``forms``:
    Removes any form tags

``annoying_tags``:
    Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marque>``

``remove_tags``:
    A list of tags to remove.

``allow_tags``:
    A list of tags to include (default include all).

``remove_unknown_tags``:
    Remove any tags that aren't standard parts of HTML.

``safe_attrs_only``:
    If true, only include 'safe' attributes (specifically the list
    from `feedparser
    <http://feedparser.org/docs/html-sanitization.html>`_).

``add_nofollow``:
    If true, then any <a> tags will have ``rel="nofollow"`` added to them.

This modifies the document *in place*.



Instance Methods [hide private]
 
__init__(self, **kw)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
source code
 
__call__(self, doc)
Cleans the document.
source code
 
kill_conditional_comments(self, doc)
IE conditional comments basically embed HTML that the parser doesn't normally see.
source code
 
_kill_elements(self, doc, condition, iterate=None) source code
 
_remove_javascript_link(self, link) source code
 
_has_sneaky_javascript(self, style)
Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``.
source code
 
clean_html(self, html) source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Variables [hide private]
  scripts = True
  javascript = True
  comments = True
  style = False
  links = True
  meta = True
  page_structure = True
  processing_instructions = True
  embedded = True
  frames = True
  forms = True
  annoying_tags = True
  remove_tags = None
  allow_tags = None
  remove_unknown_tags = True
  safe_attrs_only = True
  add_nofollow = False
  _decomment_re = re.compile(r'(?s)/\*.*?\*/')
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, **kw)
(Constructor)

source code 
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
Overrides: object.__init__
(inherited documentation)

kill_conditional_comments(self, doc)

source code 
IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional.

_has_sneaky_javascript(self, style)

source code 

Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this.

Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.