Package lxml :: Package html :: Module clean :: Class Cleaner
[show private | hide private]
[frames | no frames]

Type Cleaner

object --+
         |
        Cleaner


Instances cleans the document of each of the possible offending
elements.  The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.

``scripts``:
    Removes any ``<script>`` tags.

``javascript``:
    Removes any Javascript, like an ``onclick`` attribute.

``comments``:
    Removes any comments.

``style``:
    Removes any style tags or attributes.

``links``:
    Removes any ``<link>`` tags

``meta``:
    Removes any ``<meta>`` tags

``page_structure``:
    Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.

``processing_instructions``:
    Removes any processing instructions.

``embedded``:
    Removes any embedded objects (flash, iframes)

``frames``:
    Removes any frame-related tags

``forms``:
    Removes any form tags

``annoying_tags``:
    Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marque>``

``remove_tags``:
    A list of tags to remove.

``allow_tags``:
    A list of tags to include (default include all).

``remove_unknown_tags``:
    Remove any tags that aren't standard parts of HTML.

``safe_attrs_only``:
    If true, only include 'safe' attributes (specifically the list
    from `feedparser
    <http://feedparser.org/docs/html-sanitization.html>`_).

``add_nofollow``:
    If true, then any <a> tags will have ``rel="nofollow"`` added to them.

This modifies the document *in place*.

Method Summary
  __init__(self, **kw)
  __call__(self, doc)
Cleans the document.
  _has_sneaky_javascript(self, style)
Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``.
  _kill_elements(self, doc, condition, iterate)
  _remove_javascript_link(self, link)
  clean_html(self, html)
  kill_conditional_comments(self, doc)
IE conditional comments basically embed HTML that the parser doesn't normally see.
    Inherited from object
  __delattr__(...)
x.__delattr__('name') <==> del x.name
  __getattribute__(...)
x.__getattribute__('name') <==> x.name
  __hash__(x)
x.__hash__() <==> hash(x)
  __new__(T, S, ...)
T.__new__(S, ...) -> a new object with type S, a subtype of T
  __reduce__(...)
helper for pickle
  __reduce_ex__(...)
helper for pickle
  __repr__(x)
x.__repr__() <==> repr(x)
  __setattr__(...)
x.__setattr__('name', value) <==> x.name = value
  __str__(x)
x.__str__() <==> str(x)

Class Variable Summary
SRE_Pattern _decomment_re = /\*.*?\*/
bool add_nofollow = False
NoneType allow_tags = None                                                                  
bool annoying_tags = True
bool comments = True
bool embedded = True
bool forms = True
bool frames = True
bool javascript = True
bool links = True
bool meta = True
bool page_structure = True
bool processing_instructions = True
NoneType remove_tags = None                                                                  
bool remove_unknown_tags = True
bool safe_attrs_only = True
bool scripts = True
bool style = False

Method Details

__call__(self, doc)
(Call operator)

Cleans the document.

_has_sneaky_javascript(self, style)

Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this.

Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts.

kill_conditional_comments(self, doc)

IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional.

Class Variable Details

_decomment_re

Type:
SRE_Pattern
Value:
/\*.*?\*/                                                              

add_nofollow

Type:
bool
Value:
False                                                                  

allow_tags

Type:
NoneType
Value:
None                                                                  

annoying_tags

Type:
bool
Value:
True                                                                   

comments

Type:
bool
Value:
True                                                                   

embedded

Type:
bool
Value:
True                                                                   

forms

Type:
bool
Value:
True                                                                   

frames

Type:
bool
Value:
True                                                                   

javascript

Type:
bool
Value:
True                                                                   

links

Type:
bool
Value:
True                                                                   

meta

Type:
bool
Value:
True                                                                   

page_structure

Type:
bool
Value:
True                                                                   

processing_instructions

Type:
bool
Value:
True                                                                   

remove_tags

Type:
NoneType
Value:
None                                                                  

remove_unknown_tags

Type:
bool
Value:
True                                                                   

safe_attrs_only

Type:
bool
Value:
True                                                                   

scripts

Type:
bool
Value:
True                                                                   

style

Type:
bool
Value:
False                                                                  

Generated by Epydoc 2.1 on Sat Aug 18 12:44:28 2007 http://epydoc.sf.net