| Home | Trees | Index | Help |
|
|---|
| Package lxml :: Package html :: Module clean :: Class Cleaner |
|
object --+
|
Cleaner
Instances cleans the document of each of the possible offending
elements. The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.
``scripts``:
Removes any ``<script>`` tags.
``javascript``:
Removes any Javascript, like an ``onclick`` attribute.
``comments``:
Removes any comments.
``style``:
Removes any style tags or attributes.
``links``:
Removes any ``<link>`` tags
``meta``:
Removes any ``<meta>`` tags
``page_structure``:
Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
``processing_instructions``:
Removes any processing instructions.
``embedded``:
Removes any embedded objects (flash, iframes)
``frames``:
Removes any frame-related tags
``forms``:
Removes any form tags
``annoying_tags``:
Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marque>``
``remove_tags``:
A list of tags to remove.
``allow_tags``:
A list of tags to include (default include all).
``remove_unknown_tags``:
Remove any tags that aren't standard parts of HTML.
``safe_attrs_only``:
If true, only include 'safe' attributes (specifically the list
from `feedparser
<http://feedparser.org/docs/html-sanitization.html>`_).
``add_nofollow``:
If true, then any <a> tags will have ``rel="nofollow"`` added to them.
This modifies the document *in place*.
| Method Summary | |
|---|---|
__init__(self,
**kw)
| |
Cleans the document. | |
Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. | |
_kill_elements(self,
doc,
condition,
iterate)
| |
_remove_javascript_link(self,
link)
| |
clean_html(self,
html)
| |
IE conditional comments basically embed HTML that the parser doesn't normally see. | |
| Inherited from object | |
x.__delattr__('name') <==> del x.name | |
x.__getattribute__('name') <==> x.name | |
x.__hash__() <==> hash(x) | |
T.__new__(S, ...) -> a new object with type S, a subtype of T | |
helper for pickle | |
helper for pickle | |
x.__repr__() <==> repr(x) | |
x.__setattr__('name', value) <==> x.name = value | |
x.__str__() <==> str(x) | |
| Class Variable Summary | |
|---|---|
SRE_Pattern |
_decomment_re = /\*.*?\*/
|
bool |
add_nofollow = False
|
NoneType |
allow_tags = None |
bool |
annoying_tags = True
|
bool |
comments = True
|
bool |
embedded = True
|
bool |
forms = True
|
bool |
frames = True
|
bool |
javascript = True
|
bool |
links = True
|
bool |
meta = True
|
bool |
page_structure = True
|
bool |
processing_instructions = True
|
NoneType |
remove_tags = None |
bool |
remove_unknown_tags = True
|
bool |
safe_attrs_only = True
|
bool |
scripts = True
|
bool |
style = False
|
| Method Details |
|---|
__call__(self,
doc)
Cleans the document.
|
_has_sneaky_javascript(self, style)Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this. Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts. |
kill_conditional_comments(self, doc)IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional. |
| Class Variable Details |
|---|
_decomment_re
|
add_nofollow
|
allow_tags
|
annoying_tags
|
comments
|
embedded
|
forms
|
frames
|
javascript
|
links
|
meta
|
page_structure
|
processing_instructions
|
remove_tags
|
remove_unknown_tags
|
safe_attrs_only
|
scripts
|
style
|
| Home | Trees | Index | Help |
|
|---|
| Generated by Epydoc 2.1 on Sat Aug 18 12:44:28 2007 | http://epydoc.sf.net |