Home | Trees | Index | Help |
|
---|
Package lxml :: Package html :: Module clean :: Class Cleaner |
|
object
--+
|
Cleaner
Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. ``scripts``: Removes any ``<script>`` tags. ``javascript``: Removes any Javascript, like an ``onclick`` attribute. ``comments``: Removes any comments. ``style``: Removes any style tags or attributes. ``links``: Removes any ``<link>`` tags ``meta``: Removes any ``<meta>`` tags ``page_structure``: Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. ``processing_instructions``: Removes any processing instructions. ``embedded``: Removes any embedded objects (flash, iframes) ``frames``: Removes any frame-related tags ``forms``: Removes any form tags ``annoying_tags``: Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marque>`` ``remove_tags``: A list of tags to remove. ``allow_tags``: A list of tags to include (default include all). ``remove_unknown_tags``: Remove any tags that aren't standard parts of HTML. ``safe_attrs_only``: If true, only include 'safe' attributes (specifically the list from `feedparser <http://feedparser.org/docs/html-sanitization.html>`_). ``add_nofollow``: If true, then any <a> tags will have ``rel="nofollow"`` added to them. This modifies the document *in place*.
Method Summary | |
---|---|
__init__(self,
**kw)
| |
Cleans the document. | |
Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. | |
_kill_elements(self,
doc,
condition,
iterate)
| |
_remove_javascript_link(self,
link)
| |
clean_html(self,
html)
| |
IE conditional comments basically embed HTML that the parser doesn't normally see. | |
Inherited from object | |
x.__delattr__('name') <==> del x.name | |
x.__getattribute__('name') <==> x.name | |
x.__hash__() <==> hash(x) | |
T.__new__(S, ...) -> a new object with type S, a subtype of T | |
helper for pickle | |
helper for pickle | |
x.__repr__() <==> repr(x) | |
x.__setattr__('name', value) <==> x.name = value | |
x.__str__() <==> str(x) |
Class Variable Summary | |
---|---|
SRE_Pattern |
_decomment_re = /\*.*?\*/
|
bool |
add_nofollow = False
|
NoneType |
allow_tags = None |
bool |
annoying_tags = True
|
bool |
comments = True
|
bool |
embedded = True
|
bool |
forms = True
|
bool |
frames = True
|
bool |
javascript = True
|
bool |
links = True
|
bool |
meta = True
|
bool |
page_structure = True
|
bool |
processing_instructions = True
|
NoneType |
remove_tags = None |
bool |
remove_unknown_tags = True
|
bool |
safe_attrs_only = True
|
bool |
scripts = True
|
bool |
style = False
|
Method Details |
---|
__call__(self,
doc)
Cleans the document.
|
_has_sneaky_javascript(self, style)Depending on the browser, stuff like ``e x p r e s s i o n(...)`` can get interpreted, or ``expre/* stuff */ssion(...)``. This checks for attempt to do stuff like this. Typically the response will be to kill the entire style; if you have just a bit of Javascript in the style another rule will catch that and remove only the Javascript from the style; this catches more sneaky attempts. |
kill_conditional_comments(self, doc)IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional. |
Class Variable Details |
---|
_decomment_re
|
add_nofollow
|
allow_tags
|
annoying_tags
|
comments
|
embedded
|
forms
|
frames
|
javascript
|
links
|
meta
|
page_structure
|
processing_instructions
|
remove_tags
|
remove_unknown_tags
|
safe_attrs_only
|
scripts
|
style
|
Home | Trees | Index | Help |
|
---|
Generated by Epydoc 2.1 on Sat Aug 18 12:44:28 2007 | http://epydoc.sf.net |