|
|
|
html_annotate(doclist,
markup=<function default_markup at 0x367b140>)
doclist should be ordered from oldest to newest, like: |
source code
|
|
|
tokenize_annotated(doc,
annotation)
Tokenize a document and add an annotation attribute to each token |
source code
|
|
|
html_annotate_merge_annotations(tokens_old,
tokens_new)
Merge the annotations from tokens_old into tokens_new, when the
tokens in the new document already existed in the old document. |
source code
|
|
|
copy_annotations(src,
dest)
Copy annotations from the tokens listed in src to the tokens in dest |
source code
|
|
|
compress_tokens(tokens)
Combine adjacent tokens when there is no HTML between the tokens,
and they share an annotation |
source code
|
|
|
compress_merge_back(tokens,
tok)
Merge tok into the last element of tokens (modifying the list of
tokens in-place). |
source code
|
|
|
markup_serialize_tokens(tokens,
markup_func)
Serialize the list of tokens into a list of text chunks, calling
markup_func around text to add annotations. |
source code
|
|
|
htmldiff(old_html,
new_html)
Do a diff of the old and new document. The documents are HTML
fragments (str/UTF8 or unicode), they are not complete documents
(i.e., no <html> tag). |
source code
|
|
|
htmldiff_tokens(html1_tokens,
html2_tokens)
Does a diff on the tokens themselves, returning a list of text
chunks (not tokens). |
source code
|
|
|
expand_tokens(tokens,
equal=False)
Given a list of tokens, return a generator of the chunks of
text for the data in the tokens. |
source code
|
|
|
merge_insert(ins_chunks,
doc)
doc is the already-handled document (as a list of text chunks);
here we add <ins>ins_chunks</ins> to the end of that. |
source code
|
|
|
merge_delete(del_chunks,
doc)
Adds the text chunks in del_chunks to the document doc (another
list of text chunks) with marker to show it is a delete.
cleanup_delete later resolves these markers into <del> tags. |
source code
|
|
|
cleanup_delete(chunks)
Cleans up any DEL_START/DEL_END markers in the document, replacing
them with <del></del>. To do this while keeping the document
valid, it may need to drop some tags (either start or end tags). |
source code
|
|
|
|
|
split_delete(chunks)
Returns (stuff_before_DEL_START, stuff_inside_DEL_START_END,
stuff_after_DEL_END). Returns the first case found (there may be
more DEL_STARTs in stuff_after_DEL_END). Raises NoDeletes if
there's no DEL_START found. |
source code
|
|
|
locate_unbalanced_start(unbalanced_start,
pre_delete,
post_delete)
pre_delete and post_delete implicitly point to a place in the
document (where the two were split). This moves that point (by
popping items from one and pushing them onto the other). It moves
the point to try to find a place where unbalanced_start applies. |
source code
|
|
|
locate_unbalanced_end(unbalanced_end,
pre_delete,
post_delete)
like locate_unbalanced_start, except handling end tags and
possibly moving the point earlier in the document. |
source code
|
|
|
tokenize(html,
include_hrefs=True)
Parse the given HTML and returns token objects (words with attached tags). |
source code
|
|
|
parse_html(html,
cleanup=True)
Parses an HTML fragment, returning an lxml element. Note that the HTML will be
wrapped in a <div> tag that was not in the original document. |
source code
|
|
|
cleanup_html(html)
This 'cleans' the HTML, meaning that any page structure is removed
(only the contents of <body> are used, if there is any <body).
Also <ins> and <del> tags are removed. |
source code
|
|
|
fixup_chunks(chunks)
This function takes a list of chunks and produces a list of tokens. |
source code
|
|
|
flatten_el(el,
include_hrefs,
skip_tag=False)
Takes an lxml element el, and generates all the text chunks for
that tag. Each start tag is a chunk, each word is a chunk, and each
end tag is a chunk. |
source code
|
|
|
split_words(text)
Splits some text into words. Includes trailing whitespace (one
space) on each word when appropriate. |
source code
|
|
|
start_tag(el)
The text representation of the start tag for a tag. |
source code
|
|
|
end_tag(el)
The text representation of an end tag for a tag. Includes
trailing whitespace when appropriate. |
source code
|
|
|
|
|
|
|
|
|
fixup_ins_del_tags(html)
Given an html string, move any <ins> or <del> tags inside of any
block-level elements, e.g. transform <ins><p>word</p></ins> to
<p><ins>word</ins></p> |
source code
|
|
|
|
|
_fixup_ins_del_tags(doc)
fixup_ins_del_tags that works on an lxml document in-place |
source code
|
|
|
_contains_block_level_tag(el)
True if the element contains any block-level elements, like <p>, <td>, etc. |
source code
|
|
|
_move_el_inside_block(el,
tag)
helper for _fixup_ins_del_tags; actually takes the <ins> etc tags
and moves them inside any block-level tags. |
source code
|
|
|
_merge_element_contents(el)
Removes an element, but merges its contents into its place, e.g.,
given <p>Hi <i>there!</i></p>, if you remove the <i> element you get
<p>Hi there!</p> |
source code
|
|
|
_body_re = re.compile(r'(?is) <body.*? >')
|
|
_end_body_re = re.compile(r'(?is) </body.*? >')
|
|
_ins_del_re = re.compile(r'(?is) </? ( ins| del) .*? >')
|
|
end_whitespace_re = re.compile(r'[ \t\n\r] $')
|
|
empty_tags = ( ' param ' , ' img ' , ' area ' , ' br ' , ' basefont ' , ' input ...
|
|
block_level_tags = ( ' address ' , ' blockquote ' , ' center ' , ' dir ' , ...
|
|
block_level_container_tags = ( ' dd ' , ' dt ' , ' frameset ' , ' li ' , ' t ...
|
|
start_whitespace_re = re.compile(r'^[ \t\n\r] ')
|
|
__package__ = ' lxml.html '
|