|
|
|
html_annotate(doclist,
markup=<function default_markup at 0x9b8b48c>)
doclist should be ordered from oldest to newest, like: |
source code
|
|
|
tokenize_annotated(doc,
annotation)
Tokenize a document and add an annotation attribute to each token |
source code
|
|
|
html_annotate_merge_annotations(tokens_old,
tokens_new)
Merge the annotations from tokens_old into tokens_new, when the
tokens in the new document already existed in the old document. |
source code
|
|
|
copy_annotations(src,
dest)
Copy annotations from the tokens listed in src to the tokens in dest |
source code
|
|
|
compress_tokens(tokens)
Combine adjacent tokens when there is no HTML between the tokens,
and they share an annotation |
source code
|
|
|
compress_merge_back(tokens,
tok)
Merge tok into the last element of tokens (modifying the list of
tokens in-place). |
source code
|
|
|
markup_serialize_tokens(tokens,
markup_func)
Serialize the list of tokens into a list of text chunks, calling
markup_func around text to add annotations. |
source code
|
|
|
|
|
htmldiff_tokens(html1_tokens,
html2_tokens)
Does a diff on the tokens themselves, returning a list of text
chunks (not tokens). |
source code
|
|
|
expand_tokens(tokens,
equal=False)
Given a list of tokens, return a generator of the chunks of
text for the data in the tokens. |
source code
|
|
|
merge_insert(ins_chunks,
doc)
doc is the already-handled document (as a list of text chunks);
here we add <ins>ins_chunks</ins> to the end of that. |
source code
|
|
|
merge_delete(del_chunks,
doc)
Adds the text chunks in del_chunks to the document doc (another
list of text chunks) with marker to show it is a delete. |
source code
|
|
|
|
|
|
|
split_delete(chunks)
Returns (stuff_before_DEL_START, stuff_inside_DEL_START_END,
stuff_after_DEL_END). |
source code
|
|
|
locate_unbalanced_start(unbalanced_start,
pre_delete,
post_delete)
pre_delete and post_delete implicitly point to a place in the
document (where the two were split). |
source code
|
|
|
locate_unbalanced_end(unbalanced_end,
pre_delete,
post_delete)
like locate_unbalanced_start, except handling end tags and
possibly moving the point earlier in the document. |
source code
|
|
|
tokenize(html,
include_hrefs=True)
Parse the given HTML and returns token objects (words with attached tags). |
source code
|
|
|
|
|
cleanup_html(html)
This 'cleans' the HTML, meaning that any page structure is removed
(only the contents of <body> are used, if there is any <body). |
source code
|
|
|
fixup_chunks(chunks)
This function takes a list of chunks and produces a list of tokens. |
source code
|
|
|
flatten_el(el,
include_hrefs,
skip_tag=False)
Takes an lxml element el, and generates all the text chunks for
that tag. |
source code
|
|
|
|
|
start_tag(el)
The text representation of the start tag for a tag. |
source code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
_fixup_ins_del_tags(doc)
fixup_ins_del_tags that works on an lxml document in-place |
source code
|
|
|
_contains_block_level_tag(el)
True if the element contains any block-level elements, like <p>, <td>, etc. |
source code
|
|
|
_move_el_inside_block(el,
tag)
helper for _fixup_ins_del_tags; actually takes the <ins> etc tags
and moves them inside any block-level tags. |
source code
|
|
|
_merge_element_contents(el)
Removes an element, but merges its contents into its place, e.g.,
given <p>Hi <i>there!</i></p>, if you remove the <i> element you get
<p>Hi there!</p> |
source code
|
|
|
_body_re = re.compile(r'(?is) <body.*? >')
|
|
_end_body_re = re.compile(r'(?is) </body.*? >')
|
|
_ins_del_re = re.compile(r'(?is) </? ( ins| del) .*? >')
|
|
end_whitespace_re = re.compile(r'[ \t\n\r] $')
|
|
empty_tags = ( ' param ' , ' img ' , ' area ' , ' br ' , ' basefont ' , ' input ...
|
|
block_level_tags = ( ' address ' , ' blockquote ' , ' center ' , ' dir ' , ...
|
|
block_level_container_tags = ( ' dd ' , ' dt ' , ' frameset ' , ' li ' , ' t ...
|
|
start_whitespace_re = re.compile(r'^[ \t\n\r] ')
|