Implementing XML languages with lxml

Dr. Stefan Behnel

/

lxml-dev@codespeak.net

tagpython.png

What is an »XML language«?

Popular mistakes to avoid (2)

"This is tree data, I'll take the DOM!"

=> write verbose, redundant, hard-to-maintain code

Popular mistakes to avoid (3)

"SAX is so fast and consumes no memory!"

=> write confusing state-machine code

=> debugging into existence

Working with XML

Getting XML work done

(instead of getting time wasted)

How can you work with XML?

What if you could simplify this?

What if you could simplify this ...

... then »lxml« is your friend!

Overview

What is lxml?

What do you get for your money?

Lesson 0: a quick overview

why »lxml takes all the pain out of XML«

(a quick overview of lxml features and ElementTree concepts)

Namespaces in ElementTree

Text content in ElementTree

Attributes in ElementTree

Tree iteration in lxml.etree (1)

>>> root = etree.fromstring(
...   "<root> <a><b/><b/></a> <c><d/><e><f/></e><g/></c> </root>")

>>> print([child.tag for child in root])   # children
['a', 'c']

>>> print([el.tag for el in root.iter()])  # self and descendants
['root', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']

>>> print([el.tag for el in root.iterdescendants()])
['a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']


>>> def iter_breadth_first(root):
...     bfs_queue = collections.deque([root])
...     while bfs_queue:
...         el = bfs_queue.popleft()  # pop next element
...         bfs_queue.extend(el)      # append its children
...         yield el

>>> print([el.tag for el in iter_breadth_first(root)])
['root', 'a', 'c', 'b', 'b', 'd', 'e', 'g', 'f']

Tree iteration in lxml.etree (2)

>>> root = etree.fromstring(
...   "<root> <a><b/><b/></a> <c><d/><e><f/></e><g/></c> </root>")

>>> tree_walker = etree.iterwalk(root, events=('start', 'end'))

>>> for (event, element) in tree_walker:
...     print("%s (%s)" % (element.tag, event))
root (start)
a (start)
b (start)
b (end)
b (start)
b (end)
a (end)
c (start)
d (start)
d (end)
e (start)
f (start)
f (end)
e (end)
g (start)
g (end)
c (end)
root (end)

Path languages in lxml

<root>
  <speech class='dialog'><p>So be it!</p></speech>
  <p>stuff</p>
</root>

Summary of lesson 0

Lesson 1: parsing XML/HTML

The input side

(a quick overview)

Parsing XML and HTML from ...

(parsing from strings and filenames/URLs frees the GIL)

Example: parsing from a string

Parsing XML into ...

Summary of lesson 1

Lesson 2: generating XML

The output side

(and how to make it safe and simple)

The example language: Atom

The Atom XML format

Example: generate XML (1)

The ElementMaker (or E-factory)

>>> from lxml.builder import ElementMaker
>>> A = ElementMaker(namespace="http://www.w3.org/2005/Atom",
...                  nsmap={None : "http://www.w3.org/2005/Atom"})
>>> atom = A.feed(
...   A.author( A.name("Stefan Behnel") ),
...   A.entry(
...     A.title("News from lxml"),
...     A.link(href="/"),
...     A.summary("See what's <b>fun</b> about lxml...",
...               type="html"),
...   )
... )
>>> from lxml.etree import tostring
>>> print( tostring(atom, pretty_print=True) )

Example: generate XML (2)

>>> atom = A.feed(
...   A.author( A.name("Stefan Behnel") ),
...   A.entry(
...     A.title("News from lxml"),
...     A.link(href="/"),
...     A.summary("See what's <b>fun</b> about lxml...",
...               type="html"),
...   )
... )
<feed xmlns="http://www.w3.org/2005/Atom">
  <author>
    <name>Stefan Behnel</name>
  </author>
  <entry>
    <title>News from lxml</title>
    <link href="/"/>
    <summary type="html">See what's &lt;b&gt;fun&lt;/b&gt;
                         about lxml...</summary>
  </entry>
</feed>

Be careful what you type!

>>> atom = A.feed(
...   A.author( A.name("Stefan Behnel") ),
...   A.entry(
...     A.titel("News from lxml"),
...     A.link(href="/"),
...     A.summary("See what's <b>fun</b> about lxml...",
...               type="html"),
...   )
... )
<feed xmlns="http://www.w3.org/2005/Atom">
  <author>
    <name>Stefan Behnel</name>
  </author>
  <entry>
    <titel>News from lxml</titel>
    <link href="/"/>
    <summary type="html">See what's &lt;b&gt;fun&lt;/b&gt;
                         about lxml...</summary>
  </entry>
</feed>

Want more 'type safety'?

Write an XML generator module instead:

# atomgen.py

from lxml import etree
from lxml.builder import ElementMaker

ATOM_NAMESPACE = "http://www.w3.org/2005/Atom"

A = ElementMaker(namespace=ATOM_NAMESPACE,
                 nsmap={None : ATOM_NAMESPACE})

feed = A.feed
entry = A.entry
title = A.title
# ... and so on and so forth ...


# plus a little validation function: isvalid()
isvalid = etree.RelaxNG(file="atom.rng")

The Atom generator module

>>> import atomgen as A

>>> atom = A.feed(
...   A.author( A.name("Stefan Behnel") ),
...   A.entry(
...     A.link(href="/"),
...     A.title("News from lxml"),
...     A.summary("See what's <b>fun</b> about lxml...",
...               type="html"),
...   )
... )

>>> A.isvalid(atom) # ok, forgot the ID's => invalid XML ...
False

>>> title = A.titel("News from lxml")
Traceback (most recent call last):
  ...
AttributeError: 'module' object has no attribute 'titel'

Mixing languages (1)

Atom can embed serialised HTML

>>> import lxml.html.builder as h

>>> html_fragment = h.DIV(
...   "this is some\n",
...   h.A("HTML", href="http://w3.org/MarkUp/"),
...   "\ncontent")
>>> serialised_html = etree.tostring(html_fragment, method="html")

>>> summary = A.summary(serialised_html, type="html")
>>> print(etree.tostring(summary))
<summary xmlns="http://www.w3.org/2005/Atom" type="html">
   &lt;div&gt;this is some
   &lt;a href="http://w3.org/MarkUp/"&gt;HTML&lt;/a&gt;
   content&lt;/div&gt;
</summary>

Mixing languages (2)

Atom can also embed non-escaped XHTML

>>> from copy import deepcopy
>>> xhtml_fragment = deepcopy(html_fragment)

>>> from lxml.html import html_to_xhtml
>>> html_to_xhtml(xhtml_fragment)

>>> summary = A.summary(xhtml_fragment, type="xhtml")
>>> print(etree.tostring(summary, pretty_print=True))
<summary xmlns="http://www.w3.org/2005/Atom" type="xhtml">
  <html:div xmlns:html="http://www.w3.org/1999/xhtml">this is some
  <html:a href="http://w3.org/MarkUp/">HTML</html:a>
  content</html:div>
</summary>

Summary of lesson 2

... this is all you need for the output side of XML languages

Lesson 3: Designing XML APIs

The Element API

(and how to make it the way you want)

Trees in C and in Python

ep2008/proxies.png

Mapping Python classes to nodes

Example: a simple Element class (1)

Example: a simple Element class (2)

Mapping Python classes to nodes

Designing an Atom API

Consider lxml.objectify

>>> from lxml import objectify

>>> feed = objectify.parse("atom-example.xml")
>>> print(feed.title)
Example Feed

>>> print([entry.title for entry in feed.entry])
['Atom-Powered Robots Run Amok']

>>> print(feed.entry[0].title)
Atom-Powered Robots Run Amok

Still room for more convenience

from itertools import chain

class FeedElement(objectify.ObjectifiedElement):

    def addIDs(self):
        "initialise the IDs of feed and entries"

        for element in chain([self], self.entry):
            if element.find(_ATOM_NS + "id") is None:
                id = etree.SubElement(self, _ATOM_NS + "id")
                id.text = make_guid()

Incremental API design

Setting up the Element mapping

Atom has a namespace => leave the mapping to lxml

# ...
_atom_lookup = etree.ElementNamespaceClassLookup(
                  objectify.ObjectifyElementClassLookup())

# map the classes to tag names
ns = _atom_lookup.get_namespace(ATOM_NAMESPACE)
ns["feed"]  = FeedElement
ns["entry"] = EntryElement
# ... and so on
# or use ns.update(vars()) with appropriate class names

# create a parser that does some whitespace cleanup
atom_parser = etree.XMLParser(remove_blank_text=True)

# make it use our Atom classes
atom_parser.set_element_class_lookup(_atom_lookup)

# and help users in using our parser setup
def parse(input):
    return etree.parse(input, atom_parser)

Using your new Atom API

>>> import atom
>>> feed = atom.parse("ep2008/atom-example.xml").getroot()

>>> print(len(feed.entry))
1
>>> print([entry.title for entry in feed.entry])
['Atom-Powered Robots Run Amok']

>>> link_tag = "{%s}link" % atom.ATOM_NAMESPACE
>>> print([link.get("href") for link in feed.iter(link_tag)])
['http://example.org/', 'http://example.org/2003/12/13/atom03']

Summary of lesson 3

To implement an XML API ...

  1. start off with lxml's Element API
    • or take a look at the object API of lxml.objectify
  2. specialise it into a set of custom Element classes
  3. map them to XML tags using one of the lookup schemes
  4. improve the API incrementally while using it
    • discover inconveniences and beautify them
    • avoid putting work into things that work

Conclusion

lxml ...