lxml supports both XPath and XSLT through libxml2 and libxslt in a standards compliant way.
The usual setup procedure:
>>> from lxml import etree >>> from StringIO import StringIO
lxml.etree supports the simple path syntax of the find, findall and findtext methods on ElementTree and Element, as known from the original ElementTree library (ElementPath). As an lxml specific extension, these classes also provide an xpath() method that supports expressions in the complete XPath syntax, as well as custom extension functions.
There are also specialized XPath evaluator classes that are more efficient for frequent evaluation: XPath and XPathEvaluator. See the performance comparison to learn when to use which. Their semantics when used on Elements and ElementTrees are the same as for the xpath() method described here.
For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative):
>>> f = StringIO('<foo><bar></bar></foo>') >>> tree = etree.parse(f) >>> r = tree.xpath('/foo/bar') >>> len(r) 1 >>> r[0].tag 'bar' >>> r = tree.xpath('bar') >>> r[0].tag 'bar'
When xpath() is used on an Element, the XPath expression is evaluated against the element (if relative) or against the root tree (if absolute):
>>> root = tree.getroot() >>> r = root.xpath('bar') >>> r[0].tag 'bar' >>> bar = root[0] >>> r = bar.xpath('/foo/bar') >>> r[0].tag 'bar' >>> tree = bar.getroottree() >>> r = tree.xpath('/foo/bar') >>> r[0].tag 'bar'
The xpath() method has support for XPath variables:
>>> expr = "//*[local-name() = $name]" >>> print root.xpath(expr, name = "foo")[0].tag foo >>> print root.xpath(expr, name = "bar")[0].tag bar >>> print root.xpath("$text", text = "Hello World!") Hello World!
Optionally, you can provide a namespaces keyword argument, which should be a dictionary mapping the namespace prefixes used in the XPath expression to namespace URIs:
>>> f = StringIO('''\ ... <a:foo xmlns:a="http://codespeak.net/ns/test1" ... xmlns:b="http://codespeak.net/ns/test2"> ... <b:bar>Text</b:bar> ... </a:foo> ... ''') >>> doc = etree.parse(f) >>> r = doc.xpath('/t:foo/b:bar', ... namespaces={'t': 'http://codespeak.net/ns/test1', ... 'b': 'http://codespeak.net/ns/test2'}) >>> len(r) 1 >>> r[0].tag '{http://codespeak.net/ns/test2}bar' >>> r[0].text 'Text'
There is also an optional extensions argument which is used to define custom extension functions in Python that are local to this evaluation.
The return values of XPath evaluations vary, depending on the XPath expression used:
XPath string results are 'smart' in that they provide a getparent() method that knows their origin:
You can distinguish between different text origins with the boolean properties is_text, is_tail and is_attribute.
Note that getparent() may not always return an Element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, getparent() will return None.
ElementTree objects have a method getpath(element), which returns a structural, absolute XPath expression to find that element:
>>> a = etree.Element("a") >>> b = etree.SubElement(a, "b") >>> c = etree.SubElement(a, "c") >>> d1 = etree.SubElement(c, "d") >>> d2 = etree.SubElement(c, "d") >>> tree = etree.ElementTree(c) >>> print tree.getpath(d2) /c/d[2] >>> tree.xpath(tree.getpath(d2)) == [d2] True
The XPath class compiles an XPath expression into a callable function:
>>> root = etree.XML("<root><a><b/></a><b/></root>") >>> find = etree.XPath("//b") >>> print find(root)[0].tag b
The compilation takes as much time as in the xpath() method, but it is done only once per class instantiation. This makes it especially efficient for repeated evaluation of the same XPath expression.
Just like the xpath() method, the XPath class supports XPath variables:
>>> count_elements = etree.XPath("count(//*[local-name() = $name])") >>> print count_elements(root, name = "a") 1.0 >>> print count_elements(root, name = "b") 2.0
This supports very efficient evaluation of modified versions of an XPath expression, as compilation is still only required once.
Prefix-to-namespace mappings can be passed as second parameter:
>>> root = etree.XML("<root xmlns='NS'><a><b/></a><b/></root>") >>> find = etree.XPath("//n:b", namespaces={'n':'NS'}) >>> print find(root)[0].tag {NS}b
By default, XPath supports regular expressions in the EXSLT namespace:
>>> regexpNS = "http://exslt.org/regular-expressions" >>> find = etree.XPath("//*[re:test(., '^abc$', 'i')]", ... namespaces={'re':regexpNS}) >>> root = etree.XML("<root><a>aB</a><b>aBc</b></root>") >>> print find(root)[0].text aBc
You can disable this with the boolean keyword argument regexp which defaults to True.
lxml.etree provides two other efficient XPath evaluators that work on ElementTrees or Elements respectively: XPathDocumentEvaluator and XPathElementEvaluator. They are automatically selected if you use the XPathEvaluator helper for instantiation:
>>> root = etree.XML("<root><a><b/></a><b/></root>") >>> xpatheval = etree.XPathEvaluator(root) >>> print isinstance(xpatheval, etree.XPathElementEvaluator) True >>> print xpatheval("//b")[0].tag b
This class provides efficient support for evaluating different XPath expressions on the same Element or ElementTree.
ElementTree supports a language named ElementPath in its find*() methods. One of the main differences between XPath and ElementPath is that the XPath language requires an indirection through prefixes for namespace support, whereas ElementTree uses the Clark notation ({ns}name) to avoid prefixes completely. The other major difference regards the capabilities of both path languages. Where XPath supports various sophisticated ways of restricting the result set through functions and boolean expressions, ElementPath only supports pure path traversal without nesting or further conditions. So, while the ElementPath syntax is self-contained and therefore easier to write and handle, XPath is much more powerful and expressive.
lxml.etree bridges this gap through the class ETXPath, which accepts XPath expressions with namespaces in Clark notation. It is identical to the XPath class, except for the namespace notation. Normally, you would write:
>>> root = etree.XML("<root xmlns='ns'><a><b/></a><b/></root>") >>> find = etree.XPath("//p:b", namespaces={'p' : 'ns'}) >>> print find(root)[0].tag {ns}b
ETXPath allows you to change this to:
>>> find = etree.ETXPath("//{ns}b") >>> print find(root)[0].tag {ns}b
lxml.etree raises exceptions when errors occur while parsing or evaluating an XPath expression:
>>> find = etree.XPath("\\") Traceback (most recent call last): ... XPathSyntaxError: Invalid expression
lxml will also try to give you a hint what went wrong, so if you pass a more complex expression, you may get a somewhat more specific error:
>>> find = etree.XPath("//*[1.1.1]") Traceback (most recent call last): ... XPathSyntaxError: Invalid predicate
During evaluation, lxml will emit an XPathEvalError on errors:
>>> find = etree.XPath("//ns:a") >>> find(root) Traceback (most recent call last): ... XPathEvalError: Undefined namespace prefix
This works for the XPath class, however, the other evaluators (including the xpath() method) are one-shot operations that do parsing and evaluation in one step. They therefore raise evaluation exceptions in all cases:
>>> root = etree.Element("test") >>> find = root.xpath("//*[1.1.1]") Traceback (most recent call last): ... XPathEvalError: Invalid predicate >>> find = root.xpath("//ns:a") Traceback (most recent call last): ... XPathEvalError: Undefined namespace prefix >>> find = root.xpath("\\") Traceback (most recent call last): ... XPathEvalError: Invalid expression
Note that lxml versions before 1.3 always raised an XPathSyntaxError for all errors, including evaluation errors. The best way to support older versions is to except on the superclass XPathError.
lxml.etree introduces a new class, lxml.etree.XSLT. The class can be given an ElementTree object to construct an XSLT transformer:
>>> f = StringIO('''\ ... <xsl:stylesheet version="1.0" ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> ... <xsl:template match="/"> ... <foo><xsl:value-of select="/a/b/text()" /></foo> ... </xsl:template> ... </xsl:stylesheet>''') >>> xslt_doc = etree.parse(f) >>> transform = etree.XSLT(xslt_doc)
You can then run the transformation on an ElementTree document by simply calling it, and this results in another ElementTree object:
>>> f = StringIO('<a><b>Text</b></a>') >>> doc = etree.parse(f) >>> result_tree = transform(doc)
By default, XSLT supports all extension functions from libxslt and libexslt as well as Python regular expressions through the EXSLT regexp functions. Also see the documentation on custom extension functions and document resolvers. There is a separate section on controlling access to external documents and resources.
The result of an XSL transformation can be accessed like a normal ElementTree document:
>>> f = StringIO('<a><b>Text</b></a>') >>> doc = etree.parse(f) >>> result = transform(doc) >>> result.getroot().text 'Text'
but, as opposed to normal ElementTree objects, can also be turned into an (XML or text) string by applying the str() function:
>>> str(result) '<?xml version="1.0"?>\n<foo>Text</foo>\n'
The result is always a plain string, encoded as requested by the xsl:output element in the stylesheet. If you want a Python unicode string instead, you should set this encoding to UTF-8 (unless the ASCII default is sufficient). This allows you to call the builtin unicode() function on the result:
>>> unicode(result) u'<?xml version="1.0"?>\n<foo>Text</foo>\n'
You can use other encodings at the cost of multiple recoding. Encodings that are not supported by Python will result in an error:
>>> xslt_tree = etree.XML('''\ ... <xsl:stylesheet version="1.0" ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> ... <xsl:output encoding="UCS4"/> ... <xsl:template match="/"> ... <foo><xsl:value-of select="/a/b/text()" /></foo> ... </xsl:template> ... </xsl:stylesheet>''') >>> transform = etree.XSLT(xslt_tree) >>> result = transform(doc) >>> unicode(result) Traceback (most recent call last): ... LookupError: unknown encoding: UCS4
It is possible to pass parameters, in the form of XPath expressions, to the XSLT template:
>>> xslt_tree = etree.XML('''\ ... <xsl:stylesheet version="1.0" ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> ... <xsl:template match="/"> ... <foo><xsl:value-of select="$a" /></foo> ... </xsl:template> ... </xsl:stylesheet>''') >>> transform = etree.XSLT(xslt_tree) >>> f = StringIO('<a><b>Text</b></a>') >>> doc = etree.parse(f)
The parameters are passed as keyword parameters to the transform call. First let's try passing in a simple string expression:
>>> result = transform(doc, a="'A'") >>> str(result) '<?xml version="1.0"?>\n<foo>A</foo>\n'
Let's try a non-string XPath expression now:
>>> result = transform(doc, a="/a/b/text()") >>> str(result) '<?xml version="1.0"?>\n<foo>Text</foo>\n'
Just like custom extension functions, lxml supports custom extension elements in XSLT. This means, you can write XSLT code like this:
<xsl:template match="*"> <my:python-extension> <some-content /> </my:python-extension> </xsl:template>
And then you can implement the element in Python like this:
>>> class MyExtElement(etree.XSLTExtension): ... def execute(self, context, self_node, input_node, output_parent): ... print "Hello from XSLT!" ... output_parent.text = "I did it!" ... # just copy own content input to output ... output_parent.extend( list(self_node) )
The arguments passed to this function are
In XSLT, extension elements can be used like any other XSLT element, except that they must be declared as extensions using the standard XSLT extension-element-prefixes option:
>>> xslt_ext_tree = etree.XML(''' ... <xsl:stylesheet version="1.0" ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform" ... xmlns:my="testns" ... extension-element-prefixes="my"> ... <xsl:template match="/"> ... <foo><my:ext><child>XYZ</child></my:ext></foo> ... </xsl:template> ... <xsl:template match="child"> ... <CHILD>--xyz--</CHILD> ... </xsl:template> ... </xsl:stylesheet>''')
To register the extension, add its namespace and name to the extension mapping of the XSLT object:
>>> my_extension = MyExtElement() >>> extensions = { ('testns', 'ext') : my_extension } >>> transform = etree.XSLT(xslt_ext_tree, extensions = extensions)
Note how we pass an instance here, not the class of the extension. Now we can run the transformation and see how our extension is called:
>>> root = etree.XML('<dummy/>') >>> result = transform(root) Hello from XSLT! >>> str(result) '<?xml version="1.0"?>\n<foo>I did it!<child>XYZ</child></foo>\n'
XSLT extensions are a very powerful feature that allows you to interact directly with the XSLT processor. You have full read-only access to the input document and the stylesheet, and you can even call back into the XSLT processor to process templates. Here is an example that passes an Element into the .apply_templates() method of the XSLTExtension instance:
>>> class MyExtElement(etree.XSLTExtension): ... def execute(self, context, self_node, input_node, output_parent): ... child = self_node[0] ... results = self.apply_templates(context, child) ... output_parent.append(results[0]) >>> my_extension = MyExtElement() >>> extensions = { ('testns', 'ext') : my_extension } >>> transform = etree.XSLT(xslt_ext_tree, extensions = extensions) >>> root = etree.XML('<dummy/>') >>> result = transform(root) >>> str(result) '<?xml version="1.0"?>\n<foo><CHILD>--xyz--</CHILD></foo>\n'
Note how we applied the templates to a child of the extension element itself, i.e. to an element inside the stylesheet instead of an element of the input document.
There is one important thing to keep in mind: all Elements that the execute() method gets to deal with are read-only Elements, so you cannot modify them. They also will not easily work in the API. For example, you cannot pass them to the tostring() function or wrap them in an ElementTree.
What you can do, however, is to deepcopy them to make them normal Elements, and then modify them using the normal etree API. So this will work:
>>> from copy import deepcopy >>> class MyExtElement(etree.XSLTExtension): ... def execute(self, context, self_node, input_node, output_parent): ... child = deepcopy(self_node[0]) ... child.text = "NEW TEXT" ... output_parent.append(child) >>> my_extension = MyExtElement() >>> extensions = { ('testns', 'ext') : my_extension } >>> transform = etree.XSLT(xslt_ext_tree, extensions = extensions) >>> root = etree.XML('<dummy/>') >>> result = transform(root) >>> str(result) '<?xml version="1.0"?>\n<foo><child>NEW TEXT</child></foo>\n'
There's also a convenience method on ElementTree objects for doing XSL transformations. This is less efficient if you want to apply the same XSL transformation to multiple documents, but is shorter to write for one-shot operations, as you do not have to instantiate a stylesheet yourself:
>>> result = doc.xslt(xslt_tree, a="'A'") >>> str(result) '<?xml version="1.0"?>\n<foo>A</foo>\n'
This is a shortcut for the following code:
>>> transform = etree.XSLT(xslt_tree) >>> result = transform(doc, a="'A'") >>> str(result) '<?xml version="1.0"?>\n<foo>A</foo>\n'
Some applications require a larger set of rather diverse stylesheets. lxml.etree allows you to deal with this in a number of ways. Here are some ideas to try.
The most simple way to reduce the diversity is by using XSLT parameters that you pass at call time to configure the stylesheets. The partial() function in the functools module of Python 2.5 may come in handy here. It allows you to bind a set of keyword arguments (i.e. stylesheet parameters) to a reference of a callable stylesheet. The same works for instances of the XPath() evaluator, obviously.
You may also consider creating stylesheets programmatically. Just create an XSL tree, e.g. from a parsed template, and then add or replace parts as you see fit. Passing an XSL tree into the XSLT() constructor multiple times will create independent stylesheets, so later modifications of the tree will not be reflected in the already created stylesheets. This makes stylesheet generation very straight forward.
A third thing to remember is the support for custom extension functions. Some things are much easier to do in XSLT than in Python, while for others it is the complete opposite. Finding the right mixture of Python code and XSL code can help a great deal in keeping applications well designed and maintainable.
If you want to know how your stylesheet performed, pass the profile_run keyword to the transform:
>>> result = transform(doc, a="/a/b/text()", profile_run=True) >>> profile = result.xslt_profile
The value of the xslt_profile property is an ElementTree with profiling data about each template, similar to the following:
<profile> <template rank="1" match="/" name="" mode="" calls="1" time="1" average="1"/> </profile>
Note that this is a read-only document. You must not move any of its elements to other documents. Please deep-copy the document if you need to modify it. If you want to free it from memory, just do:
>>> del result.xslt_profile