Frequently asked questions on lxml. See also the notes on compatibility to ElementTree.
Read the lxml.etree Tutorial. While this is still work in progress (just as any good documentation), it provides an overview of the most important concepts in lxml.etree. If you want to help out, the tutorial is a very good place to start.
There is also a tutorial for ElementTree which works for lxml.etree. The API documentation also contains many examples for lxml.etree. To learn using lxml.objectify, read the objectify documentation.
There is a lot of documentation as lxml implements the well-known ElementTree API and tries to follow its documentation as closely as possible. There are a couple of issues where lxml cannot keep up compatibility. They are described in the compatibility documentation. The lxml specific extensions to the API are described by individual files in the doc directory of the distribution and on the web page.
The compliance to XML Standards depends on the support in libxml2 and libxslt. Here is a quote from http://xmlsoft.org/:
In most cases libxml2 tries to implement the specifications in a relatively strictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ tests from the OASIS XML Tests Suite.
lxml currently supports libxml2 2.6.20 or later, which has even better support for various XML standards. Some of the more important ones are: HTML, XML namespaces, XPath, XInclude, XSLT, XML catalogs, canonical XML, RelaxNG, XML:ID. Support for XML Schema and Schematron is currently incomplete in libxml2, but is mostly usable and still being worked on. libxml2 also supports loading documents through HTTP and FTP.
The two modules provide different ways of handling XML. However, objectify builds on top of lxml.etree and therefore inherits most of its capabilities and a large portion of its API.
lxml.etree is a generic API for XML and HTML handling. It aims for ElementTree compatibility and supports the entire XML infoset. It is well suited for both mixed content and data centric XML. Its generality makes it the best choice for most applications.
lxml.objectify is a specialized API for XML data handling in a Python object syntax. It provides a very natural way to deal with data fields stored in a structurally well defined XML format. Data is automatically converted to Python data types and can be manipulated with normal Python operators. Look at the examples in the objectify documentation to see what it feels like to use it.
Objectify is not well suited for mixed contents or HTML documents. As it is built on top of lxml.etree, however, it inherits the normal support for XPath, XSLT or validation.
lxml.etree is a very fast library for processing XML. There are, however, a few caveats involved in the mapping of the powerful libxml2 library to the simple and convenient ElementTree API. Not all operations are as fast as the simplicity of the API might suggest, while some use cases can heavily benefit from finding the right way of doing them. The benchmark page has a comparison to other ElementTree implementations and a number of tips for performance tweaking. As with any Python application, the rule of thumb is: the more of your processing runs in C, the faster your application gets. See also the section on threading.
The ElementTree tree model defines an Element as a container with a tag name, contained text, child Elements and a tail text. This means that whenever you serialise an Element, you will get all parts of that Element:
>>> from lxml import etree >>> root = etree.XML("<root><tag>text<child/></tag>tail</root>") >>> print etree.tostring(root[0]) <tag>text<child/></tag>tail
This is a huge simplification for the tree model as it avoids text nodes to appear in the list of children and makes access to them quick and simple. So this is a benefit in most applications and simplifies many, many XML tree algorithms.
However, in document-like XML (and especially HTML), the above result can be unexpected to new users and can sometimes require a bit more overhead. A good way to deal with this is to use helper functions that copy the Element without its tail. The lxml.html package also deals with this in a couple of places, as most HTML algorithms benefit from a tail-free behaviour.
It really depends on your application, but the rule of thumb is: more recent versions contain less bugs and provide more features.
Read the release notes of libxml2 and the release notes of libxslt to see when (or if) a specific bug has been fixed.
Short answer: If you want to contribute a binary build, we are happy to put it up on the Cheeseshop.
Long answer: Two of the bigger problems with the Windows system are the lack of a pre-installed standard compiler and the missing package management. Both make it non-trivial to build lxml on this platform. We are trying hard to make lxml as platform-independent as possible and it is regularly tested on Windows systems. However, we currently cannot provide Windows binary distributions ourselves.
From time to time, users of different environments kindly contribute binary builds of lxml, most frequently for Windows or Mac-OS X. We put these on the Cheeseshop to make it as easy as possible for others to use lxml on their platform.
If there is not currently a binary distribution of the most recent lxml release for your platform available from the Cheeseshop, please look through the older versions to see if they provide a binary build. This is done by appending the version number to the cheeseshop URL, e.g.:
http://cheeseshop.python.org/pypi/lxml/1.1.2
Most likely, you use a Python installation that was configured for internal use of UCS2 unicode, meaning 16-bit unicode. The lxml egg distributions are generally compiled on platforms that use UCS4, a 32-bit unicode encoding, as this is used on the majority of platforms. Sadly, both are not compatible, so the eggs can only support the one they were compiled with.
This means that you have to compile lxml from sources for your system. Note that you do not need Pyrex for this, the lxml source distribution is directly compilable on both platform types. See the build instructions on how to do this.
lxml interfaces with two C libraries: libxml2 and libxslt. Accessing them at the C-level is required for performance reasons.
To avoid writing plain C-code and caring too much about the details of built-in types and reference counting, lxml is written in Pyrex, a Python-like language that is translated into C-code. Chances are that if you know Python, you can write code that Pyrex accepts. Again, the C-ish style used in the lxml code is just for performance optimisations. If you want to contribute, don't bother with the details, a Python implementation of your contribution is better than none. And keep in mind that lxml's flexible API often favours an implementation of features in pure Python, without bothering with C-code at all. For example, the lxml.html package is entirely written in Python.
Please contact the mailing list if you need any help.
Besides enhancing the code, there are a lot of places where you can help the project and its user base. You can
One of the goals of lxml is "no segfaults", so if there is no clear warning in the documentation that you were doing something potentially harmful, you have found a bug and we would like to hear about it. Please report this bug to the mailing list. See the next section on how to do that.
However, there are a few things to try first, to make sure the problem is really within lxml (or libxml2 or libxslt):
In any case, try to reproduce the problem with the latest versions of libxml2 and libxslt. From time to time, bugs and race conditions are found in these libraries, so a more recent version might already contain a fix for your problem.
First, you should look at the current developer changelog to see if this is a known problem that has already been fixed in the SVN trunk since the release you are using.
Also, the 'crash' section above has a few good advices what to try to see if the problem is really in lxml - and not in your setup. Believe it or not, that happens more often than you might think, especially when old libraries or even multiple library versions are installed.
You should always try to reproduce the problem with the latest versions of libxml2 and libxslt - and make sure they are used (lxml.etree can tell you what it runs with, see below).
Otherwise, we would really like to hear about it. Please report it to the mailing list so that we can fix it. It is very helpful in this case if you can come up with a short code snippet that demonstrates your problem. If others can reproduce and see the problem, it is much easier for them to fix it - and maybe even easier for you to describe it and get people convinced that it really is a problem to fix. Please also report the version of lxml, libxml2 and libxslt that you are using by calling this:
from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
Yes, although not carelessly.
lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself. lxml also allows concurrency during validation (RelaxNG and XMLSchema) and XSL transformation. You can share RelaxNG, XMLSchema and XSLT objects between threads. While you can also share parsers between threads, this will serialize the access to each of them, so it is better to copy() parsers or to use the default parser. Note that access to the XML() and HTML() functions is always serialized. If you need to parse concurrently from strings, use parse() with StringIO.
Due to the way libxslt handles threading, concurrent access to stylesheets is currently only possible if it was parsed in the main thread. Parsing and applying a stylesheet inside one thread also works.
Warning: You should generally avoid modifying trees in other threads than the one it was generated in. Although this should work in many cases, there are certain scenarios where the termination of a thread that parsed a tree can crash the application if subtrees of this tree were moved to other documents. You should be on the safe side when passing trees between threads if you either
Depends. The best way to answer this is timing and profiling.
The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to 0. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by complex XSLTs, your speedup on multi-processor machines can be substantial.
See the question above to learn which operations free the GIL to support multi-threading.
Can be. You can see for yourself by compiling lxml entirely without threading support. Pass the --without-threading option to setup.py when building lxml from source.
lxml currently has the restriction that an XSLT object can only be used in a thread if it was created either in the thread itself or in the main thread. This is due to some interfering optimisations in libxslt and lxml.etree. To work around this, you can do a couple of things:
If your stylesheets are diverse and status specific, you can still prepare them in advance if you:
These environments can use threads in a way that may not make it obvious when threads are created and what happens in which thread. This makes it hard to ensure lxml's threading support is used in a reliable way. Sadly, if problems arise, they are as diverse as the applications, so it is difficult to provide any generally applicable solution. Also, these environments are so complex that problems become hard to debug and even harder to reproduce in a predictable way. If you encounter crashes in one of these systems, but your code runs perfectly when started by hand, the following gives you a few hints for possible approaches to solve your specific problem:
make sure you use recent versions of libxml2, libxslt and lxml. The libxml2 developers keep fixing bugs in each release, and lxml also tries to become more robust against possible pitfalls. So newer versions might already fix your problem in a reliable way.
make sure the library versions you installed are really used. Do not rely on what your operating system tells you! Print the version constants in lxml.etree from within your runtime environment to make sure it is the case. This is especially a problem under MacOS-X when newer library versions were installed in addition to the outdated system libraries.
if you use mod_python, try setting this option:
PythonInterpreter main_interpreter
There was a discussion on the mailing list about this problem:
compile lxml without threading support by running setup.py with the --without-threading option. While this might be slower in certain scenarios on multi-processor systems, it might also keep your application from crashing, which should be worth more to you than peek performance. Remember that lxml is fast anyway, so concurrency may not even be worth it.
avoid doing fancy XSLT stuff like foreign document access or passing in subtrees trough XSLT variables. This might or might not work, depending on your specific usage.
try copying trees at suspicious places and working with those instead of a tree shared between threads. A good candidate might be the result of an XSLT or the stylesheet itself.
try keeping thread-local copies of XSLT stylesheets, i.e. one per thread, instead of sharing one. Also see the question above.
you can try to serialise suspicious parts of your code with explicit thread locks, thus disabling the concurrency of the runtime system.
report back on the mailing list to see if there are other ways to work around your specific problems. Do not forget to report the version numbers of lxml, libxml2 and libxslt you are using (see the question on reporting a bug).
Pretty printing (or formatting) an XML document means adding white space to the content. These modifications are harmless if they only impact elements in the document that do not carry (text) data. They corrupt your data if they impact elements that contain data. If lxml cannot distinguish between whitespace and data, it will not alter your data. Whitespace is therefore only added between nodes that do not contain data. This is always the case for trees constructed element-by-element, so no problems should be expected here. For parsed trees, a good way to assure that no conflicting whitespace is left in the tree is the remove_blank_text option:
>>> parser = etree.XMLParser(remove_blank_text=True) >>> tree = etree.parse(file, parser)
This will allow the parser to drop blank text nodes when constructing the tree. If you now call a serialization function to pretty print this tree, lxml can add fresh whitespace to the XML tree to indent it.
lxml can read Python unicode strings and even tries to support them if libxml2 does not. However, if the unicode string declares an XML encoding internally (<?xml encoding="..."?>), parsing is bound to fail, as this encoding is most likely not the real encoding used in Python unicode. The same is true for HTML unicode strings that contain charset meta tags, although the problems may be more subtle here. The libxml2 HTML parser may not be able to parse the meta tags in broken HTML and may end up ignoring them, so even if parsing succeeds, later handling may still fail with character encoding errors.
Note that Python uses different encodings for unicode on different platforms, so even specifying the real internal unicode encoding is not portable between Python interpreters. Don't do it.
Python unicode strings with XML data or HTML data that carry encoding information are broken. lxml will not parse them. You must provide parsable data in a valid encoding.
The str() implementation of the XSLTResultTree class (a subclass of the ElementTree class) knows about the output method chosen in the stylesheet (xsl:output), write() doesn't. If you call write(), the result will be a normal XML tree serialization in the requested encoding. Calling this method may also fail for XSLT results that are not XML trees (e.g. string results).
If you call str(), it will return the serialized result as specified by the XSL transform. This correctly serializes string results to encoded Python strings and honours xsl:output options like indent. This almost certainly does what you want, so you should only use write() if you are sure that the XSLT result is an XML tree and you want to override the encoding and indentation options requested by the stylesheet.
The iterparse() implementation is based on the libxml2 parser. It requires the tree to be intact to finish parsing. If you delete or modify parents of the current node, chances are you modify the structure in a way that breaks the parser. Normally, this will result in a segfault. Please refer to the iterparse section of the lxml API documentation to find out what you can do and what you can't do.
findall() is part of the original ElementTree API. It supports a simple subset of the XPath language, without predicates, conditions and other advanced features. It is very handy for finding specific tags in a tree. Another important difference is namespace handling, which uses the {namespace}tagname notation. This is not supported by XPath. The findall, find and findtext methods are compatible with other ElementTree implementations and allow writing portable code that runs on ElementTree, cElementTree and lxml.etree.
xpath(), on the other hand, supports the complete power of the XPath language, including predicates, XPath functions and Python extension functions. The syntax is defined by the XPath specification. If you need the expressiveness and selectivity of XPath, the xpath() method, the XPath class and the XPathEvaluator are the best choice.
It was decided that it is more important to keep compatibility with ElementTree to simplify code migration between the libraries. The main difference compared to XPath is the {namespace}tagname notation used in findall(), which is not valid XPath.
ElementTree and lxml.etree use the same implementation, which assures 100% compatibility. Note that findall() is so fast in lxml that a native implementation would not bring any performance benefits.
You can traverse the document (getiterator()) and collect the prefix attributes from all Elements into a set. However, it is unlikely that you really want to do that. You do not need these prefixes, honestly. You only need the namespace URIs. All namespace comparisons use these, so feel free to make up your own prefixes when you use XPath expressions or extension functions.
The only place where you might consider specifying prefixes is the serialization of Elements that were created through the API. Here, you can specify a prefix mapping through the nsmap argument when creating the root Element. Its children will then inherit this prefix for serialization.
You can't. In XPath, there is no such thing as a default namespace. Just use an arbitrary prefix and let the namespace dictionary of the XPath evaluators map it to your namespace. See also the question above.