Frequently Asked Questions

See also the notes on compatibility to ElementTree.

  1. Is there a tutorial?

    There is a tutorial for ElementTree which also works for lxml.etree. The API documentation also contains many examples.

  2. Where can I find more documentation about lxml?

    There is a lot of documentation as lxml implements the well-known ElementTree API and tries to follow its documentation as closely as possible. There are a couple of issues where lxml cannot keep up compatibility. They are described in the compatibility documentation. The lxml specific extensions to the API are described by individual files in the doc directory of the distribution and on the web page.

  3. My application crashes! Why does lxml.etree do that?

    1. If you are using threads, make sure that you are not sharing non thread-safe objects between threads. Especially the default parser, XSLT() and the validators are not thread-safe for performance reasons. You have to create a new one for each thread, use a thread-safe object pool or assure thread-safe access to them yourself.
    2. One of the goals of lxml is "no segfaults", so if there is no clear warning in the documentation that you were doing something potentially harmful, you have found a bug and we would like to hear about it. Please report this bug to the mailing list. See the next section on how to do that.
  4. I think I have found a bug in lxml. What should I do?

    1. First, you should look at the current developer changelog to see if this is a known problem that has already been fixed in the SVN trunk.
    2. Otherwise, we would really like to hear about it. Please report it to the mailing list so that we can fix it. It is very helpful in this case if you can come up with a short code snippet that demonstrates your problem. Please also report the version of lxml, libxml2 and libxslt that you are using (see the module attributes etree.LXML_VERSION etc.).
  5. Can I use threads to concurrently access the lxml API?

    Short answer: No.

    Long answer: lxml does not currently release the GIL (Python's global interpreter lock) internally, so you will not benefit from any performance improvements by using threads. It is also not trivial to free the GIL, as lxml calls back into Python in many places during XML processing: extension functions, Python resolvers, error reporting, etc.

  6. Why doesn't the pretty_print option reformat my XML output?

    Pretty printing (or formatting) an XML document means adding white space to the content. These modifications are harmless if they only impact elements in the document that do not carry (text) data. They corrupt your data if they impact elements that contain data. If lxml cannot distinguish between whitespace and data, it will not alter your data. Whitespace is therefore only added between nodes that do not contain data. This is always the case for trees constructed element-by-element, so no problems should be expected here. For parsed trees, a good way to assure that no conflicting whitespace is left in the tree is the remove_blank_text option:

    >>> parser = etree.XMLParser(remove_blank_text=True)
    >>> tree = etree.parse(file, parser)
    

    This will allow the parser to drop blank text nodes when constructing the tree. If you now call a serialization function to pretty print this tree, lxml can add fresh whitespace to the XML tree to indent it.

  7. What are the findall() and xpath() methods on Element(Tree)?

    findall() is part of the original ElementTree API. It supports a simple subset of the XPath language, without predicates, conditions and other advanced features. It is very handy for finding specific tags in a tree. Another important difference is namespace handling, which uses the {namespace}tagname notation. This is not supported by XPath. The findall, find and findtext methods are compatible with other ElementTree implementations and allow writing portable code that runs on ElementTree, cElementTree and lxml.etree.

    xpath(), on the other hand, supports the complete power of the XPath language, including predicates, XPath functions and Python extension functions. The syntax is defined by the XPath specification. If you need the expressiveness and selectivity of XPath, the xpath() method, the XPath class and the XPathEvaluator are the best choice.

  8. Why doesn't findall() support full XPath expressions?

    It was decided that it is more important to keep compatibility with ElementTree to simplify code migration between the libraries. The main difference compared to XPath is the {namespace}tagname notation used in findall(), which is not valid XPath.

    ElementTree and lxml.etree use the same implementation, which assures 100% compatibility. Note that findall() is so fast in lxml that a native implementation would not bring any performance benefits.

  9. What is the difference between str(xslt(doc)) and xslt(doc).write() ?

    The str() implementation of the XSLTResultTree class (a subclass of ElementTree) knows about the output method chosen in the stylesheet (xsl:output), write() doesn't. If you call write(), the result will be a normal XML tree serialization in the requested encoding. Calling this method may also fail for XSLT results that are not XML trees (e.g. string results).

    If you call str(), it will return the serialized result as specified by the XSL transform. This correctly serializes string results to encoded Python strings and honours xsl:output options like indent. This almost certainly does what you want, so you should only use write() if you are sure that the XSLT result is an XML tree and you want to override the encoding and indentation options requested by the stylesheet.

  10. Why is my application so slow?

    lxml.etree is a very fast library for processing XML. There are, however, a few caveats involved in the mapping of the powerful libxml2 library to the simple and convenient ElementTree API. Not all operations are as fast as the simplicity of the API might suggest. The benchmark page has a comparison to other ElementTree implementations and a number of tips for performance tweaking.