Frequently asked questions on lxml. See also the notes on compatibility to ElementTree.
Contents
The code examples below use the 'lxml.etree` module:
>>> from lxml import etree
Read the lxml.etree Tutorial. While this is still work in progress (just as any good documentation), it provides an overview of the most important concepts in lxml.etree. If you want to help out, improving the tutorial is a very good place to start.
There is also a tutorial for ElementTree which works for lxml.etree. The documentation of the extended etree API also contains many examples for lxml.etree. Fredrik Lundh's element library contains a lot of nice recipes that show how to solve common tasks in ElementTree and lxml.etree. To learn using lxml.objectify, read the objectify documentation.
John Shipman has written another tutorial called Python XML processing with lxml that contains lots of examples. Liza Daly wrote a nice article about high-performance aspects when parsing large files with lxml.
There is a lot of documentation on the web and also in the Python standard library documentation, as lxml implements the well-known ElementTree API and tries to follow its documentation as closely as possible. The recipes in Fredrik Lundh's element library are generally worth taking a look at. There are a couple of issues where lxml cannot keep up compatibility. They are described in the compatibility documentation.
The lxml specific extensions to the API are described by individual files in the doc directory of the source distribution and on the web page.
The generated API documentation is a comprehensive API reference for the lxml package.
The compliance to XML Standards depends on the support in libxml2 and libxslt. Here is a quote from http://xmlsoft.org/:
In most cases libxml2 tries to implement the specifications in a relatively strictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ tests from the OASIS XML Tests Suite.
lxml currently supports libxml2 2.6.20 or later, which has even better support for various XML standards. The important ones are:
Support for XML Schema is currently not 100% complete in libxml2, but is definitely very close to compliance. Schematron is supported in two ways, the best being the original ISO Schematron reference implementation via XSLT. libxml2 also supports loading documents through HTTP and FTP.
For RelaxNG Compact Syntax support, there is a tool called rnc2rng, written by David Mertz, which you might be able to use from Python. Failing that, trang is the 'official' command line tool (written in Java) to do the conversion.
As an XML library, lxml is often used under the hood of in-house server applications, such as web servers or applications that facilitate some kind of content management. Many people who deploy Zope, Plone or Django use it together with lxml in the background, without speaking publicly about it. Therefore, it is hard to get an idea of who uses it, and the following list of 'users and projects we know of' is very far from a complete list of lxml's users.
Also note that the compatibility to the ElementTree library does not require projects to set a hard dependency on lxml - as long as they do not take advantage of lxml's enhanced feature set.
Zope3 and some of its extensions have good support for lxml:
And don't miss the quotes by our generally happy users, and other sites that link to lxml. As Liza Daly puts it: "Many software products come with the pick-two caveat, meaning that you must choose only two: speed, flexibility, or readability. When used carefully, lxml can provide all three."
The two modules provide different ways of handling XML. However, objectify builds on top of lxml.etree and therefore inherits most of its capabilities and a large portion of its API.
lxml.etree is a generic API for XML and HTML handling. It aims for ElementTree compatibility and supports the entire XML infoset. It is well suited for both mixed content and data centric XML. Its generality makes it the best choice for most applications.
lxml.objectify is a specialized API for XML data handling in a Python object syntax. It provides a very natural way to deal with data fields stored in a structurally well defined XML format. Data is automatically converted to Python data types and can be manipulated with normal Python operators. Look at the examples in the objectify documentation to see what it feels like to use it.
Objectify is not well suited for mixed contents or HTML documents. As it is built on top of lxml.etree, however, it inherits the normal support for XPath, XSLT or validation.
lxml.etree is a very fast library for processing XML. There are, however, a few caveats involved in the mapping of the powerful libxml2 library to the simple and convenient ElementTree API. Not all operations are as fast as the simplicity of the API might suggest, while some use cases can heavily benefit from finding the right way of doing them. The benchmark page has a comparison to other ElementTree implementations and a number of tips for performance tweaking. As with any Python application, the rule of thumb is: the more of your processing runs in C, the faster your application gets. See also the section on threading.
The ElementTree tree model defines an Element as a container with a tag name, contained text, child Elements and a tail text. This means that whenever you serialise an Element, you will get all parts of that Element:
>>> root = etree.XML("<root><tag>text<child/></tag>tail</root>") >>> print(etree.tostring(root[0])) <tag>text<child/></tag>tail
Here is an example that shows why not serialising the tail would be even more surprising from an object point of view:
>>> root = etree.Element("test") >>> root.text = "TEXT" >>> print(etree.tostring(root)) <test>TEXT</test> >>> root.tail = "TAIL" >>> print(etree.tostring(root)) <test>TEXT</test>TAIL >>> root.tail = None >>> print(etree.tostring(root)) <test>TEXT</test>
Just imagine a Python list where you append an item and it doesn't show up when you look at the list.
The .tail property is a huge simplification for the tree model as it avoids text nodes to appear in the list of children and makes access to them quick and simple. So this is a benefit in most applications and simplifies many, many XML tree algorithms.
However, in document-like XML (and especially HTML), the above result can be unexpected to new users and can sometimes require a bit more overhead. A good way to deal with this is to use helper functions that copy the Element without its tail. The lxml.html package also deals with this in a couple of places, as most HTML algorithms benefit from a tail-free behaviour.
>>> root = etree.XML("<?my PI?><root><!-- empty --></root>") >>> root.tag 'root' >>> root.getprevious().tag is etree.PI True >>> root[0].tag is etree.Comment True
I'm glad you asked.
def recursive_dict(element): return element.tag, \ dict(map(recursive_dict, element)) or element.text
Note that this beautiful quick-and-dirty converter expects children to have unique tag names and will silently overwrite any data that was contained in preceding siblings with the same name. For any real-world application of xml-to-dict conversion, you would better write your own, longer version of this.
In Python 2, lxml's API returns byte strings for plain ASCII text values, be it for tag names or text in Element content. This is the same behaviour as known from ElementTree. The reasoning is that ASCII encoded byte strings are compatible with Unicode strings in Python 2, but consume less memory (usually by a factor of 2 or 4) and are faster to create because they do not require decoding. Plain ASCII string values are very common in XML, so this optimisation is generally worth it.
In Python 3, lxml always returns Unicode strings for text and names, as does ElementTree. Since Python 3.3, Unicode strings containing only characters that can be encoded in ASCII or Latin-1 are generally as efficient as byte strings. In older versions of Python 3, the above mentioned drawbacks apply.
To avoid network access, external resources are first looked up in XML catalogues. Many systems have them installed by default, but some don't. On Linux systems, the default place to look is the index file /etc/xml/catalog, which most importantly provides a mapping from doctype IDs to locally installed DTD files.
See the libxml2 catalogue documentation for further information.
The same as in ElementTree. See the tutorial.
It really depends on your application, but the rule of thumb is: more recent versions contain less bugs and provide more features.
Read the release notes of libxml2 and the release notes of libxslt to see when (or if) a specific bug has been fixed.
We provide binaries for Linux (manylinux), macOS and MS Windows shortly after each source release.
Thanks to the help by Maximilian Hils and the Appveyor build service, we try to serve the frequent requests for binary builds available for Microsoft Windows in a timely fashion, since users of that platform usually fail to build lxml themselves. Two of the major design issues of this operating system make this non-trivial for its users: the lack of a pre-installed standard C-compiler and the missing package management.
We currently rely on the WinLibs project to provide library versions that are buildable on MS Windows. If the library that we use in lxml's Windows binary wheels is outdated, it is probably because they have not updated their repositories yet. Consider filing a ticket on their side and notifying us when a new version is available, so that we can integrate it.
You are using a Python installation that was configured for a different internal Unicode representation than the lxml package you are trying to install. CPython versions before 3.3 allowed to switch between two types at build time: the 32 bit encoding UCS4 and the 16 bit encoding UCS2. Sadly, both are not compatible, so eggs and other binary distributions can only support the one they were compiled with.
This means that you have to compile lxml from sources for your system. Note that you do not need Cython for this, the lxml source distribution is directly compilable on both platform types. See the build instructions on how to do this.
lxml consists of a relatively large amount of (Cython) generated C code in a single source module. Compiling this module requires a lot of free memory, usually more than half a GB, which can pose problems especially on shared/cloud build systems.
If your C compiler crashes while building lxml from sources, consider using one of the binary wheels that we provide. The "manylinux" binaries should generally work well on most build systems and install substantially faster than a source build.
It almost is.
lxml is not written in plain Python, because it interfaces with two C libraries: libxml2 and libxslt. Accessing them at the C-level is required for performance reasons.
However, to avoid writing plain C-code and caring too much about the details of built-in types and reference counting, lxml is written in Cython, a superset of the Python language that translates to C-code. Chances are that if you know Python, you can write code that Cython accepts. Again, the C-ish style used in the lxml code is just for performance optimisations. If you want to contribute, don't bother with the details, a Python implementation of your contribution is better than none. And keep in mind that lxml's flexible API often favours an implementation of features in pure Python, without bothering with C-code at all. For example, the lxml.html package is written entirely in Python.
Please contact the mailing list if you need any help.
If you find something that you would like lxml to do (or do better), then please tell us about it on the mailing list. Pull requests on github are always appreciated, especially when accompanied by unit tests and documentation (doctests would be great). See the tests subdirectories in the lxml source tree (below the src directory) and the ReST text files in the doc directory.
We also have a list of missing features that we would like to implement but didn't due to lack if time. If you find the time, patches are very welcome.
Besides enhancing the code, there are a lot of places where you can help the project and its user base. You can
One of the goals of lxml is "no segfaults", so if there is no clear warning in the documentation that you were doing something potentially harmful, you have found a bug and we would like to hear about it. Please report this bug to the mailing list. See the section on bug reporting to learn how to do that.
If your application (or e.g. your web container) uses threads, please see the FAQ section on threading to check if you touch on one of the potential pitfalls.
In any case, try to reproduce the problem with the latest versions of libxml2 and libxslt. From time to time, bugs and race conditions are found in these libraries, so a more recent version might already contain a fix for your problem.
Remember: even if you see lxml appear in a crash stack trace, it is not necessarily lxml that caused the crash.
If you are using the ``xmlsec`` library together with lxml, you have to make sure that both use the same version of libxml2. The binary wheels of lxml statically include a (usually recent) version of libxml2, whereas xmlsec often depends on the systemwide installed libraries. If you get crashes or unexpected behaviour when using both, please make sure that both get to use the same libxml2 version. Anaconda/condaforge/etc. based installations will usually come with matching C libraries. If you use xmlsec with the system libraries, please build lxml from sources against those as well, e.g. by installing the development packages of libxml2 and libxslt and then installing lxml with
python -m pip install --no-binary lxml lxml
This was a common problem up to lxml 2.1.x. Since lxml 2.2, the only officially supported way to use it on this platform is through a static build against freshly downloaded versions of libxml2 and libxslt. See the build instructions for MacOS-X.
First, you should look at the current developer changelog to see if this is a known problem that has already been fixed in the master branch since the release you are using.
Also, the 'crash' section above has a few good advices what to try to see if the problem is really in lxml - and not in your setup. Believe it or not, that happens more often than you might think, especially when old libraries or even multiple library versions are installed.
You should always try to reproduce the problem with the latest versions of libxml2 and libxslt - and make sure they are used. lxml.etree can tell you what it runs with:
import sys from lxml import etree print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
If you can figure that the problem is not in lxml but in the underlying libxml2 or libxslt, you can ask right on the respective mailing lists, which may considerably reduce the time to find a fix or work-around. See the next question for some hints on how to do that.
Otherwise, we would really like to hear about it. Please report it to the bug tracker or to the mailing list so that we can fix it. It is very helpful in this case if you can come up with a short code snippet that demonstrates your problem. If others can reproduce and see the problem, it is much easier for them to fix it - and maybe even easier for you to describe it and get people convinced that it really is a problem to fix.
It is important that you always report the version of lxml, libxml2 and libxslt that you get from the code snippet above. If we do not know the library versions you are using, we will ask back, so it will take longer for you to get a helpful answer.
Since as a user of lxml you are likely a programmer, you might find this article on bug reports an interesting read.
A large part of lxml's functionality is implemented by libxml2 and libxslt, so problems that you encounter may be in one or the other. Knowing the right place to ask will reduce the time it takes to fix the problem, or to find a work-around.
Both libxml2 and libxslt come with their own command line frontends, namely xmllint and xsltproc. If you encounter problems with XSLT processing for specific stylesheets or with validation for specific schemas, try to run the XSLT with xsltproc or the validation with xmllint respectively to find out if it fails there as well. If it does, please report directly to the mailing lists of the respective project, namely:
On the other hand, everything that seems to be related to Python code, including custom resolvers, custom XPath functions, etc. is likely outside of the scope of libxml2/libxslt. If you encounter problems here or you are not sure where there the problem may come from, please ask on the lxml mailing list first.
In any case, a good explanation of the problem including some simple test code and some input data will help us (or the libxml2 developers) see and understand the problem, which largely increases your chance of getting help. See the question above for a few hints on what is helpful here.
Short answer: yes, if you use lxml 2.2 and later.
Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself. lxml also allows concurrency during validation (RelaxNG and XMLSchema) and XSL transformation. You can share RelaxNG, XMLSchema and XSLT objects between threads.
While you can also share parsers between threads, this will serialize the access to each of them, so it is better to .copy() parsers or to just use the default parser if you do not need any special configuration. The same applies to the XPath evaluators, which use an internal lock to protect their prepared evaluation contexts. It is therefore best to use separate evaluator instances in threads.
Warning: Before lxml 2.2, and especially before 2.1, there were various issues when moving subtrees between different threads, or when applying XSLT objects from one thread to trees parsed or modified in another. If you need code to run with older versions, you should generally avoid modifying trees in other threads than the one it was generated in. Although this should work in many cases, there are certain scenarios where the termination of a thread that parsed a tree can crash the application if subtrees of this tree were moved to other documents. You should be on the safe side when passing trees between threads if you either
Since lxml 2.2, even multi-thread pipelines are supported. However, note that it is more efficient to do all tree work inside one thread, than to let multiple threads work on a tree one after the other. This is because trees inherit state from the thread that created them, which must be maintained when the tree is modified inside another thread.
Depends. The best way to answer this is timing and profiling.
The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to zero. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.
See the question above to learn which operations free the GIL to support multi-threading.
Possibly, yes. You can see for yourself by compiling lxml entirely without threading support. Pass the --without-threading option to setup.py when building lxml from source. You can also build libxml2 without pthread support (--without-pthreads option), which may add another bit of performance. Note that this will leave internal data structures entirely without thread protection, so make sure you really do not use lxml outside of the main application thread in this case.
Since later lxml 2.0 versions, you can do this. There is some overhead involved as the result document needs an additional cleanup traversal when the input document and/or the stylesheet were created in other threads. However, on a multi-processor machine, the gain of freeing the GIL easily covers this drawback.
If you need even the last bit of performance, consider keeping (a copy of) the stylesheet in thread-local storage, and try creating the input document(s) in the same thread. And do not forget to benchmark your code to see if the increased code complexity is really worth it.
These environments can use threads in a way that may not make it obvious when threads are created and what happens in which thread. This makes it hard to ensure lxml's threading support is used in a reliable way. Sadly, if problems arise, they are as diverse as the applications, so it is difficult to provide any generally applicable solution. Also, these environments are so complex that problems become hard to debug and even harder to reproduce in a predictable way. If you encounter crashes in one of these systems, but your code runs perfectly when started by hand, the following gives you a few hints for possible approaches to solve your specific problem:
make sure you use recent versions of libxml2, libxslt and lxml. The libxml2 developers keep fixing bugs in each release, and lxml also tries to become more robust against possible pitfalls. So newer versions might already fix your problem in a reliable way. Version 2.2 of lxml contains many improvements.
make sure the library versions you installed are really used. Do not rely on what your operating system tells you! Print the version constants in lxml.etree from within your runtime environment to make sure it is the case. This is especially a problem under MacOS-X when newer library versions were installed in addition to the outdated system libraries. Please read the bugs section regarding MacOS-X in this FAQ.
if you use mod_python, try setting this option:
PythonInterpreter main_interpreter
There was a discussion on the mailing list about this problem:
in a threaded environment, try to initially import lxml.etree from the main application thread instead of doing first-time imports separately in each spawned worker thread. If you cannot control the thread spawning of your web/application server, an import of lxml.etree in sitecustomize.py or usercustomize.py may still do the trick.
compile lxml without threading support by running setup.py with the --without-threading option. While this might be slower in certain scenarios on multi-processor systems, it might also keep your application from crashing, which should be worth more to you than peek performance. Remember that lxml is fast anyway, so concurrency may not even be worth it.
look out for fancy XSLT stuff like foreign document access or passing in subtrees through XSLT variables. This might or might not work, depending on your specific usage. Again, later versions of lxml and libxslt provide safer support here.
try copying trees at suspicious places in your code and working with those instead of a tree shared between threads. Note that the copying must happen inside the target thread to be effective, not in the thread that created the tree. Serialising in one thread and parsing in another is also a simple (and fast) way of separating thread contexts.
try keeping thread-local copies of XSLT stylesheets, i.e. one per thread, instead of sharing one. Also see the question above.
you can try to serialise suspicious parts of your code with explicit thread locks, thus disabling the concurrency of the runtime system.
report back on the mailing list to see if there are other ways to work around your specific problems. Do not forget to report the version numbers of lxml, libxml2 and libxslt you are using (see the question on reporting a bug).
Note that most of these options will degrade performance and/or your code quality. If you are unsure what to do, please ask on the mailing list.
Pretty printing (or formatting) an XML document means adding white space to the content. These modifications are harmless if they only impact elements in the document that do not carry (text) data. They corrupt your data if they impact elements that contain data. If lxml cannot distinguish between whitespace and data, it will not alter your data. Whitespace is therefore only added between nodes that do not contain data. This is always the case for trees constructed element-by-element, so no problems should be expected here. For parsed trees, a good way to assure that no conflicting whitespace is left in the tree is the remove_blank_text option:
>>> parser = etree.XMLParser(remove_blank_text=True) >>> tree = etree.parse(filename, parser)
This will allow the parser to drop blank text nodes when constructing the tree. If you now call a serialization function to pretty print this tree, lxml can add fresh whitespace to the XML tree to indent it.
Note that the remove_blank_text option also uses a heuristic if it has no definite knowledge about the document's ignorable whitespace. It will keep blank text nodes that appear after non-blank text nodes at the same level. This is to prevent document-style XML from losing content.
The HTMLParser has this structural knowledge built-in, which means that most whitespace that appears between tags in HTML documents will not be removed by this option, except in places where it is truly ignorable, e.g. in the page header, between table structure tags, etc. Therefore, it is also safe to use this option with the HTMLParser, as it will keep content like the following intact (i.e. it will not remove the space that separates the two words):
<p><b>some</b> <em>text</em></p>
If you want to be sure all blank text is removed from an XML document (or just more blank text than the parser does by itself), you have to use either a DTD to tell the parser which whitespace it can safely ignore, or remove the ignorable whitespace manually after parsing, e.g. by setting all tail text to None:
for element in root.iter(): element.tail = None
Fredrik Lundh also has a Python-level function for indenting XML by appending whitespace to tags. It can be found on his element library recipes page.
First of all, XML is explicitly defined as a stream of bytes. It's not Unicode text. Take a look at the XML specification, it's all about byte sequences and how to map them to text and structure. That leads to rule number one: do not decode your XML data yourself. That's a part of the work of an XML parser, and it does it very well. Just pass it your data as a plain byte stream, it will always do the right thing, by specification.
This also includes not opening XML files in text mode. Make sure you always use binary mode, or, even better, pass the file path into lxml's parse() function to let it do the file opening, reading and closing itself. This is the most simple and most efficient way to do it.
That being said, lxml can read Python unicode strings and even tries to support them if libxml2 does not. This is because there is one valid use case for parsing XML from text strings: literal XML fragments in source code.
However, if the unicode string declares an XML encoding internally (<?xml encoding="..."?>), parsing is bound to fail, as this encoding is almost certainly not the real encoding used in Python unicode. The same is true for HTML unicode strings that contain charset meta tags, although the problems may be more subtle here. The libxml2 HTML parser may not be able to parse the meta tags in broken HTML and may end up ignoring them, so even if parsing succeeds, later handling may still fail with character encoding errors. Therefore, parsing HTML from unicode strings is a much saner thing to do than parsing XML from unicode strings.
Note that Python uses different encodings for unicode on different platforms, so even specifying the real internal unicode encoding is not portable between Python interpreters. Don't do it.
Python unicode strings with XML data that carry encoding information are broken. lxml will not parse them. You must provide parsable data in a valid encoding.
Technically, yes. However, you likely do not want to do that, because it is extremely inefficient. The text encoding that libxml2 uses internally is UTF-8, so parsing from a Unicode file means that Python first reads a chunk of data from the file, then decodes it into a new buffer, and then copies it into a new unicode string object, just to let libxml2 make yet another copy while encoding it down into UTF-8 in order to parse it. It's clear that this involves a lot more recoding and copying than when parsing straight from the bytes that the file contains.
If you really know the encoding better than the parser (e.g. when parsing HTML that lacks a content declaration), then instead of passing an encoding parameter into the file object when opening it, create a new instance of an XMLParser or HTMLParser and pass the encoding into its constructor. Afterwards, use that parser for parsing, e.g. by passing it into the etree.parse(file, parser) function. Remember to open the file in binary mode (mode="rb"), or, if possible, prefer passing the file path directly into parse() instead of an opened Python file object.
The str() implementation of the XSLTResultTree class (a subclass of the ElementTree class) knows about the output method chosen in the stylesheet (xsl:output), write() doesn't. If you call write(), the result will be a normal XML tree serialization in the requested encoding. Calling this method may also fail for XSLT results that are not XML trees (e.g. string results).
If you call str(), it will return the serialized result as specified by the XSL transform. This correctly serializes string results to encoded Python strings and honours xsl:output options like indent. This almost certainly does what you want, so you should only use write() if you are sure that the XSLT result is an XML tree and you want to override the encoding and indentation options requested by the stylesheet.
The iterparse() implementation is based on the libxml2 parser. It requires the tree to be intact to finish parsing. If you delete or modify parents of the current node, chances are you modify the structure in a way that breaks the parser. Normally, this will result in a segfault. Please refer to the iterparse section of the lxml API documentation to find out what you can do and what you can't do.
Don't. What you would produce is not well-formed XML. XML parsers will refuse to parse a document that contains null characters. The right way to embed binary data in XML is using a text encoding such as uuencode or base64.
This has nothing to do with lxml itself, only with the parser of libxml2. Since libxml2 version 2.7, the parser imposes hard security limits on input documents to prevent DoS attacks with forged input data. Since lxml 2.2.1, you can disable these limits with the huge_tree parser option if you need to parse really large, trusted documents. All lxml versions will leave these restrictions enabled by default.
Note that libxml2 versions of the 2.6 series do not restrict their parser and are therefore vulnerable to DoS attacks.
Note also that these "hard limits" may still be high enough to allow for excessive resource usage in a given use case. They are compile time modifiable, so building your own library versions will allow you to change the limits to your own needs. Also see the next question.
XML based web-service endpoints are generally subject to several types of attacks if they allow some kind of untrusted input. From the point of view of the underlying XML tool, the most obvious attacks try to send a relatively small amount of data that induces a comparatively large resource consumption on the receiver side.
First of all, make sure network access is not enabled for the XML parser that you use for parsing untrusted content and that it is not configured to load external DTDs. Otherwise, attackers can try to trick the parser into an attempt to load external resources that are overly slow or impossible to retrieve, thus wasting time and other valuable resources on your server such as socket connections. Note that you can register your own document loader in lxml, which allows for fine-grained control over any read access to resources.
Some of the most famous excessive content expansion attacks use XML entity references. Luckily, entity expansion is mostly useless for the data commonly sent through web services and can simply be disabled, which rules out several types of denial of service attacks at once. This also involves an attack that reads local files from the server, as XML entities can be defined to expand into the content of external resources. Consequently, version 1.2 of the SOAP standard explicitly disallows entity references in the XML stream.
To disable entity expansion, use an XML parser that is configured with the option resolve_entities=False. Then, after (or while) parsing the document, use root.iter(etree.Entity) to recursively search for entity references. If it contains any, reject the entire input document with a suitable error response. In lxml 3.x, you can also use the new DTD introspection API to apply your own restrictions on input documents. Since version 5.x, lxml disables the expansion of external entities (XXE) by default. If you really want to allow loading external files into XML documents using this functionality, you have to explicitly set resolve_entities=True.
Another attack to consider is compression bombs. If you allow compressed input into your web service, attackers can try to send well forged highly repetitive and thus very well compressing input that unpacks into a very large XML document in your server's main memory, potentially a thousand times larger than the compressed input data.
As a counter measure, either disable compressed input for your web server, at least for untrusted sources, or use incremental parsing with iterparse() instead of parsing the whole input document into memory in one shot. That allows you to enforce suitable limits on the input by applying semantic checks that detect and prevent an illegitimate use of your service. If possible, you can also use this to reduce the amount of data that you need to keep in memory while parsing the document, thus further reducing the possibility of an attacker to trick your system into excessive resource usage.
Finally, please be aware that XPath suffers from the same vulnerability as SQL when it comes to content injection. The obvious fix is to not build any XPath expressions via string formatting or concatenation when the parameters may come from untrusted sources, and instead use XPath variables, which safely expose their values to the evaluation engine.
The defusedxml package comes with an example setup and a wrapper API for lxml that applies certain counter measures internally.
lxml preserves the order in which attributes were originally created. There is one case in which this is difficult: when attributes are passed in a dict or as keyword arguments to the Element() factory. Before Python 3.6, dicts had no predictable order. Since Python 3.6, however, dicts also preserve the creation order of their keys, and lxml makes use of that since release 4.4. In earlier versions, lxml tries to assure at least reproducible output by sorting the attributes from the dict before creating them. All sequential ways to set attributes keep their order and do not apply sorting. Also, OrderedDict instances are recognised and not sorted.
In cases where you cannot control the order in which attributes are created, you can still change it before serialisation. To sort them by name, for example, you can apply the following function:
def sort_attributes(root): for el in root.iter(): attrib = el.attrib if len(attrib) > 1: attributes = sorted(attrib.items()) attrib.clear() attrib.update(attributes)
findall() is part of the original ElementTree API. It supports a simple subset of the XPath language, without predicates, conditions and other advanced features. It is very handy for finding specific tags in a tree. Another important difference is namespace handling, which uses the {namespace}tagname notation. This is not supported by XPath. The findall, find and findtext methods are compatible with other ElementTree implementations and allow writing portable code that runs on ElementTree, cElementTree and lxml.etree.
xpath(), on the other hand, supports the complete power of the XPath language, including predicates, XPath functions and Python extension functions. The syntax is defined by the XPath specification. If you need the expressiveness and selectivity of XPath, the xpath() method, the XPath class and the XPathEvaluator are the best choice.
It was decided that it is more important to keep compatibility with ElementTree to simplify code migration between the libraries. The main difference compared to XPath is the {namespace}tagname notation used in findall(), which is not valid XPath.
ElementTree and lxml.etree use the same implementation, which assures 100% compatibility. Note that findall() is so fast in lxml that a native implementation would not bring any performance benefits.
You can traverse the document (root.iter()) and collect the prefix attributes from all Elements into a set. However, it is unlikely that you really want to do that. You do not need these prefixes, honestly. You only need the namespace URIs. All namespace comparisons use these, so feel free to make up your own prefixes when you use XPath expressions or extension functions.
The only place where you might consider specifying prefixes is the serialization of Elements that were created through the API. Here, you can specify a prefix mapping through the nsmap argument when creating the root Element. Its children will then inherit this prefix for serialization.
You can't. In XPath 1.0, there is no such thing as a default namespace. Just use an arbitrary prefix and let the namespace dictionary of the XPath evaluators map it to your namespace. See also the question above.
lxml's iterators need to hold on to an element in the tree in order to remember their current position. Therefore, tree modifications between two calls into the iterator can lead to surprising results if such an element is deleted or moved around, for example.
If your code risks modifying elements that the iterator might still need, and you know that the number of elements returned by the iterator is small, then just read them all into a list (or use .findall()), and iterate over that list.
If the number of elements can be larger and you really want to process the tree incrementally, you can often use a read-ahead generator to make the iterator advance beyond the critical point before touching the tree structure.
For example:
from itertools import islice from collections import deque def readahead(iterator, count=1): iterator = iter(iterator) # allow iterables as well elements = deque(islice(iterator, 0, count)) for element in iterator: elements.append(element) yield elements.popleft() yield from elements for element in readahead(root.iterfind("path/to/children")): element.getparent().remove(element)