Validation with lxml

Apart from the built-in DTD support in parsers, lxml currently supports three schema languages: DTD, Relax NG and XML Schema. All three provide identical APIs in lxml, represented by validator classes with the obvious names.

Contents

The usual setup procedure:

>>> from lxml import etree
>>> from StringIO import StringIO

DTD

There are two places in lxml where DTDs are supported: parsers and the DTD class. If you pass a keyword option to a parser that requires DTD loading, lxml will automatically include the DTD in the parsing process. If you pass the keyword for DTD validation, lxml (or rather libxml2) will use this DTD right inside the parser and report failure or success when parsing terminates.

The parser support for DTDs depends on internal or external subsets of the XML file. This means that the XML file itself must either contain a DTD or must reference a DTD to make this work. If you want to validate an XML document against a DTD that is not referenced by the document itself, you can use the DTD class.

To use the DTD class, you must first pass a filename or file-like object into the constructor to parse a DTD:

>>> f = StringIO("<!ELEMENT b EMPTY>")
>>> dtd = etree.DTD(f)

Now you can use it to validate documents:

>>> root = etree.XML("<b/>")
>>> print dtd.validate(root)
1

>>> root = etree.XML("<b><a/></b>")
>>> print dtd.validate(root)
0

The reason for the validation failure can be found in the error log:

>>> print dtd.error_log.filter_from_errors()[0]
<string>:1:ERROR:VALID:DTD_NOT_EMPTY: Element b was declared EMPTY this one has content

RelaxNG

The RelaxNG class takes an ElementTree object to construct a Relax NG validator:

>>> f = StringIO('''\
... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
...  <zeroOrMore>
...     <element name="b">
...       <text />
...     </element>
...  </zeroOrMore>
... </element>
... ''')
>>> relaxng_doc = etree.parse(f)
>>> relaxng = etree.RelaxNG(relaxng_doc)

Alternatively, pass a filename to the file keyword argument to parse from a file. This also enables correct handling of include files from within the RelaxNG parser.

You can then validate some ElementTree document against the schema. You'll get back True if the document is valid against the Relax NG schema, and False if not:

>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> relaxng.validate(doc)
1

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> relaxng.validate(doc2)
0

Calling the schema object has the same effect as calling its validate method. This is sometimes used in conditional statements:

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> if not relaxng(doc2):
...     print "invalid!"
invalid!

If you prefer getting an exception when validating, you can use the assert_ or assertValid methods:

>>> relaxng.assertValid(doc2)
Traceback (most recent call last):
  [...]
DocumentInvalid: Document does not comply with schema

>>> relaxng.assert_(doc2)
Traceback (most recent call last):
  [...]
AssertionError: Document does not comply with schema

Starting with version 0.9, lxml now has a simple API to report the errors generated by libxml2. If you want to find out why the validation failed in the second case, you can look up the error log of the validation process and check it for relevant messages:

>>> log = relaxng.error_log
>>> print log.last_error
<string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there

You can see that the error (ERROR) happened during RelaxNG validation (RELAXNGV). The message then tells you what went wrong. Note that this error log is local to the RelaxNG object. It will only contain log entries that appeared during the validation. The DocumentInvalid exception raised by the assertValid method above provides access to the global error log (like all other lxml exceptions).

Similar to XSLT, there's also a less efficient but easier shortcut method to do one-shot RelaxNG validation:

>>> doc.relaxng(relaxng_doc)
1
>>> doc2.relaxng(relaxng_doc)
0

XMLSchema

lxml.etree also has XML Schema (XSD) support, using the class lxml.etree.XMLSchema. The API is very similar to the Relax NG and DTD classes. Pass an ElementTree object to construct a XMLSchema validator:

>>> f = StringIO('''\
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <xsd:element name="a" type="AType"/>
... <xsd:complexType name="AType">
...   <xsd:sequence>
...     <xsd:element name="b" type="xsd:string" />
...   </xsd:sequence>
... </xsd:complexType>
... </xsd:schema>
... ''')
>>> xmlschema_doc = etree.parse(f)
>>> xmlschema = etree.XMLSchema(xmlschema_doc)

You can then validate some ElementTree document with this. Like with RelaxNG, you'll get back true if the document is valid against the XML schema, and false if not:

>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> xmlschema.validate(doc)
1

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> xmlschema.validate(doc2)
0

Calling the schema object has the same effect as calling its validate method. This is sometimes used in conditional statements:

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> if not xmlschema(doc2):
...     print "invalid!"
invalid!

If you prefer getting an exception when validating, you can use the assert_ or assertValid methods:

>>> xmlschema.assertValid(doc2)
Traceback (most recent call last):
  [...]
DocumentInvalid: Document does not comply with schema

>>> xmlschema.assert_(doc2)
Traceback (most recent call last):
  [...]
AssertionError: Document does not comply with schema

Error reporting works as for the RelaxNG class:

>>> log = xmlschema.error_log
>>> error = log.last_error
>>> print error.domain_name
SCHEMASV
>>> print error.type_name
SCHEMAV_ELEMENT_CONTENT

If you were to print this log entry, you would get something like the following. Note that the error message depends on the libxml2 version in use:

<string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).

Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut method to do XML Schema validation:

>>> doc.xmlschema(xmlschema_doc)
1
>>> doc2.xmlschema(xmlschema_doc)
0