DTDs
- The difference between well-formed and valid XML documents.
- The purpose of DTDs.
- To create internal and external DTDs.
- To validate an XML document according to a DTD.
- The limitations of DTDs.
Well-formed vs. Valid
A well-formed XML document is one that follows the syntax rules described in "XML Syntax Rules". A valid XML document is one that conforms to a specified structure. For an XML document to be validated, it must be checked against a schema, which is a document that defines the structure for a class of XML documents. XML documents that are not intended to conform to a schema can be well-formed, but they cannot be valid.
The Purpose of DTDs
A Document Type Definition (DTD) is a type of schema. The purpose of DTDs is to provide a framework for validating XML documents. By defining a structure that XML documents must conform to, DTDs allow different organizations to create shareable data files.
Imagine, for example, a company that creates technical courseware and sells it to technical training companies. Those companies may want to display the outlines for that courseware on their websites, but they do not want to display it in the same way as every other company who buys the courseware. By providing the course outlines in a predefined XML format, the courseware vendor makes it possible for the training companies to write programs to read those XML files and transform them into HTML pages with their own formatting styles (perhaps using XSLT or CSS). If the XML files had no predefined structure, it would be very difficult to write such programs.
Creating DTDs
DTDs are simple text files that can be created with any basic text editor. Although they look a little cryptic at first, they are not terribly complicated once you get used to them.
A DTD outlines what elements can be in an XML document and the attributes and subelements that they can take. Let's start by taking a look at a complete DTD and then dissecting it.
Code Sample: DTDs/Demos/Beatles.dtd
<!ELEMENT beatles (beatle+)> <!ELEMENT beatle (name)> <!ATTLIST beatle link CDATA #IMPLIED real (yes|no) "yes"> <!ELEMENT name (firstname, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)>
The Document Element
When creating a DTD, the first step is to define the document element.
<!ELEMENT beatles (beatle+)>
The element declaration above states that the beatles element must contain one or more beatle elements.
Child Elements
When defining child elements in DTDs, you can specify how many times those elements can appear by adding a modifier after the element name. If no modifier is added, the element must appear once and only once. The other options are shown in the table below.
| Modifier | Description |
|---|---|
| ? | Zero or one times. |
| + | One or more times. |
| * | Zero or more times. |
It is not possible to specify a range of times that an element may appear (e.g, 2-4 appearances).
Other Elements
The other elements are declared in the same way as the document element - with the <!ELEMENT> declaration. The Beatles DTD declares four additional elements.
Each beatle element must contain a child element name, which must appear once and only once.
<!ELEMENT beatle (name)>
Each name element must contain a firstname and lastname element, which each must appear once and only once and in that order.
<!ELEMENT name (firstname, lastname)>
Some elements contain only text. This is declared in a DTD as #PCDATA. PCDATA stands for parsed character data, meaning that the data will be parsed for XML tags and entities. The firstname and lastname elements contain only text.
<!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)>
Choice of Elements
It is also possible to indicate that one of several elements may appear as a child element. For example, the declaration below indicates that an img element may have a child element name or a child element id, but not both.
<!ELEMENT img (name|id)>
Empty Elements
Empty elements are declared as follows.
<!ELEMENT img EMPTY>
Mixed Content
Sometimes elements can have elements and text intermingled. For example, the following declaration is for a body element that may contain text in addition to any number of link and img elements.
<!ELEMENT body (#PCDATA | link | img)*>
Location of Modifier
The location of modifiers in a declaration is important. If the modifier is outside of a set of parentheses, it applies to the group; whereas, if the modifier is immediately next to an element name, it applies only to that element. The following examples illustrate.
In the example below, the body element can have any number of interspersed child link and img elements.
<!ELEMENT body (link | img)*>
In the example below, the body element can have any number of child link elements or any number of child img elements, but it cannot have both link and img elements.
<!ELEMENT body (link* | img*)>
In the example below, the body element can have any number of child link and img elements, but they must come in pairs, with the link element preceding the img element.
<!ELEMENT body (link, img)*>
In the example below, the body element can have any number of child link elements followed by any number of child img elements.
<!ELEMENT body (link*, img*)>
Using Parentheses for Complex Declarations
Element declarations can be more complex than the examples above. For example, you can specify that a person element either contains a single name element or a firstname and lastname element. To group elements, wrap them in parentheses as shown below.
<!ELEMENT person (name | (firstname,lastname))>
Declaring Attributes
Attributes are declared using the <!ATTLIST > declaration. The syntax is shown below.
<!ATTLIST ElementName AttributeName AttributeType State DefaultValue? AttributeName AttributeType State DefaultValue?>
- ElementName is the name of the element taking the attributes.
- AttributeName is the name of the attribute.
- AttributeType is the type of data that the attribute value may hold. Although there are many types, the most common are CDATA (unparsed character data) and ID (a unique identifier). A list of options can also be given for the attribute type.
- DefaultValue is the value of the attribute if it is not included in the element.
- State can be one of three values: #REQUIRED, #FIXED (set value), and #IMPLIED (optional).
The beatle element has two possible attributes: link, which is optional and may contain any valid XML text, and real, which defaults to yes if it is not included.
<!ATTLIST beatle link CDATA #IMPLIED real (yes|no) "yes">
Validating an XML Document with a DTD
The DOCTYPE declaration in an XML document specifies the DTD to which it should conform. In the code sample below, the DOCTYPE declaration indicates the file should be validated against Beatles.dtd in the same directory.
Code Sample: DTDs/Demos/Beatles.xml
<?xml version="1.0"?> <!DOCTYPE beatles SYSTEM "Beatles.dtd"> <beatles> <beatle link="http://www.paulmccartney.com"> <name> <firstname>Paul</firstname> <lastname>McCartney</lastname> </name> </beatle> <beatle link="http://www.johnlennon.com"> <name> <firstname>John</firstname> <lastname>Lennon</lastname> </name> </beatle> <beatle link="http://www.georgeharrison.com"> <name> <firstname>George</firstname> <lastname>Harrison</lastname> </name> </beatle> <beatle link="http://www.ringostarr.com"> <name> <firstname>Ringo</firstname> <lastname>Starr</lastname> </name> </beatle> <beatle link="http://www.webucator.com" real="no"> <name> <firstname>Nat</firstname> <lastname>Dunn</lastname> </name> </beatle> </beatles>
DTDs Conclusion
In this lesson of the XML tutorial, you learned to created DTDs to validate XML documents.
