Resources

Brief XML Tutorial

What is XML?

XML (“eXtensible Markup Language”) is a markup language for documents that contain structured textual information.
W3C approved XML as a standard in 1998. Basically it is a subset of the mature indrustry language SGML specifically intended for the use on the Internet. For that reason every XML document is also an SGML document.

Data Model

It is important to understand that XML data model is a tree built of elements, attributes and other types of nodes. Textual syntax is just a serialization of the XML model.

Elements and Tags

Wanted XML structure is achieved combining elements, which in the textual XML syntax are represented as tags. XML standard defines building principles of element, but does not specify neither a specific set of elements nor their meaning. That is defined by other standards in the XML family (XHTML, XSLT etc.). XML can be called a meta-language — a language for describing other languages.
Each tag starts with a < symbol and ends with a > symbol. In between stand the element name and (probably) additional information, such as attributes or namespace declarations. Element names are case-sensitive. There are start and end tags. In the end tag, / symbol stands before the element name. Element may also contain child (inner) elements. For example:

<element>
  text 1
  <child_element>
	text 2
  </child_element>
</element>

Empty element may be written in a short form, omitting the end tag: <empty_element/>. That is equivalent to: <empty_element></empty_element>.
As you might already understand, the elements' markup cannot overlap, i. e. such structures are not allowed: <1_element> ... <2_element> ... </1_element> ... </2_element>. They often occured in HTML files.
There may be only one top-level (root) element in an XML document.

Attributes

XML element may contain attributes. They are written in the form attribute_name="attribute_value" or attribute_name='attribute_value'. Attribute names are case-sensitive. Attribute value can only be textual (a string). Attributes sit in the start tag of an element, after its name, for example: <student id="12345678"> ... </student>. An element may not have two attributes with the same name.
XML attributes do not have a short form that was allowed in HTML. For example, instead of <option selected> ... </option> we need to write <option selected="selected"> ... </option>.
Some of the attributes (for example, xml:lang, which defines the language of an element) are reserved by the XML standart itself. That is, they have the same meaning in all the XML documents and may not be used by the authors for their own purposes.
Attributes usually store additional information about the element, but sometimes they might be used instead of elements (especially when the element's value is a simple string). Compare:

<student>
  <name>John</name>
  <last_name>Doe</last_name>
  <address>
	<zip_code>90210</zip_code>
	...
  </address>
  ...
</student>

and

<student name="John" last_name="Doe">
  <address>
	<zip code="90210"/>
	...
  </address>
  ...
</student>

Element or Attribute?

There is a common question about whether an element or an attribute should be used when storing a string value. There probably is no single answer. Attributes are just supportive constructs, but nevertheless convenient and even required in many cases.
You can try applying such method — imagine, that an element is a box. Then all the things inside of it should be elements, while the labels on the box should be attributes.
Suppose we have a <car> element. In this case inner (child) elements could be <chassis>, <engine> etc., while make_date and color could be attributes.
However, if we are creating <car_information> element, we could use <make_date> and <color> as elements as well. In that case registration_date should be an attribute since it keeps information about the car registration and not the car itself (meta-information).

Namespaces

Namespaces are used to define which XML dialect certain element or attribute belongs to.
XML allows you to use whatever element and attribute name you want. It would be perfectly legal to use <p> in your own custom document. However, if you would try to insert that document into XHTML document, a name conflict would arise. A browser would think that this <p> belongs to XHTML and would interpret it as a paragraph, which is probably not what the author intended. In this kind of situation namespaces are used.
Elements that we want to be in a certain namespace must have a special prefix in their names. Such element looks like this: <prefix:element_name>.
The prefix must be previously defined in the document (in one of the higher-level elements) with such attribute: prefix:xmlns="URI". Here URI stands for Universal Resource Identifier that uniquely identifies the namespace.
The attribute xmlns="URI" may also be used without the prefix. In such case the namespace identified with this URI will be the default one. I. e. all the prefix-less elements defined in the lower levels will belong to this namespace.
In such case the example from above would look like this:

<html xmlns="http://www.w3.org/1999/xhtml" john:xmlns="http://www.johndoe.com/xml">
  ...
  <body>
	<h1>Header ... </h1>
	<p>Paragraph ... </p>
	...
	<john:p john:attribute"value">My element</john:p>
	...
	<p>Paragraph ... </p>
  </body>
</html>

The same namespace may have several different prefixes.

Entities

Entities are used for inserting a special symbol (in more complex cases also text or file) into an XML document. They are defined in the entity declaration and used in a document as follows: &name; or &#code;. Here code stands for a numeric symbol code and name for an alias that may be defined in the declaration.
For example, entity &quot; inserts quotes into a document; &amp; inserts ampersand & (which is used in entity declaration itself); &copy; inserts ©.

Special Markup

XML defines some special markup beyond elements.

header
<?xml version="version" encoding="encoding"?>. Here version stands for the version of XML standard used (the latest is 1.1) and encoding is self-explanatory. UTF-8 encoding is recommended for use with XML.
processing instructions
For example: <?xml-stylesheet type="text/xsl" href="style.xsl"?> (defines a stylesheet for this XML document)
comments
<!-- comment -->
CDATA
<![CDATA[ content ]]>. Everything that should not be interpreted as markup — script code, CSS code etc. — should be wrapped in a CDATA block. For example:
<style type="text/css">
  <![CDATA[
	html { font-family: "Tahoma", "Verdana", sans-serif; }
	body > p { line-height: 1cm; } /* contains XML special character */
  ]]>
</style>
document type definition
Defines the valid structure of a document, i. e. what elements and attributes it may contain, how they may be arranged etc. DTD may be a separate file. XML Schema is a newer (and much more complex) stardard for defining XML structures.
entity declaration

XML Document

XML document may start with a header. It may be followed by processing instructions, document type and entity declarations.
Then comes the main document content with elements.
XML document may be well-formed and valid.
Well-formed document is valid according to XML syntax. I. e. it has a single root element, its elements are properly nested etc.
Valid document is valid according to its DTD or schema (off course, it has to be well-formed in the first place).
Comments may appear anywhere in the document, and CDATA may appear wherever elements may appear.
All the special XML symbols that are in the text (not in the markup) have to be replaced with entities.
When XML parser comes across an error, it has to report an error without further processing. That is different from HTML which allowed to silently ignore errors.
XML document structure is sometimes differentiated into data-oriented and document-oriented.

XML Usage

As mentioned before, XML lets easily represent structured textual information. This document may also be interpreted as XML.
Benefits of XML: it is a simple, extensible, OS and programming language independent, open and free standard.
XML (and appropriate document types) can replace many priopretary binary and textual formats and protocols. XML data is easily manipulated by the programming languages, stored in a database or even used as a database.
One of the most practical XML applications is transformation. By applying different stylesheets to the same data, one might get results of different forms. For example, WWW server logs stored in a XML file may be transformed into XHTML with one XSLT stylesheet, into SVG graphic with another, and into WML (markup for mobiles) with the third. Not surprisingly, XSLT stylesheets are also XML documents.