BEGINNING XML PART 4 – EMPTY ELEMENTS AND XML DECLARATION (Page 1)

Sometimes an element has no data. Recall our earlier example, where the middle element contained no name:

<name nickname='Shiny John'>
 <first>John</first>
 <!--John lost his middle name in a fire-->
 <middle></middle>
 <last>Doe</last>
</name>

In this case, you also have the option of writing this element using the special empty element syntax:

<middle/>

This is the one case where a start-tag doesn’t need a separate end-tag, because they are both combined together into this one tag. In all other cases, they do.

Recall from our discussion of element names that the only place we can have a space within the tag is before the closing “>”. This rule is slightly different when it comes to empty elements. The “/” and “>” characters always have to be together, so you can create an empty element like this:

<middle />

but not like these:

<middle/ >
<middle / >

Empty elements really don’t buy you anything – except that they take less typing – so you can use them, or not, at your discretion. Keep in mind, however, that as far as XML is concerned <middle></middle> is exactly the same as <middle/>; for this reason, XML parsers will sometimes change your XML from one form to the other. You should never count on your empty elements being in one form or the other, but since they’re syntactically exactly the same, it doesn’t matter. (This is the reason that IE5 felt free to change our earlier <parody></parody> syntax to just <parody/>.

Interestingly, nobody in the XML community seems to mind the empty element syntax, even though it doesn’t add anything to the language. This is especially interesting considering the passionate debates that have taken place on whether attributes are really necessary.

One place where empty elements are very often used is for elements that have no (or optional) PCDATA, but instead have all of their information stored in attributes. So if we rewrote our <name> example without child elements, instead of a start-tag and end-tag we would probably use an empty element, like this:

<name first="John" middle="Fitzgerald Johansen" last="Doe"/>

Another common example is the case where just the element name is enough; for example, the HTML <BR> tag might be converted to an XML empty element, such as the XHTML <br/> tag. (XHTML is the latest “XML-compliant” version of HTML.)

XML Declaration

It is often very handy to be able to identify a document as being a certain type. XML provides the XML declaration for us to label documents as being XML, along with giving the parsers a few other pieces of information. You don’t need to have an XML declaration, but you should include it anyway.

A typical XML declaration looks like this:

<?xml version='1.0' encoding='UTF-16' standalone='yes'?>
<name nickname='Shiny John'>
 <first>John</first>
 <!--John lost his middle name in a fire-->
 <middle/>
 <last>Doe</last>
</name>

Some things to note about the XML declaration:

  • The XML declaration starts with the characters <?xml, and ends with the characters ?>.
  • If you include it, you must include the version, but the encoding and standalone attributes are optional.
  • The version, encoding, and standalone attributes must be in that order.
  • Currently, the version should be 1.0. If you use a number other than 1.0, XML parsers that were written for the version 1.0 specification should reject the document. (As of yet, there have been no plans announced for any other version of the XML specification. If there ever is one, the version number in the XML declaration will be used to signal which version of the specification your document claims to support.)
  • The XML declaration must be right at the beginning of the file. That is, the first character in the file should be that <; no line breaks or spaces. Some parsers are more forgiving about this than others.

So an XML declaration can be as full as the one above, or as simple as:

<?xml version='1.0'?> 

The next two sections will describe more fully the encoding and standalone attributes of the XML declaration.

Encoding

It should come as no surprise to us that text is stored in computers using numbers, since numbers are all that computers really understand.

A character code is a one-to-one mapping between a set of characters and the corresponding numbers to represent those characters.

A character encoding is the method used to represent the numbers in a character code digitally, (in other words how many bytes should be used for each number, etc.)

One character code/encoding that you might have come across is the American Standard Code for Information Interchange (ASCII). For example, in ASCII the character “a” is represented by the number 97, and the character “A” is represented by the number 65.

There are seven-bit and eight-bit ASCII encoding schemes. 8-bit ASCII uses one byte (8 bits) for each character, which can only store 256 different values, so that limits ASCII to 256 characters. That’s enough to easily handle all of the characters needed for English, which is why ASCII was the predominant character encoding used on personal computers in the English-speaking world for many years. But there are way more than 256 characters in all of the world’s languages, so obviously ASCII can only handle a small subset of these. This is reason that Unicode was invented.

Unicode

Unicode is a character code designed from the ground up with internationalization in mind, aiming to have enough possible characters to cover all of the characters in any human language. There are two major character encodings for Unicode: UTF-16 and UTF-8. UTF-16 takes the easy way, and simply uses two bytes for every character (two bytes = 16 bits = 65,356 possible values).

UTF-8 is more clever: it uses one byte for the characters covered by 7-bit ASCII, and then uses some tricks so that any other characters may be represented by two or more bytes. This means that ASCII text can actually be considered a subset of UTF-8, and processed as such. For text written in English, where most of the characters would fit into the ASCII character encoding, UTF-8 can result in smaller file sizes, but for text in other languages, UTF-16 should usually be smaller.

Because of the work done with Unicode to make it international, the XML specification states that all XML processors must use Unicode internally. Unfortunately, very few of the documents in the world are encoded in Unicode. Most are encoded in ISO-8859-1, or windows-1252, or EBCDIC, or one of a large number of other character encodings. (Many of these encodings, such as ISO-8859-1 and windows-1252, are actually variants of ASCII. They are not, however, subsets of UTF-8 in the same way that “pure” ASCII is.)

Continued…