Specifying Character Encoding for XML
This is where the encoding attribute in our XML declaration comes in. It allows us to specify, to the XML parser, what character encoding our text is in. The XML parser can then read the document in the proper encoding, and translate it into Unicode internally. If no encoding is specified, UTF-8 or UTF-16 is assumed (parsers must support at least UTF-8 and UTF-16). If no encoding is specified, and the document is not UTF-8 or UTF-16, it results in an error.
Sometimes an XML processor is allowed to ignore the encoding specified in the XML declaration. If the document is being sent via a network protocol such as HTTP, there may be protocol-specific headers which specify a different encoding than the one specified in the document. In such a case, the HTTP header would take precedence over the encoding specified in the XML declaration. However, if there are no external sources for the encoding, and the encoding specified is different from the actual encoding of the document, it results in an error.
If you’re creating XML documents in Notepad on a machine running a Microsoft Windows operating system, the character encoding you are using by default is windows-1252. So the XML declarations in your documents should look like this:
<?xml version="1.0" encoding="windows-1252"?>
However, not all XML parsers understand the windows-1252 character set. If that’s the case, try substituting ISO-8859-1, which happens to be very similar. Or, if your document doesn’t contain any special characters (like accented characters, for example), you could use ASCII instead, or leave the encoding attribute out, and let the XML parser treat the document as UTF-8.
If you’re running Windows NT or Windows 2000, Notepad also gives you the option of saving your text files in Unicode, in which case you can leave out the encoding attribute in your XML declarations.
If the standalone attribute is included in the XML declaration, it must be either yes or no.
- yes specifies that this document exists entirely on its own, without depending on any other files
- no indicates that the document may depend on other files
This little attribute actually has its own name: the Standalone Document Declaration, or SDD. The XML specification doesn’t actually require a parser to do anything with the SDD. It is considered more of a hint to the parser than anything else.
This is only a partial description of the SDD. If it has whetted your appetite for more, you’ll have to be patient until Chapter 11, when all will be made clear.
It’s time to take a look at how the XML declaration works in practice.
Try It Out – Declaring Al’s CD to the World
Let’s declare our XML document, so that any parsers will be able to tell right away what it is. And, while we’re at it, let’s take care of that second <parody> element, which doesn’t have any content.
1. Open up the file cd3.xml, and make the following changes:
<?xml version='1.0' encoding='windows-1252' standalone='yes'?> <CD serial='B6B41B' disc-length='36:55'> <artist>"Weird Al" Yankovic</artist> <title>Dare to be Stupid</title> <genre>parody</genre> <date-released>1990</date-released> <!--date-released is the date released to CD, not to record--> <song> <title>Like A Surgeon</title> <length> <minutes>3</minutes> <seconds>33</seconds> </length> <parody> <title>Like A Virgin</title> <artist>Madonna</artist> </parody> </song> <song> <title>Dare to be Stupid</title> <length> <minutes>3</minutes> <seconds>25</seconds> </length> <parody/> </song> <!--There are more songs on this CD, but I didn't have time to include them!--> </CD>
Save the file as cd5.xml, and view it in IE5:
How It Works
With our new XML declaration, any XML parser can tell right away that it is indeed dealing with an XML document, and that document is claiming to conform to version 1.0 of the XML specification.
Furthermore, the document indicates that it is encoded using the windows-1252 character encoding. Again many XML parsers don’t understand windows-1252, so you may have to play around with the encoding. Luckily, the parser used by Internet Explorer 5 does understand windows-1252, so if you’re viewing the examples in IE5 you can leave the XML declaration as it is here.
In addition, because the Standalone Document Declaration declares that this is a standalone document, the parser knows that this one file is all that it needs to fully process the information.
And finally, because “Dare to be Stupid” is not a parody of any particular song, the <parody> element has been changed to an empty element. That way we can visually emphasize the fact that there is no information there. Remember, though, that to the parser <parody/> is exactly the same as <parody></parody>, which is why this part of our document looks the same as it did in our earlier screenshots.