Minibases: Transitionary Use of Schema-validated XML Data Islands

2003-03-25 :: Glenn Slayden

Q: How do I create, load, validate, and access XML data islands?

A: An XML data island is a fragment of well-formed (and perhaps valid) XML on an HTML page. It's very easy to create and access this data, although I found it quite difficult to accomplish validation against an XML schema. This was probably due more to the fact that I was learning about namespaces, schemas, and validation for the first time during this exercise, rather than any inherent difficulty with data islands. I can't be sure how difficult it would have been if I had been a schema definition expert.

For the purpose of this blurb, I'll be using the so-called W3C XML Schema 1.1* (May 2001). Note that this was (at the time of writing, 2003) the most modern schema definition mechanism, as opposed to the two other commonly used systems: DTD and a stillborn Microsoft device. There are numerous articles on the web which compare these three systems, and there are also several other fringe systems, but without going into too much detail, suffice it to say that the W3C XML Schema definition language is the most comprehensive and capable. An official overview of it is available in this primer

In the example case, a website contains pages in which a subset of a large database is displayed, but individual items within the subset may be used on the page many times. It may be possible to reduce the size of the page downloads through normalization—sending only a single copy of the data items (which I'll call the minibase), and then using a client-side script such as java to build the final page from the minibase. As a bonus, processing cycles are also distributed away from the server to the clients. Rather than spending time formatting HTML, the server now just analyzes the data dependencies for a particular page, removes duplicates from the list, and prepares the minibase, an XML data island containing just the data that the client will need to build the page.

Typically, the client-side script would not be subject to change as the minibase chages, so isolating that code as a <script> element with an external source would reduce download size even more.

At this point, XML mavens will point out that my javascript is doing the job of an XSLT transformation. Perhaps so, but as a transitionary measure, simply moving javascript from the ASP page to the client-side as described here is a much easier task than a complete paradigm-shift. In other words, complex server-side applications exist which are based on scads of procedural-imperative code that can be preserved while still moving into an XML environment.

And now for the code. The HTML file which contains an XML data island and javascript to select and display one of the items based on its attribute value:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<BASE HREF="http://www.glennslayden.com/XML_data_islands.htm">
</head><body>

<xml id="xroot">
<tl:minibase xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
			xmlns:tl="urn:thai-language-schema"
			xsi:schemaLocation="urn:thai-language-schema http://www.thai-language.com/tl.xsd">
			 
	<te teid='12345'>
		<thai>เก่า</thai>
		<xlit>khaao<span class=tt>F</span></xlit>
		<xid>333333</xid>
	</te>
	<te teid='555333'>
		<thai>เก่า</thai>
		<xlit>xxxyxx<span class=tt>F</span></xlit>
		<xid>444444</xid>
	</te>
	<te teid='556777'>
		<thai>เก่า</thai>
		<xlit>xxxyxx<span class=tt>F</span></xlit>
		<xid>1</xid>
	</te>
</tl:minibase>
<script language="javascript" type="text/javascript">

validateFile();

function validateFile()
{
	var dom_root;
	
	dom_root = new ActiveXObject("MSXML2.DOMDocument.4.0");
		
	dom_root.async = false;
	dom_root.validateOnParse = true;
	dom_root.resolveExternals = true;
	dom_root.setProperty("SelectionLanguage", "XPath");
	dom_root.setProperty("SelectionNamespaces", "xmlns:tl='urn:thai-language-schema'");

	this_xml_doc = document.all("xroot").XMLDocument;
	s =  this_xml_doc.firstChild.xml;
		
	dom_root.loadXML(this_xml_doc.firstChild.xml);
	
	if (dom_root.parseError.errorCode != 0){
		document.write("Validation failed.<br><hr>" +
			"<br>Reason: " + dom_root.parseError.reason +
			"<br>Source: " + dom_root.parseError.srcText +
			"<br>Line: " + dom_root.parseError.line + "<br>");
			return;
	}
	else
		document.write("Validation succeeded.<br><hr><br>" + dom_root.xml + "<br>");
		
	// continue
	
	node = dom_root.selectSingleNode("//tl:minibase/te[@teid=555333]");
	
	document.write(node);
	
	if (node != null){
		document.write(node.text);
	}
}

</script>
</body>
</html>
tl.xsd: the XML Schema definition, must be located in the same directory as the HTML file.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
			xmlns:tl="urn:thai-language-schema"
			targetNamespace="urn:thai-language-schema">

	<xsd:element name="minibase">
		<xsd:complexType>
			<xsd:choice maxOccurs="unbounded">
	
				<xsd:element name="te">
				<xsd:complexType>
					<xsd:sequence>
						<xsd:element name="thai"/>
						<xsd:element name="xlit"/>
						<xsd:element name="xid"/>
					</xsd:sequence>
					<xsd:attribute name="teid" type="xsd:unsignedInt" use="required"/>
				</xsd:complexType>
				</xsd:element>
			
			</xsd:choice>
		</xsd:complexType>
	</xsd:element>
</xsd:schema>
I strongly suggest that you run your schema through a schema validator such as W3C's, and fix all the problems before attempting to use MSXML/XMLDOM to try to validate XML against it.

Notice that the attribute 'teid' is defined as an unsigned integer, one of the W3C XML primitive datatypes. This means that in the XPath statement of selectSingleNode, we don't have to put quotes around the number we're looking for. However, even though it's a numeric value, it must have quotes in the XML data island, since XML requires all attribute values to be quoted.

We can be sure that validation against the schema is actually occurring by changing one of those numeric values for teid in the data island to a non-numeric value, say by inserting an 'x' into the middle of the number. When you refresh the HTML page, you should get an error that the validation failed because of a type problem.

One maddening aspect of developing this code was that validation against the schema appears to be finicky and fragile. If the XML processor doesn't like the slightest thing about your namespaces and the "hook-up" between the data and the schema, it will not perform the validation, and the parse error will report success. The only way I found to make sure that the validation is actually happening is to "break" it, as described in the previous paragraph, and see if the error is reported.