Glenn Slayden - Parsing Legacy HTML with XMLDOM

Parsing Legacy HTML with XMLDOM

2003-03-25 :: Glenn Slayden

Q: Is is possible to load run-of-the-mill HTML into "MSXML.DOMDocument.4.0" (Microsoft.XMLDOM) or another XML parser?

A: XMLDOM refers to the Document Object Model of XML, a powerful tree structure for working with XML data. When a document is loaded into an XMLDOM tree, there are numerous properties and methods which enable you to quickly search and edit the tree. In particular, XPath expressions can be used to select part or parts of the tree for manipulation using the node member functions selectSingleNode or selectNodes.

Because of the power of this tool, it's appealing to try to use it to manipulate the base of legacy HTML code—is this possible?

In general, no, because in practice HTML is not well-formed. It's important to understand the difference between "well-formed" and "valid" in the context of XML. XML parsers require their input to be well formed, although not neccessarily valid. Well-formedness refers to correct tag nesting and closure, and other constraints which are generally easy to impose within HTML, but are not usually stricly adhered to because HTML processors (i.e. browsers) do not enforce them. XML validity refers to compliance of the represented data with a schema.

Tools such as Dave Ragett's TIDY, available free-of-charge in several versions, can repair most HTML problems, providing well- formed output from arbitray HTML input. One problem with this tool is that it is not Unicode-aware, and performs all its processing on 8-bit input and output files.

Once your HTML is well-formed it can be loaded into an XML parser. As a first step towards well-formedness, run the HTML through a validator such as validator.w3.org. You can choose one of 3 supported DTDs for your HTML document. Put the chosen directive at the very beginning of your file.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" http://www.w3.org/TR/html40/strict.dtd>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" http://www.w3.org/TR/html40/loose.dtd>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" http://www.w3.org/TR/html40/frameset.dtd>

Do not use the XML file directive <?xml version="1.0"?> in HTML files.

To load your HTML into MSXML, you might need some entity definitions also, since XML has deprecated many of the commonly used named entities. You'll find out exactly which ones when the XMLDOM complains with parse errors.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
<!ENTITY nbsp " ">
<!ENTITY raquo "»">
]>

I've also heard that you should make sure that your opening and closing tags match exactly—XML is case sensitive for tags whereas HTML is not. Here's the VBScript code you can use to load some HTML into the XMLDOM on an ASP server:

	s = "<html><head></head><body>Hello World!</body></html>"

	es = "<!DOCTYPE HTML PUBLIC " & chr(34) & "-//W3C//DTD HTML 4.01 Transitional//EN" & chr(34) & " " & chr(34) & "http://www.w3.org/TR/html4/loose.dtd" & chr(34) & " [" & &_
        "<!ENTITY nbsp " & chr(34) & " " & chr(34) & ">" &_
        "<!ENTITY raquo " & chr(34) & "»" & chr(34) & ">" &_
        "]>" & myhtml
        
	set dom_doc = Server.CreateObject("Microsoft.XMLDOM")	'or "MSXML.DOMDocument.4.0"

	dom_doc.async = false
	dom_doc.resolveExternals = false
	dom_doc.validateOnParse = false

	parse_result = dom_doc.loadXML(es)

	if (parse_result) then
		Response.Write "parse successful<br>"

		call dom_doc.setProperty("SelectionLanguage", "XPath");
		
		Dim b_node
		set b_node = dom_doc.selectSingleNode("//body")
		
		Response.Write b_node.xml
	else		
		Dim perr
		set perr = dom_doc.parseError
	
		Response.Write perr.reason & "at line " & perr.line & " character " & perr.linepos & "<br><br><code>" & perr.srcText & "</code>"
	end if

In this example, the selectSingleNode method is shown just as a demonstration of how easy it is to select part of the loaded document. In the case of this trivial document, it's not such an interesting example, but in the case of complex HTML documents, such ability is invaluable.

My own investigation of this technique came about when I wished to programatically reprocess monolithic HTML pages which are dynamically generated by a 3rd-party message board application written in Perl/CGI. After first capturing the Perl stdout into my ASP page (using an ActiveX object I wrote called ExecCGI*), I needed to 1.) strip away the outermost tags which established the HTML as a standalone page, and 2.) break the remainder down into chunks which could be inserted into the proper section (head, body, etc.) of my ASP-generated output. In this way, I was hoping to be able to subjugate the standalone 3rd-party application and integrate it with the rest of my site.

In the end, this didn't work because in my specific case the 3rd-party app occasionally generates non-Valid XML input. But it was still a useful learning experience and the practice may be useful for those with simpler inputs.

Of course you can use the code shown above on the client side by replacing "Server.CreateObject" with "CreateObject." The javascript implementation is left as an exercise for the reader.

*contact me for details if you're interested in ExecCGI. From VB, an ASP page, or other automation environment, this ActiveX object can:
1.) establish the standard CGI environment variables, or any other desired environment variables;
2.) synchronously execute any console process with or without command line arguments;
3.) capture and return separate feeds of stdout stderr