by Glenn Slayden. please contact me for corrections or comments 2003-03-25
Q: Is is possible to load run-of-the-mill HTML into "MSXML.DOMDocument.4.0" (Microsoft.XMLDOM) or another XML parser?
A: DOM refers to the Document Object Model of XML, a powerful tree structure for working with
XML data. When a document is loaded into a DOM tree, there are numerous properties and methods which enable you to quickly
search and edit the tree. In particular, powerful XPath expressions can be used to select part or parts of the tree for
manipulation using the node member functions selectSingleNode or selectNodes.
Because of the enormous power of this tool, it's appealing to try to use it to manipulate the enormous base of legacy HTML code—is
this possible?
In general, no, because in practice HTML is not well-formed. It's important to understand the difference
between "well-formed" and "valid" in the context of XML. XML parsers require their input to be well formed,
although not neccessarily valid. Well-formed refers to correct tag nesting and closure, and other constraints
which are generally easy to impose within HTML, but are not usually stricly adhered to because HTML processors (largely browsers)
do not enforce them. XML validity refers to compliance of the represented data with a
schema.
Tools such as Dave Ragett's TIDY, available free-of-charge in several versions, can repair most HTML problems, providing well-
formed output from arbitray HTML input. One serious problem with this tool is that it is not Unicode-aware, and
performs all its processing on 8-bit input and output files.
Once your HTML is well-formed, however, it can be loaded into the Microsoft XML parser. As a first step towards well-formedness,
run the HTML through a validator such as validator.w3.org. You can choose one of 3 supported DTDs for your HTML
document. Put the chosen directive at the very beginning of your file.
Do not use the XML file directive <?xml version="1.0"?> in HTML files.
To load your HTML into MSXML, you might need some entity definitions also, since XML has deprecated many of the commonly used named
entities. You'll find out exactly which ones when the XMLDOM complains with parse errors.
]>
...
I've also heard that you should make sure that your opening and closing tags match exactly—XML is case sensitive for
tags whereas HTML is not. Here's the VBScript code you can use to do the load on an ASP page:
s = "
Hello World!"
es = "" & chr(13) & chr(10) &_
"" & chr(13) & chr(10) &_
"]>" & chr(13) & chr(10) & myhtml
set dom_doc = Server.CreateObject("Microsoft.XMLDOM") 'or "MSXML.DOMDocument.4.0"
dom_doc.async = false
dom_doc.resolveExternals = false
dom_doc.validateOnParse = false
parse_result = dom_doc.loadXML(es)
if (parse_result) then
Response.Write "parse successful "
call dom_doc.setProperty("SelectionLanguage", "XPath");
Dim b_node
set b_node = dom_doc.selectSingleNode("//body")
Response.Write b_node.xml
else
Dim perr
set perr = dom_doc.parseError
Response.Write perr.reason & "at line " & perr.line & " character " & perr.linepos & "
" & perr.srcText & ""
end if
In this example, the selectSingleNode method is shown just as a demonstration of how easy it is to select part of the loaded document. In the
case of this trivial document, it's not such an interesting example, but in the case of complex HTML documents, such ability is invaluable.
My own investigation of this technique came about when I wished to programatically reprocess monolithic HTML pages which are dynamically generated
by a 3rd-party message board application written in Perl/CGI. After first capturing the Perl stdout into my ASP page (using an ActiveX object I
wrote called ExecCGI*), I needed to 1.) strip away the outermost tags which established the HTML as a standalone page, and 2.) break the remainder
down into chunks which could be inserted into the proper section (head, body, etc.) of my ASP-generated output. In this way, I was hoping to be able to
subjugate the standalone 3rd-party application and integrate it with the rest of my site.
In the end, this didn't work because in my specific case the 3rd-party app occasionally generates non-Valid XML input. But it was still a useful learning experience
and the practice may be useful for those with simpler inputs.
Of course you can use the code shown above on the client side by replacing "Server.CreateObject" with "CreateObject." The javascript implementation
is left as an exercise for the reader.
*contact me for details if you're interested in ExecCGI. From VB, an ASP page, or other automation environment, this ActiveX object can:
1.) establish the standard CGI environment variables, or any other desired environment variables;
2.) synchronously execute any console process with or without command line arguments;
3.) capture and return separate feeds of stdout stderr