Tip: Convert from HTML to XML with HTML Tidy
by Benoit MarchalMarchal.com
Wednesday, 3rd August 2005
Preserve Legacy Web Sites With This Handy Utility
Level: Introductory
This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool, HTML Tidy. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools.
One the challenges that webmasters face when converting from pure HTML to XML/XSL is the preservation of their legacy Web sites. Because it would be too costly to dump the old site and start again from scratch, some sort of automated procedure that brings the HTML site to XML is required.
Even XML converts have to deal with HTML files: Many products have added an option for exporting HTML documents -- an option you might want to integrate into your Web site.
This tip discusses HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. Tidy is distributed as open source.
Tool Of The Trade
The basic tool you can use to upgrade a site from HTML to XML is HTML Tidy. Originally developed by Dave Raggett and distributed under an open source license through the W3C Web site, HTML Tidy is now maintained by a group of volunteers at SourceForge. A Java-language version (aptly called JTidy) is also available (see Resources). Last but not least, an API allows you to integrate HTML Tidy as a library in your applications.
HTML and XML are both markup languages derived from SGML, so they have a lot in common. Still, there are two major differences:
XML syntax is far more restrictive; most importantly, in XML you must remember to close the tags.
HTML coding often has been relatively careless, so the files are rarely trouble-free to start with.
Early Web browsers encouraged sloppiness among webmasters by being extraordinarily tolerant of errors. At the time, the goal of these browsers was to get as many people on board as possible and to encourage webmasters to publish documents. The strategy worked, and Web content grew exponentially.
Still, poor coding practices caused all kind of incompatibilities, and HTML Tidy was originally designed to address this. It rewrites HTML pages to be conformant with the latest W3C standards. In the process, it fixes many common errors such as unclosed tags.
Although HTML Tidy primarily works with HTML pages, it also supports XHTML, an XML vocabulary.
As an example, I will work with a photo gallery generated with Photoshop. You can use other HTML documents, but if you'd like to experiment with the same files I use, the gallery is also available for download in the Resources section. Listing 1 is an excerpt from the gallery -- as you can see, it's plain HTML code.
Options:
Printer Friendly
Email Friend
Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Benoit is available to help you with XML projects. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.
