Tip: Convert from HTML to XML with HTML Tidy

by Benoit Marchal
Marchal.com
Wednesday, 3rd August 2005

 Preserve Legacy Web Sites With This Handy Utility

Level: Introductory

This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool, HTML Tidy. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools.
One the challenges that webmasters face when converting from pure HTML to XML/XSL is the preservation of their legacy Web sites. Because it would be too costly to dump the old site and start again from scratch, some sort of automated procedure that brings the HTML site to XML is required.

Even XML converts have to deal with HTML files: Many products have added an option for exporting HTML documents -- an option you might want to integrate into your Web site.

This tip discusses HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. Tidy is distributed as open source.

Tool Of The Trade

The basic tool you can use to upgrade a site from HTML to XML is HTML Tidy. Originally developed by Dave Raggett and distributed under an open source license through the W3C Web site, HTML Tidy is now maintained by a group of volunteers at SourceForge. A Java-language version (aptly called JTidy) is also available (see Resources). Last but not least, an API allows you to integrate HTML Tidy as a library in your applications.

HTML and XML are both markup languages derived from SGML, so they have a lot in common. Still, there are two major differences:

XML syntax is far more restrictive; most importantly, in XML you must remember to close the tags.
HTML coding often has been relatively careless, so the files are rarely trouble-free to start with.
Early Web browsers encouraged sloppiness among webmasters by being extraordinarily tolerant of errors. At the time, the goal of these browsers was to get as many people on board as possible and to encourage webmasters to publish documents. The strategy worked, and Web content grew exponentially.

Still, poor coding practices caused all kind of incompatibilities, and HTML Tidy was originally designed to address this. It rewrites HTML pages to be conformant with the latest W3C standards. In the process, it fixes many common errors such as unclosed tags.

Although HTML Tidy primarily works with HTML pages, it also supports XHTML, an XML vocabulary.

As an example, I will work with a photo gallery generated with Photoshop. You can use other HTML documents, but if you'd like to experiment with the same files I use, the gallery is also available for download in the Resources section. Listing 1 is an excerpt from the gallery -- as you can see, it's plain HTML code.


Options:
Printer Friendly
Email Friend

About The Author:

Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Benoit is available to help you with XML projects. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.

Developer Categories



Developer Tutorials
ASP
CGI & Perl
CSS
Flash
HTML
Java
JavaScript
MySQL
PHP
Python
XML

Developer Documentation

Developer Tools



Search our Developer Tutorials
  The DevSyndicate Network