Tip: Convert from HTML to XML with HTML Tidy
by Benoit MarchalMarchal.com
Wednesday, 3rd August 2005
Listing 1. index.html (an excerpt)
<HTML>
<HEAD>
<TITLE>Journey to Windsor</TITLE>
</HEAD>
<BODY>
<TABLE>
<TR>
<TD width=15></TD>
<TD><FONT size="3"face="Helvetica">
Journey to Windsor<BR>
Benoît Marchal<BR>
July 2003<BR>
<BR>
<A href="mailto:bmarchal@pineapplesoft.com">
bmarchal@pineapplesoft.com</A>
</FONT></TD>
</TR>
</TABLE>
<CENTER><TABLE border=3>
<TR><TD>
<A href="pages/dscn0824.html">
<IMG src="thumbnails/dscn0824.jpg" border="0" alt="dscn0824">
</A><br>
<FONT size="3" face="Helvetica">
dscn0824.jpg<br>
A bright, red mailbox inside the castle. It seems oddly familiar
in an historic setting.<br>
Windsor Castle <br>
© 2003, Benoît Marchal
</FONT>
</TD></TR>
</TABLE></CENTER>
</BODY>
</HTML>
<HEAD>
<TITLE>Journey to Windsor</TITLE>
</HEAD>
<BODY>
<TABLE>
<TR>
<TD width=15></TD>
<TD><FONT size="3"face="Helvetica">
Journey to Windsor<BR>
Benoît Marchal<BR>
July 2003<BR>
<BR>
<A href="mailto:bmarchal@pineapplesoft.com">
bmarchal@pineapplesoft.com</A>
</FONT></TD>
</TR>
</TABLE>
<CENTER><TABLE border=3>
<TR><TD>
<A href="pages/dscn0824.html">
<IMG src="thumbnails/dscn0824.jpg" border="0" alt="dscn0824">
</A><br>
<FONT size="3" face="Helvetica">
dscn0824.jpg<br>
A bright, red mailbox inside the castle. It seems oddly familiar
in an historic setting.<br>
Windsor Castle <br>
© 2003, Benoît Marchal
</FONT>
</TD></TR>
</TABLE></CENTER>
</BODY>
</HTML>
Tidying Up
Obviously, the first step is to download and install HTML Tidy. HTML Tidy is available on most platforms, including Windows, Linux, and MacOS. The default executable is a command-line tool, but GUI versions are available for Windows and MacOS.
To run HTML Tidy, open a terminal and issue the following command:
tidy -asxhtml -numeric < index.html > index.xml
That's it! HTML Tidy immediately converts index.html into index.xml. HTML Tidy will print messages that highlight issues with the original HTML document during the conversion. In most cases, you can safely ignore these messages.
HTML Tidy runs as a filter, so it expects standard input and it prints the result to the standard output. The redirection operators (< and >) allow you to work with files. By default, HTML Tidy produces a clean HTML page, but you can set two options to output XML, instead:
-asxhtml outputs XHTML documents instead of HTML.
-numeric uses character entities instead of HTML entities. For example, î is replaced with î.
The difference between XHTML and HTML might sound trivial (it's only an extra "X" after all) but it is important. XHTML is a version of HTML 4.01 that has been adapted to the XML syntax. The vocabulary is unchanged (XHTML uses the familiar <p>, <b>, and <a> tags, for example), but the syntax is XML, so it merges nicely in an XML workflow.
The main differences between HTML and XHTML are:
Listing 2 is the file that HTML Tidy produces when Listing 1 is provided as input. As you can see, it is a valid XML document, and it takes surprisingly little work to produce it.
Options:
Printer Friendly
Email Friend
About The Author:
Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Benoit is available to help you with XML projects. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.
Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Benoit is available to help you with XML projects. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.
