February 18, 2004
What is all the hoo-ha about XML (eXtensible Markup Language)? Is it the one true mark-up language that will change the way we view and manipulate informational content forever? Or is it just a flash-in-the-pan, with bells and whistles that no one wants nor even cares about? Ronald Bourret, XML programmer and technical writer, is an XML proponent. He and technical writer Karin Gallagher addressed the February BAEF meeting on the basics of XML and the impact it is having on the writing business.
"XML is here to stay," Bourret said. "Its use is growing in publishing; it's absolutely ubiquitous in data industry."
According to Bourret, for the book and publishing industry, XML's main advantages are the ability to publish the same data to multiple formats and the ease of reusing content. He likens today's uncertainty about XML to yesterday's reluctance to go online.
XML is, at its heart, yet another version of SGML, like HTML. But unlike HTML, XML allows users to define their own tags; also, while HTML focuses on the presentation of data, XML focuses on the manipulation and dissemination of data.
"XML is an attempt to go back to defining text, rather than presenting text," Bourret said. "With HTML the only thing you do is display in a browser; with XML, you have a lot more choices."
There are three basic structures in an XML document: elements, attributes, and character data.
Elements are start and end tags. In Bourret's example, he created a <book> element that had several "child" elements defined: <title>, <author>, <price>, and <description>.
Deciding whether and how to do child elements depends on how you want to
use your data, according to Bourret. When creating his
"Look at names, for example," Bourret said. "You could have a single
name tag that would be 'Ronald Bourret' or you could have the first name
separate from the last. It depends on what degree of granularity you
want to go to. Is it important to have that
Elements can have several different content types: element content, PCDATA (or text) content, empty content, or mixed content.
There are several rules for XML elements. In XML, all start tags must
have end tags (generally denoted by a slash; e.g., is the end
Attributes are one way to represent data, and they are an alternative to creating child elements. Note, however, that attributes cannot have further elements. In Bourret's "book" example, using brackets, like this < > instead of quote marks, the "book" element has the ISBN attribute "book ISBN="x-xxx-xxxxx-x" "; he could have chosen to have ISBN as a child element of book, but instead he chose to view the ISBN as an attribute of the book.
"There is a long, on-running debate in the XML community about whether it's better to have attributes or child elements," Bourret said. "It's really personal preference. I write the code to process my own XML documents, so I tend to prefer attributes. But in the publishing world, you're almost always going to do elements."
Character data is the text data. One difference between XML and HTML for character data is that XML recognizes white space, while HTML does not.
"XML schemas are the way you define elements and attributes and how they fit together," he said. "Theoretically these schemas are optional, but in practice you want people to think about this."
There are two main types of schemas: DTDs (document type definitions) and XML Schemas (with a capital S). DTDs are more established. They come out of the SGML world; one of the best known DTDs for publishing is DocBook.
DocBook has been around for about ten or fifteen years and has a lot of free tools with it. It has the advantage of being very well documented. According to Bourret, the problem with DocBook is that it may not define the elements and attributes exactly how you want. Also, it is big and complex to work with.
Other documentation schemas besides DocBook include XHTML, which is a version of HTML that has some qualities of XML. However, according to Bourret, while XHTML is widely supported and well known, the real disadvantage of it is that XHTML does not separate content and presentation, meaning that if you want to repurpose that content, you may end up doing a lot of hand clean-up.
NewsML is a relatively new schema, designed specifically for the news industry. As such it is widely used by the media, including UPI, Reuters, The Wall Street Journal, and a lot of other large media organization to send news around the Web.
XML Schemas are newer and less easy to read than DTDs. The advantage of writing your own Schema is that you get exactly what you want, deciding which elements and attributes you need for your project. But it's not a simple task -- especially since you'll have to write the surrounding tools, stylesheets, etc. In addition, your Schema will be proprietary and, therefore, not portable.
Getting to the Data
Once you've done all the work, defined your tags, and put in the data,
how do you view your data? Since XML is not a presentation language, it
doesn't display data. Two options are:
And if you want to display data in a table, that's a presentation question separate from XML. According to Bourret, you can define table elements as part of your language. DocBook, for example, has table elements that you can use.
"With an XML document, what you do is your choice," Bourret said. "You write the document in XML, and somewhere in the conversion process, the production people define fonts, etc. Some information may never get displayed at all."
The advantage of having content in XML, according to Bourret, is that you can generate multiple views of the same data, depending on how you intend to use it. This content can be formatted for PDF, for a book, for the Web -- all using one base file.
"You can generate multiple different views of a document," Bourret said. "For a book written in XML, the view may be a subset of a document, or a rearrangement thereof. You could generate a view that is just the table of contents. Because all content is labeled through tags, you can pull any of it out and define how it is arranged."
XML has the additional advantages of being nonproprietary and portable to different operating systems. Because XML documents are text and Unicode, the files are readable on any system in almost any language.
"XML is easy to recombine and reuse," Bourret said. "It's text, so it's easy to process. Similar to building dynamic Web pages today, you could take XML content and very easily build three different versions of an online magazine: free, standard, and premium."
Karin Gallagher, technical writer formerly at Borland, discussed the advantages and disadvantages of XML in the context of converting preexisting documentation to XML. Although Borland had considered and discarded the idea of converting to XML several years ago (mainly because of the lack of available resources and tools), the company ultimately decided to move to XML about a year and a half ago.
"We wanted to automate the process for the API reference (help files), we wanted to continue to single source, and we wanted to make the help system cross-platform," Gallagher said. "We ported the existing documentation over to XML -- about 45,000 topics, and one guy did it in three months."
They chose an XML editor by first writing up a list of their criteria, then picking from the shortlist of available editors (XMetaL and XML Spy) according to which one best satisfied their needs. Gallagher characterizes XML Spy as more for developers who are creating schemas and XMetaL as more content-based with a WYSIWYG interface. But it turned out that XMetaL wasn't quite WYSIWYG enough.
"The problem with XMetaL is that you have to apply the style sheet to see the final formatting, and you could only add elements in tag view, not in WYSIWYG," Gallagher said. "XMetaL is not as mature as Word or Framemaker. After a few years go by and more people use it, it will get better."
They also decided to create their own XML Schema to maintain flexibility, rather than using DocBook. The programmer created a base schema with elements that all of the other schemas would use. But, one problem was that the purpose of some of the elements was unclear. In addition, the schema author used the schema to enforce "good writing" (e.g., that there could be no more than four paragraphs in an overview), and this ultimately caused problems.
"The documentation is now 100 percent XML," Gallagher said. "But the cons were that we had to rewrite many documents from scratch, and we were forced by the schema to write too simply for many of our users. In addition, because of the schema design, we're no longer single source and the API documentation is still not automated."
Despite the problems, Gallagher still finds advantages in XML. She said that the language is not complicated and that the schemas allow a lot of flexibility. Writers can concentrate on writing rather than formatting. While tagging is involved, it need only be done once.
"But you need to decide if it's for you," Gallagher said. "Plan, and decide, and really think ahead. Choose your editor, convert existing documents, and then convert any supporting tools. Everyone needs to be involved."