HTML and the World Wide Web are everywhere. As an example of their ubiquity, I'm going to Central America for Easter this year, and if I want to, I'll be able to surf the Web, read my e-mail, and even do online banking from Internet cafés in Antigua Guatemala and Belize City. (I don't intend to, however, since doing so would take time away from a date I have with a palm tree and a rum-filled coconut.)

And yet, despite the omnipresence and popularity of HTML, it is severely limited in what it can do. It's fine for disseminating informal documents, but HTML now is being used to do things it was never designed for. Trying to design heavy-duty, flexible, interoperable data systems from HTML is like trying to build an aircraft carrier with hacksaws and soldering irons: the tools (HTML and HTTP) just aren't up to the job.

The good news is that many of the limitations of HTML have been overcome in XML, the Extensible Markup Language. XML is easily comprehensible to anyone who understands HTML, but it is much more powerful. More than just a markup language, XML is a metalanguage -- a language used to define new markup languages. With XML, you can create a language crafted specifically for your application or domain.

XML will complement, rather than replace, HTML. Whereas HTML is used for formatting and displaying data, XML represents the contextual meaning of the data.

This article will present the history of markup languages and how XML came to be. We'll look at sample data in HTML and move gradually into XML, demonstrating why it provides a superior way to represent data. We'll explore the reasons you might need to invent a custom markup language, and I'll teach you how to do it. We'll cover the basics of XML notation, and how to display XML with two different sorts of style languages. Then, we'll dive into the Document Object Model, a powerful tool for manipulating documents as objects (or manipulating object structures as documents, depending upon how you look at it). We'll go over how to write Java programs that extract information from XML documents, with a pointer to a free program useful for experimenting with these new concepts. Finally, we'll take a look at an Internet company that's basing its core technology strategy on XML and Java.

Is XML for you?
Though this article is written for anyone interested in XML, it has a special relationship to the JavaWorld series on XML JavaBeans. (See Resources for links to related articles.) If you've been reading that series and aren't quite "getting it," this article should clarify how to use XML with beans. If you are getting it, this article serves as the perfect companion piece to the XML JavaBeans series, since it covers topics untouched therein. And, if you're one of the lucky few who still have the XML JavaBeans articles to look forward to, I recommend that you read the present article first as introductory material.

A note about Java
There's so much recent XML activity in the computer world that even an article of this length can only skim the surface. Still, the whole point of this article is to give you the context you need to use XML in your Java program designs. This article also covers how XML operates with existing Web technology, since many Java programmers work in such an environment.

XML opens the Internet and Java programming to portable, nonbrowser functionality. XML frees Internet content from the browser in much the same way Java frees program behavior from the platform. XML makes Internet content available to real applications.

Java is an excellent platform for using XML, and XML is an outstanding data representation for Java applications. I'll point out some of Java's strengths with XML as we go along.

HTML example
Take a look at the little chunk of HTML in Listing 1:


<!-- The original html recipe -->
<HTML>
<HEAD>
<TITLE>Lime Jello Marshmallow Cottage Cheese Surprise</TITLE>
</HEAD>
<BODY>
<H3>Lime Jello Marshmallow Cottage Cheese Surprise</H3>
My grandma's favorite (may she rest in peace).
<H4>Ingredients</H4>
<TABLE BORDER="1">
<TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR>
<TR><TD>1</TD><TD>box</TD><TD>lime gelatin</TD></TR>
<TR><TD>500</TD><TD>g</TD><TD>multicolored tiny marshmallows</TD></TR>
<TR><TD>500</TD><TD>ml</TD><TD>cottage cheese</TD></TR>
<TR><TD></TD><TD>dash</TD><TD>Tabasco sauce (optional)</TD></TR>
</TABLE>
<P>
<H4>Instructions</H4>
<OL>
<LI>Prepare lime gelatin according to package instructions...</LI>
<!-- and so on -->
</BODY>
</HTML>

Listing 1. Some HTML

(A printable version of this listing can be found at example.html.)

Looking at the HTML code in Listing 1, it's probably clear to just about anyone that this is a recipe for something (something awful, but a recipe nonetheless). In a browser, our HTML produces something like this:

Lime Jello Marshmallow Cottage Cheese Surprise

My grandma's favorite (may she rest in peace).

Ingredients

QtyUnitsItem
1boxlime gelatin
500gmulticolored tiny marshmallows
500mlCottage cheese
dashTabasco sauce (optional)

Instructions

  1. Prepare lime gelatin according to package instructions...

Listing 2. What the HTML in Listing 1 looks like in a browser


java com.javaworld.JavaBeans.XMLApr99.ParseDemo

for instructions. The XML for Java package includes excellent documentation, a programmer's guide, and several example programs to get you started.

The source code is also available in zip and tar.gz formats. As an exercise, try downloading one of the other vendors' XML parsers from the Resources section, and then overriding the method ParseDemo.createParser() in the sample code to create a parser from the new package.

Become a tree surgeon!
One final, somewhat more advanced topic, before we close. The SAX interface allows you to parse an XML file and execute particular actions whenever certain structures (like tags) appear in the input. That's great for a lot of applications. There are times, though, when you want to be able to cut and paste whole sections of XML documents, restructure them, or maybe even build from scratch an object structure like the one in Figure 3, and then save the whole structure as an XML file. For that, you need access to the DOM API.

The DOM API allows you to represent your XML document as a tree of nodes in your Java (or other language) program. While a SAX parser reads an XML file, doing callbacks to a user-defined class, a DOM parser reads an XML file and returns a representation of the file as a tree of objects, most of which are of type org.w3c.dom.Node This gives you immense power in manipulating structured documents. Figure 4 is an example of what I'm talking about.


Figure 4. A DOM document transformation system

The Document Object Model, in the package org.w3c.dom, defines interfaces for document elements (that is, tags), DTD elements, text nodes (where the actual text inside the tags is kept), and quite a few other things we haven't even discussed. Figure 4 is a schematic of a general system that can transform one XML document to some other form programmatically. Your program uses a DOM parser to parse an XML file, and the parser returns a tree that is an exact representation of the XML in the file. Note that, at this point, you've read an input file, checked it for formatting and semantic validity, and built a complex hierarchical object structure, all in just a few lines of code. You can then traverse the document tree in software, doing whatever you like to the tree structure. Add nodes, delete them, update their values, read or set their attributes -- basically anything you like. When your tree has the new structure you desire, tell the top node to print itself to another XML file, and the new document is created.

XML-Java synergy
One of the reasons Java and XML are so well-suited for one another is that Java and XML are both extensible: Java through its class loaders, XML through its DTD. Imagine a server, reading and writing XML, where the DTD for the system input can change. When a new element is added to the input language, a running server (written in Java) could automatically load new Java classes to handle the new tags. You would not only have an extensible application server -- you wouldn't even have to take the server down to add the extensions!

One small idea points to the possible implementations of XML and Java together. The next section is about a company whose combination of XML and Java is its core technology.

XML with Java in the real world
You now have a handle on XML technology, including how it's implemented in Java. You understand that a document can be viewed as a tree of objects and manipulated using SAX or DOM. Let's have a look at a real company that is using all of these technologies to provide solutions for its clients.

DOM interfaces exist not only for XML, but for HTML, as well. This means that the leftmost document in Figure 4 could be a Web page from which you wish to extract information for manipulation in Java.

In fact, Epicentric, an Internet startup in San Francisco, does just that. Epicentric uses Java and XML in its turnkey systems to allow creation of custom portal sites. Portal sites, like the front pages of Netscape Netcenter and Excite!, are integrated aggregations of information from various Internet sources. In a corporate Internet environment, a portal may contain information gleaned from external Web pages (for example, weather reports), alongside internal enterprise data. Portals are also often customizable by each user.

Epicentric's systems read HTML from the Internet as DOM documents, extract information from those documents, and store that information in a standard XML format. Other information sources are also converted into this same XML format and stored on Epicentric's server. The company then uses the XML with XSL and Java Server Pages to create custom portals for its clients.

"A lot of good work has been done on the basics ... like parsers and XSL processors," says Ed Anuff, CEO of Epicentric. One benefit of using XML is that it makes designers think through the system structure in a very structured way, Anuff says.

When asked about concerns with XML, Anuff states that many of the problems he runs into are architectural, such as which DTD to use, and designating the appropriate places in the system to use XML. Systems designers are still working out how to use this new technology most effectively in an enterprise environment.

Also, since the technology is so new, it's often hard to know what pieces of the system to build in-house. For example, quite a few companies built their own XML parsers but now have little return on investment because larger companies are developing superior XML technology and giving it away for free. "The biggest challenge today is figuring out when you're reinventing the wheel, and when you're adding value," says Anuff.

Despite these challenges, the future looks bright for Epicentric, which has several "pretty decent-sized customers" using the company's software in beta. With clients and advertisers that include the likes of Eastman Kodak Company, Sun Microsystems, Chase Bank, and LIFE Magazine, Epicentric is using XML to aggregate and redistribute information in novel ways.

Conclusion
XML is a powerful data representation technology for which Java is uniquely well-suited. You're going to be hearing a lot about XML in the coming months and years. Anyone working with information systems that communicate with other systems (and what systems don't, these days?) has a lot to gain by understanding XML technology and using it to its full advantage.

Using XML with XSL or CSS, you can manage your Web site's content and style, and change style in one place (the style sheet) instead of editing piles of HTML files or, worse, editing the scripts that produce HTML dynamically. Using SAX or DOM, you can treat Web documents as object structures and process them in a general and clean way. Or, you can leave browsers behind entirely and write pure-Java clients and servers that talk to each other -- and other systems -- in XML, the new lingua franca of the Internet. Sun Microsystems, the creator of Java, has perhaps best described the power of XML and Java together in its slogan: Portable Code -- Portable Data. Start experimenting with XML in Java, and you'll soon wonder how you ever lived without it.

Thanks to Dave Orchard for his comments on drafts of this article, and to the many helpful people I met in San Jose, CA.


Printer-friendly version Printer-friendly version | Send this article to a friend Mail this to a friend

About the author
Mark Johnson lives in Fort Collins, CO, and is a C++ programmer by day and Java columnist by night. Very late night.

Using the XML Data Source Object

You can either use the OBJECT element to refer to the XML data source object, or you can use data islands and the XML element.

The XML data source Microsoft® ActiveX® object can be inserted into an HTML page as follows.

<OBJECT width=0 height=0
    classid="clsid:550dda30-0541-11d2-9ca9-0060b0ec3d39"
    id="xmldso">
</OBJECT>

This can be used as an XML data provider in conjunction with the data binding features of Microsoft Internet Explorer 5.0 for binding XML data to HTML elements on the page.

To load XML into the data source object, use the Dynamic HTML (DHTML) XMLDocument property to get a Document Object Model (DOM), and then call the load method as follows.

<SCRIPT for=window event=onload>
    var doc = xmldso.XMLDocument;
    doc.load("books.xml");
    if (doc.documentNode == null)
    {
        HandleError(doc);
    }
</SCRIPT>

Inline XML

You can also provide the XML inline inside the OBJECT element, as shown in the following example.

<OBJECT width=0 height=0
    classid="clsid:550dda30-0541-11d2-9ca9-0060b0ec3d39"
    id="xmldso">
<favorites>
<favorite>
<name>Microsoft</name>
<url>http://www.microsoft.com</url>
</favorite>
</favorites>
</OBJECT>

You use script to load the data source object as follows.

<SCRIPT for=window event=onload>
    var doc = xmldso.XMLDocument;
    doc.loadXML(xmldso.altHtml);
    if (doc.documentNode == null)
    {
        HandleError(doc);
    }
</SCRIPT>

Events Used with the XML Data Source Object

The XML data source object triggers events when the underlying XML data changes. These events are common among the XML data source object and the other supplied data source objects. For more information about events and the data source object, see DHTML Event Model Support for Data Binding in the DHTML documentation.

Viewing and Navigating a Subset of the Data

If you want a table to display a small portion of your XML data set, use the DATAPAGESIZE attribute on your TABLE element. The DATAPAGESIZE attribute indicates how many records to display in the table.

To navigate the table, you can use the nextPage, previousPage, firstPage, and lastPage methods to display different pages of the data.

Usually, you can provide buttons to view different pages of the data set with nextPage, previousPage, firstPage, and lastPage. For example, a button to view the next page can be written as follows.

<INPUT TYPE="button" VALUE="Next" ONCLICK="tbl.nextPage();">

To specify the table, you can use the following code.

<TABLE DATAPAGESIZE=1 ID=tbl DATASRC=#xmlData>
...Table...
</TABLE>

This example sets the table to display one record (DATAPAGESIZE=1), identifies itself as "tbl" (ID=tbl), and uses a data source called "xmlData".

To indicate which table the button refers to, use the ID attribute used with the TABLE element.

The ONCLICK attribute can also specify "previousPage", "firstPage", or "lastPage". For example, to create a button to display the first page, use the following.

<INPUT TYPE="button" VALUE="First Page" ONCLICK="tbl.firstPage();">

The $Text Data Field

When you bind data using the XML data source object, an automatic field called "$Text" is created. It contains the items in that record, concatenated. The following example demonstrates the $Text data field.

<HTML><HEAD></HEAD><TITLE></TITLE>
<BODY>
<XML ID="xmlParts">
<?xml version="1.0" ?>
<parts>
<part>
<partnumber>A1000</partnumber>
<description>Flat washer</description>
<quantity>1000</quantity>
</part>
<part>
<partnumber>S2300</partnumber>
<description>Machine screw</description>
<quantity>1000</quantity>
</part>
<part>
<partnumber>M2400</partnumber>
<description>Nail</description>
<quantity>500</quantity>
</part>
</parts>
</XML>
<table datasrc=#xmlParts>
<tr>
<td><div datafld="partnumber"></div></td>
<td><div datafld="$Text"></div></td>
</tr>
</table>
</BODY>
</HTML>

In this example, the table will consist of a column of part numbers (where datafld is equal to "partnumber") and a column containing the part number, description, and quantity concatenated (where datafld is equal to "$Text"). For example, the first row of the partnumber column will contain "S2300", while the second row of the $Text column will contain "S2300 Machine screw 1000". The $Text column contains the part number.

Rules for Assigning XML Elements and Attributes to Columns and Rows

The XML data source object follows a procedure for assigning elements and attributes to columns and rows in databound applications. XML is modeled as a tree with one tag containing the entire hierarchy. For example, an XML description of a book can contain chapter tags, figure tags, and section tags. A book tag can contain the subelements of chapter, figure, and section. When the XML data source object assigns rows and columns, the subelements, not the top level element, are converted.

The XML data source object uses this procedure for converting the subelements.

Each subelement and attribute corresponds to a column in some rowset in the hierarchy.

The name of the column is the same as the name of the subelement or attribute, unless the parent element has an attribute and a subelement with the same name, in which case a "!" is prepended to the sublement's column name.

Each column is either a simple column containing scalar values, usually strings, or a rowset column containing subrowsets.

Columns corresponding to attributes are always simple.

Columns corresponding to subelements are rowset columns if either the subelement has its own subelements and/or attributes, or the subelement's parent has more than one instance of the subelement as a child. Otherwise, the column is simple.

When there are multiple instances of a subelement (under different parents), its column is a rowset column if any of the instances imply a rowset column; its column is simple only if all instances imply a simple column.

All rowsets have an additional column named $Text.

A simpler conversion takes place if you have set the JavaDSOCompatible flag to True. The JavaDSOCompatible flag makes the Internet Explorer 5.0 XML data source object compatible with the Java data source object supplied with Internet Explorer 4.0. To set the JavaDSOCompatible flag, you can use the XML element as follows.

<xml id="xmldata" JavaDSOCompatible=true>
...XML data
</xml>

Or, you can use the following with the OBJECT element (using the XML element is recommended).

<OBJECT width=0 height=0
    classid="clsid:550dda30-0541-11d2-9ca9-0060b0ec3d39"
    id="xmldso">
<PARAM NAME="JavaDSOCompatible" value="true">
</OBJECT>

The following method is used for creating rows and columns when JavaDSOCompatible is True.

Any element that contains another element is automatically a rowset.

Elements that contain only text are columns.

Values stored in attributes are ignored.

For more information about the Java XML data source object, see XML Data Source in the DHTML documentation.

Using DTDs

If you use a document type definition (DTD) with your XML, the XML data source object uses the following method for converting elements and attributes to rows and columns.

Each subelement and attribute named by the DTD corresponds to a column in some rowset in the hierarchy.

The name of the column is the same as the name of the subelement or attribute, unless the parent element has an attribute and a subelement with the same name, in which case a "!" is prepended to the sublement's column name.

Each column is either a simple column containing scalar values, usually strings, or a rowset column containing subrowsets.

Columns corresponding to attributes are always simple.

Columns corresponding to subelements are rowset columns if either the DTD allows the subelement to have its own subelements and/or attributes, or the DTD allows the subelement's parent to have more than one instance of the subelement as a child. Otherwise the column is simple.

All rowsets have an additional column named $Text.

Content corresponding to the content model "ANY" is not included in the rowset hierarchy.

XML Data-Islands

XML data-islands are used to pass data to the HTML Components (HTCs) used on many of the ASP pages in Commerce Server Business Desk. An XML data-island is data, described in XML, that is accessed through an id attribute associated with one of two different elements (xml and script) that might be contained in the ASP page. The data itself often exists in the ASP pages themselves, but need not.

An XML data-island can exist in an ASP page, either hard-coded or generated programmatically using ASP script, using one of the following three tagging mechanisms:

  • Within an xml element:
    <xml id='DataIslandID'>
        data-island XML can go here
    </xml>
    
  • Within a script element that includes the type attribute with the value "text/xml":
    <script type='text/xml' id='DataIslandID'>
        data-island XML can go here
    </script>
    
  • Within a script element that include the language attribute with the value "xml":
    <script language='xml' id='DataIslandID'>
        data-island XML can go here
    </script>
    

The XML data comprising the data-island does not need to be specified on the ASP page itself. Any of the above mechanisms for embedding an XML data-island on an ASP page can use the src attribute to specify the URL of another file from which the contents of the data-island will be retrieved. Any of the following three single lines of code, in conjunction with the data-island XML contained in the specified source file, specify legitimate XML data-islands:

<xml    id='DataIslandID' src='XMLDataFileURL' />

or

<script id='DataIslandID' src='XMLDataFileURL' language='xml' />

or

<script id='DataIslandID' src='XMLDataFileURL' type='text/xml' />
Copyright © 1996–2000 Microsoft Corporation.
All rights reserved.
Microsoft XML 3.0 - XML Tutorial

Lesson 1: Authoring XML Elements

What is an XML element?

XML is a meta-markup language, a set of rules for creating semantic tags used to describe data. An XML element is made up of a start tag, an end tag, and data in between. The start and end tags describe the data within the tags, which is considered the value of the element. For example, the following XML element is a <director> element with the value "Matthew Dunn."

<director>Matthew Dunn</director>

The element name "director" allows you to mark up the value "Matthew Dunn" semantically, so you can differentiate that particular bit of data from another, similar bit of data. For example, there might be another element with the value "Matthew Dunn."

<actor>Matthew Dunn</actor>

Because each element has a different tag name, you can easily tell that one element refers to Matthew Dunn, the director, while the other refers to Matthew Dunn, the actor. If there were no way to mark up the data semantically, having two elements with the same value might cause confusion.

In addition, XML tags are case-sensitive, so the following are each a different element.

<City> <CITY> <city>

Attributes

An element can optionally contain one or more attributes. An attribute is a name-value pair separated by an equal sign (=).

<CITY ZIP="01085">Westfield</CITY>

In this example, ZIP="01085" is an attribute of the <CITY> element. Attributes are used to attach additional, secondary information to an element, usually meta information. Attributes can also accept default values, while elements cannot. Each attribute of an element can be specified only once, but in any order.

Try it!

In the following text box, type the title of a favorite movie and then click Continue.

Check the syntax

Because XML is a highly structured language, it is important that all XML be well-formed. That is, the XML must have both a start tag and end tag, and must be authored using the proper syntax. In the following box, create an XML element with a start tag, an end tag, and a value on a single line. Click the Well-formed? button to see if your XML is correct.

Lesson 2: Authoring XML Documents

What is an XML document?

A basic XML document is simply an XML element that can, but might not, include nested XML elements.

For example, the XML <books> element is a valid XML document:

<books>
  <book isbn="0345374827">
    <title>The Great Shark Hunt</title>
    <author>Hunter S. Thompson</author>
  </book>
</books>

Authoring guidelines

There are some things to remember when constructing a basic XML document.

  • All elements must have an end tag.
  • All elements must be cleanly nested (overlapping elements are not allowed).
  • All attribute values must be enclosed in quotation marks.
  • Each document must have a unique first element, the root node.

Try it!

In the following text box, create an XML document that contains both <element> and <attribute> nodes. Click the Well-formed? button to see whether your XML document conforms to the XML specification.

Lesson 3: Authoring XML Data Islands

What is an XML data island?

A data island is an XML document that exists within an HTML page. It allows you to script against the XML document without having to load it through script or through the <OBJECT> tag. Almost anything that can be in a well-formed XML document can be inside a data island.

The <XML> element marks the beginning of the data island, and its ID attribute provides a name that you can use to reference the data island.

The XML for a data island can be either inline:

<XML ID="XMLID">
  <customer>
    <name>Mark Hanson</name>
    <custID>81422</custID>
  </customer>
</XML>

or referenced through a SRC attribute on the <XML> tag:

<XML ID="XMLID" SRC="customer.xml"></XML>

You can also use the <SCRIPT> tag to create a data island.

<SCRIPT LANGUAGE="xml" ID="XMLID">
  <customer>
    <name>Mark Hanson</name>
    <custID>81422</custID>
  </customer>
</SCRIPT>

Authoring guidelines

Simply author an XML document, place that XML document within an <XML> element, and give that <XML> element an ID attribute.

Try it!

In the following text box, type a well-formed XML document.

Type an ID for the data island.



Click the Insert Data Island button to display an HTML page with your data island inserted.

Lesson 4: Using the XML Object Model

What is the XML object model?

The XML object model is a collection of objects that you use to access and manipulate the data stored in an XML document. The XML document is modeled after a tree, in which each element in the tree is considered a node. Objects with various properties and methods represent the tree and its nodes. Each node contains the actual data in the document.

How do I access the nodes in the tree?

You access nodes in the tree by scripting against their objects. These objects are created by the XML parser when it loads and parses the XML document. You reference the tree, or document object, by its ID value. In the following example, MyXMLDocument is the document object's ID value. The document object's properties and methods give you access to the root and child node objects of the tree. The root, or document element, is the top-level node from which its child nodes branch out to form the XML tree. The root node can appear in the document only once.

Run the mouse over the following data island to reveal the code required to access each node. The root node is <class>, and its child node is <student>, which has child nodes of <name> and <GPA>.

<XML ID="MyXMLDocument">
  <class>
    <student studentID="13429">
      <name>Jane Smith</name>
      <GPA>3.8</GPA>
    </student>
  </class>
</XML>

The following list is a sample of the properties and methods that you use to access nodes in an XML document.

Property/Method Description
XMLDocument Returns a reference to the XML Document Object Model (DOM) exposed by the object.
DocumentElement Returns the document root of the XML document.
ChildNodes Returns a node list containing the children of a node (if any).
Item Accesses individual nodes within the list through an index. Index values are zero-based, so item(0) returns the first child node.
Text Returns the text content of the node.

The following code shows an HTML page containing an XML data island. The data island is contained within the <XML> element.

<HTML>
  <HEAD>
    <TITLE>HTML with XML Data Island</TITLE>
  </HEAD>
  <BODY>
    <P>Within this document is an XML data island.</P>
    <XML ID="resortXML">
      <resorts>
        <resort>Calinda Cabo Baja</resort>
        <resort>Na Balam Resort</resort>
      </resorts>
    </XML>
  </BODY>
</HTML>
Calinda Cabo Baja Na Balam Resort

You access the data island through the ID value, "resortXML", which becomes the name of the document object. In the preceding example, the root node is <resorts>, and the child nodes are <resort>.

The following code accesses the second child node of <resorts> and returns its text, "Na Balam Resort."

resortXML.XMLDocument.documentElement.childNodes.item(1).text

How do I persist XML DOM tree information?

Several methods and interfaces are available for persisting DOM information.

If you are using a script language, the DOMDocument object exposes the load, loadXML, and save methods, and the xml property.

For Microsoft® Visual Basic® and C or C++ programmers, the DOMDocument interface exposes the same members as the DOMDocument object. IXMLDOMDocument also implements standard COM interfaces such as IPersistStreamInit, IPersistMoniker, and IStream.

Try it!

In the following text box, enter code to access a part of either of the preceding documents. Assume that the first document object is "MyXMLDocument" and the second is "resortXML". Then click the Access XML button to reveal the node you have referenced.

Jane Smith 3.8

Lesson 6: Authoring XML Schemas

What is an XML Schema?

An XML Schema is an XML-based syntax for defining how an XML document is marked up. XML Schema is a schema specification recommended by Microsoft and it has many advantages over document type definition (DTD), the initial schema specification for defining an XML model. DTDs have many drawbacks, including the use of non-XML syntax, no support for datatyping, and non-extensibility. For example, DTDs do not allow you to define element content as anything other than another element or a string. For more information about DTDs, see the Worldwide Web Consortium (W3C) XML Recommendation. XML Schema improves upon DTDs in several ways, including the use of XML syntax, and support for datatyping and namespaces. For example, an XML Schema allows you to specify an element as an integer, a float, a Boolean, a URL, and so on.

The Microsoft® XML Parser (MSXML) in Microsoft Internet Explorer 5.0 and later can validate an XML document with both a DTD and an XML Schema.

How can I create an XML Schema?

Run the mouse over the following XML document to reveal the schema declarations for each node.

  <class xmlns="x-schema:classSchema.xml">
    <student studentID="13429">
      <name>James Smith</name>
      <GPA>3.8</GPA>
    </student>
  </class>

You'll notice in the preceding document that the default namespace is "x-schema:classSchema.xml." This tells the parser to validate the entire document against the schema (x-schema) at the following URL ("classSchema.xml").

The following is the entire schema for the preceding document. The schema begins with the <Schema> element containing the declaration of the schema namespace and, in this case, the declaration of the "datatypes" namespace as well. The first, "xmlns="urn:schemas-microsoft-com:xml-data"," indicates that this XML document is an XML Schema. The second, "xmlns:dt="urn:schemas-microsoft-com:datatypes"," allows you to type element and attribute content by using the dt prefix on the type attribute within their ElementType and AttributeType declarations.

<Schema xmlns="urn:schemas-microsoft-com:xml-data"
  xmlns:dt="urn:schemas-microsoft-com:datatypes">
  <AttributeType name='studentID' dt:type='string' required='yes'/>
  <ElementType name='name' content='textOnly'/>
  <ElementType name='GPA' content='textOnly' dt:type='float'/>
  <ElementType name='student' content='mixed'>
    <attribute type='studentID'/>
    <element type='name'/>
    <element type='GPA'/>
  </ElementType>
  <ElementType name='class' content='eltOnly'>
    <element type='student'/>
  </ElementType>
</Schema>

The declaration elements that you use to define elements and attributes are described as follows.

Element Description
ElementType Assigns a type and conditions to an element, and what, if any, child elements it can contain.
AttributeType Assigns a type and conditions to an attribute.
attribute Declares that a previously defined attribute type can appear within the scope of the named ElementType element.
element Declares that a previously defined element type can appear within the scope of the named ElementType element.

The content of the schema begins with the AttributeType and ElementType declarations of the innermost elements.

<AttributeType name='studentID' dt:type='string' required='yes'/>
<ElementType name='name' content='textOnly'/>
<ElementType name='GPA' content='textOnly' dt:type='float'/>

The next ElementType declaration is followed by its attribute and child elements. When an element has attributes or child elements, they must be included this way in its ElementType declaration. They must also be previously declared in their own ElementType or AttributeType declaration.

<ElementType name='student' content='mixed'>
  <attribute type='studentID'/>
  <element type='name'/>
  <element type='GPA'/>
</ElementType>

This process is continued throughout the rest of the schema until every element and attribute has been declared.

Unlike DTDs, XML Schemas allow you to have an open content model, allowing you to do such things as type elements and apply default values without necessarily restricting content.

In the following schema, the <GPA> element is typed and has an attribute with a default value, but no other nodes are declared within the <student> element.

<Schema xmlns="urn:schemas-microsoft-com:xml-data"
  xmlns:dt="urn:schemas-microsoft-com:datatypes">
  <AttributeType name="scale" default="4.0"/>
  <ElementType name="GPA" content="textOnly" dt:type="float">
    <attribute type="scale"/>
  </ElementType>
  <AttributeType name="studentID"/>
  <ElementType name="student" content="eltOnly" model="open" order="many">
    <attribute type="studentID"/>
    <element type="GPA"/>
  </ElementType>
</Schema>

The preceding schema allows you to validate only the area with which you are concerned. This gives you more control over the level of validation for your document and allows you to use some of the features provided by the schema without having to employ strict validation.

Try it!

Try authoring a schema for the following XML document.

<order>
  <customer>
    <name>Fidelma McGinn</name>
    <phone_number>425-655-3393</phone_number>
  </customer>
  <item>
    <number>5523918</number>
    <description>shovel</description>
    <price>39.99</price>
  </item>
  <date_of_purchase>1998-10-23</date_of_purchase>
  <date_of_delivery>1998-11-03</date_of_delivery>
</order>

After you have completed the schema, run it through the XML Validator.

MSDN® Online Downloads provides a set of XML sample files, including an XML document with an accompanying schema. Download these samples to work with the XML document and the schema. To test the validity of your XML against a schema, you can load the document through the XML Validator or simply view the XML file in the MIMETYPE Viewer.

The following are some considerations.

  • ElementType and AttributeType declarations must precede attribute and element content declarations that refer to these types. For example, in the preceding schema, the ElementType declaration for the <GPA> element must precede the ElementType declaration for the <student> element.
  • The default value of the order attribute depends on the value of the content attribute. When the content is set to "eltOnly," the order defaults to seq. When the content is set to "mixed," the order defaults to many. For more information about these default values, see the XML Schema Reference.

Reasons for Namespaces

The appeal of XML lies in the ability to invent tags that convey meaningful information. For example, XML allows you to represent information about a book in the following way.

<BOOK>
 <TITLE>XML Developer's Guide</TITLE>
 <PRICE currency="US Dollar">44.95</PRICE>
</BOOK>

Similarly, you can represent information about an author in the following way.

<AUTHOR>
 <TITLE>Ms</TITLE>
 <NAME>Ambercrombie Kim</NAME>
</AUTHOR>

Although the human reader can distinguish between the different interpretations of the TITLE element, a computer program does not have the context to tell them apart. Without additional information it cannot tell that the first TITLE element is intended to refer to a string representing the title of the book, and that the second element refers to an enumeration representing the title of the author: "Mr.," "Ms.," "Mrs.," and so on.

Namespaces solve this problem by associating a vocabulary (or namespace) with a tag name. For example, the titles can be written as follows:

<BookInfo:TITLE xmlns:BookInfo="books-namespace-URI">XML Developer's Guide</BookInfo:TITLE>
<AuthorInfo:TITLE xmlns:AuthorInfo="authors-namespace-URI">Ms.</AuthorInfo:TITLE>

The name preceding the colon, the prefix, maps to an XML namespace identified by a Universal Resource Identifier (URI). The namespace ensures global uniqueness when merging XML sources, while the associated prefix—a short name that substitutes for the namespace's URI—must be unique only in the tightly scoped context of the document. With this scheme, no conflicts exist between tags and attributes, and two tags can be the same only if they are from the same namespace and have the same tag name. This allows a document to contain both book and author information without confusion about whether the TITLE element refers to the book or the author. If a computer program wanted to display the name of a book in a user interface, it would use the object model to look for the TITLE element of the "BookInfo" namespace.

For more information about namespaces, see the Worldwide Web Consortium Namespaces in XML recommendation.

XML Element | xml Object

Internet Development Index

Defines an XML data island on an HTML page.

Members Table

AttributePropertyDescription
canHaveHTML Retrieves the value indicating whether the object can contain rich HTML markup.
IDid Retrieves the string identifying the object.
isContentEditable Retrieves the value indicating whether the user can edit the contents of the object.
isDisabled Retrieves the value indicating whether the user can interact with the object.
isMultiLine Retrieves the value indicating whether the content of the object contains one or more lines.
parentElement Retrieves the parent object in the object hierarchy.
readyState Retrieves the current state of the object.
recordset Sets or retrieves from a data source object a reference to the default record set.
scopeName Retrieves the namespace defined for the element.
SRCsrc Sets or retrieves a URL to be loaded by the object.
tagUrn Sets or retrieves the Uniform Resource Name (URN) specified in the namespace declaration.
XMLDocument Retrieves a reference to the XML Document Object Model (DOM) exposed by the object.
BehaviorDescription
clientCaps Provides information about features supported by Microsoft® Internet Explorer, as well as a way for installing browser components on demand.
download Downloads a file and notifies a specified callback function when the download is complete.
homePage Contains information about a user's homepage.
CollectionDescription
behaviorUrns Returns a collection of Uniform Resource Name (URN) strings identifying the behaviors attached to the element.
EventDescription
ondataavailable Fires periodically as data arrives from data source objects that asynchronously transmit their data.
ondatasetchanged Fires when the data set exposed by a data source object changes.
ondatasetcomplete Fires to indicate that all data is available from the data source object.
onreadystatechange Fires when the state of the object has changed.
onrowenter Fires to indicate that the current row has changed in the data source and new data values are available on the object.
onrowexit Fires just before the data source control changes the current row in the object.
onrowsdelete Fires when rows are about to be deleted from the recordset.
onrowsinserted Fires just after new rows are inserted in the current recordset.
MethodDescription
addBehavior Attaches a behavior to the element.
componentFromPoint Returns the component located at the specified coordinates via certain events.
fireEvent Fires a specified event on the object.
getAttributeNode   Retrieves an attribute object referenced by the attribute.name property.
namedRecordset Retrieves the recordset object corresponding to the named data member from a data source object (DSO).
normalize   Merges adjacent TextNode objects to produce a normalized document object model.
removeAttributeNode   Removes an attribute object from the object.
removeBehavior Detaches a behavior from the element.
setAttributeNode   Sets an attribute object node as part of the object.
Style AttributeStyle PropertyDescription
behaviorbehavior Sets or retrieves the location of the DHTML Behaviors.
text-autospacetextAutospace Sets or retrieves the autospacing and narrow space width adjustment of text.
text-underline-positiontextUnderlinePosition Sets or retrieves the position of the underline decoration that is set through the textDecoration property of the object.

Remarks

The readyState property of the XML element, available as a string value, corresponds to the readyState property of the XMLDOMDocument object, which is available as a long value. The string values correspond to the long values of the XML document object's property as shown in the examples section.

The XMLDocument property is the default property.

This element is available in HTML and script as of Internet Explorer 5.

This element is not rendered.

This element requires a closing tag.

Examples

This example uses the XML element to define a simple XML data island that can be embedded directly in an HTML page.

<XML ID="oMetaData">
  <METADATA>
     <AUTHOR>John Smith</AUTHOR>
     <GENERATOR>Visual Notepad</GENERATOR>
     <PAGETYPE>Reference</PAGETYPE>
     <ABSTRACT>Specifies a data island</ABSTRACT>
  </METADATA>
</XML>

This example uses the readyState property of the xml object to determine whether the XML data island is completely downloaded.

  if (oMetaData.readyState == "complete")
      window.alert ("The XML document is ready.");

This example uses the readyState property of the XMLDOMDocument object to determine whether the XML data island is completely downloaded.

  if (oMetaData.XMLDocument.readyState == 4)
      window.alert ("The XML document is ready.");

This script example retrieves the text contained within the ABSTRACT field of the data island.

   var oNode = oMetaData.XMLDocument.selectSingleNode("METADATA/ABSTRACT");
   alert(oNode.text);

Standards Information

This object is a Microsoft extension to HTML Non-Microsoft link.

Formatting XML Documents

While Microsoft® Internet Explorer 5.0 provides a default style sheet that is useful for exploring document structures, some applications must present the information stored in XML documents directly to the user, without relying on scripts or additional processing code. Internet Explorer 5.0 and later provides support for cascading style sheets (CSS), and the Microsoft XML Parser (MSXML) provides enhanced support for the Extensible Stylesheet Language (XSL).

XML and Cascading Style Sheets

Cascading style sheets allow developers to describe formatting that should be applied to document structures. Cascading style sheets let document designers make statements like "all paragraph elements should be formatted as separate blocks, with a 24 pixel indent on the first line" or "all price elements should be presented in bold green 12 point san-serif type except price elements with a status attribute of sale, which should be presented in red."

Cascading style sheets is commonly described as an annotative style language, meaning that it adds formatting information to the document tree rather than changing the document itself. Its lists of formatting rules can be combined or overridden to provide multiple layers of formatting information appropriate to different document variations without requiring change to the document structure itself.

XML, XSL, and XSLT

XSL provides a set of tools for transforming documents from their original labeled document structure to a new structure that can be used for presentation. Typically, developers create transformations from an XML vocabulary to HTML, though a 'formatting object' vocabulary is under development at the Worldwide Web Consortium (W3C).

XSL is commonly described as a transformative style language. Instead of adding information to the original document structure, it creates a new document structure based on rules applied to the content of the original. The transformation language used by XSL, XSL Transformations (XSLT), allows developers to create sophisticated templates detailing how the original document's information should be presented in the new document, known as the result document.

See Also

Augmenting HTML | Choosing Between CSS and XSLT

Declaring Namespaces

Many XML parsers, including the Microsoft® XML Parser (MSXML), provide support for associating elements and attributes with namespaces. You can use a default declaration or explicit declaration to declare namespaces. In both cases, the declaration associates a Uniform Resource Identifier (URI) with particular element and attribute names.

Choosing Namespace URIs

Developers creating new XML vocabularies may need to choose URIs for use as namespace identifiers. The following are general guidelines for creating new vocabularies that need namespace identifiers.

  • Use namespace URIs that you control.

    While the Namespaces in XML Recommendation does not prohibit borrowing, do not create a new namespace using a URI you do not control.

  • Use URIs that are persistent.

    Although generally maintained, domain names do expire. Other URI facilities, such as Uniform Resource Names (URNs) and Permanent URLs (PURLs), guarantee persistence beyond the domain name infrastructure.

  • Use URIs that consistently point to the same location.

    While the Namespaces in XML Recommendation does not prohibit the use of relative URI references in namespace identifiers, their use is largely undefined.

  • Identify and describe the namespace URI in the documentation for your vocabulary.

Default Declaration

The default declaration declares a namespace for all elements contained by the element where the declaration appears. The following example declares the catalog element and all unprefixed elements and attributes within it to be members of the namespace "http://www.example.com/catalog/".) The attribute xmlns is an XML keyword and is understood by namespace-aware XML parsers , including MSXML, to be a namespace declaration:

<catalog xmlns="http://www.example.com/catalog/">
  <book id="bk101">
     ...
  </book>
  <book id="bk109">
     ...
  </book>
</catalog>

Default declarations are commonly used when a document contains only elements from a particular namespace or when one namespace effectively dominates all others.

Explicit Declaration

An explicit declaration defines a shorthand, or prefix, to substitute for the full name of a namespace. Use an explicit declaration to reference a node from a namespace separate from your default namespace.

If the catalog example created in the document map had to represent its currency values using an element from a different namespace, it might include both a declaration for the catalog as a whole and an explicit declaration for the element describing the prices.

Explicit declarations of namespace prefixes use attribute names beginning with xmlns: followed by the prefix. The value of the attribute is the namespace URI.

The following example declares "cat" and "money" to be shorthand for the full names of their respective namespaces. All elements beginning with "cat:" or "money:" are considered to be from the namespace "http://www.example.com/catalog/" or "http://www.example.com/currency/", respectively.

<cat:catalog xmlns:cat="http://www.example.com/catalog/"
   xmlns:money="http://www.example.com/currency/">
  <cat:book id="bk101">
     <cat:author>&#71;ambardella, Matthew</cat:author>
     <cat:title>XML Developer's &#x47;uide</cat:title>
     <cat:genre>Computer</cat:genre>
     <money:price>44.95</money:price>
     <cat:publish_date>2000-10-01</cat:publish_date>
     <cat:description><![CDATA[An in-depth look at creating applications with
        XML, using <, >,]]> and &amp;.</cat:description>
  </cat:book>
  <cat:book id="bk109">
     <cat:author>Kress, Peter</cat:author>
     <cat:title>Paradox Lost</cat:title>
     <cat:genre>Science Fiction</cat:genre>
     <money:price>6.95</money:price>
     <cat:publish_date>2000-11-02</cat:publish_date>
     <cat:description>After an inadvertant trip through a Heisenberg Uncertainty
        Device, James Salway discovers the problems of being quantum.</cat:description>
  </cat:book>
</cat:catalog>

Explicit declarations are useful when a node contains elements from different namespaces. You can also fix default namespace declarations with explicit namespace declarations. The example above is identical to the previous example, except that all of the elements except price use the default namespace declaration, avoiding a lot of repeated cat: prefixes.

<catalog xmlns="http://www.example.com/catalog/"
   xmlns:money="http://www.example.com/currency/">
  <book id="bk101">
     <author>&#71;ambardella, Matthew</author>
     <title>XML Developer's &#x47;uide</title>
     <genre>Computer</genre>
     <money:price>44.95</money:price>
     <publish_date>2000-10-01</publish_date>
     <description><![CDATA[An in-depth look at creating applications with XML,
        using <, >,]]> and &amp;.</description>
  </book>
  <book id="bk109">
     <author>Kress, Peter</author>
     <title>Paradox Lost</title>
     <genre>Science Fiction</genre>
     <money:price>6.95</money:price>
     <publish_date>2000-11-02</publish_date>
     <description>After an inadvertant trip through a Heisenberg Uncertainty Device,
        James Salway discovers the problems of being quantum.</description>
  </book>
</catalog>

All of the elements except price are associated with the namespace URI "http://www.example.com/catalog/"; price is associated with "http://www.example.com/currency/".

Displaying XML Data Islands with JavaScript (cont'd)

Bind Data Fields to the Data Island
Binding data fields to the data island is also relatively simple. Binding is the term for creating an automatic connection between a data source and a data consumer. To bind the source to the consumer, you must specify the name of the source object and a column name. A bound data consumer's contents update automatically as you move through the source data. For this display-only application, span tags are appropriate. Just add two new attributes to the <span> element, for example:

   <span dataSrc="#data" dataFld="lastName"></span>
The dataSrc attribute indicates which data island contains the data for the <span> tag, and the dataFld attribute contains the name of the specific data field containing the <span> element data. Note that you set the dataSrc attribute value to the <xml> element's id attribute preceded by the pound (#) sign.

That's it! IE5 takes care of the internal plumbing required to synchronize the XML DSO and the <span> element. When the page loads, the text fields will display the indicated data field of the first record (node) in the XML data tree.



Here's the code that binds the fields of the data island in the Employee Directory (see Figure 1):

   <!DOCTYPE html PUBLIC
      "-//W3C//DTD XHTML 1.0
      Strict//EN"
      "DTD/xhtml1-strict.dtd">
   <html
      xmlns="http://www.w3.org/1999/xhtml"
      xml:lang="en" lang="en">
   <head>
   <title>XML Data Binding Demo</title>
   <script src="xmlNav.js">
   </script>
   </head>
   <body>
   <xml id="data" src="data.xml"></xml>
   <h2 style="color: navy;
      font-family:
      Verdana,sans-serif">
      Employee Directory
   </h2>
   <table border="0"
      style="font: 10pt Verdana,sans-serif;
      color: navy; background-color:
      rgb(255,255,200)"
      cellpadding="5">
      <tr>
         <td>Last Name </td>
         <td>
            <span style="background: white;
            width:150; border: inset;
            border-width:1" dataSrc="#data"
            dataFld="lastName"></span>
         </td>
         <td>FirstName </td>
         <td><span style="background: white;
            width:150; border: inset;
            border-width:1" dataSrc="#data"
            dataFld="firstName"></span>
         </td>
      </tr>
      <tr>
         <td>Title </td>
         <td><span style="background: white;
            width:150; border: inset;
            border-width:1" dataSrc="#data"
            dataFld="title"></span>
         </td>
         <td>Department </td>
         <td>
            <span style="background: white;
            width:150; border: inset;
            border-width:1" dataSrc="#data"
            dataFld="department"></span>
         </td>
      </tr>
      <tr>
         <td>Extension </td>
         <td>
            <span style="background: white;
            width:150; border: inset;
            border-width:1" dataSrc="#data"
            dataFld="extension"></span>
         </td>
         <td>
            <input type="button"
               value="|<" onClick="first()">
            </input>
            <input type="button"
               value="<" onClick="previous()">
            </input>
            <input type="button"
               value=">" onClick="next()">
            </input>
            <input type="button"
               value=">|" onClick="last()">
            </input>
         </td>
      </tr>
   </table>
   </body>
   </html>