XML: Beyond basics

author-image
CIOL Bureau
Updated On
New Update

Every valid XML document should start with a statement, which tells the
browser which language, and version is being used:

Advertisment

This is the XML declaration. While it is not required, its presence
explicitly identifies the document as an XML document and indicates the version
of XML to which it was authored.

If you are familiar with languages like HTML or SGML, the sessions on XML
will be understood by you fairly well. Otherwise, I recommend you go through
some of the HTML tutorials. Here
is one
for you.

Advertisment

There's no document type declaration. Unlike SGML, XML does not require a
document type declaration. However, a document type declaration can be supplied,
and some documents will require one in order to be understood unambiguously. (A
DTD specifies a set of elements, their relationships, and the tag set to mark
the document. Thus the markup code within the document is following a 'model'
described in the DTD). Every markup language is equipped with a DTD. It defines
a language by the elements that are permitted. For example, in HTML we usually
use a declaration which tells the browser about which DTD to use. For example:

<!doctype html public "-//w3c//dtd html 4.0<br /> Transitional//EN"&gt;

Whereas in an XML document the DTD can be user defined and the document can
have its own DTD contained in line like:

Advertisment

<!doctype entertainment</p>

<< Elements can be declared here >>

The DTD describes to the browser how to interpret the tags, which tags should
be nested among which others, and may contain conditional statements, etc.

Advertisment

Now, let us see how to create elements for the document. The tag ELEMENT> does the job. Element type declarations identify the names of
elements and the nature of their content. A typical element type declaration
looks like this:

This declaration identifies the element named "movies". Its content
scheme follows the element name. The content scheme defines what an element may
contain. In this case, "movies" must contain "Hindi" and
"English". The commas between element names indicate that they must
occur in succession. While declaring the elements validations can be given to
them using certain symbols. Let us look at them:

Advertisment

The "+" symbol

The plus after "Hindi" indicates that it may be repeated more than
once but must occur at least once.

Advertisment

The "?" symbol

The question mark after "English" indicates that it is optional (it
may be absent, or it may occur exactly once). A name with no punctuation, such
as "Hindi", must occur exactly once.

Advertisment

The "*" symbol

The wild card symbol is used to indicate that the content is optional (may
occur zero or more times). If you want "movies" to contain zero or
more "Hindi" names then you can write:

Two other content models are possible: EMPTY indicates that the element has
no content (and consequently no end-tag), and ANY indicates that "any"
content is allowed. For example:

#PCDATA

In addition to element names, the special symbol #PCDATA is reserved to
indicate character data. The short form PCDATA stands for parseable character
data. Elements that contain only the usual names are said to have element
content. And the elements that have the usual names and #PCDATA are said to have
"mixed content". Let us take an example:

The vertical bar indicates an "or" relationship where the names can
be mixed in any order. The asterisk indicates that the content is optional as
defined above and may occur zero or more times. By the above definition,
"movies" may contain zero or more characters and "Indian"
tags, mixed in any order. In the content scheme of the element while using #PCDATA
it must come first, all of the elements must be separated by vertical bars, and
the entire group must be optional.

tech-news