By Chuck Mosher
Introduction
Today's Internet-driven economy has accelerated users' expectations for unfettered access to information resources and transparent data exchange among applications. This exchange requires a deep knowledge of how each application, tool, or service structures and interprets the information it uses. The term metadata is used to describe both how this information is structured (its syntax) and what it means (its semantics).
Unfortunately, most applications define metadata differently. Each uses slightly different programming structures, syntax, and semantics to model their metadata. These incompatibilities make it challenging for one application to discover and interact with the data maintained by another application. To overcome this hurdle, companies typically spend lots of developer time and money building hard-wired interfaces between applications so they can share data in sophisticated ways. This costly and ongoing effort is the result of a lack of common metadata -- or more specifically, a common model for creating metadata (a metamodel). This stifles the development of robust solutions to common business problems that require combining or using data from multiple, heterogeneous applications. Examples of such business problems include:
- Enterprise application integration
- Data warehousing
- Business intelligence
- Business-to-business exchanges
- Information portals
- Software development
Difficulties in the management of metadata limit the ability of developers to create integrated, interoperable applications in these and other domains. Standardizing on XML DTDs or XML Schemas, which many industries are attempting to do as a solution to this problem, is insufficient, as they do not have the capability to model complex, semantically rich, hierarchical metadata. This problem becomes more critical as the number and complexity of applications, services, data sources, interchange protocols, and deployment platforms continues to increase in the e-economy.
To facilitate metadata interoperability, a host of vendors have teamed up through the Java Community Process (JCP) to provide a platform-independent specification for metadata. It is the Java Metadata Interface (JMI) specification, which is JSR-40 on the JCP website at jcp.org. The JMI specification defines a dynamic, platform-neutral infrastructure that enables the creation, storage, access, discovery, and exchange of metadata using Java interfaces. JMI is based on the Meta Object Facility (MOF) specification from the Object Management Group (OMG), an industry-endorsed standard for metadata management.
This paper describes what metadata is, why its management is so important, what is required of a metadata management infrastructure, and how JMI satisfies these requirements.
What is Metadata?
Metadata can be defined as information about data, or simply data about data. In practice, metadata is what tools, databases, applications and other information services use to define the structure and meaning of their objects, services, and other computing artifacts. Examples of metadata include:
A database schema, which describes not only how data entries are laid out in a relational database, but what they mean.
The meaning of object attributes and methods in a Java class file.
An Enterprise JavaBean (EJB) deployment descriptor, which describes the metadata an application server needs in order to deploy and use the EJB.
The examples above refer to what is called technical metadata. There is also business metadata (also called process metadata), which is used to capture semantic content about such things as business rules, business nomenclature and business terminology. A glossary is an example of business metadata. Then there are systems that provide a means of notating metadata. XML DTDs are an example of a way to notate and interchange semantic information about structured documents. The Unified Modeling Language (UML) defines a way to formally describe the function, structure, and behavior of a system. We will show later how UML can be used as part of a metadata management infrastructure.
Why is Metadata Important?
Without metadata, or semantics, one has no way of knowing what a data object represents. The integer value '39' encountered in a program, for example, could mean almost anything. Currently, metadata management is made extremely difficult by the fact that many, perhaps most, of these semantics tend to be embedded in the systems that use them. The lack of a common way to represent and share metadata thus leads to great difficulties in sharing even the simplest data, let alone in the integration of complex applications, components, and systems.
People are excited about XML in part because at one level (using data encapsulated within descriptor tags) it appears to solve this problem. However, XML has significant limitations as a general solution to metadata representation and management. The current flurry of activity in various industries to standardize on a set of XML DTDs to represent the semantics of their particular artifacts, while a valuable and worthwhile first step, is really focused at the wrong level of abstraction for enabling true metadata interchange.
Where is metadata management required? Well, basically wherever there are significant metadata artifacts to be found. And this turns out to be just about everywhere. Application development tools and frameworks need to manage and understand: models, record definitions, and database definitions. Component-based development environments must deal with a host of interfaces, classes, and components, possibly in different languages. Data warehouses and information portals have to manage data that is organized as tables, columns, schemas, cubes, or flat files in different formats.
Finally, comprehensive metadata management is a necessary prerequisite for enabling the kind of services-based architecture that we are all moving towards. You cannot share or use a service unless you understand the semantics of the objects and features that it provides. For example, the higher-level artifacts of ebXML (a standard for enabling e-commerce) such as business objects, business processes, company profiles, and trading partner agreements all need to be described in a platform and implementation-independent way. The reason we point out ebXML is because the people working on it are using the metadata technologies we will be describing to characterize their artifacts.
Figure: Data Warehousing example
An Example: Challenges of Metadata Management in Data Warehousing
Data warehousing applications, because of the variety of interoperational challenges they face, illustrate some of the problems encountered by the lack of a common metadata standard.
Data warehousing applications typically deal with many different data sources. Each data source will of course have its own unique metadata (its schema). Not only do data warehousing applications have to integrate these different databases, but the databases themselves usually capture different aspects of the business (that is, not only are the schemas structurally different, they often refer to different things).
Further, these databases are often located on different systems that span the company. Additionally, other data sources can contribute, for example, files that have been generated that capture web site click-stream data. Also, applications themselves can be sources of data, such as the output of ERP or CRM systems. All of these different kinds of data need to be Extracted, Transformed, and Loaded into a data warehouse.
This is what ETL stands for in the middle box in the figure above. This process is so complex that there are over 250 ETL vendors in the industry today that make their living by writing separate interfaces to each different kind of data source on the input end and each data warehouse on the output end. And any one company that is putting a data warehouse solution in place cannot typically use only one ETL vendor, but needs many to handle their own unique set of data source requirements.
Then we even have a problem on the output end, where one would like to take the data that has been distilled into a data warehouse and analyze it with different reporting tools. Of course, these tools themselves also expect their data to be in a certain (different and unique) format.