— BETH HUFFER, ANNE HUNT


Executive Summary

If ontology is going to provide the next generation of intelligent information management, more attention must be given to creating high-fidelity, truly semantic models that are capable of representing not only the electronic media that are used to store the information, music, graphics, etc. that people care about, but are also capable of representing the propositional content and the linguistic characteristics of those media.  The abstract propositional or musical contents of documents or recordings is often quite independent of the media on which it is encoded, or even the language used for the encoding. 

Introduction

Truly semantic information management must recognize and leverage this independence. Ontology modeling is the ideal method for for producing high-fidelity representations that attend to the distinction, but all too often, ontologists take short cuts, producing models that are not robust enough to support the sort of semantic content management and information retrieval systems that the semantic web has promised.

Information management is an area in which semantic technology promises to pay huge dividends.  In industry after industry, people are looking to ontology-supported systems to organize, label, store, and retrieve documents and other digital assets. But like any software component, the right design makes all the difference.  In the rush to build the ontologies that will support the next generation of information management, it is tempting to take short cuts in the ontology model. This temptation must be avoided if ontology is really going to be the basis for the next generation of information management. The domain to be represented in such systems is complex and fraught with conceptual difficulty, and careful planning must go into the modeling process if the system is to be as reusable, extensible, and flexible as advocates of applied ontology have promised.

Ontologies for information management systems are especially problematic because they must represent a very broad domain that includes propositional content, the information bearing objects that convey that content, and the world of states of affairs that the propositional content describes.  Indeed, ontologies for information management systems are models of models: they are models, expressed in a formal language, of models, expressed in one or more natural languages, of a domain of interest. 

The complexity of the domain of interest in an information management system is too often ignored in favor of a simplified representation.  Consider, for example, a system that organizes and manages music. Suppose I am interested in finding all of the available versions of John Lennon’s song, Imagine. On a typical music site, say, Yahoo! Music, I enter a search string such as "Imagine".  This search will return as results 86 albums, 14 videos, and 1000 songs, all with the title "Imagine".  Many of them will be recordings of the John Lennon song in which I am interested, but many will be completely different songs that just happen to have the same name.  To find out which are the recordings I’m looking for, I can listen to snippets from each of those results. But with 1000 results, this is not going to be a very satisfactory user experience. Alternatively, I can try to refine my results by adding to my search string.  I might add, for instance, "John Lennon" to my search.  In this case, I will get fewer results, but most of them are recordings by John Lennon of other songs, not the one I want.  Moreover, the search returns not only many a recording that fails to meet my criteria, but also fails to return many that do, since any recording of Imagine under a different name will not be returned by any search engine that relies on term matching alone.

What I’m looking for, in fact, is a single song, perhaps performed by different artists, on various occasions, and recorded on various media such as compact discs and magnetic tapes. What I am after, in fact, is really quite different from what a typical search engine will provide.  Where they return a variety of different things that share the same name, what I want is a single thing that may even appear in different places under very different names. The sort of intelligent search we believe will be the next generation of web search requires an information model that is robust and of very high-fidelity, that is able to  represent accurately all three areas of the domain — the informational content, the IBOs, and the objects of the information — as well as the relations among these three areas. 

The issues that arise for music sites apply equally to sites that provide documentary content.  Songs and stories and reports and pieces of informationare abstract, propositional objects that may, among other things, be written down on a sheet of paper, uttered aloud (or played on an instrument), recorded on a magnetic tape, or carved into the wall of a cave.  When information is disseminated in any of these ways, the result is an information bearing object ("IBO"). The same piece of information may be encoded on many different media, and even in many different languages.  For example, the information that snow is white may be encoded like this: “Snow is white”, orlike this: “La niège est blanche”, or like this: “La nieve es blanca”.  Similarly, the same combination of musical notes may be played on different instruments, sung by different voices, at different times, etc.  Or, the same piece of information, encoded in exactly the same language, or sung by exactly the same voice, may be encoded in many different media. One and the same English sentence, “Snow is white”, may occur as a type-written string in a document such as this one, or as a carving on the trunk of a tree, or as an audio recording on a magnetic tape.  But in each instance, it is the very same piece of information – namely, that snow is white – being conveyed.  The very same piece of information may be expressed here with one term and there with a different, synonymous term, as is the case with the locutions, “Dogs are mammals” and “Canines are mammals”.  Sometimes, the precise term that is used to convey the information is an important property of the IBO, as can be the case with product brand names.

The most typical content management systems are designed primarily to handle documents, but not their propositional contents.  And most information retrieval systems are designed to look for terms, and not the propositions express.  But the propositional contents of documents are independent of the language of the document or the instantiation of the document on paper or in electronic format.  An ontology model  that respects this distinction can support a system that allows information workers to find every version (in whatever language) of a particular document by using a very simple query that remains stable as new versions are invented and added to the system. 

Moreover, a system that can recognize propositional content independently of specific terms can find documents that a traditional term-matching system cannot. The individual instance of #PropositionalContent for a set of documents provides a useful, stable “hanger” for typical enterprise metadata.  For example, in the case of user support, that instance of #PropositionalContent can serve as the reference for numerous support incidents, software code that represents the authoritative fix for a class of support issues, the originating author of the support content, and the sub-organization whose budget supports the production of all solution assets for that particular issue.  Having a single, stable object upon which to hang this critical information, as opposed to storing the information on any number of document versions and translations, and then mapping all of them to every other one, is the conceptual equivalent to keeping one’s car keys on a hook by the door.  In an enterprise setting, the time and energy savings that result from being easily able to find exactly the information sought are enormous.

Many data models, whether ontological or otherwise, give short shrift to these issues.  We believe this is a mistake. There is much to be gained from taking the time to create a robust, high-fidelity, domain model right out the gate.