In this column “The art of Linked Data” a few of us at Zepheira will try to bring observations, reflections and practical advice from various projects applying Linked Data and thus Semantic Web principles across diverse domains. At Zepheira we help organizations implement sophisticated Web solutions with a specialty in combining the reasoning power of people with the mechanical processing of computers.
Imagine a situation where a scientific researcher is trying to organize a variety of material for a project or paper; that material might be coming from various sources, in various formats, and with shifting context throughout. There might be related research papers (citations and references), contact information for peer researchers, information about organizations who have sponsored the research with grants or research budget, the actual scientific data collected during the research, and more.
At Zepheira we recognize that there are things that computing can achieve to make the researcher’s job much more efficient, including analysis and conversion of underlying formats, basic indexing, and cataloging. We also recognize that once you’ve done these basic things, the limitations of computing become very apparent. You can guess that a particular phrase is a place name, or that another is a title of a related paper, but in our experience such guessing leads to very uneven results. The Linked Data community has taken entity extraction and the like to very sophisticated levels, but real value comes from being able to present the framework of guesses to the researcher so he or she can create annotations in a friendly user interface. With such annotations in place we can link data and apply data services with full confidence.
You might wonder whether Linked Data is essential to such an approach, but if you think about the chaotic and barely-structured nature of the materials in question, you realize that traditional schemata and the like are far too rigid, and what you need is something more like the Web, where links can be created and then refined incrementally with metadata over time. The power of Linked Data really shines when you are combining computing with the human capacity for common-sense but unreliably consistent connections.
Linked Data principles are therefore key to our success with such applications, and we hope we can share some of our insights in this column. For one real-world example, think of a librarian in a situation not unlike that of the researcher I mentioned before. A librarian often has to organize collections of books, periodicals, archived digital material, electronic subscriptions and more in a spiderweb of identifier schemes, classification systems, and diverse cataloging conventions. Furthermore the librarian’s value comes from helping organizations organize such collections, and in helping patrons (including researchers) navigate them and find what they need, which means they need to be flexible with context. And organization might focus more on the usage profile of its collections, while a patron might be looking at a book or article out of personal interest, for teaching, for research, or more.
Zepheira was approached by The Library of Congress to put together a system that helps librarians in such situations, with a particular focus on collating digital materials earmarked for preservation. The result of our effort is Recollection, which has become a central resource of The National Digital Information Infrastructure and Preservation Program (NDIIPP), led by The Library. Recollection has already been the subject of a couple of articles by Kathy MacDougall and Trevor Owens on semanticweb.com and I suggest you read those articles to get an overall sense of the project, which is very interesting in how it brings together heterogeneous data sets from across over 200 participating organizations into one application.
(Recollection articles: Part I ; Part II)
I’ll focus here on some of the practical lessons in Linked Data as embodied in the Recollection project. In past articles for semanticweb.com, I’ve centered Linked Data discussions around identifiers, relationships and services. These central concepts inform the architecture of Recollection as well.
In Recollection data sets, views and even entities such as the users themselves have unique, clean URLs which are quite easy to work with from a simple “view source” perspective, or with layered tools. The URLs are emphasized throughout the workflow, and users are encouraged to share URLs for raw data sets and for constructed views whether through Recollection’s built-in sharing mechanisms, or by e-mailing, posting or tweeting URLs, putting them on Facebook or any other such generic service. The simple layering of URLs is also key to Recollections support of embedding views in other Web pages.
Not only do data sets and views have IDs, but individual items do as well, and these are used as the anchor for the standard item inspection view, which pulls in properties associated with the item. In RDF terms the inspect view renders an item in terms of its role as both subject and object of statements, but with some contextualization to ensure that the user is not presented with an incomprehensible dump of relationships.
In Recollection relationships start with the classic, personal sort which is the core of social networks. They are also maintained throughout the data, and are used as the basis of workflow and policy. For example a view has a formal relationship with its source data set, and visibility settings for the data set can affect who can see derived views. Even within the data sets relationships are asserted from items to data profiles which include typing and other such metadata.
The primary mandate from the Library in developing Recollection was to improve access to existing digital collections, allowing the user to browse material and make important discoveries by “following their nose.” That said, our experience at Zepheira directed us to make plug-in services a fundamental part of the architecture.
Recollection includes the concept of plug-in services relating to data import, it also supports “data augmentation” services, which automate enhancements to data based on user annotations, and it provides rudimentary hooks for services relating to how data sets can be used beyond the Recollection application boundary. These hooks are rudimentary because of limitations of the SIMILE Exhibit component which is the foundation for data views. Zepheira is working with contributors at MIT, Google and The Library on a separate project, “Exhibit 3” to address such limitations.
The data augmentation services are much more comprehensively architected. If you take a real-world system comprising a few thousand data sets from a couple of hundred contributing organizations it’s pretty inevitable that you end up with thousands of data conventions, and plenty of quirks. These can be very hard to work out in any fully automated fashion, as proven by decades and billions of dollars of research and development across the industry.
The Recollection approach is to provide a library of services designed to harmonize and enhance data elements for sharing and viewing, while providing a user interface for annotating patterns that support interpretation of the data elements. For example, the user can select a map-plotting service and select a set of spreadsheet columns that express a location name, which can then be geocoded. The Web services design is key to supporting reuse of such services, which are engineered to work in context of the data set identifiers and relationships.
Zepheira has a screencast on data augmentation. Clicking the image below will open the screencast in a new tab/window.
We implemented the built-in Recollection services as modules of Akara, an open-source platform for RESTful Web services, but any chosen toolkit can be used to satisfy the required REST API.
From the earliest days of computer science we’ve recognized that humans are infuriatingly sloppy, and that computers are infuriatingly precise. The Web was the first information space to scale globally in part because its architecture struck a good balance between precision and sloppiness, but also because it supported user-friendly mechanisms for consuming its nodes (pictures, embedded audio and video, scripting, etc.) Linked Data helps codify such balanced architecture when you want a little bit more controlled context, and at Zepheira we feel that the right user-centered application basics are key to empowering users with such control of context. Recollection is an example within the cultural heritage space, but also broadly applicable, of how such systems provide value in practice.
Uche Ogbuji is a founding partner and enterprise architect at Zepheira, LLC, which provides solutions that help organizations apply advanced Web technologies to manage, share and enhance information. Uche specializes in the integration of next-generation Web systems with established enterprise data technology. He has written over 300 articles on XML, RDF, Web services and related topics, having pioneered software development in such areas. He is lead on the open source Akara project, a framework written in Python supporting Web-based data services. Uche was born in Calabar, Nigeria and now lives and works with his family near Boulder, Colorado, USA. His personal Weblog is Copia.