This is Part I of a two-part series. Part II will be published later this month.
Individuals, communities and organizations increasingly require the ability to combine fragmented data sources into easily searched and navigated wholes. From combining family playlists to merging scientific databases and spreadsheets, the need for integrating data and metadata from multiple sources into a single, Web-based publishing framework is increasing. Allowing users to publish, explore and visualize data in useful ways is a powerful mechanism for identifying, organizing and sharing patterns inherent in this data. Web data publishing demands easier data integration and customizable ways to interact with the data such as faceted browsing, spatial or temporal-based visualizations, tag clouds, and full-text search.
Tip: Click images to view full-size in new tabs/windows
Figure 1: An Exhibit at the U.S. Library of Congress providing a faceted navigation and map-based view of spreadsheet.
The Exhibit tool has proved extremely popular, with tens of thousands of Web sites using it since its release five years ago. Exhibits have been published with every kind of data, from movie box office receipts to clinical trial results to specialized digital library collections. Web browser widgets, however, are inherently constrained from scaling beyond small collections of a few thousand items since they run inside a browser and store data in memory once it’s been downloaded from a Web server. Relatively small datasets work well, but the available in-memory data storage and processing methods prevent widgets from handling collections of even tens of thousands of items. To address this problem, the Simile team reconvened to design a new architecture for Exhibit that addresses the scaling problem and other limitations identified by its users. With initial funding from the U.S. Library of Congress we have begun to implement this new architecture with a related goal of long-term sustainability as open source software.
The project is still underway, but an initial release of this code has been made available to help demonstrate the capabilities of this platform. In Part 1 of this article we will describe Exhibit, the goals of the project, and what’s changed with the new design. In Part 2 we’ll discuss the community aspects of this effort (e.g. the open source software development process and the technical environment that Exhibit depends on) and anticipated benefit we hope to see by making available a Web-based, open source software platform for publishing Linked Data.
Exhibit 3.0 Goals
The requirements process that shaped the new version of Exhibit (which we call Exhibit3 to reflect its relationship to the current Exhibit 2 software) began by identifying representative issues expressed by the community and large scale production use of the current Exhibit system. The following high-level goals are a few of the main objectives of this project.
Robustness and scalability. The original Exhibit platform was designed to support collections under 1,000 items. As the popularity of Exhibit increased so has the demand to integrate and publish larger collections of data. For Exhibit3, our objective was to create a robust, scalable data publishing platform for supporting very large collections of any type of data (and metadata). The new Exhibit3 platform has specific requirements to scale to 100,000 items, support 20,000 facet values and provided a ‘staged’ server-side option for publishing large collections. A change to a single line of HTML will allow the author of the exhibit to publish small (in memory) to large (staged) collections within the same Exhibit3 User Interface.
Figure 2: A ‘staged’ server-side based Exhibit illustrating the publishing and navigation over 40K item dataset.
Ease of data sharing. An enormous benefit of a Web-based publishing platform is the ease by which an author can deploy and share views of data with a wider audience. The ability to quickly navigate an Exhibit, selecting appropriate facets and visualizing the results (on maps, timelines, etc.) is a powerful means for exploring underlying patterns inherent to the data. As larger collections of data are made available in Exhibit3, the ability to capture state (i.e. facets selected, search terms used, specific views of data, etc.) becomes increasingly important for sharing insights with others. In Exhibit3 we have focused on increased permalink (bookmarking) capabilities for end users to link to specific views with state.
Data indexability, portability and pipelining. Publishing data is often part of a larger data management, sharing or curation workflow. The ease with which data can be exposed and made available via the Exhibit3 platform is part of the goal for this project. Further, the ability to expose this data in new ways (e.g. to search engines and / or Web accessible devises) greatly increases the discoverability and utility of Linked Data. More sophisticated data input and output facilities to enhance Exhibit’s support for more complex data pipelines (e.g., from spreadsheets or XML documents, and into external databases or triplestores) including straightforward data export features to enable large-scale data editing in external applications (e.g. Google’s Refine) are anticipated as a result of this effort.
Customization and extensibility. Improved support for developers wishing to extend the tool to add new views, widgets and facets are also anticipated. With these improvements for example, we hope to encourage other developer communities working on various other view widgets (e.g. heat maps, bar charts, numerical visualization tools, etc.) to more easily integrate these views into a Exhibit’s facet-based data publishing platform. To further support external systems, the Exhibit3 platform is expected to offer APIs for result set pagination and/or summarization, and links to particular Exhibit states. For more complex UI layout editing, extension hooks and reference implementations will be provided as examples. Finally, we anticipate additional API documentation for the staged communication interfaces to further enable Exhibit3 to be driven by different back-end content management systems.
Exhibit 3.0 Architecture
Exhibit provides a publishing platform for any data that can be modeled as an RDF graph and encoded in simple JSON (i.e., anything). The new architecture for Exhibit3, shown in figure 3, maintains the proven design goals of the current Exhibit work while becoming more scalable and modular to meet emerging requirements already described. To accomplish this objective, a parallel ‘staged’ client/server architecture has been introduced that builds on earlier work of the Simile project on a prototype called Backstage.
Figure 3: A diagram illustrating the underlying architecture for the in-memory and staged Exhibit3 data publishing platforms.
The Exhibit3 project is designed to fix many shortcomings of the original Exhibit work, making it far more scalable, modular, and feature rich than the original tool, and the work is being done as broadly inclusive open source software development. Further details of the Exhibit3 project including status, source code and demonstrations, and how to contribute is available at http://simile-widgets.org/exhibit3.
Part 2: Exhibit and the Community
In the next installment of this series, we’ll discuss how community support plays into the development of this open source effort and the anticipated benefits of this software to publishers of Linked Data.
About MacKenzie Smith
MacKenzie Smith is Research Director at the MIT Libraries, overseeing various digital library research and development projects. Her research focuses on the Semantic Web for scholarly communication, and digital data curation in support of e-research. She was formerly the Libraries’ Associate Director for Technology, serving as the project director for DSpace, MIT’s collaboration with Hewlett-Packard Labs to develop an open source digital repository for scholarly research material in digital formats, and the Principal Investigator on a number of digital library research projects including Simile. MacKenzie is also currently a Science Fellow at the Creative Commons, and Special Consultant to the Association of Research Libraries’ E-Science Institute. She has worked at the Harvard University Library to manage its Library Digital Initiative, and at the library IT departments of Harvard and the University of Chicago.
About Eric Miller
Eric Miller is founder and president of Zepheira which provides solutions to effectively integrate, navigate and manage information across boundaries of person, group and enterprise. Zepheira’s clients include government agencies, national libraries, scientific publishers, NGOs, health care institutions and life sciences organizations many of which are using Exhibit to publish linked data. Previously, Eric led the Semantic Web Initiative for the World Wide Web Consortium (W3C) at MIT. Eric’s responsibilities for W3C included the architectural and technical leadership in the design and evolution of the Semantic Web and using the Web as an architecture for Linked Data. Eric served as a Research Scientist at MIT’s Computer Science and Artificial Intelligence Laboratory, where he was a Principal Investigator on the MIT SIMILE project. Prior to that, Eric was a Senior Research Scientist at OCLC, Inc. and the co-founder and Associate Director of the Dublin Core Metadata Initiative.