Language of the BirdsA Daunting Task

What if you were given a large new initiative to lead? Your resources are a team of 180 different groups distributed across the country. They work for different organizations, each with their own set of priorities and goals. Each group does things just a little differently to accomplish the task at hand. They use different work processes, different software and produce their work in a variety of data formats.

What if you needed to pull the work of all of these groups together and present it in a cohesive manner? Sound familiar? This is a task that presents itself to organizations on a frequent basis as workforces and partnerships expand globally and the amount of digital information to be managed grows exponentially.

Now add another factor into this picture…imagine that the preservation of our nation’s digital history was riding on your success. This is the situation the Library of Congress faced in 2008.

The National Digital Information Infrastructure and Preservation Program at the Library of Congress is an initiative to develop a national strategy to collect, archive and preserve the burgeoning amounts of digital content for current and future generations. It is based on an understanding that digital stewardship on a national scale depends on active cooperation between communities. The NDIIPP network of partners have collected a diverse array of digital content, including social science data-sets, geospatial information, Web sites and blogs, e-journals, audiovisual materials, and digital government records.

In 2008, the NDIIPP partners shared content through a simple static web page constructed manually by Library staff. In order to explore more dynamic and useful tools and processes for sharing their collections of diverse content, the Library began a pilot project in 2009 with Zepheira. The result is a software platform called Recollection.

The focus of Recollection is:

•             to allow NDIIPP partners to access, connect, and share their diverse digital collections.

•             to provide a method for transforming data created in different formats into a common format shared within the Recollection platform.

•             to enhance the ways in which collections can be explored by allowing partners to create custom web pages which highlight important aspects of collections such as timeframes and geographic origins.

•             to enable partners to not only share these views within Recollection, but to also embed them directly into the web pages they use to showcase their collections.

Recollection is meant to provide the platform, tools and environment that enables the community of NDIIPP Partners to share their collections and data on an ongoing basis. The distributed network of NDIIPP partners can also take advantage of tools which allow them to relate collections, increasing their usefulness to Library patrons.

Empowering Curators of Information

Recollection achieves the following key goals for the Library and its NDIIPP Partners:

•             Rapidly combine private and public information collections for easy sharing.

•             Enable new insights into patterns and relationships inherent to data.

•             Build a network of trust and participation around common goals.

•             Tie together the many public collections that are curated by NDIIPP Partners in different systems and formats across the globe; it allows curators of information to enhance their information resources by connecting them to other open data sources available on the Web.

Users can either point at or upload their collections to Recollection and describe the types of data within the collection via a web interface. They can enhance this data by leveraging Recollection to generate latitude/longitude coordinates, consistent date formats, breaking lists into individual values, and other data manipulations useful for analysis. In future phases of Recollection, users will be able to merge multiple data files in order to build views which provide users the ability to analyze information across the combined collections.

Once they’ve described the data, users can quickly create a custom web interface to the data with visualizations that show map plots, timelines, number charts, and other interesting views. They can select facets and tag clouds associated with the data to provide the ability to filter information in ways that are interesting to themselves or targeted audiences. The user can then publish and share the finished page on the Web, including embedding this custom web interface into their own web pages by copying and pasting one simple line of code.

By allowing users to easily share their custom views and data, Recollection promotes partnership and social interaction around the curation of data. The power of combining the data and expertise of many curators serves to increase the value of the collections and enables new collaborative collections to be generated based upon these interactions.

It is easiest to get a sense of Recollection by viewing it in action.

Linked Data: Accelerating Cooperation

In order to accelerate the goals of NDIIPP partners to share and discover content in a flexible and resilient way, Zepheira looked beyond traditional data management solutions to a solution based in linked data principles. Linked data technology is used in Recollection as a basic platform for librarians and curators exposing collections to the Web, and as a source of data to augment these collections. Potential users of the information can more easily discover and analyze this data in a variety of new ways as a result. Not only do consumers of the information have increased access, but collection curators can begin to connect information across collections and from the World Wide Web to enhance collection value with new resources. These connections create a powerful “Web of Data” for all of the resources curated under the auspices of the NDIIPP Program.

Recollection applies the following linked data principles to support the use case:

•             Expose information resources via URIs or Identifiers: URIs are used to identify all information that is exposed to the Web as a new resource. Prior to Recollection, many collections of information were held in “dark archives” that were inaccessible to most.

•             Use standard HTTP for ease of access: HTTP URIs are used to allow information to be located by the widest variety of tools and services possible. Data sets, data profiles and data views housed in Recollection are now exposed to all users of the World Wide Web and easily accessible.

•             Provide data in common formats to maximize sharing: The data provided when people access Recollection URIs is available in many of the common formats used on the Web today making it more widely useful. Data is made available in RDF/XML, HTML, Semantic wikitext, JSON and a variety of additional formats. Users of Recollection are free to take the information that is exposed and use it in new ways that are meaningful to them.

The Recollection Platform leverages Web-based open standards and open source tools, many of which Zepheira is actively leading. For more details on open source software used to create the Recollection Platform see –> http://recollection.zepheira.com/about/community/.

The linked data community is not just about sharing data, but also about sharing the wherewithal to process data. In the process of developing Recollection, the stakeholders have also contributed a great deal of data-processing tools and best practices to the community, including by enhancing open-source tools used in Recollection. Relevant domains of data include date/time formats, statistical formats such as SPSS, Web feeds (i.e. RSS and Atom), XML MODS, and more.

Key members of the Recollection team are also leaders of the MIT Simile project, focused on designing tools to facilitate interoperability among digital assets distributed across individual, community, and institutional stores. Not only has Recollection included basic enhancements to the Simile widget set (maps, pie charts, graphs, tag clouds and specialized pick lists), but it has included fundamental enhancements to the data conduits and views (e.g. OpenLayers support) that underpin Exhibit. Since the Simile project in general, and Exhibit specifically, are such a popular component of linked data systems, this work reflects the give-and-take of participating in the linked open data community.

The Library of Congress recently funded the Exhibit 3 Project, a collaboration between the Library, MIT and Zepheira to increase scalability and enhance features. Among other things, Exhibit 3 will promote better HTTP access to the session state of users of collections (i.e. the state of an exhibit), make it easier for the open source community to add new visualization widgets, and enhance data pipelining capabilities.

In summary, linked data principles were an excellent fit to help meet the goals of the Recollection project. By exposing information resources as URIs, they are now accessible to all users of the World Wide Web and become easy to share. NDIIPP partner data is published in a modular way allowing the flexibility to combine and recombine this information in the future in ways that do not need to be pre-defined today, leaving opportunities to add value. The Recollection Platform has provided a foundation for the NDIIPP community to share the fruits of their curation efforts and build upon a set of trusted information resources.

In part 2 of this series, Trevor Owens, Digital Archivist and Recollection Project Lead at the Library of Congress, will describe in more detail the way in which Recollection supports the NDIIPP community and the future role Recollection will serve in the Library’s digital preservation strategy.

Read Part 2 of this series here.

Author Bios

Kathy MacDougallKathy MacDougall is a Partner at Zepheira which provides solutions to effectively integrate, navigate and visualize data across personal, group and enterprise boundaries. Kathy has extensive experience leading enterprise-wide initiatives to help companies evaluate and leverage their corporate data to increase revenues and uncover new business intelligence. Successes during her 20-year tenure in this field include creating data-based and knowledge management solutions for companies ranging in $500M to $11B in size, including such names as General Electric and Sun Microsystems. At Sun Microsystems, Kathy and her team led the first known large-scale corporate implementation of Semantic Web technologies which provides the foundation for dynamic delivery of product-related content from across the organization.

Trevor Owens - photo credit: Barry WheelerTrevor Owens is an information technology specialist with the National Digital Information Infrastructure and Preservation Program (NDIIPP) in the Office of Strategic Initiatives at the Library of Congress. Before coming to the Library of Congress Trevor was the community lead for the Zotero project at the Center for History and New Media and before that Trevor worked for the Games, Learning, and Society Conference in Madison Wisconsin. Trevor received a BA in the History of Science from University of Wisconsin: Madison, and an MA in History, with a focus on American history and Digital history, from George Mason University. He is currently completing a PhD in Research Methods and Instructional Technology in the Graduate School of Education at George Mason University. Photo Credit: Barry Wheeler