Working with Linked Data could be a little bit easier than it is, and a collaborative project between MediaEvent Services and the Freie Universität of Berlin aims to make it so.

The open source Linked Data Integration Framework (LDIF), which will be a topic of discussion at the upcoming Semantic Technology & Business Conference in Berlin, seeks to address the pain that can occur when attempting to work with distinct data sets, perhaps each one spanning several gigabytes, that are loaded into one triple store. The LDIF approach is to perform transformations to unify the data outside of the triple store, to improve performance and scalability, as well as make it easier to stay up to date with various Linked data sets. A Hadoop version of the LDIF framework just launched, for processing a virtually unlimited amount of data on a cluster.

“One main issue is that everyone doing Linked Data is free to use his own vocabulary and naming schema,” says MediaEvent Services’ head of development, Christian Becker, who will be presenting at SemTech Berlin. “That’s part of the success of Linked Data, but if you want to work with two different data sets, or more important if you have internal databases to link to a Linked Data source to enrich your product or website, you need to find a way to unify these things. This is exactly what LDIF is doing.” The traditional way to address the issues, through SPARQL queries to the Linked Data sources that follow their particular vocabulary and naming schemas, doesn’t scale well if you want to continue to add more data sources or have data stored locally, he says. Another issue is keeping current with Linked Data sets, whose publishers all may provide different access methods.

“If you’re working against one data source, maybe you can follow its paradigms but it gets really complex if there’s more than one target,” Becker says. “I think this really makes working with multiple Linked Data sources easier.”

Data will travel through a four-stage pipeline in LDIF, at each stage continually detecting which data exactly is required and which is not, to speed processing of transformations. Among the technologies that LDIF is leveraging are the R2R framework for vocabulary mapping module and the Silk framework for identity resolution for merging the same data from different sources. Also a piece of LDIF comes in the form of the Named Graphs data model.  “Once you integrate data from several data sources you still want to keep track where the data came from,” Becker says. The Named Graphs model provides metadata on each triple as a fourth element, for provenance tracking. “So you can create queries saying only give me movies downloaded from DBpedia or only data updated within the last week, or you attach some other stuff to it, like ratings. It’s sort of like a meta layer on the data.”

One of the funding sources of the project is Paul Allen’s Vulcan Inc, which also has interests in semantic and AI efforts such as EVRI and Project Halo. Currently LDIF is still mostly in the bare-bones engine stage for developers, but Becker hopes to have it out of prototype and into a 1.0 version by year’s end.

You can register for SemTech Berlin here. Early bird rates expire today.