Group Delves into Understanding the Origin of Data

Jennifer Zaino
SemanticWeb.com Contributor

The World Wide Web Consortium (W3C) in late September chartered the Provenance Incubator Group, charged with developing a road map for Semantic Web technologies. Chairing the group is Yolanda Gil, associate director for Research, Intelligent Systems Division, and research associate professor of computer science at the Information Sciences Institute at the University of Southern California.

The importance of delving into the issues behind understanding and verifying the origin of information speaks to one thing: The semantic web has arrived.

“Provenance is a very practical matter, and if we didn’t have large data available on the semantic web and people interested in using it and reasoning with it — if we didn’t have scientists who have adopted semantic web technologies and want to solve a big problem they have for integrative science — if we didn’t have such a push, I don’t think we would be thinking about this so hard or starting this effort,” Gil says. “I think it attests to just the way the semantic web is spreading and getting adopted and finding its way to real use.”

Information provenance, she explains, is an issue for everyone, whether it’s a researcher sourcing data on the web, a consumer searching social webs for information on vacation destinations, or scientists trying to connect their data findings with the work of their peers.

“We all do analysis, but there’s nothing on the web that represents that,” she says, when it comes to those ways of determining whether we trust certain information. Were there a way to access exactly what the provenance of information is, such as who was the original source of certain news, for the purposes of more accurately assessing that data, it would have wide-ranging benefits. “Browsers could potentially exploit that but also semantic reasoners could use it,” she says, to help users determine which is the valid source when there are discrepancies around information that is presented as fact but is inconsistent among different web sources.

The growth of linked data adds to the provenance challenge.

“Now that there’s all this data linked together, you start asking questions and then you have to take into account where the information came from, and that it’s not just a level field for every single piece of data in the network,” she says. And provenance really gets interesting as it relates to the scientific community, in areas such as life sciences. (Gil’s research activities in this area have resulted in papers including Examining the Challenges of Scientific Workflows.) As an example, she points to work around genomics involving population studies across a broad range of populations to determine linkages among genes and certain diseases. In such work, genotypes are likely collected at different sites, perhaps using different machines or even different settings. “Unless you have proof of how genotype information came to be and how it was created, you could mix apples and oranges in your analysis,” she says, potentially resulting in inaccurate leads.

“In science every day there is a bigger push to integrative science, to integrate information from diverse sources,” she adds, which increases the requirement to ensure scientists are integrating meaningful data for their computations. Scientists also are very concerned about reproducibility, a cornerstone of the scientific method that provenance recording can help preserve, she says.

“We have a long history of looking at what are the human abilities that are useful to us [in assessing trust in information], and can we try to come up with reasoning mechanisms we could use on the machines, on the semantic web,” she says. “It’s very early on, but imagine eventually having some kind of standard language to express that this web site has original content that I created, this is my signature, you can prove that it is me stating this content, and if you find this content reproduced elsewhere you know I created it and it’s authentic.”

You’d also know that Gil is a researcher at the University of Southern California, adding to her weight as a trustworthy source for information relating to computer science and AI, for instance. “You can use that information to make judgments,” Gil says.

Mechanisms for determining provenance also could have applicability for safeguarding intellectual property and copyrights online. “As people copy images from the web, provenance again might be useful here to figure out, with that information, that there is a restriction about using an image for a particular purpose,” she says. “A system might automatically warn you of those kinds of things.”

The incubator will get started by posting use case scenarios to illustrate the range of uses for provenance information, and then figure out from there what semantic web representation requirements might be needed to address those scenarios. As part of that the group may liaison with other W3C groups to address related issues that might be better addressed by other parts of the web architecture. Take the example of two web sites, each providing different values for the height of the Empire State Building. The work of the provenance group could lead to mechanisms for some day identifying that one of the sites is repurposing data originating on a high school kid’s class project web site, while the other repurposes information from the Visitors’ Bureau of New York City. But the W3C Security Activity group may be the more appropriate party to police whether that a site claiming to be Visitors Bureau really is the Visitors’ Bureau.

So far the Provenance group has about 15 members with more joining.

“The important thing,” says Gil, “is we have representation from a broad variety of areas, from database professions that have been looking at issues of provenance, from the argumentation tradition of how you represent and reconcile different points of view, from people in life sciences research like myself bringing their perspectives. To me that’s what’s really exciting.”

Semantic Tech & Business Conference Returns to San Francisco

Semantic Tech & Business Conference returns to San Francisco in June! Join us from June 3-7 for complete coverage of Big Data, Linked Data, Extreme Information Management, and Semantic Web. From breakthrough approaches to solving business problems to the big data implications of fast–evolving technologies, SemTechBiz provides you with an unparalleled interactive experience and delivers tangible business value. We're offering a special early rate when you register by February 17. Sign up now!