Who has had the greatest impact on rock music? It’s a question that still isn’t answered, despite the efforts of Ronald P. Reck, principal at RRecktek LLC, and Kenneth B. Sall, principal systems engineer/XML data analyst at Ken Sall Consulting.

The team wanted to use semantic technology, along with DBpedia and MusicBrainz data sources, to try and figure out the answer. Reck and Sall recently published a paper, Determining the Impact of Eric Clapton on Music Using RDF Graphs: Selected Challenges of Semantics Across and Within Datasets, based on their experiences. Their plan was to use RDF and SPARQL to query properties and relationships among musical artists to reveal their activity, impact and “six degrees of Eric Clapton” connections to other artists.

Reck and Sall initially saw this project as a door-opener to showing relationships between pieces of data, and drawing inferences and conclusions from them, for a more serious purpose: “We were interested in music, but the real application, especially in the government, is tying the clues together, for example, around terrorists,” says Sall. It turns out that musicians and terrorists have some things in common — they tend to have specific roles in their organizations, and may cross-partner with other groups in loose relationships.

While the work didn’t result in answering the original question posed, it did reveal, as Sall puts it, “what can go wrong in doing this kind of semantic analysis.” That’s in itself useful, as it presents an opportunity to find at least some solutions around those pitfalls.

Among the problems they identified:

Precision. “We couldn’t just accept facts as they were stated and really had to verify them,” says Sall. Issues here included the different degrees of data precision among contributions to Wikipedia, the primary but indirect source for the project’s RDF data, and varying interpretations of label meanings in RDF triples. For instance, some sources correlated the date a performer was born not with their literal birth-date, but with the date of their first performance. “We thought we made a mistake in the way we were doing the query, but we checked it out, and that’s what we saw,” Sall says.

Context. Sall explains this with an example: One statement said that Eric Clapton was associated with The Beatles’ Magical Mystery Tour album, which he was convinced was untrue. As it happens, he was right — and wrong. Clapton did backup singing on All You Need is Love on the U.K. version of the album, but not on the U.S. one. “So, depending on which version you look at it is either a true statement or false statement,” he says. “Now context becomes important. It may be a true fact, but in and of itself, without some other context applied to it, you can’t verify whether it is true or not. It’s ambiguous at any rate. So all these little subtleties  made it difficult to come to the conclusions we were trying to get at.”

Verifiability. While validating a fact is easy, invalidating one is hard, maybe even impossible, Reck points out. “RDF is sometimes called a format, but one limitation is that it’s ideal at stating that something is, not that stating that something isn’t,” he says. “It’s the open world problem.”

How to deal with some of these issues? Well, the lowest-common denominator data problem is a hard one to tussle with, unfortunately. “It doesn’t matter how carefully you craft a query if someone contributed data that is a bit sloppy,” Reck notes.

But there are some strategies that can come into play. For instance, addressing the above-mentioned birthday date problem, Reck imagines as one possible solution getting two more similar data sets and creating a business rule that says, if a birthday is identical in any three data sets and deviates in the fourth, categorically state that the three are correct and the fourth is wrong. “We could write a query like that and generate a list of wrong birthdays and data sets, and maybe process that to tell those who are custodians of those data sets that we have isolated things we think are errors,” he says.

Sall also mentions that there’s a lot of work underway among those supporting government efforts on ontologies and helping specify hierarchies and define implicit relationships. “When you talk about cleaning data before operations, that falls back to, are the terms that people use to express relationships really well-defined, and are those definitions shared so that there is no ambiguity whatever between what predicates and verbs mean,” he says. “That is crucial.”

Sall and Reck will present their findings at the upcoming SemTech DC, which you can find out more about here.