The spotlight’s on DBpedia. Literally. A new open source tool that goes by the name of DBpedia Spotlight annotates mentions of DBpedia resources in text to link unstructured information sources to the Linked Open Data cloud. The idea behind it was to ‘go generic’ so that users could download, adapt and integrate it with their own stacks to meet their specific needs.
That idea started to play out in the community just a day or two after the tool’s release, in fact. The EuropeanaConnect Media Annotation Prototype is using DBpedia Spotlight to support images, audio and video content semantic tagging and annotation — that’s something that one of DBpedia Spotlight’s creators, Pablo N. Mendes, hadn’t foreseen.
Mendes says that he thinks DBpedia Spotlight’s ability to be used for named entity recognition and name resolution to support a variety of use cases such as faceted browsing, customized web feeds and delivering complementary content on web pages sets it apart from other services with more specific focal points, such as enriching blog content. “One advantage of having an open source solution is to let other developers adapt it to their specific needs,” says Mendes, a PhD Student at the Kno.e.sis Center, Wright State University and Research Associate at Freie Universität Berlin who’s been working with DBpedia creator Chris Bizer and fellow F.U. Research Associate Max Jakob on this project.
One of the ways DBpedia Spotlight aims at flexibility is by letting users determine what degree of precision makes the most sense for the application to which they would like to apply its semantic annotation. The current version of DBpedia Spotlight was built from a DBpedia3.6+Wikipedia dump from Oct. 2010, and users can configure the confidence value for returning annotations about content entities. Setting it higher may result in fewer annotations but the ones returned are more likely to be correct, while a lower confidence value will try to get you as many annotations as possible but the likelihood of mistakes grows.
“The higher confidence would be more appropriate for, say, a library wishing to interconnect a lot of content and needs the enrichment function to be completely automated to cover that large base without compromising its credibility by having the incorrect DBpedia links returned,” Mendes says. Without oversight, that work could be embarrassed if the DBpedia entry for the word bass that is returned is the fish when the context is bass, a tone of low frequency. “For bloggers or journalists writing their periodic column, users may prefer lower confidence settings that generate higher coverage of annotations, since eventual mistakes can be manually curated/corrected before publication.”
DBpedia Spotlight can help in classifying a topic of a document from its text. For instance, if many of the DBpedia entities returned are politicians, the conclusion can be that the text may relate to politics (entities can be used by topic detection algorithms, even if alone they may not be enough). That’s a starter to accommodate faceted browsing, as is the fact that extracted entities also can draw upon other DBpedia information, such as the offices the politicians noted held and their dates serving in office. “Connecting your text to DBpedia enables this use case of more semantic processing or browsing of your text,” says Mendes.
The tool also can help with linking entities in content by relationship. This has applications even in the corporate world, where businesses may want to mine documents to discern work relationships between individuals, for instance. “Many such relationship extraction algorithms rely on entity identification beforehand, so it may be easier to find relationships between entities if you already have identified who those entities are in the text,” says Mendes.
DBpedia Spotlight could evolve in concert with work underway on DBpedia Live, which is an attempt to update DBpedia automatically as Wikipedia is updated. “If that is in place we also can connect to update our index,” he says. “This entity extraction model we have — we keep it simple in a way that we do not need to recompute everything new if we need to add one more example. It is an incremental model so if new articles or new paragraphs that mention an entity are added, we just add that to our index and start using that automatically.”
Right now DBpedia Spotlight also is working exclusively with the English version of DBpedia, but as other DBpedia language efforts mature, Mendes says it is possible to adapt DBpedia Spotlight to add entities from other languages. “We already have been contacted by French and Brazilian developers that would like to collaborate for French and Brazilian Portuguese versions,” he says.