The BBC has been working on automatically tagging their audio archives with DBpedia identifiers. Yves Raimond explains how the broadcast company is handling the project. Raimond writes, “One dataset we are looking at within this project is the World Service archive. This archive is isolated from other programme data sources at the BBC, like BBC Programmes or the Genome Project, and the associated programme data within it is very sparse. It would therefore benefit a lot from being automatically interlinked with further data sources which makes it such a particularly interesting use-case. The archive is also very large: it covers many decades and consists of about two and a half years of high-quality continuous audio content.”

Raimond continues, “One way of dealing with such a large programme archive with patchy metadata but high-quality content is to use the content itself in order to find links with related data sources. For example if a programme mentions ‘London’, ‘Olympics’ and ’1948′ a lot, then there is a high chance it is talking about the 1948 Summer Olympics. Using the structured data available in Wikipedia we can then draw a link between a recent programme on the 2012 London Olympics and that archive programme and use that link to provide further historical context. When developing such an algorithm we need to take into account a couple of desirable properties: it needs to be efficient enough to be applicable to a large archive and it needs to use an unbounded target vocabulary, as programmes within an archive can virtually be about anything.”

Read more here.

Image: Courtesy BBC