ClueWeb 09 and ClueWeb 12. They may sound like secret project names, but in fact they’re two datasets, both created by The Lemur Project to support research on information retrieval and related human-language technologies (ClueWeb12 is the successor to ClueWeb 09). Today, news comes from Research at Google that it’s undertaken Freebase Annotations of English-language web pages of the ClueWeb 09 and ClueWeb 12 Corpora.

That adds up, it says, to nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

Read more