Last Friday saw the release of the Wikilinks Corpus from Research at Google, 40 million entities in context strong.

As explained in a blog post by Dave Orr, Amar Subramanya, and Fernando Pereira at Google Research, the Big Data set “involves 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus.” The mentions, the post relates, are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If each page on Wikipedia is throught of as an entity, then the anchor text can be thought of as a mention of the corresponding entity, it says.

The data can be found in Google’s Wikilinks Corpus; tools and data with extra context can be found at UMass Wiki-linksUMass Amherst‘s Sameer Singh and Andrew McCallum are collaborators in the project. The team does note that users will need to do a little footwork to understand the corpus, as they can’t distribute actual annotated web pages because of copyright issues – just an index of URLs, and the tools to create the dataset, or whatever piece of it you want.

Freebase shared a post on Google+ about the announcement, to the effect that the huge dataset has “a lot of potential for folks building entity-aware apps. Essentially it lets your app tell the difference between ambiguous entity names like Apple (the fruit) and Apple (the company). Because they give Wikipedia URLs for each entity it’s easy to connect this to all of the Freebase APIs.” And, asks Freebase, “what would you do if you could scan any page on the web and know which Freebase entities it mentioned?”

The Google Research team blog gives some ideas of how it sees the data potentially being used:

  • Look into coreference – when different mentions mention the same entity — or entity resolution – matching a mention to the underlying entity
  • Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity (see a paper by Google Research on Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models here)
  • Learn things about entities by aggregating information across all the documents they’re mentioned in
  • Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
  • Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.