Posts Tagged ‘annotated corpus’

Google Releases Linguistic Data based on NY Times Annotated Corpus

Photo of New York Times Building in New York City

Dan Gillick and Dave Orr recently wrote, “Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas. Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.”

The blog continues with, “We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article. One way to approach the problem is to look for words that appear more often than their ordinary rates.”

Read more here.

Photo credit : Eric Franzon

KEYNOTE: Semantics at The New York Times – SemTech 2009 Video

The first semantic search system for The New Times was released in 1913 and was available bound in either paper ($6) or cloth ($8). In the 96 years since the advent of The Historical Index to The New York Times, semantic technology has become central to The New York Times’ daily operations and the focus of much internal research and development. In this keynote, Rob Larson, VP of Digital Production, and Evan Sandhaus, Semantic Technologist, will review the long history of semantic technology at The New York Times; discuss the application of this technology in our operations; and review an innovative initiative to enlist the global community in solving some of our toughest challenges.

KEYNOTE: Semantics at The New York Times from Semantic Universe on Vimeo.

At the end of the Keynote, Evan and Rob made a significant announcement regarding The New York Times and their contribution to the Linked Open Data movement.

View this Announcement:

New York Times Announcement at SemTech 2009 from Semantic Universe on Vimeo.