Posts Tagged ‘New York Times’

Google Releases Linguistic Data based on NY Times Annotated Corpus

Photo of New York Times Building in New York City

Dan Gillick and Dave Orr recently wrote, “Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas. Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.”

The blog continues with, “We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article. One way to approach the problem is to look for words that appear more often than their ordinary rates.”

Read more here.

Photo credit : Eric Franzon

A Look Inside The New York Times’ TimesMachine

nytAdrienne Lafrance of The Atlantic reports, “One of the tasks the human brain best performs is identifying patterns. We’re so hardwired this way, researchers have found, that we sometimes invent repetitions and groupings that aren’t there as a way to feel in control. Pattern recognition is, of course, a skill computers have, too. And machines can group data at scales and with speeds unlike anything a human brain might attempt. It’s what makes computers so powerful and so useful. And seeing the structural framework for patterns across vast systems of categorization can be enormously revealing, too.” Read more

New York Times Turns to Machine Learning for Better Understanding of Readers

nyt

Matthew Ingram of GigaOM recently wrote, “You might not think an applied mathematician who does research in biology and has a PhD in theoretical physics would have much to offer a 163-year-old newspaper publisher, but Chris Wiggins, head of the data science team at the New York Times, told attendees at the Structure conference in San Francisco that machine learning can do much the same thing for media companies as it does for research biologists: namely, make sense of a whole pile of data.” Read more

Structured Data In The Spotlight At The New York Times

Photo credit : Eric Franzon

Photo credit : Eric Franzon

In the winter of 2012, The New York Times began its implementation of the schema.org compatible version of rNews, a standard for embedding machine-readable publishing metadata into HTML documents, to improve the quality and appearance of its search results, as well as generate more traffic through algorithmically generated links. The semantic markup for news articles brought to its web pages structured data properties to define author, the date a work was created, its editor, headline, and so on.

But according to a leaked New York Times internal innovation report that appears here, there’s more work to be done in the structured data realm as part of a grand plan to truly put digital first in the face of falling website and smartphone app readership and hotter competition from both old guard and new age newsrooms and social media properties that are transforming how journalism is delivered for an audience increasingly invested in mobile, social, and personalized technologies.

The report was put together with insights from parties including Evan Sandhaus, director for search, archives and semantics at The NY Times, who was instrumental in the rNews/schema.org effort as well as the TimesMachine relaunch, a digital archive of 46,592 issues of The New York Times whose use includes surrounding current news stories with context. While the report notes that the Gray Lady has not been standing still in the face of its challenges, citing newsroom advances to grow audience with efforts such as using data to inform decisions, it needs to do more – faster – to make it easy to get its content in front of digital readers.

Read more

Georgia Tech Embraces MOOC Model For MS In Computer Science

Now you can get a master’s degree in Computer Science from a prestigious university online. The New York Times has reported that the Georgia Institute of Technology is planning to offer the CS degree via the MOOC (massive open online course) model.

According to the Georgia Tech MS Computer Science program of study website, students can choose specializations in topics such as computational perception and robotics, which includes courses in artificial intelligence, machine learning, and autonomous multi-robot systems among student choices; interactive intelligence, which include courses in knowledge-based AI and natural language; or machine learning, which offers electives in the topic for theory, trading and finance, among other options.

Read more

A Look Back at the 2012 TimesOpen Events

Greg Bates of Programmable Web reports, “The Gray Lady is getting her code on. In Andre Behrens’s New York Times blog, Open, billed as ‘All the code that’s fit to print,’ he recounts events on coding and science held in 2012. Two of the notable events were the well-attended one on Big Data and Smarter Scaling and their Open Source Science Fair. Three speakers graced the Big Data event: Andrew Montalenti the CTO of Parse.ly… James Boehmer, Manager of Search Technology at the New York Times; and Allan Beaufour, CTO of Chartbeat.” Read more

New York Times Working on a Linked Data Search Engine

Aaron Bradley of SEOSkeptic reports, “On Beet.TV, Andy Plesser recently featured a short but fascinating video of Michael Zimbalist, Vice President of Research and Development Operations at the New York Times, talking with Joanna O’Connell of Forrester about a prototype linked data search engine being developed by the Times. Zimbalist begins by talking about the great asset that is the New York Times Index, and the relationship between the Index’s metadata and linked data.” Read more

Dynamic Semantic Publishing for Beginners, Part 2

Even as semantic web concepts and tools are underpinning revolutionary changes in the way we discover and consume information, people with even a casual interest in the semantic web have difficulty understanding how and why this is happening.  One of the most exciting application areas for semantic technologies is online publishing, although for thousands of small-to-medium sized publishers, unfamiliar semantic concepts are too intimidating to grasp the relevance of these technologies. This three-part series is part of my own journey to better understand how semantic technologies are changing the landscape for publishers of news and information.  Read Part 1.

—-

News and Media Organizations were well represented at the Semantic Technology and Business Conference in San Francisco this year.  Among the organizations presenting were the New York Times, the Associated Press (AP), the British Broadcasting Co. (BBC), Hearst Media Co., Agence France Press (AFP), and Getty Images.

It was interesting to note that, outside of the New York Times, which has been publishing a very detailed index since 1912, many news organizations presenting at the conference did not make the extensive classification of content a priority until the last decade or so.  It makes sense that, in a newspaper publishing environment, creating a detailed and involved index that guides every reader directly to a specific subject mentioned in the paper must not have seemed as critical as it does now– it’s not as though the reader was likely to keep the newspaper for future reference material– so the work of indexing news content by subject as a reference was left for the most part for librarians to do well after an article was published.

In the early days of the internet, categorization of content (where it existed) was limited to simple taxonomies or to free tagging.  News organizations made rudimentary attempts to identify subjects covered by content, but  did not provide much information  about relationships between these subjects.   Search functions matched the words in the search to the words in the content of the article or feature.   Most websites still organize their content this way.

The drawbacks of this approach to online publishing is that it doesn’t make the most of the content “assets” publishers possess.    Digital content has the potential to be either permanent or ephemeral– it can exist and be accessed by a viewer for as long as the publisher chooses to keep it, and many news organizations are beginning to realize the value of giving their material a longer shelf life by presenting it in different contexts.   If you have just read an article about, say, Hillary Clinton, you would might be interested in a related story about the State Department, or perhaps her daughter Chelsea, or her husband Bill….   But how would any content management system be able to serve up a related story if no one had bothered to indicate somewhere what the story is about and how these people and/or concepts are related to one another?

Read more

Expert Schema.org Panel Finalized for #SemTechBiz San Francisco Program

Q: What do Google, Microsoft, Yahoo!, Yandex, the New York Times, and The Walt Disney Company have in common?

A: schema.org

On June 2, 2011, schema.org was launched with little fanfare, but it quickly received a lot of attention. Now, almost exactly one year later, we have assembled a panel of experts from the organizations listed above to discuss what has happened since and what we have to look forward to as the vocabulary continues to grow and evolve, including up-to-the-minute news and announcements. The panel will take place at the upcoming Semantic Technology and Business Conference in San Francisco.

Moderated by Ivan Herman, the Semantic Web Activity Lead for the World Wide Web Consortium, the panel includes representatives from each of the core search engines involved in schema.org, and two of the largest early implementers: The New York Times and Disney. Among the topics we will discuss will be the value proposition of using schema.org markup, publishing techniques and syntaxes, vocabularies that have been mapped to schema.org, current tools and applications, existing implementations, and a look forward at what is planned and what is needed to encourage adoption and consumption.

Panelists:

photo of Ivan Herman Moderator: Ivan Herman
Semantic Web Activity Lead,
World Wide Web Consortium
Photo of Dan Brickley Dan Brickley
Contractor,
schema.org at Google
Photo of John Giannandrea John Giannandrea
Director Engineering,
Google
Photo of Peter Mika Peter Mika
Senior Researcher,
Yahoo!
Photo of Alexander Shubin Alexander Shubin
Product Manager,
Head of Strategic Direction,
Yandex
Photo of Mike Van Snellenberg Mike Van Snellenberg
Principal Program Manager,
Microsoft/Bing
Photo of Evan Sandhaus Evan Sandhaus
Semantic Technologist,
New York Times Company
Photo of Jeffrey Preston Jeffrey W. Preston
SEO Manager,
Disney Interactive Media Group

These panelists, along with the rest of the more than 120 speakers from SemTechBiz, will be on-hand to answer audience questions and discuss the latest work in Semantic Technologies. You can join the discussion by registering for SemTechBiz – San Francisco today (and save $200 off the onsite price)

 

Smart Ad Sophistication Lacking in News Industry, To Its Peril

The Pew Research Center’s Project for Excellence in Journalism State of the News Media 2012 report was just published, and among the findings is that efforts by most top news sites to monetize the web in their own right are still limited. Few news companies, it reports, “have made much progress in some key new digital areas. Among the top news websites, there is little use of the digital advertising that is expected to grow most rapidly, so-called “smart,” or targeted, advertising.”

 

 

Failing to make a lot more hay from digital ads is problematic for traditional news companies given the decline in print circulation and in its ad revenue, too. The report says that in 2011, losses in print advertising dollars outpaced gains in digital revenue by a factor of roughly 10 to 1, which it calls an even worse ratio than in 2010.

Read more

NEXT PAGE >>