News organizations are a prime candidate for the implementation of semantic technologies, but they also pose some of the most interesting challenges to semantic web professionals. After all, the news is dynamic — it’s happening right now, all around the world, and the same events are being reported on by an ever growing number of publishers in countless languages. The information contained within “the news” is an ever expanding glob that can’t be contained, so how can any organization hope to each even a semblance of organization?
For Gannett, the answer to that question is autotagging. With the help of Dan Segal, a Senior Taxonomist at Marcinko Enterprises, Gannett has been working to semantically categorize and tag their stockpile of information from their 82 US daily newspapers and 23 television stations, not to mention all of the breaking stories that add to that heap every day. Needless to say, tagging the news isn’t exactly an easy quest, but Gannett has several very good reasons to try.
What Tagging Can Do for the News
At the recent Semantic Technology and Business Conference in San Francisco, Segal and Kathleen Cottay, Senior Manager for Taxonomy and Semantic Analytics at Gannett shared all of the ways that they foresaw semantic tagging and linked data benefitting Gannett:
- Uniformity: Finding a way to tag all of their information in a standard, uniform way allows Gannett to lay the foundation for each of the following…
- Discovery: By implementing user-friendly semantic search tools, Gannett can better align their legacy content with breaking news and make all of their information more easily accessible, both internally and externally. Easier content discovery also creates a launch pad for new content features and data journalism modules.
- Analysis: You can’t use what you don’t know you have. Effectively tagging content at Gannett gives them the ability to assess what they have, figure out what they’re missing, and from that starting point, look for ways to recombine information for a better user experience.
- Monetization: The end result of all of the previous points is added value for the company. Semantic tagging allows for more effective targeted advertising to consumers as well as more valuable sponsorship opportunities for advertisers. It also creates a high-level point of view with automated metrics that can help executives determine which profit centers they should be focusing their attention on.
Easier Said Than Done
With the benefits clearly laid out, Gannett was ready to move forward, Segal and Cottay informed us, but more than a few challenges stood in their way. As a major news organization encompassing USA Today, international news outlets, and hyper-local newspapers, Gannett’s data covers virtually every topic known to man. They needed to address issues of disambiguation, how to curate news stories in real time, the various shelf lives of different types of content, as well as integration with their existing systems and products, to name just a few challenges. So they decided to start (relatively) small, beginning with the development of a strategic approach to semantic tagging at USA Today.
Similar to their counterparts over at the BBC, the team at Gannett decided to focus on simplicity. “We only had a year to get this going, so we needed autotagging that could be easily managed without manually curated rules,” Segal said. To that end, they created tools that were minimally rule dependent. Instead of building from scratch, they decided to buy the AP Taxonomy and turned to TEMIS’s Luxid platform for their automated indexing software and ITM and Mondeca for their taxonomy and ontology management software, respectively.
Because their tools were off-the-shelf, the Gannett team had to go in and do a lot of customization to make the system work for them. This involved locking down their top-level terms, consolidating redundant and overlapping topics, and decoupling pre-coordinated terms and removing poly-hierarchy. They broke down complex terms into their parts to minimize the number of terms and removed topics that were too highly specialized to be of much use, such as complex scientific terms. In this way, they managed to condense a taxonomy of 4,000 terms to 1,000 terms with four levels of hierarchy.
They also had the challenge of optimizing their autotagging software. For this, they chose a hybrid approach of statistically-based methods for high level categories and lexically-based methods for detailed categories. They also added in placeholders that were “future-proof” so that when a new topic arises — as is prone to happen in the news business — they can easily go in and add it without retraining the entire model.
This is just the beginning of the story for Gannett. Their team is still hard at work, striving to make the system its best and expand it to the other segments of the company. Over the long term, Gannett would like to incorporate linked data and dynamic semantic publishing. Cottay and Segal shared the sentiments of Gannett’s leadership that taxonomy and tagging are at the core of the organizing principle that will allow Gannett to bring its content to life. The work ahead is still substantial, to say the least, but so are the potential dividends.
Learn more about semantic technology in the news industry at the upcoming SemTechBiz Conference in New York. Register now to secure your place at this highly anticipated event.
Images: Courtesy Gannett