Calais’ Second Step

Jennifer Zaino
SemanticWeb.com Contributor

Thomson Reuters’ Calais service is hitting Version 2, with a number of improvements, including the Marmoset plug-in for Yahoo’s SearchMonkey service. The company is announcing the news today at the Semantic Technology Conference (SemTech).


 


“Marmoset takes unstructured data and feeds it to the monkey,” says Thomas Tague, Calais evangelist and project lead, Thomson Reuters. Any site owner can embed the code in their template files, and when SearchMonkey comes by to search the site, Marmoset wakes up, sends the content on the page to the Calais web service, which performs metadata generation on the fly and returns the metadata as microformats embedded in the blog page for SearchMonkey to harvest.


 


“Our issue with SearchMonkey was its limitations,” says Tague. “The first is that there’s not a lot of semantic metadata out there to scrape, and the second is that, for unstructured content like news, there is even less where people took the time to create semantic metadata. So we created Marmoset.”


 


The company is also announcing an open source plug-in that was actually built by Phase2Technology, a Drupal module for Calais that lets mid-tier and smaller newspaper publishers – which increasingly are starting to use the Drupal open source content management system as their publishing platform – automatically attach semantic metadata to any of their content.


 


Tague calls this in some ways a more powerful version of the WordPress plug-in for bloggers that Thomson Reuters is also unveiling with this release. The WordPress plug-in addresses bloggers’ laments that rich tagging is a pain, and finding images that are pertinent and copyright-acceptable is even more of a pain. The new plug-in returns tag suggestions based on text typed into a blog, but gives users the option of choosing which they want to apply.


 


Then, says Tague, “the cool stuff starts to happen.” Calais finds in FlickR pictures that match the tag or any combination of tags, ensures they are copyright-acceptable, and then gives bloggers the option to size it, write hyperlink text for it, and inserts the final image into the blog post. “That’s not going to change the world, but our goal was to make blogging more fun. But more importantly  it’s about how you thread the needle between a pure folksonomy and a pure taxonomy,” says Tague.  


 


Expanding core API capabilities


 


He puts the cost of developing the WordPress plug-in at about $50,000. That’s some $45,000 more than the bounty the company was originally offering to developers to create such a plug-in. Tague says that the company received only about half a dozen entries, and none of them was the great application it had hoped for.


 


“Something went wrong and we didn’t get the attention of the major players there,” he says. “I did a full disclosure on the [Calais] blog that the right thing is not to accept any of these and instead hire a company to develop this for us that knows how to develop production-strength WordPress plug-ins.”


 


It may get more of the attention it wants from the development community with a revise of its web site to include more sharing and community-based capabilities. Some 3,000 developers have registered so far. The other big focus of the last quarter, Tague says, has been some expansions around Calais’ core API capabilities. That includes rolling out two new output formats – simple tags and microformats to tag pages on the fly – to supplement RDF, which many of its users find to be just too much overhead for their needs.


 


“The other side is more subtle but long-term more interesting,” he says, and that is the rolling out of dozens of new entities, primarily in areas like pop culture.


 


“What’s important is we proved a model where we can use open data sources, like Freebase, combined with NLP (Natural Language Processing) to generate entities more rapidly. Over the coming year that will let us dramatically expand the knowledge domains Calais covers,” he says. “We’ve always been strong at business news, now we’re getting strong at entertainment news, sporting news, we’re looking into bio technology. We’re getting request for things we didn’t expect. But we want to be the semantic plumbing for everyone, so we must expand our domains rapidly.”

Semantic Tech & Business Conference Returns to San Francisco

Semantic Tech & Business Conference returns to San Francisco in June! Join us from June 3-7 for complete coverage of Big Data, Linked Data, Extreme Information Management, and Semantic Web. From breakthrough approaches to solving business problems to the big data implications of fast–evolving technologies, SemTechBiz provides you with an unparalleled interactive experience and delivers tangible business value. We're offering a special early rate when you register by February 17. Sign up now!