Building the DocumentCloud With the Help of OpenCalais

Jennifer Zaino
SemanticWeb.com Contributor

Working toward a summer 2010 release, DocumentCloud is building the infrastructure behind its software and web site designed to facilitate the sharing of original source documents by reporters, watchdog groups, and researchers.

dc.pngHaving received a $719,500 two-year grant from the Knight News Challenge in the early summer, the effort-conceived by journalists from ProPublica and The New York Times-last week tapped Thomson Reuters’ OpenCalais service to perform entity extraction on documents that will be submitted to the site.

The semantic web service will help deliver to users results that help them take their stories or research to the next level by making previously hidden connections with other source documents related to specific people, places, companies and events.

Today source documents-the results of Freedom of Information Act (FOIA) requests, for instance-live on what Scott Klein, ProPublica‘s editor of online development, calls the “grey web.” Reporters may post these documents as evidence supporting their stories as PDF files along with the published article, but search engines don’t index them – or at least in many times not in ways particularly useful for reporters. And when the story gets old, the background research fades away with it – at least on the web.

Reporters across publications may share their background information with each other (once their work is published) if they have personal relationships, but that’s not scalable. The result for many news enterprises and other research or watchdog groups is a lot of duplicated efforts – and the costs and time they eat up – searching for data and making FOIA requests, not to mention reporters’ completely missing a key link because they were unaware of the existence of some other organization and the work it had done that could be relevant to advancing their own articles.

Trusted contributors

The work isn’t far enough along for Klein to provide complete specifics, but the plan right now is for trusted contributors – journalism organizations, watchdog groups and established research outfits – to be able to upload their documents to the site and be presented with an interface to provide some basic data (who they are, what the FOIA request was to get the document, date and so on) and then get back a list of entities and metadata extracted by the semantic web service (perhaps in conjunction with natural language processing technologies) for the reporter to approve, if he or she chooses.

“Calais is extremely good at entity extraction,” he says, so that feature probably will exist just as a gut-check mechanism. That will keep away mistakes connecting, say, a document to Shirley Temple when the information is actually related to someone named Shirley walking to temple, says Klein.

DocumentCloud so far has some 25 contributors-in addition to the New York Times and ProPublica-signed on to help it, initially by uploading documents into the system and provide general feedback. That includes newspapers such as The Boston Globe, magazines such as Mother Jones and The Atlantic, and organizations such as The Center for Democracy and Technology. It will be adding on more partners but plans are to limit who actually can contribute to the initiative-though anyone will be able to search its data-for a number of reasons.

While serious bloggers, for instance, may someday become part of the fabric, Klein says it’s important that contributors understand copyright laws, liability issues, and apply some editorial judgment to the documents they upload.


But even with a limited set of contributors, Klein suspects that it would be pretty easy to get to a “metric ton” of documents, when you consider where these source documents are coming from-news and public resource organizations with millions of pages of documents in their archives.

“So even with a small number of contributors taking part the technology scaling is pretty significant even then,” he says. “In terms of Calais-Reuters obviously knows very well this challenge and how to deal with this level of scalability.”

Planning for large-scale input from the start also is evident in its first open source release, Crowd Cloud, for building up parallel computer resouces to do arbitrary jobs.Why would reporters want to share the results of their hard work with the rest of the world? For one thing, the source documents wouldn’t be available until an article was published, to avoid competitive issues.

For another, those participating get back the benefits they enable – upload a document and the system will report back what entities were found inside that document and other documents related to it, perhaps source material they had never even heard of that could help them take their own published work to the next level with follow-on stories, for example. They could also initially publish source documents privately to get that feedback – and who knows but that won’t lead them to the last piece of evidence they need to get a piece published.

In addition to saving organizations the costs of duplicative research, Klein sees an opportunity to help them drive revenue – and that’s something few news publishers are likely to sniff at in these troubled times for journalism. Searches will send traffic to their web sites rather than to something like an FTP download form – and that can translate into “an actual saleable web click,” he says.

As for DocumentCloud itself, Klein says it’s committed to being a charitable non-profit, perhaps providing fee-based document hosting for smaller news organizations – but the main goal is to be a neutral broker for this information.

Open the data

“We are an open organization and we’re not intending to create a walled service where you have to come to our web site to find these documents,” he says. “We want to make these searchable by other search engines as well and open the data as a resource for Google, Microsoft and Yahoo to find the documents. What we get out of it is achieving the mission of the organization, which is to make documents finally available to a maximum number of people. What matters is that the documents are findable.”

The semantic web and its evolution through services like Calais, the Linked Data movement, and standards such as RDF, was always on the minds of Document Cloud’s founders when they originated the idea of the service. “Very much it was in our minds that these documents are as far as you can get from the semantic web or even almost from the web,” he said, for all the reasons named earlier. “Our elevator pitch was to make these documents part of the web.”

The rest of this year will be working mostly in private beta mode. Klein says the hope is to release this summer along with the site and service an API that should let users also dig into the raw data behind the documents. How they might exploit RDFa or RDF for that – and what that might mean in the way of skills users might have to develop to take advantage of the capability – is still in the works. But, says Klein, “we’re going to make it the most awesome way we could possible imagine.”

Semantic Tech & Business Conference Returns to San Francisco

Semantic Tech & Business Conference returns to San Francisco in June! Join us from June 3-7 for complete coverage of Big Data, Linked Data, Extreme Information Management, and Semantic Web. From breakthrough approaches to solving business problems to the big data implications of fast–evolving technologies, SemTechBiz provides you with an unparalleled interactive experience and delivers tangible business value. We're offering a special early rate when you register by February 17. Sign up now!