SemTechBiz SF SemTechBiz UK SemTechBiz NYC more TVNewser TVSpy GalleyCat AppNewser UnBeige AgencySpy PRNewser 10,000 Words FishbowlNY FishbowlLA FishbowlDC MediaJobsDaily SocialTimes AllFacebook AllTwitter

Harvard’s Berkman Center Seeds the MediaCloud

Jennifer Zaino
SemanticWeb.com Contributor

Who covered news about Africa more in a particular month, the New York Times or the BBC?

MediaCloud, a project of the Berkman Center for Internet and Society at Harvard, aims to solve problems like that. It has the task of helping researchers, businesses, individuals, and publications themselves ask and answer quantitative questions about the media, what it’s covering, and how.

Semantic technologies, cloud computing, and RSS feeds are key enablers of the service, currently in beta mode. Back in 2003, Berkman Center fellow Ethan Zuckerman tried a similar experiment, Global Attention Profiles, scraping the search engine of the New York Times to discover what the Gray Lady was paying attention to.

“That was kind of fun, but not a great method,” he says. Five years later — following a year-long project for the MacArthur Foundation that the Berkman Center undertook around the future of mainstream and citizen media — it decided to look back at the question of statistically analyzing media coverage, with greater success, thanks to advances made in the interim period. With RSS feeds, there’s no need to scrape search engines every day — a simple subscription takes care of that.

The Calais web service was the second innovation. MediaCloud, which currently gets RSS feeds from 1,500 media outlets (blogs and mainstream press), uses those feeds to find story URLs, retrieve the entire HTML of the story, apply its algorithms to determine text vs. banners or navigation toolbars, and then to pull the text away from the formatting. Then it sends the way of Calais and other entity extraction or terming engines to get back information on who is mentioned in a story, where the locale is, and the general subject or topic.

From there, the story text goes into a full text search engines to retrieve specific terms or phrases, gets dumped into a database, and becomes source material for the three simple tools currently on the site to let people start playing with the service. Being able to throw text against Calais and get pretty high quantity entities and terms out of it, Zuckerman says, was a “big step forward.”

The open source and open data project runs off the Amazon cloud. The Berkman Center tried it on its own server first, but with terabyte file systems and hundreds of gigabytes of relational databases, it couldn’t keep up. “It’s pretty exciting that by signing up with Amazon we were able to scale massively and very quickly,” Zuckerman says. The service hopes ultimately to scale to 15,000 RSS sources.

What’s currently live — showing the top ten most mentioned terms for up to three media sources at a time, or the top ten most mentioned term for each media source that occurs in stories along with a term you specify, or a world map of each media source that indicates which countries get more coverage–is meant as just of a taste of what you can do with the data.


“We’re really trying to build a research framework to let different scholars or individuals build experiments using this data,” says Zuckerman. One reason for using Calais, in fact, is that as an open source effort the Center only is using tools that other people can get access to. With Calais, “if you want to build your own media cloud, in a couple of weeks you can download our code, in GPL, and throw it at Calais because they make a certain number of queries available for free per day,” he says.

The Center is working with scholars on research proposals for the service — one lexicographer, for example, sees potential for using it to look for the coinage of new words in blogs vs. mainstream media, Zuckerman notes. One of the research questions in the Harvard community that this could help with would be to ferret out information regarding the language around the stimulus package– who calls it a stimulus, when did it start getting tagged as a bailout, how have those terms battled over time and who is amplifying the use of either term in the space. Zuckerman himself is interested in better understanding the ecosystem of mainstream media vs. blogs in the context of which media is more parochial and which is more global.

“All this involves looking at histograms or term usage over time,” he says. “These are interesting open research questions,” and so far it really has only been possible to answer them anecdotally rather than statistically.

Perhaps traditional purveyors of journalism could even find a use for the service that could translate into more revenue. Zuckerman speculates that newspapers could potentially use the service to compare their coverage of local, national and foreign news with their leading competitors to see who’s doing more of what, and then decide if it makes more sense to follow or go in a different direction. “What’s hard right now is we run on really weak data,” he says, but the fact is it takes a lot of work to answer questions that might help news outlets better figure out their place in the world — and how they can exploit their expertise in certain niches to potentially drive new revenue, perhaps by syndicating that content to sources with less expertise in a particular area.

“We wanted to create a service that lets us ask the big questions about the topic focus, the geographic focus and to a certain extent the focus on language and the framing of stories across media,” says Zuckerman. “If we can just add some statistical rigor into the game, I feel like we’ll be doing a great service.”

SemTechBiz is Less Than 2 Weeks Away

The Semantic Tech & Business Conference (SemTechBiz) is coming to San Francisco on June 3-7! Join us for case studies, innovative panels, tutorials, and keynotes that will provide you with practical advice, hands-on guidance, and breakthrough approaches to solving business problems with semantic technology. Passes go up $200 at the door. Sign up now and save !