Huffington Post Invests in Slice of Semantics
Jennifer Zaino
SemanticWeb.com Contributor
The Huffington Post is not only adopting semantic technology — it’s investing in it.
The online publisher has a minority investment in a company called Adaptive Semantics, Huffington Post CTO Paul Berry recently told Semanticweb.com. One of the problems content providers have to deal with is effectively moderating postings to their site, which sometimes can cross the line in terms of the kind of language used or in the expression of extremely offensive sentiments. In fact, the co-founders of Adaptive Semantics heard Huffington Post co-founder and editor in chief Arianna Huffington give an interview in which she asked the technical community if they couldn’t come up with a way of solving that problem. At the time they had been working on problems around movie review sites, to automatically parse what was being said about films.
But the company founders realized that the machine learning and natural language processing technology it was building its movie application on could easily be turned to The Huffington Post’s need. That’s when it got in touch with the online publisher and six months later signed a contract with its new investor.
Most often, content providers use key word and spam filters to try to weed out the truly nasty comments.
“But that doesn’t catch all the content because people are more creative than keyword filters,” says Adaptive Semantics co-founder Elena Haliczer — it’s very easy, for example, to fool these systems by replacing the letters of some obscene words with exclamation marks or ampersands. “And the volume of comments that a Huffington Post gets becomes unmanageable for a human moderator.”
Editors also have to generally create a whole list of keywords they consider illegal on their sites, including some that are in and of themselves inoffensive. For instance, they may need to add the president’s last name to the list, because comments related to political personalities could be abusive and so require a human to take a look before they go live.
The linguistics algorithms inherent to Adaptive Semantics’ technology gets around these issues. Jeff Revesz, also a co-founder of the company, notes that what makes Adaptive Semantics work is that it learns by example. “When someone comes up with a creative way of getting around filters, the algorithm sees it and learns the vocabulary they use,” he notes. “It is an arms race between the algorithm and the commenters, and the algorithm wins because it learns the commenters’ tactics.”
Online content providers must train the system, called JuLiA. The Huffington Post, for example, supplied historical data in the form of comments that were deleted from the site by editors and those that were published, acting on the assumption that both sets were trustworthy. Its staff went through the set and according to the publisher’s own standards tagged those that included objectionable content. Following a spot check to ensure the tagged ones are really the abusive ones, it feeds the information to the algorithm, which then creates a model, or classifier, that based on that data can look at new comments coming in and determine which are publishable and which are abusive. Of course, this can be tuned based on the publisher’s feedback.
The models Adaptive Semantics uses usually have databases of about 250,000 features, single words and keyword combinations it pulls from the text submitted to it. Publishers’ can choose all the trust thresholds’ when JuLia looks at a comment it gives a percent abusive score.
“At 80 percent abusive we trust JuLiA to autodelete and at 80 percent clean we trust her to publish,” says Haliczer. “Everything in between may go to a set of human moderators, which is where the editorial process keeps going. As JuLiA gets more trustworthy and as editors have more faith in the system, they can adjust trust levels so that at 50 ercent abusiveness JuLiA auto-deletes a comment and at 40 percent automatically publishes it. So that lessens the number of comments human moderators have to go through and makes sure the ones that do go to them are the ones they can legitimately argu about what the real decision there should be.”
Currently the system is performing about 25 percent of the total moderation for the Huffington Post, she says — the equivalent of a job done by three or four moderators, and Haliczer expects that to go up in time as the algorithm is continually retrained and refined for the publisher’s purposes. It’s been implemented there for just about two months, starting at about 10 or 12 percent of moderation to where it is now.
And that’s just the start of where the co-founders expect to take JuLiA. Still in the moderation set of features are plans to train the algorithm, assuming it has enough data, to identify the tenor of an entire thread of comments — for instance, that a string has a certain argumentative or snarly score. It also can have application beyond the world of content publishers. Revesz says they are working on ways to apply it to identify product reviews to judge whether they are legit or not. Sites such as TripAdvisor have problems with ‘planted’ reviews by hospitality companies, for example.
“The semantics of a fake review vs. a real one are discernable,” he says.
He sees applications for JuLiA’s ability to mine general semantics from text with standards-based Semantic Web technologies such as OpenCalais, which The Huffington Post also is using. As an example, he says that if you combine Calais with JuLiA you could create a system to, say, identify a specific company and the relationship between it and another company, and then parse out what people think about those companies and their relationship. “That’s how we fit into the Semantic Web,” he says.

The 
Eric Franzon
VP Community
Jennifer Zaino
Contributor
Angela Guess Contributor
semanticweb.com Twitter feed loading...