Bringing structured data to users from unstructured web content – that’s what Webhose.io is offering in the way of a new API service that works with millions of blogs, forums, reviews and news sites, and comments posts. The company is leveraging its technology roots as a message board search engine, Omgili, to the work of getting posts’ clean text content, dates, authors, links, language and so on out in JSON, XML and RSS formats rather than as unstructured content via an HTML file.
For Omgili to do its job in the message board space, a world untouched by microformats and other semantic web structures, required creating a crawler that used heuristic techniques to extract text, titles and other details from those posts, he says. “Once we created that we were able to extract data from less complex sources like blogs, news sites and so on,” Geva says.
The output is similar to what import.io does (see The Semantic Web Blog post on that technology here), but on a much larger scale,” says founder Ran Geva. The import.io service, he says, is great when users want specific data from one or two sites but more is required for heavy lifting when you want to leverage a lot of data. “We save and download millions of posts per day,” he says. “When it comes to getting structured content out of complicated sources on a mass scale, that’s a unique technology challenge we conquered a while ago.”