SemTechBiz SF more TVNewser TVSpy LostRemote SocialTimes AllFacebook AllTwitter GalleyCat AppNewser UnBeige AgencySpy PRNewser 10,000 Words FishbowlNY FishbowlLA FishbowlDC MediaJobsDaily

Growing Resource: WebDataCommons.org

Following this teaser last week, Dr. Christian Bizer has reported, “We are happy to announce WebDataCommons.org, a joint project of Freie Universität Berlin and the Karlsruhe Institute of Technology to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public. WebDataCommons.org provides the extracted data for download in the form of RDF-quads. In addition, we produce basic statistics about the extracted data.”

The website states, “Web Data Commons enables you to use structured data originating from hundreds of millions of web pages within your applications without needing to crawl the Web yourself. Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web. We have extracted and published structured data from both the 2012 and the 2009/2010 Common Crawl corpus. For the future, we plan to update the extracted datasets on a regular basis as new Common Crawl corpora are becoming available.”

Learn more here.

Image: Courtesy Flickr/ sjcockell

Semantic Technology Conference Attracts Notable Speakers

LOGO: Semantic Technology & Business Conference; June 2-5, 2013, San Francisco, CaliforniaJoin Semantic Technology & Business Conference, June 2-5 in San Francisco, to hear the latest industry developments from 130 experts in the space. Sessions will be led by practitioners and semantic experts at Walmart, Viacom, Wells Fargo, Google, Yahoo!, and more. Register today.