SemTechBiz SF more TVNewser TVSpy LostRemote SocialTimes AllFacebook AllTwitter GalleyCat AppNewser UnBeige AgencySpy PRNewser 10,000 Words FishbowlNY FishbowlLA FishbowlDC MediaJobsDaily

Web Data Commons Project Releases New Dataset

Christian Bizer and Robert Meusel of the Web Data Commons project today announced the release of a new WebDataCommons dataset: “The dataset has been extracted from the latest version of the Common Crawl. This August 2012 version of the Common Crawl contains over 3 billion HTML pages which originate from over 40 million websites (pay-level-domains). Altogether we discovered structured data within 369 million HTML pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million websites (5.65%).  Approximately 519 thousand of these websites use RDFa, while 140 thousand websites use Microdata. Microformats are used on 1.7 million websites.”
Bizer & Meusel noted, “Basic statistics about the extracted dataset as well as the vocabularies that are used together with each encoding format are found at: http://www.webdatacommons.org/2012-08/stats/stats.html. Additional statistics that analyze top-level domain distribution and the popularity of the websites covered by the Common Crawl, as well as the topical domains of the embedded data are found at: http://www.webdatacommons.org/2012-08/stats/additional_stats.html. The overall size of the August 2012 WebDataCommons dataset is 7.3 billion quads. The dataset is split into 1,416 files each having a size of around 100 MB. In order to make it easier to find data from a specific website or top-level-domain, we provide indexes about the location of specific data within the files.”

Learn more at Web Data Commons.

Image: Courtesy Web Data Commons

Early Bird Rates End At Midnight Tonight

LOGO: Semantic Technology & Business Conference; June 2-5, 2013, San Francisco, CaliforniaJoin Semantic Technology & Business Conference, June 2-5 in San Francisco, to hear the latest industry developments from 130 experts in the space. Session topics include Semantic Video's Coming Of Age, Why Big Data for Enterprise Needs Semantic Technologies, and many more. Early bird rates end at midnight tonight, so register now and save $500.