Common Crawl Corpus Update Makes Web Crawl Data More Efficient, Approachable For Users To Explore
Common Crawl now is providing its 2012 corpus of web crawl data not just as .ARC files, but also is releasing the metadata files (JSON-based metadata with all the links from every page crawled, metatags, headers and so on) as well as text output.
Semantic web projects that use its corpus include the work of Web Data Commons, which last month created a new analysis on vocabulary usage by pay-level domain in its microdata and RDFa dataset.
With the metadata files, users don’t have to extract the link graph from the raw crawl, which, says Common Crawl Chief Architect Ahad Rana, is “pretty significant for the community. They don’t have to expend all this CPU power to extract the links. And metadata files are a much smaller set of data than the raw corpus.” Similarly, the full text output that users now can run analysis over is significantly smaller than the .ARC file raw content.



The 
Eric Franzon
VP Community
Jennifer Zaino
Contributor
Angela Guess Contributor
semanticweb.com Twitter feed loading...