Posts Tagged ‘Web crawl data’

Common Crawl Corpus Update Makes Web Crawl Data More Efficient, Approachable For Users To Explore

Common Crawl now is providing its 2012 corpus of web crawl data not just as .ARC files, but also is releasing the metadata files (JSON-based metadata with all the links from every page crawled, metatags, headers and so on) as well as text output.

Semantic web projects that use its corpus include the work of Web Data Commons, which last month created a new analysis on vocabulary usage by pay-level domain in its microdata and RDFa dataset.

With the metadata files, users don’t have to extract the link graph from the raw crawl, which, says Common Crawl Chief Architect Ahad Rana, is “pretty significant for the community. They don’t have to expend all this CPU power to extract the links. And metadata files are a much smaller set of data than the raw corpus.” Similarly, the full text output that users now can run analysis over is significantly smaller than the .ARC file raw content.

Read more

Common Crawl To Add New Data In Amazon Web Services Bucket

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week.

Read more

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web Projects Using Its Corpus, And Why Open Web Crawls Matter To Developing Big Data Expertise

The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services.  The non-profit Common Crawl is the vision of Gil Elbaz, who founded Applied Semantics and the AdSense technology for which Google acquired it , as well as the Factual open data aggregation platform, and it counts Nova Spivack  — who’s been behind semantic services from Twine to Bottlenose – among its board of directors.

Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says.

“You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”

Read more