Posts Tagged ‘Common Crawl’

Blekko Data Donation Is A Big Benefit To Common Crawl

Common Crawl, the non-profit organization creating a repository of openly and freely accessible web crawl data, is getting a present from search engine provider blekko. It’s donating its metadata on search engine ranking for 140 million websites and 22 billion webpages to Common Crawl.

“The blekko data donation is a huge benefit to Common Crawl,” Common Crawl director Lisa Green told The Semantic Web Blog. “Knowing what the blekko team is crawling and how they rate those pages allows us to improve our crawler and enrich our corpus for high-value webpages.”

Read more

Web Data Commons Project Releases New Dataset

Christian Bizer and Robert Meusel of the Web Data Commons project today announced the release of a new WebDataCommons dataset: “The dataset has been extracted from the latest version of the Common Crawl. This August 2012 version of the Common Crawl contains over 3 billion HTML pages which originate from over 40 million websites (pay-level-domains). Altogether we discovered structured data within 369 million HTML pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million websites (5.65%).  Approximately 519 thousand of these websites use RDFa, while 140 thousand websites use Microdata. Microformats are used on 1.7 million websites.” Read more

Common Crawl Announces Winners of Code Contest

Common Crawl has announced the winners of their first ever Common Crawl Code Contest. According to the site, “We were thrilled by the response to the contest and the many great entries. Several people let us know that they were not able to complete their project in time to submit to the contest. We’re currently working with them to finish the projects outside of the contest and we’ll be showcasing some of those projects in the near future! All entries creatively showcased the immense potential of the Common Crawl data.” Read more

New Common Crawl Video and Contest Details

Common Crawl is back in the news after releasing a new video about the organization. According to Common Crawl, “After announcing the release of 2012 data and other enhancements, we are now excited to share with you this short video that explains why we here at Common Crawl are working hard to bring web crawl data to anyone who wants to use it. We hope it gets you excited about our work too. Please help us share this by posting, forwarding, and tweeting widely! We want our message to be broadcast loud and clear: openly accessible web crawl data is a powerful resource for education, research, and innovation of every kind.” Read more

Common Crawl Corpus Update Makes Web Crawl Data More Efficient, Approachable For Users To Explore

Common Crawl now is providing its 2012 corpus of web crawl data not just as .ARC files, but also is releasing the metadata files (JSON-based metadata with all the links from every page crawled, metatags, headers and so on) as well as text output.

Semantic web projects that use its corpus include the work of Web Data Commons, which last month created a new analysis on vocabulary usage by pay-level domain in its microdata and RDFa dataset.

With the metadata files, users don’t have to extract the link graph from the raw crawl, which, says Common Crawl Chief Architect Ahad Rana, is “pretty significant for the community. They don’t have to expend all this CPU power to extract the links. And metadata files are a much smaller set of data than the raw corpus.” Similarly, the full text output that users now can run analysis over is significantly smaller than the .ARC file raw content.

Read more

Common Crawl To Add New Data In Amazon Web Services Bucket

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week.

Read more

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web Projects Using Its Corpus, And Why Open Web Crawls Matter To Developing Big Data Expertise

The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services.  The non-profit Common Crawl is the vision of Gil Elbaz, who founded Applied Semantics and the AdSense technology for which Google acquired it , as well as the Factual open data aggregation platform, and it counts Nova Spivack  — who’s been behind semantic services from Twine to Bottlenose – among its board of directors.

Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says.

“You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”

Read more