Common Crawl To Add New Data In Amazon Web Services Bucket

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.
“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week.


The
The publishing industry is an interesting beast: Its front-end moves rapidly to get content out to readers, but its back-end processes to deliver that information are so tightly packed that there’s not a lot of drive to make sweeping changes in those technologies or processes.
On the way from
To follow up our 
An informal raise-your-hand survey of attendees at the 



Eric Franzon
VP Community
Jennifer Zaino
Contributor
Angela Guess Contributor
semanticweb.com Twitter feed loading...