The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.
“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week.