Posts Tagged ‘microformat’

The Web Is 25 — And The Semantic Web Has Been An Important Part Of It

web25NOTE: This post was updated at 5:40pm ET.

Today the Web celebrates its 25th birthday, and we celebrate the Semantic Web’s role in that milestone. And what a milestone it is: As of this month, the Indexed Web contains at least 2.31 billion pages, according to WorldWideWebSize.  

The Semantic Web Blog reached out to the World Wide Web Consortium’s current and former semantic leads to get their perspective on the roads The Semantic Web has traveled and the value it has so far brought to the Web’s table: Phil Archer, W3C Data Activity Lead coordinating work on the Semantic Web and related technologies; Ivan Herman, who last year transitioned roles at the W3C from Semantic Activity Lead to Digital Publishing Activity Lead; and Eric Miller, co-founder and president of Zepheira and the leader of the Semantic Web Initiative at the W3C until 2007.

While The Semantic Web came to the attention of the wider public in 2001, with the publication in The Scientific American of The Semantic Web by Tim Berners-Lee, James Hendler and Ora Lassila, Archer points out that “one could argue that the Semantic Web is 25 years old,” too. He cites Berners-Lee’s March 1989 paper, Information Management: A Proposal, that includes a diagram that shows relationships that are immediately recognizable as triples. “That’s how Tim envisaged it from Day 1,” Archer says.

Read more

Common Crawl To Add New Data In Amazon Web Services Bucket

The Common Crawl Foundation is on the verge of adding to its Amazon Web Services (AWS) Public Data Set of openly and freely accessible web crawl data. It was back in January that Common Crawl announced the debut of its corpus on AWS (see our story here). Now, a billion new web sites are in the bucket, according to Common Crawl director Lisa Green, adding to the 5 billion web pages already there.

“When are you going to have new data is one of most frequent questions we get,” she says. The answer is that processing is underway now, and she hopes they’ll be ready to go this week.

Read more

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web Projects Using Its Corpus, And Why Open Web Crawls Matter To Developing Big Data Expertise

The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services.  The non-profit Common Crawl is the vision of Gil Elbaz, who founded Applied Semantics and the AdSense technology for which Google acquired it , as well as the Factual open data aggregation platform, and it counts Nova Spivack  — who’s been behind semantic services from Twine to Bottlenose – among its board of directors.

Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says.

“You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”

Read more

Is the Publishing Industry Ready to Embrace Change? Find Out At The Semantic Web Media Summit

The publishing industry is an interesting beast: Its front-end moves rapidly to get content out to readers, but its back-end processes to deliver that information are so tightly packed that there’s not a lot of drive to make sweeping changes in those technologies or processes.

“They have to be on this schedule, so traditionally they have been slow to change their operations,” says Rachel Lovinger, associate experience director at interactive agency Razorfish.  At next week’s Semantic Web Media Summit taking place in New York City, Lovinger will be among the speakers discussing topics such as whether the industry’s tolerance for change is growing, given its need to find more innovative ways both to reach audiences and be more profitable.

Read more

Infochimps Adds Geo APIs and Takes A Shine to Schema.Org, Too

On the way from Infochimps: Its Geo APIs that bring to developers data from open sources such as GeoNames and The National Climate Data Center, as well as licensed sources such as Locationary and Foursquare.  Now for the twist: The data marketplace is semantifying the geo data with a approach. This is just the first step in a broader plan to format all data to which its new and existing APIs provide access to in a unified way.

As CTO Flip Kromer explains it, Infochimps takes from the collection of types it defines. “That was designed for microformat markup in web pages. And we say why not take this tastefully done collection of types and make it so that it can be used by mobile phone and web developers,” he says, so that they can easily and in a unified way build on the databases to which Infochimps provides access. “That if we map that back to a JSON HTTP API world, that’s a really good thing that unlocks a lot of power for developers.”

Read more

RDFa Fading Away in EPUB3 Standard

To follow up our story last week about the upcoming EPUB3 standard, for which International Digital Publishing Forum membership comments were due in by today, it appears that the updates “are moving the spec even further from any apparent support/use of RDFa.” That’s according to Eric Freese, solutions architect at digital publishing solutions vendor Aptara and member of the IDPF trade and standards organization.

One of the reasons RDFa support has been in question was that there were some concerns regarding what the future holds for RDFa 1.1. The timeline for RDFa 1.1 puts January 2012 as the date for the publication of all final documents. Issues such as its side-by-side evolution with microdata structured data markup schemes such as Google, Bing and Yahoo’s collaboration are still being reviewed, for example.

Read more

Sindice Puts The Web of Data At Your Disposal

Ltd. launched as a startup company this week, complete with a publicly available beta SPARQL endpoint to its indexed and live-updated dataset of some 12 billion triples. Next week will see Sindice –which began as a joint academic research project among DERI, the Fondazione Bruno Kessler and OpenLink Software to collect, search, query and build applications on top of semantically marked up Web data — deliver formal support for

Sindice, of course, is agnostic when it comes to ingesting semantic markup formats. Supporting new formats is just a matter of syntax adaptation for the service. Whatever format a web site decides to employ — from RDF to RDFa to microformats to microdata — Sindice has coverage of the structured web data and keeps it fresh.

The service opens up vast possibilities for business: As long as a web site structures data in one of these formats, and uses standards like Sitemaps for publishing semantic content, it can become a part of Sindice’s continuously updated repository. And thus it become a datasource for business use, one that also can join with other datasets.

Read more

Guided Tour of The Semantic Web At SemTech 2011

 An informal raise-your-hand survey of attendees at the SemTech conference in San Francisco this week revealed that a good number of attendees were here for the first time. And one of the early morning tutorials Monday provided a perfect opportunity for many of them to explore the Semantic Web in greater depth, with the W3C’s Semantic Web Activity Lead Ivan Herman Introduction to the Semantic Web session.

During the session Herman explained the various components of the Semantic Web and how they fit together. He started with defining RDF (Resource Description Framework) as the basis for it all, serving as a general model for the triples – the subject-property-object sets — forming a directed, labeled graph, where labels are identified by URIs (Uniform Resource Indicators). And working from there all the way through to OWL and RIF (Rule Interchange Format).

Some highlights of the journey follow:

Read more

Extra, Extra: rNews Seeks To Be Semantic Standard For Online News Publishers

News publishing outlets stand to benefit from adopting Semantic Web technologies, and now there’s a lightweight way for them to begin moving in that direction, too.

The International Press Telecommunications Council (IPTC) recently introduced rNews 0.1, a set of specifications and best practices for using RDFa to embed news-specific metadata (headlines, bylines, publication dates and so on) into HTML documents. It hopes rNews will become a standard in the industry for conveying through to browsers and into HTML documents the deep structure and explicitly modeled content that exists in publishers’ back-end data layers. The wider its adoption across news channels, the greater the chance of innovative apps cropping up that can help publishers increase engagement with their audiences, according to rNews’ developers.

Read more

RDFa is Sweeping the Web

A new article by Peter Mika looks at the growing reach of RDFa and microformats on the web. The article includes a chart with information on the deployment of RDFa and other microformats across the web “based on an analysis of 12 billion web pages indexed by Yahoo! Search.” The analysis, which spans the last three years, is quite enlightening.

According to the article, “The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hatom microformat.”

Read more