A call for comments is out for a proposal for a ‘Datasets‘ addition to schema.org, via the W3C’s Web Schemas task force group that is used by the schema.org project to collaborate with the wider community.

The proposal extending schema.org for describing datasets and data catalogs introduces three new types, with associated properties, as follows:

Writing at the Schema.org blog, Dan Brickley calls it a “small but useful vocabulary,” with particular relevance to open government and public sector data.

He also references this week’s post at Data.gov by Chris Musialek, the chief software architect for the site. Musialek writes that, following a review of the draft proposal, “we are comfortable with the current state of things,” and that any work left to do seems very resolveable.

“We’ve been watching the schema.org datasets schema space for a while now, as Data.gov is very interested in adding schema.org support for our listing of over 450,000 datasets. We think this will help the major search engines create better relevance rankings of Federal government data, where many searches begin,” Musialek says. And he notes later in the post that, “We’re really excited to see this schema move in the direction of official addition to schema.org. We really hope to see it be included in a schema.org release soon.”

The Tetherless World Constellation at Rensselaer Polytechnic Institute – where Professor James A. Hendler is now the head of the Department of Computer Science – has a demo available that contains automatically-generated dataset descriptions based on TWC’s International Dataset Search and which uses the schema.org extension for datasets and data catalogs. A few weeks back, at the Semantic Technology & Business Conference in San Francisco, Hendler told The Semantic Web Blog in an interview that, while a vocabulary for describing datasets and data catalogue was not yet part of schema.org, efforts were underway to make that happen.

In that interview Hendler also disclosed that the number of open government data sets on the web has hit the million mark. In his schema.org blog posting, Brickley says the proposal is exciting because of the “huge number of datasets that have been made  public in recent years. While each dataset may ultimately be expressed in detailed, domain-specific form (e.g. using specific scientific or statistical schemas), the Datasets proposal focuses on the high level common characteristics that are shared across thousands of otherwise diverse datasets.”

The proposal includes a table mapping Datasets extension types and properties (including supporting schema.org vocabulary) to and from their approximate equivalents in Data Catalog Vocabulary (DCAT), Asset Description Metadata Schema (ADMS), and VoID. The next steps for the proposal are to get feedback from publishers of applicable datasets that the extension would be useful to them and is a good fit to available metadata.