A million data sets. That’s the number of government data sets out there on the web that we have closed in on.

“The question is, when you have that many, how do you search for them, find them, coordinate activity between governments, bring in NGOs,” says James A. Hendler, Tetherless World Senior Constellation Professor, Department of Computer Science and Cognitive Science Department at Rensselaer Polytechnic Institute, and a principal investigator of its Linking Open Government Data project lives, as well as Internet web expert for data.gov, He also is connected with many other governments’ open data projects. “Semantic web tools organize and link the metadata about these things, making them searchable, explorable and extensible.”

To be more specific, Hendler at SemTech a couple of weeks ago said there are 851,000 open government data sets across 153 catalogues from 30-something countries, with the three biggest representatives, in terms of numbers, at the moment being the U.S., the U.K, and France. Last week, the one million threshold was crossed.

About 410,000 of these data sets are from the U.S. (federal, state, city, county, tribal included), including quite a large number of geo-data sets. The U.S. government’s goal is to put “lots and lots and lots of stuff out there” and let people figure out what they want to do with it, he notes.

And the momentum is moving forward in the States through efforts such as President Obama’s release in May of a new directive under the strategy known as “Digital Government: Building a 21st Century Platform to Better Serve the American People.” It emphasizes open APIs for developers to create mobile apps that can leverage government data and adopting new standards to make applicable Government information open and machine-readable by default.

In contrast to the U.S. approach, the U.K. is trying to go sector by sector with the government data it releases, to help with decision-making. “So the U.K. may be the lead when it comes to the organized release of data and cataloging.” But beyond those leaders, it is happening everywhere, even if in smaller doses. Hendler notes that Russia’s new government includes an open government cabinet-level position, for instance.

“There’s the public good of transparency and specific good of pushing innovation,” Hendler says. “Releasing government data can get things the government used to pay for for free, or cause new things to happen that let people create companies around data they used to have to toil for.”

As the figures show, “when you talk about the web of data, there are a huge number of open government data sets.” And not only won’t keyword searches  find you every one of them that’s related, for example, to health, but there also are issues around things like the use of multiple-language metadata.

“There are a lot of issues we have been thinking about for a long time,” says Hendler.

Work to be done in terms of making open government data more accessible to search engines includes, for example, dataset descriptions as schema.org extensions. “Schema.org doesn’t have a database descriptor but there’s a lot of interest in one,” Hendler says. “So large government sites are talking to search engine companies about embedded formats.”

————————————————————————————————

In related news, recently it also was announced that computer and web scientists at the Tetherless World Constellation, led by Constellation Professor Deborah McGuinness, will be one of the parties working on an Intelligence Advanced Research Projects Activity (IARPA) Foresight and Understanding from Scientific Exposition (FUSE) program to develop computer systems that help quickly identify emerging ideas and capabilities in technology. Rensselaer received $510,000 to fund its initial phase of the larger collaborative research project that also involves Brandeis University, New York University, 1790 Analytics, and is led by BAE Systems.

As noted in the FUSE IARPA press release launching the program, the current process to scan the horizon for new technologies is done primarily by humans. The team wants to develop computational programs to quickly analyze billions of pages of text across multiple languages for the emergence of new technological and scientific trends,   The end game is to develop computational programs that will quickly analyze millions and even billions of pages of text for the emergence of new technological and scientific trends so they can be better studied, developed and invested in by human analysts.