Semantic Web Community: I’m disappointed in us! Or at least in our group marketing prowess. We have been failing to capitalize on two major trends that everyone has been talking about and that are directly addressable by Semantic Web technologies! For shame.
I’m talking of course about Big Data and NoSQL. Given that I’ve already given my take on how Semantic Web technology can help with the Big Data problem on SemanticWeb.com, this time around I’ll tackle NoSQL and the Semantic Web.
After all, we gave up SQL more than a decade ago. We should be part of the discussion. Heck, even the XQuery guys got in on the action early!
Check out this Google Trends diagram.
NoSQL came out of nowhere in 2009, and now dominates much of the database conversation on the web. Document stores like MongoDB and CouchDB, distributed, key-value stores such as Riak and Cassandra, and other weird stores like Hadoop-as-database (never understood that usage myself) now dominate the conversation as the alternative to traditional, SQL databases.
Some sources will toss a nod to Semantic Web tools when talking about NoSQL, but not much more than that. Wikipedia’s NoSQL page does mentions RDF Database as a category, but only lists a single RDF store! AllegroGraph and Virtuoso are also mentioned under the Graph database section, but none of the others. In the upcoming book from Pragmatic Publishers, Seven Databases in Seven Weeks, a graph store makes it (Neo4J), but not a single RDF store does.
I’m going to argue that Semantic Web technologies should be a bigger part of this NoSQL conversation for a certain class of problems.
Why haven’t we been? I’ve been thinking about this a lot lately, and it makes sense to start by comparing the class of newer, NoSQL databases to what we’ve been working on to see similarities and differences.
At first blush, the technologies are similar in many respects:
- Non-SQL queries! (duh). We have SPARQL, while they have a variety. We both often use JSON for serialization.
- Our data model (RDF) is a graph instead of a set of tables. Some NoSQL stores use graphs, others are tuple stores, others are document stores. None uses tables.
- The graph can be represented as a tuple store, not dissimilar from the key-value stores of the NoSQL world.
- Our data model is design-last (alternatively, schema-at-read) and can be modified on the fly without screwing up every last thing that depends on it. That’s a huge advantage that NoSQL developers have discovered.
- Named Graphs and Documents (a la CouchDB or MongoDB) are very, very similar concepts.
- We’re both tackling web-related problems.
However, the NoSQL guys and our own community tend to have very different goals:
- We seek to connect the world’s data and make systems interoperable. As such, standards are key. We don’t focus on any one system being uber-scalable (though we’re getting better at this) since SPARQL enables us to query across systems.
- NoSQL systems tend to focus on point problems involving scale of some kind, often in the form of millisecond response at high throughput and large data scale. Standards are not important, nor is interoperability. Just predictable speed.
Even if you look at a non-RDF graph database like Neo4J the focus is clearly different. Neo4J does a fantastic job calculating classic mathematical graph/networking data (shortest path, etc.), and does not at all focus on the same types of problems one might solve using SPARQL.
That is, the current NoSQL world was forced to abandon SQL stores for pragmatic reasons. They didn’t start with lofty goals of connecting the world’s data, instead focusing on solving problems staring them right in the face today. In search of blazing performance at scale demanded by modern problems and modern users, they have been willing to jettison many of the tenants of relational systems. For many situations this has been a good tradeoff to make. For others, as I’ve pointed out and illustrated in the Enterprise Semantics Blog, the tradeoffs on the table are too extreme.
I don’t think that RDF databases should be in scale and performance round-ups with Riak and others. I don’t think that’s where we shine.
I do, however, believe there is a class of problems being tackled by MongoDB and CouchDB that is not essentially about scale (though that’s part of it). These document stores offer a schema-last development experience that developers are finding really nice to work with. Unlike working with a relational database, rapidly changing your model on the fly is possible in Mongo.
For these situations where flexibility is important and where a new generation of developers is getting used to tools like MongoDB, which is great but does not follow open standards.
I believe this is an easy area where we can make a bigger impact on the web, right now.
If thousands of new websites were being built on RDF stores instead of document stores from day one, we’d be much closer to the vision. Facebook Open Graph and Schema.org have made huge contributions to open metadata on the web, but this kind of activity could enable the long tail of sites to be RDF by default!
AllegroGraph, Virtuoso, and Systap can all scale, and can all shard like Mongo. We have more mature, feature rich, and robust APIs via Sesame and others to interact with the data in these stores. So why aren’t we in the conversation? Is there something really obvious that I’m missing?
Let’s make it happen. For more than a decade our community has had a vision for how to build a better web. In the past, traditional tools and inertia have kept developers from trying new databases. Today, there are no rules. It’s high time we stepped it up. On the web we can compete with MongoDB directly on those use cases. In the enterprise we can combine the best of SQL and NoSQL for a new class of flexible, robust data management tools. The conversation should not continue to move so quickly without our voice.
Rob Gonzalez is Senior Product Manager at Cambridge Semantics. Rob will present “Combining Natural Language Processing with the Semantic Web for Competitive Intelligence” at the 2012 SemTechBiz – San Francisco conference.
- Big Data Skills Worth Big Bucks
- Automatic Hashtags & Machine Learning: The New Google+
- Gmail, Meet JSON-LD
- Cambridge Semantics Wins SIIA Software CODiE Award