Clark & Parsia: New Products to Save Data From Death In An ECM Repository, And To Tackle Smaller Data Sets of Big Strategic Value
Is your enterprise content management system the place where your information goes to die? It doesn’t have to be that way.
At the SemTech conference in June, Kendall Clark of Clark & Parsia will formally launch Spanner and Stardog, the former of which is already in use at NASA and the latter of which will be entering private beta mode in the next couple of weeks. Spanner takes semantics to ECM, to help enterprises make the pivot from unstructured to semi-structured and structured information management affordable, useful and valuable, Clark says.
Stardog is its RDF database aimed at the high-value, lower-dataset size of the market, and Spanner will be able to utilize it. Alternately, organizations that already have an existing commitment to an RDF database can continue to employ that in conjunction with Spanner.
Why semantic ECM with Spanner? Traditional ECMs haven’t always provided the return on investment organizations have expected — they know there’s a document there with the data they need, but finding it is a different story with potentially devastating consequences.
“Some kinds of information shouldn’t be managed by leaving the information in documents and sticking it in an ECM,” Clark says. But that often is the result when new requirements aren’t facilitated with an eye to the future, and investments in a structured database management solution aren’t part of the initial equation for reasons of cost or time. Then, after reams and reams of documents, which may reside in an ECM or perhaps even in users’ personal directories, have insinuated their way into mission-critical use, the light-bulb goes off and the organization realizes that the data within them is too important to leave in unstructured document form.
That’s when Spanner can enter the picture, taking the directory of documents or connecting to an ECM to bring on the explicit data management that is desperately needed.
At NASA, the case study Clark will be discussing at SemTech, it was a matter of dealing with a waterfall of documents accumulated over the last decade around its space exploration management systems. Even had they been in an ECM rather than on a personal system, pulling information out of this load of data – commissions and studies for hundreds of scenarios, options, plans and research from across the world, academic and commercial – to answer questions and form plans would have been frustrating. Spanner essentially takes an arbitrary number of documents as input and then “spits out at the end a full stack semantic web application,” Clark says. “It does this fully automated or with some additional input.”
Behind Spanner is a full-service linked data platform, semantic web application development system, analytics services that work on RDF data, and smart semantic search querying and reasoning, all integrated together. “It does NLP and machines learning over documents to extract semantic and structured RDF and puts that into our semantic web platform for building RDF applications,” he says, “and as a result of analysis of documents we give back an application that lets an organization manage information extracted from documents in a structured manner.”
It opens the door to finding related documents to any other document over a corpus, but also to supporting additional input. At NASA, for example, names of individuals in the organization from its directory services were fed in with documents, and the quality of the named entity extraction service went up.
It’s important to understand that Spanner won’t waste the investment that businesses have made in ECM systems. It even may use some of the meatadata from them. “The goal here is to maximize the investment an organization has that exists in whatever tagging or categorization scheme it has,” Clark says. “All that maps easily to RDF in a simple and straightforward way, and enhances that by running the actual content of the data through a variety of NLP and machine learning technologies.” Once in RDF, a whole host of tools become immediately relevant to this data that shortly before was perhaps just a bunch of Word docs – the beauty of the semantic web, Clark says.
Bring on the Dog
Stardog now has become mature enough to be used with Spanner, if appropriate for the organization, Clark says. Where the company sees applicability for its RDF database is for requirements that aren’t dealing with billions and billions of triples. “We see in the semantic world some very large high-value data sets but that’s just a slice of the market,” he says. “We also see lots and lots of opportunities where there are not billions and billions or trillions of triples and the value of the data is tied not to quantity but to the nature of the data being represented.”
Even very large organizations often have smaller data sets of immense strategic value, he says, where requirements are around very complex queries, and correct analysis over data is imperative because the results that come back are crucially important to the organization. As an example, a medical diagnostic system may not be billions of triples in size but for sure it matters that results of any analysis that’s done over its data are correct.
A lightweight, agile, pure Java and very fast, easy to install and run anywhere database like Stardog suits these opportunities to do reasoning and inferencing over such data and recognize patterns, he thinks. Says Clark, “We are building Stardog to plug into our semantic platform to consolidate the part of the market where the value of data isn’t in size but in the nature of data.”