Schemaless structured document search system SIREn (Semantic Information Retrieval ENgine) has posted some impressive benchmarks for a demonstration it did of its prowess in searching complex nested documents. A blog here discusses the test, which indexed a collection of about 44,000 U.S. patent grant documents, with an average of 1,822 nested objects per doc, comparing Lucene’s Blockjoin capability to SIREn.
The finding for the test dataset: “Blockjoin required 3,077MB to create facets over the three chosen fields and had a query time of 90.96ms. SIREn on the other hand required just 126 MB with a query time of 8.36ms. Blockjoin required 2442% more memory while being 10.88 times slower!”
SIREn, which was launched into its own website and community as part of SindiceTech’s relaunch (see our story here), attributes the results to its use of a fundamentally different conceptual model from the Blockjoin approach. In-depth tech details of the test are discussed here. There it also is explained that while the focus of the document is Lucene/Solr, the results are identically applicable to ElasticSearch which, under the hood, uses Lucene’s Blockjoin to support nested documents.
The Semantic Web Blog also checked in with SindiceTech CEO Giovanni Tummarello to get a further read on how SIREn has evolved since the relaunch to enable such results, and in other respects.
A new method of indexing in SIREn Version 1.2, released earlier this month, is at work in the benchmark tests, Tummarello said in an email response to The Semantic Web Blog. “Previously, SIREn only had an ‘advanced’ indexing mode, which allows you to search also on label names (which can then be tokenized, etc.),” he explained – useful when someone really has no idea about the schema. For example, is there a property that contains the term “part” – as in part number, part code, part description – that then contains a value “1234″? But this capability isn’t very often necessary, he said.
“We therefore decided to move the default mode of SIREn to operate in the same way [as] other engines (you have to know the name of your property), which in turns operates faster,” he said. “In other words, the benchmarks now are between comparable engine features, although SIREn naturally offers more advanced query operators.”
Tummarello said they had projected differences along the lines showcased by the results, “but had never measured them accurately, which we wanted to do after releasing 1.2. SIREn’s approach to structured data was thought from the ground up, which is quite different from what happens in the approach currently used in Lucene/Solr/Elasticsearch.”
Domains such as financials, life sciences, technical publishing, and legal share the need to preserve and leverage the structure in nested documents to add value and possibilities, he noted. Consider mongoDB, and how “people are finding it very exciting to use it to store all sorts of data, with more and more structure as it naturally is within an application or in documents,” he stated. The U.S. Patent Office used in the benchmark is an example of the trend, with SIREn’s in-depth technical details paper noting that the documents contain structured data about inventors and classifications, as well as textual content such as the abstract – all of this data described in a XML document with multiple levels of nesting.
”With SIREn, one can ask for ‘documents in which IBM and Microsoft are mentioned max 2 sentences away,’ given that all sentences are kept separated by structure (in the original U.S. Patent Office files),” Tummarello said. “Then again, SIREn is … very efficient at indexing structured data, which means you can enrich existing documents and enjoy the benefits.
For example, referring to the same patent example, say that one annotated each company name with metadata – the year in which a company was funded, [or] its latest financial results – then one could ask queries [like] ‘find me patents in which IBM is mentioned in the same sentence or section in which a startup is mentioned (or a company with less than X in revenue)’.” You needn’t have to know the name of the company to find the docs, he said.
SIREn Version 1.3 should be on the way soon, which Tummarello said will be mostly about Elasticsearch compatibility. SIREn, he added, remains to the best of his knowledge the best way to search a knowledge graph. While at the moment the company doesn’t distribute code to indexing RDF or other graph knowledge directly, it has supporting software that it can assist users with. That said, out of the box, he added, “Siren is quite good at querying JSON-LD,” which you can read more about here.