Posts Tagged ‘Lucene’

SIREn Schemaless Structured Doc Search System Zips Through Complex Nested Document Search

sirenSchemaless structured document search system SIREn (Semantic Information Retrieval ENgine) has posted some impressive benchmarks for a demonstration it did of its prowess in searching complex nested documents. A blog here discusses the test, which indexed a collection of about 44,000 U.S. patent grant documents, with an average of 1,822 nested objects per doc, comparing Lucene’s Blockjoin capability to SIREn.

The finding for the test dataset: “Blockjoin required 3,077MB to create facets over the three chosen fields and had a query time of 90.96ms. SIREn on the other hand required just 126 MB with a query time of 8.36ms. Blockjoin required 2442% more memory while being 10.88 times slower!”

SIREn, which was launched into its own website and community as part of SindiceTech’s relaunch (see our story here), attributes the results to its use of a fundamentally different conceptual model from the Blockjoin approach. In-depth tech details of the test are discussed here. There it also is explained that while the focus of the document is Lucene/Solr, the results are identically applicable to ElasticSearch which, under the hood, uses Lucene’s Blockjoin to support nested documents.

The Semantic Web Blog also checked in with SindiceTech CEO Giovanni Tummarello to get a further read on how SIREn has evolved since the relaunch to enable such results, and in other respects.

Read more

Search And Next-Gen Big Data Apps

Search is a fundamental, a system building block, and something that should be a critical part of enterprise architectures. That’s what Grant Ingersoll, co-founder and CTO at search, discovery and analytics vendor LucidWorks – which leverages the Apache Lucene/Solr open source search project – told an audience at last week’s GigaOM Structure Data event.

The company late last year launched LucidWorks Big Data for developing Big Data applications, which builds on top of its heritage developing the LucidWorks Search solution. “It’s a platform for organizations and developers to build out next-generation data applications,” Ingersoll said in a conversation with the Semantic Web Blog. Its focus is on tight integration of key Apache open source projects and layering with a REST API, to provide developers single-source access to the stack’s richness for creating applications that provide comprehensive search, discovery and analysis of an organization’s vast content and user interactions.

LucidWorks Big Data is made up of Apache Hadoop; the Apache Mahout machine-learning library; Hive, a data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems; and Apache OpenNLP, a machine-learning based toolkit for the processing of natural language text that supports common NLP tasks.

Read more

Professional Investors: Get Your Quant In A Box

The talk of falling off the fiscal cliff that’s drowning out the holiday music could take its toll on what historically is a strong month for the stock market, according to Reuters, How that scenario will play out, not to mention a ton of other factors, is just the kind of thing to keep hedge fund managers, wealth advisors and advanced individual investors on their toes as they calculate investment strategies.

A new cloud-based artificial intelligence solution from Lucena, the first features of which are going live today, focuses on helping these users scientifically validate their investment plans, the idea being to find new market opportunities and reduce risk. The early stage company is headed up by serial entrepreneur and CEO Erez Katz, whose partner in the venture is CTO Tucker Balch, a professor of Interactive Computing at the Georgia Institute of Technology whose work focuses on machine learning and robotics.

QuantDesk is the result of five years of research Balch has done at the institution. It is, as Katz describes it, a “quant in a box” that can give sophisticated investment professionals in small or mid-size firms, who lack the resources of the large investment houses to hire quantitative analysts to derive complex and sophisticated trading algorithms, access to a scientific approach to “validate or pivot the decision process,”

Read more

Semantic Image Search: Next Up For a Major Search Engine?

We’ve seen the big three players in the search engine space honing their semantic edge, and we may soon see one of them deploying semantic technology to sharpen image searches, too. nachofoto says it is in discussions with one of the giants (which it declines to name for now) that could result in a licensing deal to bring its ‘semantic, time-based vertical image search engine,’ currently in beta, to the big-time.

CTO Anuj Agarwal and CEO Vineet Agarwal, the co-founding brothers behind nachofoto and its focus on delivering the most recent image results, decline to name which major search engine we’re talking about.

It would, of course, be pure speculation to draw any conclusions from the fact that the Agarwals both consider the best analogy for what they’re doing as “Powerset for image search” (Microsoft acquired that semantic search engine in 2008 and it’s believed to be powering Bing Wikipedia).

Read more