Posts Tagged ‘SindiceTech’

SIREn Schemaless Structured Doc Search System Zips Through Complex Nested Document Search

sirenSchemaless structured document search system SIREn (Semantic Information Retrieval ENgine) has posted some impressive benchmarks for a demonstration it did of its prowess in searching complex nested documents. A blog here discusses the test, which indexed a collection of about 44,000 U.S. patent grant documents, with an average of 1,822 nested objects per doc, comparing Lucene’s Blockjoin capability to SIREn.

The finding for the test dataset: “Blockjoin required 3,077MB to create facets over the three chosen fields and had a query time of 90.96ms. SIREn on the other hand required just 126 MB with a query time of 8.36ms. Blockjoin required 2442% more memory while being 10.88 times slower!”

SIREn, which was launched into its own website and community as part of SindiceTech’s relaunch (see our story here), attributes the results to its use of a fundamentally different conceptual model from the Blockjoin approach. In-depth tech details of the test are discussed here. There it also is explained that while the focus of the document is Lucene/Solr, the results are identically applicable to ElasticSearch which, under the hood, uses Lucene’s Blockjoin to support nested documents.

The Semantic Web Blog also checked in with SindiceTech CEO Giovanni Tummarello to get a further read on how SIREn has evolved since the relaunch to enable such results, and in other respects.

Read more

End of Support for the Sindice.com search engine: history, lessons learned, and legacy (Guest Post)

[Editor's Note: Since 2007, Sindice.com has served as a specialized search engine allowing Semantic Web practitioners and researchers to locate structured data on the Web. At the peak of its activity, Sindice.com had an index of over 700M pages and processed 20M pages per day. In a post last week, the founding team announced the end of support for Sindice.com to concentrate on delivering the technology developed for the engine to enterprise users. This week, SemanticWeb.com is proud to host a guest post by the founding team explaining the history, the challanges and the future of this technology.]

Photo of the Sindice Team, 2012

Photo of the Sindice Team, 2012

The word “Sindice” has been around for quite some time in research and practice on the “Semantic Web” or “lets see how we can turn the web into a database”.

Since 2007, Sindice.com has served as a specialized search engine that would do a crazy thing: throw away the text and just concentrate on the “markup” of the web pages. Sindice would provide an advanced API to query RDF, RDFa, Microformats and Microdata found on web sites, together with a number of other services. Sindice turned useful, we guess, as approximately 1100 scientific works in the last few years refer to it in a way or another.

Last week, we the founding team announced the end of our support of the original Sindice.com semantic search engine to concentrate on the technology that came from it.

With the launch in 2012 of Schema.org, Google and others have effectively embraced the vision of the “Semantic Web.” With the RDFa standard, and now even more with JSON-LD, richer markup is becoming more and more popular on websites. While there might not be public web data “search APIs”, large collections of crawled data (pages and RDF) exist today which are made available on cloud computing platforms for easy analysis with your favorite big data paradigm.

Even more interestingly, the technology of Sindice.com has been made available in several projects maintained either as open source (see below) or commercially supported by the Sindice.com team now transitioned in the Sindice LTD company, AKA SindiceTech.

It has been quite a journey for us, and given there is no single summary anywhere we thought we’d take this occasion to write and share it.

This is both for “historical” reasons and as a way to glimpse at future directions of this field and these technologies.

Read more

Is A Knowledge Graph-Related Acquisition In Yahoo’s Future?

sdtechIs SindiceTech about to be acquired by Yahoo? Just last month The Semantic Web Blog reported on the formal relaunch of the company’s activities following the finalization of its separation from its university incubation setting at the former DERI institute in Ireland. Now, according to the Sunday Independent, Yahoo – which the article says had originally planned on buying the company late last year but saw negotiations collapse – may resume talks on the matter.

Yahoo, the article says, “refused to comment on the Sindice-Tech deal, calling it as ‘rumour and speculation.’” SindiceTech CEO Giovanni Tummarello also says that he cannot comment on this. He did note, however, that media, search and advertising are prime sectors for employing Knowledge Graphs. “In scenarios where there is much more (semi-structured) information than one knows how to leverage right away, Big Data graph-like knowledge management and moving from search to relational and entity search is a common theme these days,” he wrote in an email to The Semantic Web Blog.

Read more

SindiceTech Relaunch Features SIREn Search System, PivotBrowser Relational Faceted Browser

sindiceLast week news came from SindiceTech about the availability of its SindiceTech Freebase Distribution for the cloud (see our story here). SindiceTech has finalized its separation from the university setting in which it incubated, the former DERI institute, now a part of the Insight Center for Data Analytics, and now is re-launching its activities, with more new solutions and capabilities on the way.

“The first thing was to launch the Knowledge Graph distribution in the cloud,” says CEO Giovanni Tummarello. “The Freebase distribution showcases how it is possible to quickly have a really large Knowledge Graph in one’s own private cloud space.” The distribution comes instrumented with some of the tools SindiceTech has developed to help users both understand and make use of the data, he says, noting that “the idea of the Knowledge Graph is to have a data integration space that makes it very simple to add new information, but all that power is at risk of being lost without the tools to understand what is in the Knowledge Graph.”

Included in the first round of the distribution’s tools for composing queries and understanding the data as a whole are the Data Types Explorer (in both tabular and graph versions), and the Assisted SPARQL Query Editor. The next releases will increase the number of tools and provide updated data. “Among the tools expected is an advanced Knowledge Graph entity search system based on our newly released SIREn search system,” he says.

Read more

SindiceTech Announces Freebase Distribution in the Cloud (Video)

sin

With the support of Google Developers, SindiceTech has announced the availability of its Freebase Distribution for the cloud. According to SindiceTech, “Freebase is an amazing data resource at the core of Google’s ‘Knowledge Graph’. Freebase data is available for full download but today, using it ‘as a whole’ is all but simple. The SindiceTech Freebase distribution solves that by providing all the Freebase knowledge preloaded in an RDF specific database (also called triplestore) and equipped with a set of tools that make it much easier to compose queries and understand the data as a whole.”

Your Own Private Freebase

Read more

Universities Put Cash Towards Helping HomeGrown Tech Startups Along

Image Photo Courtesy Flickr/401(K) 2012

Universities play an important role in advancing the technology ecosystem, semantic technology included. Look for starters at work done at The Tetherless World Constellation at Rensselaer Polytechnic Institute, Wright State University’s Kno.e.sis Ohio Center of Excellence in Knowledge-enabled Computing, MIT, and the Digital Enterprise Research Institute located at the National University of Ireland, Galway.

In addition to driving technology ever forward, institutions like these and others also provide a home for incubating good ideas that could become good businesses. Music discovery service Seevl and the enterprise-focused SindiceTech are two examples of semantic spin-outs from DERI, for instance, while MIT Media Lab gave birth to commercial properties with semantic underpinnings including music intelligence platform The Echo Nest. The Kno.e.sis Center points work it’s doing in the commercial direction, too: Its LinkedIn profile description notes that its “work is predominantly multidisciplinary, and multi-institutional, often involving industry collaborations and significant systems developing, with an eye towards real-world impact, technology licensing, and commercialization.”

Given the projects with commercial prospects underway within their own houses, it would seem there’s opportunity for universities themselves to look for even more ways to contribute to that success. And that’s just what the University of Minnesota is doing: This week it said that it’s launching a $20 million seed fund over a ten-year timeframe to support the innovative ideas to which its campus plays host.

Read more

SindiceTech Helps Enterprises Build Private Linked Data Clouds

Last week The Semantic Web Blog covered the launch of the SindiceTech Assisted SPARQL Editor as an open source project, noting that SparQLed also is part of SindiceTech’s commercial suite for large enterprises building private linked data clouds. This week, we’ll dive a little deeper into SindiceTech and its progress since the founders of the Sindice web of data search engine turned their attention to focusing on the commercial application of its technology as a real-time semantic warehousing infrastructure, which leverages cloud computing for integrating and normalizing the massive amounts of data the enterprise must deal with.

 

As SindiceTech founder and CEO Giovanni Tummarello explains, companies actually approached his team to help them make a reality of their visions to use RDF and SPARQL, as the best knowledge representation and querying technologies available, by providing the missing scalability and stability. Sindice.com was evidence that the technology the team had developed could answer these enterprises’ needs; currently there are about 700 million semantically marked-up web pages indexed in the Sindice.com search engine, with a live updated index of some 80 billion triples daily. Its database is over 5 terabytes.

Read more

SindiceTech Releases SparQLed As Open Source Project To Simplify Writing SPARQL Queries

(Editor’s Note, June 29: The SparQLed project URL now is available here.)

SindiceTech today released SparQLed, the SindiceTech Assisted SPARQL Editor, as an open source project. SindiceTech, a spinoff company from the DERI Institute, commercializes large-scale, Big Data infrastructures for enterprises dealing with semantic data. It has roots in the semantic web index Sindice, which lets users collect, search, and query semantically marked-up web data (see our story here).

SparQLed also is one of the components of the commercial Sindice Suite for helping large enterprises build private linked data clouds. It is designed to give users all the help they need to write SPARQL queries to extract information from interconnected datasets.

“SPARQL is exciting but it’s difficult to develop and work with,” says Giovanni Tummarello, who led the efforts around the Sindice search and analysis engine and is founder and CEO of SindiceTech.

Read more

SindiceTech Offers Semantic Solutions for Data Management

Philip Connolly of the Daily Business Post recently profiled Galway-based semantic start up, SindiceTech. According to Connolly, “While many people never look underneath the bonnet of the internet, web technology never stands still. Many people see the semantic web as the next step, a technology that allows machines to understand the meaning of information on the web. Most of us online will probably not notice semantic web technologies running in the background, the technologies could lead to an improvement in the relevance of the data returned through search engines for both individuals and enterprises using large amounts of data.” Read more