Posts Tagged ‘Giovanni Tummarello’

SIREn Schemaless Structured Doc Search System Zips Through Complex Nested Document Search

sirenSchemaless structured document search system SIREn (Semantic Information Retrieval ENgine) has posted some impressive benchmarks for a demonstration it did of its prowess in searching complex nested documents. A blog here discusses the test, which indexed a collection of about 44,000 U.S. patent grant documents, with an average of 1,822 nested objects per doc, comparing Lucene’s Blockjoin capability to SIREn.

The finding for the test dataset: “Blockjoin required 3,077MB to create facets over the three chosen fields and had a query time of 90.96ms. SIREn on the other hand required just 126 MB with a query time of 8.36ms. Blockjoin required 2442% more memory while being 10.88 times slower!”

SIREn, which was launched into its own website and community as part of SindiceTech’s relaunch (see our story here), attributes the results to its use of a fundamentally different conceptual model from the Blockjoin approach. In-depth tech details of the test are discussed here. There it also is explained that while the focus of the document is Lucene/Solr, the results are identically applicable to ElasticSearch which, under the hood, uses Lucene’s Blockjoin to support nested documents.

The Semantic Web Blog also checked in with SindiceTech CEO Giovanni Tummarello to get a further read on how SIREn has evolved since the relaunch to enable such results, and in other respects.

Read more

End of Support for the Sindice.com search engine: history, lessons learned, and legacy (Guest Post)

[Editor's Note: Since 2007, Sindice.com has served as a specialized search engine allowing Semantic Web practitioners and researchers to locate structured data on the Web. At the peak of its activity, Sindice.com had an index of over 700M pages and processed 20M pages per day. In a post last week, the founding team announced the end of support for Sindice.com to concentrate on delivering the technology developed for the engine to enterprise users. This week, SemanticWeb.com is proud to host a guest post by the founding team explaining the history, the challanges and the future of this technology.]

Photo of the Sindice Team, 2012

Photo of the Sindice Team, 2012

The word “Sindice” has been around for quite some time in research and practice on the “Semantic Web” or “lets see how we can turn the web into a database”.

Since 2007, Sindice.com has served as a specialized search engine that would do a crazy thing: throw away the text and just concentrate on the “markup” of the web pages. Sindice would provide an advanced API to query RDF, RDFa, Microformats and Microdata found on web sites, together with a number of other services. Sindice turned useful, we guess, as approximately 1100 scientific works in the last few years refer to it in a way or another.

Last week, we the founding team announced the end of our support of the original Sindice.com semantic search engine to concentrate on the technology that came from it.

With the launch in 2012 of Schema.org, Google and others have effectively embraced the vision of the “Semantic Web.” With the RDFa standard, and now even more with JSON-LD, richer markup is becoming more and more popular on websites. While there might not be public web data “search APIs”, large collections of crawled data (pages and RDF) exist today which are made available on cloud computing platforms for easy analysis with your favorite big data paradigm.

Even more interestingly, the technology of Sindice.com has been made available in several projects maintained either as open source (see below) or commercially supported by the Sindice.com team now transitioned in the Sindice LTD company, AKA SindiceTech.

It has been quite a journey for us, and given there is no single summary anywhere we thought we’d take this occasion to write and share it.

This is both for “historical” reasons and as a way to glimpse at future directions of this field and these technologies.

Read more

SindiceTech Relaunch Features SIREn Search System, PivotBrowser Relational Faceted Browser

sindiceLast week news came from SindiceTech about the availability of its SindiceTech Freebase Distribution for the cloud (see our story here). SindiceTech has finalized its separation from the university setting in which it incubated, the former DERI institute, now a part of the Insight Center for Data Analytics, and now is re-launching its activities, with more new solutions and capabilities on the way.

“The first thing was to launch the Knowledge Graph distribution in the cloud,” says CEO Giovanni Tummarello. “The Freebase distribution showcases how it is possible to quickly have a really large Knowledge Graph in one’s own private cloud space.” The distribution comes instrumented with some of the tools SindiceTech has developed to help users both understand and make use of the data, he says, noting that “the idea of the Knowledge Graph is to have a data integration space that makes it very simple to add new information, but all that power is at risk of being lost without the tools to understand what is in the Knowledge Graph.”

Included in the first round of the distribution’s tools for composing queries and understanding the data as a whole are the Data Types Explorer (in both tabular and graph versions), and the Assisted SPARQL Query Editor. The next releases will increase the number of tools and provide updated data. “Among the tools expected is an advanced Knowledge Graph entity search system based on our newly released SIREn search system,” he says.

Read more

Big Data Is Big Focus At SemTechBiz (Part 2)

LOGO: Semantic Technology & Business Conference; June 2-5, 2013, San Francisco, CaliforniaOur discussion of Big Data at SemTechBiz, begun here, continues:

The Enterprise Linked Data Cloud Needs Semantics, And More

Another exploration of Big Data’s intersection with semantic technology will take place at this session, where Dr. Giovanni Tummarello, senior research fellow at DERI and CTO of SindiceTech, will talk about the former becoming an enabler for the latter to be really useful in enterprises. “A lot of people say it’s via Big Data that semantic technologies like RDF will see a coming of age and clear applications in certain industries,” he says. There’s value to adding data first and understanding it later, and to that end, “semantic technologies give you the most agile tool to deal with data you don’t know, where there’s a lot of diversity, and you don’t know what of it particularly will be useful.”

Read more

Dandelion Geo And Linked Data Marketplace Private Beta On The Way

This week Dandelion, which bills itself as the one-stop shop for smart, high-quality Geo and Linked Data from trusted sources, starts its private beta. The service, which promises end users quality, normalized, linked and enriched data for their apps and reports; developers a simple API for any kind of language on any kind of platform; and corporate and government entities a way to publish and profit from their data, comes from SpazioDati.

That company is the creation of four Italian entrepreneurs – CEO Michele Barbera, president Gabriele Antonelli, partnerships director Andrea Di Benedetto, and Luca Pieraccini – who lived first-hand the frustrating experience of trying to find and leverage useful data for the custom web and mobile apps they were developing while running and working in small IT consulting companies. In an attempt to reverse the ratio of finding and cleaning data to actually building apps, says Barbera, the founders began participating in several EU-funded research projects and in the Open Data movement in Europe and Italy, including founding the non-profit Linked Open Data Italy. They also started experimenting with Semantic Web technologies.

“Open Data helps us to find valuable data and to build value-added web and mobile apps,” says Barbera. “So, let’s say that we solved partly the first problem of finding data, but not the second one, normalizing and cleaning data, since it is still very difficult to merge different data sources to put data in context.” Read more

On What Shores Will Semantic Tech Be Better Commercialized?

Courtesy: Flickr/Images of Money

Where is semantic technology better poised to be better commercialized – the U.S. or Europe? With The Semantic Business and Technology conference heading across the ocean to the U.K. next month, it seems a good time to provide some perspective on the question.

At the last SemTech conference in San Francisco, 3RoundStones took first place in the Startup competition with Callimachus Enterprise. During a discussion of some of the product’s winning features, talk turned to some of the differences between how semantic technology has progressed in the States and overseas.

Read more

Unclogging the Data Pipeline

Giovanni Tummarello has written a new article for Sindice discussing how the company “ingested 100M semantic documents in a day.” He writes, “First: build an infrastructure to process millions of documents. Instead of just doing it home-brew… do your big data homework, no shortcuts. Second: unclog some long standing clogged pipe. The [resulting] feeling is that of ‘it all makes sense’ and it happened to us the other day when we started the dataset indexing pipeline with a queue of a dozen large datasets… After doing that, we just sat back and watched the Sindice infrastructure, [which] usually takes 1-2 million documents per day, reason and index 50-100 times as much in the same timeframe, no sweat.” Read more

SindiceTech Helps Enterprises Build Private Linked Data Clouds

Last week The Semantic Web Blog covered the launch of the SindiceTech Assisted SPARQL Editor as an open source project, noting that SparQLed also is part of SindiceTech’s commercial suite for large enterprises building private linked data clouds. This week, we’ll dive a little deeper into SindiceTech and its progress since the founders of the Sindice web of data search engine turned their attention to focusing on the commercial application of its technology as a real-time semantic warehousing infrastructure, which leverages cloud computing for integrating and normalizing the massive amounts of data the enterprise must deal with.

 

As SindiceTech founder and CEO Giovanni Tummarello explains, companies actually approached his team to help them make a reality of their visions to use RDF and SPARQL, as the best knowledge representation and querying technologies available, by providing the missing scalability and stability. Sindice.com was evidence that the technology the team had developed could answer these enterprises’ needs; currently there are about 700 million semantically marked-up web pages indexed in the Sindice.com search engine, with a live updated index of some 80 billion triples daily. Its database is over 5 terabytes.

Read more

SindiceTech Releases SparQLed As Open Source Project To Simplify Writing SPARQL Queries

(Editor’s Note, June 29: The SparQLed project URL now is available here.)

SindiceTech today released SparQLed, the SindiceTech Assisted SPARQL Editor, as an open source project. SindiceTech, a spinoff company from the DERI Institute, commercializes large-scale, Big Data infrastructures for enterprises dealing with semantic data. It has roots in the semantic web index Sindice, which lets users collect, search, and query semantically marked-up web data (see our story here).

SparQLed also is one of the components of the commercial Sindice Suite for helping large enterprises build private linked data clouds. It is designed to give users all the help they need to write SPARQL queries to extract information from interconnected datasets.

“SPARQL is exciting but it’s difficult to develop and work with,” says Giovanni Tummarello, who led the efforts around the Sindice search and analysis engine and is founder and CEO of SindiceTech.

Read more

Bing Brings It On (RDFa, That Is)

The Twittersphere is buzzing about the Semantic Web at last grabbing onto the hearts and minds of the whole web community. It started off with a tweet from Juan Sequeda – a contributor to The Semantic Web Blog and a well-known figure in our area – that reads:

 

 

 

 

A follow-up message explains:

 

 

 

Follow that link and you’ll find yourself at a Bing webmaster help site that indicates Microsoft wants to play nice with whatever markup approach webmasters want to implement – microdata, microformats, or RDFa. The site mark-up overview on the page referenced says that Bing’s “crawlers do not prefer one specification over another. It’s entirely up to you to decide which of the supported specifications best fits your data.

Read more

NEXT PAGE >>