Posts Tagged ‘Sindice’

End of Support for the Sindice.com search engine: history, lessons learned, and legacy (Guest Post)

[Editor’s Note: Since 2007, Sindice.com has served as a specialized search engine allowing Semantic Web practitioners and researchers to locate structured data on the Web. At the peak of its activity, Sindice.com had an index of over 700M pages and processed 20M pages per day. In a post last week, the founding team announced the end of support for Sindice.com to concentrate on delivering the technology developed for the engine to enterprise users. This week, SemanticWeb.com is proud to host a guest post by the founding team explaining the history, the challanges and the future of this technology.]

Photo of the Sindice Team, 2012

Photo of the Sindice Team, 2012

The word “Sindice” has been around for quite some time in research and practice on the “Semantic Web” or “lets see how we can turn the web into a database”.

Since 2007, Sindice.com has served as a specialized search engine that would do a crazy thing: throw away the text and just concentrate on the “markup” of the web pages. Sindice would provide an advanced API to query RDF, RDFa, Microformats and Microdata found on web sites, together with a number of other services. Sindice turned useful, we guess, as approximately 1100 scientific works in the last few years refer to it in a way or another.

Last week, we the founding team announced the end of our support of the original Sindice.com semantic search engine to concentrate on the technology that came from it.

With the launch in 2012 of Schema.org, Google and others have effectively embraced the vision of the “Semantic Web.” With the RDFa standard, and now even more with JSON-LD, richer markup is becoming more and more popular on websites. While there might not be public web data “search APIs”, large collections of crawled data (pages and RDF) exist today which are made available on cloud computing platforms for easy analysis with your favorite big data paradigm.

Even more interestingly, the technology of Sindice.com has been made available in several projects maintained either as open source (see below) or commercially supported by the Sindice.com team now transitioned in the Sindice LTD company, AKA SindiceTech.

It has been quite a journey for us, and given there is no single summary anywhere we thought we’d take this occasion to write and share it.

This is both for “historical” reasons and as a way to glimpse at future directions of this field and these technologies.

Read more

Dandelion Geo And Linked Data Marketplace Private Beta On The Way

This week Dandelion, which bills itself as the one-stop shop for smart, high-quality Geo and Linked Data from trusted sources, starts its private beta. The service, which promises end users quality, normalized, linked and enriched data for their apps and reports; developers a simple API for any kind of language on any kind of platform; and corporate and government entities a way to publish and profit from their data, comes from SpazioDati.

That company is the creation of four Italian entrepreneurs – CEO Michele Barbera, president Gabriele Antonelli, partnerships director Andrea Di Benedetto, and Luca Pieraccini – who lived first-hand the frustrating experience of trying to find and leverage useful data for the custom web and mobile apps they were developing while running and working in small IT consulting companies. In an attempt to reverse the ratio of finding and cleaning data to actually building apps, says Barbera, the founders began participating in several EU-funded research projects and in the Open Data movement in Europe and Italy, including founding the non-profit Linked Open Data Italy. They also started experimenting with Semantic Web technologies.

“Open Data helps us to find valuable data and to build value-added web and mobile apps,” says Barbera. “So, let’s say that we solved partly the first problem of finding data, but not the second one, normalizing and cleaning data, since it is still very difficult to merge different data sources to put data in context.” Read more

On What Shores Will Semantic Tech Be Better Commercialized?

Courtesy: Flickr/Images of Money

Where is semantic technology better poised to be better commercialized – the U.S. or Europe? With The Semantic Business and Technology conference heading across the ocean to the U.K. next month, it seems a good time to provide some perspective on the question.

At the last SemTech conference in San Francisco, 3RoundStones took first place in the Startup competition with Callimachus Enterprise. During a discussion of some of the product’s winning features, talk turned to some of the differences between how semantic technology has progressed in the States and overseas.

Read more

Unclogging the Data Pipeline

Giovanni Tummarello has written a new article for Sindice discussing how the company “ingested 100M semantic documents in a day.” He writes, “First: build an infrastructure to process millions of documents. Instead of just doing it home-brew… do your big data homework, no shortcuts. Second: unclog some long standing clogged pipe. The [resulting] feeling is that of ‘it all makes sense’ and it happened to us the other day when we started the dataset indexing pipeline with a queue of a dozen large datasets… After doing that, we just sat back and watched the Sindice infrastructure, [which] usually takes 1-2 million documents per day, reason and index 50-100 times as much in the same timeframe, no sweat.” Read more

SindiceTech Helps Enterprises Build Private Linked Data Clouds

Last week The Semantic Web Blog covered the launch of the SindiceTech Assisted SPARQL Editor as an open source project, noting that SparQLed also is part of SindiceTech’s commercial suite for large enterprises building private linked data clouds. This week, we’ll dive a little deeper into SindiceTech and its progress since the founders of the Sindice web of data search engine turned their attention to focusing on the commercial application of its technology as a real-time semantic warehousing infrastructure, which leverages cloud computing for integrating and normalizing the massive amounts of data the enterprise must deal with.

 

As SindiceTech founder and CEO Giovanni Tummarello explains, companies actually approached his team to help them make a reality of their visions to use RDF and SPARQL, as the best knowledge representation and querying technologies available, by providing the missing scalability and stability. Sindice.com was evidence that the technology the team had developed could answer these enterprises’ needs; currently there are about 700 million semantically marked-up web pages indexed in the Sindice.com search engine, with a live updated index of some 80 billion triples daily. Its database is over 5 terabytes.

Read more

SindiceTech Releases SparQLed As Open Source Project To Simplify Writing SPARQL Queries

(Editor’s Note, June 29: The SparQLed project URL now is available here.)

SindiceTech today released SparQLed, the SindiceTech Assisted SPARQL Editor, as an open source project. SindiceTech, a spinoff company from the DERI Institute, commercializes large-scale, Big Data infrastructures for enterprises dealing with semantic data. It has roots in the semantic web index Sindice, which lets users collect, search, and query semantically marked-up web data (see our story here).

SparQLed also is one of the components of the commercial Sindice Suite for helping large enterprises build private linked data clouds. It is designed to give users all the help they need to write SPARQL queries to extract information from interconnected datasets.

“SPARQL is exciting but it’s difficult to develop and work with,” says Giovanni Tummarello, who led the efforts around the Sindice search and analysis engine and is founder and CEO of SindiceTech.

Read more

Linked Data: Moving Towards Consumption

Earlier this month 16 out of 42 papers were accepted for the upcoming Linked Data on the Web (LDOW) 2012 Workshop in Lyon, France in April.

What might be discerned from the tenor of the submissions is something of a shift in focus in the Linked Data space, according to workshop chair Dr. Michael Hausenblas, Linked Data Research Centre, DERI, NUI Galway, Ireland. Other organizing committee members include Tim Berners-Lee, Christian Bizer and Tom Heath. “In 2008 to 2010 it was more like we were establishing the field, getting people to talk about what they do in terms of publishing and best practice around Linked Data, Open Linked Data and Linked Enterprise Data,” says Hausenblas. Now, with the web of Linked Data having grown to about 32 billion RDF triples last year, “we’re moving more towards the consumption – publishing is a necessary precondition but not an end in itself.”

Read more

What W3C’s R2RML and Direct Mapping Mean to Enterprise Data

Juan Sequeda photoI’m very happy to announce that the World Wide Web Consortium’s RDB2RDF Working Group, in which I participate as an Invited Expert,  has published two Candidate Recommendations: R2RML: RDB to RDF Mapping Language and A Direct Mapping of Relational Data to RDF. This has been a long road and we still have some ways to go. The standardization process goes back to the W3C Workshop on RDF Access to Relational Databases, which took place in October 2007. The W3C RDB2RDF Incubator Group followed afterwards. After almost 5 years, we are on track to have a standard. However, what is this standard bringing to the table?

Read more

Bing Brings It On (RDFa, That Is)

The Twittersphere is buzzing about the Semantic Web at last grabbing onto the hearts and minds of the whole web community. It started off with a tweet from Juan Sequeda – a contributor to The Semantic Web Blog and a well-known figure in our area – that reads:

 

 

 

 

A follow-up message explains:

 

 

 

Follow that link and you’ll find yourself at a Bing webmaster help site that indicates Microsoft wants to play nice with whatever markup approach webmasters want to implement – microdata, microformats, or RDFa. The site mark-up overview on the page referenced says that Bing’s “crawlers do not prefer one specification over another. It’s entirely up to you to decide which of the supported specifications best fits your data.

Read more

Sindice Puts The Web of Data At Your Disposal


Sindice
Ltd. launched as a startup company this week, complete with a publicly available beta SPARQL endpoint to its indexed and live-updated dataset of some 12 billion triples. Next week will see Sindice –which began as a joint academic research project among DERI, the Fondazione Bruno Kessler and OpenLink Software to collect, search, query and build applications on top of semantically marked up Web data — deliver formal support for Schema.org.

Sindice, of course, is agnostic when it comes to ingesting semantic markup formats. Supporting new formats is just a matter of syntax adaptation for the service. Whatever format a web site decides to employ — from RDF to RDFa to microformats to microdata — Sindice has coverage of the structured web data and keeps it fresh.

The service opens up vast possibilities for business: As long as a web site structures data in one of these formats, and uses standards like Sitemaps for publishing semantic content, it can become a part of Sindice’s continuously updated repository. And thus it become a datasource for business use, one that also can join with other datasets.

Read more