[Editor’s Note: Since 2007, Sindice.com has served as a specialized search engine allowing Semantic Web practitioners and researchers to locate structured data on the Web. At the peak of its activity, Sindice.com had an index of over 700M pages and processed 20M pages per day. In a post last week, the founding team announced the end of support for Sindice.com to concentrate on delivering the technology developed for the engine to enterprise users. This week, SemanticWeb.com is proud to host a guest post by the founding team explaining the history, the challanges and the future of this technology.]

Photo of the Sindice Team, 2012

Photo of the Sindice Team, 2012

The word “Sindice” has been around for quite some time in research and practice on the “Semantic Web” or “lets see how we can turn the web into a database”.

Since 2007, Sindice.com has served as a specialized search engine that would do a crazy thing: throw away the text and just concentrate on the “markup” of the web pages. Sindice would provide an advanced API to query RDF, RDFa, Microformats and Microdata found on web sites, together with a number of other services. Sindice turned useful, we guess, as approximately 1100 scientific works in the last few years refer to it in a way or another.

Last week, we the founding team announced the end of our support of the original Sindice.com semantic search engine to concentrate on the technology that came from it.

With the launch in 2012 of Schema.org, Google and others have effectively embraced the vision of the “Semantic Web.” With the RDFa standard, and now even more with JSON-LD, richer markup is becoming more and more popular on websites. While there might not be public web data “search APIs”, large collections of crawled data (pages and RDF) exist today which are made available on cloud computing platforms for easy analysis with your favorite big data paradigm.

Even more interestingly, the technology of Sindice.com has been made available in several projects maintained either as open source (see below) or commercially supported by the Sindice.com team now transitioned in the Sindice LTD company, AKA SindiceTech.

It has been quite a journey for us, and given there is no single summary anywhere we thought we’d take this occasion to write and share it.

This is both for “historical” reasons and as a way to glimpse at future directions of this field and these technologies.

The Sindice.com Founders
Dr. Giovanni Tummarello & Dr. Renaud Delbru

—————–

The Sindice Project, History and Legacy

Sindice was launched in the 6th of November 2007 with the goal of being a “reverse lookup index” for the “Semantic Web.” In our mind, the Web was going to be – one way or another –  a huge canvas in which to write “statements,” but the problem was how to find these statements out there?

Via a Semantic “Indice” (index, in italian) = Sindice :)

We designed Sindice from the start with some core ideas in mind: Simplicity, Fast Response Time, Real Time Update, and Web Scale

A Very Simple API

Before Sindice, other engines like SWSE had explored the idea of putting “all the data together” and of offering a single API supporting full relational capabilities (e.g., offering complex queries involving chains of relations on the data).

While SWSE was technically amazing, it had certain shortcomings which in our view were pretty difficult, if not insurmountable, and in our opinion called for action in a different direction.  SWSE offered a fully relational API (for example, an early version of SPARQL). Understandably, the data would not fit on a single server, so a complex distributed graph infrastructure was needed. In practice this meant difficulty in scaling and inability to refresh the data as it came. Most importantly it was really not something that one could lightheartedly “open” to the general public: a fully relational API might have easily demanded arbitrary amounts of computational power with even simple requests by clients.

On the other hand even SPARQL was not enough. Heterogeneity, noise, lack of shared identifiers and idiosyncrasies of data found “out there” on the web was extreme: take two datasets on the web, it was practically impossible to get some useful results by a simple “join,” for example as provided by the SPARQL language alone.

Observing this, we decided to go in a different direction and aim at scalability and real time updates. To do this “step forward” we had to do one “backward,” offer a much simpler API.

Fast, Real Time, Search-Oriented APIs for the Web of Data

Sindice was then going to have a very simple API: at the beginning just a simple reverse lookup for a URI. Ask a URI and get URLs where that URI was mentioned. We wanted, however, fast updates so as to foster experimentation between “agents” on the web that via Sindice could have found data published by each other.

E.g., if a content management system somewhere on the web published something about http://dbpedia.org/page/Pablo_Picasso, we wanted this to be discoverable within a few minutes by an agent using Sindice. Think a sort of a global hashtag mechanism … but for the web of data, based on shared URIs.

Admittedly, we started quite naively. Our first thought was that as we wanted to be super fast and perform a very simple task, it was much better to build our own system than to start from any pre-existing one.

We therefore started by implementing a custom index based on a, back then, huge 1TB HD. Rarely however does it pay to start from scratch, and some of the smartest things in research happen when you map a problem to another one which is well understood already. This idea occurred first to Renaud Delbru: URIs in our RDF files where nothing else but “symbols”, and “statements” (subject – predicate – object “triples”) were in a sense “sentences”. So could it be possible to build an RDF search engine starting from a textual index system?

Apache Lucene offered a great way to experiment with the idea and in fact, out of the box, it was already pretty good at performing our basic task. The use of an inverted index allowed blazingly fast (and basically constant speed) index building, fast updates and there existed a large array of techniques for ranking, let alone techniques to manage the textual part of any query, e.g.,  when “literals” were involved. But what about the RDF data model? Could we have used Solr to provide structured search on RDF?

This started the works to push Solr to be able to handle structure, the Ph.D. of Renaud Delbru, implemented in a project called “Semantic Information Retrieval Index” (SIREn) [1].

When SIREn entered in production, in 2008, it allowed already not only URI lookup but also searching for patterns e.g. “http://g1o.net/Giovanni * http://sindice.com” or “http://g1o.net/Giovanni foaf:knows *”.

SIREn was then released as a standalone component, a Solr plugin, in 2009. While we never made much publicity of it, we were very pleased with the attention and the feedback.  In 2010, it was featured in “Lucene in Action” book, and we found out others were also finding it useful in the RDF world.

Since then, SIREn has evolved in aspects such as ranking (winning the Yahoo entity search competition in 2011), specialized compression, support for full JSON/XML nested data and more to the point of being arguably the most advanced engine available for richly structured documents.

Today, we continue strongly support SIREn as a a standalone product at www.sirendb.com . In its latest incarnation, SIREn 1.1 is available as plugin for Solr, MongoDB and, soon to be released, ElasticSearch that enable high performance semi-structured document/nested data capabilities.

Scaling RDF Processing and Analytics

Thanks to the scientific backing by Prof. Stefan Decker, principal investigator for the core grant that the DERI institute acquired in 2008, the “Data Intensive Infrastructure” unit at DERI institute was founded, led by Dr. Giovanni Tummarello.

The research program, crafted thanks to early results of the “few servers” version of Sindice.com,  assigned resources to expand Sindice.com operations on a 60 machine cluster: the “Webstar” cluster.

The goal: to explore what we could do in the RDF and Web of Data domain with the emerging “big data” techniques such as, at the time, Hadoop and HBase. This was relatively early.

By building our pipeline from crawling to reasoning to indexing on Hadoop, Sindice was capable of comfortably processing more than 20 million RDF documents/pages per day and had several million websites in its index and cache – also offered as API. On one day, we rebuilt the indexes and ingested 100M documents in a few hours on our 11 machine cluster.

Over time, Sindice reached 700M documents. This relatively low number for web documents is due to our approach to crawling:  Sindice never had an “open crawler” but instead focused much more on pings (real-time updates) and on correctly following sitemaps for sites which had a good sitemap and were marked by ourselves as “worth getting in full.”

It is also to be said that, in general, crawling was never a Sindice.com forte, there is no denying that. When we started, Nutch was not in a great state and we went ahead possibly too quickly writing our own thing. In hindsight this was possibly not a great idea as it diverted resources while at the same time the outcome often required manual intervention if one wanted to make sure all the data of a site was correctly ingested.

The pipeline was, however, performing very well. From end to end, data retrieval from the web to results returned from the index, Sindice usually took 15 to 30 minutes. Again, this is not great nowadays but also not too bad considering we had no specific “real-time” index and that Hadoop is well known to introduce delays in pipeline.

Hadoop did, on the other hand, provide us the power to analyze all the data we had, something we found interesting and useful both for Sindice.com and several other projects which were then to be “spun off”.

Understanding the Data: Hadoop Powered RDF Analytics

The ordinary data processing in the Hadoop pipeline involved detecting content in the markup/correcting errors when possible, extracting RDF, performing reasoning by fetching all the related ontologies and finally pushing the data in SIREn, HBase.

On top of ordinary everyday data processing, however, the Sindice Hadoop cluster was also busy daily and weekly with overall analytics. Mostly with the work of Renaud Delbru and Stephane Campinas, we developed Hadoop powered RDF analytics to understand both the content of websites and the way they were interlinked (if they were) via RDF links.

The analytics job would run and produce an RDF “summary graph” that would include statistics about most used Classes, Properties and combinations of the two — either per each dataset or when used across datasets (e.g., linking from one to another).

For this ability to show analytics across domains, the core user interface to browse this data was called the “Web linkage validator”. At computational level, we can say that the Hadoop job, documented on the recently awarded papers [4,9], took originally about a week to execute over 20-30 billion triples but was later optimized to take little over a day – running on the same cluster where the main pipeline was also running.

The Sindice data graph summary, about 1.5 billion triples worth was then loaded into a  Triplestore and used via SPARQL API by the “Web Linkage Validator” front end.

Sindice1-500

From this interface, once a domain was entered, a detail page would appear. On top we would have general statistics from the crawling system and other “site level” metadata. The rest of the interface dealt with the data in the domain, e.g., Top classes and Top Properties available, Top Property per Class, Top Classes connected to Other classes (via which property), and even more … This interface would allow us to see, for example that from the domain “bbc.co.uk” the most useful property to connect the class “musicartist” to the class “person” on dbpedia.org was “sameAs”.

It would also show us which other domains where pointing to which classes on bbc.co.uk and using which properties. In other words we could “see” the linked open data, or whatever we had harvested of it.

Sindice2-500

Sindice3-500

More Uses for the Data Graph Summary: The SPARQLed “Spinoff” Project

In the last few years, Sindice also had a SPARQL endpoint, courtesy of a cooperation with OpenLink software who made Virtuoso Cluster edition available.

We had about 16 machines dedicated to this, with typically 8 used in read only mode while the other 8 were loading the new data. For reasons of time and effort, the process never made it to be fully automated and similarly we did not typically load all of the Sindice dataset into the SPARQL endpoint, but instead only had a subset based on a whitelist.

Despite not being full, the content of the SPARQL endpoint had however already daunting data diversity. We had the idea then to investigate on how to support SPARQL query creation starting from the Data Graph summary described above.

This led to the creation of the SPARQLed, our assisted SPARQL Query editor, a “spinoff” project to do an analysis of the context of the SPARQL query as you type in order to produce suggestion and auto-completion features by leveraging the data graph summary.

In the example below, SPARQLed suggests a class (given a rdf:type property has been entered as predicate). Once a class has been selected, the system can then suggest in the next triple the most used properties for that class.

Sindice4-500

Sindice5-500

Unfortunately, SPARQLed never made it to the sindice.com SPARQL query editor because its data model, based on a domain-based “named graph” approach, differs from the extreme “named graph” approach of the sindice.com SPARQL endpoint – where each page was its own little graph.

SPARQLed is now freely available at the moment under AGPL. We are aware this licence is limited for those who want to contribute to this project and therefore we are requesting this to be changed by Apache. We will be providing updates.

But for now, let’s talk about the great things that happened when we did an Apache licence before, e.g., our web metadata extraction library that went on to become a top level Apache project.

The Sindice Inspector and ANY23

Since early in the Sindice project we felt the need to support website creators in checking their markup. We also wanted other applications, beside Sindice.com, to be able to extract the RDF in the same way Sindice would. More applications consuming the Web of data, more data: the more, the merrier.

We therefore took the markup extraction code that was then embedded in the Hadoop pipeline and made it available both as:

  • a standalone project, the  “anything to triples” library or Any23

  • as an online tool, the web inspector

Any23, was released early 2009. With great initial efforts by Richard Cyganiak and later by the folks at FBK, first of all Michele Mostarda, Any23 managed to collect a nice community around it and has become since then a top level Apache project, currently heading toward its 1.0 release with the full support for the latest RDFa and JSON-LD specs.

About the same time, we also released the Sindice Inspector, a web application based on Any23 to see webpage markup and others things as “seen” by Sindice.

In cooperation with the nice folks at University of lleda, the inspector offered several different ways to see RDF data, e.g. , via nested “frames” and as a graph. Below is a screenshot showing the complex markup of a microwave oven product page on BestBuy.

Sindice6-500

The inspector would then not only show the simple output of Any23, but also allow one to see what “inference” Sindice was going to do on the triples before indexing them.

As mentioned, Sindice performed the inference at index time [3], a process which almost doubled the size of the dataset but that allowed inference results to be used in queries. E.g., looking for a class would also allow one to find a page where something was marked up only with a sub-class of that class.

To understand which ontologies and rules to apply with Sindice followed, for each of the 700M of docs, the URIs of the properties used in the description to find if it was possible to retrieve the property/class definition or the full ontology – something that some folks in the semantic web referred to as the “follow your nose” principle.

It did so recursively, so the chain of ontologies imported and used to perform the “inference closure” could potentially be quite non trivial [10]. The picture below, for example, shows some of the ontologies that Sindice imported in full when a “foaf” property was used. Understanding which ontologies are used allowed one to “debug” inferred statements (e.g., the 1833 “implicit triples” inferred in this test foaf file).

Sindice7-500

Applications,  Applications!  Applications?

Sindice.com per se never had much of a Wow factor to a laymen: the front end was simple and provided relatively obscure results. As geeks, we were happy with Sindice: it pointed us at sources of data which might have contained the answer to some question, and that was exciting enough. “And applications will come” :). At some point, however, we pushed it further and went to write code that used Sindice.com itself. We were pretty glad we did.

The Semantic Information Mashup: Sig.ma

One occasion came when in the Okkam EU project, we wanted to not only find sources but actually create a nice aggregate of the information found out there. In other words, we wanted to show “what does the semantic web know about X”.

We named the project Semantic Information Mashup, or Si(g?).Ma [2]. The extension .ma happens to be a valid one (Marocco) so … http://Sig.ma

We recall that our expectations were not too high when looking at “how many steps could go wrong” in the overall process. Given a query, Sig.ma was to look Sindice up, get the RDF from the cache API, locate the central RDF node, look at the triples around them, consolidate on the fly both properties (e.g., multiple ways of saying the same things) and values (same things said multiple times, e.g., in different pages). The result was then to be shown in a pleasant, meaningful way.

Our main concern was the data: we thought not much was going to be there (after all we knew sindice was big … but what was inside it anyway?) and that in any case the noise introduced by each of the steps above was in the end going to be more than the signal.

We still pleasantly recall the surprise, however, when Szymon Danielczyk and Richard Cyganiak, the main drivers behind Sig.ma, showed it in action the first time: we put in a name of a colleague and had quite a “wow!” moment. Sig.ma went and started to collect bits and pieces from a number of different sources. Among these were the conference papers in DBLP, Microformats from Linkedin, people mentioning the person in their Foaf file, RDF profiles from DERI. Looking up notable people, great data was also coming from DBPedia.

Sindice8-500

It was a big wow moment. We felt we could see, and explain to others, why it made sense to put those annotations in those pages! We were also pleased others liked it, too. We received nice feedback and were awarded at the Semantic Web Challenge.

Sure, because we were looking up names of people and things which were somehow either on DBpedia in the semantic web community, because the semantic web community had been putting annotations here and there (annotations that nobody had ever consumed likely, but they were there and Sindice knew about them), likely also because not many people had started to “semantically spam” and because the result was interpreted by a human (and therefore tolerant to noise), the end result was exciting.

For the first time, arguably, we felt we were tasting the “real” Semantic Web: not a precooked mashup or static integration, instead an application that would automatically choose data sources from the open web (and dozens of sources at a time) and automatically extract the parts of information it thought relevant to do something useful (i.e., give an entity profile).

The application was furthermore interactive and felt extremely alive. If you have never seen this video, then … at this point you really should.

Sig.ma – Live views on the Web of Data from Sindice Team on Vimeo.

Commercial Value as Research Validation

Around the years 2009-2010 we started asking ourselves questions about the potential commercial value of Sindice and related technologies. Sindice Search and Sig.ma were exciting to us, but was it going to be something of commercial value?

Commercial value is very important in our opinion. It is our strong belief that the research community, and the IT research community in particular, truly owes it to the taxpayers to demonstrate the impact of what is being done. Commercial impact for us meant the ultimate research validation and fulfilment of the mandate we had been given by our stakeholders.

We were lucky at that time to meet Nova Spivak, who took us under his wing as a person who knew the VC scene in Silicon Valley and made us a round of very successful introductions to VCs. Basically almost anyone among the top VC wanted to hear the story, what was a “Semantic Web search engine” going to be about.

The pitch was “The Web of Data is Coming, and we have great technology” and that did not fly, but we we were treated with respect, and learned a lot in the process. Questions of the VCs were extraordinarily smart, but also dismaying in their simplicity: great, now please describe  “how are you going to earn your first dollar?”.

The answer is not too simple. Sindice meant to cater machine consumers (applications), and you can not really sell advertisements to a machine. Paid or biased ranking did not look like an attractive model. More than that, applications as daring as “sig.ma” – which would go and automatically hunt for open web data sources – are not really something we can envision in short time.

Sure, a developer could have had use for Sindice.com to find data sources. He would then be very happy with the service but then “what prevents the developer from using the newly found data source directly, why would he go via Sindice again?”

Good question, back to the labs.

Exploring Services Based on “Your Site Data”: Sindice Site Services

We then started to think what we could provide to website owners. What could we provide to websites that have page markups? The idea, which we called Sindice Site Services,  came to be the following:

  1. Website puts markup;

  2. Sindice picks it up, crawls it completely, keeps a “mirror image” of the dataset, constantly updated;

  3. Sindice combines the website data with others, and …

  4. Offers cool services that are useful to the website and much smarter than what the website likely was willing to custom build.

The homepage for this. (the remains are still online and so is a position paper)

Sindice9-500

The first service was a set of Widgets that one could have simply added to a website by inserting a line of javascript. The javascript would read the markup on the page, send it to the server where it would get useful content for the website, e.g., information panels (with data coming from anywhere else and known by Sindice) or recommendations to other pages on your own web site that would seem relevant based on the metadata.

Sindice10-500

In the example above, the website markup is simply “title:10,000 BC”. We have added two widgets. They read the markup to give “more info on this movie” (taken from DBPedia and RottenTomatoes markup) and recommendations to other Internal pages on the same  website.

We actually made quite a few of them, the idea was to create a “market” where people could also create their own, with a composer that was provided, powered by SPARQL and templates.

Similarly, we offered an embeddable faceted engine that would, to continue with the movie example, allow a web site visitor to sort by “RottenTomatoes score” or any data taken from DBPedia.

Sindice11-500

We were truly excited with this and we received great feedbacks. There clearly could have been a market in providing enhancements to web sites (several companies were and are in this space).

Someone else even pushed further and said this “was the missing bit for Linked Data to succeed”:

  1. Sindice could have shared its revenues (e.g., coming from advertising embedded in the widgets or from premium services) with those on the web whose data was being reused in the widgets.

  2. This was an incentive to create good data but also to provide disambiguating “sameAs” links, as this made widget creation (which was done by SPARQL) so much easier.

But this never went far in reality, for several reasons, some technical, some more conceptual.

On the one hand, Sindice crawling had never reached the maturity and stability that was going to be needed to offer a service. Also SPARQL was a very expensive way to create these widgets, and the flexibility (and performance!!) of the early widget system left a lot to be desired. Thick layers of caching and pre-caching were going to be needed anyway.

On the other hand, the use cases never fully convinced. What interesting and money producing use cases can you do with web data, e.g., indexed by Sindice v.s. something one could have done with a few selected and polished datasets? What can Sindice Technology provide in these cases?

Sindice Infrastructure for Enterprise: the Present and the Future

It was late 2012 when we were approached almost simultaneously by several people in enterprises interested in “Sindice Like” quantities of “Sindice Like” richly structured data. And wanted “Sindice Like” real-time updates, quick query response and big data understanding and processing capabilities.

For companies in sectors such as scientific publishing, finance, life sciences, defense and more, it absolutely makes sense to be able to recombine and experiment with information in a faster and more nimble way than they were otherwise able to.

If RDF can be made reliable and to work at scale and if the RDF cost penalty “tax” (how Orri Erling calls it) can be sufficiently mitigated, the idea is that the flexibility of the data model would bring a high added value: much more nimble experimentation with data boosting experimentation leading to new products and services.

Highly structured documents, on the other hand, and documents that refer to entities which can then be expanded with relational properties are more and more common and call for the use of systems like SIREn to push from faceted browsing to “relational faceted browsing.”

As we write this, the Sindice Team has now almost entirely transitioned in its startup company SindiceTech, where it is working with exciting partners, delivering enterprise knowledge graphs and the next versions of SIREn.

Wrapping Up

With the Web of Data starting to become “mainstream”, the scope of the work is necessarily larger than what is normally possible in a research environment.

On the other hand, the applicability and interest we received in some of the Sindice derived technologies call for our full focus.

With respect to Sindice.com, please consider that the core team will be leaving the institution hosting the site in the time specified above. We are now in conversation with the management with respect to being able to further support it, but for now please consider August as end of service, at least as associated with the founders.

Similarly we hope to be able to update soon on the outcome of our request to open-source the last remaining project (SPARQLed).

Hoping you have enjoyed the “historical” ride as much as we have enjoyed recalling, we look forward to supporting and interacting with interested parties in our new capacity at SindiceTech.

Sincerely,

The Sindice founders
Giovanni Tummarello & Renaud Delbru

Appendixes

Reusable Software Components Summary

Sindice Legacy: Selected Publications by Technology

  • Data acquisition

    • Semantic Sitemaps [6]

    • Any23.org (Apache top level)

  • Enrichment/Web Reasoning  [3][10]

  • Big Data RDF processing/Analytics [4][9]

  • Semi-structured Information Retrieval

    • SIREn [1]

    • Ding Ranking [11]

    • BM25MF [5]

  • UI, Open Web of Data interaction paradigm

    • Sig.Ma [2]

    • Relational Faceted Browser (to be published)

[1] R. Delbru, S. Campinas, G. Tummarello. Searching Web Data: an Entity Retrieval and High-Performance Indexing Model. Journal of Web Semantics, 2011.

[2] G. Tummarello, R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, S. Decker. Sig.ma : Live views on the Web of Data. In Journal of Web Semantics, 2010.

[3]. R. Delbru, G. Tummarello, A. Polleres. Context-Dependent OWL Reasoning in Sindice – Experiences and Lessons Learnt. International Conference on Web Reasoning and Rule Systems (RR). 2011.

[4] S. Campinas, T. E. Perry, D. Ceccarelli, R. Delbru and G. Tummarello. Introducing RDF Graph Summary With Application to Assisted SPARQL Formulation. (DEXA). Vienna, 2012.

[5] S. Campinas, R. Delbru, G. Tummarello. Effective Retrieval Model for Entity with Multi-Valued Attributes: BM25MF and Beyond. EKAW 2012.

[6] R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker and G. Tummarello. Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web. ESWC. 2008.

[7]  G. Tummarello, R. Delbru, E. Oren, Sindice. com: Weaving the open linked data – The Semantic Web, 2007 – Springer.

[8] P. Mika, G. Tummarello Web Semantics in the Clouds - Intelligent Systems, IEEE, 2008 – ieeexplore.ieee.org.

[9] S. Campinas, R. Delbru, G. Tummarello, “Efficiency and Precision Trade-Offs in Graph Summary Algorithms”, 17th International Database Engineering & Applications Symposium IDEAS 2013. Best Paper Award.

[10] Axel Polleres, Aidan Hogan, Renaud Delbru, Jürgen Umbrich: RDFS and OWL Reasoning for Linked Data. Reasoning Web 2013: 91-149

[11] R. Delbru, N. Toupikov, M. Catasta, G. Tummarello, S. Decker. Hierarchical Link Analysis for Ranking Web Data. In Proceedings of the 7th Extended Semantic Web Conference (ESWC). 2010.

During its time Sindice produced  2 Ph.Ds (Renaud Delbru, Stephane Campinas) and 9 Masters or equivalent (Szymon Danielczyk, Michele Catasta, Thomas Perry, Nickolai Toupikov, Stephane Campinas, Divanshu Gupta, Cedric Chartron, Pierre Bailly, Arthur Baudry).

Citations

As today, Google scholars reports 1100 publications that mention the term “sindice” and “semantic” together. This would not include citations or any other use made of derived papers and derived components, e.g., something mentioning or using Sig.ma, any23, Siren – but making no mentioning of sindice.

Awards

  • 2013 Best European Product Big Data Semantics – Awarded by well known analyst.

  • 2013 Best Paper Award IDEAS conference “Efficiency and Precision Trade-Offs in Graph Summary Algorithms”

  • 2011 Yahoo Semantic Search Challenge – Winning team Main challenge

  • 2010 Semantic Web Challenge, Siteservices – Shortlisted

  • 2009 Semantic Web Challenge, http://sig.ma (3rd Prize)

  • 2008 European Semantic Technology Conference, Best Business Idea Contest, 1st Prize

Acknowledgements

Sindice has been a long running project hosted primarily at the institute formerly known as DERI (now Insight Galway, Ireland) with the support of Prof. Stefan Decker, principal investigator of the overall “Lion” SFI research project, and with the active participation of the FBK institute (Trento, Italy) with the support of Dr. Paolo Traverso.

We acknowledge and are grateful to the funding agencies and projects:

  • 2012 IRCSET EMPOWER – Postdoctoral Fellowship

  • 2012 Project RETIS (Enterprise Ireland Commercialization Plus)

  • 2011 EuroSentiment  (EU, FP7)

  • 2010 Cloud4Soa – (EU, FP7) , LOD2 Project (EU, FP7) , EU Coordinate Action LATC

  • 2009 DERI LION II grant, Data Intensive Infrastructure Unit Program (DERI)

  • 2008 IMP  (EU, FP7)

  • 2007 OKKAM, (EU, FP7) , Romulus (EU, FP7)

Special thanks also go to Openlink for the collaboration on the Sindice Sparql endpoint.

Technical Team and Credits

People that have worked into Sindice, with approximate credits.

  • Dr. Giovanni Tummarello

    • Group and Project Lead

  • Dr. Renaud Delbru

    • Research and Engineering Lead, SIREn, Ranking, Analytics, PivotBrowser, Services

  • Stephane Campinas

    • SIREn, Ranking, Analytics, PivotBrowser

  • Szymon Danielczyk

    • Frontend, user interfaces, services

  • Michele Catasta, Eyal Oren

    • Early prototypes, big data scaling, services

  • Giulio Cesare Solaroli, Gabi Vulcu, Robert Fuller, Stephen Mulcahy, Naoise Dunne, Kevin Lyda, Franco Ravagli, Michele Mostarda, Davide Palmisano

    • Core engineering, crawling, processing

  • Marco Amadori, Marco Fossati, Stephen Mulcahy, Xi Bai, Tamas Benko, Paolo Capriotti, Oana Ureche, Adam Westerski, Davide Palmisano, Daniel Parming, Gabriele Renzi, Holger Stenzhorn, Nickolai Toupikov, Jürgen Umbrich, Divanshu Gupta,  Diego Ceccarelli, Cedric Chartron, Pierre Bailly, Soren Brunk, Arthur Baudry, Sean Policarpio, Thomas Perry

    • Additional R&D

  • Michael Hausenblas, Richard Cyganiak, Michele Mostarda

    • Standards, community, dissemination, ANY23

  • Jakub Kotowski, Josef Neidermeier, Harish Kumar

    • Enterprise readiness Sindice components, Enterprise offerings.

Pictures and Family album

Sindice12-500

The “Webstar” cluster, which hosted Sindice.com starting 2008.

Sindice13-500

A snapshot of sindice.com homepage, late 2013

Trento 2010, FBK and DERI cooperate on Sindice

Trento 2010, FBK and DERI cooperate on Sindice

Out of rooms, technical meeting held in the “Computer Museum”, DERI building 2012.

Out of rooms, technical meeting held in the “Computer Museum”, DERI building 2012.