Bob DuCharme recently wrote, “The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I’ve been learning more about microdata recently, but even before I did, I found that the W3C’s Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product’s web page, this makes some interesting queries possible to compare prices and other information from the two vendors.” Read more
DBpedia, as described in the recent semanticweb.com article DBpedia 2014 Announced, is “a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.” It currently has over 3 billion triples (that is, facts stored using the W3C standard RDF data model) available for use by applications, making it a cornerstone of the semantic web.
A surprising amount of this data is expressed using the SKOS vocabulary, the W3C standard model for taxonomies used by the Library of Congress, the New York Times, and many other organizations to publish their taxonomies and subject headers. (semanticweb.com has covered SKOS many times in the past.) DBpedia has data about over a million SKOS concepts, arranged hierarchically and ready for you to pull down with simple queries so that you can use them in your RDF applications to add value to your own content and other data.
Where is this taxonomy data in DBpedia?
Many people think of DBpedia as mostly storing the fielded “infobox” information that you see in the gray boxes on the right side of Wikipedia pages—for example, the names of the founders and the net income figures that you see on the right side of the Wikipedia page for IBM. If you scroll to the bottom of that page, you’ll also see the categories that have been assigned to IBM in Wikipedia such as “Companies listed on the New York Stock Exchange” and “Computer hardware companies.” The Wikipedia page for Computer hardware companies lists companies that fall into this category, as well as two other interesting sets of information: subcategories (or, in taxonomist parlance, narrower categories) such as “Computer storage companies” and “Fabless semiconductor companies,” and then, at the bottom of the page, categories that are broader than “Computer hardware companies” such as “Computer companies” and “Electronics companies.”
How does DBpedia store this categorization information? The DBpedia page for IBM shows that DBpedia includes triples saying that IBM has Dublin Core subject values such as
category:Computer_hardware_companies. The DBpedia page for the category
Computer_hardware_companies shows that is a SKOS concept with values for the two key properties of a SKOS concept: a preferred label and broader values. The
category:Computer_hardware_companies concept is itself the broader value of several other concepts such as
category:Fabless_semiconductor_companies. Because it’s the broader value of other concepts and has its own broader values, it can be both a parent node and a child node in a tree of taxonomic terms, so DBpedia has the data that lets you build a taxonomy hierarchy around any of its categories.
Dominik Schweiger, Zlatko Trajanoski and Stephan Pabinger recently wrote, “Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. Results: SPARQLGraph offers an intuitive drag &drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers.” Read more
Solution demonstrates 10x+ the performance while running on 100x the data
San Diego – August 20, 2014 – SPARQL City, which introduced its scalable graph analytic engine to market earlier this year, today announced that it has successfully run the SP2 SPARQL benchmark on 100 times the data volume as other graph solution providers, while still delivering an order of magnitude better performance on average compared to published results.
SPARQL City ran the SP2 Benchmark against 2.5 billion triples/edges on a sixteen node cluster on Amazon EC2. Average query response time for the set of seventeen queries was about 6 seconds, with query 4, the most data intensive query involving the entire dataset taking approximately 34 seconds to run. By comparison, the best reported query 4 result by other graph solution providers has been around 15 seconds, but this is when running against 25 million triples/edges, or 1/100th of the data volume in SPARQL City’s benchmark test. This level of performance, combined with the ability to easily scale out the solution on a cluster when required, makes easy to use interactive graph analytics on very large datasets possible for the first time. Detailed benchmark results can be found on our website.
Is SPARQL the SQL for NoSQL? The question will be discussed at this month’s Semantic Technology & Business Conference in San Jose by Arthur Keen, vp of solution architecture of startup SPARQL City.
It’s not the first time that the industry has considered common database query languages for NoSQL (see this story at our sister site Dataversity.net for some perspective on that). But as Keen sees it, SPARQL has the legs for the job. “What I know about SPARQL is that for every database [SQL and NoSQL alike] out there, someone has tried to put SPARQL on it,” he says, whereas other common query language efforts may be limited in database support. A factor in SPARQL’s favor is query portability across NoSQL systems. Additionally, “you can achieve much higher performance using declarative query languages like SPARQL because they specify the ‘What’ and not the ‘How’ of the query, allowing optimizers to choose the best way to implement the query,” he explains.
Recent updates to YarcData’s software for its Urika analytics appliance reflect the fact that the enterprise is starting to understand the impact that semantic technology has on turning Big Data into actual insights.
The latest update includes integration with more enterprise data discovery tools, including the visualization and business intelligence tools Centrifuge Visual Network Analytics and TIBCO Spotfire, as well as those based on SPARQL and RDF, JDBC, JSON, and Apache Jena. The goal is to streamline the process of getting data in and then being able to provide connectivity to the tools analysts use every day.
As customers see the value of using the appliance to gain business insight, they want to be able to more tightly integrate this technology into wider enterprise workflows and infrastructures, says Ramesh Menon, YarcData vice president, solutions. “Not only do you want data from all different enterprise sources to flow into the appliance easily, but the value of results is enhanced tremendously if the insights and the ability to use those insights are more broadly distributed inside the enterprise,” he says. “Instead of having one analyst write queries on the appliance, 200 analysts can use the appliance without necessarily knowing a lot about the underlying, or semantic, technology. They are able to use the front end or discovery tools they use on daily basis, not have to leave that interface, and still get the benefit of the Ureka appliance.”
Last month at its MarkLogic World 2013 conference, the enterprise NoSQL database platform provider talked semantics as it related to its MarkLogic Server technology that ingests, manages and searches structured, semi-structured, and unstructured data (see our story here). The vendor late last week was scheduled to provide an early access release of MarkLogic 7, formally due by year’s end, to some dozens of initial users.
“People see a convergence of search and semantics,” Stephen Buxton, Director, Product Management, recently told The Semantic Web Blog. To that end, a lot of the vendor’s customers have deployed MarkLogic technology as well as specialized triple stores, but what they really want, he says, is an integrated approach, “a single database that does both individually and both together,” he says. “We see the future of search as semantics and the future of semantics as search, and they are very much converging.” At its recent conference, Buxton says the company demonstrated a MarkLogic app it built to function like Google’s Knowledge Graph to provide an idea of the kinds of things the enterprise might do with both search and semantics together.
Following up on the comments made by MarkLogic CEO Gary Bloom at his keynote address at the conference, Buxton explained that, “the function in MarkLogic we are working on in engineering is a way to store and manage triples in the MarkLogic database natively, right alongside structured and unstructured information – a specialized triples index so queries are very fast, and so you can do SPARQL queries in MarkLogic. So, with MarkLogic 7 we will have a world-class triple store and world-beating information store – no one else does documents, values and triples in combination the way MarkLogic 7 will.”
One in 50 American children have autism, according to the latest figures released by the Centers for Disease Control and Prevention in March. One of the winners of the YarcData Graph Analytics Challenge, announced in April, can make a difference in better understanding the causes of the disease.
Taking second place in the competition, the work of Adam Lugowski, Dr. John Gilbert, and Kevin Dewesse, of the University of California at Santa Barbara, leveraged a dataset created for the Mayo Clinic Smackdown project, that has the same structure and property types – and scale – as the medical organization’s actual Big Data sets around autism, but which uses publicly available data in place of the real thing. The team can’t use the real data because it includes private information about patients, diagnosis, prescriptions, and the like.
But the actual data deployed for the project doesn’t matter, says Lugowski . “The goal is to find relationships we have never thought of before, and this way it doesn’t prejudice the algorithm,” he says. Using YarcData’s uRIKA graph analytics appliance, the algorithm queries the Smackdown dataset – which in its smallest version has almost 40 million RDF triples and in its largest is about 100 times bigger, mirroring the size of all the Mayo Clinic’s actual autism data – to discover commonalities among the data, mimicking how the real data sets could be queried in search of common precursors among clusters of patients with the diagnosis.
The W3C has announced that eleven specifications of SPARQL 1.1 have been published as recommendations. SPARQL is the Semantic Web query language. We caught up with Lee Feigenbaum, VP Marketing & Technology at Cambridge Semantics Inc. to discuss the significance of this announcement. Feigenbaum is a SPARQL expert who currently serves as the Co-Chair of the W3C’s SPARQL Working Group, leading the design of SPARQL.
Feigenbaum says, “SPARQL 1.1 is a huge leap forward in providing a standard way to access and update Semantic Web data. By reaching W3C Recommendation status, Semantic Web developers, vendors, publishers and consumers have a stable, well-vetted, and interoperable set of standards they can rely on for the foreseeable future.”
NEXT PAGE >>