In my most recent post, I introduced RDF as a flexible and schema-less data model. However, some of you may think then that using RDF data is going to be a complete mess. In some cases, that may be true; and it’s fine! There are use cases in which all you want is messy data. What if you want to do more interesting stuff with your RDF data like infer new knowledge? This is where ontologies come in.
What is an Ontology?
Let me scare you for a minute. The computer science definition of ontology is:
a formal and explicit specification of a shared conceptualization
Let’s break this down and get our hands dirty. Let’s make an ontology for a specific domain: universities. What are the concepts involved in a university?: students, professors, courses, departments. We all agree on this, right? This is the shared conceptualization. How are these concepts related? A student is enrolled in a course. A professor teaches a course. A course is offered by a department. These are all explicit specifications of the concepts that we are talking about. Now, we need to represent them in a way that a computer can understand it. In other words, we need a formal computer language.
For example OWL is a computer language used to write ontologies. OWL (Web Ontological Language) is the W3C standard to represent ontologies on the web. There you go, it wasn’t that complicated.
RDF, RDF Schema and OWL
Consider the statement: Juan is enrolled in CS101. With RDF, we have the subject: “Juan,” the predicate: “is enrolled in” and the object: “CS101.” The predicate is a named property that links the subject with the object. In this case, the predicate is considered to be part of the ontology that defines this statement (Student is enrolled in a Course). Additionally, consider the statement: Juan is a Student. The object of the triple would be Student, which is also part of the ontology.
RDF Schema was created as a way to basically describe an ontology. You can use classes, properties, domain and range of a property, sub classes, and sub properties. For example, Student, Professor, Course, and Department are all ontological classes. “Enrolled” is a property with domain “Student” and range “Course.” Lets assume the following RDF triple:
ex:Juan ex:enrolled ex:CS101
Given the RDF Schema ontology, I know that:
ex:enrolled rdfs:domain ex:Student ex:enrolled rdfs:range ex:Course
I can now infer that ex:Juan is a ex:Student and ex:CS101 is a ex:Course. This is new knowledge that I was able to infer, which was not made explicit before.
RDF Schema is very limited. What if I want to state that a property is symmetric (sibling) or transitive (ancestor). This can’t be done in RDF Schema and this is where OWL comes in. OWL is a much more expressive ontology language which permits users to infer much more knowledge. Besides adding more expressiveness to properties, with OWL you can state that relationships between classes are exclusive (i.e disjoint, something can’t be both x and y), cardinal (i.e there’s exactly one x for each y) and much more. In addition to inferring new knowledge, with OWL you can check consistency.
For example, imagine you are integrating two different datasets. You first create your ontology in OWL and then convert the datasets into RDF. Assume that we have two classes in our ontology: CoolCity and BoringCity and both of these classes are disjoint with each other (because a cool city can’t be a boring city).
ex:CoolCity owl:disjointWith ex:BoringCity
Now in the first dataset we extract the following RDF triple:
ex:Austin rdf:type ex:CoolCity
and in the second dataset we have the following:
ex:Austin rdf:type ex:BoringCity
If we combine these two RDF triples together, there is nothing inconsistent. However, if we reason about these two triples with our ontology, we will reach an inconsistency. It is not possible for Austin to be a CoolCity and a BoringCity at the same time when we stated in our ontology that CoolCity and BoringCity are disjoint. This is just a small example of what can be done with OWL. There are different flavors of OWL: OWL 1 and OWL 2 and each one has its sub-languages. If you are interested to learn more about OWL, check out the specs.
Database Schema vs Ontology
When I was going through the example of creating our university ontology, you may have thought that the ontology doesn’t look much different from a database schema. So what is the difference between a database schema and an ontology? There are actually several!
A database schema refers to the way the data is organized. In a relational database, the data is organized by tables. For example, the relationship “enrolled” is represented as a many-to-many table (many students can be enrolled in many courses). Additionally a database schema specifies integrity constraints over the data in the database. Furthermore, relational databases follow the closed world assumption. This means that what is not currently known to be true, is assumed false. Given a database that has the information Juan is enrolled in CS101, the query “Is Juan enrolled in CS202?” would return false. The database has no recollection of Juan being enrolled in CS202, therefore it assumes that Juan is not enrolled in CS202.
Ontologies are used to represent knowledge and reason about data in order to infer new knowledge and check consistency. OWL follows the open world assumption, which means that what is not currently known to be true is simply unknown. For the previous query, the answer would be “I don’t know!”. We currently know that Juan is enrolled in CS101 but we have no recollection if he is enrolled in CS202. Maybe he is, maybe he is not. We simply don’t know. Why does OWL follow the open world assumption? Because it allows it to infer new knowledge.
One of the common misconceptions is to consider OWL as a schema language supporting integrity constraints (foreign key, not null, etc). In a previous example, we were able to infer that ex:CS101 is a ex:Course. That is a correct inference. However, what if we accidentally state
ex:Juan ex:enrolled ex:Adriana
We would infer then that ex:Adriana is a ex:Course. In the eyes of integrity constraints, that is wrong because ex:Adriana is a person who is actually my girlfriend. But in the eyes of the open world assumption, this is simply new knowledge that was inferred.
RDF by itself is flexible and schema-less. It is never tied to a single schema. You can use as many different properties and classes from different ontologies as you want. When you combine RDF with a more expressive ontology such as OWL, you are able to infer new knowledge and check consistency of your data. Even though, OWL is not a schema language with integrity constraints, the guys at Clark & Parsia have been able to bridge this gap.
We have talked about RDF and now ontologies. So what ‘s next? We need to query our data! And that will take us to my next post: Introduction to SPARQL.
- Lifting People Out of Poverty with Open Data
- The Value of Accurate Data Attribution, And How to Get There
- Big Data Means More Than Volume
- Session Spotlight: A Host of Expert Panels at SemTechBiz SF