Building the Digital Aristotle
Jennifer Zaino
SemanticWeb.com Contributor
Semanticweb.com recently spoke with Dr. Mark Greaves, the director of knowledge systems research at Vulcan, Inc., about one of the private investment firm’s long-term research and development initiatives, called Project Halo.
Semanticweb.com: What is Project Halo?
Greaves: Project Halo is a long-term R&D effort by Vulcan to pursue a vision that we call the Digital Aristotle. The Digital Aristotle is a software application capable of answering novel questions and solving advanced problems in a broad range of scientific disciplines.
If you remember your history, Aristotle was one of the greatest teachers of antiquity (he was the tutor of Alexander the Great), and arguably the last person in the world who could plausibly claim to know the entirety of Western scientific knowledge. After Aristotle, the span of scientific knowledge continued to increase, and it became impossible for a single person to master all of it. It took until the age of computers to even dream about recreating an Aristotle-like level of expertise in science. The Digital Aristotle harkens back to the spirit of Aristotle. With recent advances in the core AI (artificial intelligence) technologies, it is now possible to imagine a way to create and curate an ever-growing digital scientific knowledge base, along with the question-answering technology to effectively use it.
To start on this vision, we are synthesizing some of the world’s best existing work in artificial intelligence (AI), including knowledge acquisition, natural language processing, and knowledge representation, as well as sponsoring targeted research to fill in certain gaps. We believe the time is right to start to create an AI system that has often been imagined but never built: a deep question-answering system for the natural sciences.
Web search engines are the predecessors to this; they are tremendously good at returning piles of web pages which contain the keywords you specify, but you generally cannot simply type in a question into a search engine and just get the answer. You can’t easily ask a search engine, for example, “what happens if I heat powdered iron with concentrated sulfuric acid?” This is a question that any good high-school chemistry student should be able to answer, and in fact we ourselves use questions like this to figure out whether people know chemistry. Even though useful question answering is a central challenge of AI, there are remarkably few research projects in the world that focus on this. And, almost none are trying to do question answering over a large domain, with the degree of complexity and sophistication that we are. We would like to build a computer system that can answer questions at the level of difficulty of the US Advanced Placement examination, which measures competence in a field at the introductory college level. Our near-term goal is to build a computer that can take the AP test in either physics, biology, or chemistry, and score a 4 or a 5 (the two top scores).
Semanticweb.com: How?
Greaves: We divided the Digital Aristotle problem into two linked sub-problems. First, we want to create a system that has sufficiently robust question-answering algorithms to answer a substantial fraction of AP-style questions. There are a lot of important leading-edge AI techniques we are using to do this, and we are also funding original research in this area.
Second, though, we want the system’s scientific knowledge to be created and maintained by non-computer-scientists — ideally, by advanced students and their teachers who know that scientific domain. We believe that this is the only way we can scale to the sizes we require, and cost-effectively create the knowledge bases that we will need. Like Wikipedia, we want to build a system that allows anyone with minimal training to put knowledge into the system, so that the system can answer questions based on that knowledge. This is incredibly challenging — no one has ever built formal knowledge authoring tools of this complexity that were usable by a wide audience. If we are successful at either of these tasks, Project Halo will be a revolution.
These are clearly “grand challenge”-type problems, and we have a world-class team working on them. Project Halo is being built by some of the leading AI organizations in the world–prominent universities, high-end research institutes like SRI in Menlo Park, leading-edge companies such as Germany’s Ontoprise GmbH. We explicitly set out to get some of the very best people in the work to work on one of the very hardest problems in the world.
Semanticweb.com: Where are you so far in the project, which began in 2004?
Greaves: One way to answer this is by the numbers. In our last major system evaluation, our system correctly answered around 30% of the questions we posed to it, drawn from a subset of the AP syllabus. We have a target of being able to answer 75% of the questions for our next evaluation, which will happen in the winter. We have touched this level of performance using experts in the lab, but achieving it using actual students as both knowledge authors and question posers will be very challenging.
The other way to answer this is by describing the state of the software. To acquire a large amount of basic scientific knowledge–what we call instance information and taxonomic information, along with basic ontology information–Project Halo has developed a semantic wiki. We want to leverage some of the crowd-sourcing and consensus aspects of wikis to also get agreement on instance-level vocabularies as well as the less logically complex information that is require to answer AP-level questions.
A key to acquiring the large amount of high-quality machine-readable data is to allow a similarly large group of people to collaborate on the data. So, our team built a powerful set of extensions to a well-known early semantic wiki called Semantic MediaWiki. These extensions are called the Halo extensions, and make Semantic MediaWiki much more powerful and usable. All of the code is available as open source on Sourceforge, and despite doing zero marketing, we have seen significant pick-up of our basic software.
Besides the Halo Extensions to Semantic MediaWiki, we are also creating a large AP-level question answering application. This application, called Aura, is the locus of a lot of our technical work in AI and large knowledge bases. Aura’s functions are quite complex, and rather than try and describe them here, I’ll point you to our published scientific papers. However, we think the combination of Semantic MediaWiki, Aura, and some very promising new work we are funding in scalable semantic rules (called SILK) can really jumpstart the ability of non-specialists to use web tools to jointly interact with data as well as with pages, and to fuse that data in user-specific ways. This ability is core to the Web 3.0 vision, and so we believe the techniques we are developing are fundamental to Web 3.0.
Semanticweb.com: So semantic web technology is a key piece of the project, but not sufficient by itself?
Greaves: That’s right. The semantic web, at least as it is evolving right now, is built on a language called OWL. OWL was created to be very good at representing certain kinds of information that is very common on the web. However, OWL also has very clear limits — for example, OWL is not particularly good at representing the information associated with processes.
For example, consider a complex chemical reaction of several steps, where certain steps are optional or context-specific, and there are several different inputs and outputs, and suppose you want to reason about what would happen at stage 6 if stage 3 was suppressed. OWL is not generally very good at representing that kind of information. So, while we view OWL as an incredibly important part of building a system that can answer novel scientific questions, and we leverage a great deal of semantic web technology, it is not the entire answer. Project Halo’s main investments are targeted at breakthroughs that go substantially beyond what semantic web technologies can represent.
Semanticweb.com: Why is Vulcan interested in this?
Greaves: The short answer is that Vulcan’s owner (Microsoft co-founder Paul Allen) is very interested in this area, and has elected to invest in Project Halo. The longer answer, though, is that we believe that recent developments in AI and the Web indicate that game-changing breakthroughs in computer processing of semantics are now possible, where they weren’t possible just 10 years ago. Remember that Project Halo is interested in semantics writ large — not just the kind of semantics that are carried in the semantic web, but also the more complex semantics that are carried in our brains, which make it possible for hundreds of thousands of students answer tough questions on the AP exams every year. We think that the results of Project Halo will substantially move the needle in AI, and provide a platform to increase the pace of innovation in this area.
Semanticweb.com: What’s next for Project Halo?
Greaves: In the winter, we are going to subject our entire system to a rigorous quantitative evaluation by an independent set of contractors. If the evaluation of the question-answering application goes well, then our next step is to spend 2009 using our system to build the knowledge to do an entire AP syllabus. Right now, when we are testing the system, we are doing it with just a knowledge fragment that is easily manageable — a representative 10 to 15 percent. We will publish the results of this evaluation.
Aside from the larger goals of Project Halo and the Digital Aristotle, we have been pleasantly surprised by the attention that our Semantic Wiki has attracted. Because we paid so much attention to usability and scalability, we now have created one of the premier semantic wiki environments in the world, and we are working hard on adding new features and supporting our growing community of non-Halo users. I think the notion of semantically-boosted knowledge collaboration using the consensus tools of wikis is unbelievably powerful, and can have a lot of traction in the world outside of AI researchers. I am really excited about this.

The 
Eric Franzon
VP Community
Jennifer Zaino
Contributor
Angela Guess Contributor
semanticweb.com Twitter feed loading...