Wikimeta, the semantic tagging and annotation architecture for incorporating semantic knowledge within documents, websites, content management systems, blogs and applications, this month is incorporating itself as a company called Wikimeta Technologies.  Wikimeta, which has a heritage linked with the NLGbAse project, last year was provided as its own web service.

Dr. Eric Charton, Ph.D, MSc at École Polytechnique de Montréal, is project leader and author of the Wikimeta code. The NLGbAse project was conducted by Charton at the University of Avignon as part of his Ph.D. Thesis.  The Semantic Web Blog recently hosted an email discussion with him to learn more about the Wikimeta architecture and its evolution.

 

The Semantic Web Blog: Tell us about the NLGBase project and Wikimeta’s relationship to it.

Charton: NLGbAse is an ontology extracted from Wikipedia. It is used in Wikimeta as a resource for semantic disambiguation. For each Wikipedia document (aka Semantic Concept), NLGbAse provides various ways of word-writing (for example, “General Motors” can be written “GM Company”, “GM”, “General Motors Corp” and so on), used for detection.

If the same word-sequence can be used for more than one Semantic Concept (like “Paris” can be a music album, many cities, name of a person), NLGbAse also provides information for disambiguation of this word sequence for annotation, according to its context.

With NLGbAse content, we can train the various machine learning components involved in the Wikimeta API. [The semantic labeling web service API provides a REST-compliant, unique access point for all text-mining and content analysis functionality.]

The Semantic Web Blog: And help us understand more about Wikimeta and its original development.

Charton: Wikimeta is a system based on multiple components:● A part of speech tagger

  • A part of speech tagger
  • A named entity recognition module
  • A semantic annotation module
  • An ontological interface to establish a link between an annotation and the semantic web

The technology of these modules has been scientifically studied in various academic projects where the actual team of Wikimeta worked during the last four years. For example, in 2009, a named entity recognition prototype was built at the University of Avignon by a team where I was involved, to participate in a scientific evaluation campaign named ESTER.

The semantic annotation problem was studied by myself and Professor Michel Gagnon at École Polytechnique de Montréal since December 2010 as a subpart of research dedicated to automatic generation of visual animations according to a text, where the goal was to identify semantic concepts related to a word in a sentence to find a graphic model.

Those researches were presented in various scientific papers and scientific evaluation campaigns. Finally all those technologies were assembled as components of a semantic annotation and text-mining tool, integrated for demo purposes in the NLGbAse platform at the beginning of 2011.

This demonstration tool received good feedback, so we decided in the middle of 2011 to build a new project dedicated to semantic annotation and text mining, provided as a web service: this is Wikimeta. It is a completely new annotation and text-mining engine, using various technologies, and incorporating the NLGbAse ontology (which is licensed under CC-non commercial). Wikimeta is an informal research group, self-funded, independent, with its own identity.

The Semantic Web Blog: How has Wikimeta evolved over the course of that time?  What are some of the most recent updates?

Charton: The main difficulty of the text-mining and semantic annotation process is that it does not produce a perfect output. There are always some annotation errors in the results. In any technical process (industrial, scientific), there is a so-called and usually accepted “Error Tolerance Margin” (sometimes very high in chemicals or biology, for example) and this is the same in text-mining. But it is interesting to see that for many text-mining end-users, the tolerance level to error is very low. Probably because the language-related processes are immediately comparable to the faculty of speaking and understanding of humans.

According to this, during the last four years, our work was focused on making the text-mining and annotation processes more robust, and the remaining error margin as low as possible, and more acceptable for users.

When we decided to launch an on-line tool and ask people to use it (including people who are not scientists and data-specialists) we wanted it to be good enough to be accepted by anybody. This means that the annotation error rate level has to be very low (mean errors in Wikimeta are under 10 percent). We work mostly – and will continue – on the robustness of the text-mining and semantic annotation engine.

The most recent updates were focused on performance improvement of the detection models, and to make Wikimeta strictly compliant with web standards (REST, XML, JSON). We also worked with academics who compare tools in standard conditions, or provide interfaces to do that—for example with the Eurecom people who provide a real standardized comparator for text-mining tools.

We also worked on the user interface to make the error margin more acceptable : We recently launched an on-line editor witch allow users to edit and correct the automatic annotation easily in a few seconds (see screen of the User Interface for managing semantic annotations below).

Finally, we continuously work on new text-mining and annotation functionalities. But we prefer to take time to validate algorithms before launching a new functionality. For example, we have worked on a Co-Reference module (determining the related entity to a pronoun like “it”, “she”) during the last two years (it was submitted as a prototype to important scientific evaluation conferences like CoNLL Shared Task 2011). This means a co-reference detection module will be integrated in Wikimeta in the next 12 months.

The Semantic Web Blog:  How do you differentiate this work from other tools/solutions out there that seem to aim at the same annotation goals?

Charton:  Here are some difference between Wikimeta and other tools :

Wikimeta is not only a semantic annotation tool (like Open Calais, Spotlight, Alchemy or Yahoo): it’s a complete text-mining solution. It provides:

  • A full Part-Of-Speech (POS) annotation layer. This gives the natures of words like verbs (with their tense), adjectives, proper name, first names, numerical values (more than 100 POS tags are vailable). This feature will allow API users to work on various high level text-mining and analsys tasks like opinion mining, sentiment analysis, where identification of adjectives or verbs is essential.
  • A real Named Entity annotation layer: this means if a person or organisation Name appear in a text stream, and doesn’t exist in ontologies (standard or not), it will be tagged with a Named Entity Class Label. This allows third-party applications to not only work on “known concept” but with new discovered concepts.
  • An exhaustive Semantic Annotation layer, fully compliant with Wikipedia and the LinkedData semantic web network.

Wikimeta outputs are strictly standards- compliant:

  • Named Entity Annotation schema of Wikimeta  is compliant with NLP standard corpora, Semantic Annotation is compliant with DBPedia, Wikipedia, LinkedData (because we use our own disambiguation ontology NLGbAse). DBpedia Spotlight, for example, provides only DBpedia annotation. DBPedia covers only a sub-part of Wikipedia concepts, and mainly the English ones. OpenCalais use its own ontology system with many concepts not linked to DBpedia or Wikipedia. We consider that next-generation of rich content, third-party applications using text mining API will need to have standardized, full access to the standard LinkedData network to retrieve extra semantic knowledge.

Wikimeta components are built from state of the art algorithms, published in scientific papers, and evaluated on standard corpora. This means, for example, that when we say that our tool makes a real multilingual disambiguation, and doesn’t use a simple non-contextual rule-based method, you can verify the details in our published scientific papers. This means also that everybody can evaluate and compare our tool with reference metrics and estimate its performance with accuracy:

  • The only alternative tool that has been described like this is DBPedia Spotlight (DBPedia Spotlight is built by academics too). So when we said Wikimeta makes over 94 percent of good semantic annotation in the three first ranked suggested annotations, this is tested, evaluated, published, peer-reviewed and reproducible by third parties. Not just a feeling after a simple text copy-paste.
  • There is a strong need of real evaluation of text-mining tools like ours: simple comparison of pasted text in a web form is not enough and doesn’t allow potential users to have a real opinion of the tool’s performance.

 

The Semantic Web Blog:   Explain a bit about its multilingual support.

Charton:  The disambiguation process, as it uses contextual information provided by NLGbAse calculated for a specific language, is really multilingual. It is the same process as the ones used by humans to identify the semantic identity of a word sequence in a text. You observe a word, the words around it (its context), and estimate a probability of its exact sense. And in fact, to do that, you use contextual words expressed in the specific language of the sentence.

A simple example: when we -as humans – try to identify if the word {Normandie} in a sentence is the ocean liner {SS Normandie} or the French region {Normandie}, we will disambiguate with French contextual words in French (paquebot, le havre) and English contextual words in English (ocean liner, boat). Wikimeta does the same.

This increases performance, and will allow us to provide many language editions in the future with a constant level of performances.

The Semantic Web Blog: What goes along with the launch of Wikimeta Technologies?

Charton:  At the beginning of March, we will offer various levels of commercial plans to the Wikimeta API as a classical web service.

We also want to offer the opportunity to software and application developers to integrate text mining and semantic annotation technology in their products. For those specific needs, we will provide Wikimeta as third-party API servers that will allow software developers or companies to manage their own scalable semantic labeling service and use it in their products as they wish.

Those servers will be offered as a free beta at the beginning of April 2012, for the first customers of Wikimeta Webservice API.

The Semantic Web Blog:   What are the most intriguing/interesting solutions you know of that use Wikimeta?

Charton: Currently, most of our users are academics and  their use is dedicated to experiments (like NERD, the semantic engine comparison platform of Eurecom, or a prototype of Question Answering System built by students of Laboratoire Informatique d’Avignon). Some research has been also done on Twitter analysis.

At this time the semantic annotation tool and its full compliance with Wikipedia have specifically interested the publishing industry and we work on prototypes related to this subject with two partners. We will probably make an announcement related to this in summer.