SemTechBiz SF SemTechBiz UK SemTechBiz NYC more TVNewser TVSpy GalleyCat AppNewser UnBeige AgencySpy PRNewser 10,000 Words FishbowlNY FishbowlLA FishbowlDC MediaJobsDaily SocialTimes AllFacebook AllTwitter

Using Semantic Web Standards for Improved Text Mining

Better text mining makes it possible to connect information in a variety of sources. The technology can connect information in CRM databases with consumer e-mails and help desk reports to provide a more complete view of the customer. Text mining can also be used in national security applications to better identify terrorists and security threats; it can assist in marketing to mine reviews for feedback on products such as movies, books and music. It can help in scientific research by providing a way to better connect scientific articles.

While there are many applications for text mining, there are also big challenges in its use. Text is written for people to read and currently, there are no programs available that can "read" text and understand its full meaning and we will not have such for the foreseeable future. With this challenge in mind, can Semantic Web technologies help improve text mining? The answer is yes, but the reason may come as a surprise for many.

Broadly speaking, the algorithms used for text mining fall into two categories: linguistic analysis and statistical pattern matching:

  • Linguistic analysis examines the structure of sentences. This means that the software must be extended to support different languages. Among many vendors in this space are Temis, Stratify and Attensity.
  • Statistical pattern matching relies on mathematical techniques that are language independent; no attempts are made to recognize what the text means – it is all symbols to the software. Bayesian belief networks are often used. The most notable vendor is Autonomy.

Linguistic Analysis Remains the Cornerstone of Text Mining

Today, it is likely that a given text mining software supports some combination of both approaches, but linguistic analysis remains to be the cornerstone of text mining – the most reliable way to get accurate results. Mathematical techniques can be useful as a way to supplement linguistic techniques. On their own, however, they don’t work reliably and have a big issue of being a “black box.” The software needs to be trained to process particular content and once trained, it behaves a certain way. If the training set is changed, the behavior may change as well. Exactly why the software produces the results that it produces and how to improve these results when they are not satisfactory is hard to pinpoint.

Linguistic analysis does not require investment in training the software, works much more reliably and accurately, is easy to understand, improve and extend, but it requires investment in developing so-called vocabulary cartridges. Cartridges are domain specific vocabularies. For example:

  • A  vendor like Calais which specializes in recognizing names, places, companies and roles, developed vocabularies of geographic locations, organizations and names. The vocabulary contains information that United Kingdom and England are synonymous and that they are countries. And that Lockheed Martin, Lockheed and LMCO refers to the same thing.
  • Attensity invested time working with life sciences companies. They have vocabularies that include different terms for the biological and chemical elements.
  • Stratify after several years of positioning their software as a general solution now specializes in analyzing legal documents

Of course, vocabularies are not the only thing that makes text mining possible. For example, there are modules for identifying words with the same root, which is called stemming. Stemming deals with prefixes and suffixes. There are also modules for recognizing sentence structure, so that the software can detect when a word is used as a noun as opposed to a verb, proverb or adjective. Such modules are well developed, are commonly used in spell checking and are available as open source.

There certainly have been no breakthroughs in the text mining approaches and algorithms in years. If you have a pretty well developed vocabulary or knowledge model of the domain, you get pretty good results, otherwise results can be mediocre. Though true, this presents a challenge as many would say that vocabulary creation is an awful lot of work!

Vocabulary/knowledge model approach to text mining has proven to work well and reliably, but it is often dismissed or used in a limited way because it is so labor intensive. Much effort has been extended on finding a more “scalable” approach –  an approach based on some magical algorithm. So far, the magic has not been discovered and there have been no fundamentally new ideas on what it could be.

Text Mining using Semantic Web Standards: Crowd-sourcing Vocabulary Creation

RDF or OWL do not offer anything specific to the text mining, except for one key feature – a standard way to describe vocabularies and distribute them over the web. What does this mean?

Suddenly, vocabulary creation can be crowd sourced and, thus, be very scalable. A single company does not need to develop and systematize all the knowledge and can leverage the work of others — vendors, research programs, end user organizations, standards bodies, individuals — anyone and everyone. Described in RDF, vocabularies developed by different parties can come together and can refer to each other. The work of one party can be extended, clarified, specialized and used by another party – in its entirety or selectively.

Vocabularies in RDF already started to appear on the web – from simple and generic to more complex and domain specific. On the simple side, TopQuadrant includes with the TopBraid Composer (and makes available on the web) a vocabulary of country names including links to Dbpedia: http://topbraid.org/countries. DBpedia already contains, for each country, synonyms and language specific names and because of that, TopQuadrant decided not to replicate this knowledge since it’s readily available on Dbpedia. For an example on how to use it, go to http://www.topquadrant.com/docs/videos/DBpedia.wmv.

Domain specific vocabularies available in RDF include AGROVOC from the United Nation’s Food and Agriculture Organization and the Economic Thesaurus from the German National Library of Economics.

The holy grail of the text mining – the accurate concept extraction from unstructured text — will soon come although, it will likely not stem from the advances in the mathematical algorithms or some super smart artificial intelligence. Instead, it is coming as a result of advances in the web, enabling us to better use human intelligence.

SemTechBiz is Less Than 2 Weeks Away

The Semantic Tech & Business Conference (SemTechBiz) is coming to San Francisco on June 3-7! Join us for case studies, innovative panels, tutorials, and keynotes that will provide you with practical advice, hands-on guidance, and breakthrough approaches to solving business problems with semantic technology. Passes go up $200 at the door. Sign up now and save !