Taking Search — And Meaning — Beyond English

Companies and government agencies – from global firms dealing with legal e-discovery issues across their offices worldwide to intelligence officials working on watch lists – need to be able to search across and extract meaning from unstructured text – and not just in English. Multi-lingual text analytics vendor Basis Technology Corp., which develops the Rosette linguistics platform, says such needs are driving a seeing a renewed focus on the internationalization push.

“By and large I think the market as a whole has been getting more sophisticated in both their understanding and use of text analytics,” says Steve Kearns, Rosette product manager. “So the forward movement in the semantic web and all the related fields really has given rise to a better customer for us, people who understand the issues, and we now see a big multilingual focus…. People are seeing they need to make search work in other languages, make it function as it should across different locations and languages.”


The company this week released Rosette 7, the latest version of its software, which is used in major web and enterprise search engines, from Google to Bing to Oracle software. The product supports 55 languages for language identification, and if you count different encodings that grows to over 100 languages and encoding pairs. For base linguistics for search engine enablement it supports 20 languages, depending on how you count them. Among its constantly expanding language set is extended coverage for Middle Eastern languages such as Pashto and Dari. “These hard languages are not well served in the market today in the search engine and entity extraction space, so we built on that,” Kearns says.

In addition to its list-based and rules-based entity detection algorithms, the update features a completely rewritten core statistical engine for entity extraction of organizations, people, places, URLs, and so on. Its improved extractor here means that it takes less training data to build statistical models and improves the ability for users to train the engine in the field, Kearns says. “We use the statistical engine for persons, locations and organization;. these are hard to make rules about. How can you make a correct set of rules that this is a person name vs. an organization name, for instance,” he says. Hundreds of thousands of words from publicly available data in each of the 14 languages the engine supports were annotated by humans, and then its model was trained to read all the sentences and learn what makes up a person entity, an organization entity, and so on, “and gives us a lot more traction on these hard entity types than you get by just rules.” But because in some cases, particularly government (which makes up about 40 to 50 percent of its market), the text the software is trained on needs to be enhanced with the industry’s unique prose. “Depending on where the data comes from, it doesn’t always look like data we trained on. So when the actual genre of text is different enough, the performance of statistical entity extractors could suffer because it doesn’t recognize the sentence structure or new worlds.” Being able to quickly train for this in the field elevates performance.

Another major feature in Rosette 7 is name matching and name translation, a problem the company has been working on for more than five years with the result that this is the first time name translation and searching are integrated into the Rosette platform’s same core set of APIs. “Bringing those in lets us more easily set up a pipeline….If Osama bin Laden is written in Arabic you can search for it in English, Japanese or Chinese documents. Being able to hook all these pieces up, the name matching and translation to entity extraction, gives radical new capabilities in terms of how customers might assemble their applications.” As an example, he points to e-discovery efforts around doing entity extraction across all emails. “Now you have people, and can put them into search engine. But what if not everyone spells ‘Joseph’ properly—some say Joe, or whatever. Now you have the flexibility of a fuzzy name search engine to find that. Or what if it’s in a Chinese document from a Chinese customer—now our software can directly find out that he is mentioned there, so that document will be relevant to the case and you will want someone to read this.”

The latest version also now supports Lucene-based applications, so any organization using the open source search toolkits can get the same advanced linguistic processing used by high end web and enterprise search engines.

Announcing Semantic Tech & Business Conference - San Francisco 2012

Semantic Tech & Business Conference is returning to San Francisco in June! Join us from June 3-7 for complete coverage of Big Data, Linked Data, Extreme Information Management, and Semantic Web. From breakthrough approaches to solving business problems to the big data implications of fast–evolving technologies, SemTechBiz provides you with an unparalleled interactive experience and delivers tangible business value. We're offering a special early rate when you register by February 17. Sign up now!