Since Nstein was acquired by OpenText a little over a year ago, work has been underway to build the former’s semantic technology for text mining and analytics and search into the latter’s enterprise content management platform. So far, that’s resulted in adding Semantic Navigation, the on-premise or cloud web site search and content discovery solution, to OpenText’s Web content management (WCM) products, such as OpenText Web Experience Management and Web Site Management.
This covers aspects such as content tagging and semantic faceting at the content and document levels. This year and the following should see further integration of Nstein technologies into the OpenText solutions set, as well as some new offerings emerging to support other use cases.
As an example, the company is working on a listening platform application, drawing on work Nstein had done for the Canadian government’s public health agency that used its Text Mining Engine to identify potential threats to human health by scouring multiple sources — including news aggregators like Factiva – that were parsed for about 1,000 or so concepts such as “mysterious ailments” and “outbreak.” It’s building up a framework for ingesting different data sources to support this, says Charles-Olivier Simard, product manager for semantic technologies at OpenText.
Some highlights moving forward are taking entity extraction and normalization to the space of collecting, analyzing, and finding business trends that emerge across enterprise’s vast collections of documents, sources and repositories — and also going beyond extracting and categorizing named entities and sentiment from text documents to apply semantics to other media, such as photos, videos and other unstructured information.
“We want to have a high accuracy on a huge amount of content, almost in real-time,” Simard says. “We [at Nstein] used to work for 10 years with content providers like news agencies, so we were doing business around a huge amount of content like millions of documents. But it was mostly of the same type and with well-formed, respected syntaxes. Now that we are part of OpenText it is about the same story, but for each organization it’s about finding everything in Intranets, emails, and so on. It is a big mass of data and documents.”
2011, then, has been and will be largely about semantics-driven content analytics for OpenText’s Content Server (formerly LiveLink), the core of the vendor’s ECM Suite for managing records, emails, and so forth, with expectations of delivering a solution this summer. “When you have to tag news articles it’s a little bit easier than most anything you can have in an Intranet – you need accuracy to extract good quality information and also meet business needs in different industries,” he says. It’s harder to derive good semantic analysis out of short pieces of content like emails, so there’s more work on linguistic configuration and training the system to ensure high accuracy.
On the flip side, it’s also challenging to deal with very large documents that often also characterize enterprise content – those 300-page Word docs and multi-slide Powerpoints can hit on so many aspects that it’s work to tag them under appropriate categories. Part of getting all this right includes building up industry-specific vocabularies.
In conjunction with this effort, the company has been redesigning the architecture of its semantic technology to better support different linguistic models to extract similar documents, entities, and key concepts and extract facts and relationships between them, as well as to account for industry-specific annotations and industry-specific documents.
Next year the focus will be on transactions. While specifics are yet to be determined, Simard says the team is exploring a few cases for how semantics fits into that. “There is no official initiative but we have some interesting prototypes and are deciding the first target,” he says. “There could be some low-level stuff to make sure business processes are fully automated.” For instance, appropriate business processes can be triggered through the identification of specific topics, according to what the technology finds in the body of a document.