reamyimgaeThere’s one thing that Tom Reamy, chief knowledge architect at KAPS Group, says is a continual refrain among enterprise business users: Search sucks. IT regularly attempts to make things better by buying new search engines and for awhile, everything’s good – until content grows and things start to go downhill again.

Enterprise search, he explained to an audience at this week’s Enterprise Search & Discovery summit, “is never going to be solved by search engine technology” alone. It needs a helping hand from a number of different corners to improve the experience. Good governance and taxonomies can help, for example. But there are challenges in their use, such as the fact that the people who write documents for enterprise repositories can be very creative at avoiding tasks they don’t consider their jobs, such as categorizing documents for others to find during their searches, and even if they’re willing to do it, figuring out what a document is about is a very complex decision.

And, as beautiful a structure as a taxonomy may be to behold, marrying it to millions of documents is itself complex in scale and purpose for both authors and librarians who may have had nothing to do with its creation and so can’t be counted on to apply it well.

Less recognized for the role it can play in rescuing enterprise search is text analytics.

“It’s not going to be the whole answer but it is part of the answer of making search work, because it’s adding a whole dimension of meaning to documents and search engines,” he said. “The real heart of text analytics is automatic categorization, where you train the software to tell you what a document or parts of a document are about,” he said. The virtue of taxonomies is more evident in context with text analytics, in overcoming the gap between taxonomies and their application to documents, resulting in “consistent, powerful and economic tagging that can be run almost automatically,” he said.

The standard way to do it is to have a content management system where the documents are published, then for text analytics to jump in to provide information on what the document is about and present that to the author, “who has the cognitively simple task to say, ‘yes, go ahead’ or make fixes. Authors will actually do that,…react to a suggested set of keywords. so you can get good results. And if they do decide that something is not right, you get feedback into your categorization rules and taxonomies that can be used to improve the whole thing,” he explained.

Providing the “aboutness” of a complete document, he believes, is just the beginning of where text analytics and automatic categorization can take things. “It can take advantage of the fact that documents are not really unstructured, but that they have lots of structures,” he said – pages, sections, even down to the sentence level for more in-depth categorization than just looking at the document as a whole. As an example of the sections idea, he mentioned that KAPS has built rules for a pharmaceuticals company looking to improve its enterprise search. “It’s saying that if you find words we’ve decided to find on something that’s an abstract or methods section of a document, even if the word abstract doesn’t appear, use these dynamic rules to say this is the section of the document called something like Abstract,” he said.

“Define the section dynamically with the rule and if you find phrases like clinical trials and humans but not words like animals, then count it and add it to the relevancy score. That’s a way to more powerful and precise relevancy. For the pilot we are doing this using these sections headings and getting very close to 100 percent accuracy in our categorization,” he noted.

He also explained that enterprises need to take smart semantic infrastructures – from the information lifecycle, organizational environment, content types and structures, and even business processes and people – into consideration to improve enterprise search.

Reamy also will be speaking at the upcoming Semantic Technology & Business Conference in August, on the topic of turning big text into big data (see agenda here).