Posts Tagged ‘Entity Extraction’

Alchemy Aims to Add More API Wizardry

Orchestr8’s NLP- and machine learning-based AlchemyAPI service for analyzing content and extracting semantic metadata has added some new capabilities.

One new feature is dubbed Relation Extraction, which project engineer Shaun Roach tells the Semantic Web Blog “detects sentences describing actions, events, and facts, and then codes them into a machine-readable format.  It is a key feature for developers who want to go a step beyond tagging, to understand specifically how all the people, places, and things mentioned in the document are interacting.”

So, it processes natural language, and converts documents and web pages into actionable, semantically enriched “Subject-Action-Object” data, as the company blog describes it.

Read more

What’s Next For OpenText As It Continues Integration of Nstein’s Technologies?

Since Nstein was acquired by OpenText a little over a year ago, work has been underway to build the former’s semantic technology for text mining and analytics and search into the latter’s enterprise content management platform. So far, that’s resulted in adding Semantic Navigation, the on-premise or cloud web site search and content discovery solution, to OpenText’s Web content management (WCM) products, such as OpenText Web Experience Management and Web Site Management.

This covers aspects such as content tagging and semantic faceting at the content and document levels. This year and the following should see further integration of Nstein technologies into the OpenText solutions set, as well as some new offerings emerging to support other use cases.

As an example, the company is working on a listening platform application, drawing on work Nstein had done for the Canadian government’s public health agency that used its Text Mining Engine to identify potential threats to human health by scouring multiple sources — including news aggregators like Factiva – that were parsed for about 1,000 or so concepts such as “mysterious ailments” and “outbreak.” It’s building up a framework for ingesting different data sources to support this, says Charles-Olivier Simard, product manager for semantic technologies at OpenText.

Read more

Patent, Patent, Digital Reasoning’s Got a Text Discovery Patent

Are you starting to hear more about patents that relate to the Semantic Web space? There was an interesting discussion by Erik Sherman here on Facebook’s patent for automatic search curation as feeding its semantic search ambitions, for instance.

Generally speaking, in fact, patents are big in the news, with the passage last week by the Senate of the Patent Reform Bill, which has among its goals getting patents issued sooner — but which also is spurring concern, especially in the tech industry, about its impact on patent infringement actions.

Against this backdrop, and perhaps flying a bit more under the radar, was a U.S. patent (No. 7,882,055) granted to Digital Reasoning for its distributed system of intelligent software agents for discovering the meaning in text. Company CEO Tim Estes calls what the vendor has applied to its Synthesys technology a “bottom-up” patent.

Specifically, it covers the mechanism of measurement and the applications of algorithms to develop machine-understandable structures from patterns of symbol usage, the company says, as well as the  semantic alignment of those learned structures from unstructured data with pre-existing structured data — a necessary step in creating enterprise-class entity-oriented systems.

So, in plain(er) English, it’s about using algorithms to bootstrap the creation of semantic models from large-scale unstructured data with minimal a priori information – in other words, to let the data speak for itself. It aims at being a fast route to entity-oriented analytics for harvesting critical facts and relationships across a spread of information in documents.

Read more

Time for Semantic ETL?

What’s the link between the trends of more and more objects and even commercial transactions on the web being described in a machine-readable, semantic format and the endless streaming of all that data? Revenue-funded startup First Retail, whose principals Anne Jude Hunt and Simon G. Handley will be speaking at the upcoming Semantic Technology Conference in June, thinks the answer is semantic ETL.

Extract, transform, load (ETL) is a widely known concept in the well-charted terrain of the IT world. That’s about transforming a bunch of heterogeneous data to unify it within a data warehouse and get some use out of it.

Semantic ETL, says Hunt, is brought on by the fact that today people want to deal with the growing loads of streaming data while it’s streaming and that “people want intelligent data, machine-readable tags,[they want] to slice and dice it for BI in lots of different ways, so the  traditional data warehouse and relational database approach is just not working for people.” Cleansed and integrated semantic data loaded into distributed, scalable triple stores can come to the rescue.

Read more

The Spotlight’s on DBpedia

The spotlight’s on DBpedia. Literally. A new open source tool that goes by the name of DBpedia Spotlight annotates mentions of DBpedia resources in text to link unstructured information sources to the Linked Open Data cloud. The idea behind it was to ‘go generic’ so that users could download, adapt and integrate it with their own stacks to meet their specific needs.

That idea started to play out in the community just a day or two after the tool’s release, in fact.  The EuropeanaConnect Media Annotation Prototype is using DBpedia Spotlight to support images, audio and video content semantic tagging and annotation — that’s something that one of DBpedia Spotlight’s creators, Pablo N. Mendes, hadn’t foreseen.

Read more

High Precision Entity Extraction: A U.S. State Department Case Study – SemTech 2009 Audio

Joseph C. Wicentowski, U.S. Department of State
Dan McCreary, Dan McCreary and Associates

The U.S. State Department’s Office of the Historian has embarked on an ambitious effort to migrate its diplomatic history document archive from paper to an enriched electronic media for online consumption. We have extremely high standards for semantic precision and accuracy, due to Congressional mandates, which makes this unique resource useful to a broad audience, which includes scholars, government officials, and the general public. Furthermore, the new format allows us to repurpose our content and integrate it with "mashup" applications such as timelines and geographical map views.

This case study reviews the U.S. State Department’s requirements and the decision process that led us to adopt high-precision semantic markup standards that are supported by our tools as well as by our vendors. We will review our requirements and decision-making, and will show concrete examples of how the precise identifiers for people, locations, and events allow us to enrich the display of our documents online.

We will also review the full document lifecycle and the need for automated but high quality entity extraction tools to minimize document conversion costs. This case study will discuss some of the tradeoffs others may face when advanced technology decisions have both risks and rewards for the digital historian.

In this presentation we will:

  • Review business requirements for a high precision entity extraction application
  • Describe our semantic approach
  • Demonstrate entity extraction
  • Demonstrate timeline and other mashups
  • Summarize project benefits

Attachment: High Precision Entity Extraction – A US State Department Case Study.mp3 (54.54 MB)


Joe Wicentowski
Joe Wicentowski

After completing a Fulbright grant in Asia for his doctoral research and receiving his Ph.D. in History from Harvard University, Joseph C. Wicentowski joined the U.S. Department of State’s Office of the Historian. He has taken a leadership role in digital history management as a digital historian, developing new digital formats for the Department’s archive of U.S. diplomatic and foreign affairs documents, which reach back to the founding of the historian’s office in 1861. He has led development of a new website for these documents, based on a native XML database, and is working to bring the benefits of data visualization, metadata management, and other digital history applications to the federal government and the public. He has particular interests in XML, XQuery, and U.S. and Chinese history.

Dan McCreary
Dan McCreary

Dan is an enterprise data architect/strategist living in Minneapolis. He has worked for organizations such as Bell Labs and Steve Job’s NeXT Computer as well as founding his own consulting firm of over 75 people. He has a background in object-oriented programming and declarative XML languages (XSLT, XML Schema design, XForms, XQuery, RDF, and OWL). He has published articles on various technology topics including the Semantic Web, metadata registries, enterprise integration strategies, XForms, and XQuery. He is author of the XForms Tutorial and Cookbook.

Entity Extraction and the Semantic Web

Entity Extraction is the process of automatically extracting document metadata from unstructured text documents.  Extracting key entities such as person names, locations, dates, specialized terms and product terminology from free-form text can empower organizations to not only improve keyword search but also open the door to semantic search, faceted search and document repurposing.  This article defines the field of entity extraction, shows some of the technical challenges involved, and shows how RDF can be used to store document annotations. It then shows how new tools such as Apache UIMA are poised to make entity extraction much more cost effective to an organization.

Read more