Executive Summary

Chemistry is a central science and the data produced as a consequence is immense. However, much of this data is which makes data integration difficult. In this article, we demonstrate how chemical data can be retrieved from reports, scientific theses and papers or patents and discuss how these sources can be processed using natural language processing techniques and named-entity recognisers to produce chemical data and knowledge expressed in RDF.

1. Introduction

Chemistry is at the heart of several high-value industries, such as the pharmaceutical and biomedical industries, materials manufacturing and design etc. Future progress in science will crucially depend on the ability to mash up chemical data with data derived from other domains, such as biochemistry, genomics, immunology, materials science etc. and the semantic web has significant offerings to make when moving towards to goal of inter-disciplinary data mashups. The twin pillars of the URI and the web of linked data can be sure to have a profound impact on the way in which science will be carried out in the 21st century. For chemistry, this will, for example, mean that once a resource such as a molecule or a chemical substance has been defined, it can then be linked to data about its properties, physico-chemical characteristics etc and also to knowledge that other disciplines outside of chemistry have about this compound with great ease and it will be possible to analyse for other information such as co-occurence of this compound with other compounds, research activity involving this and related compounds, citations etc..

Chemists produce and report vast amounts of data every day. The Chemical Abstracts Service (CAS) indexes over 10,000 new substance every day.1 This data is mainly derived from typical scientific publications. On top of this, there is an even larger amount of data which is contained in (electronic) laboratory notebooks, reports and patents. The widespread adoption of data from high-throughput experimentation is further swelling the data deluge. The one common characteristic, which data from all of these sources has, is the fact that it is usually contained in documents in a completely unstructured form, which makes it exceptionally hard to search, to retrieve and to mash up.

Given the importance of mashups then and the unstructured nature in which chemical information is normally produced and recorded, one important task is to develop technologies that allow the extraction and structuring of chemical data and thus the "semantification" of chemistry. Roughly speaking, a "semantification workflow" could look like this: (a) identification of sources containing chemical information (typically scientific papers, scientific theses, blogs, wikipedia entries, other web-sources) (b) identification and extraction of chemical entities and other chemistry relevant information and (c) markup of the extracted entities /data in XML or RDF. The identification and extraction process is the important step here and natural language processing (NLP) technologies and parts of speech taggers (POS) are the tools of choice. Although there is considerable interest for this type of workflow in other domains such as the biosiences and medicine and a number of tools have been developed by both commercial (e.g. Temis, Linguamatics etc.) and academic groups (e.g. GENIA, PennBioIE etc.), chemistry lags sadly behind, although a number of reports concerning the extraction of chemical entities from the literature have been reported over the last several years.2-5

2. Extraction of Chemical Information from unstructured Text

The prime open tool for the extraction of chemical entities is the OSCAR 3 system6 and we will show how entity extraction can be accomplished using OSCAR 3 and a part-of-speech tagger provided by NLTK provided by the Natural Language Toolkit (NLTK).7 OSCAR (Open Source Chemistry Analysis Routines) is an open source application and part of the SciBorg project8 for the deep parsing and analysis of scientific texts, but can also be used in a standalone or integrated with other NLP systems. NLTK is a suite of open source modules, data sets and tutorials supporting research and development in natural language processing. It uses symbolic and statistical natural language processing. NLTK will be used here to find the parts of speech and then extract the key-phrases.

To demonstrate how we apply these tools we will walk through a simple chemical synthesis procedure taken from a typical PhD thesis8 in organic chemistry:

"nBuLi ( 1.6M solution in Et2O , 18.75 ml , 30 mmol ) was added to a stirred solution of alkyne 155 ( 5.0 g , 27 mmol ) in Et2O ( 20 ml ) at -78°C and the mixture stirred for 1 h . Freshly cracked paraformaldehyde ( mp =163-165°C ) was bubbled through the reaction mixture , which was under a constant argon flow . After 20 – 30 min , the mixture was diluted with Et2O ( 200 ml ) and poured onto saturated NaCl solution ( 150 ml ) , the phases were separated , and the aqueous layer extracted with Et2O ( 2 x 50 ml ) . The combined organic phases were dried ( MgSO4 ) , filtered and concentrated in vacuo . Purification by flash column chromatography ( eluent PE : Et2O 4:1 to 1:1 , gradient ) yielded alcohol 156 ( 4.67 g , 21 mmol , 81 % ) as an oil."

Figure 1: An example of a chemical synthesis procedure from a PhD thesis.

So far the only structure we have in this text is the title of the preparation and its content. We use OSCAR to identify the chemical names in this text :

"nBuLi ( 1.6M solution in Et2O , 18.75 ml , 30 mmol ) was added to a stirred solution of alkyne 155 ( 5.0 g , 27 mmol ) in Et2O ( 20 ml ) at -78°C and the mixture stirred for 1 h . Freshly cracked paraformaldehyde ( mp =163-165°C ) was bubbled through the reaction mixture , which was under a constant argon flow . After 20 – 30 min , the mixture was diluted with Et2O ( 200 ml ) and poured onto saturated NaCl solution ( 150 ml ) , the phases were separated , and the aqueous layer extracted with Et2O ( 2 x 50 ml ) . The combined organic phases were dried ( MgSO4 ) , filtered and concentrated in vacuo . Purification by flash column chromatography ( eluent PE : Et2O 4:1 to 1:1 , gradient ) yielded alcohol 156 ( 4.67 g , 21 mmol , 81 % ) as an oil."

Figure 2: An example of a chemical synthesis procedure from a PhD thesis after markup using the OSCAR 3 system.

OSCAR marks up chemical entities contained in this paragraph using a mixture of SciXML and a technology developed by our group here in Cambridge. The first part of the figure 3 (A) presents the title of another paper and the first sentence of the abstract in natural language. Part (B) shows the same sentence after markup through the OSCAR 3. In this example, chemical entities such as “oleic acid” or “magnetite” are marked up as chemical moieties (type=”CM”) and additional information, such as in-line representations of chemical structure (SMILES and InCHI) as well as ontology terms and other information can be added.

Figure 3: Markup of Chemical Entities via the OSCAR 3 system.

Figure 3: Markup of Chemical Entities via the OSCAR 3 system.

 

Once the chemical entities have been marked up in this way, we can use the Natural Language Processing Toolkit to determine the syntactical structure of the text. By doing so, we are able to determine quantities and several types of experimental conditions. Crucially, it is also possible to detect actions such as addition, dissolution, extraction etc. in the text (Figure 4).

Figure 4: 'Action' Phrases marked up using NLTK.

Figure 4: ‘Action’ Phrases marked up using NLTK.

 At this stage, information has now been marked up.  The resulting parse tree is stored in XML (Figure 5) .

The parse tree is then converted to RDF. We will simplify at this point and imagine that the resource www.foo.bar/preparation-1 is a unique URI :

What results is in effect an RDF-graph based representation of the above paragraph. Not only does the graph contain all the compounds involved in the preparation (and which themselves could have other information such as properties, supplier data etc associated with them), but also their roles (which can be deduced from actions etc.) and other information such as yields etc… . Assignment of roles allows the classification of compounds into reactants and products, solvents, reagents, catalysts etc. Once chemical information is stored in this way, it now becomes feasible to search for experiments by parameter: an example query would be to search for all experiments that have yields over 80% and that use substances which are solvents and have boiling points below 50 C.

When this approach is applied to a whole document rather than a single paragraph, it becomes possible to draw "chemical topic maps" from the literature. In documents detailing chemical experiments, it is common to assign numbers to each occurring chemical entity. By tracking the compounds involved in a synthesis procedure and the reactions they participate in, it is possible to generate "reactant yields product" graphs (figure 7).

In the example in figure 7, the compound with ID 155 is transformed into compound 156 during a chemical reaction. If we now plot this for all the transformations identified in a document, we arrive at "topic map" of chemical transformations (figure 8).

Figure 8: Topic Map of a Thesis.

Figure 8: Topic Map of a Thesis.

This snapshot provides a concise overview over the entire document. The colours represent the colours of the products (identified from NLP) and the shapes of the nodes represent the states of compounds (oil = circles, solid = squares, crystal = diamond). This information can be directly parsed out of the text or inferred from information which has been extracted. However, the shapes of the graphs and the connectivity between compounds also holds information: were chemical syntheses performed in parallel or did isomers (i.e. compounds which have the same chemical composition but a different physical arrangement of atoms) form during the reaction? This and other information can potentially be gleaned by inspecting the shape of the graph.

While the above discussion has illustrated some technical solutions to the problem of semantification of unstructured information within the domain of chemistry, the real challenge remains one of access. Sources of chemical information and documents are typically proprietary and closed access and content providers such as science, technology and medicine publishers are taking active steps to prevent use of the technologies we have described here for the extraction of scientific data from their sources. The situation is aggravated by the fact, that chemistry, unlike other sciences, has not evolved a culture or tradition of data sharing, which means that both the technological infrastructure as well as the "mindshare" are currently absent.

3. Summary and Conclusions

Chemistry is a central science which is at the heart of many high-value industries. While the volume of data produced by chemists is high, chemical information is often unstructures and thus hard to search, retrieve and mash-up. However, structuring and thus semantification are possible using a combination of natural language processing and parts-of-speech tagging. The results from NLP and POS processing can be translated to RDF, which, in turn, can be used for both the visualisation of topics and relationships in documents and the quick search, retrieval and mashup of chemical information. The oftentimes proprietary and closed nature of chemical information sources presents a serious obstacle to the semantification of chemistry.
 

4. References

[1]    Massie RJ (2007) At the cusp of a new century, Chem. Eng. News 85:56-57
[2]    Chowdhury GG and Lynch MF (1992) Automatic interpretation of the texts of chemical patent abstracts.1. Lexical analysis and categorization. Journal Chem. Inf. Comp. Sci.                 32:463-467
[3]    Chowdhury GG and Lynch MF (1992) Automatic interpretation of the texts of chemical patent abstracts. 2. Processing and results. J. Chem. Inf. Comp. Sci. 32:468-473
[4]    Vasserman A (2004) Identifying chemical names in biomedical text: an investigation of the sub-string co-occurrence based approaches. HLTNAACL
[5]    Wilbur JW, Hazard GF, Divita G, et al. (1999) Analysis of biomedical text for chemical names: a comparison of three methods. AMIA Symposium:176-180
[6]    Corbett P and Murray-Rust P (2006) High-throughput identification of chemistry in life science texts. Computational Life Sciences II, Lecture Notes in Computer Science                        4216:107-118
[7]    Natural Language Toolkit http://www.nltk.org Accessed Jan 11, 2009
[8]    Copestake A, Corbett P, Murray-Rust P, et al. (2006) An architecture for language processing for scientific texts. UK eScience All Hands Meeting 2006
[9]    Harter J (2002) π -Allyltricarbonyliron Lactone Complexes: Versatile Tools for Asymmetric Synthesis, PhD thesis, University of Cambridge.