Executive Summary

Chemistry is an important and high-value vertical in the modern world and the "semantification" of chemistry will be crucial for further rapid innovation not only in the discipline itself, but also in related areas such as drug discovery, medicine and materials design. This article provides a short overview over the current technological state of the art in semantic chemistry and also discusses some obstancles, which have, so far, impeded the widespread uptake of chemistry in the domain.

1. Introduction

Chemistry is arguably the most central of the physical sciences and at the heart of many fundamental industries: developments in chemical science very directly affect sectors such as the pharmaceutical and medical industry, the producers and processors of modern materials such as polymers and, of course, the chemical industry itself.

In modern science, it is important to realise, that most of the truly exciting scientific and technological progress now happens at the interfaces between two or more scientific and technical disciplines. As such, the development of new knowledge or products can, from an informatician’s point of view, be considered to be an exercise in the integration of data from different scientific domains. Chemistry overlaps with almost all domains of modern science, from pharmacology, biochemistry, toxicology to genetics and materials science.

As such, it is of prime importance to develop a comprehensive semantic apparatus for the discipline, which can contribute to the “data integration” process. This short article is subdivided into three parts. In the first part, it will discuss the state of the art in semantic chemistry at the time of writing (early 2009), the second part will look at current efforts in “semantification” within the domain of chemistry and the third part will discuss some of the technical and “socio-political” obstacles semantic chemistry is facing today.

2. The Current “Semantic Chemistry Technology Stack”

The general semantic web toolkit in common use today consists of three major components: XML dialects, RDF(S) vocabularies and OWL ontologies (Figure 1).

Figure 1: The semantic layer-cake.

Figure 1: The semantic layer-cake.

  

Let us look at each of these components in turn and how they have been applied to the field of chemistry.

2.1 Markup Languages for Chemistry

In terms of markup languages, the foremost and most relevant markup language pertaining to the realm of chemistry is Chemical Markup Language (CML), developed over the last decade by Murray-Rust, Rzepa and others.[1-7]
CML is designed to hold a large variety of chemical information, such as molecular structures (the spatial location of and the connectivity between the atoms that make up a molecule), materials structures (in particular polymers) as well as spectroscopic and other analytical data and also crystallographic and computational information. An example of the CML-based representation of the molecular structure (i.e. the atomic composition of a molecule and the spatial arrangement and connectivity of the atoms making up a molecule) is shown in figure 2.
 

Figure 2: CML document describing the 2-dimensional molecular stucture of the styrene molecule.

Figure 2: CML document describing the 2-dimensional molecular stucture of the styrene molecule.

The CML document describes an entity of type <molecule>. The <molecule> is a data container for two further data containers called <atomArray> and <bondArray>. The <atomArray> element contains a list of all the atoms present in the molecule, together with IDs, element types and, in this case, 2D coordinates specifying the spatial arrangement of atoms in the molecule. The <bondArray> element by analogy contains a list of bonds, bond IDs, a specification which atoms are connected by the bond and the bond order (is it a single, double, triple or any other type of bond?). Furthermore, CML can hold many different types of other annotations on atoms, bonds and associated chemical data. CML was recently extended to deal with fuzzy materials such as polymers, which also introduced the notion of introducing free variables into an otherwise purely declarative language, by injecting XSLT into specifications of CML and evaluating expressions in a lazy manner.[7]

Scientific and chemical information in free unstructured text such as scientific papers, theses and reports can be marked up in an analogous manner. Figure 3 shows the markup of a sentence contained in a scientific paper using a mixture of SciXML and a technology developed by our group here in Cambridge.

Figure 3: An abstract (ref[39]) (A) prior to markup, (B) after markup with OSCAR 3.

Figure 3: An abstract (ref[39]) (A) prior to markup, (B) after markup with OSCAR 3.

The first part of the figure (A) presents the title of the paper and the first sentence of the abstract in and (B) shows the same sentence after (automated) markup through the OSCAR 3 natural language processing system.[8] In this example, chemical entities such as “oleic acid” or “magnetite” are marked up as chemical moieties (type=”CM”) and additional information, such as in-line representations of chemical structure (SMILES and InCHI) as well as ontology terms and other information can be added.

Other markup language of relevance to chemistry include Analytical Markup Language (AnIML),[9] ThermoML – a markup language for thermochemical and thermophysical property data,[10] MathML[11] (Mathematical Markup Language) and SciXML.[8, 12] Furthermore, Indian researchers have recently reported the development of an alternative to CML for the markup of chemical reaction information.[13, 14]

2.2 RDF Vocabularies

While the ecosystem for markup languages in chemistry is relatively well developed, the same cannot currently be said for the availability of RDF vocabularies for the domain. The most notable efforts were reported by Frey et al. as part of the CombeChem project.[15-17]The proposed vocabulary provides the basic mechanism to describe both state-independent (e.g. identifiers, molecular weights etc.) and condition-dependent (e.g. experimentally determined physical properties where the property is dependent on, for example, measurement or environmental conditions) entities associated with molecules, as well as provenance information for both molecules and data (Figure 4).

Figure 4: Snapshot of the CombeChem RDF vocabulary for chemistry.(15)

Figure 4: Snapshot of the CombeChem RDF vocabulary for chemistry.(15)

Furthermore, the same authors have also modelled a synthetic chemistry experiment in RDF.[18] There are sporadic efforts to model aspects of molecular structure in both RDF and OWL, but this must be considered to be developing work at this stage.[19, 20] In further studies, RDF has been exploited to the purposes of publishing in the chemical domain[21, 22] and for developing technologies which could lead to the generation of “research interest” (social) networks for chemists.[22, 23]

While at least some RDF vocabularies for chemistry are therefore available, what is decidedly missing is the availability of mashup examples. This can be explained by the difficulty associated with getting hold of chemical data: unlike biology or physics, chemistry has not (yet) developed a culture of data sharing and is extremely conservative in its adoption of a more open culture. We will discuss this further below.

2.3 Ontologies

Ontologies are computable conceptualisations of a knowledge domain and thus crucially important for adding "meaning" to data. To date, only few attempts have been made to construct formal ontologies for chemistry. Very early attempts predate the arrival of the semantic web and indeed the internet: in the 1980s, Gordon considered the the syntax, semantics and history of structural formulae as well as the semantics and formal attributes of chemical transformations in a set of papers, which led to a formalised language for relational chemistry.[24-26] Somewhat later, van der Vet published construction rules for the some very fundamental chemical concepts, such as "pure substance", "phase" and "heterogeneous system" as the basis for the development of further axiomatisations relevant to chemistry.[27]

The currently most widely used chemical ontology is the European Bioinformatics Institute’s (EBI) "Chemical Entities of Biological Interest" (ChEBI) ontology.[28] ChEBI combines information from three main sources, namely IntEnz,[29] COMPOUND and the Chemical Ontology (CO)[30] and contains ontological associations which specify chemical relationships (e.g. "chloroform is a chloroalkane"), biological roles and uses and applications of the molecules contained in the ontology. ChEBI is stored in a relational database, but can be exported to OBO format and translated into OWL. Other ontologies currently maintained by the EBI are REX[31] and FIX[32]. REX terms describe physicochemical processes, whereas FIX mainly describes physicochemical measurement methods. Again, both ontologies are available in the OBO format. There have been other attempts to model aspects of chemistry, such as chemical structure,[19] laboratory processes[15-18] chemical reactions,[13, 14] and polymers[33] but these are isolated and somewhat small-scale efforts. There is currently no discernible community effort to develop a formalisation of chemical concepts.

3. "Semantification in Chemistry"

A significant amount of chemical data is currently tied up in unstructured sources such as scientific papers, theses and patents. As such, natural language processing  (NLP) of these sources is often required to extract relevant information and data and to add metadata . While there is considerable activity in processing text in the biological, biochemical and medical literature by both companies (e.g. Temis, Linguamatics and others) and academic groups (e.g. GENIA, PennBioIE) chemistry is sadly lagging behind in this area, although a number of reports have appeared in the literature over the past several years.[34-37] The principal open tool for the extraction and semantic markup of chemical entities at the moment is the OSCAR 3 system, which is currently being developed by Corbett and Murray-Rust.[8] OSCAR 3 is part of the SciBorg project[38] for the deep parsing and analysis of scientific texts, but can also be used in a standalone or integrated with other NLP systems. A typical example of OSCAR’s output has  been provided in figure 2.

4. Cultural Access Barriers to Semantic Chemistry

So far, we have only discussed the technical aspects of semantic chemistry. And while the field is in many ways still in its infancy (note the absence of a significant body of RDF vocabulary and ontologies), this situation is currently being addressed by a number of academic groups as well as commercial entities and it is reasonable to expect that a substantial amount of work will become available over the next several years, The real challenges associated with semantic chemistry are not so much of a technological nature, but rather "socio-cultural". We have already alluded to the fact that, unlike other scientific, technical and medical fields, chemistry has not evolved a culture of data and knowledge sharing. Rather chemistry has ceded the dissemination of data and knowledge almost entirely to commercial entities in the form of publishing businesses. However, as is the case in mainstream publishing, the internet is currently in the process of destroying the business model associated with scientific publishing (publishers justifying subscriptions and revenue by organisuing manuscript collection, peer review, editorial work, printing and distribution to subscribers of the journal issue). As a consequence, scientific publishers are increasingly shifting their value proposition to content, i.e. scientific data and seem to attempt to prevent the automatic extraction of data (i.e. non-copyrightable facts) from their journals. For obvious reasons, disciplines which have already evolved both the technological as well as cultural mechanisms for data sharing are less severely impacted by this than chemistry, which currently has neither the technological nor indeed the cultural wherewithal for data sharing. Sooner or later, this will adversely affect the progress of science as a whole – the biosciences, for example, are crucially dependent on chemical data and without the ability to mash up data from both sources, progress in biology etc. will undoubtedly be impeded.

The crucial task for anyone interested in the use of semantics in the chemical domain, therefore, is to not only develop the necessary technology, but first and foremost to make a contribution towards changing hearts and minds in the discipline and to create “data awareness” in practicing scientists, which are not also informaticians. The Open Access movement is making slow and steady progress in this (several very significant universities have recently adopted open access publishing mandates) and the current generation of undergraduate and postgraduate students is keenly aware of the possibilities and the promise of semantic technologies. Therefore, there is considerable reason for optimism that we will see the transition from "chemistry" to "semantic chemistry" and full participation of the discipline in the semantic web in the not too distant future.

5. Summary and Conclusions

Chemistry is a conservative discipline which is nevertheless staring to participate in the semantic web. There is a considerable and useful infrastructure of markup languages available for the dissemination and exchange of chemical data. While not currently highly developed, some first drafts of RDF vocabularies and ontologies are also coming on-stream and good progress in the extraction of chemical entities from unstructured sources is also being made. The main obstacle that is currently holding up both the further development of semantics in the chemical domain and its further adoption as a technology is socio-cultural in nature: to date, chemistry has not evolved a culture of data sharing and therefore neither the cultural nor the technical mechanisms are in place, which results in a scarcity of available data sets. Nevertheless, the increasing adoption of open access and the further penetration of semantic technology into chemistry will force change to occur and there is every reason to remain optimistic.

References
1    Holliday GL, Murray-Rust P and Rzepa HS (2006) Chemical markup, XML, and the world wide web. 6. CMLReact, an XML vocabulary for chemical reactions. J. Chem. Inf. Model. 46:145-157
2    Murray-Rust P, Rzepa HS, Williamson MJ, et al. (2004) Chemical markup, XML, and the world wide web. 5. Applications of chemical metadata in RSS aggregators. J. Chem. Inf. Comput. Sci. 44:462-469
3    Murray-Rust P and Rzepa HS (2003) Chemical markup, XML, and the world wide web. 4. CML schema. J. Chem. Inf. Comput. Sci. 43:757-772
4    Gkoutos GV, Murray-Rust P, Rzepa HS, et al. (2001) Chemical markup, XML and the world-wide web. 3. Toward a signed semantic chemical web of trust. J. Chem. Inf. Comput. Sci. 41:1124-1130
5    Murray-Rust P and Rzepa HS (2001) Chemical markup, XML and the world-wide web. 2. Information objects and the CMLDOM. J. Chem. Inf. Comput. Sci. 41:1113-1123
6    Murray-Rust P and Rzepa H (1999) Chemical markup, XML, and the world-wide web. 1. Basic principles. J. Chem. Inf. Comput. Sci. 39:928-942
7    Adams N, Murray-Rust P, Winter J, et al. (2008) Chemical markup, XML and the world wide web. 8. Polymer Markup Language. J. Chem. Inf. Model. 48:2118-2128
8    Corbett P and Murray-Rust P (2006) High-throughput identification of chemistry in life science texts. Computational Life Sciences II, Lecture Notes in Computer Science 4216:107-118
9    Davies T, Lampen P, Fiege M, et al. (2003) AnIMLs in the spectroscopic laboratory? Spectroscopy Europe 15:25-28
10    Frenkel M, Chiroco RD, Diky V, et al. (2006) XML-based IUPAC standard for experimental, predicted, and critically evaluated thermodynamic property data storage and capture (ThermoML) (IUPAC Recommendations 2006). Pure Appl. Chem, 78:541-612
11    S. B, Devitt S, Diaz A, et al. (1999) Mathematical Markup Language (MathML™) 1.01 Specification. http://www.w3.org/TR/REC-MathML/, Accessed Jan 5, 2009
12    Teufel S (2008) SciXML. http://sourceforge.net/projects/scixml/, Accessed Jan 6, 2009
13    Sankar P and Aghila G (2006) Design and development of chemical ontologies for reaction representation. J. Chem. Inf. Model. 46:2355-2368
14    Sankar P and Aghila G (2007) Ontology aided modeling of organic reaction mechanisms with flexible and fragment based XML markup procedures. J. Chem. Inf. Model. 47:1747-1762
15    Taylor KR, Essex JW, Frey JG, et al. (2006) The semantic grid and chemistry: experience with CombeChem. J. Web Semant. 4:84-101
16    Frey JG, de Roure D, Schraefel MC, et al. (2003) Context slicing the chemical aether. First International Workshop on Hypermedia and the Semantic Web:9 Nottingham, UK
17    Taylor KR, Gledhill RJ, Essex JW, et al. (2006) Bringing chemical data onto the semantic web. J. Chem. Inf. Model. 46:939-952
18    Frey JG, Hughes GV, Mills HR, et al. (2003) Less is more: lightweight ontologies and user interfaces for smart labs. UK e-Science All Hands Meeting:500-507 Nottingham, UK
19    Feldman HJ, Dumontier M, Lng S, et al. (2005) CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules. FEBS Letters 579:4685-4691
20    Bhat TN and Barkley J (2008) Development of a use case for chemical resource description framework for acquired immune deficiency syndrome drug discovery. Open Bioinf. J. 2:20-27
21    Casher O and Rzepa HS (2006) SemanticEye: a semantic web application to rationalize and enhance chemical electronic publishing. J. Chem. Inf. Model. 46:2396-2411
22    Rzepa HS, Casher O and Murray-Rust P (2005) RDF-based molecular relationships, the semantic web and the future of scientific publishing. Abstr. Papers, 229th ACS National Meeting, San Diego, USA
23    Rzepa HS and Willighagen EL (2008) FOAF (Friend-of-a-friend): RDF-metadata enhanced social networking in chemistry. Abstracts of Papers, 235th ACS National Meeting, New Orleans, United States
24    Gordon JE and Brockwell JC (1983) Chemical Inference. 1. Formalization of the language of organic chemistry: generic structural formulas. J. Chem. Inf. Comput. Sci. 23:117-134
25    Gordon JE (1984) Chemical Inference. 2. Formalization of the language of organic chemistry: generic systematic nomenclature. J. Chem. Inf. Comput. Sci. 23:81-92
26    Gordon JE (1988) Chemical Inference. 3. Formalization of the language of relational chemistry: ontology and algebra. J. Chem. Inf. Comput. Sci. 28:100-115
27    van der Vet PE (1993) Structured system of concepts for storing, retrieving and manipulating chemical information. J. Chem. Inf. Comput. Sci. 33:564-568
28    de Matos P, Ennis M, Zbinden M, et al. (2006) ChEBI – Chemical entities of biological interest. http://www3.oup.co.uk/nar/database/summary/646, Accessed December 12, 2008
29    Fleischmann A, Darsow M, Degtyarenko K, et al. (2004) IntEnz, the integrated relational enzyme database. Nucleic Acids Res. 32:D434-D437
30    Kanehisa M, Goto S, Kawashima S, et al. (2004) The KEGG resource for decipering the genome. Nucleic Acids Res. 32:D277-D280
31    Degtyarenko K (2007) The Rex Ontology. http://obofoundry.org/cgi-bin/detail.cgi?id=rex, Accessed December 30, 2008
32    Degtyarenko K (2007) The FIX ontology. http://obofoundry.org/cgi-bin/detail.cgi?id=fix, Accessed December 30, 2008
33    Adams N and Murray-Rust P (2008) Engineering polymer informatics: towards the computer-aided design of polymers. Macromol. Rapid Commun. 29:615-632
34    Chowdhury GG and Lynch MF (1992) Automatic interpretation of the texts of chemical patent abstracts.1. Lexical analysis and categorization. Journal Chem. Inf. Comp. Sci. 32:463-467
35    Chowdhury GG and Lynch MF (1992) Automatic interpretation of the texts of chemical patent abstracts. 2. Processing and results. J. Chem. Inf. Comp. Sci. 32:468-473
36    Vasserman A (2004) Identifying chemical names in biomedical text: an investigation of the sub-string co-occurrence based approaches. HLTNAACL:
37    Wilbur JW, Hazard GF, Divita G, et al. (1999) Analysis of biomedical text for chemical names: a comparison of three methods. AMIA Symposium:176-180
38    Copestake A, Corbett P, Murray-Rust P, et al. (2006) An architecture for language processing for scientific texts. UK eScience All Hands Meeting 2006
39    Hamoudeh M, Faraj AA, Canet-Soulas E, et al. (2007) Elaboration of PLLA-based superparamagnetic nanoparticles: characterization, magnetic behaviour study and in vitro relaxivity evaluation. Int. J. Pharm. 338:248-257