SemTechBiz SF SemTechBiz UK SemTechBiz NYC more TVNewser TVSpy GalleyCat AppNewser UnBeige AgencySpy PRNewser 10,000 Words FishbowlNY FishbowlLA FishbowlDC MediaJobsDaily SocialTimes AllFacebook AllTwitter

Chemical Taxonomies and Ontologies for Semantic Web


Executive Summary

Semantic Web (SW) is a vision of World Wide Web Consortium for seamless integration and query of complex data of all kinds and here we describe its implementation for chemical drug-like compounds. Resource Description Framework (RDF) is a recommended technology to integrate a variety of applications and provide a light weight ontology system to support annotation, search and exchange of knowledge in a SW. Establishing, managing and exchanging information on chemical compounds is a huge challenge for researchers working in many fields ranging from drug-discovery to enzymolgy, bio-fuels to agriculture. Structures of chemical compounds are complex to be described by names alone. Subtle structural changes may result in huge change in chemical properties and these changes may not all be explicit in its name. Also, the total number of chemical compounds one may have to deal within a particular field may run into millions. These are just a few of the reasons why we need a new generation technology – the Chemical Semantic Web (CSW), to manage chemical compounds. Here we describe a technique of establishing CSW using Chem-BLAST[1] and present a prototype CSW resource for AIDS research (http://bioinfo.nist.gov/SemanticWeb_pr2d/chemblast.do). Recently, we have extended (http://xpdb.nist.gov/pdb/chemblast.html ) this work to the tens of thousands of ligands held in the Protein Data Bank (http://www.rcsb.org/pdb/home/home.do ).

Article

Chemical compounds (compounds) are one of the most abundant and fascinating creation of nature; many of these compounds existed prior to life on earth. Every living body is a mini-factory for many compounds and these compounds play several critical roles in biological functions ranging from self-defense to survival, growth to self-regulation and reproduction. Extracting, identifying, naming, classifying and exchanging information on compounds have been a major process of evolution of human civilization over tens of thousands of years. Names derived from their source, names identifying their discoverer or their use, names based on structural information (IUPAC name, International Chemical Identifier (InChI) http://old.iupac.org/inchi/ ) of a compound are some of the popular choices for their identification. Though these choices are very useful and they have been widely adopted, they fall short of providing a reliable mechanism for humans to organize, exchange and instantly visualize their chemical structures.

Clear and succinct specification of a chemical structure by names alone can be difficult as it may have atoms connected by a complex network of atomic bonds stretching in three-dimensions (Fig 1).

 

Fig 1 shows a compound with (IUPAC name 1-CARBOXY-1′-[(DIMETHYLAMINO)-CARBONYL]FERROCENE ) with hetero-atom found in the PDB (ID = 1A3I)). The three-dimensional structure of this compound is very complex and it is difficult to visualize its structure from its IUPAC name. For this reason we present a rule-based Chemical Semantic Web concept that uses visual tools to present compounds using a recursive structural RDF. Structural RDFs are generated from structural sub-components (such as the five member rings or the metal atom and its surroundings). For clarity over the Web, theses elements of RDF are presented as molecular images for visual inspection and subsequent selection. If an element of a RDF can be further expressed as a new RDF of its sub-component then a recursive method for generating RDF is used. The idea we use to present compounds is that of a super store where you no longer need to remember the brand names; the items are laid out to choose from pictures laid out in shelves in a predictable rule-based method; you see what are in that store and what you do not see in the shelves are not in that store.

Three-dimensional arrangements of atoms determine the inter-molecular interaction properties (Fig 2) of a compound with other molecules such as drug targets. Thus an understanding of these arrangements impact ones ability to use of these compounds in technological development ranging from medicine to pest-control, biofuels to agricultural.
 

Fig 2 shows a compound bound to the active site of HIV protease – a major target of AIDS drugs.  The molecular surface (shown in dotted gold) of the residues of the protease forms complementary interactions with the molecular surface (shown in pink) of this compound that is also known as a drug-candidate. This drug-candidate is made up of several sub-structures that branch out to reach into the cavities (shown by the dimples in the blue surface) formed by the protein residues. We propose to breakup the drug-candidate into these sub-structures and use the sub-structures in a recursive RDF for developing ontology for Chemical Semantic Web. Change in the protein surface caused by drug-resistance mutations is a major cause of the drug-resistance phenomenon shown by the HIV protease. Presenting this drug-candidate in a way that facilitates the visualization of its interaction with protein surface is a key component of Chemical Semantic Web that we present.

For decades, scientists have been working to unravel the three-dimensional or at least two-dimensional projection of the three-dimensional structure of compounds. Currently structural information on tens of millions of compounds is available. PubChem[2] has one of the largest (over five million) public collections of structural information and Chemical abstracts (http://www.cas.org/) and Cambridge Structural Database (http://www.ccdc.cam.ac.uk/products/csd/) have been the leaders in providing such information for decades. The Protein Data Bank[3] (http://www.rcsb.org/pdb/home/home.do)  is the sole resource for structures of compounds bound to marco-molecules.

The sheer volume and distributed ownership of the data on compounds make it hard to vision of managing it by traditional methods such as that integrate the information into a single or few data warehouses mentioned above.  Efficient use of these huge volumes of structural data in Web based distributed data resources requires smart easy-to-use technologies and common vocabularies. Making sense of these data in a multi-disciplinary distributed environment is an ideal challenge to be tackled by Semantic Web technologies (http://www.w3.org/2001/sw/ ). We have been working in collaboration with HCLS of W3C (http://www.w3.org/2001/sw/hcls/) to illustrate some of the concepts of Chemical Semantic Web (CSW) and its implementation. This work has two main goals one to develop methods and automated procedures to prepare structural data for CSW and second is to illustrate their use for a visible project such as that of AIDS research.

Semantic Web (SW[4-14]) is the vision of the World Wide Web for future Web both for distributing and integrating data. There have been several recent proposals[15-17] to use InChI as a rule-based name for complete compounds in a CSW. However, the use of compounds in a CSW needs more than this. CSW needs to use name and associate ontology not only on the complete compound but also on several additional parameters that are characteristic of the compound such as its substructures, type of its atoms and bonds, and their disposition in three-dimensions. Thus the challenge is not only to uniquely name each compound but also to uniquely name and correlate each sub-component of every compound. Our research therefore focuses also on assigning names and managing ontology on these additional parameters.

Resource Description framework (RDF, http://www.w3.org/RDF/ ) is a preferred framework to annotate, store and integrate data for CSW as per the recommendations of W3C/HCLS. A chemical structural RDF may be written as ‘structure’ – ‘relationship’ – ‘a sub-component of the structure’. ‘Sub-component of a structure may often be a recursive property of the structure; for instance, a triple fused ring may be a sub-component of a structure and that triple fused ring itself may be made of a six, five and a seven member rings and also one of these rings may be bonded to a heavy atom. Our proposal is to develop an ontology and RDF with all these information on substructures included. We call this a recursive RDF. A recursive RDF on a structure is developed by expressing the ‘sub-component’ or ‘sub-components’ of a RDF element in a new RDF. Thus a recursive RDF is of the type ‘sub-component of a structure’ – ‘relationship’ –‘sub-component of the sub-component of the structure’. The number of recursive layers of RDF that may be built on a given compound may depend on several factors like the size (number of atoms) and the number of unique structural scaffolds in that compound. It may also depend on the ‘use case’. For instance a sub-structure, thio-proline, may be of special consideration for AIDS inhibitors but it may be of lesser significance for another ‘use-case’ such as drug-design for malaria and thus the thio-proline need not be a part of ontology for a ‘use-case’ focused on malaria.

We use the Chem-BLAST[1] method to establish recursive RDF layers of a compound. The basic principle of the Chem-BLAST is to first express a compound in the form

compound <->’A’-‘B’-‘C’-‘D’-…

where ‘A’,’B’, .. are sub-components of the compound. These sub-components may be considered as ‘amino-acids’ if the compound is considered to be a peptide. Then each sub-component is further divided into its sub-components. The rules used to define sub-components of a compound may vary between different applications. These rules may be broadly classified into two types – ‘a use-case based’ and ‘a general purpose based’.

Use-case based RDF: The sub-components of a compound are generated using rules that are applicable to a particular ‘use-case’. Structure-based drug-design, chemical synthesis or study of chemical reaction and its products are few examples of ‘use-cases’ of an RDF. In each one of these ‘use-cases’, the compounds are divided into sub-components and then these sub-components into further sub-components using rules applicable to that ‘use-case’. For instance, recap-rules[18] or chemical reaction rules are a possibility for ‘use-cases’ on chemical synthesis. For drug-design for AIDS using peptidic inhibitors, sub-components may be defined by breaking the compound at peptidic bonds (http://esw.w3.org/topic/HCLS/ChemicalTaxonomiesUseCase ) and its illustration is at http://bioinfo.nist.gov/SemanticWeb_pr2d/chemblast.do .

General purpose RDF: Rather than building a ‘use-case’ specific RDF, a developer of CSW may try to target a wider audience by choosing to build RDFs built on commonly used concepts that are applicable to many ‘use-cases’. We used structural scaffolds (http://xpdb.nist.gov/pdb/chemblast.html ) to illustrate the building of a general purpose RDF. This work presents a CSW case-study for structural data from the Protein Data Bank and focuses on several diseases. In this implementation, each compound is first examined for ring structures using an automated procedure. Then, in a recursive cycle, each one of these ring structure is allowed to grow in size by adding one or two atoms that are bonded to it. Each recursive cycle generates a super-structure of the sub-structure it started with and the new super-structure forms a new RDF with the sub-structure it started with.

Building a Chemical Ontology
: Recursive RDFs are stacked head-to-tail to generate ontologies on compounds and their sub-structures as

‘A’ <->’sub-structure1’;
‘sub-structure1 <->’sub-sub-structure1’;
…. additional classification of sub-structure1.

‘B’<->’sub-structure2’;
‘sub-structure2<->’sub-sub-structure2’;
…. additional classification of sub-structure2.

‘A’ and ‘B’ refer to the sub-components of a compound.

Subsequently, RDFs for biological properties for compounds are generated and stacked against the corresponding structural ontologies as

Compound 1 <->’cellular assy1’;
Compound 1 <->’anti-viral assay 1’;
…. other assays or data.

Compound 2 <->’cellular assay 1’;
Compound 2 <->’antiviral assay 1’;
…. other compounds.

New research directions and possible focus areas: At present chemical Semantic Web is in its infancy and a global participation is needed by many data providers to make it to grow. Common vocabularies, general ontologies, efficient and scalable technologies for generating and searching ontologies, and ‘use-case’ based methods to define RDFs for compounds are just a few stumbling blocks for the growth of CSW. Presentation of information on a compound is also a challenge due to lack of easy-to-use naming and organizing scheme for the millions of compounds to be handled by CSW. Here we propose a recursive automated method, similar to the ones used to manage file folder in a computer, for weaning through the millions of elements of a CSW. In the method, a user specifies compounds using one of their low level features, for instance a common structural scaffold of his interest and then refines his specifications in successive steps. For example, for HIV Protease inhibitors a ‘6-member ring’ may be a starter and structures with more atoms bonded to the ‘6-memebr ring’ may be subsequent choices for refinement of his choice. Chemical structures are often complex to draw on the fly and also several tautomeric forms of the structures may make it even more difficult to decide what to draw. Therefore a preferred and probably the intuitive way to facilitate search on structures is by presenting the elements of CSW as visual examples of pre-defined choices of ontological elements to choose from. A user may also be allowed to switch between different ontologies depending on his/her preferences. (Fig 3).

Bhat - Figure 3

Bhat – Figure 3

Fig 3 shows the Web interface that we developed to present the ontology for CSW for AIDS research. The lowest level of the CSW that we use here focuses on the types of the substructures and the atoms they contain. A user may open any one of these lowest level folders (left of the figure) and view through its contents. After he opens a folder, he may step though a lower to higher level of ontology presented using images of structural sub-components.

One may also think of other ways of defining the lowest level of the ontology. For instance the lowest level of the ontology need not be a classification of the structural scaffold as shown above. Instead it can be a range of petencies or the synthetic core of the compound or the manufacturer or toxicity of the compound. The interface we present here is a rule-based system that allows user to infer the results of a query using familiar items. The items that may be used to query the database are all presented as molecular images. The result generated by this system in response to a query is dependent on the ontology chosen by the user. User requirements may vary between different ‘use-cases’ and thus community participation is needed to build ontologies that meet the needs of as many ‘use-cases’ as possible.

Coordination between the many efforts on developing CSW ontologies pursued around the world is also an important focus area. Certainly, a full agreement between the ontologies adopted by all developers is not only impractical but also it may hinder independent efforts to develop CSW. Therefore, emphasis on only a partial overlap between different ontologies is desirable at this time. Common elements among differing ontologies may be used as cross-over points between them. For instance, certain types of the rings and/or type of the atoms contained in a compound may be used as common elements among many ontologies and a user may use these elements to switch between different applications available world wide. These cross-over points between ontologies, may be also used to provide inter-Web page query capabilities between CSW Web pages that do not share complete ontologies.

To summarize here we describe how we have been formulating, developing and implementing Chemical Semantic Web to a real-world problem – namely for AIDS-drug discovery. We also present a Web resource to illustrate it and we hope that this work will be useful for others in their attempt to develop shared Web resources for complex data such as those of chemical structures and cellular images that may be best presented to users using visual examples of images of the data item being queried. Some of our related publications may be downloaded from http://xpdb.nist.gov/hiv2_d/download.html .

Acknowledgements: Some of the work reported here is done in collaboration with other researchers both within and out-side NIST and we refer readers to the Web Page for their names.
 
Disclaimer:
Certain trade and company products are identified in this article to specify adequately the computer products used in the example. In no case does such identification imply endorsement by the National Institute of Standards and Technology (NIST), nor does it imply that the products are necessarily the best available for the purpose.

References:

[1]    M. D. Prasanna, J. Vondrasek, A. Wlodawer, H. Rodriguez, and T. N. Bhat, "Chemical compound navigator: a web-based chem-BLAST, chemical taxonomy-based search engine for browsing compounds," Proteins, vol. 63, pp. 907-17, Jun 1 2006.
[2]    D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, L. Y. Geer, W. Helmberg, Y. Kapustin, D. L. Kenton, O. Khovayko, D. J. Lipman, T. L. Madden, D. R. Maglott, J. Ostell, K. D. Pruitt, G. D. Schuler, L. M. Schriml, E. Sequeira, S. T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, T. O. Suzek, R. Tatusov, T. A. Tatusova, L. Wagner, and E. Yaschenko, "Database resources of the National Center for Biotechnology Information," Nucleic Acids Res, vol. 34, pp. D173-80, Jan 1 2006.
[3]    H. M. Berman, Westbrook, J. Feng Z. Gilliland G. Bhat T. N. Weissig H. Shindyalov I. N. Bourne P. E., "The Protein Data Bank," Nucleic Acids Research, vol. 28, pp. 235-242, 2000.
[4]    L. A. Slaughter, D. Soergel, and T. C. Rindflesch, "Semantic representation of consumer questions and physician answers," Int J Med Inform, vol. 75, pp. 513-529, Aug 23 2005.
[5]    C. Weng, J. H. Gennari, and D. B. Fridsma, "User-centered semantic harmonization: a case study," J Biomed Inform, vol. 40, pp. 353-64, Jun 2007.
[6]    E. Neumann, "A life science Semantic Web: are we there yet?," Sci STKE, vol. 2005, p. pe22, May 10 2005.
[7]    E. Neumann, Quan, D., "Biodash: A Semantic Web Dashboard for Drug Development," Pacific Sympoosium on Biocomputing, vol. 11, pp. 176-187, 2006.
[8]    E. Neumann and L. Prusak, "Knowledge networks in the age of the Semantic Web," Brief Bioinform, vol. 8, pp. 141-9, May 2007.
[9]    E. K. Neumann and D. Quan, "BioDash: a Semantic Web dashboard for drug development," Pac Symp Biocomput, pp. 176-87, 2006.
[10]    G. Hughes, H. Mills, D. De Roure, J. G. Frey, L. Moreau, M. C. Schraefel, G. Smith, and E. Zaluska, "The semantic smart laboratory: a system for supporting the chemical eScientist," Org Biomol Chem, vol. 2, pp. 3284-93, Nov 21 2004.
[11]    T. N. Bhat, Barkley, J., "Semantic Web for the Life Sciences – Hype, Why, How and Use Case for AIDS Inhibitors," Proceedings of the IEEE Congress on Services, pp. 87 – 91, 7/07/2007 2007.
[12]    T. N. Bhat, Barkley, J., "Development of a Use case for Chemical Resource Description Framework for Acquired Immune Deficiency Syndrome Drug Discovery," The Open Bioinformatics Journal, vol. 2, pp. 20 -27, 2008.
[13]    H. I. Feigenbaum L., Hongsermeier T., Neumann E., Stephens S., "The Semantic Web in Action," Scientific American, vol. 297, pp. 90 – 97, Dec 2007 2007.
[14]    H. J. Feldman, M. Dumontier, S. Ling, N. Haider, and C. W. Hogue, "CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules," FEBS Lett, vol. 579, pp. 4685-91, Aug 29 2005.
[15]    M. Prasanna, Vondrasek, J., Wlodawer, A., Bhat, TN., "Application of InChI to curate, index and query 3-D structures," Proteins, Structure, Function, and Bioinformatics, vol. 60, pp. 1-4, 2005.
[16]    P. Murray-Rust, H. S. Rzepa, J. J. Stewart, and Y. Zhang, "A global resource for computational chemistry," J Mol Model (Online), vol. 11, pp. 532-41, Nov 2005.
[17]    A. M. Richard, L. S. Gold, and M. C. Nicklaus, "Chemical structure indexing of toxicity data on the internet: moving toward a flat world," Curr Opin Drug Discov Devel, vol. 9, pp. 314-25, May 2006.
[18]    X. Q. Lewell, D. B. Judd, S. P. Watson, and M. M. Hann, "RECAP–retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry," J Chem Inf Comput Sci, vol. 38, pp. 511-22, May-Jun 1998.

SemTechBiz is Less Than 2 Weeks Away

The Semantic Tech & Business Conference (SemTechBiz) is coming to San Francisco on June 3-7! Join us for case studies, innovative panels, tutorials, and keynotes that will provide you with practical advice, hands-on guidance, and breakthrough approaches to solving business problems with semantic technology. Passes go up $200 at the door. Sign up now and save !