Jump to: navigation, search

Data mining methods for ASessing Similarity of molecular structUres in MEtric spaces



Data mining still faces a dilemma when dealing with unstructured data not easily discriminated through a finite set of attributes in vector spaces. These include video or audio streams or even text documents. For many types of data, the only option is to work in metric spaces. Metric spaces are characterized by allowing comparisons between instances with a given similarity metric, although not having a definite coordinate reference system. Even if it is not possible to determine fixed attributes or coordinates to data instances it may be possible to compare them. The question is then to define a comparing function, that is a similarity metric. Unfortunately, differently from structured data, there is not a definitive answer on what is the best way to code or compare instances. How similar are two songs? Or how similar are any two documents? These are typical questions that researchers face trying to get answers for.

Molecules, whether proteins or organic compounds are typical examples. A molecule can have arbitrary dimension, structure and composition, and it is known that similar compounds many times share biological or physical properties. However there is not a univocal and unequivocal way of coding and comparing these compounds. Several alternatives exist and several have provided good results for specific domains, however, to the best of our knowledge, there has never existed a systematic evaluation of existing techniques in different contexts of application. The purpose of this project is to perform this task in a rigorous and unbiased way, by testing the most promising and proven techniques against a set of curated databases for different domains.

Scientifically and economically this is a very pertinent issue. To understand the biological activity of an organic compound or protein is costly enterprise only performed in expensive and time consuming laboratory work. This is a critical issue for the pharmaceutical and biological industries that spend a considerable budget for research searching for adequate compounds and to determine protein function.

Several computational tools have been developed over the years to solve this issue. As it is difficult to employ the most common machine learning to unstructured data, the majority of researchers have turned to metric space mining, by analyzing similarities between compounds which are the expected to share similar functional annotations. Although this is true to an extent, two difficulties need to be cleared, namely a) to understand what is the best way to code biological compounds and b) understand which are the most effective ways of measuring compound similarity using a specific coding. Literature offers several approaches both in bioinformatics and cheminformatics that have gained notoriety. However several problems are not adequately handled by such algorithms, which only provide good results and fast answers when the similarity is very high. Other methods include comparison of structures, which are know to be in many cases more robust, but the determination of chemical structures is many times a costly and time demanding endeavor, not available for most of the macro molecules known. When mining simpler compounds for functional annotation, the situation is the same, however, such molecules can be represented as adjacency graphs, and those graphs can be mined for comparison. Graph comparison however is a difficult problem by itself still not yet known if it is in P or NP. A number of approaches nevertheless exist, based on decomposing the graph in linear sequences defining binary fingerprints, bit string representations of molecular structure, which can be compared according to statistical known metrics.

As referred, this project will provide a systematic comparison of the most relevant methods for the determination of structure metrics for biological compounds. Differently from other studies, the purpose is not to make a review of existing techniques, but to apply them in distinct, specific and well described datasets. It will define contexts of use according to biological activity or physical properties, namely:

  1. a local and thoroughly annotated subset of enzyme classes;
  2. The ThermInfo database , which is an annotated database of simple organic compounds with circa 3000 organic compounds fully characterized from their chemical and physical properties;
  3. The Shulgin Set describe two very strict set of compounds, that have detailed physical, chemical and biological activity information for about 500 molecules. Each of these sets present separate challenges for testing the different assumptions in this proposal.
Research Team

Current Team

Past Members


  • Period: 1-Jan-2010 to 31-Dec-2014
  • Funding:
    • SFRH/BD/29797/2006, Doctoral research scholarship for Daniel Faria
    • SFRH/BD/64487/2009, Doctoral research scholarship for Ana Teixeira
    • MOTIVE contract from 1-Jan-2012 until 31-Dec-2014
    • LaSIGE - Pluriannual funding



DOI | | BibTeX source
Ana L. Teixeira, Andre O. Falcao 2013: Noncontiguous atom matching structural similarity function. Journal of Chemical Information and Modeling 10(53), 2511-2524.

| Document | BibTeX source
Ana L. Teixeira, João P. Leal, Andre O Falcao, Improving QSPR models for predicting standard enthalpy of formation with a hybrid approach for feature selection (as a poster).CQB - Day 2013 p. 70, Faculdade de Ciências da Universidade de Lisboa, Portugal, July, 2013.

| Document | BibTeX source
Ana L. Teixeira, João P. Leal, Andre O. Falcao, Automated Identification and Classification of Stereochemistry: Chirality and Double Bond Stereoisomerism Technical Report. Technical Report . LaSIGE, Department of Informatics, Faculty of Sciences, University of Lisbon, February 2013.

DOI | | BibTeX source
Ana L. Teixeira, João P Leal, Andre O Falcao 2013: Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons. Journal of Cheminformatics.

| Document | BibTeX source
Andre O Falcao, Luis Pinheiro, Ana L. Teixeira, B3Info -- An information system for molecular Blood-Brain Barrier penetration data (as a poster).Beating the Blood-Brain and other Blood Barriers p. 57, Lisbon, Portugal, February, 2013.

| Document | BibTeX source
Ana L. Teixeira, Andre O Falcao, Improving Blood Brain Barrier Penetration in silico Models with a Hybrid Approach for Descriptor Selection (as a poster).Beating the Blood-Brain and other Blood Barriers p. 40, Lisbon, Portugal, February, 2013.

| BibTeX source
Daniel Faria, Andreas Schlicker, Catia Pesquita, Hugo Bastos, Antonio E N Ferreira, Mario Albrecht, Andre Falcao 2012: Mining GO Annotations for Improving Annotation Consistency. PLoS ONE 7(7), e40519.

| BibTeX source
Inês Martins, Ana Teixeira, Luís Pinheiro, Andre O. Falcao 2012: A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 6(52), 1686-1697.

Personal tools
Research Lines
Internal Information