TY - JOUR
T1 - The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis
AU - Gil, Nelson
AU - Fiser, Andras
N1 - Funding Information:
This work was supported by National Institutes of Health (NIH) grant R01 GM118709, and the Extreme Science and Engineering Discovery Environment (XSEDE) project (NSF grant ACI-1053575). NG was supported by the National Research Service Award (NRSA) individual fellowship F31GM116570 and the Medical Scientist Training Program (MSTP) grant T32GM007288.
Publisher Copyright:
© 2018 The Author(s). Published by Oxford University Press. All rights reserved.
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Motivation The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art nonlate-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported â - F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs). Results The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis. Supplementary information.
AB - Motivation The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art nonlate-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported â - F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs). Results The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis. Supplementary information.
UR - http://www.scopus.com/inward/record.url?scp=85058766219&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058766219&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/bty523
DO - 10.1093/bioinformatics/bty523
M3 - Article
C2 - 29947739
AN - SCOPUS:85058766219
VL - 35
SP - 12
EP - 19
JO - Bioinformatics
JF - Bioinformatics
SN - 1367-4803
IS - 1
ER -