Search CORE

The Francis Crick Institute

Extreme Evolutionary Disparities Seen in Positive Selection across Seven Complex Diseases

Author: A Di Rienzo
Atul J. Butte
BF Voight
BM Rothschild
CC Babbitt
CC Spencer
EC Walsh
Erik Corona
EY Fung
FG Jimenez
J Askling
J Maddox
JB Kirsner
JD Wall
JL Mobley
Joel T. Dudley
John Hawks
K Ding
K Zhang
KJ Guinan
L Arbiza
LM Chevin
M Currat
M Sirota
MA Schaub
N Patil
PC Sabeti
PC Sabeti
PC Sabeti
PD Thomas
R Blekhman
R Plomin
S Fabre
S Myles
S Nejentsev
S Podder
SG Filler
SJ Richardson
ST Sherry
V Butty
Publication venue: Public Library of Science
Publication date: 17/08/2010
Field of study

Positive selection is known to occur when the environment that an organism inhabits is suddenly altered, as is the case across recent human history. Genome-wide association studies (GWASs) have successfully illuminated disease-associated variation. However, whether human evolution is heading towards or away from disease susceptibility in general remains an open question. The genetic-basis of common complex disease may partially be caused by positive selection events, which simultaneously increased fitness and susceptibility to disease. We analyze seven diseases studied by the Wellcome Trust Case Control Consortium to compare evidence for selection at every locus associated with disease. We take a large set of the most strongly associated SNPs in each GWA study in order to capture more hidden associations at the cost of introducing false positives into our analysis. We then search for signs of positive selection in this inclusive set of SNPs. There are striking differences between the seven studied diseases. We find alleles increasing susceptibility to Type 1 Diabetes (T1D), Rheumatoid Arthritis (RA), and Crohn's Disease (CD) underwent recent positive selection. There is more selection in alleles increasing, rather than decreasing, susceptibility to T1D. In the 80 SNPs most associated with T1D (p-value <7.01×10−5) showing strong signs of positive selection, 58 alleles associated with disease susceptibility show signs of positive selection, while only 22 associated with disease protection show signs of positive selection. Alleles increasing susceptibility to RA are under selection as well. In contrast, selection in SNPs associated with CD favors protective alleles. These results inform the current understanding of disease etiology, shed light on potential benefits associated with the genetic-basis of disease, and aid in the efforts to identify causal genetic factors underlying complex disease

Public Library of Science (PLOS)

Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase

Author: A Bairoch
A Sekowska
B Barcelona-Andres
B Labedan
B Labedan
B Wargnies
C Tricot
C Vander Wauven
GH Gonnet
I Paulsen
I Schomburg
J Felsenstein
JA Gerlt
JP Simon
L Grivell
M Kanehisa
M Zuniga
PC Babbitt
PD Karp
R Apweiler
R Cunin
RJ Roon
S Dashuang
SE Brenner
T Janowitz
TA Hall
The Gene Ontology Consortium
V Stalon
Y Nakada
Y Nakada
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Annotating genomes remains an hazardous task. Mistakes or gaps in such a complex process may occur when relevant knowledge is ignored, whether lost, forgotten or overlooked. This paper exemplifies an approach which could help to ressucitate such meaningful data. RESULTS: We show that a set of closely related sequences which have been annotated as ornithine carbamoyltransferases are actually putrescine carbamoyltransferases. This demonstration is based on the following points : (i) use of enzymatic data which had been overlooked, (ii) rediscovery of a short NH(2)-terminal sequence allowing to reannotate a wrongly annotated ornithine carbamoyltransferase as a putrescine carbamoyltransferase, (iii) identification of conserved motifs allowing to distinguish unambiguously between the two kinds of carbamoyltransferases, and (iv) comparative study of the gene context of these different sequences. CONCLUSIONS: We explain why this specific case of misannotation had not yet been described and draw attention to the fact that analogous instances must be rather frequent. We urge to be especially cautious when high sequence similarity is coupled with an apparent lack of biochemical information. Moreover, from the point of view of genome annotation, proteins which have been studied experimentally but are not correlated with sequence data in current databases qualify as "orphans", just as unassigned genomic open reading frames do. The strategy we used in this paper to bridge such gaps in knowledge could work whenever it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context

DI-fusion

ComPath: comparative enzyme analysis and annotation in pathway/subsystem contexts

Author: A Andreeva
A Bateman
A Marchler-Bauer
A Osterman
AL Barabási
C Gene Ontology
C The UniProt
CJA Sigrist
CM Zmasek
DA Benson
DH Haft
HM Berman
HW Ma
J Wu
K Choi
Kwangmin Choi
L Pireddu
M Kanehisa
M Kanehisa
M Madera
N Hulo
P Stothard
PC Babbitt
PD Karp
R Caspi
R Overbeek
RA George
S Kim
S Kim
S Kim
S Kim
SCH Pegg
SF Altschul
Sun Kim
V BATAGELJL
VM Markowitz
W Thompson
WR Pearson
Y Ye
Y Zheng
YI Wolf
Publication venue: BioMed Central
Publication date: 01/03/2008
Field of study

Abstract Background Once a new genome is sequenced, one of the important questions is to determine the presence and absence of biological pathways. Analysis of biological pathways in a genome is a complicated task since a number of biological entities are involved in pathways and biological pathways in different organisms are not identical. Computational pathway identification and analysis thus involves a number of computational tools and databases and typically done in comparison with pathways in other organisms. This computational requirement is much beyond the capability of biologists, so information systems for reconstructing, annotating, and analyzing biological pathways are much needed. We introduce a new comparative pathway analysis workbench, ComPath, which integrates various resources and computational tools using an interactive spreadsheet-style web interface for reliable pathway analyses. Results ComPath allows users to compare biological pathways in multiple genomes using a spreadsheet style web interface where various sequence-based analysis can be performed either to compare enzymes (e.g. sequence clustering) and pathways (e.g. pathway hole identification), to search a genome for <it>de novo </it>prediction of enzymes, or to annotate a genome in comparison with reference genomes of choice. To fill in pathway holes or make <it>de novo </it>enzyme predictions, multiple computational methods such as FASTA, Whole-HMM, CSR-HMM (a method of our own introduced in this paper), and PDB-domain search are integrated in ComPath. Our experiments show that FASTA and CSR-HMM search methods generally outperform Whole-HMM and PDB-domain search methods in terms of sensitivity, but FASTA search performs poorly in terms of specificity, detecting more false positive as E-value cutoff increases. Overall, CSR-HMM search method performs best in terms of both sensitivity and specificity. Gene neighborhood and pathway neighborhood (global network) visualization tools can be used to get context information that is complementary to conventional KEGG map representation. Conclusion ComPath is an interactive workbench for pathway reconstruction, annotation, and analysis where experts can perform various sequence, domain, context analysis, using an intuitive and interactive spreadsheet-style interface. </p

Public Library of Science (PLOS)

Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes : implications for classification of enzyme function

Author: Daniel E. Almonacid
Emmanuel R. Yera
John B. O. Mitchell
Patricia C. Babbitt
Christine A. Orengo
RA George
JA Gerlt
PC Babbitt
JA Gerlt
ME Glasner
GJ Bartlett
MY Galperin
H Hegyi
ACR Martin
PF Gherardini
AE Todd
AE Todd
Z Zhang
S Tsoka
SA Teichmann
TD Otto
NH Horowitz
RA Jensen
GA Petsko
RA Chiang
CS Wright
BW Matthews
SC Morris
N Nagano
CT Porter
C Andreini
C Andreini
GL Holliday
GL Holliday
SC-H Pegg
SC-H Pegg
N Nagano
AG Murzin
G Ausiello
H Berman
NM O'Boyle
P Jaccard
P Willett
TF Smith
CA Orengo
AG McDonald
ACR Martin
SB Needleman
M Kotera
Y Yamanishi
GL Holliday
L Maveyraud
J Pitarch
R Castillo
JC Hermann
Z Wang
AP Leech
DG Gourley
AW Roszak
LM Blomberg
A Matte
MM Benning
PC Babbitt
DL Scott
BW Segelke
YS Ho
YS Ho
ME Lowe
Y Bourne
RJ Kazlauskas
H van Tilbeurgh
GJ Bartlett
T Nakai
A Zajc
E Ortlund
AB Hickman
TJ Wyckoff
CR Sweet
W Plaga
V Leppänen
F Himo
M Mathieu
Y Modis
RF Doolittle
V Sangar
BE Shakhnovich
PW Lord
J Gasteiger
O Sacher
DARS Latino
DARS Latino
DARS Latino
Y Loewenstein
T Dandekar
R Overbeek
AJ Enright
EM Marcotte
M Pellegrini
JC Hermann
JC Hermann
ME Glasner
L Jiang
M Bashton
K Tipton
KA Olszewski
R Preißner
J Park
PR Mittl
S Lorenzen
A Fersht
M Errami
EO Cannon
M Rizzi
J Symersky
Y Devedjiev
AR Battersby
N Frankenberg
MA Mathews
HL Schubert
B Stec
L Ma
JE Murphy
EB Fauman
RH Hoff
F Wang
Publication venue
Publication date: 01/01/2010
Field of study

The authors thank the National Institutes of Health (NIH R01 GM60595 to PCB) and the Scottish Universities Life Sciences Alliance (SULSA to JBOM) for funding.Functionally analogous enzymes are those that catalyze similar reactions on similar substrates but do not share common ancestry, providing a window on the different structural strategies nature has used to evolve required catalysts. Identification and use of this information to improve reaction classification and computational annotation of enzymes newly discovered in the genome projects would benefit from systematic determination of reaction similarities. Here, we quantified similarity in bond changes for overall reactions and catalytic mechanisms for 95 pairs of functionally analogous enzymes (non-homologous enzymes with identical first three numbers of their EC codes) from the MACiE database. Similarity of overall reactions was computed by comparing the sets of bond changes in the transformations from substrates to products. For similarity of mechanisms, sets of bond changes occurring in each mechanistic step were compared; these similarities were then used to guide global and local alignments of mechanistic steps. Using this metric, only 44% of pairs of functionally analogous enzymes in the dataset had significantly similar overall reactions. For these enzymes, convergence to the same mechanism occurred in 33% of cases, with most pairs having at least one identical mechanistic step. Using our metric, overall reaction similarity serves as an upper bound for mechanistic similarity in functional analogs. For example, the four carbon-oxygen lyases acting on phosphates (EC 4.2.3) show neither significant overall reaction similarity nor significant mechanistic similarity. By contrast, the three carboxylic-ester hydrolases (EC 3.1.1) catalyze overall reactions with identical bond changes and have converged to almost identical mechanisms. The large proportion of enzyme pairs that do not show significant overall reaction similarity (56%) suggests that at least for the functionally analogous enzymes studied here, more stringent criteria could be used to refine definitions of EC sub-subclasses for improved discrimination in their classification of enzyme reactions. The results also indicate that mechanistic convergence of reaction steps is widespread, suggesting that quantitative measurement of mechanistic similarity can inform approaches for functional annotation.Peer reviewe

University of St. Andrews - Pure

The Francis Crick Institute

St Andrews Research Repository

Is EC class predictable from reaction mechanism?

Author: A Statnikov
AG McDonald
BV Dasarathy
BW Matthews
C Andreini
C Andreini
DARS Latino
DE Almonacid
DE Almonacid
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
IUBMB
J Gorodkin
J Menke
John BO Mitchell
JW Torrance
K Astikainen
KM Borgwardt
L Breiman
L De Ferrari
LD Hughes
M Aizerman
M Kanehisa
M Leber
N Furnham
N Nagano
N Nagano
Neetika Nath
NM O'Boyle
O Sacher
PC Babbitt
PD Dobson
R Lowe
RD Uriarte
SA Rahman
SCH Pegg
SCH Pegg
T Bray
T Bylander
V Egelhofer
VN Vapnik
WS Noble
X Hu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

We thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support.Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.Publisher PDFPeer reviewe

University of St. Andrews - Pure

St Andrews Research Repository

Target selection and annotation for the structural genomics of the amidohydrolase and enolase superfamilies

Author: A Andreeva
A Sakai
A Weeks
AE Todd
Andrej Sali
C Nowlan
CH Wu
CM Seibert
D Vitkup
DA Benson
DL Wheeler
EF Pettersen
F Melo
Frank M. Raushel
H Berman
HJ Imker
J Akana
J Gough
J Lee
J. Michael Sauder
JA Gerlt
JA Gerlt
JA Gerlt
JB Bonanno
JB Thoden
JC Hermann
JC Hermann
JC Norvell
JC Venter
JE Vick
JE Vick
Jeffrey B. Bonanno
Jennifer J. Seffernick
JF Rakus
JJ Irwin
John A. Gerlt
L Holm
L Song
L Williams
Libusha Kelly
Margaret E. Glasner
Mark R. Chance
Matthew P. Jacobson
ME Glasner
ME Glasner
ME Glasner
N Eswar
N Nagano
Narayanan Eswar
P Shannon
Patricia C. Babbitt
PC Babbitt
R Marti-Arbona
R Marti-Arbona
R Marti-Arbona
R Sanchez
R Tyagi
Ranyee Chiang
RS Hall
RZ Liao
SC Almo
SC Pegg
SD Brown
SF Altschul
Shoshana D. Brown
SL Schafer
Stephen K. Burley
Steven C. Almo
Subramanyam Swaminathan
TN Porter
TT Nguyen
U Pieper
Ursula Pieper
WS Yew
WS Yew
WS Yew
Xiaojing Zheng
Y Li
Publication venue: Springer Netherlands
Publication date: 01/01/2009
Field of study

To study the substrate specificity of enzymes, we use the amidohydrolase and enolase superfamilies as model systems; members of these superfamilies share a common TIM barrel fold and catalyze a wide range of chemical reactions. Here, we describe a collaboration between the Enzyme Specificity Consortium (ENSPEC) and the New York SGX Research Center for Structural Genomics (NYSGXRC) that aims to maximize the structural coverage of the amidohydrolase and enolase superfamilies. Using sequence- and structure-based protein comparisons, we first selected 535 target proteins from a variety of genomes for high-throughput structure determination by X-ray crystallography; 63 of these targets were not previously annotated as superfamily members. To date, 20 unique amidohydrolase and 41 unique enolase structures have been determined, increasing the fraction of sequences in the two superfamilies that can be modeled based on at least 30% sequence identity from 45% to 73%. We present case studies of proteins related to uronate isomerase (an amidohydrolase superfamily member) and mandelate racemase (an enolase superfamily member), to illustrate how this structure-focused approach can be used to generate hypotheses about sequence–structure–function relationships

The LabelHash algorithm for substructure matching

Background: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95 % sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs a

CiteSeerX

TSpace (University of Toronto)

clusterMaker: a multi-algorithm clustering plugin for Cytoscape

Abstract Background In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present <it>clusterMaker</it>, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. <it>clusterMaker </it>is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast <it>Saccharomyces cerevisiae</it>; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. Conclusions The Cytoscape plugin <it>clusterMaker </it>provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the <it>clusterMaker </it>plugin. <it>clusterMaker </it>is available via the Cytoscape plugin manager.</p