91 research outputs found

    Biases in the Experimental Annotations of Protein Function and their Effect on Our Understanding of Protein Function Space

    Get PDF
    The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here we investigate just how prevalent is the "few articles -- many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.Comment: Accepted to PLoS Computational Biology. Press embargo applies. v4: text corrected for style and supplementary material inserte

    Extreme Evolutionary Disparities Seen in Positive Selection across Seven Complex Diseases

    Get PDF
    Positive selection is known to occur when the environment that an organism inhabits is suddenly altered, as is the case across recent human history. Genome-wide association studies (GWASs) have successfully illuminated disease-associated variation. However, whether human evolution is heading towards or away from disease susceptibility in general remains an open question. The genetic-basis of common complex disease may partially be caused by positive selection events, which simultaneously increased fitness and susceptibility to disease. We analyze seven diseases studied by the Wellcome Trust Case Control Consortium to compare evidence for selection at every locus associated with disease. We take a large set of the most strongly associated SNPs in each GWA study in order to capture more hidden associations at the cost of introducing false positives into our analysis. We then search for signs of positive selection in this inclusive set of SNPs. There are striking differences between the seven studied diseases. We find alleles increasing susceptibility to Type 1 Diabetes (T1D), Rheumatoid Arthritis (RA), and Crohn's Disease (CD) underwent recent positive selection. There is more selection in alleles increasing, rather than decreasing, susceptibility to T1D. In the 80 SNPs most associated with T1D (p-value <7.01×10−5) showing strong signs of positive selection, 58 alleles associated with disease susceptibility show signs of positive selection, while only 22 associated with disease protection show signs of positive selection. Alleles increasing susceptibility to RA are under selection as well. In contrast, selection in SNPs associated with CD favors protective alleles. These results inform the current understanding of disease etiology, shed light on potential benefits associated with the genetic-basis of disease, and aid in the efforts to identify causal genetic factors underlying complex disease

    Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase

    Get PDF
    BACKGROUND: Annotating genomes remains an hazardous task. Mistakes or gaps in such a complex process may occur when relevant knowledge is ignored, whether lost, forgotten or overlooked. This paper exemplifies an approach which could help to ressucitate such meaningful data. RESULTS: We show that a set of closely related sequences which have been annotated as ornithine carbamoyltransferases are actually putrescine carbamoyltransferases. This demonstration is based on the following points : (i) use of enzymatic data which had been overlooked, (ii) rediscovery of a short NH(2)-terminal sequence allowing to reannotate a wrongly annotated ornithine carbamoyltransferase as a putrescine carbamoyltransferase, (iii) identification of conserved motifs allowing to distinguish unambiguously between the two kinds of carbamoyltransferases, and (iv) comparative study of the gene context of these different sequences. CONCLUSIONS: We explain why this specific case of misannotation had not yet been described and draw attention to the fact that analogous instances must be rather frequent. We urge to be especially cautious when high sequence similarity is coupled with an apparent lack of biochemical information. Moreover, from the point of view of genome annotation, proteins which have been studied experimentally but are not correlated with sequence data in current databases qualify as "orphans", just as unassigned genomic open reading frames do. The strategy we used in this paper to bridge such gaps in knowledge could work whenever it is possible to collect a body of facts about experimental data, homology, unnoticed sequence data, and accurate informations about gene context

    ComPath: comparative enzyme analysis and annotation in pathway/subsystem contexts

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Once a new genome is sequenced, one of the important questions is to determine the presence and absence of biological pathways. Analysis of biological pathways in a genome is a complicated task since a number of biological entities are involved in pathways and biological pathways in different organisms are not identical. Computational pathway identification and analysis thus involves a number of computational tools and databases and typically done in comparison with pathways in other organisms. This computational requirement is much beyond the capability of biologists, so information systems for reconstructing, annotating, and analyzing biological pathways are much needed. We introduce a new comparative pathway analysis workbench, ComPath, which integrates various resources and computational tools using an interactive spreadsheet-style web interface for reliable pathway analyses.</p> <p>Results</p> <p>ComPath allows users to compare biological pathways in multiple genomes using a spreadsheet style web interface where various sequence-based analysis can be performed either to compare enzymes (e.g. sequence clustering) and pathways (e.g. pathway hole identification), to search a genome for <it>de novo </it>prediction of enzymes, or to annotate a genome in comparison with reference genomes of choice. To fill in pathway holes or make <it>de novo </it>enzyme predictions, multiple computational methods such as FASTA, Whole-HMM, CSR-HMM (a method of our own introduced in this paper), and PDB-domain search are integrated in ComPath. Our experiments show that FASTA and CSR-HMM search methods generally outperform Whole-HMM and PDB-domain search methods in terms of sensitivity, but FASTA search performs poorly in terms of specificity, detecting more false positive as E-value cutoff increases. Overall, CSR-HMM search method performs best in terms of both sensitivity and specificity. Gene neighborhood and pathway neighborhood (global network) visualization tools can be used to get context information that is complementary to conventional KEGG map representation.</p> <p>Conclusion</p> <p>ComPath is an interactive workbench for pathway reconstruction, annotation, and analysis where experts can perform various sequence, domain, context analysis, using an intuitive and interactive spreadsheet-style interface. </p

    Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes : implications for classification of enzyme function

    Get PDF
    The authors thank the National Institutes of Health (NIH R01 GM60595 to PCB) and the Scottish Universities Life Sciences Alliance (SULSA to JBOM) for funding.Functionally analogous enzymes are those that catalyze similar reactions on similar substrates but do not share common ancestry, providing a window on the different structural strategies nature has used to evolve required catalysts. Identification and use of this information to improve reaction classification and computational annotation of enzymes newly discovered in the genome projects would benefit from systematic determination of reaction similarities. Here, we quantified similarity in bond changes for overall reactions and catalytic mechanisms for 95 pairs of functionally analogous enzymes (non-homologous enzymes with identical first three numbers of their EC codes) from the MACiE database. Similarity of overall reactions was computed by comparing the sets of bond changes in the transformations from substrates to products. For similarity of mechanisms, sets of bond changes occurring in each mechanistic step were compared; these similarities were then used to guide global and local alignments of mechanistic steps. Using this metric, only 44% of pairs of functionally analogous enzymes in the dataset had significantly similar overall reactions. For these enzymes, convergence to the same mechanism occurred in 33% of cases, with most pairs having at least one identical mechanistic step. Using our metric, overall reaction similarity serves as an upper bound for mechanistic similarity in functional analogs. For example, the four carbon-oxygen lyases acting on phosphates (EC 4.2.3) show neither significant overall reaction similarity nor significant mechanistic similarity. By contrast, the three carboxylic-ester hydrolases (EC 3.1.1) catalyze overall reactions with identical bond changes and have converged to almost identical mechanisms. The large proportion of enzyme pairs that do not show significant overall reaction similarity (56%) suggests that at least for the functionally analogous enzymes studied here, more stringent criteria could be used to refine definitions of EC sub-subclasses for improved discrimination in their classification of enzyme reactions. The results also indicate that mechanistic convergence of reaction steps is widespread, suggesting that quantitative measurement of mechanistic similarity can inform approaches for functional annotation.Peer reviewe

    Is EC class predictable from reaction mechanism?

    Get PDF
    We thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support.Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.Publisher PDFPeer reviewe

    Target selection and annotation for the structural genomics of the amidohydrolase and enolase superfamilies

    Get PDF
    To study the substrate specificity of enzymes, we use the amidohydrolase and enolase superfamilies as model systems; members of these superfamilies share a common TIM barrel fold and catalyze a wide range of chemical reactions. Here, we describe a collaboration between the Enzyme Specificity Consortium (ENSPEC) and the New York SGX Research Center for Structural Genomics (NYSGXRC) that aims to maximize the structural coverage of the amidohydrolase and enolase superfamilies. Using sequence- and structure-based protein comparisons, we first selected 535 target proteins from a variety of genomes for high-throughput structure determination by X-ray crystallography; 63 of these targets were not previously annotated as superfamily members. To date, 20 unique amidohydrolase and 41 unique enolase structures have been determined, increasing the fraction of sequences in the two superfamilies that can be modeled based on at least 30% sequence identity from 45% to 73%. We present case studies of proteins related to uronate isomerase (an amidohydrolase superfamily member) and mandelate racemase (an enolase superfamily member), to illustrate how this structure-focused approach can be used to generate hypotheses about sequence–structure–function relationships

    The LabelHash algorithm for substructure matching

    Get PDF
    Background: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95 % sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs a

    clusterMaker: a multi-algorithm clustering plugin for Cytoscape

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present <it>clusterMaker</it>, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. <it>clusterMaker </it>is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL.</p> <p>Results</p> <p>Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast <it>Saccharomyces cerevisiae</it>; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section.</p> <p>Conclusions</p> <p>The Cytoscape plugin <it>clusterMaker </it>provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the <it>clusterMaker </it>plugin. <it>clusterMaker </it>is available via the Cytoscape plugin manager.</p
    corecore