27 research outputs found
A High-Resolution Map of Human Evolutionary Constraint Using 29 Mammals
The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering ~4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for ~60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.National Human Genome Research Institute (U.S.)National Institute of General Medical Sciences (U.S.) (Grant number GM82901)National Science Foundation (U.S.). Postdoctural Fellowship (Award 0905968)National Science Foundation (U.S.). Career (0644282)National Institutes of Health (U.S.) (R01-HG004037)Alfred P. Sloan Foundation.Austrian Science Fund. Erwin Schrodinger Fellowshi
Ultra-fast sequence clustering from similarity networks with SiLiX
<p>Abstract</p> <p>Background</p> <p>The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time.</p> <p>Results</p> <p>We present the software package <monospace>SiLiX</monospace> that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity.</p> <p>Conclusions</p> <p>Comparing state-of-the-art software, <monospace>SiLiX</monospace> presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. <monospace>SiLiX</monospace> is freely available at <url>http://lbbe.univ-lyon1.fr/SiLiX</url>.</p
Translog, a web browser for studying the expression divergence of homologous genes
<p>Abstract</p> <p>Background</p> <p>Increasing amount of data from comparative genomics, and newly developed technologies producing accurate gene expression data facilitate the study of the expression divergence of homologous genes. Previous studies have individually highlighted factors that contribute to the expression divergence of duplicate genes, e.g. promoter changes, exon structure heterogeneity, asymmetric histone modifications and genomic neighborhood conservation. However, there is a lack of a tool to integrate multiple factors and visualize their variety among homologous genes in a straightforward way.</p> <p>Results</p> <p>We introduce Translog (a web-based tool for Transcriptome comparison of homologous genes) that assists in the comparison of homologous genes by displaying the loci in three different views: promoter view for studying the sharing/turnover of transcription initiations, exon structure for displaying the exon-intron structure changes, and genomic neighborhood to show the macro-synteny conservation in a larger scale. CAGE data for transcription initiation are mapped for each transcript and can be used to study transcription turnover and expression changes. Alignment anchors between homologous loci can be used to define the precise homologous transcripts. We demonstrate how these views can be used to visualize the changes of homologous genes during evolution, particularly after the 2R and 3R whole genome duplication.</p> <p>Conclusion</p> <p>We have developed a web-based tool for assisting in the transcriptome comparison of homologous genes, facilitating the study of expression divergence.</p
Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals
A common assumption in comparative genomics is that orthologous genes share greater functional similarity than do paralogous genes (the “ortholog conjecture”). Many methods used to computationally predict protein function are based on this assumption, even though it is largely untested. Here we present the first large-scale test of the ortholog conjecture using comparative functional genomic data from human and mouse. We use the experimentally derived functions of more than 8,900 genes, as well as an independent microarray dataset, to directly assess our ability to predict function using both orthologs and paralogs. Both datasets show that paralogs are often a much better predictor of function than are orthologs, even at lower sequence identities. Among paralogs, those found within the same species are consistently more functionally similar than those found in a different species. We also find that paralogous pairs residing on the same chromosome are more functionally similar than those on different chromosomes, perhaps due to higher levels of interlocus gene conversion between these pairs. In addition to offering implications for the computational prediction of protein function, our results shed light on the relationship between sequence divergence and functional divergence. We conclude that the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act
The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons
To connect human biology to fish biomedical models, we sequenced the genome of spotted gar (Lepisosteus oculatus), whose lineage diverged from teleosts before teleost genome duplication (TGD). The slowly evolving gar genome has conserved in content and size many entire chromosomes from bony vertebrate ancestors. Gar bridges teleosts to tetrapods by illuminating the evolution of immunity, mineralization and development (mediated, for example, by Hox, ParaHox and microRNA genes). Numerous conserved noncoding elements (CNEs; often cis regulatory) undetectable in direct human-teleost comparisons become apparent using gar: functional studies uncovered conserved roles for such cryptic CNEs, facilitating annotation of sequences identified in human genome-wide association studies. Transcriptomic analyses showed that the sums of expression domains and expression levels for duplicated teleost genes often approximate the patterns and levels of expression for gar genes, consistent with subfunctionalization. The gar genome provides a resource for understanding evolution after genome duplication, the origin of vertebrate genomes and the function of human regulatory sequences
BioTIME 2.0: Expanding and Improving a Database of Biodiversity Time Series
Motivation Here, we make available a second version of the BioTIME database, which compiles records of abundance estimates for species in sample events of ecological assemblages through time. The updated version expands version 1.0 of the database by doubling the number of studies and includes substantial additional curation to the taxonomic accuracy of the records, as well as the metadata. Moreover, we now provide an R package (BioTIMEr) to facilitate use of the database. Main Types of Variables Included The database is composed of one main data table containing the abundance records and 11 metadata tables. The data are organised in a hierarchy of scales where 11,989,233 records are nested in 1,603,067 sample events, from 553,253 sampling locations, which are nested in 708 studies. A study is defined as a sampling methodology applied to an assemblage for a minimum of 2 years. Spatial Location and Grain Sampling locations in BioTIME are distributed across the planet, including marine, terrestrial and freshwater realms. Spatial grain size and extent vary across studies depending on sampling methodology. We recommend gridding of sampling locations into areas of consistent size. Time Period and Grain The earliest time series in BioTIME start in 1874, and the most recent records are from 2023. Temporal grain and duration vary across studies. We recommend doing sample-level rarefaction to ensure consistent sampling effort through time before calculating any diversity metric. Major Taxa and Level of Measurement The database includes any eukaryotic taxa, with a combined total of 56,400 taxa. Software Format csv and. SQL
