223 research outputs found

    Discovery and genotyping of structural variation from long-read haploid genome sequence data

    Get PDF
    In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that &gt;89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF &gt; 1%). We estimate that this theoretical human diploid differs by as much as ∼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.</jats:p

    Reconstructing complex regions of genomes using long-read sequencing technology

    Get PDF
    Cataloged from PDF version of article.Obtaining high-quality sequence continuity of complex regions of recent segmental duplication remains one of the major challenges of finishing genome assemblies. In the human and mouse genomes, this was achieved by targeting large-insert clones using costly and laborious capillary-based sequencing approaches. Sanger shotgun sequencing of clone inserts, however, has now been largely abandoned, leaving most of these regions unresolved in newer genome assemblies generated primarily by next-generation sequencing hybrid approaches. Here we show that it is possible to resolve regions that are complex in a genome-wide context but simple in isolation for a fraction of the time and cost of traditional methods using long-read single molecule, real-time (SMRT) sequencing and assembly technology from Pacific Biosciences (PacBio). We sequenced and assembled BAC clones corresponding to a 1.3-Mbp complex region of chromosome 17q21.31, demonstrating 99.994% identity to Sanger assemblies of the same clones. We targeted 44 differences using Illumina sequencing and find that PacBio and Sanger assemblies share a comparable number of validated variants, albeit with different sequence context biases. Finally, we targeted a poorly assembled 766-kbp duplicated region of the chimpanzee genome and resolved the structure and organization for a fraction of the cost and time of traditional finishing approaches. Our data suggest a straightforward path for upgrading genomes to a higher quality finished state

    Assembly of long error-prone reads using de Bruijn graphs

    Get PDF
    The recent breakthroughs in assembling long error-prone reads were based on the overlap-layout-consensus (OLC) approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the OLC approach is the only practical paradigm for assembling long error-prone reads. We show how to generalize de Bruijn graphs for assembling long error-prone reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions

    Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs.

    Get PDF
    Variable number tandem repeats (VNTRs) are composed of consecutive repetitive DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. In this work, we solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We develop software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We use this to discover VNTRs with length stratified by continental population, and expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease

    Mechanism of organization increase in complex systems

    Get PDF
    This paper proposes a variational approach to describe the evolution of organization of complex systems from first principles, as increased efficiency of physical action. Most simply stated, physical action is the product of the energy and time necessary for motion. When complex systems are modeled as flow networks, this efficiency is defined as a decrease of action for one element to cross between two nodes, or endpoints of motion - a principle of least unit action. We find a connection with another principle that of most total action, or a tendency for increase of the total action of a system. This increase provides more energy and time for minimization of the constraints to motion in order to decrease unit action, and therefore to increase organization. Also, with the decrease of unit action in a system, its capacity for total amount of action increases. We present a model of positive feedback between action efficiency and the total amount of action in a complex system, based on a system of ordinary differential equations, which leads to an exponential growth with time of each and a power law relation between the two. We present an agreement of our model with data for core processing units of computers. This approach can help to describe, measure, manage, design and predict future behavior of complex systems to achieve the highest rates of self-organization and robustness.Comment: 22 pages, 4 figures, 1 tabl

    Scalable Nanopore Sequencing of Human Genomes Provides a Comprehensive View of Haplotype-Resolved Variation and Methylation

    Get PDF
    Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer\u27s and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls

    Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants

    Get PDF
    The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost

    An integrated map of structural variation in 2,504 human genomes

    Get PDF
    © 2015 Macmillan Publishers Limited. All rights reserved. Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association

    Characterization of the Conus bullatus genome and its venom-duct transcriptome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The venomous marine gastropods, cone snails (genus <it>Conus</it>), inject prey with a lethal cocktail of conopeptides, small cysteine-rich peptides, each with a high affinity for its molecular target, generally an ion channel, receptor or transporter. Over the last decade, conopeptides have proven indispensable reagents for the study of vertebrate neurotransmission. <it>Conus bullatus </it>belongs to a clade of <it>Conus </it>species called <it>Textilia</it>, whose pharmacology is still poorly characterized. Thus the genomics analyses presented here provide the first step toward a better understanding the enigmatic <it>Textilia </it>clade.</p> <p>Results</p> <p>We have carried out a sequencing survey of the <it>Conus bullatus </it>genome and venom-duct transcriptome. We find that conopeptides are highly expressed within the venom-duct, and describe an <it>in silico </it>pipeline for their discovery and characterization using RNA-seq data. We have also carried out low-coverage shotgun sequencing of the genome, and have used these data to determine its size, genome-wide base composition, simple repeat, and mobile element densities.</p> <p>Conclusions</p> <p>Our results provide the first global view of venom-duct transcription in any cone snail. A notable feature of <it>Conus bullatus </it>venoms is the breadth of A-superfamily peptides expressed in the venom duct, which are unprecedented in their structural diversity. We also find SNP rates within conopeptides are higher compared to the remainder of <it>C. bullatus </it>transcriptome, consistent with the hypothesis that conopeptides are under diversifying selection.</p
    corecore