85 research outputs found
GPU-Accelerated BWT Construction for Large Collection of Short Reads
Advances in DNA sequencing technology have stimulated the development of
algorithms and tools for processing very large collections of short strings
(reads). Short-read alignment and assembly are among the most well-studied
problems. Many state-of-the-art aligners, at their core, have used the
Burrows-Wheeler transform (BWT) as a main-memory index of a reference genome
(typical example, NCBI human genome). Recently, BWT has also found its use in
string-graph assembly, for indexing the reads (i.e., raw data from DNA
sequencers). In a typical data set, the volume of reads is tens of times of the
sequenced genome and can be up to 100 Gigabases. Note that a reference genome
is relatively stable and computing the index is not a frequent task. For reads,
the index has to computed from scratch for each given input. The ability of
efficient BWT construction becomes a much bigger concern than before. In this
paper, we present a practical method called CX1 for constructing the BWT of
very large string collections. CX1 is the first tool that can take advantage of
the parallelism given by a graphics processing unit (GPU, a relative cheap
device providing a thousand or more primitive cores), as well as simultaneously
the parallelism from a multi-core CPU and more interestingly, from a cluster of
GPU-enabled nodes. Using CX1, the BWT of a short-read collection of up to 100
Gigabases can be constructed in less than 2 hours using a machine equipped with
a quad-core CPU and a GPU, or in about 43 minutes using a cluster with 4 such
machines (the speedup is almost linear after excluding the first 16 minutes for
loading the reads from the hard disk). The previously fastest tool BRC is
measured to take 12 hours to process 100 Gigabases on one machine; it is
non-trivial how BRC can be parallelized to take advantage a cluster of
machines, let alone GPUs.Comment: 11 page
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
MEGAHIT is a NGS de novo assembler for assembling large and complex
metagenomics data in a time- and cost-efficient manner. It finished assembling
a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a
single computing node with and without a GPU, respectively. MEGAHIT assembles
the data as a whole, i.e., it avoids pre-processing like partitioning and
normalization, which might compromise on result integrity. MEGAHIT generates 3
times larger assembly, with longer contig N50 and average contig length than
the previous assembly. 55.8% of the reads were aligned to the assembly, which
is 4 times higher than the previous. The source code of MEGAHIT is freely
available at https://github.com/voutcn/megahit under GPLv3 license.Comment: 2 pages, 2 tables, 1 figure, submitted to Oxford Bioinformatics as an
Application Not
BASE: a practical de novo assembler for large genomes using long NGS reads
© 2016 The Author(s). Background: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods: This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results: Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions: BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.published_or_final_versio
SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
To tackle the exponentially increasing throughput of Next-Generation
Sequencing (NGS), most of the existing short-read aligners can be configured to
favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging
the computational power of both CPU and GPU with optimized algorithms, delivers
high speed and sensitivity simultaneously. Compared with widely adopted
aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including
BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while
maintaining the highest sensitivity and lowest false discovery rate (FDR) on
Illumina reads with different lengths. Transcending its predecessor SOAP3,
which does not allow gapped alignment, SOAP3-dp by default tolerates alignment
similarity as low as 60 percent. Real data evaluation using human genome
demonstrates SOAP3-dp's power to enable more authentic variants and longer
Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly
discovered deletions. SOAP3-dp natively supports BAM file format and provides a
scoring scheme same as BWA, which enables it to be integrated into existing
analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and
Tianhe-1A.Comment: 21 pages, 6 figures, submitted to PLoS ONE, additional files
available at "https://www.dropbox.com/sh/bhclhxpoiubh371/O5CO_CkXQE".
Comments most welcom
Data-Flow-Based Normalization Generation Algorithm of R1CS for Zero-Knowledge Proof
The communities of blockchains and distributed ledgers have been stirred up
by the introduction of zero-knowledge proofs (ZKPs). Originally designed to
solve privacy issues, ZKPs have now evolved into an effective remedy for
scalability concerns and are applied in Zcash (internet money like Bitcoin). To
enable ZKPs, Rank-1 Constraint Systems (R1CS) offer a verifier for bi-linear
equations. To accurately and efficiently represent R1CS, several language tools
like Circom, Noir, and Snarky have been proposed to automate the compilation of
advanced programs into R1CS. However, due to the flexible nature of R1CS
representation, there can be significant differences in the compiled R1CS forms
generated from circuit language programs with the same underlying semantics. To
address this issue, this paper uses a data-flow-based R1CS paradigm algorithm,
which produces a standardized format for different R1CS instances with
identical semantics. By using the normalized R1CS format circuits, the
complexity of circuits' verification can be reduced. In addition, this paper
presents an R1CS normalization algorithm benchmark, and our experimental
evaluation demonstrates the effectiveness and correctness of our methods.Comment: 10pages, 8 figures, a shorter version is accepted by PRDC 202
MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)
Background: Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). Results: To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). Summary: MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. Availability and implementation: MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. Supplementary information: Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.published_or_final_versio
ZK-ProVer: Proving Programming Verification in Non-Interactive Zero-Knowledge Proofs
Program verification ensures software correctness through formal methods but incurs substantial computational overhead. It typically encodes program execution into formulas that are verified using a SAT solver and its extensions. However, this process exposes sensitive program details and requires redundant computations when multiple parties need to verify correctness. To overcome these limitations, zero-knowledge proofs (ZKPs) generate compact, reusable proofs with fast verification times, while provably hiding the program’s internal logic. We propose a two-phase zero-knowledge protocol that hides program implementation details throughout verification. Phase I uses a zero-knowledge virtual machine (zkVM) to encode programs into SAT formulas without
revealing their semantics. Phase II employs the encoding of resolution proofs for UNSAT instances and circuits for satisfying assignment verification for SAT instances through PLONKish circuits. Evaluation on the Boolector benchmark demonstrates that our method achieves verification time that is efficient and is independent of clause width for UNSAT instances and formula size for SAT instances. The resulting ZKPs enable efficient verification of program properties while providing strong end-to-end privacy guarantees
Can Language Models Pretend Solvers? Logic Code Simulation with LLMs
Transformer-based large language models (LLMs) have demonstrated significant
potential in addressing logic problems. capitalizing on the great capabilities
of LLMs for code-related activities, several frameworks leveraging logical
solvers for logic reasoning have been proposed recently. While existing
research predominantly focuses on viewing LLMs as natural language logic
solvers or translators, their roles as logic code interpreters and executors
have received limited attention. This study delves into a novel aspect, namely
logic code simulation, which forces LLMs to emulate logical solvers in
predicting the results of logical programs. To further investigate this novel
task, we formulate our three research questions: Can LLMs efficiently simulate
the outputs of logic codes? What strength arises along with logic code
simulation? And what pitfalls? To address these inquiries, we curate three
novel datasets tailored for the logic code simulation task and undertake
thorough experiments to establish the baseline performance of LLMs in code
simulation. Subsequently, we introduce a pioneering LLM-based code simulation
technique, Dual Chains of Logic (DCoL). This technique advocates a dual-path
thinking approach for LLMs, which has demonstrated state-of-the-art performance
compared to other LLM prompt strategies, achieving a notable improvement in
accuracy by 7.06% with GPT-4-Turbo.Comment: 12 pages, 8 figure
The oyster genome reveals stress adaptation and complexity of shell formation
The Pacific oyster Crassostrea gigas belongs to one of the most species-rich but genomically poorly explored phyla, the Mollusca. Here we report the sequencing and assembly of the oyster genome using short reads and a fosmid-pooling strategy, along with transcriptomes of development and stress response and the proteome of the shell. The oyster genome is highly polymorphic and rich in repetitive sequences, with some transposable elements still actively shaping variation. Transcriptome studies reveal an extensive set of genes responding to environmental stress. The expansion of genes coding for heat shock protein 70 and inhibitors of apoptosis is probably central to the oyster's adaptation to sessile life in the highly stressful intertidal zone. Our analyses also show that shell formation in molluscs is more complex than currently understood and involves extensive participation of cells and their exosomes. The oyster genome sequence fills a void in our understanding of the Lophotrochozoa. © 2012 Macmillan Publishers Limited. All rights reserved
- …
