186 research outputs found
Fast and Scalable Minimal Perfect Hashing for Massive Key Sets
Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}.
Source code: https://github.com/rizkg/BBHas
Toward Optimal Fingerprint Indexing for Large Scale Genomics
Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index.
Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases
Locality-preserving minimal perfect hashing of k-mers
Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log(2) (e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log (2)(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature
Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching
The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets.
Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes.
By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings.
In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data
Immunomodulation of phloretin by impairing dendritic cell activation and function
Dietary compounds in fruits and vegetables have been shown to exert many biological activities. In addition to antioxidant effects, a number of flavonoids are able to modulate inflammatory responses. Here, we demonstrated that phloretin (PT), a natural dihydrochalcone found in many fruits, suppressed the activation and function of mouse dendritic cells (DCs). Phloretin disturbed the multiple intracellular signaling pathways in DCs induced by the Toll-like receptor 4 (TLR4) agonist lipopolysaccharide (LPS), including ROS, MAPKs (ERK, JNK, p38 MAPK), and NF-κB, and thereby reducing the production of inflammatory cytokines and chemokines. Phloretin also effectively suppressed the activation of DCs treated with different dosages of LPS or various TLR agonists. The LPS-induced DC maturation was attenuated by phloretin because the expression levels of the MHC class II and the co-stimulatory molecules were down-regulated, which then inhibited the LPS-stimulating DCs and the subsequent naïve T cell activation in a mixed lymphocyte reaction. Moreover, in vivo administration of phloretin suppressed the phenotypic maturation of the LPS-challenged splenic DCs and decreased the IFN-γ production from the activated CD4 T cells. Thus, we suggest that phloretin may potentially be an immunomodulator by impairing the activation and function of DCs and phloretin-contained fruits may be helpful in the improvement of inflammation and autoimmune diseases
BGREAT: A De Bruijn graph read mapping tool
International audienceMapping reads on references is a central task in numerous genomic studies. Since references are mainly extracted from assembly graphs, it is of high interest to map efficiently on such structures. The problem of mapping sequences on a De Bruijn graph has been shown NP-complete[1] and no scalable generic tool exists yet. We motivate here the problem of mapping reads on a de Bruijn graph and we present a practical solution and its implementation called BGREAT. BGREAT handles real world instances of billions reads with moderate resources. Mapping on de Bruijn graph enable to keep whole genomic information and get rid off possible assembly mistakes. However the problem is theoretically hard to handle on real-world dataset. Using a set of heuristics, our proposed tool is able to map million read by CPU hours even on complex human genomes. BGREAT is available at github.com/Malfoy/BGREAT[1]Limasset, A., & Peterlongo, P. (2015). Read Mapping on de Bruijn graph. arXiv preprint arXiv:1505.04911. [2]Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biol 10.3 (2009): R25
Minimal perfect hash functions in large scale bioinformatics Problem
International audience. Genomic and metagenomic fields, generating huge sets ofshort genomic sequences, brought their own share of high performanceproblems. To extract relevant pieces of information from the huge datasets generated by current sequencing techniques, one must rely on extremelyscalable methods and solutions. Indexing billions of objects isa task considered too expensive while being a fundamental need in thisfield. In this paper we propose a straightforward indexing structure thatscales to billions of element and we propose two direct applications ingenomics and metagenomics. We show that our proposal solves probleminstances for which no other known solution scales-up. We believe thatmany tools and applications could benefit from either the fundamentaldata structure we provide or from the applications developed from thisstructure
A resource-frugal probabilistic dictionary and applications in (meta)genomics
International audienceGenomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques , one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure
STRONG: metagenomics strain resolution on assembly graphs
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads
- …
