Search CORE

186 research outputs found

Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

Author: Chikhi Rayan
Limasset Antoine
Peterlongo Pierre
Rizk Guillaume
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHas

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

DROPS Dagstuhl Research Online Publication Server

Hal-Diderot

Portail HAL UNIV-RENNES

Toward Optimal Fingerprint Indexing for Large Scale Genomics

Author: Cazaux Bastien
Limasset Antoine
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases

DROPS Dagstuhl Research Online Publication Server

HAL: Hyper Article en Ligne

Locality-preserving minimal perfect hashing of k-mers

Author: Limasset Antoine
Pibiri Giulio Ermanno
Shibuya Yoshihiro
Publication venue
Publication date: 01/01/2023
Field of study

Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log(2) (e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log (2)(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

HAL Portal de Univ. Gustave Eiffel

HAL: Hyper Article en Ligne

HAL-Ecole des Ponts ParisTech

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

Author: Limasset Antoine
Marchet Camille
Martayan Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

DROPS Dagstuhl Research Online Publication Server

HAL: Hyper Article en Ligne

Immunomodulation of phloretin by impairing dendritic cell activation and function

Author: Bode
Cardona
Chang
Chauveau
Chen
Chen
Choi
Choi
Chung
Crozier
Devi
Duge de Bernonville
Ehrenkranz
Galkina
Gasparini
Gonzalez-Gallego
Hassan
Huang
Huang
Jung
Kahle
Kim
Kim
Kobori
Landete
Li
Limasset
Lin
Liu
Lu
Martin
Morinobu
Oyoshi
Park
Rescigno
Rios
Sembries
Sheng
Shin
Steinman
Steinman
Teixeira Damasceno
van Vliet
Vasantha Rupasinghe
Verhasselt
Wang
Yamada
Yang
Yang
Yu
Publication venue: 'Royal Society of Chemistry (RSC)'
Publication date: 01/01/2014
Field of study

Dietary compounds in fruits and vegetables have been shown to exert many biological activities. In addition to antioxidant effects, a number of flavonoids are able to modulate inflammatory responses. Here, we demonstrated that phloretin (PT), a natural dihydrochalcone found in many fruits, suppressed the activation and function of mouse dendritic cells (DCs). Phloretin disturbed the multiple intracellular signaling pathways in DCs induced by the Toll-like receptor 4 (TLR4) agonist lipopolysaccharide (LPS), including ROS, MAPKs (ERK, JNK, p38 MAPK), and NF-κB, and thereby reducing the production of inflammatory cytokines and chemokines. Phloretin also effectively suppressed the activation of DCs treated with different dosages of LPS or various TLR agonists. The LPS-induced DC maturation was attenuated by phloretin because the expression levels of the MHC class II and the co-stimulatory molecules were down-regulated, which then inhibited the LPS-stimulating DCs and the subsequent naïve T cell activation in a mixed lymphocyte reaction. Moreover, in vivo administration of phloretin suppressed the phenotypic maturation of the LPS-challenged splenic DCs and decreased the IFN-γ production from the activated CD4 T cells. Thus, we suggest that phloretin may potentially be an immunomodulator by impairing the activation and function of DCs and phloretin-contained fruits may be helpful in the improvement of inflammation and autoimmune diseases

Crossref

National Chung Hsing University Institutional Repository

BGREAT: A De Bruijn graph read mapping tool

Author: Limasset Antoine
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 06/07/2015
Field of study

International audienceMapping reads on references is a central task in numerous genomic studies. Since references are mainly extracted from assembly graphs, it is of high interest to map efficiently on such structures. The problem of mapping sequences on a De Bruijn graph has been shown NP-complete[1] and no scalable generic tool exists yet. We motivate here the problem of mapping reads on a de Bruijn graph and we present a practical solution and its implementation called BGREAT. BGREAT handles real world instances of billions reads with moderate resources. Mapping on de Bruijn graph enable to keep whole genomic information and get rid off possible assembly mistakes. However the problem is theoretically hard to handle on real-world dataset. Using a set of heuristics, our proposed tool is able to map million read by CPU hours even on complex human genomes. BGREAT is available at github.com/Malfoy/BGREAT[1]Limasset, A., & Peterlongo, P. (2015). Read Mapping on de Bruijn graph. arXiv preprint arXiv:1505.04911. [2]Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biol 10.3 (2009): R25

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Hal-Diderot

Portail HAL UNIV-RENNES

Minimal perfect hash functions in large scale bioinformatics Problem

Author: Bittner Lucie
Limasset Antoine
Marchet Camille
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 28/06/2016
Field of study

International audience. Genomic and metagenomic fields, generating huge sets ofshort genomic sequences, brought their own share of high performanceproblems. To extract relevant pieces of information from the huge datasets generated by current sequencing techniques, one must rely on extremelyscalable methods and solutions. Indexing billions of objects isa task considered too expensive while being a fundamental need in thisfield. In this paper we propose a straightforward indexing structure thatscales to billions of element and we propose two direct applications ingenomics and metagenomics. We show that our proposal solves probleminstances for which no other known solution scales-up. We believe thatmany tools and applications could benefit from either the fundamentaldata structure we provide or from the applications developed from thisstructure

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Portail HAL UNIV-RENNES

A resource-frugal probabilistic dictionary and applications in (meta)genomics

Author: Bittner Lucie
Limasset Antoine
Marchet Camille
Peterlongo Pierre
Publication venue: 'Centre pour la Communication Scientifique Directe (CCSD)'
Publication date: 29/08/2016
Field of study

International audienceGenomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques , one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

STRONG: metagenomics strain resolution on assembly graphs

Author: Chikhi R
Darling AE
Eren AM
James R
Limasset A
Nurk S
Quince C
Raguideau S
Soyer OS
Summers JK
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/06/2021
Field of study

We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads

OPUS - University of Technology Sydney

Directory of Open Access Journals

PubMed Central

HAL Descartes

Warwick Research Archives Portal Repository

University of East Anglia digital repository

HAL: Hyper Article en Ligne

Hal-Diderot