43 research outputs found
Machine learning and statistical inference in microbial population genomics
The availability of large genome datasets has changed the microbiology research landscape. Analyzing such data requires computationally demanding analyses, and new approaches have come from different data analysis philosophies. Machine learning and statistical inference have overlapping knowledge discovery aims and approaches. However, machine learning focuses on optimizing prediction, whereas statistical inference focuses on understanding the processes relating variables. In this review, we outline the different aspirations, precepts, and resulting methodologies, with examples from microbial genomics. Emphasizing complementarity, we argue that the combination and synthesis of machine learning and statistics has potential for pathogen research in the big data era
Hemimetabolous genomes reveal molecular basis of termite eusociality
Around 150 million years ago, eusocial termites evolved from within the cockroaches, 50 million years before eusocial Hymenoptera, such as bees and ants, appeared. Here, we report the 2-Gb genome of the German cockroach, Blattella germanica, and the 1.3-Gb genome of the drywood termite Cryptotermes secundus. We show evolutionary signatures of termite eusociality by comparing the genomes and transcriptomes of three termites and the cockroach against the background of 16 other eusocial and non-eusocial insects. Dramatic adaptive changes in genes underlying the production and perception of pheromones confirm the importance of chemical communication in the termites. These are accompanied by major changes in gene regulation and the molecular evolution of caste determination. Many of these results parallel molecular mechanisms of eusocial evolution in Hymenoptera. However, the specific solutions are remarkably different, thus revealing a striking case of convergence in one of the major evolutionary transitions in biological complexity
SARS-CoV-2 susceptibility and COVID-19 disease severity are associated with genetic variants affecting gene expression in a variety of tissues
Variability in SARS-CoV-2 susceptibility and COVID-19 disease severity between individuals is partly due to
genetic factors. Here, we identify 4 genomic loci with suggestive associations for SARS-CoV-2 susceptibility
and 19 for COVID-19 disease severity. Four of these 23 loci likely have an ethnicity-specific component.
Genome-wide association study (GWAS) signals in 11 loci colocalize with expression quantitative trait loci
(eQTLs) associated with the expression of 20 genes in 62 tissues/cell types (range: 1:43 tissues/gene),
including lung, brain, heart, muscle, and skin as well as the digestive system and immune system. We perform
genetic fine mapping to compute 99% credible SNP sets, which identify 10 GWAS loci that have eight or fewer
SNPs in the credible set, including three loci with one single likely causal SNP. Our study suggests that the
diverse symptoms and disease severity of COVID-19 observed between individuals is associated with variants across the genome, affecting gene expression levels in a wide variety of tissue types
Big data approaches to microbial genomics
Alongside tremendous challenges in infectious diseases, like the rise of antimicrobial resistance and the coronavirus disease pandemic, the 21st century is also witness to the big data revolution, which offers opportunities to design methodology capable of addressing these great challenges. Whilst developing tools there are two competing philosophies of how to gain insight from big data: The modelling approach, where the natural data generating mechanism is approximated by statistical inference, and the algorithmic approach, where general-purpose algorithms are tuned to capture hidden structure in the data for prediction. The aim of the thesis is to contribute existing infectious disease problems, by motivating, designing, and applying the correct big data methodology, whilst facilitating future use through generating applications can be easily re-purposed. I first design a machine learner that can predict the source of Campylobacteriosis 33% more accurately than the previous most commonly used methods. Our method broadens the data input spectrum to captures of whole genomes, which uniquely allows assigning sources to individual samples showing a shift in host affinity of one of the most common lineages of Campylobacter jejuni. Based on the individual prediction of the machine learner,I infer which genetic changes are associated with host specificity by conducting a genome-wide association study. I find fluoroquinolone resistant genes pre-adapting chicken isolates to infection for humans and polyphosphate pathway associated genes to distinguish adaption to chicken and ruminant niche. For the study of COVID-19 risk, I conduct a machine learning prediction of very severe forms of the disease, hospitalisation, and susceptibility, whilst also inferring risk factors for all phenotypes by applying Bayesian model averaging. I re-discover commonly defined risk factors describing socio-economic standing, ill health and ethnicity whilst discovering more novel factors like previous lung injury predisposing very severe COVID-19 and bring order to the wealth of published COVID-19 risk studies. In the closing arguments I give limitations of my work and give recommendations on how the developed tools can be re-applied to make big data research more accessible. I also expand how statistical inference and machine learning prediction can be used in unison to tap into the potential of big data to address the foremost infectious disease challenges of our time
The past, present and future of ancient bacterial DNA
Groundbreaking studies conducted in the mid-1980s demonstrated the possibility of sequencing ancient DNA (aDNA), which has allowed us to answer fundamental questions about the human past. Microbiologists were thus given a powerful tool to glimpse directly into inscrutable bacterial history, hitherto inaccessible due to a poor fossil record. Initially plagued by concerns regarding contamination, the field has grown alongside technical progress, with the advent of high-throughput sequencing being a breakthrough in sequence output and authentication. Albeit burdened with challenges unique to the analysis of bacteria, a growing number of viable sources for aDNA has opened multiple avenues of microbial research. Ancient pathogens have been extracted from bones, dental pulp, mummies and historical medical specimens and have answered focal historical questions such as identifying the aetiological agent of the black death as
Yersinia pestis
. Furthermore, ancient human microbiomes from fossilized faeces, mummies and dental plaque have shown shifts in human commensals through the Neolithic demographic transition and industrial revolution, whereas environmental isolates stemming from permafrost samples have revealed signs of ancient antimicrobial resistance. Culminating in an ever-growing repertoire of ancient genomes, the quickly expanding body of bacterial aDNA studies has also enabled comparisons of ancient genomes to their extant counterparts, illuminating the evolutionary history of bacteria. In this review we summarize the present avenues of research and contextualize them in the past of the field whilst also pointing towards questions still to be answered.</jats:p
Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing
Establishing the frequentist properties of Bayesian approaches widens their
appeal and offers new understanding. In hypothesis testing, Bayesian model
averaging addresses the problem that conclusions are sensitive to variable
selection. But Bayesian false discovery rate (FDR) guarantees are contingent on
prior assumptions that may be disputed. Here we show that Bayesian
model-averaged hypothesis testing is a closed testing procedure that controls
the frequentist familywise error rate (FWER) in the strong sense. The rate
converges pointwise as the sample size grows and, under some conditions,
uniformly. The `Doublethink' method computes simultaneous posterior odds and
asymptotic p-values for model-averaged hypothesis testing. We explore its
benefits, including post-hoc variable selection, and limitations, including
finite-sample inflation, through a Mendelian randomization study and
simulations comparing approaches like LASSO, stepwise regression, the
Benjamini-Hochberg procedure and e-values.Comment: First full draft. 25 pages main text, 47 pages total. 9 figures, 2
table
