2,401 research outputs found
Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene-environment interactions
Genetic association analyses often involve data from multiple
potentially-heterogeneous subgroups. The expected amount of heterogeneity can
vary from modest (e.g., a typical meta-analysis) to large (e.g., a strong
gene--environment interaction). However, existing statistical tools are limited
in their ability to address such heterogeneity. Indeed, most genetic
association meta-analyses use a "fixed effects" analysis, which assumes no
heterogeneity. Here we develop and apply Bayesian association methods to
address this problem. These methods are easy to apply (in the simplest case,
requiring only a point estimate for the genetic effect and its standard error,
from each subgroup) and effectively include standard frequentist meta-analysis
methods, including the usual "fixed effects" analysis, as special cases. We
apply these tools to two large genetic association studies: one a meta-analysis
of genome-wide association studies from the Global Lipids consortium, and the
second a cross-population analysis for expression quantitative trait loci
(eQTLs). In the Global Lipids data we find, perhaps surprisingly, that effects
are generally quite homogeneous across studies. In the eQTL study we find that
eQTLs are generally shared among different continental groups, and discuss
consequences of this for study design.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS695 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Using linear predictors to impute allele frequencies from summary or pooled genotype data
Recently-developed genotype imputation methods are a powerful tool for
detecting untyped genetic variants that affect disease susceptibility in
genetic association studies. However, existing imputation methods require
individual-level genotype data, whereas, in practice, it is often the case that
only summary data are available. For example, this may occur because, for
reasons of privacy or politics, only summary data are made available to the
research community at large; or because only summary data are collected, as in
DNA pooling experiments. In this article we introduce a new statistical method
that can accurately infer the frequencies of untyped genetic variants in these
settings, and indeed substantially improve frequency estimates at typed
variants in pooling experiments where observations are noisy. Our approach,
which predicts each allele frequency using a linear combination of observed
frequencies, is statistically straightforward, and related to a long history of
the use of linear methods for estimating missing values (e.g., Kriging). The
main statistical novelty is our approach to regularizing the covariance matrix
estimates, and the resulting linear predictors, which is based on methods from
population genetics. We find that, besides being both fast and
flexible---allowing new problems to be tackled that cannot be handled by
existing imputation approaches purpose-built for the genetic context---these
linear methods are also very accurate. Indeed, imputation accuracy using this
approach is similar to that obtained by state-of-the-art imputation methods
that use individual-level data, but at a fraction of the computational cost.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS338 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A nested mixture model for protein identification using mass spectrometry
Mass spectrometry provides a high-throughput way to identify proteins in
biological samples. In a typical experiment, proteins in a sample are first
broken into their constituent peptides. The resulting mixture of peptides is
then subjected to mass spectrometry, which generates thousands of spectra, each
characteristic of its generating peptide. Here we consider the problem of
inferring, from these spectra, which proteins and peptides are present in the
sample. We develop a statistical approach to the problem, based on a nested
mixture model. In contrast to commonly used two-stage approaches, this model
provides a one-stage solution that simultaneously identifies which proteins are
present, and which peptides are correctly identified. In this way our model
incorporates the evidence feedback between proteins and their constituent
peptides. Using simulated data and a yeast data set, we compare and contrast
our method with existing widely used approaches (PeptideProphet/ProteinProphet)
and with a recently published new approach, HSM. For peptide identification,
our single-stage approach yields consistently more accurate results. For
protein identification the methods have similar accuracy in most settings,
although we exhibit some scenarios in which the existing methods perform
poorly.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS316 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A statistical framework for joint eQTL analysis in multiple tissues
Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and
widely-adopted approach to identifying putative regulatory variants and linking
them to specific genes. Up to now eQTL studies have been conducted in a
relatively narrow range of tissues or cell types. However, understanding the
biology of organismal phenotypes will involve understanding regulation in
multiple tissues, and ongoing studies are collecting eQTL data in dozens of
cell types. Here we present a statistical framework for powerfully detecting
eQTLs in multiple tissues or cell types (or, more generally, multiple
subgroups). The framework explicitly models the potential for each eQTL to be
active in some tissues and inactive in others. By modeling the sharing of
active eQTLs among tissues this framework increases power to detect eQTLs that
are present in more than one tissue compared with "tissue-by-tissue" analyses
that examine each tissue separately. Conversely, by modeling the inactivity of
eQTLs in some tissues, the framework allows the proportion of eQTLs shared
across different tissues to be formally estimated as parameters of a model,
addressing the difficulties of accounting for incomplete power when comparing
overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our
framework to re-analyze data from transformed B cells, T cells and fibroblasts
we find that it substantially increases power compared with tissue-by-tissue
analysis, identifying 63% more genes with eQTLs (at FDR=0.05). Further the
results suggest that, in contrast to previous analyses of the same data, the
majority of eQTLs detectable in these data are shared among all three tissues.Comment: Summitted to PLoS Genetic
- …
