812 research outputs found
ProbCD: enrichment analysis accounting for categorization uncertainty
As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test. We developed an open-source R package to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for
the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/. We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation
Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data
Background: Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods. We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Results: Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions: Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect
Classes of Multiple Decision Functions Strongly Controlling FWER and FDR
This paper provides two general classes of multiple decision functions where
each member of the first class strongly controls the family-wise error rate
(FWER), while each member of the second class strongly controls the false
discovery rate (FDR). These classes offer the possibility that an optimal
multiple decision function with respect to a pre-specified criterion, such as
the missed discovery rate (MDR), could be found within these classes. Such
multiple decision functions can be utilized in multiple testing, specifically,
but not limited to, the analysis of high-dimensional microarray data sets.Comment: 19 page
Gene set analysis exploiting the topology of a pathway
BACKGROUND: Recently, a great effort in microarray data analysis is directed towards the study of the so-called gene sets. A gene set is defined by genes that are, somehow, functionally related. For example, genes appearing in a known biological pathway naturally define a gene set. The gene sets are usually identified from a priori biological knowledge. Nowadays, many bioinformatics resources store such kind of knowledge (see, for example, the Kyoto Encyclopedia of Genes and Genomes, among others). Although pathways maps carry important information about the structure of correlation among genes that should not be neglected, the currently available multivariate methods for gene set analysis do not fully exploit it.
RESULTS: We propose a novel gene set analysis specifically designed for gene sets defined by pathways. Such analysis, based on graphical models, explicitly incorporates the dependence structure among genes highlighted by the topology of pathways. The analysis is designed to be used for overall surveillance of changes in a pathway in different experimental conditions. In fact, under different circumstances, not only the expression of the genes in a pathway, but also the strength of their relations may change. The methods resulting from the proposal allow both to test for variations in the strength of the links, and to properly account for heteroschedasticity in the usual tests for differential expression.
CONCLUSIONS: The use of graphical models allows a deeper look at the components of the pathway that can be tested separately and compared marginally. In this way it is possible to test single components of the pathway and highlight only those involved in its deregulation
Inflated false discovery rate due to volcano plots: problem and solutions
Motivation: Volcano plots are used to select the most interesting discoveries when too many discoveries remain after application of Benjamini-Hochberg's procedure (BH). The volcano plot suggests a double filtering procedure that selects features with both small adjusted -value and large estimated effect size. Despite its popularity, this type of selection overlooks the fact that BH does not guarantee error control over filtered subsets of discoveries. Therefore the selected subset of features may include an inflated number of false discoveries. Results: In this paper, we illustrate the substantially inflated type I error rate of volcano plot selection with simulation experiments and RNA-seq data. In particular, we show that the feature with the largest estimated effect is a very likely false positive result. Next, we investigate two alternative approaches for multiple testing with double filtering that do not inflate the false discovery rate. Our procedure is implemented in an interactive web application and is publicly available.Development and application of statistical models for medical scientific researc
A large scale survey reveals that chromosomal copy-number alterations significantly affect gene modules involved in cancer initiation and progression
Background
Recent observations point towards the existence of a large number of neighborhoods composed of functionally-related gene modules that lie together in the genome. This local component in the distribution of the functionality across chromosomes is probably affecting the own chromosomal architecture by limiting the possibilities in which genes can be arranged and distributed across the genome. As a direct consequence of this fact it is therefore presumable that diseases such as cancer, harboring DNA copy number alterations (CNAs), will have a symptomatology strongly dependent on modules of functionally-related genes rather than on a unique "important" gene.
Methods
We carried out a systematic analysis of more than 140,000 observations of CNAs in cancers and searched by enrichments in gene functional modules associated to high frequencies of loss or gains.
Results
The analysis of CNAs in cancers clearly demonstrates the existence of a significant pattern of loss of gene modules functionally related to cancer initiation and progression along with the amplification of modules of genes related to unspecific defense against xenobiotics (probably chemotherapeutical agents). With the extension of this analysis to an Array-CGH dataset (glioblastomas) from The Cancer Genome Atlas we demonstrate the validity of this approach to investigate the functional impact of CNAs.
Conclusions
The presented results indicate promising clinical and therapeutic implications. Our findings also directly point out to the necessity of adopting a function-centric, rather a gene-centric, view in the understanding of phenotypes or diseases harboring CNAs.Spanish Ministry of Science and Innovation (grant BIO2008-04212)Spanish Ministry of Science and Innovation (grant FIS PI 08/0440)GVA-FEDER (PROMETEO/2010/001)Red Temática de Investigación Cooperativa en Cáncer (RTICC) (grant RD06/0020/1019)Instituto de Salud Carlos III (ISCIII)Spanish Ministry of Science and InnovationSpanish Ministry of Health (FI06/00027
Loss of DPP6 in neurodegenerative dementia : a genetic player in the dysfunction of neuronal excitability
Emerging evidence suggested a converging mechanism in neurodegenerative brain diseases (NBD) involving early neuronal network dysfunctions and alterations in the homeostasis of neuronal firing as culprits of neurodegeneration. In this study, we used paired-end short-read and direct long-read whole genome sequencing to investigate an unresolved autosomal dominant dementia family significantly linked to 7q36. We identified and validated a chromosomal inversion of ca. 4Mb, segregating on the disease haplotype and disrupting the coding sequence of dipeptidyl-peptidase 6 gene (DPP6). DPP6 resequencing identified significantly more rare variants-nonsense, frame-shift, and missense-in early-onset Alzheimer's disease (EOAD, p value = 0.03, OR = 2.21 95% CI 1.05-4.82) and frontotemporal dementia (FTD, p = 0.006, OR = 2.59, 95% CI 1.28-5.49) patient cohorts. DPP6 is a type II transmembrane protein with a highly structured extracellular domain and is mainly expressed in brain, where it binds to the potassium channel K(v)4.2 enhancing its expression, regulating its gating properties and controlling the dendritic excitability of hippocampal neurons. Using in vitro modeling, we showed that the missense variants found in patients destabilize DPP6 and reduce its membrane expression (p < 0.001 and p < 0.0001) leading to a loss of protein. Reduced DPP6 and/or K(v)4.2 expression was also detected in brain tissue of missense variant carriers. Loss of DPP6 is known to cause neuronal hyperexcitability and behavioral alterations in Dpp6-KO mice. Taken together, the results of our genomic, genetic, expression and modeling analyses, provided direct evidence supporting the involvement of DPP6 loss in dementia. We propose that loss of function variants have a higher penetrance and disease impact, whereas the missense variants have a variable risk contribution to disease that can vary from high to low penetrance. Our findings of DPP6, as novel gene in dementia, strengthen the involvement of neuronal hyperexcitability and alteration in the homeostasis of neuronal firing as a disease mechanism to further investigate
Unraveling genetic predisposition to familial or early onset gastric cancer using germline whole-exome sequencing
Recognition of individuals with a genetic predisposition to gastric cancer (GC) enables preventive measures. However, the underlying cause of genetic susceptibility to gastric cancer remains largely unexplained. We performed germline whole-exome sequencing on leukocyte DNA of 54 patients from 53 families with genetically unexplained diffuse-type and intestinal-type GC to identify novel GC-predisposing candidate genes. As young age at diagnosis and familial clustering are hallmarks of genetic tumor susceptibility, we selected patients that were diagnosed below the age of 35, patients from families with two cases of GC at or below age 60 and patients from families with three GC cases at or below age 70. All included individuals were tested negative for germline CDH1 mutations before or during the study. Variants that were possibly deleterious according to in silico predictions were filtered using several independent approaches that were based on gene function and gene mutation burden in controls. Despite a rigorous search, no obvious candidate GC predisposition genes were identified. This negative result stresses the importance of future research studies in large, homogeneous cohorts
Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models
<p>Abstract</p> <p>Background</p> <p>Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multi-dimensional parametric and nonparametric modeling of the pathway effect.</p> <p>Results</p> <p>In this paper we propose a logistic kernel machine regression model for binary outcomes. This model relates the disease risk to covariates parametrically, and to genes within a genetic pathway parametrically or nonparametrically using kernel machines. The nonparametric genetic pathway effect allows for possible interactions among the genes within the same pathway and a complicated relationship of the genetic pathway and the outcome. We show that kernel machine estimation of the model components can be formulated using a logistic mixed model. Estimation hence can proceed within a mixed model framework using standard statistical software. A score test based on a Gaussian process approximation is developed to test for the genetic pathway effect. The methods are illustrated using a prostate cancer data set and evaluated using simulations. An extension to continuous and discrete outcomes using generalized kernel machine models and its connection with generalized linear mixed models is discussed.</p> <p>Conclusion</p> <p>Logistic kernel machine regression and its extension generalized kernel machine regression provide a novel and flexible statistical tool for modeling pathway effects on discrete and continuous outcomes. Their close connection to mixed models and attractive performance make them have promising wide applications in bioinformatics and other biomedical areas.</p
Pathway testing for longitudinal metabolomics
We propose a top-down approach for pathway analysis of longitudinal metabolite data. We apply a score test based on a shared latent process mixed model which can identify pathways with differentially progressing metabolites. The strength of our approach is that it can handle unbalanced designs, deals with potential missing values in the longitudinal markers, and gives valid results even with small sample sizes. Contrary to bottom-up approaches, correlations between metabolites are explicitly modeled leveraging power gains. For large pathway sizes, a computationally efficient solution is proposed based on pseudo-likelihood methodology. We demonstrate the advantages of the proposed method in identification of differentially expressed pathways through simulation studies. Finally, longitudinal metabolite data from a mice experiment is analyzed to demonstrate our methodology.Functional Genomics of Muscle, Nerve and Brain Disorder
- …
