149 research outputs found

    Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

    Get PDF
    Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

    The FAST-AIMS Clinical Mass Spectrometry Analysis System

    Get PDF
    Within clinical proteomics, mass spectrometry analysis of biological samples is emerging as an important high-throughput technology, capable of producing powerful diagnostic and prognostic models and identifying important disease biomarkers. As interest in this area grows, and the number of such proteomics datasets continues to increase, the need has developed for efficient, comprehensive, reproducible methods of mass spectrometry data analysis by both experts and nonexperts. We have designed and implemented a stand-alone software system, FAST-AIMS, which seeks to meet this need through automation of data preprocessing, feature selection, classification model generation, and performance estimation. FAST-AIMS is an efficient and user-friendly stand-alone software for predictive analysis of mass spectrometry data. The present resource review paper will describe the features and use of the FAST-AIMS system. The system is freely available for download for noncommercial use

    Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study

    Get PDF
    The development of new high-throughput genotyping technologies has allowed fast evaluation of single nucleotide polymorphisms (SNPs) on a genome-wide scale. Several recent genome-wide association studies employing these technologies suggest that panels of SNPs can be a useful tool for predicting cancer susceptibility and discovery of potentially important new disease loci.In the present paper we undertake a careful examination of the relative significance of genetics, environmental factors, and biases of the data analysis protocol that was used in a previously published genome-wide association study. That prior study reported a nearly perfect discrimination of esophageal cancer patients and healthy controls on the basis of only genetic information. On the other hand, our results strongly suggest that SNPs in this dataset are not statistically linked to the phenotype, while several environmental factors and especially family history of esophageal cancer (a proxy to both environmental and genetic factors) have only a modest association with the disease.The main component of the previously claimed strong discriminatory signal is due to several data analysis pitfalls that in combination led to the strongly optimistic results. Such pitfalls are preventable and should be avoided in future studies since they create misleading conclusions and generate many false leads for subsequent research

    Causal Feature Selection

    Full text link

    Causal graph-based analysis of genome-wide association data in rheumatoid arthritis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>GWAS owe their popularity to the expectation that they will make a major impact on diagnosis, prognosis and management of disease by uncovering genetics underlying clinical phenotypes. The dominant paradigm in GWAS data analysis so far consists of extensive reliance on methods that emphasize contribution of individual SNPs to statistical association with phenotypes. Multivariate methods, however, can extract more information by considering associations of multiple SNPs simultaneously. Recent advances in other genomics domains pinpoint multivariate causal graph-based inference as a promising principled analysis framework for high-throughput data. Designed to discover biomarkers in the local causal pathway of the phenotype, these methods lead to accurate and highly parsimonious multivariate predictive models. In this paper, we investigate the applicability of causal graph-based method TIE* to analysis of GWAS data. To test the utility of TIE*, we focus on anti-CCP positive rheumatoid arthritis (RA) GWAS datasets, where there is a general consensus in the community about the major genetic determinants of the disease.</p> <p>Results</p> <p>Application of TIE* to the North American Rheumatoid Arthritis Cohort (NARAC) GWAS data results in six SNPs, mostly from the MHC locus. Using these SNPs we develop two predictive models that can classify cases and disease-free controls with an accuracy of 0.81 area under the ROC curve, as verified in independent testing data from the same cohort. The predictive performance of these models generalizes reasonably well to Swedish subjects from the closely related but not identical Epidemiological Investigation of Rheumatoid Arthritis (EIRA) cohort with 0.71-0.78 area under the ROC curve. Moreover, the SNPs identified by the TIE* method render many other previously known SNP associations conditionally independent of the phenotype.</p> <p>Conclusions</p> <p>Our experiments demonstrate that application of TIE* captures maximum amount of genetic information about RA in the data and recapitulates the major consensus findings about the genetic factors of this disease. In addition, TIE* yields reproducible markers and signatures of RA. This suggests that principled multivariate causal and predictive framework for GWAS analysis empowers the community with a new tool for high-quality and more efficient discovery.</p> <p>Reviewers</p> <p>This article was reviewed by Prof. Anthony Almudevar, Dr. Eugene V. Koonin, and Prof. Marianthi Markatou.</p

    Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A recent study reported that gene expression profiles from peripheral blood samples of healthy subjects prior to viral inoculation were indistinguishable from profiles of subjects who received viral challenge but remained asymptomatic and uninfected. If true, this implies that the host immune response does not have a molecular signature. Given the high sensitivity of microarray technology, we were intrigued by this result and hypothesize that it was an artifact of data analysis.</p> <p>Findings</p> <p>Using acute respiratory viral challenge microarray data, we developed a molecular signature that for the first time allowed for an accurate differentiation between uninfected subjects prior to viral inoculation and subjects who remained asymptomatic after the viral challenge.</p> <p>Conclusions</p> <p>Our findings suggest that molecular signatures can be used to characterize immune responses to viruses and may improve our understanding of susceptibility to viral infection with possible implications for vaccine development.</p

    Analysis and Computational Dissection of Molecular Signature Multiplicity

    Get PDF
    Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities
    corecore