38 research outputs found

    Avoid Oversimplifications in Machine Learning: Going beyond the Class-Prediction Accuracy.

    Full text link
    Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning have been accomplished: two classifiers providing the same performance in one validation can disagree on many future validations. It does not provide explainability in its decision-making process and is not objective, as its value is also affected by class proportions in the validation set. Despite these issues, this does not mean we should omit the class-prediction accuracy. Instead, it needs to be enriched with accompanying evidence and tests that supplement and contextualize the reported accuracy. This additional evidence serves as augmentations and can help us perform machine learning better while avoiding naive reliance on oversimplified metrics

    ProInfer: An interpretable protein inference tool leveraging on biological networks.

    Get PDF
    In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide which maps to more than one candidate protein) and enhances the statistical confidence for proteins supported by both unique and ambiguous peptides. Consequently, ProInfer rescues weakly supported proteins thereby improving proteome coverage. Evaluated across THP1 cell line, lung cancer and RAW267.4 datasets, ProInfer always infers the most numbers of true positives, in comparison to mainstream protein inference tools Fido, EPIFANY and PIA. ProInfer is also adept at retrieving differentially expressed proteins, signifying its usefulness for functional analysis and phenotype profiling. Source codes of ProInfer are available at https://github.com/PennHui2016/ProInfer

    How to do quantile normalization correctly for gene expression data analyses.

    Full text link
    Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis. However, it is susceptible to class-effect proportion effects (the proportion of class-correlated variables in a dataset) and batch effects (the presence of potentially confounding technical variation) when applied blindly on whole data sets, resulting in higher false-positive and false-negative rates. We evaluate five strategies for performing quantile normalization, and demonstrate that good performance in terms of batch-effect correction and statistical feature selection can be readily achieved by first splitting data by sample class-labels before performing quantile normalization independently on each split ("Class-specific"). Via simulations with both real and simulated batch effects, we demonstrate that the "Class-specific" strategy (and others relying on similar principles) readily outperform whole-data quantile normalization, and is robust-preserving useful signals even during the combined analysis of separately-normalized datasets. Quantile normalization is a commonly used procedure. But when carelessly applied on whole datasets without first considering class-effect proportion and batch effects, can result in poor performance. If quantile normalization must be used, then we recommend using the "Class-specific" strategy

    PROTREC: A probability-based approach for recovering missing proteins based on biological networks.

    Get PDF
    A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods - such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) - across a variety of proteomics datasets derived from different proteomics data acquisition paradigms: Higher PROTREC scores are much more closely correlated with higher recovery rates of MPs across sample replicates. The PROTREC score, unlike methods reporting p-values, can be directly interpreted as the probability that an unreported protein in a proteomic screen is actually present in the sample being screened. SIGNIFICANCE: Mass spectrometry (MS) has developed rapidly in recent years; however, an obvious proportion of proteins is still undetected, leading to missing protein problems. A few existing protein recovery methods are based on biological networks, but the performance is not satisfactory. We propose a new protein recovery method, PROTREC, a Bayesian-inspired approach based on biological networks, which shows exceptional performance across multiple validation strategies. It does not rely on peptide information, so it avoids the ambiguity issue that most protein assembly methods face

    Mathematical-based microbiome analytics for clinical translation

    Get PDF
    This is the final version. Available on open access from Elsevier via the DOI in this recordTraditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrounding environment and wide microbial diversity that exists in vivo. Given the emergence of next generation sequencing technologies and advancing bioinformatic pipelines, researchers now have unprecedented capabilities to characterise the human microbiome in terms of its taxonomy, function, antibiotic resistance and even bacteriophages. Despite this, an analysis of microbial communities has largely been restricted to ordination, ecological measures, and discriminant taxa analysis. This is predominantly due to a lack of suitable computational tools to facilitate microbiome analytics. In this review, we first evaluate the key concerns related to the inherent structure of microbiome datasets which include its compositionality and batch effects. We describe the available and emerging analytical techniques including integrative analysis, machine learning, microbial association networks, topological data analysis (TDA) and mathematical modelling. We also present how these methods may translate to clinical settings including tools for implementation. Mathematical based analytics for microbiome analysis represents a promising avenue for clinical translation across a range of acute and chronic disease states.Singapore Ministry of Health’s National Medical Research CouncilNanyang Technological University, SingaporeEngineering and Physical Sciences Research Council (EPSRC

    A generalisability theory approach to quantifying changes in psychopathology among ultra-high-risk individuals for psychosis

    Get PDF
    Distinguishing stable and fluctuating psychopathological features in young individuals at Ultra High Risk (UHR) for psychosis is challenging, but critical for building robust, accurate, early clinical detection and prevention capabilities. Over a 24-month period, 159 UHR individuals were assessed using the Positive and Negative Symptom Scale (PANSS). Generalisability Theory was used to validate the PANSS with this population and to investigate stable and fluctuating features, by estimating the reliability and generalisability of three factor (Positive, Negative, and General) and five factor (Positive, Negative, Cognitive, Depression, and Hostility) symptom models. Acceptable reliability and generalisability of scores across occasions and sample population were demonstrated by the total PANSS scale (Gr = 0.85). Fluctuating symptoms (delusions, hallucinatory behaviour, lack of spontaneity, flow in conversation, emotional withdrawal, and somatic concern) showed high variability over time, with 50-68% of the variance explained by individual transient states. In contrast, more stable symptoms included excitement, poor rapport, anxiety, guilt feeling, uncooperativeness, and poor impulse control. The 3-factor model of PANSS and its subscales showed robust reliability and generalisability of their assessment scores across the UHR population and evaluation periods (G = 0.77-0.93), offering a suitable means to assess psychosis risk. Certain subscales within the 5-factor PANSS model showed comparatively lower reliability and generalisability (G = 0.33-0.66). The identified and investigated fluctuating symptoms in UHR individuals are more amendable by means of intervention, which could have significant implications for preventing and addressing psychosis. Prioritising the treatment of fluctuating symptoms could enhance intervention efficacy, offering a sharper focus in clinical trials. At the same time, using more reliable total scale and 3 subscales can contribute to more accurate assessment of enduring psychosis patterns in clinical and experimental settings

    Turning straw into gold: building robustness into gene signature inference.

    Full text link
    Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (GAPs) are needed, namely: leveraging existing data and knowledge; careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible. Here, using a re-examination of a breast cancer data set and 48 published signatures, we illustrate the value of adopting these GAPs
    corecore