35 research outputs found
Precision medicine methodology development with application to survival and genomics data
Precision medicine and genomics data provide chances for better decision making in the public health domain. In this dissertation, we develop some important elements of precision medicine and address some aspects of genomics data. The first element is developing a nonparametric regression method for interval censored data. We develop a method called Interval Censored Recursive Forests (ICRF), an iterative random forest survival estimator for interval censored data. This method solves the splitting bias problem in tree-based methods for censored data. For this task, we develop consistent splitting rules and employ a recursion technique. This estimator is uniformly consistent and shows high prediction accuracy in simulations and data analyses. Second, we develop an estimator of the optimal dynamic treatment regime (DTR) for survival outcomes with dependent censoring. When one wants to maximize the survival time or the survival probability of cancer patients who go through multiple rounds of chemotherapies, finding the dynamic optimal treatment regime is complicated by the incompleteness of the survival information. Some patients may drop out or face failure before going through all the preplanned treatment stages, which results in a different number of treatment stages for different patients. To address this issue, we generalize the Q-learning approach and the random survival forest framework. This new method also overcomes limitations of the existing methods---independent censoring or a strong modeling structure of the failure time. We show consistency of the value of the estimator and illustrate the performance of the method through simulations and analysis of the leukemia patient data and the national mortality data. Third, we develop a method that measures gene-gene associations after adjusting for the dropout events in single cell RNA sequencing (scRNA-seq) data. Posing a bivariate zero-inflated negative binomial (BZINB) model, we estimate the dropout probability and measure the underlying correlation after controlling for the dropout effects. The gene-gene association measured in this way can serve as a building block of gene set testing methods. The BZINB model has a straightforward latent variable interpretation and is estimated using the EM algorithm.Doctor of Philosoph
Multi-stage optimal dynamic treatment regimes for survival outcomes with dependent censoring
We propose a reinforcement learning method for estimating an optimal dynamic
treatment regime for survival outcomes with dependent censoring. The estimator
allows the treatment decision times to be dependent on the failure time and
conditionally independent of censoring, supports a flexible number of treatment
arms and treatment stages, and can maximize either the mean survival time or
the survival probability at a certain time point. The estimator is constructed
using generalized random survival forests, and its consistency is shown using
empirical process theory. Simulations and leukemia data analysis results
suggest that the new estimator brings higher expected outcomes than existing
methods in various settings. An R package dtrSurv is available on CRAN
Inference for change-plane regression
A key challenge in analyzing the behavior of change-plane estimators is that
the objective function has multiple minimizers. Two estimators are proposed to
deal with this non-uniqueness. For each estimator, an n-rate of convergence is
established, and the limiting distribution is derived. Based on these results,
we provide a parametric bootstrap procedure for inference. The validity of our
theoretical results and the finite sample performance of the bootstrap are
demonstrated through simulation experiments. We illustrate the proposed methods
to latent subgroup identification in precision medicine using the ACTG175 AIDS
study data
Machine Learning and Health Science Research: Tutorial
Machine learning (ML) has seen impressive growth in health science research due to its capacity for handling complex data to perform a range of tasks, including unsupervised learning, supervised learning, and reinforcement learning. To aid health science researchers in understanding the strengths and limitations of ML and to facilitate its integration into their studies, we present here a guideline for integrating ML into an analysis through a structured framework, covering steps from framing a research question to study design and analysis techniques for specialized data types
BZINB Model-Based Pathway Analysis and Module Identification Facilitates Integration of Microbiome and Metabolome Data
Integration of multi-omics data is a challenging but necessary step to advance our understanding of the biology underlying human health and disease processes. To date, investigations seeking to integrate multi-omics (e.g., microbiome and metabolome) employ simple correlation-based network analyses; however, these methods are not always well-suited for microbiome analyses because they do not accommodate the excess zeros typically present in these data. In this paper, we introduce a bivariate zero-inflated negative binomial (BZINB) model-based network and module analysis method that addresses this limitation and improves microbiome–metabolome correlation-based model fitting by accommodating excess zeros. We use real and simulated data based on a multi-omics study of childhood oral health (ZOE 2.0; investigating early childhood dental caries, ECC) and find that the accuracy of the BZINB model-based correlation method is superior compared to Spearman’s rank and Pearson correlations in terms of approximating the underlying relationships between microbial taxa and metabolites. The new method, BZINB-iMMPath, facilitates the construction of metabolite–species and species–species correlation networks using BZINB and identifies modules of (i.e., correlated) species by combining BZINB and similarity-based clustering. Perturbations in correlation networks and modules can be efficiently tested between groups (i.e., healthy and diseased study participants). Upon application of the new method in the ZOE 2.0 study microbiome–metabolome data, we identify that several biologically-relevant correlations of ECC-associated microbial taxa with carbohydrate metabolites differ between healthy and dental caries-affected participants. In sum, we find that the BZINB model is a useful alternative to Spearman or Pearson correlations for estimating the underlying correlation of zero-inflated bivariate count data and thus is suitable for integrative analyses of multi-omics data such as those encountered in microbiome and metabolome studies
An Automated Machine Learning Classifier for Early Childhood Caries.
Purpose: The purpose of the study was to develop and evaluate an automated machine learning algorithm (AutoML) for children's classification according to early childhood caries (ECC) status. Methods: Clinical, demographic, behavioral, and parent-reported oral health status information for a sample of 6,404 three- to five-year-old children (mean age equals 54 months) participating in an epidemiologic study of early childhood oral health in North Carolina was used. ECC prevalence (decayed, missing, and filled primary teeth surfaces [dmfs] score greater than zero, using an International Caries Detection and Assessment System score greater than or equal to three caries lesion detection threshold) was 54 percent. Ten sets of ECC predictors were evaluated for ECC classification accuracy (i.e., area under the ROC curve [AUC], sensitivity [Se], and positive predictive value [PPV]) using an AutoML deployment on Google Cloud, followed by internal validation and external replication. Results: A parsimonious model including two terms (i.e., children's age and parent-reported child oral health status: excellent/very good/good/fair/poor) had the highest AUC (0.74), Se (0.67), and PPV (0.64) scores and similar performance using an external National Health and Nutrition Examination Survey (NHANES) dataset (AUC equals 0.80, Se equals 0.73, PPV equals 0.49). Contrarily, a comprehensive model with 12 variables covering demographics (e.g., race/ethnicity, parental education), oral health behaviors, fluoride exposure, and dental home had worse performance (AUC equals 0.66, Se equals 0.54, PPV equals 0.61). Conclusions: Parsimonious automated machine learning early childhood caries classifiers, including single-item self-reports, can be valuable for ECC screening. The classifier can accommodate biological information that can help improve its performance in the future
Model Selection for Survival Individualized Treatment Rules Using the Jackknife Estimator
Abstract
Background: Precision medicine is an emerging field that involves the selection of treatments based onpatients’ individual prognostic data. It is formalized through the identification of individualized treatment rules(ITRs) that maximize a clinical outcome. When the type of outcome is time-to-event, the correct handling ofcensoring is crucial for estimating reliable optimal ITRs. Methods: We propose a jackknife estimator of the value function to allow for right-censored data for a binarytreatment. The jackknife estimator or leave-one-out-cross-validation approach can be used to estimate thevalue function and select optimal ITRs using existing machine learning methods. We address the issue ofcensoring in survival data by introducing an inverse probability of censoring weighted (IPCW) adjustment inthe expression of the jackknife estimator of the value function. In this paper, we estimate the optimal ITR byusing random survival forest (RSF) and Cox proportional hazards model (COX). We use a Z-test to comparethe optimal ITRs learned by RSF and COX with the zero-order model (or one-size-fits-all). Through simulationstudies, we investigate the asymptotic properties and the performance of our proposed estimator underdifferent censoring rates. We illustrate our proposed method on a phase III clinical trial of non-small cell lung cancer data. Results: Our simulations show that COX outperforms RSF for small sample sizes. As sample sizes increase,the performance of RSF improves, in particular when the underlying distribution of the failure times follows anon-linear pattern. The estimator is fairly normally distributed across different combinations of simulationscenarios and censoring rates. When applied to a non-small-cell lung cancer data set, our method determinesthe zero-order model (ZOM) as the best performing model. This finding highlights the possibility that tailoringmay not be needed for this cancer data set. Conclusion: The jackknife approach for estimating the value function in the presence of right-censored datashows satisfactory performance when there is small to moderate censoring. Winsorizing the upper and lowerpercentiles of the estimated survival weights for computing the IPCWs stabilizes the estimator.</jats:p
Interval-censored linear quantile regression
Censored quantile regression has emerged as a prominent alternative to
classical Cox's proportional hazards model or accelerated failure time model in
both theoretical and applied statistics. While quantile regression has been
extensively studied for right-censored survival data, methodologies for
analyzing interval-censored data remain limited in the survival analysis
literature. This paper introduces a novel local weighting approach for
estimating linear censored quantile regression, specifically tailored to handle
diverse forms of interval-censored survival data. The estimation equation and
the corresponding convex objective function for the regression parameter can be
constructed as a weighted average of quantile loss contributions at two
interval endpoints. The weighting components are nonparametrically estimated
using local kernel smoothing or ensemble machine learning techniques. To
estimate the nonparametric distribution mass for interval-censored data, a
modified EM algorithm for nonparametric maximum likelihood estimation is
employed by introducing subject-specific latent Poisson variables. The proposed
method's empirical performance is demonstrated through extensive simulation
studies and real data analyses of two HIV/AIDS datasets.Comment: under revisio
Model selection for survival individualized treatment rules using the jackknife estimator
Abstract
Background
Precision medicine is an emerging field that involves the selection of treatments based on patients’ individual prognostic data. It is formalized through the identification of individualized treatment rules (ITRs) that maximize a clinical outcome. When the type of outcome is time-to-event, the correct handling of censoring is crucial for estimating reliable optimal ITRs.
Methods
We propose a jackknife estimator of the value function to allow for right-censored data for a binary treatment. The jackknife estimator or leave-one-out-cross-validation approach can be used to estimate the value function and select optimal ITRs using existing machine learning methods. We address the issue of censoring in survival data by introducing an inverse probability of censoring weighted (IPCW) adjustment in the expression of the jackknife estimator of the value function. In this paper, we estimate the optimal ITR by using random survival forest (RSF) and Cox proportional hazards model (COX). We use a Z-test to compare the optimal ITRs learned by RSF and COX with the zero-order model (or one-size-fits-all). Through simulation studies, we investigate the asymptotic properties and the performance of our proposed estimator under different censoring rates. We illustrate our proposed method on a phase III clinical trial of non-small cell lung cancer data.
Results
Our simulations show that COX outperforms RSF for small sample sizes. As sample sizes increase, the performance of RSF improves, in particular when the expected log failure time is not linear in the covariates. The estimator is fairly normally distributed across different combinations of simulation scenarios and censoring rates. When applied to a non-small-cell lung cancer data set, our method determines the zero-order model (ZOM) as the best performing model. This finding highlights the possibility that tailoring may not be needed for this cancer data set.
Conclusion
The jackknife approach for estimating the value function in the presence of right-censored data shows satisfactory performance when there is small to moderate censoring. Winsorizing the upper and lower percentiles of the estimated survival weights for computing the IPCWs stabilizes the estimator.
</jats:sec
