3,187 research outputs found

    On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

    Get PDF
    In logistic regression, separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample, and as a result, finite maximum likelihood estimates of the regression coefficients do not exist. Gelman et al. (2008) recommended independent Cauchy distributions as default priors for the regression coefficients in logistic regression, even in the case of separation, and reported posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a natural question is whether the posterior means of the regression coefficients exist under separation. We prove theorems that provide necessary and sufficient conditions for the existence of posterior means under independent Cauchy priors for the logit link and a general family of link functions, including the probit link. We also study the existence of posterior means under multivariate Cauchy priors. For full Bayesian inference, we develop a Gibbs sampler based on Polya-Gamma data augmentation to sample from the posterior distribution under independent Student-t priors including Cauchy priors, and provide a companion R package in the supplement. We demonstrate empirically that even when the posterior means of the regression coefficients exist under separation, the magnitude of the posterior samples for Cauchy priors may be unusually large, and the corresponding Gibbs sampler shows extremely slow mixing. While alternative algorithms such as the No-U-Turn Sampler in Stan can greatly improve mixing, in order to resolve the issue of extremely heavy tailed posteriors for Cauchy priors under separation, one would need to consider lighter tailed priors such as normal priors or Student-t priors with degrees of freedom larger than one

    Estimating propensity scores with missing covariate data using general location mixture models

    No full text
    In many observational studies, researchers estimate causal effects using propensity scores, e.g., by matching or sub-classifying on the scores. Estimation of propensity scores is complicated when some values of the covariates aremissing. We propose to use multiple imputation to create completed datasets, from which propensity scores can be estimated, with a general location mixture model. The model assumes that the control units are a latent mixture of (i)units whose covariates are drawn from the same distributions as the treated units’ covariates and (ii) units whose covariates are drawn from different distributions. This formulation reduces the influence of control units outside the treated units’ region of the covariate space on the estimation of parameters in the imputation model, which can result in more plausible imputations and better balance in the true covariate distributions. We illustrate the benefits of 1 the latent class modeling approach with simulations and with an observationalstudy of the effect of breast feeding on children’s cognitive abilities

    Estimating propensity scores with missing covariate data using general location mixture models

    No full text
    In many observational studies, researchers estimate causal effects using propensity scores, e.g., by matching or sub-classifying on the scores. Estimation of propensity scores is complicated when some values of the covariates aremissing. We propose to use multiple imputation to create completed datasets, from which propensity scores can be estimated, with a general location mixture model. The model assumes that the control units are a latent mixture of (i)units whose covariates are drawn from the same distributions as the treated units’ covariates and (ii) units whose covariates are drawn from different distributions. This formulation reduces the influence of control units outside the treated units’ region of the covariate space on the estimation of parameters in the imputation model, which can result in more plausible imputations and better balance in the true covariate distributions. We illustrate the benefits of 1 the latent class modeling approach with simulations and with an observationalstudy of the effect of breast feeding on children’s cognitive abilities

    A latent class model to multiply impute missing treatment indicators in observational studies when inferences of the treatment effect are made using propensity score matching

    Get PDF
    Analysts often estimate treatment effects in observational studies using propensity score matching techniques. When there are missing covariate values, analysts can multiply impute the missing data to create m completed data sets. Analysts can then estimate propensity scores on each of the completed data sets, and use these to estimate treatment effects. However, there has been relatively little attention on developing imputation models to deal with the additional problem of missing treatment indicators, perhaps due to the consequences of generating implausible imputations. However, simply ignoring the missing treatment values, akin to a complete case analysis, could also lead to problems when estimating treatment effects. We propose a latent class model to multiply impute missing treatment indicators. We illustrate its performance through simulations and with data taken from a study on determinants of children's cognitive development. This approach is seen to obtain treatment effect estimates closer to the true treatment effect than when employing conventional imputation procedures as well as compared to a complete case analysis

    The effect of snow accumulation on imaging riometer performance

    Get PDF
    In January 1998 an imaging riometer system was deployed at Halley, Antarctica (76°S, 27°W), involving the construction of an array of 64 crossed-dipole antennas and a ground plane. Weather conditions at Halley mean that such an array will rapidly bury beneath the snow, so the system was tuned to operate efficiently when buried. Theoretical calculations indicate that because the distance between the ground plane and the array was scaled to be 1/4λ in the snow, as snow fills the gap the signal will increase by 0.6–2.5 dB. Similarly, the short antennas are resonant when operated in snow, not in air. Theoretical calculations show that the largest effect of this is the mismatch of their feed point impedance to the receiver network. As the signal for each riometer beam is composed of a contribution from all 64 antennas, for each antenna that buries the signal level will increase by 1/64 of ∼9 dB. The measured response of the system to burial showed significant changes as snow accumulated in and over the array during 1998. The changes are consistent with the magnitude of the effects predicted by the theoretical calculations. The Halley imaging riometer system, having now been buried completely, is operating more efficiently than if a standard air-tuned configuration had been deployed. The results are of considerable relevance to the ever-increasing community of imaging riometer users regarding both deployment and the subsequent interpretation of scientific data. Some systems will experience similar permanent burial, while others will be subject to significant annual variability as a result of becoming snow-covered during winter and clear during summer

    Bayesian model-based clustering for populations of network data

    Get PDF
    There is increasing appetite for analysing populations of network data due to the fast-growing body of applications demanding such methods. While methods exist to provide readily interpretable summaries of heterogeneous network populations, these are often descriptive or ad hoc, lacking any formal justification. In contrast, principled analysis methods often provide results difficult to relate back to the applied problem of interest. Motivated by two complementary applied examples, we develop a Bayesian framework to appropriately model complex heterogeneous network populations, while also allowing analysts to gain insights from the data and make inferences most relevant to their needs. The first application involves a study in computer science measuring human movements across a university. The second analyses data from neuroscience investigating relationships between different regions of the brain. While both applications entail analysis of a heterogeneous population of networks, network sizes vary considerably. We focus on the problem of clustering the elements of a network population, where each cluster is characterised by a network representative. We take advantage of the Bayesian machinery to simultaneously infer the cluster membership, the representatives, and the community structure of the representatives, thus allowing intuitive inferences to be made. The implementation of our method on the human movement study reveals interesting movement patterns of individuals in clusters, readily characterised by their network representative. For the brain networks application, our model reveals a cluster of individuals with different network properties of particular interest in neuroscience. The performance of our method is additionally validated in extensive simulation studies

    Bayesian model-based clustering for multiple network data

    Full text link
    There is increasing appetite for analysing multiple network data. This is different to analysing traditional data sets, where now each observation in the data comprises a network. Recent technological advancements have allowed the collection of this type of data in a range of different applications. This has inspired researchers to develop statistical models that most accurately describe the probabilistic mechanism that generates a network population and use this to make inferences about the underlying structure of the network data. Only a few studies developed to date consider the heterogeneity that can exist in a network population. We propose a Mixture of Measurement Error Models for identifying clusters of networks in a network population, with respect to similarities detected in the connectivity patterns among the networks' nodes. Extensive simulation studies show our model performs well in both clustering multiple network data and inferring the model parameters. We further apply our model on two real world multiple network data sets resulting from the fields of Computing (Human Tracking Systems) and Neuroscience

    Using saturated models for data synthesis

    Get PDF

    The appeal of the gamma family distribution to protect the confidentiality of contingency tables

    Full text link
    Administrative databases, such as the English School Census (ESC), are rich sources of information that are potentially useful for researchers. For such data sources to be made available, however, strict guarantees of privacy would be required. To achieve this, synthetic data methods can be used. Such methods, when protecting the confidentiality of tabular data (contingency tables), often utilise the Poisson or Poisson-mixture distributions, such as the negative binomial (NBI). These distributions, however, are either equidispersed (in the case of the Poisson) or overdispersed (e.g. in the case of the NBI), which results in excessive noise being applied to large low-risk counts. This paper proposes the use of the (discretized) gamma family (GAF) distribution, which allows noise to be applied in a more bespoke fashion. Specifically, it allows less noise to be applied as cell counts become larger, providing an optimal balance in relation to the risk-utility trade-off. We illustrate the suitability of the GAF distribution on an administrative-type data set that is reminiscent of the ESC

    Multiply imputing missing values arising by design in transplant survival data

    Get PDF
    In this article, we address a missing data problem that occurs in transplant survival studies. Recipients of organ transplants are followed up from transplantation and their survival times recorded, together with various explanatory variables. Due to differences in data collection procedures in different centers or over time, a particular explanatory variable (or set of variables) may only be recorded for certain recipients, which results in this variable being missing for a substantial number of records in the data. The variable may also turn out to be an important predictor of survival and so it is important to handle this missing‐by‐design problem appropriately. Consensus in the literature is to handle this problem with complete case analysis, as the missing data are assumed to arise under an appropriate missing at random mechanism that gives consistent estimates here. Specifically, the missing values can reasonably be assumed not to be related to the survival time. In this article, we investigate the potential for multiple imputation to handle this problem in a relevant study on survival after kidney transplantation, and show that it comprehensively outperforms complete case analysis on a range of measures. This is a particularly important finding in the medical context as imputing large amounts of missing data is often viewed with scepticism
    corecore