782 research outputs found

    Do unbalanced data have a negative effect on LDA?

    Get PDF
    For two-class discrimination, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] claimed that, when covariance matrices of the two classes were unequal, a (class) unbalanced data set had a negative effect on the performance of linear discriminant analysis (LDA). Through re-balancing 10 real-world data sets, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] provided empirical evidence to support the claim using AUC (Area Under the receiver operating characteristic Curve) as the performance metric. We suggest that such a claim is vague if not misleading, there is no solid theoretical analysis presented in Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562], and AUC can lead to a quite different conclusion from that led to by misclassification error rate (ER) on the discrimination performance of LDA for unbalanced data sets. Our empirical and simulation studies suggest that, for LDA, the increase of the median of AUC (and thus the improvement of performance of LDA) from re-balancing is relatively small, while, in contrast, the increase of the median of ER (and thus the decline in performance of LDA) from re-balancing is relatively large. Therefore, from our study, there is no reliable empirical evidence to support the claim that a (class) unbalanced data set has a negative effect on the performance of LDA. In addition, re-balancing affects the performance of LDA for data sets with either equal or unequal covariance matrices, indicating that having unequal covariance matrices is not a key reason for the difference in performance between original and re-balanced data

    Short note on two output-dependent hidden Markov models

    Get PDF
    The purpose of this note is to study the assumption of mutual information independence", which is used by Zhou (2005) for deriving an output-dependent hidden Markov model, the so-called discriminative HMM (D-HMM), in the context of determining a stochastic optimal sequence of hidden states. The assumption is extended to derive its generative counterpart, the G-HMM. In addition, state-dependent representations for two output-dependent HMMs, namely HMMSDO (Li, 2005) and D-HMM, are presented

    D-optimal designs via a cocktail algorithm

    Get PDF
    A fast new algorithm is proposed for numerical computation of (approximate) D-optimal designs. This "cocktail algorithm" extends the well-known vertex direction method (VDM; Fedorov 1972) and the multiplicative algorithm (Silvey, Titterington and Torsney, 1978), and shares their simplicity and monotonic convergence properties. Numerical examples show that the cocktail algorithm can lead to dramatically improved speed, sometimes by orders of magnitude, relative to either the multiplicative algorithm or the vertex exchange method (a variant of VDM). Key to the improved speed is a new nearest neighbor exchange strategy, which acts locally and complements the global effect of the multiplicative algorithm. Possible extensions to related problems such as nonparametric maximum likelihood estimation are mentioned.Comment: A number of changes after accounting for the referees' comments including new examples in Section 4 and more detailed explanations throughou

    Microstructure Effects on Daily Return Volatility in Financial Markets

    Full text link
    We simulate a series of daily returns from intraday price movements initiated by microstructure elements. Significant evidence is found that daily returns and daily return volatility exhibit first order autocorrelation, but trading volume and daily return volatility are not correlated, while intraday volatility is. We also consider GARCH effects in daily return series and show that estimates using daily returns are biased from the influence of the level of prices. Using daily price changes instead, we find evidence of a significant GARCH component. These results suggest that microstructure elements have a considerable influence on the return generating process.Comment: 15 pages, as presented at the Complexity Workshop in Aix-en-Provenc

    Learning Mixtures of Gaussians in High Dimensions

    Full text link
    Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in Rn, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many machine learning applications. Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case. In this work, we show that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in its most general form under a smoothed analysis framework, where the parameters are randomly perturbed from an adversarial starting point. In particular, given samples from a mixture of Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give an algorithm that learns the parameters with polynomial running time and using polynomial number of samples. The central algorithmic ideas consist of new ways to decompose the moment tensor of the Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are derived from the combinatorial structure of higher order moments of Gaussian distributions (sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop new tools for bounding smallest singular values of structured random matrices, which could be useful in other smoothed analysis settings

    Scattering statistics of rock outcrops: Model-data comparisons and Bayesian inference using mixture distributions

    Get PDF
    The probability density function of the acoustic field amplitude scattered by the seafloor was measured in a rocky environment off the coast of Norway using a synthetic aperture sonar system, and is reported here in terms of the probability of false alarm. Interpretation of the measurements focused on finding appropriate class of statistical models (single versus two-component mixture models), and on appropriate models within these two classes. It was found that two-component mixture models performed better than single models. The two mixture models that performed the best (and had a basis in the physics of scattering) were a mixture between two K distributions, and a mixture between a Rayleigh and generalized Pareto distribution. Bayes' theorem was used to estimate the probability density function of the mixture model parameters. It was found that the K-K mixture exhibits significant correlation between its parameters. The mixture between the Rayleigh and generalized Pareto distributions also had significant parameter correlation, but also contained multiple modes. We conclude that the mixture between two K distributions is the most applicable to this dataset.Comment: 15 pages, 7 figures, Accepted to the Journal of the Acoustical Society of Americ

    Quantitative assessment of sewer overflow performance with climate change in northwest England

    Get PDF
    Changes in rainfall patterns associated with climate change can affect the operation of a combined sewer system, with the potential increase in rainfall amount. This could lead to excessive spill frequencies and could also introduce hazardous substances into the receiving waters, which, in turn, would have an impact on the quality of shellfish and bathing waters. This paper quantifies the spilling volume, duration and frequency of 19 combined sewer overflows (CSOs) to receiving waters under two climate change scenarios, the high (A1FI), and the low emissions (B1) scenarios, simulated by three global climate models (GCMs), for a study catchment in northwest England. The future rainfall is downscaled, using climatic variables from HadCM3, CSIRO and CGCM2 GCMs, with the use of a hybrid generalized linear–artificial neural network model. The results from the model simulation for the future in 2080 showed an annual increase of 37% in total spill volume, 32% in total spill duration, and 12% in spill frequency for the shellfish water limiting requirements. These results were obtained, under the high emissions scenario, as projected by the HadCM3 as maximum. Nevertheless, the catchment drainage system is projected to cope with the future conditions in 2080 by all three GCMs. The results also indicate that under scenario B1, a significant drop was projected by CSIRO, which in the worst case could reach up to 50% in spill volume, 39% in spill duration and 25% in spill frequency. The results further show that, during the bathing season, a substantial drop is expected in the CSO spill drivers, as predicted by all GCMs under both scenarios

    Characterizing and Improving Generalized Belief Propagation Algorithms on the 2D Edwards-Anderson Model

    Full text link
    We study the performance of different message passing algorithms in the two dimensional Edwards Anderson model. We show that the standard Belief Propagation (BP) algorithm converges only at high temperature to a paramagnetic solution. Then, we test a Generalized Belief Propagation (GBP) algorithm, derived from a Cluster Variational Method (CVM) at the plaquette level. We compare its performance with BP and with other algorithms derived under the same approximation: Double Loop (DL) and a two-ways message passing algorithm (HAK). The plaquette-CVM approximation improves BP in at least three ways: the quality of the paramagnetic solution at high temperatures, a better estimate (lower) for the critical temperature, and the fact that the GBP message passing algorithm converges also to non paramagnetic solutions. The lack of convergence of the standard GBP message passing algorithm at low temperatures seems to be related to the implementation details and not to the appearance of long range order. In fact, we prove that a gauge invariance of the constrained CVM free energy can be exploited to derive a new message passing algorithm which converges at even lower temperatures. In all its region of convergence this new algorithm is faster than HAK and DL by some orders of magnitude.Comment: 19 pages, 13 figure

    First results from the Very Small Array -- I. Observational methods

    Full text link
    The Very Small Array (VSA) is a synthesis telescope designed to image faint structures in the cosmic microwave background on degree and sub-degree angular scales. The VSA has key differences from other CMB interferometers with the result that different systematic errors are expected. We have tested the operation of the VSA with a variety of blank-field and calibrator observations and cross-checked its calibration scale against independent measurements. We find that systematic effects can be suppressed below the thermal noise level in long observations; the overall calibration accuracy of the flux density scale is 3.5 percent and is limited by the external absolute calibration scale.Comment: 9 pages, 10 figures, MNRAS in press (Minor revisions
    corecore