782 research outputs found
Do unbalanced data have a negative effect on LDA?
For two-class discrimination, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] claimed that, when covariance matrices of the two classes were unequal, a (class) unbalanced data set had a negative effect on the performance of linear discriminant analysis (LDA). Through re-balancing 10 real-world data sets, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] provided empirical evidence to support the claim using AUC (Area Under the receiver operating characteristic Curve) as the performance metric. We suggest that such a claim is vague if not misleading, there is no solid theoretical analysis presented in Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562], and AUC can lead to a quite different conclusion from that led to by misclassification error rate (ER) on the discrimination performance of LDA for unbalanced data sets. Our empirical and simulation studies suggest that, for LDA, the increase of the median of AUC (and thus the improvement of performance of LDA) from re-balancing is relatively small, while, in contrast, the increase of the median of ER (and thus the decline in performance of LDA) from re-balancing is relatively large. Therefore, from our study, there is no reliable empirical evidence to support the claim that a (class) unbalanced data set has a negative effect on the performance of LDA. In addition, re-balancing affects the performance of LDA for data sets with either equal or unequal covariance matrices, indicating that having unequal covariance matrices is not a key reason for the difference in performance between original and re-balanced data
Short note on two output-dependent hidden Markov models
The purpose of this note is to study the assumption of mutual information independence", which is used by Zhou (2005) for deriving an output-dependent hidden Markov model, the so-called discriminative HMM (D-HMM), in the context of determining a stochastic optimal sequence of hidden states. The assumption is extended to derive its generative counterpart, the G-HMM. In addition, state-dependent representations for two output-dependent HMMs, namely HMMSDO (Li, 2005) and D-HMM, are presented
D-optimal designs via a cocktail algorithm
A fast new algorithm is proposed for numerical computation of (approximate)
D-optimal designs. This "cocktail algorithm" extends the well-known vertex
direction method (VDM; Fedorov 1972) and the multiplicative algorithm (Silvey,
Titterington and Torsney, 1978), and shares their simplicity and monotonic
convergence properties. Numerical examples show that the cocktail algorithm can
lead to dramatically improved speed, sometimes by orders of magnitude, relative
to either the multiplicative algorithm or the vertex exchange method (a variant
of VDM). Key to the improved speed is a new nearest neighbor exchange strategy,
which acts locally and complements the global effect of the multiplicative
algorithm. Possible extensions to related problems such as nonparametric
maximum likelihood estimation are mentioned.Comment: A number of changes after accounting for the referees' comments
including new examples in Section 4 and more detailed explanations throughou
Microstructure Effects on Daily Return Volatility in Financial Markets
We simulate a series of daily returns from intraday price movements initiated
by microstructure elements. Significant evidence is found that daily returns
and daily return volatility exhibit first order autocorrelation, but trading
volume and daily return volatility are not correlated, while intraday
volatility is. We also consider GARCH effects in daily return series and show
that estimates using daily returns are biased from the influence of the level
of prices. Using daily price changes instead, we find evidence of a significant
GARCH component. These results suggest that microstructure elements have a
considerable influence on the return generating process.Comment: 15 pages, as presented at the Complexity Workshop in Aix-en-Provenc
Learning Mixtures of Gaussians in High Dimensions
Efficiently learning mixture of Gaussians is a fundamental problem in
statistics and learning theory. Given samples coming from a random one out of k
Gaussian distributions in Rn, the learning problem asks to estimate the means
and the covariance matrices of these Gaussians. This learning problem arises in
many areas ranging from the natural sciences to the social sciences, and has
also found many machine learning applications. Unfortunately, learning mixture
of Gaussians is an information theoretically hard problem: in order to learn
the parameters up to a reasonable accuracy, the number of samples required is
exponential in the number of Gaussian components in the worst case. In this
work, we show that provided we are in high enough dimensions, the class of
Gaussian mixtures is learnable in its most general form under a smoothed
analysis framework, where the parameters are randomly perturbed from an
adversarial starting point. In particular, given samples from a mixture of
Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give
an algorithm that learns the parameters with polynomial running time and using
polynomial number of samples. The central algorithmic ideas consist of new ways
to decompose the moment tensor of the Gaussian mixture by exploiting its
structural properties. The symmetries of this tensor are derived from the
combinatorial structure of higher order moments of Gaussian distributions
(sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop
new tools for bounding smallest singular values of structured random matrices,
which could be useful in other smoothed analysis settings
Scattering statistics of rock outcrops: Model-data comparisons and Bayesian inference using mixture distributions
The probability density function of the acoustic field amplitude scattered by
the seafloor was measured in a rocky environment off the coast of Norway using
a synthetic aperture sonar system, and is reported here in terms of the
probability of false alarm. Interpretation of the measurements focused on
finding appropriate class of statistical models (single versus two-component
mixture models), and on appropriate models within these two classes. It was
found that two-component mixture models performed better than single models.
The two mixture models that performed the best (and had a basis in the physics
of scattering) were a mixture between two K distributions, and a mixture
between a Rayleigh and generalized Pareto distribution. Bayes' theorem was used
to estimate the probability density function of the mixture model parameters.
It was found that the K-K mixture exhibits significant correlation between its
parameters. The mixture between the Rayleigh and generalized Pareto
distributions also had significant parameter correlation, but also contained
multiple modes. We conclude that the mixture between two K distributions is the
most applicable to this dataset.Comment: 15 pages, 7 figures, Accepted to the Journal of the Acoustical
Society of Americ
Quantitative assessment of sewer overflow performance with climate change in northwest England
Changes in rainfall patterns associated with climate change can affect the operation of a combined sewer system, with the potential increase in rainfall amount. This could lead to excessive spill frequencies and could also introduce hazardous substances into the receiving waters, which, in turn, would have an impact on the quality of shellfish and bathing waters. This paper quantifies the spilling volume, duration and frequency of 19 combined sewer overflows (CSOs) to receiving waters under two climate change scenarios, the high (A1FI), and the low emissions (B1) scenarios, simulated by three global climate models (GCMs), for a study catchment in northwest England. The future rainfall is downscaled, using climatic variables from HadCM3, CSIRO and CGCM2 GCMs, with the use of a hybrid generalized linear–artificial neural network model. The results from the model simulation for the future in 2080 showed an annual increase of 37% in total spill volume, 32% in total spill duration, and 12% in spill frequency for the shellfish water limiting requirements. These results were obtained, under the high emissions scenario, as projected by the HadCM3 as maximum. Nevertheless, the catchment drainage system is projected to cope with the future conditions in 2080 by all three GCMs. The results also indicate that under scenario B1, a significant drop was projected by CSIRO, which in the worst case could reach up to 50% in spill volume, 39% in spill duration and 25% in spill frequency. The results further show that, during the bathing season, a substantial drop is expected in the CSO spill drivers, as predicted by all GCMs under both scenarios
Characterizing and Improving Generalized Belief Propagation Algorithms on the 2D Edwards-Anderson Model
We study the performance of different message passing algorithms in the two
dimensional Edwards Anderson model. We show that the standard Belief
Propagation (BP) algorithm converges only at high temperature to a paramagnetic
solution. Then, we test a Generalized Belief Propagation (GBP) algorithm,
derived from a Cluster Variational Method (CVM) at the plaquette level. We
compare its performance with BP and with other algorithms derived under the
same approximation: Double Loop (DL) and a two-ways message passing algorithm
(HAK). The plaquette-CVM approximation improves BP in at least three ways: the
quality of the paramagnetic solution at high temperatures, a better estimate
(lower) for the critical temperature, and the fact that the GBP message passing
algorithm converges also to non paramagnetic solutions. The lack of convergence
of the standard GBP message passing algorithm at low temperatures seems to be
related to the implementation details and not to the appearance of long range
order. In fact, we prove that a gauge invariance of the constrained CVM free
energy can be exploited to derive a new message passing algorithm which
converges at even lower temperatures. In all its region of convergence this new
algorithm is faster than HAK and DL by some orders of magnitude.Comment: 19 pages, 13 figure
First results from the Very Small Array -- I. Observational methods
The Very Small Array (VSA) is a synthesis telescope designed to image faint
structures in the cosmic microwave background on degree and sub-degree angular
scales. The VSA has key differences from other CMB interferometers with the
result that different systematic errors are expected. We have tested the
operation of the VSA with a variety of blank-field and calibrator observations
and cross-checked its calibration scale against independent measurements. We
find that systematic effects can be suppressed below the thermal noise level in
long observations; the overall calibration accuracy of the flux density scale
is 3.5 percent and is limited by the external absolute calibration scale.Comment: 9 pages, 10 figures, MNRAS in press (Minor revisions
- …
