287 research outputs found
Probabilistic performance estimators for computational chemistry methods: Systematic Improvement Probability and Ranking Probability Matrix. I. Theory
The comparison of benchmark error sets is an essential tool for the
evaluation of theories in computational chemistry. The standard ranking of
methods by their Mean Unsigned Error is unsatisfactory for several reasons
linked to the non-normality of the error distributions and the presence of
underlying trends. Complementary statistics have recently been proposed to
palliate such deficiencies, such as quantiles of the absolute errors
distribution or the mean prediction uncertainty. We introduce here a new score,
the systematic improvement probability (SIP), based on the direct system-wise
comparison of absolute errors. Independently of the chosen scoring rule, the
uncertainty of the statistics due to the incompleteness of the benchmark data
sets is also generally overlooked. However, this uncertainty is essential to
appreciate the robustness of rankings. In the present article, we develop two
indicators based on robust statistics to address this problem: P_{inv}, the
inversion probability between two values of a statistic, and \mathbf{P}_{r},
the ranking probability matrix. We demonstrate also the essential contribution
of the correlations between error sets in these scores comparisons
Probabilistic performance estimators for computational chemistry methods: Systematic Improvement Probability and Ranking Probability Matrix. II. Applications
In the first part of this study (Paper I), we introduced the systematic
improvement probability (SIP) as a tool to assess the level of improvement on
absolute errors to be expected when switching between two computational
chemistry methods. We developed also two indicators based on robust statistics
to address the uncertainty of ranking in computational chemistry benchmarks:
Pinv , the inversion probability between two values of a statistic, and Pr ,
the ranking probability matrix. In this second part, these indicators are
applied to nine data sets extracted from the recent benchmarking literature. We
illustrate also how the correlation between the error sets might contain useful
information on the benchmark dataset quality, notably when experimental data
are used as reference
Probabilistic performance estimators for computational chemistry methods: the empirical cumulative distribution function of absolute errors
Benchmarking studies in computational chemistry use reference datasets to
assess the accuracy of a method through error statistics. The commonly used
error statistics, such as the mean signed and mean unsigned errors, do not
inform end-users on the expected amplitude of prediction errors attached to
these methods. We show that, the distributions of model errors being neither
normal nor zero-centered, these error statistics cannot be used to infer
prediction error probabilities. To overcome this limitation, we advocate for
the use of more informative statistics, based on the empirical cumulative
distribution function of unsigned errors, namely (1) the probability for a new
calculation to have an absolute error below a chosen threshold, and (2) the
maximal amplitude of errors one can expect with a chosen high confidence level.
Those statistics are also shown to be well suited for benchmarking and ranking
studies. Moreover, the standard error on all benchmarking statistics depends on
the size of the reference dataset. Systematic publication of these standard
errors would be very helpful to assess the statistical reliability of
benchmarking conclusions.Comment: Supplementary material: https://github.com/ppernot/ECDF
Investigating the performance of shear wave elastography for cardiac stiffness assessment through finite element simulations
Stratification of uncertainties recalibrated by isotonic regression and its impact on calibration error statistics
Abstract Post hoc recalibration of prediction uncertainties of machine
learning regression problems by isotonic regression might present a problem for
bin-based calibration error statistics (e.g. ENCE). Isotonic regression often
produces stratified uncertainties, i.e. subsets of uncertainties with identical
numerical values. Partitioning of the resulting data into equal-sized bins
introduces an aleatoric component to the estimation of bin-based calibration
statistics. The partitioning of stratified data into bins depends on the order
of the data, which is typically an uncontrolled property of calibration
test/validation sets. The tie-braking method of the ordering algorithm used for
binning might also introduce an aleatoric component. I show on an example how
this might significantly affect the calibration diagnostics
Can bin-wise scaling improve consistency and adaptivity of prediction uncertainty for machine learning regression ?
Binwise Variance Scaling (BVS) has recently been proposed as a post hoc
recalibration method for prediction uncertainties of machine learning
regression problems that is able of more efficient corrections than uniform
variance (or temperature) scaling. The original version of BVS uses
uncertainty-based binning, which is aimed to improve calibration conditionally
on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in
particular with alternative loss functions and a binning scheme based on an
input-feature (X) in order to improve adaptivity, i.e. calibration conditional
on X. The performances of BVS and its proposed variants are tested on a
benchmark dataset for the prediction of atomization energies and compared to
the results of isotonic regression.Comment: This version corrects an error in the estimation of the Sx scores for
the test set, affecting Fig. 2 and Tables I-III of the initial version. The
main points of the discussion and the conclusions are unchange
Validation of ML-UQ calibration statistics using simulated reference values: a sensitivity analysis
Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration
statistics do not have predefined reference values and are mostly used in
comparative studies. In consequence, calibration is almost never validated and
the diagnostic is left to the appreciation of the reader. Simulated reference
values, based on synthetic calibrated datasets derived from actual
uncertainties, have been proposed to palliate this problem. As the generative
probability distribution for the simulation of synthetic errors is often not
constrained, the sensitivity of simulated reference values to the choice of
generative distribution might be problematic, shedding a doubt on the
calibration diagnostic. This study explores various facets of this problem, and
shows that some statistics are excessively sensitive to the choice of
generative distribution to be used for validation when the generative
distribution is unknown. This is the case, for instance, of the correlation
coefficient between absolute errors and uncertainties (CC) and of the expected
normalized calibration error (ENCE). A robust validation workflow to deal with
simulated reference values is proposed
- …
