Search CORE

834 research outputs found

Measuring classifier performance: a coherent alternative to the area under the ROC curve

Author: Hand DJ
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2009
Field of study

Spiral - Imperial College Digital Repository

Evaluating diagnostic tests: The area under the ROC curve and the balance of errors

Author: Hand DJ
Publication venue: 'Wiley'
Publication date: 30/06/2010
Field of study

Spiral - Imperial College Digital Repository

A tool for subjective and interactive visual data exploration

Author: C Ware
D Paurat
DJ Hand
J Lijffijt
T Bie De
T Ruotsalo
V Dzyuba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets

Crossref

Ghent University Academic Bibliography

Efficient estimation of AUC in a sliding window

Author: A Bifet
C Ferri
D Brzezinski
DJ Hand
I Žliobaitė
J Gama
J Gama
J Gama
Remco R. Bouckaert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2019
Field of study

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length

k

. More specifically, we propose an algorithm that, given

\epsilon

, estimates AUC within

\epsilon / 2

, and can maintain this estimate in

O((\log k) / \epsilon)

time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires

O(k)

time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (

i

) the groups are small enough, so that the error stays small, (

ii

) the number of groups is small, so that enumerating them is not expensive, and (

iii

) the definition is flexible enough so that we can maintain the groups efficiently. Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee

\epsilon / 2

, and that we can achieve significant speed-ups with only a modest sacrifice in accuracy

arXiv.org e-Print Archive

Crossref

Aspects of data ethics in a changing world: Where are we now?

Author: Hand DJ
Publication venue: Mary Ann Liebert
Publication date: 01/09/2018
Field of study

Ready data availability, cheap storage capacity, and powerful tools for extracting information from data have the potential to significantly enhance the human condition. However, as with all advanced technologies, this comes with the potential for misuse. Ethical oversight and constraints are needed to ensure that an appropriate balance is reached. Ethical issues involving data may be more challenging than the ethical challenges of some other advanced technologies partly because data and data science are ubiquitous, having the potential to impact all aspects of life, and partly because of their intrinsic complexity. We explore the nature of data, personal data, data ownership, consent and purpose of use, trustworthiness of data as well as of algorithms and of those using the data, and matters of privacy and confidentiality. A checklist is given of topics that need to be considered

Spiral - Imperial College Digital Repository

Randomized Reference Classifier with Gaussian Distribution and Soft Confusion Matrix Applied to the Improving Weak Classifiers

Author: B. Bergmann
D Yekutieli
DJ Hand
F Provost
F Wilcoxon
HA David
J Demšar
James O. Berger
JR Quinlan
L Breiman
L Kuncheva
M Friedman
M Hall
M Kurzynski
Marcin Majak
Marek Kurzynski
Marina Sokolova
N Johnson
Pawel Trajdos
R Lysiak
S Garcia
T Cover
T Woloszynski
Y Freund
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/05/2019
Field of study

In this paper, an issue of building the RRC model using probability distributions other than beta distribution is addressed. More precisely, in this paper, we propose to build the RRR model using the truncated normal distribution. Heuristic procedures for expected value and the variance of the truncated-normal distribution are also proposed. The proposed approach is tested using SCM-based model for testing the consequences of applying the truncated normal distribution in the RRC model. The experimental evaluation is performed using four different base classifiers and seven quality measures. The results showed that the proposed approach is comparable to the RRC model built using beta distribution. What is more, for some base classifiers, the truncated-normal-based SCM algorithm turned out to be better at discovering objects coming from minority classes.Comment: arXiv admin note: text overlap with arXiv:1901.0882

arXiv.org e-Print Archive

Crossref

Interpreting random forest classification models using a feature contribution method

Author: A Liaw
A. Tropsha
C. Strobl
D Baehrens
DJ Hand
K Hansen
L Breiman
L Breiman
L Rosenbaum
L. Carlsson
V.E. Kuz’min
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/12/2013
Field of study

Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is relatively easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance . For “black box” models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. By analysing feature contributions for a training dataset, the most significant variables can be determined and their typical contribution towards predictions made for individual classes, i.e., class-specific feature contribution “patterns”, are discovered. These patterns represent a standard behaviour of the model and allow for an additional assessment of the model reliability for new data. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models

arXiv.org e-Print Archive

Crossref

Bradford Scholars

White Rose Research Online

Estimating bank default with generalised extreme value regression models

Author: Agresti A
Dowd K
Embrechts P
Giudici P
Gomez-Gonzalez J
Gup BE
Hand DJ
Hand DJ
McCullagh P
Merton RC
Nocedal J
Paolo Giudici
Raffaella Calabrese
Resti A
Rose PS
Roth M
Ruppert D
Vasicek OA
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The paper proposes a novel model for the prediction of bank failures, on the basis of both macroeconomic and bank-specific microeconomic factors. As bank failures are rare, in the paper we apply a regression method for binary data based on extreme value theory, which turns out to be more effective than classical logistic regression models, as it better leverages the information in the tail of the default distribution. The application of this model to the occurrence of bank defaults in a highly bank dependent economy (Italy) shows that, while microeconomic factors as well as regulatory capital are significant to explain proper failures, macroeconomic factors are relevant only when failures are defined not only in terms of actual defaults but also in terms of mergers and acquisitions. In terms of predictive accuracy, the model based on extreme value theory outperforms classical logistic regression models

University of Essex Research Repository

Crossref

Archivio Istituzionale della Ricerca - Università degli Studi di Pavia

Premenopausal endogenous oestrogen levels and breast cancer risk: a meta-analysis.

Author: AH Eliassen
C Frost
CR Rosenberg
D J Bratton
DJ Hand
DK Wysowski
EW Steyerberg
G Chene
HV Thomas
J Russo
JN Matthews
JP Costantino
JTCB Herjan
K Tanabe
K Walker
K Walker
KJ Helzlsouer
LJ Martin
M Kabuto
MC Pike
MH Gail
N Vasiljevic
R DerSimonian
R Kaaks
RC Travis
RF Galbraith
S Greenland
SA Missmer
The Endogenous Hormones and Breast cancer Collaborative Group
TJ Key
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/09/2011
Field of study

BACKGROUND: Many of the established risk factors for breast cancer implicate circulating hormone levels in the aetiology of the disease. Increased levels of postmenopausal endogenous oestradiol (E2) have been found to increase the risk of breast cancer, but no such association has been confirmed in premenopausal women. We carried out a meta-analysis to summarise the available evidence in women before the menopause. METHODS: We identified seven prospective studies of premenopausal endogenous E2 and breast cancer risk, including 693 breast cancer cases. From each study we extracted odds ratios of breast cancer between quantiles of endogenous E2, or for unit or s.d. increases in (log transformed) E2, or (where odds ratios were unavailable) summary statistics for the distributions of E2 in breast cancer cases and unaffected controls. Estimates for a doubling of endogenous E2 were obtained from these extracted estimates, and random-effect meta-analysis was used to obtain a pooled estimate across the studies. RESULTS: Overall, we found weak evidence of a positive association between circulating E2 levels and the risk of breast cancer, with a doubling of E2 associated with an odds ratio of 1.10 (95% CI: 0.96, 1.27). CONCLUSION: Our findings are consistent with the hypothesis of a positive association between premenopausal endogenous E2 and breast cancer risk

Crossref

LSHTM Research Online

PubMed Central

Evaluation of machine-learning methods for ligand-based virtual screening

Author: A Bender
A Bender
A Bender
A Ormerod
A Ormerod
AE Klon
AM Capelli
AR Leach
B Chen
Beining Chen
C Williams
D Hand
D Rogers
D Wilton
DA Cosgrove
David J. Wood
DB Kitchen
DE Clark
DJ Hand
DJ Wilton
DM Hawkins
E Parzen
FL Stahura
G Harper
G Redl
G Schneider
George Papadatos
H Eckert
H Kubinyi
HM Berman
J Aitchison
J Bajorath
J Delaney
J Hert
J Hert
J Hert
JC Saeh
L Hodes
L Hodes
L Hodes
M Congreve
M Glick
M Glick
M Wagener
M Whittle
N Christianini
N Nikolova
Nikolaus Stiefl
P Constans
P Domingos
P Willett
P Willett
P Willett
P Willett
P Willett
Paulette Greenidge
Peter Willett
Q Zhang
R P Sheridan
RD Brown
RD Brown
RD Cramer
RE Carhart
RO Duda
Robert F. Harrison
S Anzali
TJ McNeany
TM Mitchell
Xiao Qing Lewell
XY Xia
YC Martin
YC Martin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Machine-learning methods can be used for virtual screening by analysing the structural characteristics of molecules of known (in)activity, and we here discuss the use of kernel discrimination and naive Bayesian classifier (NBC) methods for this purpose. We report a kernel method that allows the processing of molecules represented by binary, integer and real-valued descriptors, and show that it is little different in screening performance from a previously described kernel that had been developed specifically for the analysis of binary fingerprint representations of molecular structure. We then evaluate the performance of an NBC when the training-set contains only a very few active molecules. In such cases, a simpler approach based on group fusion would appear to provide superior screening performance, especially when structurally heterogeneous datasets are to be processed

Crossref

White Rose Research Online