834 research outputs found
A tool for subjective and interactive visual data exploration
We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets
Efficient estimation of AUC in a sliding window
In many applications, monitoring area under the ROC curve (AUC) in a sliding
window over a data stream is a natural way of detecting changes in the system.
The drawback is that computing AUC in a sliding window is expensive, especially
if the window size is large and the data flow is significant.
In this paper we propose a scheme for maintaining an approximate AUC in a
sliding window of length . More specifically, we propose an algorithm that,
given , estimates AUC within , and can maintain this
estimate in time, per update, as the window slides.
This provides a speed-up over the exact computation of AUC, which requires
time, per update. The speed-up becomes more significant as the size of
the window increases. Our estimate is based on grouping the data points
together, and using these groups to calculate AUC. The grouping is designed
carefully such that () the groups are small enough, so that the error stays
small, () the number of groups is small, so that enumerating them is not
expensive, and () the definition is flexible enough so that we can
maintain the groups efficiently.
Our experimental evaluation demonstrates that the average approximation error
in practice is much smaller than the approximation guarantee ,
and that we can achieve significant speed-ups with only a modest sacrifice in
accuracy
Aspects of data ethics in a changing world: Where are we now?
Ready data availability, cheap storage capacity, and powerful tools for extracting information from data have the potential to significantly enhance the human condition. However, as with all advanced technologies, this comes with the potential for misuse. Ethical oversight and constraints are needed to ensure that an appropriate balance is reached. Ethical issues involving data may be more challenging than the ethical challenges of some other advanced technologies partly because data and data science are ubiquitous, having the potential to impact all aspects of life, and partly because of their intrinsic complexity. We explore the nature of data, personal data, data ownership, consent and purpose of use, trustworthiness of data as well as of algorithms and of those using the data, and matters of privacy and confidentiality. A checklist is given of topics that need to be considered
Randomized Reference Classifier with Gaussian Distribution and Soft Confusion Matrix Applied to the Improving Weak Classifiers
In this paper, an issue of building the RRC model using probability
distributions other than beta distribution is addressed. More precisely, in
this paper, we propose to build the RRR model using the truncated normal
distribution. Heuristic procedures for expected value and the variance of the
truncated-normal distribution are also proposed. The proposed approach is
tested using SCM-based model for testing the consequences of applying the
truncated normal distribution in the RRC model. The experimental evaluation is
performed using four different base classifiers and seven quality measures. The
results showed that the proposed approach is comparable to the RRC model built
using beta distribution. What is more, for some base classifiers, the
truncated-normal-based SCM algorithm turned out to be better at discovering
objects coming from minority classes.Comment: arXiv admin note: text overlap with arXiv:1901.0882
Interpreting random forest classification models using a feature contribution method
Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is relatively easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance . For “black box” models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. By analysing feature contributions for a training dataset, the most significant variables can be determined and their typical contribution towards predictions made for individual classes, i.e., class-specific feature contribution “patterns”, are discovered. These patterns represent a standard behaviour of the model and allow for an additional assessment of the model reliability for new data. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models
Estimating bank default with generalised extreme value regression models
The paper proposes a novel model for the prediction of bank failures, on the basis of both macroeconomic and bank-specific microeconomic factors. As bank failures are rare, in the paper we apply a regression method for binary data based on extreme value theory, which turns out to be more effective than classical logistic regression models, as it better leverages the information in the tail of the default distribution. The application of this model to the occurrence of bank defaults in a highly bank dependent economy (Italy) shows that, while microeconomic factors as well as regulatory capital are significant to explain proper failures, macroeconomic factors are relevant only when failures are defined not only in terms of actual defaults but also in terms of mergers and acquisitions. In terms of predictive accuracy, the model based on extreme value theory outperforms classical logistic regression models
Premenopausal endogenous oestrogen levels and breast cancer risk: a meta-analysis.
BACKGROUND: Many of the established risk factors for breast cancer implicate circulating hormone levels in the aetiology of the disease. Increased levels of postmenopausal endogenous oestradiol (E2) have been found to increase the risk of breast cancer, but no such association has been confirmed in premenopausal women. We carried out a meta-analysis to summarise the available evidence in women before the menopause. METHODS: We identified seven prospective studies of premenopausal endogenous E2 and breast cancer risk, including 693 breast cancer cases. From each study we extracted odds ratios of breast cancer between quantiles of endogenous E2, or for unit or s.d. increases in (log transformed) E2, or (where odds ratios were unavailable) summary statistics for the distributions of E2 in breast cancer cases and unaffected controls. Estimates for a doubling of endogenous E2 were obtained from these extracted estimates, and random-effect meta-analysis was used to obtain a pooled estimate across the studies. RESULTS: Overall, we found weak evidence of a positive association between circulating E2 levels and the risk of breast cancer, with a doubling of E2 associated with an odds ratio of 1.10 (95% CI: 0.96, 1.27). CONCLUSION: Our findings are consistent with the hypothesis of a positive association between premenopausal endogenous E2 and breast cancer risk
Evaluation of machine-learning methods for ligand-based virtual screening
Machine-learning methods can be used for virtual screening by analysing the structural characteristics of molecules of known (in)activity, and we here discuss the use of kernel discrimination and naive Bayesian classifier (NBC) methods for this purpose. We report a kernel method that allows the processing of molecules represented by binary, integer and real-valued descriptors, and show that it is little different in screening performance from a previously described kernel that had been developed specifically for the analysis of binary fingerprint representations of molecular structure. We then evaluate the performance of an NBC when the training-set contains only a very few active molecules. In such cases, a simpler approach based on group fusion would appear to provide superior screening performance, especially when structurally heterogeneous datasets are to be processed
- …
