287 research outputs found
Interpretable multiclass classification by MDL-based rule lists
Interpretable classifiers have recently witnessed an increase in attention
from the data mining community because they are inherently easier to understand
and explain than their more complex counterparts. Examples of interpretable
classification models include decision trees, rule sets, and rule lists.
Learning such models often involves optimizing hyperparameters, which typically
requires substantial amounts of data and may result in relatively large models.
In this paper, we consider the problem of learning compact yet accurate
probabilistic rule lists for multiclass classification. Specifically, we
propose a novel formalization based on probabilistic rule lists and the minimum
description length (MDL) principle. This results in virtually parameter-free
model selection that naturally allows to trade-off model complexity with
goodness of fit, by which overfitting and the need for hyperparameter tuning
are effectively avoided. Finally, we introduce the Classy algorithm, which
greedily finds rule lists according to the proposed criterion. We empirically
demonstrate that Classy selects small probabilistic rule lists that outperform
state-of-the-art classifiers when it comes to the combination of predictive
performance and interpretability. We show that Classy is insensitive to its
only parameter, i.e., the candidate set, and that compression on the training
set correlates with classification performance, validating our MDL-based
selection criterion
Local Subspace-Based Outlier Detection using Global Neighbourhoods
Outlier detection in high-dimensional data is a challenging yet important
task, as it has applications in, e.g., fraud detection and quality control.
State-of-the-art density-based algorithms perform well because they 1) take the
local neighbourhoods of data points into account and 2) consider feature
subspaces. In highly complex and high-dimensional data, however, existing
methods are likely to overlook important outliers because they do not
explicitly take into account that the data is often a mixture distribution of
multiple components.
We therefore introduce GLOSS, an algorithm that performs local subspace
outlier detection using global neighbourhoods. Experiments on synthetic data
demonstrate that GLOSS more accurately detects local outliers in mixed data
than its competitors. Moreover, experiments on real-world data show that our
approach identifies relevant outliers overlooked by existing methods,
confirming that one should keep an eye on the global perspective even when
doing local outlier detection.Comment: Short version accepted at IEEE BigData 201
Twente Optical Perfusion Camera: system overview and performance for video rate laser Doppler perfusion imaging
We present the Twente Optical Perfusion Camera (TOPCam), a novel laser Doppler Perfusion Imager based on CMOS technology. The tissue under investigation is illuminated and the resulting dynamic speckle pattern is recorded with a high speed CMOS camera. Based on an overall analysis of the signal-to-noise ratio of CMOS cameras, we have selected the camera which best fits our requirements. We applied a pixel-by-pixel noise correction to minimize the influence of noise in the perfusion images. We can achieve a frame rate of 0.2 fps for a perfusion image of 128×128 pixels (imaged tissue area of 7×7 cm2) if the data is analyzed online. If the analysis of the data is performed offline, we can achieve a frame rate of 26 fps for a duration of 3.9 seconds. By reducing the imaging size to 128×16 pixels, this frame rate can be achieved for up to half a minute. We show the fast imaging capabilities of the system in order of increasing perfusion frame rate. First the increase of skin perfusion after application of capsicum cream, and the perfusion during an occlusion-reperfusion procedure at the fastest frame rate allowed with online analysis is shown. With the highest frame rate allowed with offline analysis, the skin perfusion revealing the heart beat and the perfusion during an occlusion-reperfusion procedure is presented. Hence we have achieved video rate laser Doppler perfusion imaging
Mining local staircase patterns in noisy data
Most traditional biclustering algorithms identify biclusters with no or little overlap. In this paper, we introduce the problem of identifying staircases of biclusters. Such staircases may be indicative for causal relationships between columns and can not easily be identified by existing biclustering algorithms. Our formalization relies on a scoring function based on the Minimum Description Length principle. Furthermore, we propose a first algorithm for identifying staircase biclusters, based on a combination of local search and constraint programming. Experiments show that the approach is promising
Helping Made Easy: Ease of Argument Generation Enhances Intentions to Help
Previous work has shown that self-generating arguments is more persuasive than reading arguments provided by others, particularly if self-generation feels easy. The present study replicates and extends these findings by providing evidence for fluency effects on behavioral intention in the realm of helping. In two studies, participants were instructed to either self-generate or read two versus ten arguments about why it is good to help. Subsequently, a confederate asked them for help. Results show that self-generating few arguments is more effective than generating many arguments. While this pattern reverses for reading arguments, easy self-generation is the most effective strategy compared to all other conditions. These results have important implications for fostering behavioral change in all areas of life
Explainable Contextual Anomaly Detection using Quantile Regression Forests
Traditional anomaly detection methods aim to identify objects that deviate
from most other objects by treating all features equally. In contrast,
contextual anomaly detection methods aim to detect objects that deviate from
other objects within a context of similar objects by dividing the features into
contextual features and behavioral features. In this paper, we develop
connections between dependency-based traditional anomaly detection methods and
contextual anomaly detection methods. Based on resulting insights, we propose a
novel approach to inherently interpretable contextual anomaly detection that
uses Quantile Regression Forests to model dependencies between features.
Extensive experiments on various synthetic and real-world datasets demonstrate
that our method outperforms state-of-the-art anomaly detection methods in
identifying contextual anomalies in terms of accuracy and interpretability.Comment: Manuscript submitted to Data Mining and Knowledge Discovery in
October 2022 for possible publication. This is the revised version submitted
in April 202
Truly Unordered Probabilistic Rule Sets for Multi-class Classification
Rule set learning has long been studied and has recently been frequently
revisited due to the need for interpretable models. Still, existing methods
have several shortcomings: 1) most recent methods require a binary feature
matrix as input, while learning rules directly from numeric variables is
understudied; 2) existing methods impose orders among rules, either explicitly
or implicitly, which harms interpretability; and 3) currently no method exists
for learning probabilistic rule sets for multi-class target variables (there is
only one for probabilistic rule lists).
We propose TURS, for Truly Unordered Rule Sets, which addresses these
shortcomings. We first formalize the problem of learning truly unordered rule
sets. To resolve conflicts caused by overlapping rules, i.e., instances covered
by multiple rules, we propose a novel approach that exploits the probabilistic
properties of our rule sets. We next develop a two-phase heuristic algorithm
that learns rule sets by carefully growing rules. An important innovation is
that we use a surrogate score to take the global potential of the rule set into
account when learning a local rule.
Finally, we empirically demonstrate that, compared to non-probabilistic and
(explicitly or implicitly) ordered state-of-the-art methods, our method learns
rule sets that not only have better interpretability but also better predictive
performance.Comment: Camera ready version for ECMLPKDD 2022, with Supplementary Material
Probabilistic Truly Unordered Rule Sets
Rule set learning has recently been frequently revisited because of its
interpretability. Existing methods have several shortcomings though. First,
most existing methods impose orders among rules, either explicitly or
implicitly, which makes the models less comprehensible. Second, due to the
difficulty of handling conflicts caused by overlaps (i.e., instances covered by
multiple rules), existing methods often do not consider probabilistic rules.
Third, learning classification rules for multi-class target is understudied, as
most existing methods focus on binary classification or multi-class
classification via the ``one-versus-rest" approach.
To address these shortcomings, we propose TURS, for Truly Unordered Rule
Sets. To resolve conflicts caused by overlapping rules, we propose a novel
model that exploits the probabilistic properties of our rule sets, with the
intuition of only allowing rules to overlap if they have similar probabilistic
outputs. We next formalize the problem of learning a TURS model based on the
MDL principle and develop a carefully designed heuristic algorithm. We
benchmark against a wide range of rule-based methods and demonstrate that our
method learns rule sets that have lower model complexity and highly competitive
predictive performance. In addition, we empirically show that rules in our
model are empirically ``independent" and hence truly unordered.Comment: Submitted to JML
- …
