245 research outputs found
Data Analytics with Differential Privacy
Differential privacy is the state-of-the-art definition for privacy,
guaranteeing that any analysis performed on a sensitive dataset leaks no
information about the individuals whose data are contained therein. In this
thesis, we develop differentially private algorithms to analyze distributed and
streaming data. In the distributed model, we consider the particular problem of
learning -- in a distributed fashion -- a global model of the data, that can
subsequently be used for arbitrary analyses. We build upon PrivBayes, a
differentially private method that approximates the high-dimensional
distribution of a centralized dataset as a product of low-order distributions,
utilizing a Bayesian Network model. We examine three novel approaches to
learning a global Bayesian Network from distributed data, while offering the
differential privacy guarantee to all local datasets. Our work includes a
detailed theoretical analysis of the distributed, differentially private
entropy estimator which we use in one of our algorithms, as well as a detailed
experimental evaluation, using both synthetic and real-world data. In the
streaming model, we focus on the problem of estimating the density of a stream
of users, which expresses the fraction of all users that actually appear in the
stream. We offer one of the strongest privacy guarantees for the streaming
model, user-level pan-privacy, which ensures that the privacy of any user is
protected, even against an adversary that observes the internal state of the
algorithm. We provide a detailed analysis of an existing, sampling-based
algorithm for the problem and propose two novel modifications that
significantly improve it, both theoretically and experimentally, by optimally
using all the allocated "privacy budget."Comment: Diploma Thesis, School of Electrical and Computer Engineering,
Technical University of Crete, Chania, Greece, 201
Estimating the modal parameters from multiple measurement setups using a joint state space model
Computing the modal parameters of structural systems often requires processing data from multiple non-simultaneously recorded setups of sensors. These setups share some sensors in common, the so-called reference sensors, which are fixed for all measurements, while the other sensors change their position from one setup to the next. One possibility is to process the setups separately resulting in different modal parameter estimates for each setup. Then, the reference sensors are used to merge or glue the different parts of the mode shapes to obtain global mode shapes, while the natural frequencies and damping ratios are usually averaged. In this paper we present a new state space model that processes all setups at once. The result is that the global mode shapes are obtained automatically, and only a value for the natural frequency and damping ratio of each mode is estimated. We also investigate the estimation of this model using maximum likelihood and the Expectation Maximization algorithm, and apply this technique to simulated and measured data corresponding to different structures
Adaptive hybrid optimization strategy for calibration and parameter estimation of physical models
A new adaptive hybrid optimization strategy, entitled squads, is proposed for
complex inverse analysis of computationally intensive physical models. The new
strategy is designed to be computationally efficient and robust in
identification of the global optimum (e.g. maximum or minimum value of an
objective function). It integrates a global Adaptive Particle Swarm
Optimization (APSO) strategy with a local Levenberg-Marquardt (LM) optimization
strategy using adaptive rules based on runtime performance. The global strategy
optimizes the location of a set of solutions (particles) in the parameter
space. The LM strategy is applied only to a subset of the particles at
different stages of the optimization based on the adaptive rules. After the LM
adjustment of the subset of particle positions, the updated particles are
returned to the APSO strategy. The advantages of coupling APSO and LM in the
manner implemented in squads is demonstrated by comparisons of squads
performance against Levenberg-Marquardt (LM), Particle Swarm Optimization
(PSO), Adaptive Particle Swarm Optimization (APSO; the TRIBES strategy), and an
existing hybrid optimization strategy (hPSO). All the strategies are tested on
2D, 5D and 10D Rosenbrock and Griewank polynomial test functions and a
synthetic hydrogeologic application to identify the source of a contaminant
plume in an aquifer. Tests are performed using a series of runs with random
initial guesses for the estimated (function/model) parameters. Squads is
observed to have the best performance when both robustness and efficiency are
taken into consideration than the other strategies for all test functions and
the hydrogeologic application
Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme
We present a novel approach for the problem of frequency estimation in data
streams that is based on optimization and machine learning. Contrary to
state-of-the-art streaming frequency estimation algorithms, which heavily rely
on random hashing to maintain the frequency distribution of the data steam
using limited storage, the proposed approach exploits an observed stream prefix
to near-optimally hash elements and compress the target frequency distribution.
We develop an exact mixed-integer linear optimization formulation, which
enables us to compute optimal or near-optimal hashing schemes for elements seen
in the observed stream prefix; then, we use machine learning to hash unseen
elements. Further, we develop an efficient block coordinate descent algorithm,
which, as we empirically show, produces high quality solutions, and, in a
special case, we are able to solve the proposed formulation exactly in linear
time using dynamic programming. We empirically evaluate the proposed approach
both on synthetic datasets and on real-world search query data. We show that
the proposed approach outperforms existing approaches by one to two orders of
magnitude in terms of its average (per element) estimation error and by 45-90%
in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on
07/2020. Revised on 05/202
The Backbone Method for Ultra-High Dimensional Sparse Machine Learning
We present the backbone method, a generic framework that enables sparse and
interpretable supervised machine learning methods to scale to ultra-high
dimensional problems. We solve sparse regression problems with features
in minutes and features in hours, as well as decision tree problems with
features in minutes.The proposed method operates in two phases: we first
determine the backbone set, consisting of potentially relevant features, by
solving a number of tractable subproblems; then, we solve a reduced problem,
considering only the backbone features. For the sparse regression problem, our
theoretical analysis shows that, under certain assumptions and with high
probability, the backbone set consists of the truly relevant features.
Numerical experiments on both synthetic and real-world datasets demonstrate
that our method outperforms or competes with state-of-the-art methods in
ultra-high dimensional problems, and competes with optimal solutions in
problems where exact methods scale, both in terms of recovering the truly
relevant features and in its out-of-sample predictive performance.Comment: First submission to Machine Learning: 06/2020. Revised: 10/202
Improving Stability in Decision Tree Models
Owing to their inherently interpretable structure, decision trees are
commonly used in applications where interpretability is essential. Recent work
has focused on improving various aspects of decision trees, including their
predictive power and robustness; however, their instability, albeit
well-documented, has been addressed to a lesser extent. In this paper, we take
a step towards the stabilization of decision tree models through the lens of
real-world health care applications due to the relevance of stability and
interpretability in this space. We introduce a new distance metric for decision
trees and use it to determine a tree's level of stability. We propose a novel
methodology to train stable decision trees and investigate the existence of
trade-offs that are inherent to decision tree models - including between
stability, predictive power, and interpretability. We demonstrate the value of
the proposed methodology through an extensive quantitative and qualitative
analysis of six case studies from real-world health care applications, and we
show that, on average, with a small 4.6% decrease in predictive power, we gain
a significant 38% improvement in the model's stability
The Missing Link! A New Skeleton for Evolutionary Multi-agent Systems in Erlang
Evolutionary multi-agent systems (EMAS) play a critical role in many artificial intelligence applications that are in use today. In this paper, we present a new generic skeleton in Erlang for parallel EMAS computations. The skeleton enables us to capture a wide variety of concrete evolutionary computations that can exploit the same underlying parallel implementation. We demonstrate the use of our skeleton on two different evolutionary computing applications: (1) computing the minimum of the Rastrigin function; and (2) solving an urban traffic optimisation problem. We show that we can obtain very good speedups (up to 142.44 ×× the sequential performance) on a variety of different parallel hardware, while requiring very little parallelisation effort.Publisher PDFPeer reviewe
Towards Stable Machine Learning Model Retraining via Slowly Varying Sequences
We consider the task of retraining machine learning (ML) models when new
batches of data become available. Existing methods focus largely on greedy
approaches to find the best-performing model for each batch, without
considering the stability of the model's structure across retraining
iterations. In this study, we propose a methodology for finding sequences of ML
models that are stable across retraining iterations. We develop a mixed-integer
optimization formulation that is guaranteed to recover Pareto optimal models
(in terms of the predictive power-stability trade-off) and an efficient
polynomial-time algorithm that performs well in practice. We focus on retaining
consistent analytical insights - which is important to model interpretability,
ease of implementation, and fostering trust with users - by using
custom-defined distance metrics that can be directly incorporated into the
optimization problem. Our method shows stronger stability than greedily trained
models with a small, controllable sacrifice in predictive power, as evidenced
through a real-world case study in a major hospital system in Connecticut
Serum Profiles of C-Reactive Protein, Interleukin-8, and Tumor Necrosis Factor-α in Patients with Acute Pancreatitis
Background-Aims. Early prediction of the severity of acute pancreatitis would lead to prompt intensive treatment resulting in improvement of the outcome. The present study investigated the use of C-reactive protein (CRP), interleukin IL-8 and tumor necrosis factor-α (TNF-α) as prognosticators of the severity of the disease.
Methods. Twenty-six patients with acute pancreatitis were studied. Patients with APACHE II score of 9 or more formed the severe group, while the mild group consisted of patients with APACHE II score of less than 9. Serum samples for measurement of CRP, IL-8 and TNF-α were collected on the day of admission and additionally on the 2nd, 3rd and 7th days.
Results. Significantly higher levels of IL-8 were found in patients with severe acute pancreatitis compared to those with mild disease especially at the 2nd and 3rd days (P = .001 and P = .014, resp.). No significant difference for CRP and TNF-α was observed between the two groups. The optimal cut-offs for IL-8 in order to discriminate severe from mild disease at the 2nd and 3rd days were 25.4 pg/mL and 14.5 pg/mL, respectively.
Conclusions. IL-8 in early phase of acute pancreatitis is superior marker compared to CRP and TNF-α for distinguishing patients with severe disease
- …
