4,469 research outputs found
A bagging SVM to learn from positive and unlabeled examples
We consider the problem of learning a binary classifier from a training set
of positive and unlabeled examples, both in the inductive and in the
transductive setting. This problem, often referred to as \emph{PU learning},
differs from the standard supervised classification problem by the lack of
negative examples in the training set. It corresponds to an ubiquitous
situation in many applications such as information retrieval or gene ranking,
when we have identified a set of data of interest sharing a particular
property, and we wish to automatically retrieve additional data sharing the
same property among a large and easily available pool of unlabeled data. We
propose a conceptually simple method, akin to bagging, to approach both
inductive and transductive PU learning problems, by converting them into series
of supervised binary classification problems discriminating the known positive
examples from random subsamples of the unlabeled set. We empirically
demonstrate the relevance of the method on simulated and real data, where it
performs at least as well as existing methods while being faster
Kernel matrix regression
We address the problem of filling missing entries in a kernel Gram matrix,
given a related full Gram matrix. We attack this problem from the viewpoint of
regression, assuming that the two kernel matrices can be considered as
explanatory variables and response variables, respectively. We propose a
variant of the regression model based on the underlying features in the
reproducing kernel Hilbert space by modifying the idea of kernel canonical
correlation analysis, and we estimate the missing entries by fitting this model
to the existing samples. We obtain promising experimental results on gene
network inference and protein 3D structure prediction from genomic datasets. We
also discuss the relationship with the em-algorithm based on information
geometry
Joint segmentation of many aCGH profiles using fast group LARS
Array-Based Comparative Genomic Hybridization (aCGH) is a method used to
search for genomic regions with copy numbers variations. For a given aCGH
profile, one challenge is to accurately segment it into regions of constant
copy number. Subjects sharing the same disease status, for example a type of
cancer, often have aCGH profiles with similar copy number variations, due to
duplications and deletions relevant to that particular disease. We introduce a
constrained optimization algorithm that jointly segments aCGH profiles of many
subjects. It simultaneously penalizes the amount of freedom the set of profiles
have to jump from one level of constant copy number to another, at genomic
locations known as breakpoints. We show that breakpoints shared by many
different profiles tend to be found first by the algorithm, even in the
presence of significant amounts of noise. The algorithm can be formulated as a
group LARS problem. We propose an extremely fast way to find the solution path,
i.e., a sequence of shared breakpoints in order of importance. For no extra
cost the algorithm smoothes all of the aCGH profiles into piecewise-constant
regions of equal copy number, giving low-dimensional versions of the original
data. These can be shown for all profiles on a single graph, allowing for
intuitive visual interpretation. Simulations and an implementation of the
algorithm on bladder cancer aCGH profiles are provided
Graph kernels based on tree patterns for molecules
Motivated by chemical applications, we revisit and extend a family of
positive definite kernels for graphs based on the detection of common subtrees,
initially proposed by Ramon et al. (2003). We propose new kernels with a
parameter to control the complexity of the subtrees used as features to
represent the graphs. This parameter allows to smoothly interpolate between
classical graph kernels based on the count of common walks, on the one hand,
and kernels that emphasize the detection of large common subtrees, on the other
hand. We also propose two modular extensions to this formulation. The first
extension increases the number of subtrees that define the feature space, and
the second one removes noisy features from the graph representations. We
validate experimentally these new kernels on binary classification tasks
consisting in discriminating toxic and non-toxic molecules with support vector
machines
The group fused Lasso for multiple change-point detection
We present the group fused Lasso for detection of multiple change-points
shared by a set of co-occurring one-dimensional signals. Change-points are
detected by approximating the original signals with a constraint on the
multidimensional total variation, leading to piecewise-constant approximations.
Fast algorithms are proposed to solve the resulting optimization problems,
either exactly or approximately. Conditions are given for consistency of both
algorithms as the number of signals increases, and empirical evidence is
provided to support the results on simulated and array comparative genomic
hybridization data
Kernel methods for in silico chemogenomics
Predicting interactions between small molecules and proteins is a crucial
ingredient of the drug discovery process. In particular, accurate predictive
models are increasingly used to preselect potential lead compounds from large
molecule databases, or to screen for side-effects. While classical in silico
approaches focus on predicting interactions with a given specific target, new
chemogenomics approaches adopt cross-target views. Building on recent
developments in the use of kernel methods in bio- and chemoinformatics, we
present a systematic framework to screen the chemical space of small molecules
for interaction with the biological space of proteins. We show that this
framework allows information sharing across the targets, resulting in a
dramatic improvement of ligand prediction accuracy for three important classes
of drug targets: enzymes, GPCR and ion channels
Can We Rebrand the Humanities?
As someone who studied both marketing and history (and who
finds her history degree a super valuable part of that mix) the
question often crosses my mind: “How can I sell my history
degree?
Reconstruction of biological networks by supervised machine learning approaches
We review a recent trend in computational systems biology which aims at using
pattern recognition algorithms to infer the structure of large-scale biological
networks from heterogeneous genomic data. We present several strategies that
have been proposed and that lead to different pattern recognition problems and
algorithms. The strenght of these approaches is illustrated on the
reconstruction of metabolic, protein-protein and regulatory networks of model
organisms. In all cases, state-of-the-art performance is reported
- …
