42 research outputs found
The analysis and advanced extensions of canonical correlation analysis
Drug discovery is the process of identifying compounds which have potentially meaningful biological activity. A problem that arises is that the number of compounds to search over can be quite large, sometimes numbering in the millions, making experimental testing intractable. For this reason computational methods are employed to filter out those compounds which do not exhibit strong biological activity. This filtering step, also called virtual screening reduces the search space, allowing for the remaining compounds to be experimentally tested. In this dissertation I will provide an approach to the problem of virtual screening based on Canonical Correlation Analysis (CCA) and several extensions which use kernel and spectral learning ideas. Specifically these methods will be applied to the protein ligand matching problem. Additionally, theoretical results analyzing the behavior of CCA in the High Dimension Low Sample Size (HDLSS) setting will be provided
Local kernel canonical correlation analysis with application to virtual drug screening
Drug discovery is the process of identifying compounds which have potentially meaningful biological activity. A major challenge that arises is that the number of compounds to search over can be quite large, sometimes numbering in the millions, making experimental testing intractable. For this reason computational methods are employed to filter out those compounds which do not exhibit strong biological activity. This filtering step, also called virtual screening reduces the search space, allowing for the remaining compounds to be experimentally tested
On-the-fly Autonomous Control of Neutron Diffraction via Physics-Informed Bayesian Active Learning
Neutron scattering is a unique and versatile characterization technique for
probing the magnetic structure and dynamics of materials. However, instruments
at neutron scattering facilities in the world is limited, and instruments at
such facilities are perennially oversubscribed. We demonstrate a significant
reduction in experimental time required for neutron diffraction experiments by
implementation of autonomous navigation of measurement parameter space through
machine learning. Prior scientific knowledge and Bayesian active learning are
used to dynamically steer the sequence of measurements. We developed the
autonomous neutron diffraction explorer (ANDiE) and used it to determine the
magnetic order of MnO and Fe1.09Te. ANDiE can determine the Neel temperature of
the materials with 5-fold enhancement in efficiency and correctly identify the
transition dynamics via physics-informed Bayesian inference. ANDiE's active
learning approach is broadly applicable to a variety of neutron-based
experiments and can open the door for neutron scattering as a tool of
accelerated materials discovery
The Fast RODEO for Local Polynomial Regression
<div><p>An open challenge in nonparametric regression is finding fast, computationally efficient approaches to estimating local bandwidths for large data sets, in particular in two or more dimensions. In the work presented here we introduce a novel local bandwidth estimation procedure for local polynomial regression which combines the greedy search of the RODEO algorithm with linear binning. The result is a fast, computationally efficient algorithm we refer to as the <i>fast RODEO</i>. We motivate the development of our algorithm by using a novel scale-space approach to derive the RODEO. We conclude with a toy example and real world example using data from the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite validation study, where we show the fast RODEO’s improvement in accuracy and computational speed over two other standard approaches.</p></div
Using Replicates in Information Retrieval Evaluation
This article explores a method for more accurately estimating the main effect of the system in a typical test-collection-based evaluation of information retrieval systems, thus increasing the sensitivity of system comparisons. Randomly partitioning the test document collection allows for multiple tests of a given system and topic (replicates). Bootstrap ANOVA can use these replicates to extract system-topic interactions—something not possible without replicates—yielding a more precise value for the system effect and a narrower confidence interval around that value. Experiments using multiple TREC collections demonstrate that removing the topic-system interactions substantially reduces the confidence intervals around the system effect as well as increases the number of significant pairwise differences found. Further, the method is robust against small changes in the number of partitions used, against variability in the documents that constitute the partitions, and the measure of effectiveness used to quantify system effectiveness.</jats:p
