3,060 research outputs found
High Dimensional Data Enrichment: Interpretable, Fast, and Data-Efficient
High dimensional structured data enriched model describes groups of
observations by shared and per-group individual parameters, each with its own
structure such as sparsity or group sparsity. In this paper, we consider the
general form of data enrichment where data comes in a fixed but arbitrary
number of groups G. Any convex function, e.g., norms, can characterize the
structure of both shared and individual parameters. We propose an estimator for
high dimensional data enriched model and provide conditions under which it
consistently estimates both shared and individual parameters. We also delineate
sample complexity of the estimator and present high probability non-asymptotic
bound on estimation error of all parameters. Interestingly the sample
complexity of our estimator translates to conditions on both per-group sample
sizes and the total number of samples. We propose an iterative estimation
algorithm with linear convergence rate and supplement our theoretical analysis
with synthetic and real experimental results. Particularly, we show the
predictive power of data-enriched model along with its interpretable results in
anticancer drug sensitivity analysis
How Accurate Are Blood (or Breath) Tests for Identifying Self-Reported Heavy Drinking Among People with Alcohol Dependence?
AIMS: Managing patients with alcohol dependence includes assessment for heavy drinking, typically by asking patients. Some recommend biomarkers to detect heavy drinking but evidence of accuracy is limited.
METHODS: Among people with dependence, we assessed the performance of disialo-carbohydrate-deficient transferrin (%dCDT, ≥1.7%), gamma-glutamyltransferase (GGT, ≥66 U/l), either %dCDT or GGT positive, and breath alcohol (> 0) for identifying 3 self-reported heavy drinking levels: any heavy drinking (≥4 drinks/day or >7 drinks/week for women, ≥5 drinks/day or >14 drinks/week for men), recurrent (≥5 drinks/day on ≥5 days) and persistent heavy drinking (≥5 drinks/day on ≥7 consecutive days). Subjects (n = 402) with dependence and current heavy drinking were referred to primary care and assessed 6 months later with biomarkers and validated self-reported calendar method assessment of past 30-day alcohol use.
RESULTS: The self-reported prevalence of any, recurrent and persistent heavy drinking was 54, 34 and 17%. Sensitivity of %dCDT for detecting any, recurrent and persistent self-reported heavy drinking was 41, 53 and 66%. Specificity was 96, 90 and 84%, respectively. %dCDT had higher sensitivity than GGT and breath test for each alcohol use level but was not adequately sensitive to detect heavy drinking (missing 34-59% of the cases). Either %dCDT or GGT positive improved sensitivity but not to satisfactory levels, and specificity decreased. Neither a breath test nor GGT was sufficiently sensitive (both tests missed 70-80% of cases).
CONCLUSIONS: Although biomarkers may provide some useful information, their sensitivity is low the incremental value over self-report in clinical settings is questionable
Multi-Step Processing of Spatial Joins
Spatial joins are one of the most important operations for combining spatial objects of several relations. In this paper, spatial join processing is studied in detail for extended spatial objects in twodimensional data space. We present an approach for spatial join processing that is based on three steps. First, a spatial join is performed on the minimum bounding rectangles of the objects returning a set of candidates. Various approaches for accelerating this step of join processing have been examined at the last year’s conference [BKS 93a]. In this paper, we focus on the problem how to compute the answers from the set of candidates which is handled by
the following two steps. First of all, sophisticated approximations
are used to identify answers as well as to filter out false hits from
the set of candidates. For this purpose, we investigate various types
of conservative and progressive approximations. In the last step, the
exact geometry of the remaining candidates has to be tested against
the join predicate. The time required for computing spatial join
predicates can essentially be reduced when objects are adequately
organized in main memory. In our approach, objects are first decomposed
into simple components which are exclusively organized
by a main-memory resident spatial data structure. Overall, we
present a complete approach of spatial join processing on complex
spatial objects. The performance of the individual steps of our approach
is evaluated with data sets from real cartographic applications.
The results show that our approach reduces the total execution
time of the spatial join by factors
Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently
In this paper we define the notion
of a probabilistic neighborhood in spatial data: Let a set of points in
, a query point , a distance metric \dist,
and a monotonically decreasing function be
given. Then a point belongs to the probabilistic neighborhood of with respect to with probability f(\dist(p,q)). We envision
applications in facility location, sensor networks, and other scenarios where a
connection between two entities becomes less likely with increasing distance. A
straightforward query algorithm would determine a probabilistic neighborhood in
time by probing each point in .
To answer the query in sublinear time for the planar case, we augment a
quadtree suitably and design a corresponding query algorithm. Our theoretical
analysis shows that -- for certain distributions of planar -- our algorithm
answers a query in time with high probability
(whp). This matches up to a logarithmic factor the cost induced by
quadtree-based algorithms for deterministic queries and is asymptotically
faster than the straightforward approach whenever .
As practical proofs of concept we use two applications, one in the Euclidean
and one in the hyperbolic plane. In particular, our results yield the first
generator for random hyperbolic graphs with arbitrary temperatures in
subquadratic time. Moreover, our experimental data show the usefulness of our
algorithm even if the point distribution is unknown or not uniform: The running
time savings over the pairwise probing approach constitute at least one order
of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via
http://dx.doi.org/10.1007/978-3-319-44543-4_3
Potential Role of Ultrafine Particles in Associations between Airborne Particle Mass and Cardiovascular Health
Numerous epidemiologic time-series studies have shown generally consistent associations of cardiovascular hospital admissions and mortality with outdoor air pollution, particularly mass concentrations of particulate matter (PM) ≤2.5 or ≤10 μm in diameter (PM(2.5), PM(10)). Panel studies with repeated measures have supported the time-series results showing associations between PM and risk of cardiac ischemia and arrhythmias, increased blood pressure, decreased heart rate variability, and increased circulating markers of inflammation and thrombosis. The causal components driving the PM associations remain to be identified. Epidemiologic data using pollutant gases and particle characteristics such as particle number concentration and elemental carbon have provided indirect evidence that products of fossil fuel combustion are important. Ultrafine particles < 0.1 μm (UFPs) dominate particle number concentrations and surface area and are therefore capable of carrying large concentrations of adsorbed or condensed toxic air pollutants. It is likely that redox-active components in UFPs from fossil fuel combustion reach cardiovascular target sites. High UFP exposures may lead to systemic inflammation through oxidative stress responses to reactive oxygen species and thereby promote the progression of atherosclerosis and precipitate acute cardiovascular responses ranging from increased blood pressure to myocardial infarction. The next steps in epidemiologic research are to identify more clearly the putative PM casual components and size fractions linked to their sources. To advance this, we discuss in a companion article (Sioutas C, Delfino RJ, Singh M. 2005. Environ Health Perspect 113:947–955) the need for and methods of UFP exposure assessment
Sampling-based Algorithms for Optimal Motion Planning
During the last decade, sampling-based path planning algorithms, such as
Probabilistic RoadMaps (PRM) and Rapidly-exploring Random Trees (RRT), have
been shown to work well in practice and possess theoretical guarantees such as
probabilistic completeness. However, little effort has been devoted to the
formal analysis of the quality of the solution returned by such algorithms,
e.g., as a function of the number of samples. The purpose of this paper is to
fill this gap, by rigorously analyzing the asymptotic behavior of the cost of
the solution returned by stochastic sampling-based algorithms as the number of
samples increases. A number of negative results are provided, characterizing
existing algorithms, e.g., showing that, under mild technical conditions, the
cost of the solution returned by broadly used sampling-based algorithms
converges almost surely to a non-optimal value. The main contribution of the
paper is the introduction of new algorithms, namely, PRM* and RRT*, which are
provably asymptotically optimal, i.e., such that the cost of the returned
solution converges almost surely to the optimum. Moreover, it is shown that the
computational complexity of the new algorithms is within a constant factor of
that of their probabilistically complete (but not asymptotically optimal)
counterparts. The analysis in this paper hinges on novel connections between
stochastic sampling-based path planning algorithms and the theory of random
geometric graphs.Comment: 76 pages, 26 figures, to appear in International Journal of Robotics
Researc
Efficient Representation of Multidimensional Data over Hierarchical Domains
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-46049-9_19[Abstract] We consider the problem of representing multidimensional data where the domain of each dimension is organized hierarchically, and the queries require summary information at a different node in the hierarchy of each dimension. This is the typical case of OLAP databases. A basic approach is to represent each hierarchy as a one-dimensional line and recast the queries as multidimensional range queries. This approach can be implemented compactly by generalizing to more dimensions the k2k2 -treap, a compact representation of two-dimensional points that allows for efficient summarization queries along generic ranges. Instead, we propose a more flexible generalization, which instead of a generic quadtree-like partition of the space, follows the domain hierarchies across each dimension to organize the partitioning. The resulting structure is much more efficient than a generic multidimensional structure, since queries are resolved by aggregating much fewer nodes of the tree.Ministerio de Economía, Industria y Competitividad; TIN2013-46238-C4-3-RMinisterio de Economía, Industria y Competitividad; IDI-20141259Ministerio de Economía, Industria y Competitividad; ITC-20151305Ministerio de Economía y Competitividad; ITC-20151247Xunta de Galicia; GRC2013/053Chile.Fondo Nacional de Desarrollo Científico y Tecnológico; 1-140796COST. IC130
- …
