3,060 research outputs found

    High Dimensional Data Enrichment: Interpretable, Fast, and Data-Efficient

    Full text link
    High dimensional structured data enriched model describes groups of observations by shared and per-group individual parameters, each with its own structure such as sparsity or group sparsity. In this paper, we consider the general form of data enrichment where data comes in a fixed but arbitrary number of groups G. Any convex function, e.g., norms, can characterize the structure of both shared and individual parameters. We propose an estimator for high dimensional data enriched model and provide conditions under which it consistently estimates both shared and individual parameters. We also delineate sample complexity of the estimator and present high probability non-asymptotic bound on estimation error of all parameters. Interestingly the sample complexity of our estimator translates to conditions on both per-group sample sizes and the total number of samples. We propose an iterative estimation algorithm with linear convergence rate and supplement our theoretical analysis with synthetic and real experimental results. Particularly, we show the predictive power of data-enriched model along with its interpretable results in anticancer drug sensitivity analysis

    How Accurate Are Blood (or Breath) Tests for Identifying Self-Reported Heavy Drinking Among People with Alcohol Dependence?

    Get PDF
    AIMS: Managing patients with alcohol dependence includes assessment for heavy drinking, typically by asking patients. Some recommend biomarkers to detect heavy drinking but evidence of accuracy is limited. METHODS: Among people with dependence, we assessed the performance of disialo-carbohydrate-deficient transferrin (%dCDT, ≥1.7%), gamma-glutamyltransferase (GGT, ≥66 U/l), either %dCDT or GGT positive, and breath alcohol (> 0) for identifying 3 self-reported heavy drinking levels: any heavy drinking (≥4 drinks/day or >7 drinks/week for women, ≥5 drinks/day or >14 drinks/week for men), recurrent (≥5 drinks/day on ≥5 days) and persistent heavy drinking (≥5 drinks/day on ≥7 consecutive days). Subjects (n = 402) with dependence and current heavy drinking were referred to primary care and assessed 6 months later with biomarkers and validated self-reported calendar method assessment of past 30-day alcohol use. RESULTS: The self-reported prevalence of any, recurrent and persistent heavy drinking was 54, 34 and 17%. Sensitivity of %dCDT for detecting any, recurrent and persistent self-reported heavy drinking was 41, 53 and 66%. Specificity was 96, 90 and 84%, respectively. %dCDT had higher sensitivity than GGT and breath test for each alcohol use level but was not adequately sensitive to detect heavy drinking (missing 34-59% of the cases). Either %dCDT or GGT positive improved sensitivity but not to satisfactory levels, and specificity decreased. Neither a breath test nor GGT was sufficiently sensitive (both tests missed 70-80% of cases). CONCLUSIONS: Although biomarkers may provide some useful information, their sensitivity is low the incremental value over self-report in clinical settings is questionable

    Multi-Step Processing of Spatial Joins

    Get PDF
    Spatial joins are one of the most important operations for combining spatial objects of several relations. In this paper, spatial join processing is studied in detail for extended spatial objects in twodimensional data space. We present an approach for spatial join processing that is based on three steps. First, a spatial join is performed on the minimum bounding rectangles of the objects returning a set of candidates. Various approaches for accelerating this step of join processing have been examined at the last year’s conference [BKS 93a]. In this paper, we focus on the problem how to compute the answers from the set of candidates which is handled by the following two steps. First of all, sophisticated approximations are used to identify answers as well as to filter out false hits from the set of candidates. For this purpose, we investigate various types of conservative and progressive approximations. In the last step, the exact geometry of the remaining candidates has to be tested against the join predicate. The time required for computing spatial join predicates can essentially be reduced when objects are adequately organized in main memory. In our approach, objects are first decomposed into simple components which are exclusively organized by a main-memory resident spatial data structure. Overall, we present a complete approach of spatial join processing on complex spatial objects. The performance of the individual steps of our approach is evaluated with data sets from real cartographic applications. The results show that our approach reduces the total execution time of the spatial join by factors

    Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently

    Full text link
    \newcommand{\dist}{\operatorname{dist}} In this paper we define the notion of a probabilistic neighborhood in spatial data: Let a set PP of nn points in Rd\mathbb{R}^d, a query point qRdq \in \mathbb{R}^d, a distance metric \dist, and a monotonically decreasing function f:R+[0,1]f : \mathbb{R}^+ \rightarrow [0,1] be given. Then a point pPp \in P belongs to the probabilistic neighborhood N(q,f)N(q, f) of qq with respect to ff with probability f(\dist(p,q)). We envision applications in facility location, sensor networks, and other scenarios where a connection between two entities becomes less likely with increasing distance. A straightforward query algorithm would determine a probabilistic neighborhood in Θ(nd)\Theta(n\cdot d) time by probing each point in PP. To answer the query in sublinear time for the planar case, we augment a quadtree suitably and design a corresponding query algorithm. Our theoretical analysis shows that -- for certain distributions of planar PP -- our algorithm answers a query in O((N(q,f)+n)logn)O((|N(q,f)| + \sqrt{n})\log n) time with high probability (whp). This matches up to a logarithmic factor the cost induced by quadtree-based algorithms for deterministic queries and is asymptotically faster than the straightforward approach whenever N(q,f)o(n/logn)|N(q,f)| \in o(n / \log n). As practical proofs of concept we use two applications, one in the Euclidean and one in the hyperbolic plane. In particular, our results yield the first generator for random hyperbolic graphs with arbitrary temperatures in subquadratic time. Moreover, our experimental data show the usefulness of our algorithm even if the point distribution is unknown or not uniform: The running time savings over the pairwise probing approach constitute at least one order of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44543-4_3

    Potential Role of Ultrafine Particles in Associations between Airborne Particle Mass and Cardiovascular Health

    Get PDF
    Numerous epidemiologic time-series studies have shown generally consistent associations of cardiovascular hospital admissions and mortality with outdoor air pollution, particularly mass concentrations of particulate matter (PM) ≤2.5 or ≤10 μm in diameter (PM(2.5), PM(10)). Panel studies with repeated measures have supported the time-series results showing associations between PM and risk of cardiac ischemia and arrhythmias, increased blood pressure, decreased heart rate variability, and increased circulating markers of inflammation and thrombosis. The causal components driving the PM associations remain to be identified. Epidemiologic data using pollutant gases and particle characteristics such as particle number concentration and elemental carbon have provided indirect evidence that products of fossil fuel combustion are important. Ultrafine particles < 0.1 μm (UFPs) dominate particle number concentrations and surface area and are therefore capable of carrying large concentrations of adsorbed or condensed toxic air pollutants. It is likely that redox-active components in UFPs from fossil fuel combustion reach cardiovascular target sites. High UFP exposures may lead to systemic inflammation through oxidative stress responses to reactive oxygen species and thereby promote the progression of atherosclerosis and precipitate acute cardiovascular responses ranging from increased blood pressure to myocardial infarction. The next steps in epidemiologic research are to identify more clearly the putative PM casual components and size fractions linked to their sources. To advance this, we discuss in a companion article (Sioutas C, Delfino RJ, Singh M. 2005. Environ Health Perspect 113:947–955) the need for and methods of UFP exposure assessment

    Sampling-based Algorithms for Optimal Motion Planning

    Get PDF
    During the last decade, sampling-based path planning algorithms, such as Probabilistic RoadMaps (PRM) and Rapidly-exploring Random Trees (RRT), have been shown to work well in practice and possess theoretical guarantees such as probabilistic completeness. However, little effort has been devoted to the formal analysis of the quality of the solution returned by such algorithms, e.g., as a function of the number of samples. The purpose of this paper is to fill this gap, by rigorously analyzing the asymptotic behavior of the cost of the solution returned by stochastic sampling-based algorithms as the number of samples increases. A number of negative results are provided, characterizing existing algorithms, e.g., showing that, under mild technical conditions, the cost of the solution returned by broadly used sampling-based algorithms converges almost surely to a non-optimal value. The main contribution of the paper is the introduction of new algorithms, namely, PRM* and RRT*, which are provably asymptotically optimal, i.e., such that the cost of the returned solution converges almost surely to the optimum. Moreover, it is shown that the computational complexity of the new algorithms is within a constant factor of that of their probabilistically complete (but not asymptotically optimal) counterparts. The analysis in this paper hinges on novel connections between stochastic sampling-based path planning algorithms and the theory of random geometric graphs.Comment: 76 pages, 26 figures, to appear in International Journal of Robotics Researc

    Efficient Representation of Multidimensional Data over Hierarchical Domains

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-46049-9_19[Abstract] We consider the problem of representing multidimensional data where the domain of each dimension is organized hierarchically, and the queries require summary information at a different node in the hierarchy of each dimension. This is the typical case of OLAP databases. A basic approach is to represent each hierarchy as a one-dimensional line and recast the queries as multidimensional range queries. This approach can be implemented compactly by generalizing to more dimensions the k2k2 -treap, a compact representation of two-dimensional points that allows for efficient summarization queries along generic ranges. Instead, we propose a more flexible generalization, which instead of a generic quadtree-like partition of the space, follows the domain hierarchies across each dimension to organize the partitioning. The resulting structure is much more efficient than a generic multidimensional structure, since queries are resolved by aggregating much fewer nodes of the tree.Ministerio de Economía, Industria y Competitividad; TIN2013-46238-C4-3-RMinisterio de Economía, Industria y Competitividad; IDI-20141259Ministerio de Economía, Industria y Competitividad; ITC-20151305Ministerio de Economía y Competitividad; ITC-20151247Xunta de Galicia; GRC2013/053Chile.Fondo Nacional de Desarrollo Científico y Tecnológico; 1-140796COST. IC130
    corecore