610 research outputs found
Nonlinear quantile mixed models
In regression applications, the presence of nonlinearity and correlation
among observations offer computational challenges not only in traditional
settings such as least squares regression, but also (and especially) when the
objective function is non-smooth as in the case of quantile regression. In this
paper, we develop methods for the modeling and estimation of nonlinear
conditional quantile functions when data are clustered within two-level nested
designs. This work represents an extension of the linear quantile mixed models
of Geraci and Bottai (2014, Statistics and Computing). We develop a novel
algorithm which is a blend of a smoothing algorithm for quantile regression and
a second order Laplacian approximation for nonlinear mixed models. To assess
the proposed methods, we present a simulation study and two applications, one
in pharmacokinetics and one related to growth curve modeling in agriculture.Comment: 26 pages, 8 figures, 8 table
Quantile contours and allometric modelling for risk classification of abnormal ratios with an application to asymmetric growth-restriction in preterm infants
We develop an approach to risk classification based on quantile contours and
allometric modelling of multivariate anthropometric measurements. We propose
the definition of allometric direction tangent to the directional quantile
envelope, which divides ratios of measurements into half-spaces. This in turn
provides an operational definition of directional quantile that can be used as
cutoff for risk assessment. We show the application of the proposed approach
using a large dataset from the Vermont Oxford Network containing observations
of birthweight (BW) and head circumference (HC) for more than 150,000 preterm
infants. Our analysis suggests that disproportionately growth-restricted
infants with a larger HC-to-BW ratio are at increased mortality risk as
compared to proportionately growth-restricted infants. The role of maternal
hypertension is also investigated.Comment: 31 pages, 3 figures, 8 table
Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval
Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated experimentally by showing significant performance improvements over the scheme proposed in VLDB 2004 We improve significantly tradeoffs between query time and output quality with respect to the baseline method in VLDB 2004, and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007. We also speed up the pre-processing time by a factor at least thirty
Extraction and classification of dense communities in the Web
The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW. In this paper we focus on the problem of searching and classifying communities in the web. Loosely speaking a community is a group of pages related to a common interest. More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph) The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for dense communities of a hundred nodes or more. Moreover it is still about 80% even for small communities of twenty nodes and density at 50% of the arcs present. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords
Lumbricus webis: a parallel and distributed crawling architecture for the Italian web
Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week
Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution
This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Strik- ing the right balance between running time and cluster well- formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the ?y by processing only the snippets provided by the auxil- iary search engines, and use no external sources of knowl- edge. Clustering is performed by means of a fast version of the furthest-point-?rst algorithm for metric k-center cluster- ing. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering ef- fectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Di- rectory Project hierarchy. According to two widely accepted external\u27 metrics of clustering quality, Armil achieves bet- ter performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. On a standard 1GHz ma- chine, Armil performs clustering and labelling altogether in less than one second
Packet Classification via Improved Space Decomposition Techniques
P ack et Classification is a common task in moder n Inter net r outers. The goal is to classify pack ets into "classes" or "flo ws" according to some ruleset that looks at multiple fields of each pack et. Differ entiated actions can then be applied to the traffic depending on the r esult of the classification. Ev en though rulesets can be expr essed in a r elati v ely compact way by using high le v el languages, the r esulting decision tr ees can partition the sear ch space (the set of possible attrib ute v alues) in a potentially v ery lar ge ( and mor e) number of r egions. This calls f or methods that scale to such lar ge pr oblem sizes, though the only scalable pr oposal in the literatur e so far is the one based on a F at In v erted Segment T r ee [1 ]. In this paper we pr opose a new geometric technique called G-filter f or pack et classification on dimensions. G-filter is based on an impr o v ed space decomposition technique. In addition to a theor etical analysis sho wing that classification in G-filter has time complexity and slightly super -linear space in the number of rules, we pr o vide thor ough experiments sho wing that the constants in v olv ed ar e extr emely small on a wide range of pr oblem sizes, and that G-filter impr o v e the best r esults in the literatur e f or lar ge pr oblem sizes, and is competiti v e f or small sizes as well
- …
