11 research outputs found

    Explaining the production of milk in Gujarat and Haryana : A matter of scale

    Get PDF
    This paper investigates parallel random sampling from a potentially-unending data stream whose elements are revealed in a series of element sequences (minibatches). While sampling from a stream was extensively studied sequentially, not much has been explored in the parallel context, with prior parallel random-sampling algorithms focusing on the static batch model. We present parallel algorithms for minibatch-stream sampling in two settings: (1) sliding window, which draws samples from a prespecified number of most-recently observed elements, and (2) infinite window, which draws samples from all the elements received. Our algorithms are computationally and memory efficient: their work matches the fastest sequential counterpart, their parallel depth is small (polylogarithmic), and their memory usage matches the best known

    Sampling in Space Restricted Settings

    No full text
    Space efficient algorithms play a central role in dealing with large amount of data. In such settings, one would like to analyse the large data using small amount of "working space". One of the key steps in many algorithms for analysing large data is to maintain a (or a small number) random sample from the data points. In this paper, we consider two space restricted settings -- (i) streaming model, where data arrives over time and one can use only a small amount of storage, and (ii) query model, where we can structure the data in low space and answer sampling queries. In this paper, we prove the following results in above two settings: - In the streaming setting, we would like to maintain a random sample from the elements seen so far. We prove that one can maintain a random sample using O(logn)O(\log n) random bits and O(logn)O(\log n) space, where nn is the number of elements seen so far. We can extend this to the case when elements have weights as well. - In the query model, there are nn elements with weights w1,...,wnw_1, ..., w_n (which are ww-bit integers) and one would like to sample a random element with probability proportional to its weight. Bringmann and Larsen (STOC 2013) showed how to sample such an element using nw+1nw +1 space (whereas, the information theoretic lower bound is nwn w). We consider the approximate sampling problem, where we are given an error parameter ε\varepsilon, and the sampling probability of an element can be off by an ε\varepsilon factor. We give matching upper and lower bounds for this problem

    A Family of Unsupervised Sampling Algorithms

    No full text
    International audienceThree algorithms for unsupervised sampling are introduced. They are easy to tune, scalable and yield a small size sample. They are based on the same concepts: they combine density and distance, they use the farthest first traversal that allows for runtime optimization, they yield a coreset and they are driven by a single user parameter. DIDES gives priority to distance while density is also managed. In DENDIS, density is of first concern while space coverage is ensured. The two of them are tuned by a meaningful parameter called granularity. The lower its value the higher the sample size. The third algorithm in the family, ProTraS, aims to explicitly design a coreset. The sampling cost is the unique parameter and stopping criterion. In this chapter their common properties and differences are studied
    corecore