1,211 research outputs found

    Sampling Correctors

    Full text link
    In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance O(1/log2n)O(1/\log^2 n)). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We also consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process

    Testing probability distributions using conditional samples

    Full text link
    We study a new framework for property testing of probability distributions, by considering distribution testing algorithms that have access to a conditional sampling oracle.* This is an oracle that takes as input a subset S[N]S \subseteq [N] of the domain [N][N] of the unknown probability distribution DD and returns a draw from the conditional probability distribution DD restricted to SS. This new model allows considerable flexibility in the design of distribution testing algorithms; in particular, testing algorithms in this model can be adaptive. We study a wide range of natural distribution testing problems in this new framework and some of its variants, giving both upper and lower bounds on query complexity. These problems include testing whether DD is the uniform distribution U\mathcal{U}; testing whether D=DD = D^\ast for an explicitly provided DD^\ast; testing whether two unknown distributions D1D_1 and D2D_2 are equivalent; and estimating the variation distance between DD and the uniform distribution. At a high level our main finding is that the new "conditional sampling" framework we consider is a powerful one: while all the problems mentioned above have Ω(N)\Omega(\sqrt{N}) sample complexity in the standard model (and in some cases the complexity must be almost linear in NN), we give poly(logN,1/ε)\mathrm{poly}(\log N, 1/\varepsilon)-query algorithms (and in some cases poly(1/ε)\mathrm{poly}(1/\varepsilon)-query algorithms independent of NN) for all these problems in our conditional sampling setting. *Independently from our work, Chakraborty et al. also considered this framework. We discuss their work in Subsection [1.4].Comment: Significant changes on Section 9 (detailing and expanding the proof of Theorem 16). Several clarifications and typos fixed in various place

    Testing Conditional Independence of Discrete Distributions

    Full text link
    We study the problem of testing \emph{conditional independence} for discrete distributions. Specifically, given samples from a discrete random variable (X,Y,Z)(X, Y, Z) on domain [1]×[2]×[n][\ell_1]\times[\ell_2] \times [n], we want to distinguish, with probability at least 2/32/3, between the case that XX and YY are conditionally independent given ZZ from the case that (X,Y,Z)(X, Y, Z) is ϵ\epsilon-far, in 1\ell_1-distance, from every distribution that has this property. Conditional independence is a concept of central importance in probability and statistics with a range of applications in various scientific domains. As such, the statistical task of testing conditional independence has been extensively studied in various forms within the statistics and econometrics communities for nearly a century. Perhaps surprisingly, this problem has not been previously considered in the framework of distribution property testing and in particular no tester with sublinear sample complexity is known, even for the important special case that the domains of XX and YY are binary. The main algorithmic result of this work is the first conditional independence tester with {\em sublinear} sample complexity for discrete distributions over [1]×[2]×[n][\ell_1]\times[\ell_2] \times [n]. To complement our upper bounds, we prove information-theoretic lower bounds establishing that the sample complexity of our algorithm is optimal, up to constant factors, for a number of settings. Specifically, for the prototypical setting when 1,2=O(1)\ell_1, \ell_2 = O(1), we show that the sample complexity of testing conditional independence (upper bound and matching lower bound) is \[ \Theta\left({\max\left(n^{1/2}/\epsilon^2,\min\left(n^{7/8}/\epsilon,n^{6/7}/\epsilon^{8/7}\right)\right)}\right)\,. \

    Learning circuits with few negations

    Get PDF
    Monotone Boolean functions, and the monotone Boolean circuits that compute them, have been intensively studied in complexity theory. In this paper we study the structure of Boolean functions in terms of the minimum number of negations in any circuit computing them, a complexity measure that interpolates between monotone functions and the class of all functions. We study this generalization of monotonicity from the vantage point of learning theory, giving near-matching upper and lower bounds on the uniform-distribution learnability of circuits in terms of the number of negations they contain. Our upper bounds are based on a new structural characterization of negation-limited circuits that extends a classical result of A. A. Markov. Our lower bounds, which employ Fourier-analytic tools from hardness amplification, give new results even for circuits with no negations (i.e. monotone functions)
    corecore