Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. This line of work shows that $\ell_1$-regularized least squares regression can accurately estimate a sparse linear model from $n$ noisy examples in $p$ dimensions, even if $p$ is much larger than $n$. In this paper we study a variant of this problem where the original $n$ input variables are compressed by a random linear transformation to $m \ll n$ examples in $p$ dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for $\ell_1$-regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called ``sparsistence.'' In addition, we show that $\ell_1$-regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called ``persistence.'' Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the mutual information between the compressed and uncompressed data that decay to zero.
Deep Dive into Compressed Regression.
Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. This line of work shows that $\ell_1$-regularized least squares regression can accurately estimate a sparse linear model from $n$ noisy examples in $p$ dimensions, even if $p$ is much larger than $n$. In this paper we study a variant of this problem where the original $n$ input variables are compressed by a random linear transformation to $m \ll n$ examples in $p$ dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for $\ell_1$-regularized compressed regression to identify the nonzero coefficients in the true mo
Two issues facing the use of statistical learning methods in applications are scale and privacy. Scale is an issue in storing, manipulating and analyzing extremely large, high dimensional data. Privacy is, increasingly, a concern whenever large amounts of confidential data are manipulated within an organization. It is often important to allow researchers to analyze data without compromising the privacy of customers or leaking confidential information outside the organization. In this paper we show that sparse regression for high dimensional data can be carried out directly on a compressed form of the data, in a manner that can be shown to guard privacy in an information theoretic sense.
The approach we develop here compresses the data by a random linear or affine transformation, reducing the number of data records exponentially, while preserving the number of original input variables. These compressed data can then be made available for statistical analyses; we focus on the problem of sparse linear regression for high dimensional data. Informally, our theory ensures that the relevant predictors can be learned from the compressed data as well as they could be from the original uncompressed data. Moreover, the actual predictions based on new examples are as accurate as they would be had the original data been made available. However, the original data are not recoverable from the compressed data, and the compressed data effectively reveal no more information than would be revealed by a completely new sample. At the same time, the inference algorithms run faster and require fewer resources than the much larger uncompressed data would require. In fact, the original data need never be stored; they can be transformed “on the fly” as they come in.
In more detail, the data are represented as a n × p matrix X . Each of the p columns is an attribute, and each of the n rows is the vector of attributes for an individual record. The data are compressed by a random linear transformation
where is a random m × n matrix with m ≪ n. It is also natural to consider a random affine transformation
where is a random m × p matrix. Such transformations have been called “matrix masking” in the privacy literature (Duncan and Pearson, 1991). The entries of and are taken to be independent Gaussian random variables, but other distributions are possible. We think of X as “public,” while and are private and only needed at the time of compression. However, even with = 0 and known, recovering X from X requires solving a highly under-determined linear system and comes with information theoretic privacy guarantees, as we demonstrate.
In standard regression, a response Y = Xβ + ǫ ∈ R n is associated with the input variables, where ǫ i are independent, mean zero additive noise variables. In compressed regression, we assume that the response is also compressed, resulting in the transformed response Y ∈ R m given by
Note that under compression, the transformed noise ǫ = ǫ is not independent across examples.
In the sparse setting, the parameter vector β ∈ R p is sparse, with a relatively small number s of nonzero coefficients supp(β) = j : β j = 0 . Two key tasks are to identify the relevant variables, and to predict the response x T β for a new input vector x ∈ R p . The method we focus on is ℓ 1regularized least squares, also known as the lasso (Tibshirani, 1996). The main contributions of this paper are two technical results on the performance of this estimator, and an informationtheoretic analysis of the privacy properties of the procedure. Our first result shows that the lasso is sparsistent under compression, meaning that the correct sparse set of relevant variables is identified asymptotically. Omitting details and technical assumptions for clarity, our result is the following. then the compressed lasso solution
includes the correct variables, asymptotically:
(1.9)
Our second result shows that the lasso is persistent under compression. Roughly speaking, persistence (Greenshtein and Ritov, 2004) means that the procedure predicts well, as measured by the predictive risk (1.10) where now X ∈ R p is a new input vector and Y is the associated response. Persistence is a weaker condition than sparsistency, and in particular does not assume that the true model is linear.
Persistence (Theorem 4.1): Given a sequence of sets of estimators B n,m , the sequence of compressed lasso estimators
β n,m = arg min
is persistent with the oracle risk over uncompressed data with respect to B n,m , meaning that
R(β) P -→ 0, as n → ∞.
(1.12) in case log 2 (np) ≤ m ≤ n and the radius of the ℓ 1 ball satisfies L n,m = o (m/ log(np)) 1/4 .
Our third result analyzes the privacy properties of compressed regression. We consider the problem of recovering the uncompressed data X from the compressed data X = X + . To preserve privacy, the random matrices and should remain private. However, even in the case where = 0 and is known, if m ≪ min(n, p) the linear
…(Full text truncated)…
This content is AI-processed based on ArXiv data.