The Sample Complexity of Dictionary Learning

Reading time: 6 minute
...

📝 Original Info

  • Title: The Sample Complexity of Dictionary Learning
  • ArXiv ID: 1011.5395
  • Date: 2015-03-01
  • Authors: R. Gribonval, M. Nielsen

📝 Abstract

A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a set of signals to be represented. Can we expect that the representation found by such a dictionary for a previously unseen example from the same source will have L_2 error of the same magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study this questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L_2 error in representation when the dictionary is used. For the case of l_1 regularized coefficient selection we provide a generalization bound of the order of O(sqrt(np log(m lambda)/m)), where n is the dimension, p is the number of elements in the dictionary, lambda is a bound on the l_1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O(sqrt(np log(m k)/m)) under an assumption on the level of orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results further yield fast rates of order 1/m as opposed to 1/sqrt(m) using localized Rademacher complexity. We provide similar results in a general setting using kernels with weak smoothness requirements.

💡 Deep Analysis

Figure 1

📄 Full Content

In processing signals from X = R n it is now a common technique to use sparse representations; that is, to approximate each signal x by a "small" linear combination a of elements d i from a dictionary D ∈ X p , so that x ≈ Da = p i=1 a i d i . This has various uses detailed in Section 1.1. The smallness of a is often measured using either a 1 , or the number of non zero elements in a, often denoted a 0 . The approximation error is measured here using a Euclidean norm appropriate to the vector space. We denote the approximation error of x using dictionary D and coefficients from A as h A,D (x) = min a∈A Da -x ,

(1.1)

where A is one of the following sets determining the sparsity required of the representation:

H k = {a : a 0 ≤ k} induced a “hard” sparsity constraint, which we also call k sparse representation, while R λ = {a : a 1 ≤ λ} induces a convex constraint that is a “relaxation” of the previous constraint.

The dictionary learning problem is to find a dictionary D minimizing

where ν is a distribution over signals that is known to us only through samples from it. The problem addressed in this paper is the “generalization” (in the statistical learning sense) of dictionary learning: to what extent does the performance of a dictionary chosen based on a finite set of samples indicate its expected error in (1.2)? This clearly depends on the number of samples and other parameters of the problem such as dictionary size. In particular, an obvious algorithm is to represent each sample using itself, if the dictionary is allowed to be as large as the sample, but the performance on unseen signals is likely to disappoint.

To state our goal more quantitatively, assume that an algorithm finds a dictionary D suited to k sparse representation, in the sense that the average representation error E m (D) on the m examples it is given is low. Our goal is to bound the generalization error ε, which is the additional expected error that might be incurred:

where η ≥ 0 is sometimes zero, and the bound depends on the number of samples and problem parameters. Since algorithms that find the optimal dictionary for a given set of samples (also known as empirical risk minimization, or ERM, algorithms) are not known for dictionary learning, we prove uniform convergence bounds that apply simultaneously over all admissible dictionaries D, thus bounding from above the sample complexity of the dictionary learning problem. Many analytic and algorithmic methods relying on the properties of finite dimensional Euclidean geometry can be applied in more general settings by applying kernel methods. These consist of treating objects that are not naturally represented in R n as having their similarity described by an inner product in an abstract feature space that is Euclidean. This allows the application of algorithms depending on the data only through a computation of inner products to such diverse objects as graphs, DNA sequences and text documents, that are not naturally represented using vector spaces (Shawe-Taylor and Cristianini, 2004). Is it possible to extend the usefulness of dictionary learning techniques to this setting? We address sample complexity aspects of this question as well.

Sparse representations are a standard practice in diverse fields such as signal processing, natural language processing, etc. Typically, the dictionary is assumed to be known. The motivation for sparse representations is indicated by the following results, in which we assume the signals come from X = R n , and the representation coefficients from A = H k where k < n, p and typically h A,D (x) ≪ 1.

• Compression: If a signal x has an approximate sparse representation in some commonly known dictionary D, then by definition, storing or transmitting the sparse representation will not cause large error.

• Representation: If a signal x has an approximate sparse representation in a dictionary D that fulfills certain geometric conditions, then its sparse representation is unique and can be found efficiently (Bruckstein et al., 2009).

• Denoising: If a signal x has a sparse representation in some known dictionary D, and x = x + ν, where the random noise ν is Gaussian, then the sparse representation found for x will likely be very close to x (for example Chen et al., 2001).

• Compressed sensing: Assuming that a signal x has a sparse representation in some known dictionary D that fulfills certain geometric conditions, this representation can be approximately retrieved with high probability from a small number of random linear measurements of x.

The number of measurements needed depends on the sparsity of x in D (Candes and Tao, 2006).

The implications of these results are significant when a dictionary D is known that sparsely represents simultaneously many signals. In some applications the dictionary is chosen based on prior knowledge, but in many applications the dictionary is learned based on a finite set of examples. To motivate dictionary learning, con

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut