On Optimality Conditions for Auto-Encoder Signal Recovery

On Optimality Conditions f or A uto-Encoder Signal Recov ery Devansh Ar pit* Department of Computer Science SUNY Buff alo USA devansha@buffalo.edu Y ingbo Zhou* Salesforce Research San Francisco Bay Area USA yingbo.zhou@salesforce.com Hung Q. Ngo LogicBlox San Francisco Bay Area USA hungngo@buffalo.edu Nils Napp Department of Computer Science SUNY Buff alo USA nnapp@buffalo.edu V enu Govindaraju Department of Computer Science SUNY Buff alo USA govind@buffalo.edu Abstract Auto-Encoders are unsupervised models that aim to learn patterns from observed data by minimizing a reconstruction cost. The useful representations learned are often found to be sparse and distrib uted. On the other hand, compressed sensing and sparse coding assume a data generating process, where the observed data is generated from some true latent signal source, and try to recover the corresponding signal from measurements. Looking at auto-encoders from this signal r ecovery perspective enables us to have a more coherent view of these techniques. In this paper , in particular, we sho w that the true hidden representation can be approximately recov ered if the weight matrices are highly incoherent with unit ` 2 row length and the bias v ectors takes the value (approximately) equal to the negati ve of the data mean. The recovery also becomes more and more accurate as the sparsity in hidden signals increases. Additionally , we empirically also demonstrate that auto-encoders are capable of recovering the data generating dictionary when only data samples are giv en. 1 Introduction Recov ering hidden signal from measurement vectors (observations) is a long studied problem in compressed sensing and sparse coding with a lot of successful applications. On the other hand, auto- encoders (AEs) (Bourlard and Kamp, 1988) are useful for unsupervised representation learning for uncov ering patterns in data. AEs focus on learning a mapping x 7→ h 7→ ˆ x , where the reconstructed vector ˆ x is desired to be as close to x as possible for the entire data distribution. What we show in this paper is that if we consider x is actually gener ated from some true sparse signal h by some process (see section 3), then switching our perspecti ve on AE to analyze h 7→ x 7→ ˆ h shows that AE is capable of r ecovering the true signal that generated the data and yields useful insights into the optimality of model parameters of auto-encoders in terms of signal r ecovery . In other words, this perspectiv e lets us look at AEs from a signal reco very point of vie w where forward propagating x * Equal contribution recov ers the true signal h . W e analyze the conditions under which the encoder part of an AE reco vers the true h from x , while the decoder part acts as the data generating process. Our main result shows that the true sparse signal h (with mild distribution assumptions) can be approximately recov ered by the encoder of an AE with high probability under certain conditions on the weight matrix and bias vectors. Additionally , we empirically show that in a practical setting when only data is observed, optimizing the AE objecti ve leads to the recov ery of both the data generating dictionary W and the true sparse signal h , which together is not well studied in the auto-encoder framew ork, to the best of our knowledge. 2 Sparse Signal Recovery P erspectiv e While it is known both empirically and theoretically , that useful features learned by AEs are usually sparse (Memise vic et al., 2014; Nair and Hinton, 2010; Arpit et al., 2016). An important question that hasn’t been answered yet is whether AEs are capable of recov ering sparse signals, in general. This is an important question for Sparse Coding, which entails recovering the sparsest h that approximately satisﬁes x = W T h , for any gi ven data v ector x and ov ercomplete weight matrix W . Ho wev er , since this problem is NP complete (Amaldi and Kann, 1998), it is usually relaxed to solving an expensi ve optimization problem (Candes et al., 2006; Candes and T ao, 2006), arg min h k x − W T h k 2 + λ k h k 1 (1) where W ∈ R m × n is a ﬁxed ov ercomplete ( m > n ) dictionary , λ is the regularization coefﬁcient, x ∈ R n is the data and h ∈ R m is the signal to recover . For this special case, Makhzani and Frey (2013) analyzed the condition under which linear AEs can recov er the support of the hidden signal. The general AE objectiv e, on the other hand, minimizes the expected reconstruction cost J AE = min W , b e , b d E x  L ( x , s d  W T s e ( Wx + b e ) + b d  )  (2) for some reconstruction cost L , encoding and decoding activ ation function s e ( . ) and s d ( . ) , and bias vectors b e and b d . In this paper we consider linear acti v ation s d because it is a more general case. Notice ho wever , in the case of auto-encoders, the acti vation functions can be non-linear in general, in contrast to the sparse coding objective. In addition, in case of AEs we do not hav e a separate parameter h for the hidden representation corresponding to ev ery data sample x indi vidually . Instead, the hidden representation for ev ery sample is a parametric function of the sample itself. This is an important distinction between the optimization in eq. 1 and our problem – the identity of h in eq. 1 is only well deﬁned in the presence of ` 1 regularization due to the overcompleteness of the dictionary . Howe ver , in our problem, we assume a true signal h generates the observed data x as x = W T h + b d , where the dictionary W and bias vector b d are ﬁxed. Hence, what we mean by r ecovery of sparse signals in an AE framework is that if we generate data using the above generation process, then can the estimate ˆ h = s e ( Wx + b e ) indeed reco ver the true h for some activation functions s e ( . ) , and bias vector b e ? And if so, what pr operties of W , b e , s e ( . ) and h lead to good r ecovery? Ho wev er , when giv en an x and the true overcomplete W , the solution h to x = W T h is not unique. Then the question arises about the possibility of recov ering such an h . Ho wev er , as we show , recov ery using the AE mechanism is strongest when the signal h is the sparsest possible one, which from compressed sensing theory , guarantees uniqueness of h if W is sufﬁciently incoherent 1 . 3 Data Generation Process W e consider the following data generation process: x = W T h + b d + e (3) where x ∈ R n is the observed data, b d ∈ R n is a bias vector , e ∈ R n is a noise vector , W ∈ R m × n is the weight matrix and h ∈ R m is the true hidden representation (signal) that we want to reco ver . Throughout our analysis, we assume that the signal h belongs to the follo wing class of distribution, Assumption 1. Bounded Independent Non-ne gative Spar se (BINS): Every hidden unit h j is an independent random variable with the following density function: f ( h j ) =  (1 − p j ) δ 0 ( h j ) if h j = 0 p j f c ( h j ) if h j ∈ (0 , l max j ] (4) 1 Coherence is deﬁned as max W i , W j ,i 6 = j | W T i W j | k W i kk W j k 2 wher e f c ( . ) can be any arbitrary normalized distribution bounded in the interval (0 , l max j ] , mean µ h j , and δ 0 ( . ) is the Dirac Delta function at zer o. As a short hand, we say that h j follows the distribution BINS( p , f c , µ h , l max ). Notice that E h j [ h j ] = p j µ h j . The abo ve distrib ution assumption ﬁts naturally with sparse coding, when the intended signal is non-negati ve sparse. From the AE perspective, it is also justiﬁed based on the follo wing observation. In neural netw orks with ReLU acti v ations, hidden unit pre-activ ations ha ve a Gaussian like symmetric distribution (Hyvärinen and Oja, 2000; Iof fe and Szegedy, 2015). If we assume these distrib utions are mean centered 2 , then the hidden units’ distribution after ReLU has a large mass at 0 while the rest of the mass concentrates in (0 , l max ] for some ﬁnite positiv e l max , because the pre-activ ations concentrate symmetrically around zero. As we sho w in the next section, ReLU is indeed capable of recov ering such signals. On a side note, the distrib ution from assumption 1 can take shapes similar to that of Exponential or Rectiﬁed Gaussian distribution 3 (which are generally used for modeling biological neurons) but is simpler to analyze. This is because we allow f c ( . ) to be any arbitrary normalized distribution. The only restriction assumption 1 has is that to be bounded. Howe ver , this does not change the representati ve po wer of this distribution signiﬁcantly because: a) the distributions used for modeling neurons ha ve v ery small tail mass; b) in practice, we are generally interested in signals with upper bounded values. The generation process considered in this section ( i.e. eq. 3 and assumptions 1) is justiﬁed because: 1. This data generation model ﬁnds applications in a number of areas (Y ang et al., 2009; Ka vukcuoglu et al., 2010; Wright et al., 2009). Notice that while x is the measurement vector (observed data), which can in general be noisy , h denotes the actual signal (internal representation) because it reﬂects the combination of dictionary ( W T ) atoms in volv ed in generating the observed samples and hence serves as the true identity of the data. 2. Sparse distributed representation (Hinton, 1984) is both observ ed and desired in hidden represen- tations. It has been empirically shown that representations that are truly sparse ( i.e. large number of true zeros) and distributed usually yield better linear separability and performance (Glorot et al., 2011; Wright et al., 2009; Y ang et al., 2009). Decoding bias ( b d ) : Consider the data generation process (exclude noise for now) x = W T h + b d . Here b d is a bias vector which can take an y arbitrary v alue but similar to W , it is ﬁx ed for any particular data generation process. Howev er , the following remark shows that if an AE can recov er the sparse code ( h ) from a data sample generated as x = W T h , then it is also capable of recovering the sparse code from the data generated as x = W T h + b d and vice versa. Remark 1. Let x 1 = W T h wher e x 1 ∈ R n , W ∈ R m × n and h ∈ R m . Let x 2 = W T h + b d wher e b d ∈ R n is a ﬁxed vector . Let ˆ h 1 = s e ( Wx 1 + b ) and ˆ h 2 = s e ( Wx 2 + b − Wb d ) . Then ˆ h 1 = h iff ˆ h 2 = h . Thus without any loss of generality , we will assume our data is generated from x = W T h + e . 4 Signal Recovery Analysis W e analyse two separate class of signals in this category– continuous sparse, and binary sparse signals that follow BINS. F or notational con venience, we will drop the subscript of b e and simply refer this parameter as b since it is the only bias vector (we are not considering the other bias b d due to remark 1). The Auto-Encoder signal r ecovery mec hanism that we analyze throughout this paper is deﬁned as, Deﬁnition 1. Let a data sample x ∈ R n be generated by the pr ocess x = W T h + e wher e W ∈ R m × n is a ﬁxed matrix, e is noise and h ∈ R m . Then we deﬁne the A uto-Encoder signal recov ery mechanism as ˆ h s e ( x ; W , b e ) that r ecovers the estimate ˆ h = s e ( Wx + b e ) wher e s e ( . ) is an activation function. 4.1 Binary Sparse Signal Analysis First we consider the noiseless case of data generation, Theorem 1. (Noiseless Binary Signal Recovery): Let each element of h follow BINS( p , δ 1 , µ h , l max ) and let ˆ h ∈ R m be an auto-encoder signal r ecovery mec hanism with Sigmoid activation function and 2 This happens for instance as a result of the Batch Normalization (Ioffe and Szegedy, 2015) technique, which leads to signiﬁcantly faster con vergence. It is thus a good practice to ha ve a mean centered pre-activ ation distribution. 3 depending on the distribution f c ( . ) 3 bias b for a measur ement vector x ∈ R n such that x = W T h . If we set b i = − P j a ij p j ∀ i ∈ [ m ] , then ∀ δ ∈ (0 , 1) , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1 (1 − p i ) e − 2 ( δ 0 + p i a ii ) 2 P m j =1 ,j 6 = i a 2 ij + p i e − 2 ( δ 0 +(1 − p i ) a ii ) 2 P m j =1 ,j 6 = i a 2 ij ! (5) wher e a ij = W T i W j , δ 0 = ln( δ 1 − δ ) and W i is the i th r ow of the matrix W cast as a column vector . Analysis : W e ﬁrst analyse the properties of the weight matrix W that results in strong reco very bound. Notice the terms ( δ 0 + p i a ii ) 2 and ( δ 0 + (1 − p i ) a ii ) 2 need to be as large as possible, while simultaneously , the term P m j =1 ,j 6 = i a 2 ij needs to be as close to zero as possible. For the sake of analysis, lets set 4 δ 0 = 0 (achie ved when δ = 0 . 5 ). Then our problem gets reduced to maximizing the ratio ( a ii ) 2 P m j =1 ,j 6 = i a 2 ij = k W i k 4 P m j =1 ,j 6 = i ( W T i W j ) 2 = k W i k 2 P m j =1 ,j 6 = i k W j k 2 cos 2 θ ij , where θ ij is the angle between W i and W j . From the property of coherence, if the rows of the weight matrix is highly incoherent, then cos θ ij is close to 0 . Again, for the ease of analysis, lets replace each cos θ ij with a small positi ve number  . Then ( a ii ) 2 P m j =1 ,j 6 = i a 2 ij ≈ k W i k 2  2 P m j =1 ,j 6 = i k W j k 2 = 1  2 P m j =1 ,j 6 = i k W j k 2 / k W i k 2 . Finally , since we would want this term to be maximized for each hidden unit h i equally , the obvious choice for each weight length k W i k ( i ∈ [ m ] ) is to set it to 1 . Finally , lets analyse the bias vector . Notice we hav e instantiated each element of the encoding bias b i to take v alue − P j a ij p j . Since p j is essentially the mean of each binary hidden unit h i , we can say that b i = − P j a ij E h j [ h j ] = − W T i W T E h [ h ] = − W T i E h [ x ] . Signal recov ery is strong for binary signals when the recovery mechanism is gi ven by ˆ h i , Sigmoid ( W T i ( x − E h [ x ])) (6) where the ro ws of W are highly incoherent and each hidden weight has length ones ( k W i k 2 = 1 ), and each dimension of data x is approximately uncorrelated (see theorem 3). Now we state the reco very bound for the noisy data generation scenario. Proposition 1. (Noisy Binary Signal Recovery): Let each element of h follow BINS( p , δ 1 , µ h , l max ) and let ˆ h ∈ R m be an auto-encoder signal r ecovery mec hanism with Sigmoid activation function and bias b for a measur ement vector x = W T h + e wher e e ∈ R n is any noise vector independent of h . If we set b i = − P j a ij p j − W T i E e [ e ] ∀ i ∈ [ m ] , then ∀ δ ∈ (0 , 1) , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1 (1 − p i ) e − 2 ( δ 0 − W T i ( e − E e [ e ])+ p i a ii ) 2 P m j =1 ,j 6 = i a 2 ij (7) + p i e − 2 ( δ 0 − W T i ( e − E e [ e ])+(1 − p i ) a ii ) 2 P m j =1 ,j 6 = i a 2 ij ! (8) wher e a ij = W T i W j , δ 0 = ln( δ 1 − δ ) and W i is the i th r ow of the matrix W cast as a column vector . W e hav e not assumed any distribution on the noise random v ariable e and this term has no effect on recov ery (compared to the noiseless case) if the noise distribution is orthogonal to the hidden weight vectors. Again, the same properties of W lead to better recov ery as in the noiseless case. In the case of bias, we have set each element of the bias b i , − P j a ij p j − W T i E e [ e ] ∀ i ∈ [ m ] . Notice from the deﬁnition of BINS, E h j [ h j ] = p j . Thus in essence, b i = − P j a ij E h j [ h j ] − W T i E e [ e ] . Expanding a ij , we get, b i , − W T i W T E h [ h ] − W T i E e [ e ] = − W T i E h [ x ] . Thus the e xpression of bias is unaffected by error statistics as long as we can compute the data mean. In this section, we will ﬁrst consider the case when data ( x ) is generated by linear process x = W T h + e , and if W and encoding bias b hav e certain properties, then the signal recovery bound 4 Setting δ = 0 . 5 is not such a bad choice after all because for binary signals, we can recov er the exact true signal with high probability by simply binarize the signal recovered by Sigmoid with some threshold. 4 ( k h − ˆ h k ) is strong. W e will then consider the case when data generated by a non-linear process x = s d ( W T h + b d + e ) (for certain class of functions s d ( . ) ) can be recov ered as well by the same mechanism. For deep non-linear networks, this means that forward propagating data to hidden layers, such that the network parameters satisfy the required conditions, implies each hidden layer recov ers the true signal that generated the corresponding data. W e ha ve mov ed all the proofs to appendix for better readability . 4.2 Continuous Sparse Signal Recovery Theorem 2. (Noiseless Continuous Signal Recovery): Let each element of h ∈ R m follow BINS( p , f c , µ h , l max ) distribution and let ˆ h ReLU ( x ; W , b ) be an auto-encoder signal r ecovery mechanism with Rectiﬁed Linear activation function (ReLU) and bias b for a measur ement vector x ∈ R n such that x = W T h . If we set b i , − P j a ij p j µ h j ∀ i ∈ [ m ] , then ∀ δ ≥ 0 , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1   e − 2 ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij )) 2 P j a 2 ij l 2 max j + e − 2 ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij )) 2 P j a 2 ij l 2 max j   (9) wher e a i s ar e vectors such that a ij =  W T i W j if i 6 = j W T i W i − 1 if i = j (10) W i is the i th r ow of the matrix W cast as a column vector . Analysis : W e ﬁrst analyze the properties of the weight matrix that results in strong reco very bound. W e ﬁnd that for strong recovery , the terms ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , a ij )) 2 and ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij )) 2 should be as large as possible, while simultaneously , the term P j a 2 ij l 2 max j needs to be as close to zero as possible. First, notice the term (1 − p j )( l max j − 2 p j µ h j ) . Since µ h j < l max j by deﬁnition, we have that both terms containing (1 − p j )( l max j − 2 p j µ h j ) are always positi ve and contrib utes tow ards stronger recovery if p j is less than 50% (sparse), and becomes stronger as the signal becomes sparser (smaller p j ). No w if we assume the ro ws of the weight matrix W are highly incoherent and that each row of W has unit ` 2 length, then it is safe to assume each a ij ( ∀ i, j ∈ [ m ] ) is close to 0 from the deﬁnition of a ij and properties of W we hav e assumed. Then for any small positiv e value of δ , we can approximately say ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij )) 2 P j a 2 ij l 2 max j ≈ δ 2 P j a 2 ij l 2 max j where each a ij is very close to zero. The same argument holds similarly for the other term. Thus we ﬁnd that a strong signal recov ery bound would be obtained if the weight matrix is highly incoherent and all hidden vectors are of unit length. In the case of bias, we have set each element of the bias b i , − P j a ij p j µ h j ∀ i ∈ [ m ] . Notice from the deﬁnition of BINS, E h j [ h j ] = p j µ h j . Thus in essence, b i = − P j a ij E h j [ h j ] . Expanding a ij , we get b i = − W T i W T E h [ h ] + E h i [ h i ] = − W T i E h [ x ] + E h i [ h i ] . The recov ery bound is strong for continuous signals when the recovery mechanism is set to ˆ h i , ReLU ( W T i ( x − E x [ x ]) + E h i [ h i ]) (11) and the rows of W are highly incoherent and each hidden weight has length ones ( k W i k 2 = 1 ). Now we state the reco very bound for the noisy data generation scenario. Proposition 2. (Noisy Continuous Signal Recovery): Let each element of h ∈ R m follow BINS( p , f c , µ h , l max ) distribution and let ˆ h ReLU ( x ; W , b ) be an auto-encoder signal r ecovery mechanism with Rectiﬁed Linear activation function (ReLU) and bias b for a measur ement vector x ∈ R n such that x = W T h + e wher e e is any noise random vector independent of h . If we set 5 b i , − P j a ij p j µ h j − W T i E e [ e ] ∀ i ∈ [ m ] , then ∀ δ ≥ 0 , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1   e − 2 ( δ − W T i ( e − E e [ e ])+ P j (1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij )) 2 P j a 2 ij l 2 max j + e − 2 ( δ − W T i ( e − E e [ e ])+ P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij )) 2 P j a 2 ij l 2 max j   (12) wher e a i s ar e vectors such that a ij =  W T i W j if i 6 = j W T i W i − 1 if i = j (13) W i is the i th r ow of the matrix W cast as a column vector . Notice that we ha ve not assumed an y distribution on v ariable e , which denotes the noise. Also, this term has no ef fect on recovery (compared to the noiseless case) if the noise distribution is orthogonal to the hidden weight vectors. On the other hand, the same properties of W lead to better recov ery as in the noiseless case. Howe ver , in the case of bias, we ha ve set each element of the bias b i , − P j a ij p j µ h j − W T i E e [ e ] ∀ i ∈ [ m ] . From the deﬁnition of BINS, E h j [ h j ] = p j µ h j . Thus b i = − P j a ij E h j [ h j ] − W T i E e [ e ] . Expanding a ij , we get, b i , − W T i W T E h [ h ] + E h i [ h i ] − W T i E e [ e ] = − W T i E h [ x ] + E h i [ h i ] . Thus the e xpression of bias is unaffected by error statistics as long as we can compute the data mean ( i.e. the recovery is the same as sho wn in eq. 11). 4.3 Properties of Generated Data Since the data we observe results from the hidden signal giv en by x = W T h , it would be interesting to analyze the distribution of the generated data. This would provide us more insight into what kind of pre-processing would ensure stronger signal recov ery . Theorem 3. (Uncorr elated Distribution Bound): If data is gener ated as x = W T h wher e h ∈ R m has covariance matrix diag( ζ ) , ( ζ ∈ R + m ) and W ∈ R m × n ( m > n ) is such that each r ow of W has unit length and the r ows of W ar e maximally incoher ent, then the co variance matrix of the generated data is appr oximately spherical (uncorrelated) satisfying , min α k Σ − α I k F ≤ r 1 n ( m k ζ k 2 2 − k ζ k 2 1 ) (14) wher e Σ = E x [( x − E x [ x ])( x − E x [ x ]) T ] is the covariance matrix of the gener ated data. Analysis : Notice that for any vector v ∈ R m , m k v k 2 2 ≥ k v k 2 1 , and the equality holds when each element of the vector v is identical. Data x generated using a maximally incoherent dictionary W (with unit ` 2 row length) as x = W T h guarantees x is highly uncorrelated if h is uncorrelated with near identity co variance. This would ensure the hidden units at the following layer are also uncorrelated during training. Further the cov ariance matrix of x is identity , if all hidden units have equal v ariance. This analysis acts as a justiﬁcation for data whitening where data is processed to hav e zero mean and identity cov ariance matrix. Notice that although the generated data does not have zero mean, the recov ery process (eq. 11) subtracts data mean and hence it does not affect reco very . 4.4 Connections with existing work A uto-Encoders (AE) : Our analysis rev eals the conditions on parameters of an AE that lead to strong recov ery of h (for both continuous and binary case), which ultimately implies low data reconstruction error . 6 Howe ver , the above arguments hold for AEs from a recov ery point of view . T raining an AE on data may lead to learning of the identity function. Thus usually AEs are trained along with a bottle-neck to make the learned representation useful. One such bottle-neck is the De-noising criteria giv en by , J DAE = min W , b k x − W T s e ( W ˜ x + b ) k 2 (15) where s e ( . ) is the activ ation function and ˜ x is a corrupted version of x . It has been sho wn that the T ay- lor’ s expansion of D AE (Theorem 3 of Arpit et al., 2016) has the term P m j,k =1 j 6 = k  ∂ h j ∂ a j ∂ h k ∂ a k ( W T j W k ) 2  . If we constrain the lengths of the weight v ectors to hav e ﬁxed length, then this regularization term minimizes a weighted sum of cosine of the angle between e very pair of weight vectors. As a result, the weight vectors become increasingly incoherent. Hence we achie ve both our goals by adding one additional constraint to D AE– constraining weight vectors to hav e unit length. Ev en if we do not apply an explicit constraint, we can expect the weight lengths to be upper bounded from the basic AE objecti ve itself, which would e xplain the learning of incoherent weights due to the D AE regularization.On a side note, our analysis also justiﬁes the use of tied weights in auto-encoders. Sparse Coding (SC) : SC in volv es minimizing k x − W T h k 2 using the sparsest possible h . The analysis after theorem 2 sho ws signal reco very using the AE mechanism becomes stronger for sparser signals (as also conﬁrmed experimentally in section 5). In other words, for an y giv en data sample and weight matrix, as long as the conditions on the weight matrix and bias are met, the AE recovery mechanism recov ers the sparsest possible signal; which justiﬁes using auto-encoders for recovering sparse codes (see Henaff et al., 2011; Makhzani and Fre y, 2013; Ng, 2011, for work along this line). Independent Component Analysis (Hyvärinen and Oja, 2000) (ICA): ICA assumes we observe data generated by the process x = W T h where all elements of the h are independent and W is a mixing matrix. The task of ICA is to recov er both W and h giv en data. This data generating process is precisely what we assumed in section 3. Based on this assumption, our results show that 1) the properties of W that can recov er such independent signals h ; and 2) auto-encoders can be used for recov ering such signals and weight matrix W . k-Sparse AEs : Makhzani and Frey (2013) propose to zero out all the values of hidden units smaller than the top-k values for each sample during training. This is done to achieve sparsity in the learned hidden representation. This strategy is justiﬁed from the perspectiv e of our analysis as well. This is because the P A C bound (theorem 2) deri ved for signal recovery using the AE signal recov ery mechanism shows we recover a noisy version of the true sparse signal. Since the noise in each recov ered signal unit is roughly proportional to the original value, de-noising such recov ered signals can be achie ved by thresholding the hidden unit v alues (exploiting the fact that the signal is sparse). This can be done either by using a ﬁxed threshold or picking the top k v alues. Data Whitening : Theorem 3 sho ws that data generated from BINS and incoherent weight matrices are roughly uncorrelated. Thus reco vering back such signals using auto-encoders would be easier if we pre-process the data to hav e uncorrelated dimensions. 5 Empirical V eriﬁcation W e empirically verify the fundamental predictions made in section 4 which both serv e to justify the assumptions we hav e made, as well as conﬁrm our results. W e v erify the follo wing: a) the optimality of the rows of a weight matrix W to have unit length and being highly incoherent for AE signal recov ery; b) effect of sparsity on AE signal recovery; and c) in practice, AE can recover not only the true sparse signal h , but also the dictionary W that used to generate the data. 5.1 Optimal Properties of W eights and Bias Our analysis on signal recov ery in section 4 (eq. 11) sho ws signal recov ery bound is strong when a) the data generating weight matrix W has rows of unit ` 2 length; b) the rows of W are highly incoherent; c) each bias v ector element is set to the negati ve expectation of the pre-activ ation; d) signal h has each dimension independent. In order to verify this, we generate N = 5 , 000 signals h ∈ R m =200 from BINS( p = 0 . 02 , f c =uniform, µ h = 0 . 5 , l max = 1 ) with f c ( . ) set to uniform distribution for simplicity . W e then generate the corresponding 5 , 000 data sample x = c W T h ∈ R 180 using an incoherent weight matrix W ∈ R 200 × 180 (each element sampled from zero mean Gaussian, the columns are then orthogonalized, and ` 2 length of each row rescaled to 1 ; notice the rows cannot be orthogonal). W e then recov er each signal using, ˆ h i , ReLU ( c W T i ( x − E h [ x ]) + E h i [ h i ] + ∆ b ) (16) 7 1.0 0.5 0.0 0.5 1.0 b 0.5 1.0 1.5 c Cont. Signal Recovery 10 20 30 40 50 60 70 80 90 1.0 0.5 0.0 0.5 1.0 b 0.5 1.0 1.5 c Binary Signal Recovery 0 10 20 30 40 50 Figure 1: Error heatmap showing optimal v alues of c and ∆ b for recov ering continous (left) and binary (right) signal using inchoherent weights. 0.001 0.010 0.100 1.000 Noise std. 0 20 40 60 APRE Binary Recovery Continous Recovery Figure 2: A verage percentage recov ery error of noisy signal recov ery . where c and ∆ b are scalars that we vary between [0 . 1 , 2] and [ − 1 , +1] respecti vely . W e also generate N = 5 , 000 signals h ∈ { 0 , 1 } m =200 from BINS( 0 . 02 , δ 1 , 0 . 02 , 1 ) with f c ( . ) set to Dirac delta function at 1. W e then generate the corresponding 5 , 000 data sample x = c W T h ∈ R 180 following the same procedure as for the continuous signal case. The signal is reco vered using ˆ h i , σ ( c W T i ( x − E h [ x ]) + ∆ b ) (17) where σ is the sigmoid function. For the recovered signals, we calculate the A verag e P er centag e Recovery Err or (APRE) as, APRE = 100 N m N ,m X i =1 ,j =1 w h i j 1 ( | ˆ h i j − h i j | >  ) (18) where we set  to 0 . 1 for continuous signals and 0 for binary case, 1 ( . ) is the indicator operator , ˆ h i j denotes the j th dimension of the recov ered signal corresponding to the i th true signal and, w h i j =  0 . 5 p if h i j > 0 0 . 5 1 − p if h i j = 0 (19) The error is weighted with w h i j so that the reco very error for both zero and non-zero h i j s are penalized equally . This is specially needed in this case, because h i j is sparse and a lo w error can also be achie ved by tri vially setting all the recov ered ˆ h i j s to zero. Along with the incoherent weight matrix, we also generate data separately using a highly coherent weight matrix that we get by sampling each element randomly from a uniform distribution on [0 , 1] and scaling each row to unit length. According to our analysis, we should get least error for c = 1 and ∆ b = 0 for the incoherent matrix while the coherent matrix should yield both higher recov ery error and a dif ferent choice of c and b (which is unpredictable). The error heat maps for both continuous and binary reco very 5 are sho wn in ﬁg. 1. For the incoherent weight matrix, we see that the empirical optimal is precisely c = 1 and ∆ b = 0 (which is exactly as predicted) with 0.21 and 0.0 APRE for continuous and binary recov ery , respecti vely . It is interesting to note that the binary recovery is quite rob ust with the choice of c and ∆ b , which is because 1) the recov ery is denoised through thresholding, and 2) the binary signal inherently contains less information and thus is easier to recover . For the coherent weight matrix, we get 45.75 and 32.63 APRE instead (see ﬁg. 5). W e also experiment on the noisy recov ery case, where we generate the data using incoherent weight matrix with c = 1 and ∆ b = 0 . F or each data dimension we add independent Gaussian noise with mean 100 with standard de viation varying from 0 . 001 to 1 . Both signal reco very schemes are quite robust against noise (see ﬁg. 2). In particular , the binary signal recovery is very rob ust, which conforms with our previous observ ation. 5.2 Effect of Sparsity on Signal Recovery W e analyze the effect of sparsity of signals on their reco very using the mechanism sho wn in section 4. In order to do so, we generate incoherent matrices using tw o different methods– Gaussian 6 and 5 W e use 0.55 as the threshold to binarize the recov ered signal using sigmoid function. 6 Gaussian and Xavier (Glorot and Bengio, 2010) initialization becomes identical after weight length normalization 8 orthogonal (Saxe et al., 2013). In addition, all the generated weight matrices are normalized to hav e unit ` 2 row length. W e then sample signals and generate data using the same conﬁgurations as mentioned in section 5.1; only this time, we ﬁx c = 1 and ∆ b = 0 , vary hidden unit activ ation probability p in [0 . 02 , 1] , and duplicate the generated data while adding noise to the copy , which we sample from a Gaussian distrib ution with mean 100 and standard deviation 0 . 05 . According to our analysis, noise mean should have no effect on recov ery so the mean value of 100 shouldn’t hav e any ef fect; only standard de viation affects reco very . W e ﬁnd for all weight matrices, recov ery error reduces with increasing sparsity (decreasing p , see ﬁg. 3). Additionally , we ﬁnd that both recov ery are robust against noise. W e also ﬁnd the recovery error trend is almost al ways lower for orthogonal weight matrices, especially when the signal is sparse. 7 Recall theorem 2 suggests stronger recov ery for more incoherent matrices. So we look into the ro w coherence of W ∈ R m × n sampled from Gaussian and Orthogonal methods with m = 200 and v arying n ∈ [100 , 300] . W e found that orthogonal initialized matrices have signiﬁcantly lo wer coherence ev en though the orthogonalization is done column-wise (see ﬁg. 6.). This explains signiﬁcantly lower recovery error for orthogonal matrices in ﬁgure 3. 0.0 0.2 0.4 0.6 0.8 1.0 U n i t A c t i v a t i o n P r o b a b i l i t y ( p ) 0 10 20 30 40 50 APRE Cont. Recovery Error vs. Sparsity Orthogonal Orthogonal (noise) Gaussian Gaussian (noise) 0.0 0.2 0.4 0.6 0.8 1.0 U n i t A c t i v a t i o n P r o b a b i l i t y ( p ) 0 10 20 30 40 50 APRE Bin. Recovery Error vs. Sparsity Orthogonal Orthogonal (noise) Gaussian Gaussian (noise) Figure 3: Effect of signal sparseness on con- tinuous (left) and binary (right) signal reco very . Noise in parenthesis indicate the generated data was corrupted with Gaussian noise. Sparser sig- nals are recov ered better . 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product Figure 4: Cosine similarity between greedy paired rows of W and ˆ W for continuous (left) and binary (right) recov ery . The upper, mid and lower bar denotes the 95th, 50th and 5th per- centile. 5.3 Recovery of Data Dictionary W e showed the conditions on W and b for good recov ery of sparse signal h . In practice, ho we ver , one does not ha ve access to W , in general. Therefore, in this section, we empirically demonstrate that AE can indeed recov er both W and h through optimizing the AE objecti ve. W e generate 50 , 000 signals h ∈ R m =200 with the same BINS distrib ution as in section 5.1. The data are then generate as x = W T h using an incoherent weight matrix W ∈ R 200 × 180 (same as in section 5.1). W e then recov er the data dictionary ˆ W by: ˆ W = arg min W E x  k x − W T s e ( W ( x − E x [ x ])) k 2  , where k W i k 2 2 = 1 ∀ i (20) Notice that although given sparse signal h the data dictionary W is unique (Hillar and Sommer, 2015), there are m ! number of equi valent solutions for ˆ W , since we can permute dimension of h in AE. T o check if the original data dictionary is recovered, we therefore pair up the rows of W and ˆ W by greedily select the pairs that result in the highest dot product value. W e then measure the goodness of the recov ery by looking at the v alues of all the paired dot products. In addition, since we know the pairing, we can calculate APRE to ev aluate the quality of recov ered hidden signal. As can be observed from ﬁg. 4, by optimizing the AE objective we can recov er the the original data dictionary W (almost all of the cosine distances are 1). The ﬁnal achie ved 1 . 61 and 0 . 15 APRE for continuous and binary signal reco very , which is a bit less than what we achie ved in section 5.1. Howe ver , one should note that for this set of experiments we only observ ed data x and no other information regarding W is exposed. Not surprisingly , we again observed that the binary signal recov ery is more robust as compared to the continuous counterpart, which may attribute to its lo wer information content. W e also did experiments on noisy data and achie ved similar performance as in section 5.1 when the noise is less signiﬁcant (see supplementary materials for more details). These results strongly suggests that AEs are capable of recov ering the true hidden signal in practice. 7 notice the rows of W are not orthogonal for overcomplete ﬁlters, rather the columns are orthogonalized, unless W is undercomplete 9 6 Conclusion In this paper we looked at the sparse signal recovery problem from the Auto-Encoder perspective and provide no vel insights into conditions under which AEs can recov er such signals. In particular , 1) from the signal recovery stand point, if we assume that the observed data is generated from some sparse hidden signals according to the assumed data generating process, then, the true hidden representation can be approximately reco vered if a) the weight matrices are highly incoherent with unit ` 2 ro w length, and b) the bias vectors are as described in equation 11 (theorem 2) 8 . The recov ery also becomes more and more accurate with increasing sparsity in hidden signals. 2) From the data generation perspecti ve, we found that data generated from such signals (assumption 1) hav e the property of being roughly uncorrelated (theorem 3), and thus pre-process the data to ha ve uncorrelated dimensions may encourage stronger signal reco very . 3) Gi ven only measurement data, we empirically show that the AE reconstruction objecti ve recovers the data generating dictionary , and hence the true signal h . 4) These conditions and observ ations allow us to vie w various existing techniques, such as data whitening, independent component analysis, etc. , in a more coherent picture when considering signal recov ery . References Edoardo Amaldi and V iggo Kann. On the approximability of minimizing nonzero v ariables or unsatisﬁed relations in linear systems. Theor etical Computer Science , 209(1):237–260, 1998. Dev ansh Arpit, Y ingbo Zhou, Hung Ngo, and V enu Govindaraju. Why regularized auto-encoders learn sparse representation? In ICML , 2016. H. Bourlard and Y . Kamp. Auto-association by multilayer perceptrons and singular value decomposi- tion. Biological Cybernetics , 59(4-5):291–294, 1988. ISSN 0340-1200. Emmanuel J Candes and T erence T ao. Near -optimal signal reco very from random projections: Univ ersal encoding strategies? Information Theory , IEEE T ransactions on , 52(12):5406–5425, 2006. Emmanuel J Candes, Justin K Romberg, and T erence T ao. Stable signal recov ery from incomplete and inaccurate measurements. Communications on pur e and applied mathematics , 59(8):1207–1223, 2006. Xavier Glorot and Y oshua Bengio. Understanding the dif ﬁculty of training deep feedforward neural networks. In International confer ence on artiﬁcial intelligence and statistics , pages 249–256, 2010. Xavier Glorot, Antoine Bordes, and Y oshua Bengio. Deep sparse rectiﬁer neural netw orks. In AIST A TS , pages 315–323, 2011. Mikael Henaf f, K evin Jarrett, K oray Ka vukcuoglu, and Y ann LeCun. Unsupervised learning of sparse features for scalable audio classiﬁcation. ISMIR , 11:445. 2011, 2011. Christopher J Hillar and Friedrich T Sommer . When can dictionary learning uniquely recover sparse data from subsamples? IEEE T ransactions on Information Theory , 61(11):6290–6297, 2015. Geoffre y E Hinton. Distributed representations. 1984. Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks , 13(4):411–430, 2000. Serge y Ioffe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. In Francis R. Bach and David M. Blei, editors, ICML , volume 37 of JMLR Pr oceedings , pages 448–456. JMLR.org, 2015. K oray Kavukcuoglu, Marc’Aurelio Ranzato, and Y ann LeCun. Fast inference in sparse coding algorithms with applications to object recognition. arXiv pr eprint arXiv:1010.3467 , 2010. Alireza Makhzani and Brendan Frey . k-sparse autoencoders. CoRR , abs/1312.5663, 2013. URL http://arxiv.org/abs/1312.5663 . Roland Memise vic, Kishore Reddy Konda, and David Krue ger . Zero-bias autoencoders and the beneﬁts of co-adapting features. In ICLR , 2014. 8 For binary reco very , the bias equation is described in 6 10 V inod Nair and Geof frey E Hinton. Rectiﬁed linear units improv e restricted boltzmann machines. In Pr oceedings of the 27th International Conference on Machine Learning (ICML-10) , pages 807–814, 2010. Andrew Ng. Sparse autoencoder . CSE294 Lectur e notes , 2011. Andre w M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv pr eprint arXiv:1312.6120 , 2013. J. Wright, A.Y . Y ang, A. Ganesh, S.S. Sastry , and Y i Ma. Robust face recognition via sparse representation. IEEEE TP AMI , 31(2):210 –227, Feb. 2009. Jianchao Y ang, Kai Y u, Y ihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classiﬁcation. In CVPR , pages 1794–1801, 2009. 11 A ppendix: On Optimality Conditions for A uto-Encoder Signal Recovery 1 Proofs Remark 1. Let x 1 = W T h wher e x 1 ∈ R n , W ∈ R m × n and h ∈ R m . Let x 2 = W T h + b d wher e b d ∈ R n is a ﬁxed vector . Let ˆ h 1 = s e ( Wx 1 + b ) and ˆ h 2 = s e ( Wx 2 + b − Wb d ) . Then ˆ h 1 = h iff ˆ h 2 = h . Pr oof: Let ˆ h 1 = h . Thus h = s e ( Wx 1 + b ) . On the other hand, ˆ h 2 = s e ( Wx 2 + b − Wb d ) = s e ( WW T h + Wb d + b − Wb d ) = s e ( WW T h + b ) = s e ( Wx 1 + b ) = h . The other dir ection can be pr oved similarly . Theorem 1. Let each element of h follow BINS( p , δ 1 , µ h , l max ) and let ˆ h ∈ R m be an auto-encoder signal r ecovery mechanism with Sigmoid activation function and bias b for a measur ement vector x ∈ R n such that x = W T h . If we set b i = − P j a ij p j ∀ i ∈ [ m ] , then ∀ δ ∈ (0 , 1) , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1 (1 − p i ) e − 2 ( δ 0 + p i a ii ) 2 P m j =1 ,j 6 = i a 2 ij + p i e − 2 ( δ 0 +(1 − p i ) a ii ) 2 P m j =1 ,j 6 = i a 2 ij ! (21) wher e a ij = W T i W j , δ 0 = ln( δ 1 − δ ) and W i is the i th r ow of the matrix W cast as a column vector . Pr oof. Notice that, Pr( | ˆ h i − h i | ≥ δ ) = Pr( | ˆ h i − h i | ≥ δ    h i = 0) Pr( h i = 0) + Pr( | ˆ h i − h i | ≥ δ    h i = 1) Pr( h i = 1) (22) and from deﬁnition 1, ˆ h i = σ ( X j a ij h j + b i ) (23) Thus, Pr( | ˆ h i − h i | ≥ δ ) = (1 − p i ) Pr( σ ( X j a ij h j + b i ) ≥ δ    h i = 0) + p i Pr( σ ( − X j a ij h j − b i ) ≥ δ    h i = 1) (24) Notice that Pr( σ ( P j a ij h j + b i ) ≥ δ    h i = 0) = Pr( P j a ij h j + b i ≥ ln( δ 1 − δ )    h i = 0) . Let z i = P j a ij h j + b i and δ 0 = ln( δ 1 − δ ) . Then, setting b i = − E h [ P j a ij h j ] = − P j a ij p j , using Chernoff ’ s inequality , for any t > 0 , Pr( z i ≥ δ 0    h i = 0) ≤ E h [ e tz i ] e tδ 0 = E h h e t P j 6 = i a ij ( h j − p j ) − tp i a ii i e tδ 0 = E h h Q j 6 = i e ta ij ( h j − p j ) i e t ( δ 0 + p i a ii ) = Q j 6 = i E h j  e ta ij ( h j − p j )  e t ( δ 0 + p i a ii ) (25) Let T j = E h j  e ta ij ( h j − p j )  . Then, T j = (1 − p j ) e − tp j a ij + p j e t (1 − p j ) a ij = e − tp j a ij (1 − p j + p j e ta ij ) (26) 1 Let e g ( t ) , T j , thus, g ( t ) = − tp j a ij + ln(1 − p j + p j e ta ij ) = ⇒ g (0) = 0 (27) g 0 ( t ) = − p j a ij + p j a ij e ta ij 1 − p j + p j e ta ij = ⇒ g 0 (0) = 0 (28) g 00 ( t ) = p j (1 − p j ) a 2 ij e ta ij (1 − p j + p j e ta ij ) 2 (29) g 000 ( t ) = p j (1 − p j ) a 3 ij e ta ij (1 − p j + p j e ta ij )(1 − p j − p j e ta ij ) (1 − p j + p j e ta ij ) 4 (30) (31) Setting g 000 ( t ) = 0 , we get t ∗ = 1 a ij ln( 1 − p j p j ) . Thus, g 00 ( t ) ≤ g ( t ∗ ) = a 2 ij 4 . By T aylor’ s theorem, ∃ c ∈ [0 , t ] ∀ t > 0 s.t., g ( t ) = g (0) + tg 0 (0) + t 2 2 g 00 ( c ) ≤ t 2 a 2 ij 8 (32) Thus we can upper bound T j as, T j ≤ e t 2 a 2 ij 8 (33) Hence we can write Pr( z i ≥ δ 0 ) as Pr( z i ≥ δ 0 ) ≤ Q j 6 = i T j e t ( δ 0 + a ii p i ) = Q j 6 = i e t 2 a 2 ij 8 e t ( δ 0 + a ii p i ) = e t 2 P j 6 = i a 2 ij 8 − t ( a ii p i + δ 0 ) (34) On the other hand, notice Pr( σ ( − P j a ij h j − b i ) ≥ δ    h i = 1) = Pr( − P j a ij h j − b i ≥ ln( δ 1 − δ )    h i = 1) = Pr( − z i ≥ δ 0    h i = 1) . Pr( − z i ≥ δ 0    h i = 1) ≤ E h [ e − tz i ] e tδ 0 = E h h e − t P j 6 = i a ij ( h j − p j ) − t (1 − p i ) a ii i e tδ 0 (35) = E h h Q j 6 = i e − ta ij ( h j − p j ) i e t ( δ 0 +(1 − p i ) a ii ) (36) = Q j 6 = i E h j  e − ta ij ( h j − p j )  e t ( δ 0 +(1 − p i ) a ii ) (37) Let T j = E h j  e − ta ij ( h j − p j )  . Then we can similarly bound Pr( − z i ≥ δ 0 ) by effecti vely ﬂipping the sign of a ij ’ s in the pre vious deriv ation, Pr( − z i ≥ δ 0 ) ≤ Q j 6 = i T j e t ( δ 0 + a ii (1 − p i )) = Q j 6 = i e t 2 a 2 ij 8 e t ( δ 0 + a ii (1 − p i )) = e t 2 P j 6 = i a 2 ij 8 − t ( a ii (1 − p i )+ δ 0 ) (38) Minimizing both 34 and 38 with respect to t and applying union bound, we get, Pr( | ˆ h i − h i | ≥ δ ) ≤ (1 − p i ) e − 2( a ii p i + δ 0 ) 2 P j 6 = i a 2 ij + p i e − 2( a ii (1 − p i )+ δ 0 ) 2 P j 6 = i a 2 ij (39) Since the above bound holds for all i ∈ [ m ] , applying union bound on all the units yields the desired result. Proposition 1. Let each element of h follow BINS( p , δ 1 , µ h , l max ) and let ˆ h ∈ R m be an auto- encoder signal r ecovery mechanism with Sigmoid activation function and bias b for a measur ement 2 vector x = W T h + e wher e e ∈ R n is any noise vector independent of h . If we set b i = − P j a ij p j − W T i E e [ e ] ∀ i ∈ [ m ] , then ∀ δ ∈ (0 , 1) , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1 (1 − p i ) e − 2 ( δ 0 − W T i ( e − E e [ e ])+ p i a ii ) 2 P m j =1 ,j 6 = i a 2 ij (40) + p i e − 2 ( δ 0 − W T i ( e − E e [ e ])+(1 − p i ) a ii ) 2 P m j =1 ,j 6 = i a 2 ij ! (41) wher e a ij = W T i W j , δ 0 = ln( δ 1 − δ ) and W i is the i th r ow of the matrix W cast as a column vector . Pr oof. Notice that, Pr( | ˆ h i − h i | ≥ δ ) = Pr( | ˆ h i − h i | ≥ δ    h i = 0) Pr( h i = 0) (42) + Pr( | ˆ h i − h i | ≥ δ    h i = 1) Pr( h i = 1) (43) and from deﬁnition 1, ˆ h i = σ ( X j a ij h j + b i + W T i e ) (44) Thus, Pr( | ˆ h i − h i | ≥ δ ) = (1 − p i ) Pr( σ ( X j a ij h j + b i + W T i e ) ≥ δ    h i = 0) + p i Pr( σ ( − X j a ij h j − b i − W T i e ) ≥ δ    h i = 1) (45) Notice that Pr( σ ( P j a ij h j + b i + W T i e ) ≥ δ    h i = 0) = Pr( P j a ij h j + b i + W T i e ≥ ln( δ 1 − δ )    h i = 0) . Let z i = P j a ij h j + b i + W T i e and δ 0 = ln( δ 1 − δ ) . Then, setting b i = − E h [ P j a ij h j ] − W T i E e [ e ] = − P j a ij p j , using Chernoff ’ s inequality on random v ariable h , for any t > 0 , Pr( z i ≥ δ 0    h i = 0) ≤ E h [ e tz i ] e tδ 0 − t W T i ( e − E e [ e ]) = E h h e t P j 6 = i a ij ( h j − p j ) − tp i a ii i e tδ 0 − t W T i ( e − E e [ e ]) = E h h Q j 6 = i e ta ij ( h j − p j ) i e t ( δ 0 − t W T i ( e − E e [ e ])+ p i a ii ) = Q j 6 = i E h j  e ta ij ( h j − p j )  e t ( δ 0 − t W T i ( e − E e [ e ])+ p i a ii ) (46) Setting ¯ δ := δ 0 − W T i ( e − E e [ e ]) , we can re write the above inequality as Pr( z i ≥ δ 0    h i = 0) ≤ Q j 6 = i E h j  e ta ij ( h j − p j )  e t ( ¯ δ + p i a ii ) (47) Since the abov e inequality becomes identical to equation 25, the rest of the proof is similar to theorem 2. Theorem 2. Let each element of h ∈ R m follow BINS( p , f c , µ h , l max ) distribution and let ˆ h ReLU ( x ; W , b ) be an auto-encoder signal reco very mechanism with Rectiﬁed Linear activation function (ReLU) and bias b for a measur ement vector x ∈ R n such that x = W T h . If we set b i , − P j a ij p j µ h j ∀ i ∈ [ m ] , then ∀ δ ≥ 0 , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1   e − 2 ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij )) 2 P j a 2 ij l 2 max j + e − 2 ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij )) 2 P j a 2 ij l 2 max j   (48) 3 wher e a i s ar e vectors such that a ij =  W T i W j if i 6 = j W T i W i − 1 if i = j (49) W i is the i th r ow of the matrix W cast as a column vector . Pr oof. From deﬁnition 1 and the deﬁnition of a ij abov e, ˆ h i = max { 0 , X j a ij h j + h i + b i } ˆ h i − h i = max {− h i , X j a ij h j + b i } (50) Let z i = P j a ij h j + b i . Thus, ˆ h i − h i = max {− h i , z i } . Then, conditioning upon z i , Pr( | ˆ h i − h i | ≤ δ ) = Pr  | ˆ h i − h i | ≤ δ    h i > 0 , | z i | ≤ δ  Pr( | z i | ≤ δ, h i > 0) (51) + Pr  | ˆ h i − h i | ≤ δ    h i > 0 , | z i | > δ  Pr( | z i | > δ, h i > 0) (52) + Pr  | ˆ h i − h i | ≤ δ    h i = 0 , | z i | ≤ δ  Pr( | z i | ≤ δ, h i = 0) (53) + Pr  | ˆ h i − h i | ≤ δ    h i = 0 , | z i | > δ  Pr( | z i | > δ, h i = 0) (54) Since Pr  | ˆ h i − h i | ≤ δ    | z i | ≤ δ  = 1 , we hav e, Pr( | ˆ h i − h i | ≤ δ ) ≥ Pr( | z i | ≤ δ ) (55) The abov e inequality is obtained by ignoring the positiv e terms that depend on the condition | z i | > δ and marginalizing o ver h i . F or any t > 0 , using Chernoff ’ s inequality , Pr( z i ≥ δ ) ≤ E h [ e tz i ] e tδ (56) Setting b i = − P j a ij µ j , where µ j = E h j [ h j ] = p j µ h j , Pr( z i ≥ δ ) ≤ E h h e t P j a ij ( h j − µ j ) i e tδ = E h h Q j e ta ij ( h j − µ j ) i e tδ = Q j E h j  e ta ij ( h j − µ j )  e tδ (57) Let T j = E h j  e ta ij ( h j − µ j )  . Then, T j = (1 − p j ) e − ta ij µ j + p j E v ∼ f c (0 + ,l max ,µ h ) h e ta ij ( v − µ j ) i (58) where f c ( a, b, µ h ) denotes any arbitrary distrib ution in the interv al ( a, b ] with mean µ h . If a ij ≥ 0 , let α = − µ j and β = l max j − µ j which essentially denote the lower and upper bound of h j − µ j . Then, T j = (1 − p j ) e ta ij α + p j E v ∼ f c (0 + ,l max j ,µ h j ) h e ta ij ( v − µ j ) i (59) ≤ (1 − p j ) e ta ij α + p j E v  β − ( v − µ j ) β − α e ta ij α + ( v − µ j ) − α β − α e ta ij β  (60) = (1 − p j ) e ta ij α + β − (1 − p j ) µ h j β − α p j e ta ij α + (1 − p j ) µ h j − α β − α p j e ta ij β (61) = (1 − p j ) e ta ij α + p j β e ta ij α β − α − p j (1 − p j ) µ h j ( β − α ) ( e ta ij α − e ta ij β ) − p j α β − α e ta ij β (62) 4 where the ﬁrst inequality in the abov e equation is from the property of a con vex function. Deﬁne u = ta ij ( β − α ) , γ = − α β − α . Then, T j ≤ e − uγ  1 − p j + p j β β − α − p j (1 − p j ) µ h j ( β − α ) (1 − e u ) − p j α β − α e u  (63) = e − uγ  1 + p j α β − α − p j (1 − p j ) µ h j ( β − α ) −  p j α β − α − p j (1 − p j ) µ h j ( β − α )  e u  (64) = e − uγ  1 −  p j γ + p j (1 − p j ) µ h j ( β − α )  +  p j γ + p j (1 − p j ) µ h j ( β − α )  e u  (65) (66) Deﬁne φ = p j γ + p j (1 − p j ) µ h j ( β − α ) and let e g ( u ) , T j = e − uγ (1 − φ + φe u ) . Then, g ( u ) = − uγ + ln(1 − φ + φe u ) = ⇒ g (0) = 0 (67) g 0 ( u ) = − γ + φe u 1 − φ + φe u = ⇒ g 0 (0) = − γ + φ = − γ (1 − p ) + p (1 − p ) µ h ( β − α ) (68) g 00 ( u ) = φ (1 − φ ) e u (1 − φ + φe u ) 2 (69) g 000 ( u ) = φ (1 − φ )(1 − φ + φe u ) e u (1 − φ − φe u ) (1 − φ + φe u ) 4 (70) Thus, for getting a maxima for g 00 ( u ) , we set g 000 ( u ) = 0 which implies 1 − φ − φe u = 0 , or , e u = 1 − φ φ . Substituting this u in g 00 ( u ) ≤ 1 / 4 . By T aylor’ s theorem, ∃ c ∈ [0 , u ] ∀ u > 0 such that, g ( u ) = g (0) + ug 0 (0) + u 2 2 g 00 ( c ) ≤ 0 − uγ (1 − p j ) + up j (1 − p j ) µ h j ( β − α ) + u 2 / 8 (71) Thus we can upper bound T j as, T j ≤ e u 2 / 8 − u  γ (1 − p j ) − p j (1 − p j ) µ h j ( β − α )  = e t 2 a 2 ij ( β − α ) 2 / 8+ ta ij ( β − α )  α (1 − p j ) β − α + p j (1 − p j ) µ h j ( β − α )  (72) Substituting for α, β , we get, T j ≤ e t 2 a 2 ij l 2 max j / 8+ ta ij (1 − p j )( − µ j + p j µ h j ) = e t 2 a 2 ij l 2 max j 8 (73) On the other hand, if a ij < 0 , then we can set α = µ j − l max j and β = µ j and proceeding similar to equation 59, we get, T j ≤ e t 2 a 2 ij l 2 max j / 8+ t | a ij | (1 − p j )( µ j − l max j + p j µ h j ) = e t 2 a 2 ij l 2 max j 8 − t | a j | (1 − p j )( l max j − 2 p j µ h j ) (74) Then, collectiv ely , we can write Pr( z i ≥ δ ) as Pr( z i ≥ δ ) ≤ Y j T j e tδ = e t 2 P j a 2 ij l 2 max j / 8 − t ( δ +(1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij ) ) (75) W e similarly bound Pr( − z i ≥ δ ) by effecti vely ﬂipping the sign of a ij ’ s, Pr( − z i ≥ δ ) ≤ e t 2 P j a 2 ij l 2 max j / 8 − t ( δ +(1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij ) ) (76) Minimizing both 75 and 76 with respect to t and applying union bound, we get, Pr( | ˆ h i − h i | ≥ δ ) ≤ e − 2 ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij )) 2 P j a 2 ij l 2 max j (77) + e − 2 ( δ + P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij )) 2 P j a 2 ij l 2 max j ∀ i ∈ [ m ] (78) Since the above bound holds for all i ∈ [ m ] , applying union bound on all the units yields the desired result. 5 Proposition 2. Let each element of h follow BINS( p , f c , µ h , l max ) distribution and let ˆ h ∈ R m be an auto-encoder signal reco very mechanism with Rectiﬁed Linear activation function and bias b for a measurement vector x ∈ R n such that x = W T h + e wher e e is any noise random vector independent of h . If we set b i , − P j a ij p j µ h j − W T i E e [ e ] ∀ i ∈ [ m ] , then ∀ δ ≥ 0 , Pr  1 m k ˆ h − h k 1 ≤ δ  ≥ 1 − m X i =1   e − 2 ( δ − W T i ( e − E e [ e ])+ P j (1 − p j )( l max j − 2 p j µ h j ) max(0 ,a ij )) 2 P j a 2 ij l 2 max j + e − 2 ( δ − W T i ( e − E e [ e ])+ P j (1 − p j )( l max j − 2 p j µ h j ) max(0 , − a ij )) 2 P j a 2 ij l 2 max j   (79) wher e a i s ar e vectors such that a ij =  W T i W j if i 6 = j W T i W i − 1 if i = j (80) W i is the i th r ow of the matrix W cast as a column vector . Pr oof. Recall that, ˆ h i = max { 0 , X j a ij h j + h i + W T i e + b i } (81) ˆ h i − h i = max {− h i , X j a ij h j + W T i e + b i } (82) Let z i = P j a ij h j + b i + W T i e . Then, similar to theorem 2, conditioning upon z i , Pr( | ˆ h i − h i | ≤ δ ) = Pr  | ˆ h i − h i | ≤ δ    h i > 0 , | z i | ≤ δ  Pr( | z i | ≤ δ, h i > 0) (83) + Pr  | ˆ h i − h i | ≤ δ    h i > 0 , | z i | > δ  Pr( | z i | > δ, h i > 0) (84) + Pr  | ˆ h i − h i | ≤ δ    h i = 0 , | z i | ≤ δ  Pr( | z i | ≤ δ, h i = 0) (85) + Pr  | ˆ h i − h i | ≤ δ    h i = 0 , | z i | > δ  Pr( | z i | > δ, h i = 0) (86) Since Pr  | ˆ h i − h i | ≤ δ    | z i | ≤ δ  = 1 , we hav e, Pr( | ˆ h i − h i | ≤ δ ) ≥ Pr( | z i | ≤ δ ) (87) For any t > 0 , using Chernoff ’ s inequality for the random v ariable h , Pr( z i ≥ δ ) ≤ E h [ e tz i ] e tδ (88) Setting b i = − P j a ij µ j − W T i E e [ e ] , where µ j = E h j [ h j ] = p j µ h j , Pr( z i ≥ δ ) ≤ E h h e t P j a ij ( h j − µ j ) i e tδ − t W T i ( e − E e [ e ]) = E h h Q j e ta ij ( h j − µ j ) i e tδ − t W T i ( e − E e [ e ]) = Q j E h j  e ta ij ( h j − µ j )  e tδ − t W T i ( e − E e [ e ]) (89) Setting ¯ δ := δ − W T i ( e − E e [ e ]) , we can re write the above inequality as Pr( z i ≥ δ ) ≤ Q j E h j  e ta ij ( h j − µ j )  e t ¯ δ (90) Since the abov e inequality becomes identical to equation 57, the rest of the proof is similar to theorem 2. 6 Theorem 3. (Uncorr elated Distribution Bound): If data is gener ated as x = W T h wher e h ∈ R m has covariance matrix diag( ζ ) , ( ζ ∈ R + m ) and W ∈ R m × n ( m > n ) is such that each r ow of W has unit length and the r ows of W ar e maximally incoher ent, then the co variance matrix of the generated data is appr oximately spherical (uncorrelated) satisfying , min α k Σ − α I k F ≤ r 1 n ( m k ζ k 2 2 − k ζ k 2 1 ) (91) wher e Σ = E x [( x − E x [ x ])( x − E x [ x ]) T ] is the covariance matrix of the gener ated data. Pr oof. Notice that, E x [ x ] = W T E h [ h ] (92) Thus, E x [( x − E x [ x ])( x − E x [ x ]) T ] = E h [( W T h − W T E h [ h ])( W T h − W T E h [ h ]) T ] (93) = E h [ W T ( h − E h [ h ])( h − E h [ h ]) T W ] (94) = W T E h [( h − E h [ h ])( h − E h [ h ]) T ] W (95) Substituting the cov ariance of h as diag( ζ ) , Σ = E x [( x − E x [ x ])( x − E x [ x ]) T ] = W T diag( ζ ) W (96) Thus, k Σ − α I k 2 F = tr  ( W T diag( ζ ) W − α I )( W T diag( ζ ) W − α I ) T  (97) = tr  W T diag( ζ ) WW T diag( ζ ) W + α 2 I − 2 α W T diag( ζ ) W  (98) Using the cyclic property of trace, k Σ − α I k 2 F = tr  WW T diag( ζ ) WW T diag( ζ ) + α 2 I − 2 α WW T diag( ζ )  (99) = k WW T diag( ζ ) k 2 F + α 2 n − 2 α m X i =1 ζ i (100) ≤ ( m X i =1 ζ 2 i )(1 + µ 2 ( m − 1)) + α 2 n − 2 α m X i =1 ζ i (101) Finally minimizing w .r .t α , we get α ∗ = 1 n P m i =1 ζ i . Substituting this into the abov e inequality , we get, min α k Σ − α I k 2 F ≤ ( m X i =1 ζ 2 i )(1 + µ 2 ( m − 1)) + 1 n ( m X i =1 ζ i ) 2 − 2 n ( m X i =1 ζ i ) 2 (102) = ( m X i =1 ζ 2 i )(1 + µ 2 ( m − 1)) − 1 n ( m X i =1 ζ i ) 2 (103) (104) Since the weight matrix is maximally incoherent, using W elch bound, we have that, µ ∈ h q m − n n ( m − 1) , 1 i . Plugging the lower bound of µ (maximal incoherence) for an y ﬁxed m and n into the abov e bound yields, min α k Σ − α I k 2 F ≤ ( m X i =1 ζ 2 i )(1 + m − n n ( m − 1) ( m − 1)) − 1 n ( m X i =1 ζ i ) 2 (105) = ( m X i =1 ζ 2 i )(1 + m − n n ) − 1 n ( m X i =1 ζ i ) 2 (106) = 1 n  m k ζ k 2 2 − k ζ k 2 1  (107) 7 2 Supplementary Experiments 2.1 Supplementary Experiments for Section 5.1 Here we show the recov ery error (APRE) for signals generated with coherent weight matrix, and as expected the recovery result is poor and the values of c and ∆ b are unpredictable. The minimum av erage percentage recovery error we got for continous signal is 45.75, and for binary signal is 32.63. 1.0 0.5 0.0 0.5 1.0 b 0.5 1.0 1.5 c Cont. Signal Recovery 50.42 60.42 70.42 80.42 90.42 1.0 0.5 0.0 0.5 1.0 b 0.5 1.0 1.5 c Binary Signal Recovery 40.33 Figure 5: Error heatmap sho wing optimal v alues of c and ∆ b for recovering continous (left) and binary (right) signal using coherent weights. 2.2 Supplementary Experiments for Section 5.2 Fig. 6 shows that the coherence of orthogonal initialized weight matrix is more incoherent as compared to the ones that using Gaussian based initialization. 100 150 200 250 n 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Coherence orthogonal gaussian Comparison of Incoherence Figure 6: Coherence of orthogonal and Gaussian weight matrix with v arying dimensions. 2.3 Supplementary Experiments for Section 5.3 For noisy signal recov ery we add independent Gaussian noise to data with mean 100 and standard deviation ranging from 0 . 01 to 0 . 2 . Note that the data is normally within the range of [ − 1 , 1] , so the noise is quite signiﬁcant when we have a standard de viation > 0 . 1 . It is clear that ev en in noisy case AE can recov er the dictionary (see ﬁg. 7). Howe ver , the recov ery is not very strong when the noise is large > 0 . 1 for continous signals, which is because 1) the precise v alue in this case is continous and thus is more inﬂuenced by the noise, 2) the dictionary recovery is poor , which result poor signal recov ery . On the other hand, the recov ery is robust in case of recovering binary signals. Similar results were found on the APRE of reco vered hidden signals. The reason for more rob ust reco very for binary signal is that 1) the information content is lo wer and 2) we binarize the reco vered hidden signal by thresholding it, which further denoised the recov ery . When optimizing the AE objecti ve for binary singal recov ery case, we did a small trick to simulate the binarization of the signal. From our analysis 8 (see Theorem 1), a recov ery error δ = 0 . 5 is reasonable as we can binarize the recov ery using some threshold. Ho wev er , when optimizing the AE using gradient based method we are unable to do this. T o simulate this effect, we of fset the pre-activ ation by a constant k and multiply the pre-activ ation by a constant c , so that it signiﬁes the input and push the post-activ ation values to wards 0 and 1 . In other words, we optimize the follo wing objectiv e when doing binary signal recovery: ˆ W = arg min W E x  k x − W T σ ( c { W ( x − E x [ x ]) + k } ) k 2  , where k W i k 2 2 = 1 ∀ i (108) where σ is the sigmoid function. W e ﬁnd set c = 6 and k = − 0 . 6 is sufﬁcient to saturate the sigmoid and simulate the binarization of hidden signals. 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.2 0.4 0.6 0.8 1.0 Dot Product 0 50 100 Epoch 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Dot Product Figure 7: Cosine similarity between greedy paired ro ws of W and ˆ W for noisy binary (upper) and continous (lower) reco very . From left to right the noise stand deviations are 0.01, 0.02, 0.05, 0.1, 0.2, respectiv ely . The upper , mid and lower bar represent the 95th, 50th and 5th percentile. T able 1: A verage percentage recov ery error for noisy AE recovery . Noise std. 0.01 0.02 0.05 0.1 0.2 Continous APRE 2.06 1.63 9.48 34.16 56.79 Binary APRE 0.15 0.16 0.18 1.56 4.00 9

On Optimality Conditions for Auto-Encoder Signal Recovery

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment