A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction

1 A Mathematical Theory of Deep Con v olutional Neural Networks for Feature Extraction Thomas W iatowski and Helmut B ¨ olcskei, F ellow , IEEE Abstract —Deep con volutional neural networks ha ve led to breakthr ough results in numerous practical machine learning tasks such as classiﬁcation of images in the ImageNet data set, control-policy-learning to play Atari games or the board game Go, and image captioning. Many of these applications ﬁrst perform featur e extraction and then feed the results ther eof into a trainable classiﬁer . The mathematical analysis of deep conv o- lutional neural networks for feature extraction was initiated by Mallat, 2012. Speciﬁcally , Mallat considered so-called scattering networks based on a wav elet transform follo wed by the modulus non-linearity in each network layer , and pro ved translation in variance (asymptotically in the wav elet scale parameter) and deformation stability of the corresponding feature extractor . This paper complements Mallat’ s results by developing a theory that encompasses general con volutional transforms, or in more technical parlance, general semi-discrete frames (including W eyl- Heisenberg ﬁlters, curvelets, shearlets, ridgelets, wa velets, and learned ﬁlters), general Lipschitz-continuous non-linearities (e.g., rectiﬁed linear units, shifted logistic sigmoids, hyperbolic tan- gents, and modulus functions), and general Lipschitz-continuous pooling operators emulating, e.g., sub-sampling and a veraging. In addition, all of these elements can be different in different network layers. For the resulting feature extractor we pro ve a translation inv ariance r esult of vertical nature in the sense of the features becoming progressi vely more translation-inv ariant with increasing network depth, and we establish deformation sensitivity bounds that apply to signal classes such as, e.g ., band- limited functions, cartoon functions, and Lipschitz functions. Index T erms —Machine learning, deep convolutional neural networks, scattering networks, featur e extraction, frame theory . I . I N T R O D U C T I O N A central task in machine learning is feature extraction [2]–[4] as, e.g., in the context of handwritten digit classiﬁcation [5]. The features to be extracted in this case correspond, for example, to the edges of the digits. The idea behind feature extraction is that feeding characteristic features of the signals—rather than the signals themselv es—to a trainable classiﬁer (such as, e.g., a support vector machine (SVM) [6]) improves classiﬁcation performance. Speciﬁcally , non-linear feature extractors (obtained, e.g., through the use of a so-called kernel in the conte xt of SVMs) can map input signal space dichotomies that are not linearly separable into linearly separable feature space dichotomies [3]. Sticking to the example of handwritten digit classiﬁcation, we would, moreov er, want the feature extractor to be in variant to the The authors are with the Department of Information T echnology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland. Email: { withomas, boelcskei } @nari.ee.ethz.ch The material in this paper was presented in part at the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China. Copyright (c) 2017 IEEE. Personal use of this material is permitted. Howe ver , permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.or g. digits’ spatial location within the image, which leads to the requirement of translation in variance. In addition, it is desirable that the feature extractor be robust with respect to (w .r .t.) handwriting styles. This can be accomplished by demanding limited sensitivity of the features to certain non- linear deformations of the signals to be classiﬁed. Spectacular success in practical machine learning tasks has been reported for feature extractors generated by so-called deep con volutional neural networks (DCNNs) [2], [7]–[11], [13], [14]. These networks are composed of multiple layers, each of which computes con volutional transforms, followed by non-linearities and pooling 1 operators. While DCNNs can be used to perform classiﬁcation (or other machine learning tasks such as regression) directly [2], [7], [9]–[11], typically based on the output of the last network layer, they can also act as stand-alone feature extractors [15]–[21] with the resulting features fed into a classiﬁer such as a SVM. The present paper pertains to the latter philosophy . The mathematical analysis of feature extractors generated by DCNNs was pioneered by Mallat in [22]. Mallat’ s the- ory applies to so-called scattering networks, where signals are propagated through layers that compute a semi-discrete wa velet transform (i.e., conv olutions with ﬁlters that are ob- tained from a mother wa velet through scaling and rotation operations), followed by the modulus non-linearity , without subsequent pooling. The resulting feature extractor is sho wn to be translation-inv ariant (asymptotically in the scale param- eter of the underlying wa velet transform) and stable w .r .t. certain non-linear deformations. Moreov er, Mallat’ s scattering networks lead to state-of-the-art results in v arious classiﬁcation tasks [23]–[25]. Contributions. DCNN-based feature extractors that were found to work well in practice emplo y a wide range of i) ﬁlters, namely pre-speciﬁed structured ﬁlters such as wa velets [16], [19]–[21], pre-speciﬁed unstructured ﬁlters such as random ﬁlters [16], [17], and ﬁlters that are learned in a super- vised [15], [16] or an unsupervised [16]–[18] fashion, ii) non-linearities beyond the modulus function [16], [21], [22], namely hyperbolic tangents [15]–[17], rectiﬁed linear units [26], [27], and logistic sigmoids [28], [29], and iii) pooling operators, namely sub-sampling [19], av erage pooling [15], [16], and max-pooling [16], [17], [20], [21]. In addition, the ﬁlters, non-linearities, and pooling operators can be different in dif ferent network layers [14]. The goal of this paper is to dev elop a mathematical theory that encompasses all these elements (apart from max-pooling) in full generality . 1 In the literature “pooling” broadly refers to some form of combining “nearby” values of a signal (e.g., through averaging) or picking one rep- resentativ e value (e.g, through maximization or sub-sampling). 2 Con volutional transforms as employed in DCNNs can be interpreted as semi-discrete signal transforms [30]–[37] (i.e., con volutional transforms with ﬁlters that are countably parametrized). Corresponding prominent representatives are curvelet [34], [35], [38] and shearlet [36], [39] transforms, both of which are kno wn to be highly effecti ve in e xtract- ing features characterized by curved edges in images. Our theory allows for general semi-discrete signal transforms, general Lipschitz-continuous non-linearities (e.g., rectiﬁed linear units, shifted logistic sigmoids, hyperbolic tangents, and modulus functions), and incorporates continuous-time Lipschitz pooling operators that emulate discrete-time sub- sampling and av eraging. Finally , different network layers may be equipped with different con volutional transforms, different (Lipschitz-continuous) non-linearities, and different (Lipschitz-continuous) pooling operators. Regarding translation in variance, it was argued, e.g., in [15]–[17], [20], [21], that in practice inv ariance of the features is crucially go verned by network depth and by the presence of pooling operators (such as, e.g., sub-sampling [19], a verage- pooling [15], [16], or max-pooling [16], [17], [20], [21]). W e show that the general feature extractor considered in this paper , indeed, exhibits such a vertical translation in variance and that pooling plays a crucial role in achieving it. Speciﬁcally , we prov e that the depth of the network determines the extent to which the extracted features are translation-inv ariant. W e also show that pooling is necessary to obtain v ertical translation in variance as otherwise the features remain fully translation- cov ariant irrespective of network depth. W e furthermore es- tablish a deformation sensiti vity bound valid for signal classes such as, e.g., band-limited functions, cartoon functions [40], and Lipschitz functions [40]. This bound sho ws that small non- linear deformations of the input signal lead to small changes in the corresponding feature vector . In terms of mathematical techniques, we draw heavily from continuous frame theory [41], [42]. W e develop a proof machinery that is completely detached from the structures 2 of the semi-discrete transforms and the speciﬁc form of the Lipschitz non-linearities and Lipschitz pooling operators. The proof of our deformation sensitivity bound is based on two key elements, namely Lipschitz continuity of the feature extractor and a deformation sensitivity bound for the signal class under consideration, namely band-limited functions (as established in the present paper) or cartoon functions and Lipschitz functions as shown in [40]. This “decoupling” approach has important practical ramiﬁcations as it shows that whenev er we have deformation sensitivity bounds for a signal class, we automatically get deformation sensiti vity bounds for the DCNN feature extractor operating on that signal class. Our results hence establish that vertical translation in variance and limited sensitivity to deformations—for signal classes with inherent deformation insensitivity—are guaranteed by the net- work structure per se rather than the speciﬁc conv olution kernels, non-linearities, and pooling operators. 2 Structure here refers to the structural relationship between the conv olution kernels in a given layer, e.g., scaling and rotation operations in the case of the wa velet transform. Notation. The complex conjugate of z ∈ C is denoted by z . W e write Re( z ) for the real, and Im( z ) for the imaginary part of z ∈ C . The Euclidean inner product of x, y ∈ C d is h x, y i := P d i =1 x i y i , with associated norm | x | := p h x, x i . W e denote the identity matrix by E ∈ R d × d . For the matrix M ∈ R d × d , M i,j designates the entry in its i -th row and j - th column, and for a tensor T ∈ R d × d × d , T i,j,k refers to its ( i, j, k ) -th component. The supremum norm of the matrix M ∈ R d × d is deﬁned as | M | ∞ := sup i,j | M i,j | , and the supremum norm of the tensor T ∈ R d × d × d is | T | ∞ := sup i,j,k | T i,j,k | . W e write B r ( x ) ⊆ R d for the open ball of radius r > 0 centered at x ∈ R d . O ( d ) stands for the orthogonal group of dimension d ∈ N , and S O ( d ) for the special orthogonal group. For a Lebesgue-measurable function f : R d → C , we write R R d f ( x )d x for the integral of f w .r .t. Lebesgue mea- sure µ L . For p ∈ [1 , ∞ ) , L p ( R d ) stands for the space of Lebesgue-measurable functions f : R d → C satisfying k f k p := ( R R d | f ( x ) | p d x ) 1 /p < ∞ . L ∞ ( R d ) denotes the space of Lebesgue-measurable functions f : R d → C such that k f k ∞ := inf { α > 0 | | f ( x ) | ≤ α for a.e. 3 x ∈ R d } < ∞ . For f , g ∈ L 2 ( R d ) we set h f , g i := R R d f ( x ) g ( x )d x . For R > 0 , the space of R -band-limited functions is denoted as L 2 R ( R d ) := { f ∈ L 2 ( R d ) | supp( b f ) ⊆ B R (0) } . For a countable set Q , ( L 2 ( R d )) Q stands for the space of sets s := { s q } q ∈Q , s q ∈ L 2 ( R d ) , for all q ∈ Q , satisfying ||| s ||| := ( P q ∈Q k s q k 2 2 ) 1 / 2 < ∞ . Id : L p ( R d ) → L p ( R d ) denotes the identity operator on L p ( R d ) . The tensor product of functions f , g : R d → C is ( f ⊗ g )( x, y ) := f ( x ) g ( y ) , ( x, y ) ∈ R d × R d . The operator norm of the bounded linear operator A : L p ( R d ) → L q ( R d ) is deﬁned as k A k p,q := sup k f k p =1 k Af k q . W e denote the Fourier trans- form of f ∈ L 1 ( R d ) by b f ( ω ) := R R d f ( x ) e − 2 π i h x,ω i d x and extend it in the usual way to L 2 ( R d ) [43, Theorem 7.9]. The con volution of f ∈ L 2 ( R d ) and g ∈ L 1 ( R d ) is ( f ∗ g )( y ) := R R d f ( x ) g ( y − x )d x . W e write ( T t f )( x ) := f ( x − t ) , t ∈ R d , for the translation operator , and ( M ω f )( x ) := e 2 π i h x,ω i f ( x ) , ω ∈ R d , for the modulation operator . In volution is deﬁned by ( I f )( x ) := f ( − x ) . A multi-index α = ( α 1 , . . . , α d ) ∈ N d 0 is an ordered d -tuple of non-negativ e integers α i ∈ N 0 . For a multi- index α ∈ N d 0 , D α denotes the differential operator D α := ( ∂ /∂ x 1 ) α 1 . . . ( ∂ /∂ x d ) α d , with order | α | := P d i =1 α i . If | α | = 0 , D α f := f , for f : R d → C . The space of functions f : R d → C whose deriv ativ es D α f of order at most N ∈ N 0 are continuous is designated by C N ( R d , C ) , and the space of inﬁnitely dif ferentiable functions is C ∞ ( R d , C ) . S ( R d , C ) stands for the Schwartz space, i.e., the space of functions f ∈ C ∞ ( R d , C ) whose deriv ativ es D α f along with the func- tion itself are rapidly decaying [43, Section 7.3] in the sense of sup | α |≤ N sup x ∈ R d (1 + | x | 2 ) N | ( D α f )( x ) | < ∞ , for all N ∈ N 0 . W e denote the gradient of a function f : R d → C as ∇ f . The space of continuous mappings v : R p → R q is C ( R p , R q ) , and for k , p, q ∈ N , the space of k -times continuously dif fer- entiable mappings v : R p → R q is written as C k ( R p , R q ) . For a mapping v : R d → R d , we let D v be its Jacobian matrix, and D 2 v its Jacobian tensor , with associated norms 3 Throughout “a.e. ” is w .r .t. Lebesgue measure. 3 k v k ∞ := sup x ∈ R d | v ( x ) | , k D v k ∞ := sup x ∈ R d | ( D v )( x ) | ∞ , and k D 2 v k ∞ := sup x ∈ R d | ( D 2 v )( x ) | ∞ . I I . S C A T T E R I N G N E T W O R K S W e set the stage by revie wing scattering networks as intro- duced in [22], the basis of which is a multi-layer architecture that in volves a wa velet transform followed by the modulus non-linearity , without subsequent pooling. Speciﬁcally , [22, Deﬁnition 2.4] deﬁnes the feature vector Φ W ( f ) of the signal f ∈ L 2 ( R d ) as the set 4 Φ W ( f ) := ∞ [ n =0 Φ n W ( f ) , (1) where Φ 0 W ( f ) := { f ∗ ψ ( − J, 0) } , and Φ n W ( f ) :=   U  λ ( j ) , . . . , λ ( p ) | {z } n indices  f  ∗ ψ ( − J, 0)  λ ( j ) ,...,λ ( p ) ∈ Λ W \{ ( − J, 0) } , for all n ∈ N , with U  λ ( j ) , . . . , λ ( p )  f :=   · · ·   | f ∗ ψ λ ( j ) | ∗ ψ λ ( k )   · · · ∗ ψ λ ( p )   | {z } n − fold convolution followed by modulus . Here, the index set Λ W :=  ( − J, 0)  ∪  ( j, k ) | j ∈ Z with j > − J , k ∈ { 0 , . . . , K − 1 }  contains pairs of scales j and directions k (in fact, k is the index of the direction described by the rotation matrix r k ), and ψ λ ( x ) := 2 d j ψ (2 j r − 1 k x ) , (2) where λ = ( j, k ) ∈ Λ W \{ ( − J, 0) } are directional wavelets [30], [44], [45] with (complex-v alued) mother wav elet ψ ∈ L 1 ( R d ) ∩ L 2 ( R d ) . The r k , k ∈ { 0 , . . . , K − 1 } , are elements of a ﬁnite rotation group G (if d is e ven, G is a subgroup of S O ( d ) ; if d is odd, G is a subgroup of O ( d ) ). The index ( − J, 0) ∈ Λ W is associated with the low-pass ﬁlter ψ ( − J, 0) ∈ L 1 ( R d ) ∩ L 2 ( R d ) , and J ∈ Z corresponds to the coarsest scale resolved by the directional wav elets (2). The family of functions { ψ λ } λ ∈ Λ W is taken to form a semi- discrete Parse val frame Ψ Λ W := { T b I ψ λ } b ∈ R d ,λ ∈ Λ W , for L 2 ( R d ) [30], [41], [42] and hence satisﬁes X λ ∈ Λ W Z R d |h f , T b I ψ λ i| 2 d b = X λ ∈ Λ W k f ∗ ψ λ k 2 2 = k f k 2 2 , for all f ∈ L 2 ( R d ) , where h f , T b I ψ λ i = ( f ∗ ψ λ )( b ) , ( λ, b ) ∈ Λ W × R d , are the underlying frame coefﬁcients. Note that for given λ ∈ Λ W , we actually have a continuum of frame coef ﬁcients as the translation parameter b ∈ R d is left unsampled. W e refer to Figure 1 for a frequenc y-domain illustration of a semi-discrete directional wav elet frame. In Appendix A, we gi ve a brief re view of the general theory of semi-discrete frames, and in Appendices B and C we collect structured example frames in 1 -D and 2 -D, respectively . 4 W e emphasize that the feature vector Φ W ( f ) is a union of the sets of feature vectors Φ n W ( f ) . ω 1 ω 2 Fig. 1: Partitioning of the frequency plane R 2 induced by a semi-discrete directional wa velet frame with K = 12 directions. The architecture corresponding to the feature extractor Φ W in (1), illustrated in Fig. 2, is kno wn as scattering network [22], and employs the frame Ψ Λ W and the modulus non-linearity | · | in ev ery network layer , but does not include pooling. F or gi ven n ∈ N , the set Φ n W ( f ) in (1) corresponds to the features of the function f generated in the n -th network layer, see Fig. 2. Remark 1. The function | f ∗ ψ λ | , λ ∈ Λ W \{ ( − J, 0) } , can be thought of as indicating the locations of singularities of f ∈ L 2 ( R d ) . Speciﬁcally , with the r elation of | f ∗ ψ λ | to the Canny edge detector [46] as described in [31], in dimension d = 2 , we can think of | f ∗ ψ λ | = | f ∗ ψ ( j,k ) | , λ = ( j, k ) ∈ Λ W \{ ( − J, 0) } , as an image at scale j specifying the locations of edges of the image f that ar e oriented in dir ection k . Furthermore , it was ar gued in [23], [25], [47] that the feature vector Φ 1 W ( f ) gener ated in the ﬁrst layer of the scattering network is very similar , in dimension d = 1 , to mel frequency cepstral coefﬁcients [48], and in dimension d = 2 to SIFT -descriptors [49], [50]. It is shown in [22, Theorem 2.10] that the feature extractor Φ W is translation-in variant in the sense of lim J →∞ ||| Φ W ( T t f ) − Φ W ( f ) ||| = 0 , (3) for all f ∈ L 2 ( R d ) and t ∈ R d . This in variance result is asymptotic in the scale parameter J ∈ Z and does not depend on the network depth, i.e., it guarantees full translation in variance in ev ery network layer . Furthermore, [22, Theorem 2.12] establishes that Φ W is stable w .r .t. deformations of the form ( F τ f )( x ) := f ( x − τ ( x )) . More formally , for the function space ( H W , k · k H W ) deﬁned in [22, Eq. 2.46], it is shown in [22, Theorem 2.12] that there exists a constant C > 0 such that for all f ∈ H W , and τ ∈ C 1 ( R d , R d ) with 5 k D τ k ∞ ≤ 1 2 d , the deformation error satisﬁes the following deformation stability bound ||| Φ W ( F τ f ) − Φ W ( f ) ||| ≤ C  2 − J k τ k ∞ + J k D τ k ∞ + k D 2 τ k ∞  k f k H W . (4) Note that this upper bound goes to inﬁnity as translation 5 It is actually the assumption k Dτ k ∞ ≤ 1 2 d , rather than k Dτ k ∞ ≤ 1 2 as stated in [22, Theorem 2.12], that is needed in [22, p. 1390] to establish that | det( E − ( D τ )( x )) | ≥ 1 − d k D τ k ∞ ≥ 1 / 2 . 4 f | f ∗ ψ λ ( j ) | | f ∗ ψ λ ( j ) | ∗ ψ ( − J, 0) || f ∗ ψ λ ( j ) | ∗ ψ λ ( l ) | || f ∗ ψ λ ( j ) | ∗ ψ λ ( l ) | ∗ ψ ( − J, 0) ||| f ∗ ψ λ ( j ) | ∗ ψ λ ( l ) | ∗ ψ λ ( m ) | · · · | f ∗ ψ λ ( p ) | | f ∗ ψ λ ( p ) | ∗ ψ ( − J, 0) || f ∗ ψ λ ( p ) | ∗ ψ λ ( r ) | || f ∗ ψ λ ( p ) | ∗ ψ λ ( r ) | ∗ ψ ( − J, 0) ||| f ∗ ψ λ ( p ) | ∗ ψ λ ( r ) | ∗ ψ λ ( s ) | · · · f ∗ ψ ( − J, 0) Fig. 2: Scattering network architecture based on wavelet ﬁlters and the modulus non-linearity . The elements of the feature vector Φ W ( f ) in (1) are indicated at the tips of the arro ws. in variance through J → ∞ is induced. In practice sig- nal classiﬁcation based on scattering networks is performed as follows. First, the function f and the wav elet frame atoms { ψ λ } λ ∈ Λ W are discretized to ﬁnite-dimensional vectors. The resulting scattering network then computes the ﬁnite- dimensional feature v ector Φ W ( f ) , whose dimension is typ- ically reduced through an orthogonal least squares step [51], and then feeds the result into a trainable classiﬁer such as, e.g., a SVM. State-of-the-art results for scattering netw orks were reported for v arious classiﬁcation tasks such as handwritten digit recognition [23], texture discrimination [23], [24], and musical genre classiﬁcation [25]. I I I . G E N E R A L D E E P C O N VO L U T I O N A L F E A T U R E E X T R A C TO R S As already mentioned, scattering networks follo w the ar - chitecture of DCNNs [2], [7]–[11], [15]–[21] in the sense of cascading con volutions (with atoms { ψ λ } λ ∈ Λ W of the wa velet frame Ψ Λ W ) and non-linearities, namely the modulus function, but without pooling. General DCNNs as studied in the literature exhibit a number of additional features: – a wide variety of ﬁlters are employed, namely pre- speciﬁed unstructured ﬁlters such as random ﬁlters [16], [17], and ﬁlters that are learned in a supervised [15], [16] or an unsupervised [16]–[18] fashion. – a wide variety of non-linearities are used such as, e.g., hyperbolic tangents [15]–[17], rectiﬁed linear units [26], [27], and logistic sigmoids [28], [29]. – con volution and the application of a non-linearity is typically follo wed by a pooling operator such as, e.g., sub-sampling [19], av erage-pooling [15], [16], or max- pooling [16], [17], [20], [21]. – the ﬁlters, non-linearities, and pooling operators are al- lowed to be different in different network layers [11], [14]. As already mentioned, the purpose of this paper is to develop a mathematical theory of DCNNs for feature extraction that encompasses all of the aspects abov e (apart from max-pooling) with the proviso that the pooling operators we analyze are continuous-time emulations of discrete-time pooling operators. Formally , compared to scattering netw orks, in the n -th netw ork layer , we replace the wa velet-modulus operation | f ∗ ψ λ | by a con volution with the atoms g λ n ∈ L 1 ( R d ) ∩ L 2 ( R d ) of a general semi-discrete frame Ψ n := { T b I g λ n } b ∈ R d ,λ n ∈ Λ n for L 2 ( R d ) with countable index set Λ n (see Appendix A for a brief re view of the theory of semi-discrete frames), follo wed by a non-linearity M n : L 2 ( R d ) → L 2 ( R d ) that satisﬁes the Lipschitz property k M n f − M n h k 2 ≤ L n k f − h k 2 , for all f , h ∈ L 2 ( R d ) , and M n f = 0 for f = 0 . The output of this non-linearity , M n ( f ∗ g λ n ) , is then pooled according to f 7→ S d/ 2 n P n ( f )( S n · ) , (5) where S n ≥ 1 is the pooling factor and P n : L 2 ( R d ) → L 2 ( R d ) satisﬁes the Lipschitz property k P n f − P n h k 2 ≤ R n k f − h k 2 , for all f , h ∈ L 2 ( R d ) , and P n f = 0 for f = 0 . W e next comment on the individual elements in our network architecture in more detail. The frame atoms g λ n are arbitrary and can, therefore, also be taken to be structured, e.g., W e yl-Heisenberg functions, curvelets, shearlets, ridgelets, or wa velets as considered in [22] (where the atoms g λ n are obtained from a mother wa velet through scaling and rotation operations, see Section II). The corresponding semi-discrete signal transforms 6 , brieﬂy revie wed in Appendices B and C, 6 Let { g λ } λ ∈ Λ ⊆ L 1 ( R d ) ∩ L 2 ( R d ) be a set of functions index ed by a countable set Λ . Then, the mapping f 7→ { f ∗ g λ ( b ) } b ∈ R d ,λ ∈ Λ = {h f , T b I g λ i} λ ∈ Λ , f ∈ L 2 ( R d ) , is called a semi-discrete signal transform, as it depends on discrete indices λ ∈ Λ and continuous variables b ∈ R d . W e can think of this mapping as the analysis operator in frame theory [53], with the proviso that for giv en λ ∈ Λ , we actually have a continuum of frame coefﬁcients as the translation parameter b ∈ R d is left unsampled. 5 hav e been employed successfully in the literature in various feature extraction tasks [32], [54]–[61], but their use—apart from wa velets—in DCNNs appears to be new . W e refer the reader to Appendix D for a detailed discussion of sev eral relev ant example non-linearities (e.g., rectiﬁed linear units, shifted logistic sigmoids, hyperbolic tangents, and, of course, the modulus function) that ﬁt into our framework. W e next explain ho w the continuous-time pooling operator (5) emulates discrete-time pooling by sub-sampling [19] or by av eraging [15], [16]. Consider a one-dimensional discrete-time signal f d ∈ ` 2 ( Z ) := { f d : Z → C | P k ∈ Z | f d [ k ] | 2 < ∞} . Sub- sampling by a factor of S ∈ N in discrete time is deﬁned by [62, Sec. 4] f d 7→ h d := f d [ S · ] and amounts to simply retaining ev ery S -th sample of f d . The discrete-time Fourier transform of h d is giv en by a summation ov er translated and dilated copies of b f d according to [62, Sec. 4] b h d ( θ ) := X k ∈ Z h d [ k ] e − 2 π ikθ = 1 S S − 1 X k =0 b f d  θ − k S  . (6) The translated copies of b f d in (6) are a consequence of the 1 -periodicity of the discrete-time Fourier transform. W e therefore emulate the discrete-time sub-sampling operation in continuous time through the dilation operation f 7→ h := S d/ 2 f ( S · ) , f ∈ L 2 ( R d ) , (7) which in the frequency domain amounts to dilation according to b h = S − d/ 2 b f ( S − 1 · ) . The scaling by S d/ 2 in (7) ensures unitarity of the continuous-time sub-sampling operation. The ov erall operation in (7) ﬁts into our general deﬁnition of pooling as it can be recov ered from (5) simply by taking P to equal the identity mapping (which is, of course, Lipschitz- continuous with Lipschitz constant R = 1 and satisﬁes Id f = 0 for f = 0 ). Next, we consider av erage pooling. In discrete time average pooling is deﬁned by f d 7→ h d := ( f d ∗ φ d )[ S · ] (8) for the (typically compactly supported) “av eraging kernel” φ d ∈ ` 2 ( Z ) and the averaging factor S ∈ N . T aking φ d to be a box function of length S amounts to computing local av erages of S consecutiv e samples. W eighted averages are obtained by identifying the desired weights with the a veraging kernel φ d . The operation (8) can be emulated in continuous time according to f 7→ S d/ 2  f ∗ φ  ( S · ) , f ∈ L 2 ( R d ) , (9) with the a veraging windo w φ ∈ L 1 ( R d ) ∩ L 2 ( R d ) . W e note that (9) can be recov ered from (5) by taking P ( f ) = f ∗ φ , f ∈ L 2 ( R d ) , and noting that con volution with φ is Lipschitz- continuous with Lipschitz constant R = k φ k 1 (thanks to Y oung’ s inequality [63, Theorem 1.2.12]) and tri vially satisﬁes P f = 0 for f = 0 . In the remainder of the paper , we refer to the operation in (5) as Lipschitz pooling thr ough dilation to indicate that (5) essentially amounts to the application of a Lipschitz-continuous mapping followed by a continuous-time dilation. Note, howe ver , that the operation in (5) will not be unitary in general. W e next state deﬁnitions and collect preliminary results needed for the analysis of the general DCNN feature extractor considered. The basic building blocks of this network are the triplets (Ψ n , M n , P n ) associated with individual network layers n and referred to as modules . Deﬁnition 1. F or n ∈ N , let Ψ n = { T b I g λ n } b ∈ R d ,λ n ∈ Λ n be a semi-discrete frame for L 2 ( R d ) and let M n : L 2 ( R d ) → L 2 ( R d ) and P n : L 2 ( R d ) → L 2 ( R d ) be Lipschitz-continuous operators with M n f = 0 and P n f = 0 for f = 0 , respectively . Then, the sequence of triplets Ω :=  (Ψ n , M n , P n )  n ∈ N is r eferred to as a module-sequence. The following deﬁnition introduces the concept of paths on index sets, which will prove useful in formalizing the feature extraction network. The idea for this formalism is due to [22]. Deﬁnition 2. Let Ω =  (Ψ n , M n , P n )  n ∈ N be a module- sequence, let { g λ n } λ n ∈ Λ n be the atoms of the frame Ψ n , and let S n ≥ 1 be the pooling factor (according to (5) ) associated with the n -th network layer . Deﬁne the operator U n associated with the n -th layer of the network as U n : Λ n × L 2 ( R d ) → L 2 ( R d ) , U n ( λ n , f ) := U n [ λ n ] f := S d/ 2 n P n  M n ( f ∗ g λ n )  ( S n · ) . (10) F or n ∈ N , deﬁne the set Λ n 1 := Λ 1 × Λ 2 × · · · × Λ n . An or der ed sequence q = ( λ 1 , λ 2 , . . . , λ n ) ∈ Λ n 1 is called a path. F or the empty path e := ∅ we set Λ 0 1 := { e } and U 0 [ e ] f := f , for all f ∈ L 2 ( R d ) . The operator U n is well-deﬁned, i.e., U n [ λ n ] f ∈ L 2 ( R d ) , for all ( λ n , f ) ∈ Λ n × L 2 ( R d ) , thanks to k U n [ λ n ] f k 2 2 = S d n Z R d    P n  M n ( f ∗ g λ n )  ( S n x )    2 d x = Z R d    P n  M n ( f ∗ g λ n )  ( y )    2 d y = k P n  M n ( f ∗ g λ n )  k 2 2 ≤ R 2 n k M n ( f ∗ g λ n ) k 2 2 (11) ≤ L 2 n R 2 n k f ∗ g λ n k 2 2 ≤ B n L 2 n R 2 n k f k 2 2 . (12) For the inequality in (11) we used the Lipschitz continuity of P n according to k P n f − P n h k 2 2 ≤ R 2 n k f − h k 2 2 , together with P n h = 0 for h = 0 to get k P n f k 2 2 ≤ R 2 n k f k 2 2 . Similar arguments lead to the ﬁrst inequality in (12). The last step in (12) is thanks to k f ∗ g λ n k 2 2 ≤ X λ 0 n ∈ Λ n k f ∗ g λ 0 n k 2 2 ≤ B n k f k 2 2 , which follows from the frame condition (30) on Ψ n . W e will also need the extension of the operator U n to paths q ∈ Λ n 1 according to U [ q ] f = U [( λ 1 , λ 2 , . . . , λ n )] f := U n [ λ n ] · · · U 2 [ λ 2 ] U 1 [ λ 1 ] f , (13) with U [ e ] f := f . Note that the multi-stage operation (13) is 6 U [ e ] f = f U  λ ( j ) 1  f  U  λ ( j ) 1  f  ∗ χ 1 U  λ ( j ) 1 , λ ( l ) 2  f  U  λ ( j ) 1 , λ ( l ) 2  f  ∗ χ 2 U  λ ( j ) 1 , λ ( l ) 2 , λ ( m ) 3  f · · · U  λ ( p ) 1  f  U  λ ( p ) 1  f  ∗ χ 1 U  λ ( p ) 1 , λ ( r ) 2  f  U  λ ( p ) 1 , λ ( r ) 2  f  ∗ χ 2 U  λ ( p ) 1 , λ ( r ) 2 , λ ( s ) 3  f · · · f ∗ χ 0 Fig. 3: Network architecture underlying the general DCNN feature extractor . The index λ ( k ) n corresponds to the k -th atom g λ ( k ) n of the frame Ψ n associated with the n -th network layer . The function χ n is the output-generating atom of the n -th layer . again well-deﬁned thanks to k U [ q ] f k 2 2 ≤ n Y k =1 B k L 2 k R 2 k ! k f k 2 2 , (14) for q ∈ Λ n 1 and f ∈ L 2 ( R d ) , which follows by repeated application of (12). In scattering networks one atom ψ λ , λ ∈ Λ W , in the wa velet frame Ψ Λ W , namely the low-pass ﬁlter ψ ( − J, 0) , is singled out to generate the extracted features according to (1), see also Fig. 2. W e follow this construction and designate one of the atoms in each frame in the module-sequence Ω =  (Ψ n , M n , P n )  n ∈ N as the output-generating atom χ n − 1 := g λ ∗ n , λ ∗ n ∈ Λ n , of the ( n − 1) -th layer . The atoms { g λ n } λ n ∈ Λ n \{ λ ∗ n } ∪ { χ n − 1 } in Ψ n are thus used across two consecutiv e layers in the sense of χ n − 1 = g λ ∗ n generating the output in the ( n − 1) -th layer , and the { g λ n } λ n ∈ Λ n \{ λ ∗ n } propagating signals from the ( n − 1) -th layer to the n -th layer according to (10), see Fig. 3. Note, howe ver , that our theory does not require the output-generating atoms to be low-pass ﬁlters 7 . From now on, with slight abuse of notation, we shall write Λ n for Λ n \{ λ ∗ n } as well. Finally , we note that extracting features in every network layer via an output- generating atom can be regarded as employing skip-layer connections [13], which skip network layers further down and feed the propagated signals into the feature vector . W e are now ready to deﬁne the feature extractor Φ Ω based on the module-sequence Ω . Deﬁnition 3. Let Ω =  (Ψ n , M n , P n )  n ∈ N be a module- sequence. The feature extr actor Φ Ω based on Ω maps L 2 ( R d ) 7 It is evident, though, that the actual choices of the output-generating atoms will ha ve an impact on practical performance. to its featur e vector Φ Ω ( f ) := ∞ [ n =0 Φ n Ω ( f ) , (15) wher e Φ n Ω ( f ) := { ( U [ q ] f ) ∗ χ n } q ∈ Λ n 1 , for all n ∈ N . The set Φ n Ω ( f ) in (15) corresponds to the features of the function f generated in the n -th network layer , see Fig. 3, where n = 0 corresponds to the root of the network. The feature extractor Φ Ω : L 2 ( R d ) → ( L 2 ( R d )) Q , with Q := S ∞ n =0 Λ n 1 , is well-deﬁned, i.e., Φ Ω ( f ) ∈ ( L 2 ( R d )) Q , for all f ∈ L 2 ( R d ) , under a technical condition on the module- sequence Ω formalized as follows. Proposition 1. Let Ω =  (Ψ n , M n , P n )  n ∈ N be a module- sequence. Denote the frame upper bounds of Ψ n by B n > 0 and the Lipschitz constants of the operators M n and P n by L n > 0 and R n > 0 , respectively . If max { B n , B n L 2 n R 2 n } ≤ 1 , ∀ n ∈ N , (16) then the featur e extractor Φ Ω : L 2 ( R d ) → ( L 2 ( R d )) Q is well- deﬁned, i.e., Φ Ω ( f ) ∈ ( L 2 ( R d )) Q , for all f ∈ L 2 ( R d ) . Pr oof. The proof is giv en in Appendix E. As condition (16) is of central importance, we formalize it as follows. Deﬁnition 4. Let Ω =  (Ψ n , M n , P n )  n ∈ N be a module- sequence with frame upper bounds B n > 0 and Lipschitz constants L n , R n > 0 of the operator s M n and P n , respec- tively . The condition max { B n , B n L 2 n R 2 n } ≤ 1 , ∀ n ∈ N , (17) is r eferred to as admissibility condition. Module-sequences that satisfy (17) are called admissible. 7 (a) (b) (c) Fig. 4: Handwritten digits from the MNIST data set [5]. For practical machine learning tasks (e.g., signal classiﬁcation), we often want the feature vector Φ Ω ( f ) to be in variant to the digits’ spatial location within the image f . Theorem 1 establishes that the features Φ n Ω ( f ) become more translation-in variant with increasing layer index n . W e emphasize that condition (17) is easily met in practice. T o see this, ﬁrst note that B n is determined through the frame Ψ n (e.g., the directional wav elet frame introduced in Section II has B = 1 ), L n is set through the non-linearity M n (e.g., the modulus function M = | · | has L = 1 , see Appendix D), and R n depends on the operator P n in (5) (e.g., pooling by sub-sampling amounts to P = Id and has R = 1 ). Obviously , condition (17) is met if B n ≤ min { 1 , L − 2 n R − 2 n } , ∀ n ∈ N , which can be satisﬁed by simply normalizing the frame elements of Ψ n accordingly . W e refer to Proposition 3 in Ap- pendix A for corresponding normalization techniques, which, as explained in Section IV, affect neither our translation in variance result nor our deformation sensitivity bounds. I V . P RO P E RT I E S O F T H E F E A T U R E E X T R A C TO R Φ Ω A. V ertical translation in variance The following theorem states that under very mild de- cay conditions on the F ourier transforms c χ n of the output- generating atoms χ n , the feature extractor Φ Ω exhibits vertical translation inv ariance in the sense of the features becoming more translation-in v ariant with increasing network depth. This result is in line with observations made in the deep learning literature, e.g., in [15]–[17], [20], [21], where it is informally argued that the network outputs generated at deeper layers tend to be more translation-inv ariant. Theorem 1. Let Ω =  (Ψ n , M n , P n )  n ∈ N be an admissible module-sequence, let S n ≥ 1 , n ∈ N , be the pooling factors in (10) , and assume that the oper ators M n : L 2 ( R d ) → L 2 ( R d ) and P n : L 2 ( R d ) → L 2 ( R d ) commute with the translation operator T t , i.e., M n T t f = T t M n f , P n T t f = T t P n f , (18) for all f ∈ L 2 ( R d ) , t ∈ R d , and n ∈ N . i) The features Φ n Ω ( f ) generated in the n -th network layer satisfy Φ n Ω ( T t f ) = T t/ ( S 1 ··· S n ) Φ n Ω ( f ) , (19) for all f ∈ L 2 ( R d ) , t ∈ R d , and n ∈ N , where T t Φ n Ω ( f ) r efers to element-wise application of T t , i.e ., T t Φ n Ω ( f ) := { T t h | ∀ h ∈ Φ n Ω ( f ) } . ii) If, in addition, there exists a constant K > 0 (that does not depend on n ) such that the F ourier transforms c χ n of the output-generating atoms χ n satisfy the decay condition | c χ n ( ω ) || ω | ≤ K, a.e. ω ∈ R d , ∀ n ∈ N 0 , (20) then ||| Φ n Ω ( T t f ) − Φ n Ω ( f ) ||| ≤ 2 π | t | K S 1 · · · S n k f k 2 , (21) for all f ∈ L 2 ( R d ) and t ∈ R d . Pr oof. The proof is giv en in Appendix F. W e start by noting that all pointwise (also referred to as memoryless in the signal processing literature) non-linearities M n : L 2 ( R d ) → L 2 ( R d ) satisfy the commutation relation in (18). A lar ge class of non-linearities widely used in the deep learning literature, such as rectiﬁed linear units, hyperbolic tangents, shifted logistic sigmoids, and the modulus func- tion as employed in [22], are, indeed, pointwise and hence cov ered by Theorem 1. Moreover , P = Id as in pooling by sub-sampling trivially satisﬁes (18). Pooling by av eraging P f = f ∗ φ , with φ ∈ L 1 ( R d ) ∩ L 2 ( R d ) , satisﬁes (18) as a consequence of the con volution operator commuting with the translation operator T t . Note that (20) can easily be met by taking the output- generating atoms { χ n } n ∈ N 0 either to satisfy sup n ∈ N 0 {k χ n k 1 + k∇ χ n k 1 } < ∞ , see, e.g., [43, Ch. 7], or to be uniformly band-limited in the sense of supp( c χ n ) ⊆ B r (0) , for all n ∈ N 0 , with an r that is independent of n (see, e.g., [30, Ch. 2.3]). The bound in (21) shows that we can explicitly control the amount of translation in variance via the pooling factors S n . This result is in line with observations made in the deep learning literature, e.g., in [15]– [17], [20], [21], where it is informally argued that pooling is crucial to get translation inv ariance of the e xtracted features. Furthermore, the condition lim n →∞ S 1 · S 2 · . . . · S n = ∞ (easily met by taking S n > 1 , for all n ∈ N ) guarantees, thanks to (21), asymptotically full translation inv ariance according to lim n →∞ ||| Φ n Ω ( T t f ) − Φ n Ω ( f ) ||| = 0 , (22) for all f ∈ L 2 ( R d ) and t ∈ R d . This means that the features 8 (a) (b) (c) Fig. 5: Handwritten digits from the MNIST data set [5]. If f denotes the image of the handwritten digit “ 5 ” in (a), then—for appropriately chosen τ —the function F τ f = f ( · − τ ( · )) models images of “ 5 ” based on different handwriting styles as in (b) and (c). Φ n Ω ( T t f ) corresponding to the shifted versions T t f of the handwritten digit “ 3 ” in Figs. 4 (b) and (c) with increasing network depth increasingly “look like” the features Φ n Ω ( f ) corresponding to the unshifted handwritten digit in Fig. 4 (a). Casually speaking, the shift operator T t is increasingly absorbed by Φ n Ω as n → ∞ , with the upper bound (21) quantifying this absorption. In contrast, the translation in v ariance result (3) in [22] is asymptotic in the wav elet scale parameter J , and does not depend on the network depth, i.e., it guarantees full translation in variance in ev ery network layer . W e honor this dif ference by referring to (3) as horizontal translation in v ariance and to (22) as vertical translation inv ariance. W e emphasize that vertical translation in variance is a struc- tural property . Speciﬁcally , if P n is unitary (such as, e.g., in the case of pooling by sub-sampling where P n simply equals the identity mapping), then so is the pooling operation in (5) owing to k S d/ 2 n P n ( f )( S n · ) k 2 2 = S d n Z R d | P n ( f )( S n x ) | 2 d x = Z R d | P n ( f )( x ) | 2 d x = k P n ( f ) k 2 2 = k f k 2 2 , where we employed the change of variables y = S n x , d y d x = S d n . Regarding av erage pooling, as already mentioned, the operators P n ( f ) = f ∗ φ n , f ∈ L 2 ( R d ) , n ∈ N , are, in general, not unitary , but we still get translation in variance as a consequence of structural properties, namely translation cov ariance of the con volution operator combined with unitary dilation according to (7). Finally , we note that in practice in certain applications it is actually translation covariance in the sense of Φ n Ω ( T t f ) = T t Φ n Ω ( f ) , for all f ∈ L 2 ( R d ) and t ∈ R d , that is desirable, for example, in facial landmark detection where the goal is to estimate the absolute position of facial landmarks in images. In such applications features in the layers closer to the root of the network are more relev ant as they are less translation-in v ariant and more translation-cov ariant. The reader is referred to [64] where corresponding numerical evidence is provided. W e proceed to the formal statement of our translation covariance result. Corollary 1. Let Ω =  (Ψ n , M n , P n )  n ∈ N be an admissible module-sequence, let S n ≥ 1 , n ∈ N , be the pooling factors in (10) , and assume that the oper ators M n : L 2 ( R d ) → L 2 ( R d ) and P n : L 2 ( R d ) → L 2 ( R d ) commute with the translation operator T t in the sense of (18) . If, in addition, ther e e xists a constant K > 0 (that does not depend on n ) such that the F ourier transforms c χ n of the output-generating atoms χ n satisfy the decay condition (20) , then ||| Φ n Ω ( T t f ) − T t Φ n Ω ( f ) ||| ≤ 2 π | t | K   1 / ( S 1 . . . S n ) − 1   k f k 2 , for all f ∈ L 2 ( R d ) and t ∈ R d . Pr oof. The proof is giv en in Appendix G. Corollary 1 shows that in the absence of pooling, i.e., taking S n = 1 , for all n ∈ N , leads to full translation cov ariance in every network layer . This prov es that pooling is necessary to get vertical translation inv ariance as otherwise the features remain fully translation-cov ariant irrespecti ve of the network depth. Finally , we note that scattering networks [22] (which do not employ pooling operators, see Section II) are rendered horizontally translation-in v ariant by letting the wav elet scale parameter J → ∞ . B. Deformation sensitivity bound The next result provides a bound—for band-limited signals f ∈ L 2 R ( R d ) —on the sensitivity of the feature extractor Φ Ω w .r .t. time-frequenc y deformations of the form ( F τ ,ω f )( x ) := e 2 π iω ( x ) f ( x − τ ( x )) . This class of deformations encompasses non-linear distortions f ( x − τ ( x )) as illustrated in Fig. 5, and modulation-like deformations e 2 π iω ( x ) f ( x ) which occur , e.g., if the signal f is subject to an undesired modulation and we therefore hav e access to a bandpass version of f only . The deformation sensitivity bound we deriv e is signal-class speciﬁc in the sense of applying to input signals belonging to a particular class, here band-limited functions. The proof tech- nique we develop applies, howe ver , to all signal classes that exhibit “inherent” deformation insensiti vity in the following sense. Deﬁnition 5. A signal class C ⊆ L 2 ( R d ) is called deformation-insensitive if ther e e xist α, β , C > 0 such that for all f ∈ C , ω ∈ C ( R d , R ) , and (possibly non-linear) τ ∈ C 1 ( R d , R d ) with k D τ k ∞ ≤ 1 2 d , it holds that k f − F τ ,ω f k 2 ≤ C  k τ k α ∞ + k ω k β ∞  . (23) 9 x f 1 ( x ) , ( F τ ,ω f 1 )( x ) x f 2 ( x ) , ( F τ ,ω f 2 )( x ) Fig. 6: Impact of the deformation F τ ,ω , with τ ( x ) = 1 2 e − x 2 and ω = 0 , on the functions f 1 ∈ C 1 ⊆ L 2 ( R ) and f 2 ∈ C 2 ⊆ L 2 ( R ) . The signal class C 1 consists of smooth, slowly varying functions (e.g., band-limited functions), and C 2 consists of compactly supported functions that exhibit discontinuities (e.g., cartoon functions [65]). W e observe that f 1 , unlike f 2 , is affected only mildly by F τ ,ω . The amount of deformation induced therefore depends drastically on the speciﬁc f ∈ L 2 ( R ) . The constant C > 0 and the exponents α, β > 0 in (23) depend on the particular signal class C . Examples of deformation-insensitiv e signal classes are the class of R - band-limited functions (see Proposition 5 in Appendix J), the class of cartoon functions [40, Proposition 1], and the class of Lipschitz functions [40, Lemma 1]. While a deformation sensitivity bound that applies to all f ∈ L 2 ( R d ) would be desirable, the e xample in Fig. 6 illustrates the difﬁculty underlying this desideratum. Speciﬁcally , we can see in Fig. 6 that for giv en τ ( x ) and ω ( x ) the impact of the deformation induced by e 2 π iω ( x ) f ( x − τ ( x )) can depend drastically on the function f ∈ L 2 ( R d ) itself. The deformation stability bound (4) for scattering networks reported in [22, Theorem 2.12] applies to a signal class as well, characterized, albeit implicitly , through [22, Eq. 2.46] and depending on the mother wa velet and the (modulus) non-linearity . Our signal-class speciﬁc deformation sensitivity bound is based on the following two ingredients. First, we establish— in Proposition 4 in Appendix I—that the feature extractor Φ Ω is Lipschitz-continuous with Lipschitz constant L Ω = 1 , i.e., ||| Φ Ω ( f ) − Φ Ω ( h ) ||| ≤ k f − h k 2 , ∀ f , h ∈ L 2 ( R d ) , (24) where, thanks to the admissibility condition (17), the Lipschitz constant L Ω = 1 in (24) is completely independent of the frame upper bounds B n and the Lipschitz-constants L n and R n of M n and P n , respectively . Second, we deriv e— in Proposition 5 in Appendix J—an upper bound on the deformation error k f − F τ ,ω f k 2 for R -band-limited functions, i.e., f ∈ L 2 R ( R d ) , according to k f − F τ ,ω f k 2 ≤ C  R k τ k ∞ + k ω k ∞  k f k 2 . (25) The deformation sensitivity bound for the feature extractor is then obtained by setting h = F τ ,ω f in (24) and using (25) (see Appendix H for the corresponding technical details). This “de- coupling” into Lipschitz continuity of Φ Ω and a deformation sensitivity bound for the signal class under consideration (here, band-limited functions) has important practical ramiﬁcations as it shows that whenev er we hav e a deformation sensiti vity bound for the signal class, we automatically get a deformation sensitivity bound for the feature extractor thanks to its Lips- chitz continuity . The same approach was used in [40] to deri ve deformation sensitivity bounds for cartoon functions and for Lipschitz functions. Lipschitz continuity of Φ Ω according to (24) also guarantees that pairwise distances in the input signal space do not in- crease through feature extraction. An immediate consequence is robustness of the feature extractor w .r .t. additiv e noise η ∈ L 2 ( R d ) in the sense of ||| Φ Ω ( f + η ) − Φ Ω ( f ) ||| ≤ k η k 2 , ∀ f ∈ L 2 ( R d ) . W e proceed to the formal statement of our deformation sensitivity result. Theorem 2. Let Ω =  (Ψ n , M n , P n )  n ∈ N be an admissible module-sequence. Ther e exists a constant C > 0 (that does not depend on Ω ) such that for all f ∈ L 2 R ( R d ) , ω ∈ C ( R d , R ) , and τ ∈ C 1 ( R d , R d ) with k D τ k ∞ ≤ 1 2 d , the featur e extractor Φ Ω satisﬁes ||| Φ Ω ( F τ ,ω f ) − Φ Ω ( f ) ||| ≤ C  R k τ k ∞ + k ω k ∞  k f k 2 . (26) Pr oof. The proof is giv en in Appendix H. First, we note that the bound in (26) holds for τ with sufﬁ- ciently “small” Jacobian matrix, i.e., as long as k D τ k ∞ ≤ 1 2 d . W e can think of this condition on the Jacobian matrix as follows 8 : Let f be an image of the handwritten digit “ 5 ” (see Fig. 5 (a)). Then, { F τ ,ω f | k D τ k ∞ < 1 2 d } is a collection of images of the handwritten digit “ 5 ”, where each F τ ,ω f models an image that may be generated, e.g., based on a different handwriting style (see Figs. 5 (b) and (c)). The condition k D τ k ∞ < 1 2 d now imposes a quantitative limit on the amount of deformation tolerated. The deformation sensiti vity bound (26) provides a limit on ho w much the features corresponding to the images in the set { F τ ,ω f | k D τ k ∞ < 1 2 d } can differ . The strength of Theorem 2 deri ves itself from the fact that the only condition on the underlying module-sequence Ω needed is admissibility according to (17), which as outlined in Section III, can easily be obtained by normalizing the frame elements of Ψ n , for all n ∈ N , appropriately . This normalization does not hav e an impact on the constant C in (26). More speciﬁcally , C is shown in (115) to be completely independent of Ω . All this is thanks to the decoupling technique used to prove Theorem 2 being completely independent of the structures of the frames Ψ n and of the speciﬁc forms of the Lipschitz-continuous operators M n and P n . The deformation sensitivity bound (26) is very general in the sense of applying to all Lipschitz-continuous (linear or non-linear) mappings Φ , not only those generated by DCNNs. The bound (4) for scattering networks reported in [22, Theorem 2.12] depends upon ﬁrst-order ( D τ ) and second- order ( D 2 τ ) deriv ati ves of τ . In contrast, our bound (26) depends on ( D τ ) implicitly only as we need to impose the condition k D τ k ∞ ≤ 1 2 d for the bound to hold 9 . W e honor this 8 The ensuing argument is taken from [40]. 9 W e note that the condition k Dτ k ∞ ≤ 1 2 d is needed for the bound (4) to hold as well. 10 difference by referring to (4) as deformation stability bound and to our bound (26) as deformation sensitivity bound. The dependence of the upper bound in (26) on the band- width R reﬂects the intuition that the deformation sensitivity bound should depend on the input signal class “description complexity”. Many signals of practical signiﬁcance (e.g., natural images) are, howe ver , either not band-limited due to the presence of sharp (and possibly curved) edges or exhibit large bandwidths. In the latter case, the bound (26) is ef fectively rendered void owing to its linear dependence on R . W e refer the reader to [40] where deformation sensitivity bounds for non-smooth signals were established. Speciﬁcally , the main contrib utions in [40] are deformation sensitivity bounds—again obtained through decoupling—for non-linear deformations ( F τ f )( x ) = f ( x − τ ( x )) according to k f − F τ f k 2 ≤ C k τ k α ∞ , ∀ f ∈ C ⊆ L 2 ( R d ) , (27) for the signal classes C ⊆ L 2 ( R d ) of cartoon functions [65] and for Lipschitz-continuous functions. The constant C > 0 and the exponent α > 0 in (27) depend on the particular signal class C and are speciﬁed in [40]. As the vertical translation in variance result in Theorem 1 applies to all f ∈ L 2 ( R d ) , the results established in the present paper and in [40] taken together show that vertical translation in variance and limited sensitivity to deformations—for signal classes with inherent deformation insensitivity—are guaranteed by the feature extraction network structure per se rather than the speciﬁc con volution k ernels, non-linearities, and pooling operators. Finally , the deformation stability bound (4) for scattering networks reported in [22, Theorem 2.12] applies to the space H W := n f ∈ L 2 ( R d )    k f k H W < ∞ o , (28) where k f k H W := ∞ X n =0  X q ∈ (Λ W ) n 1 k U [ q ] f k 2 2  1 / 2 and (Λ W ) n 1 denotes the set of paths q =  λ ( j ) , . . . , λ ( p )  of length n with λ ( j ) , . . . , λ ( p ) ∈ Λ W . While [22, p. 1350] cites numerical evidence on the series P q ∈ (Λ W ) n 1 k U [ q ] f k 2 2 being ﬁnite for a large class of signals f ∈ L 2 ( R d ) , it seems difﬁcult to establish this analytically , let alone to show that ∞ X n =0  X q ∈ (Λ W ) n 1 k U [ q ] f k 2 2  1 / 2 < ∞ . In contrast, the deformation sensitivity bound (26) applies pr ovably to the space of R -band-limited functions L 2 R ( R d ) . Finally , the space H W in (28) depends on the wavelet frame atoms { ψ λ } λ ∈ Λ W and the (modulus) non-linearity , and thereby on the underlying signal transform, whereas L 2 R ( R d ) is, triv- ially , independent of the module-sequence Ω . V . F I NA L R E M A R K S A N D O U T L O O K It is interesting to note that the frame lower bounds A n > 0 of the semi-discrete frames Ψ n affect neither the vertical translation in v ariance result in Theorem 1 nor the defor- mation sensitivity bound in Theorem 2. In fact, the entire theory in this paper carries through as long as the collections Ψ n = { T b I g λ n } b ∈ R d ,λ n ∈ Λ n , for all n ∈ N , satisfy the Bessel property X λ n ∈ Λ n Z R d |h f , T b I g λ n i| 2 d b = X λ n ∈ Λ n k f ∗ g λ n k 2 2 ≤ B n k f k 2 2 , for all f ∈ L 2 ( R d ) for some B n > 0 , which, by Proposition 2, is equiv alent to X λ n ∈ Λ n | d g λ n ( ω ) | 2 ≤ B n , a.e. ω ∈ R d . (29) Pre-speciﬁed unstructured ﬁlters [16], [17] and learned ﬁlters [15]–[18] are therefore covered by our theory as long as (29) is satisﬁed. In classical frame theory A n > 0 guarantees completeness of the set Ψ n = { T b I g λ n } b ∈ R d ,λ n ∈ Λ n for the signal space under consideration, here L 2 ( R d ) . The absence of a frame lo wer bound A n > 0 therefore translates into a lack of completeness of Ψ n , which may result in the frame coef ﬁcients h f , T b I g λ n i = ( f ∗ g λ n )( b ) , ( λ n , b ) ∈ Λ n × R d , not containing all essential features of the signal f . This will, in general, have a (possibly signiﬁcant) impact on practical feature extraction performance which is why ensuring the entire frame property (30) is prudent. Interestingly , satisfying the frame property (30) for all Ψ n , n ∈ Z , does, ho wev er, not guarantee that the feature extractor Φ Ω has a tri vial null-space, i.e., Φ Ω ( f ) = 0 if and only if f = 0 . W e refer the reader to [66, Appendix A] for an example of a feature extractor with non-trivial null-space. A P P E N D I X A S E M I - D I S C R E T E F R A M E S This appendix gi ves a brief revie w of the theory of semi- discrete frames. A list of structured example frames of interest in the context of this paper is provided in Appendix B for the 1 -D case, and in Appendix C for the 2 -D case. Semi- discrete frames are instances of continuous frames [41], [42], and appear in the literature, e.g., in the context of translation- cov ariant signal decompositions [31]–[33], and as an interme- diate step in the construction of various fully-discrete frames [34], [35], [37], [52]. W e ﬁrst collect some basic results on semi-discrete frames. Deﬁnition 6. Let { g λ } λ ∈ Λ ⊆ L 1 ( R d ) ∩ L 2 ( R d ) be a set of functions indexed by a countable set Λ . The collection Ψ Λ := { T b I g λ } ( λ,b ) ∈ Λ × R d is a semi-discr ete frame for L 2 ( R d ) if ther e exist constants A, B > 0 such that A k f k 2 2 ≤ X λ ∈ Λ Z R d |h f , T b I g λ i| 2 d b = X λ ∈ Λ k f ∗ g λ k 2 2 ≤ B k f k 2 2 , ∀ f ∈ L 2 ( R d ) . (30) The functions { g λ } λ ∈ Λ ar e called the atoms of the frame Ψ Λ . When A = B the frame is said to be tight. A tight frame with frame bound A = 1 is called a P arseval frame. 11 The frame operator associated with the semi-discrete frame Ψ Λ is deﬁned in the weak sense as S Λ : L 2 ( R d ) → L 2 ( R d ) , S Λ f := X λ ∈ Λ Z R d h f , T b I g λ i ( T b I g λ ) d b =  X λ ∈ Λ g λ ∗ I g λ  ∗ f , (31) where h f , T b I g λ i = ( f ∗ g λ )( b ) , ( λ, b ) ∈ Λ × R d , are called the frame coef ﬁcients. S Λ is a bounded, positiv e, and boundedly in vertible operator [41]. The reader might want to think of semi-discrete frames as shift-in variant frames [67], [68] with a continuous translation parameter , and of the countable index set Λ as labeling a collection of scales, directions, or frequency-shifts, hence the terminology semi-discr ete . For instance, scattering networks are based on a (single) semi-discrete wa velet frame, where the atoms { g λ } λ ∈ Λ W are index ed by the set Λ W :=  ( − J, 0)  ∪  ( j, k ) | j ∈ Z with j > − J, k ∈ { 0 , . . . , K − 1 }  labeling a collection of scales j and directions k . The following result gi ves a so-called Littlew ood-Paley con- dition [53], [69] for the collection Ψ Λ = { T b I g λ } ( λ,b ) ∈ Λ × R d to form a semi-discrete frame. Proposition 2. Let Λ be a countable set. The collection Ψ Λ = { T b I g λ } ( λ,b ) ∈ Λ × R d with atoms { g λ } λ ∈ Λ ⊆ L 1 ( R d ) ∩ L 2 ( R d ) is a semi-discr ete fr ame for L 2 ( R d ) with frame bounds A, B > 0 if and only if A ≤ X λ ∈ Λ | c g λ ( ω ) | 2 ≤ B , a.e. ω ∈ R d . (32) Pr oof. The proof is standard and can be found, e.g., in [30, Theorem 5.11]. Remark 2. What is behind Pr oposition 2 is a result on the unitary equivalence between operators [70, Deﬁnition 5.19.3]. Speciﬁcally , Pr oposition 2 follows fr om the fact that the multiplier P λ ∈ Λ | c g λ | 2 is unitarily equivalent to the frame operator S Λ in (31) accor ding to F S Λ F − 1 = X λ ∈ Λ | c g λ | 2 , wher e F : L 2 ( R d ) → L 2 ( R d ) denotes the F ourier transform. W e r efer the interested r eader to [71] wher e the fr amework of unitary equivalence was formalized in the context of shift- in variant frames for ` 2 ( Z ) . The following proposition states normalization results for semi-discrete frames that come in handy in satisfying the admissibility condition (17) as discussed in Section III. Proposition 3. Let Ψ Λ = { T b I g λ } ( λ,b ) ∈ Λ × R d be a semi- discr ete frame for L 2 ( R d ) with frame bounds A, B . i) F or C > 0 , the family of functions e Ψ Λ :=  T b I f g λ  ( λ,b ) ∈ Λ × R d , f g λ := C − 1 / 2 g λ , ∀ λ ∈ Λ , is a semi- discr ete frame for L 2 ( R d ) with frame bounds e A := A C and e B := B C . ii) The family of functions Ψ \ Λ :=  T b I g \ λ  ( λ,b ) ∈ Λ × R d , g \ λ := F − 1  c g λ  X λ 0 ∈ Λ | c g λ 0 | 2  − 1 / 2  , ∀ λ ∈ Λ , is a semi-discr ete P ar seval frame for L 2 ( R d ) , i.e., the frame bounds satisfy A \ = B \ = 1 . Pr oof. W e start by proving statement i). As Ψ Λ is a frame for L 2 ( R d ) , we have A k f k 2 2 ≤ X λ ∈ Λ k f ∗ g λ k 2 2 ≤ B k f k 2 2 , ∀ f ∈ L 2 ( R d ) . (33) W ith g λ = √ C f g λ , for all λ ∈ Λ , in (33) we get A k f k 2 2 ≤ P λ ∈ Λ k f ∗ √ C f g λ k 2 2 ≤ B k f k 2 2 , for all f ∈ L 2 ( R d ) , which is equi valent to A C k f k 2 2 ≤ P λ ∈ Λ k f ∗ f g λ k 2 2 ≤ B C k f k 2 2 , for all f ∈ L 2 ( R d ) , and hence establishes i). T o prov e statement ii), we ﬁrst note that F g \ λ = c g λ  P λ 0 ∈ Λ | c g λ 0 | 2  − 1 / 2 , for all λ ∈ Λ , and thus P λ ∈ Λ | ( F g \ λ )( ω ) | 2 = P λ ∈ Λ | c g λ ( ω ) | 2  P λ 0 ∈ Λ | c g λ 0 ( ω ) | 2  − 1 = 1 , a.e. ω ∈ R d . Application of Proposition 2 then establishes that Ψ \ Λ is a semi-discrete Parsev al frame for L 2 ( R d ) , i.e., the frame bounds satisfy A \ = B \ = 1 . A P P E N D I X B E X A M P L E S O F S E M I - D I S C R E T E F R A M E S I N 1 -D General 1 -D semi-discrete frames are gi ven by collections Ψ = { T b I g k } ( k,b ) ∈ Z × R (34) with atoms g k ∈ L 1 ( R ) ∩ L 2 ( R ) , indexed by the integers Λ = Z , and satisfying the Littlewood-P aley condition A ≤ X k ∈ Z | b g k ( ω ) | 2 ≤ B , a.e. ω ∈ R . (35) The structural example frames we consider are W eyl- Heisenberg (Gabor) frames where the g k are obtained through modulation from a prototype function, and wav elet frames where the g k are obtained through scaling from a mother wa velet. Semi-discr ete W e yl-Heisenberg (Gabor) frames: W eyl- Heisenberg frames [72]–[75] are well-suited to the extraction of sinusoidal features [76], and have been applied successfully in v arious practical feature extraction tasks [54], [77]. A semi- discrete W eyl-Heisenberg frame for L 2 ( R ) is a collection of functions according to (34), where g k ( x ) := e 2 π ikx g ( x ) , k ∈ Z , with the prototype function g ∈ L 1 ( R ) ∩ L 2 ( R ) . The atoms { g k } k ∈ Z satisfy the Littlewood-P aley condition (35) according to A ≤ X k ∈ Z | b g ( ω − k ) | 2 ≤ B , a.e. ω ∈ R . (36) A popular function g ∈ L 1 ( R ) ∩ L 2 ( R ) satisfying (36) is the Gaussian function [74]. Semi-discr ete wavelet frames: W avelets are well-suited to the extraction of signal features characterized by singularities [31], [53], and have been applied successfully in various practical 12 ω 1 ω 2 ω 1 ω 2 Fig. 7: Partitioning of the frequency plane R 2 induced by (left) a semi-discrete tensor wavelet frame, and (right) a semi-discrete directional wa velet frame. feature extraction tasks [55], [56]. A semi-discrete wav elet frame for L 2 ( R ) is a collection of functions according to (34), where g k ( x ) := 2 k ψ (2 k x ) , k ∈ Z , with the mother wa velet ψ ∈ L 1 ( R ) ∩ L 2 ( R ) . The atoms { g k } k ∈ Z satisfy the Littlew ood-Paley condition (35) according to A ≤ X k ∈ Z | b ψ (2 − k ω ) | 2 ≤ B , a.e. ω ∈ R . (37) A large class of functions ψ satisfying (37) can be obtained through a multi-resolution analysis in L 2 ( R ) [30, Deﬁnition 7.1]. Semi-discr ete curvelet frames: Curvelets, introduced in [34], [38], are well-suited to the extraction of signal features characterized by curve-like singularities (such as, e.g., curved edges in images), and hav e been applied successfully in various practical feature extraction tasks [60], [61]. A P P E N D I X C E X A M P L E S O F S E M I - D I S C R E T E F R A M E S I N 2 -D Semi-discr ete wavelet frames: T wo-dimensional wav elets are well-suited to the extraction of signal features characte- rized by point singularities (such as, e.g., stars in astronomical images [78]), and hav e been applied successfully in various practical feature extraction tasks, e.g., in [19]–[21], [32]. Prominent families of two-dimensional wav elet frames are tensor wav elet frames and directional wa velet frames: i) Semi-discr ete tensor wavelet frames: A semi-discrete tensor wav elet frame for L 2 ( R 2 ) is a collection of func- tions according to Ψ Λ TW := { T b I g ( e,j ) } ( e,j ) ∈ Λ TW ,b ∈ R 2 , g ( e,j ) ( x ) := 2 2 j ψ e (2 j x ) , where Λ TW :=  ((0 , 0) , 0)  ∪  ( e, j ) | e ∈ E \{ (0 , 0) } , j ≥ 0  , and E := { 0 , 1 } 2 . Here, the functions ψ e ∈ L 1 ( R 2 ) ∩ L 2 ( R 2 ) are tensor products of a coarse-scale function φ ∈ L 1 ( R ) ∩ L 2 ( R ) and a ﬁne-scale function ψ ∈ L 1 ( R ) ∩ L 2 ( R ) according to ψ (0 , 0) := φ ⊗ φ , ψ (1 , 0) := ψ ⊗ φ, ψ (0 , 1) := φ ⊗ ψ , and ψ (1 , 1) := ψ ⊗ ψ . The corresponding Little wood- Pale y condition (32) reads A ≤   \ ψ (0 , 0) ( ω )   2 + X j ≥ 0 X e ∈ E \{ (0 , 0) } | c ψ e (2 − j ω ) | 2 ≤ B , (38) a.e. ω ∈ R 2 . A large class of functions φ, ψ satisfying (38) can be obtained through a multi-resolution analysis in L 2 ( R ) [30, Deﬁnition 7.1]. ii) Semi-discr ete dir ectional wavelet frames: A semi- discrete directional wa velet frame for L 2 ( R 2 ) is a col- lection of functions according to Ψ Λ DW := { T b I g ( j,k ) } ( j,k ) ∈ Λ DW ,b ∈ R 2 , with g ( − J, 0) ( x ) := 2 − 2 J φ (2 − J x ) , g ( j,k ) ( x ) := 2 2 j ψ (2 j R θ k x ) , where Λ DW :=  ( − J, 0)  ∪  ( j, k ) | j ∈ Z with j > − J, k ∈ { 0 , . . . , K − 1 }  , R θ is a 2 × 2 rotation matrix deﬁned as R θ :=  cos( θ ) − sin( θ ) sin( θ ) cos( θ )  , θ ∈ [0 , 2 π ) , (39) and θ k := 2 π k K , with k = 0 , . . . , K − 1 , for a ﬁxed K ∈ N , are rotation angles. The functions φ ∈ L 1 ( R 2 ) ∩ L 2 ( R 2 ) and ψ ∈ L 1 ( R 2 ) ∩ L 2 ( R 2 ) are referred to in the literature as coarse-scale wav elet and ﬁne-scale wav elet, respectiv ely . The integer J ∈ Z corresponds to the coarsest scale resolved and the atoms { g ( j,k ) } ( j,k ) ∈ Λ DW satisfy the Littlewood-P aley condition (32) according to A ≤ | b φ (2 J ω ) | 2 + X j > − J K − 1 X k =0 | b ψ (2 − j R θ k ω ) | 2 ≤ B , (40) a.e. ω ∈ R 2 . Prominent examples of functions φ, ψ satisfying (40) are the Gaussian function for φ and a modulated Gaussian function for ψ [30]. A semi-discrete curvelet frame for L 2 ( R 2 ) is a collection of functions according to Ψ Λ C := { T b I g ( j,l ) } ( j,l ) ∈ Λ C ,b ∈ R 2 , with g ( − 1 , 0) ( x ) := φ ( x ) , g ( j,l ) ( x ) := ψ j ( R θ j,l x ) , where Λ C :=  ( − 1 , 0)  ∪  ( j, l ) | j ≥ 0 , l = 0 , . . . , L j − 1  , R θ ∈ R 2 × 2 is the rotation matrix deﬁned in (39), and θ j,l := π l 2 −d j / 2 e− 1 , for j ≥ 0 , and 0 ≤ l < L j := 2 d j / 2 e +2 , are scale-dependent rotation angles. The functions φ ∈ L 1 ( R 2 ) ∩ L 2 ( R 2 ) and ψ j ∈ L 1 ( R 2 ) ∩ L 2 ( R 2 ) satisfy the Little wood-Pale y condition (32) according to A ≤ | b φ ( ω ) | 2 + ∞ X j =0 L j − 1 X l =0 | c ψ j ( R θ j,l ω ) | 2 ≤ B , (41) a.e. ω ∈ R 2 . The ψ j , j ≥ 0 , are designed to ha ve their Fourier transforms b ψ j supported on a pair of opposite wedges of size 13 ω 1 ω 2 ω 1 ω 2 Fig. 8: Partitioning of the frequency plane R 2 induced by (left) a semi-discrete curvelet frame, and (right) a semi-discrete ridgelet frame. 2 − j / 2 × 2 j in the dyadic corona { ω ∈ R 2 | 2 j ≤ | ω | ≤ 2 j +1 } , see Fig. 8 (left). W e refer the reader to [34, Theorem 4.1] for constructions of functions φ, ψ j satisfying (41) with A = B = 1 . Semi-discr ete ridgelet frames: Ridgelets, introduced in [79], [80], are well-suited to the extraction of signal features characterized by straight-line singularities (such as, e.g., straight edges in images), and have been applied successfully in various practical feature extraction tasks [57]–[59], [61]. A semi-discrete ridgelet frame for L 2 ( R 2 ) is a collection of functions according to Ψ Λ R := { T b I g ( j,l ) } ( j,l ) ∈ Λ R ,b ∈ R 2 , with g (0 , 0) ( x ) := φ ( x ) , g ( j,l ) ( x ) := ψ ( j,l ) ( x ) , where Λ R :=  (0 , 0)  ∪  ( j, l ) | j ≥ 1 , l = 1 , . . . , 2 j − 1  , and the atoms { g ( j,l ) } ( j,l ) ∈ Λ R satisfy the Littlew ood-Paley condition (32) according to A ≤ | b φ ( ω ) | 2 + ∞ X j =1 2 j − 1 X l =1 | [ ψ ( j,l ) ( ω ) | 2 ≤ B , (42) a.e. ω ∈ R 2 . The ψ ( j,l ) ∈ L 1 ( R 2 ) ∩ L 2 ( R 2 ) , ( j, l ) ∈ Λ R \{ (0 , 0) } , are designed to be constant in the direction speciﬁed by the parameter l , and to have Fourier transforms b ψ ( j,l ) supported on a pair of opposite wedges of size 2 − j × 2 j in the dyadic corona { ω ∈ R 2 | 2 j ≤ | ω | ≤ 2 j +1 } , see Fig. 8 (right). W e refer the reader to [37, Proposition 6] for constructions of functions φ, ψ ( j,l ) satisfying (42) with A = B = 1 . Remark 3. F or further examples of interesting structur ed semi-discr ete frames, we r efer to [36], which discusses semi- discr ete shearlet frames, and [35], which deals with semi- discr ete α -curvelet frames. A P P E N D I X D N O N - L I N E A R I T I E S This appendix giv es a brief overvie w of non-linearities M : L 2 ( R d ) → L 2 ( R d ) that are widely used in the deep learning literature and that ﬁt into our theory . For each example, we establish how it satisﬁes the conditions on M : L 2 ( R d ) → L 2 ( R d ) in Theorems 1 and 2 and in Corollary 1. Speciﬁcally , we need to verify the following: (i) Lipschitz continuity: There exists a constant L ≥ 0 such that k M f − M h k 2 ≤ L k f − h k 2 , for all f , h ∈ L 2 ( R d ) . (ii) M f = 0 for f = 0 . All non-linearities considered here are pointwise (memoryless) operators in the sense of M : L 2 ( R d ) → L 2 ( R d ) , ( M f )( x ) = ρ ( f ( x )) , (43) where ρ : C → C . An immediate consequence of this property is that the operator M commutes with the translation operator T t (see Theorem 2 and Corollary 1): ( M T t f )( x ) = ρ (( T t f )( x )) = ρ ( f ( x − t )) = T t ρ ( f ( x )) = ( T t M f )( x ) , ∀ f ∈ L 2 ( R d ) , ∀ t ∈ R d . Modulus function: The modulus function | · | : L 2 ( R d ) → L 2 ( R d ) , | f | ( x ) := | f ( x ) | , has been applied successfully in the deep learning literature, e.g., in [16], [21], and most prominently in scattering networks [22]. Lipschitz continuity with L = 1 follows from k| f | − | h |k 2 2 = Z R d || f ( x ) | − | h ( x ) || 2 d x ≤ Z R d | f ( x ) − h ( x ) | 2 d x = k f − h k 2 2 , for f , h ∈ L 2 ( R d ) , by the reverse triangle inequality . Further- more, obviously | f | = 0 for f = 0 , and ﬁnally | · | is pointwise as (43) is satisﬁed with ρ ( x ) := | x | . Rectiﬁed linear unit: The rectiﬁed linear unit non-linearity (see, e.g., [26], [27]) is deﬁned as R : L 2 ( R d ) → L 2 ( R d ) , ( Rf )( x ) := max { 0 , Re( f ( x )) } + i max { 0 , Im( f ( x )) } . W e start by establishing that R is Lipschitz-continuous with L = 2 . T o this end, ﬁx f , h ∈ L 2 ( R d ) . W e ha ve | ( Rf )( x ) − ( Rh )( x ) | =   max { 0 , Re( f ( x )) } + i max { 0 , Im( f ( x )) } −  max { 0 , Re( h ( x )) } + i max { 0 , Im( h ( x )) }    ≤   max { 0 , Re( f ( x )) } − max { 0 , Re( h ( x )) }   (44) +   max { 0 , Im( f ( x )) } − max { 0 , Im( h ( x )) }   ≤   Re( f ( x )) − Re( h ( x ))   +   Im( f ( x )) − Im( h ( x ))   (45) ≤   f ( x ) − h ( x )   +   f ( x ) − h ( x )   = 2 | f ( x ) − h ( x ) | , (46) where we used the triangle inequality in (44), | max { 0 , a } − max { 0 , b }| ≤ | a − b | , ∀ a, b ∈ R , 14 in (45), and the Lipschitz continuity (with L = 1 ) of the mappings Re : C → R and Im : C → R in (46). W e therefore get k Rf − Rh k 2 =  Z R d | ( Rf )( x ) − ( Rh )( x ) | 2 d x  1 / 2 ≤ 2  Z R d | f ( x ) − h ( x ) | 2 d x  1 / 2 = 2 k f − h k 2 , which establishes Lipschitz continuity of R with Lipschitz constant L = 2 . Furthermore, ob viously Rf = 0 for f = 0 , and ﬁnally (43) is satisﬁed with ρ ( x ) := max { 0 , Re( x ) } + i max { 0 , Im( x ) } . Hyperbolic tangent: The hyperbolic tangent non-linearity (see, e.g., [15]–[17]) is deﬁned as H : L 2 ( R d ) → L 2 ( R d ) , ( H f )( x ) := tanh(Re( f ( x ))) + i tanh(Im( f ( x ))) , where tanh( x ) := e x − e − x e x + e − x . W e start by proving that H is Lipschitz-continuous with L = 2 . T o this end, ﬁx f , h ∈ L 2 ( R d ) . W e ha ve | ( H f )( x ) − ( H h )( x ) | =   tanh(Re( f ( x ))) + i tanh(Im( f ( x ))) −  tanh(Re( h ( x ))) + i tanh(Im( h ( x )))    ≤   tanh(Re( f ( x ))) − tanh(Re( h ( x )))   +   tanh(Im( f ( x ))) − tanh(Im( h ( x )))   , (47) where, again, we used the triangle inequality . In order to further upper-bound (47), we sho w that tanh is Lipschitz- continuous. T o this end, we make use of the following result. Lemma 1. Let h : R → R be a continuously dif ferentiable function satisfying sup x ∈ R | h 0 ( x ) | ≤ L . Then, h is Lipschitz- continuous with Lipschitz constant L . Pr oof. See [81, Theorem 9.5.1]. Since tanh 0 ( x ) = 1 − tanh 2 ( x ) , x ∈ R , we have sup x ∈ R | tanh 0 ( x ) | ≤ 1 . By Lemma 1 we can therefore conclude that tanh is Lipschitz-continuous with L = 1 , which when used in (47), yields | ( H f )( x ) − ( H h )( x ) | ≤   Re( f ( x )) − Re( h ( x ))   +   Im( f ( x )) − Im( h ( x ))   ≤   f ( x ) − h ( x )   +   f ( x ) − h ( x )   = 2 | f ( x ) − h ( x ) | . Here, again, we used the Lipschitz continuity (with L = 1 ) of Re : C → R and Im : C → R . Putting things together , we obtain k H f − H h k 2 =  Z R d | ( H f )( x ) − ( H h )( x ) | 2 d x  1 / 2 ≤ 2  Z R d | f ( x ) − h ( x ) | 2 d x  1 / 2 = 2 k f − h k 2 , which prov es that H is Lipschitz-continuous with L = 2 . Since tanh(0) = 0 , we trivially have H f = 0 for f = 0 . Finally , (43) is satisﬁed with ρ ( x ) := tanh(Re( x )) + i tanh(Im( x )) . Shifted logistic sigmoid: The shifted logistic sigmoid non- linearity 10 (see, e.g., [28], [29]) is deﬁned as P : L 2 ( R d ) → L 2 ( R d ) , ( P f )( x ) := sig (Re( f ( x ))) + i sig (Im( f ( x ))) , where sig ( x ) := 1 1+ e − x − 1 2 . W e ﬁrst establish that P is Lipschitz-continuous with L = 1 2 . T o this end, ﬁx f , h ∈ L 2 ( R d ) . W e ha ve | ( P f )( x ) − ( P h )( x ) | =   sig (Re( f ( x ))) + i sig (Im( f ( x ))) −  sig (Re( h ( x ))) + i sig (Im( h ( x )))    ≤   sig (Re( f ( x ))) − sig (Re( h ( x )))   +   sig (Im( f ( x ))) − sig (Im( h ( x )))   , (48) where, again, we employed the triangle inequality . As before, to further upper-bound (48), we show that sig is Lipschitz- continuous. Speciﬁcally , we apply Lemma 1 with sig 0 ( x ) = e − x (1+ e − x ) 2 , x ∈ R , and hence sup x ∈ R | sig 0 ( x ) | ≤ 1 4 , to conclude that sig is Lipschitz-continuous with L = 1 4 . When used in (48) this yields (together with the Lipschitz continuity , with L = 1 , of Re : C → R and Im : C → R ) | ( P f )( x ) − ( P h )( x ) | ≤ 1 4    Re( f ( x )) − Re( h ( x ))    + 1 4    Im( f ( x )) − Im( h ( x ))    ≤ 1 4    f ( x ) − h ( x )    + 1 4    f ( x ) − h ( x )    = 1 2    f ( x ) − h ( x )    . (49) It now follows from (49) that k P f − P h k 2 =  Z R d | ( P f )( x ) − ( P h )( x ) | 2 d x  1 / 2 ≤ 1 2  Z R d | f ( x ) − h ( x ) | 2 d x  1 / 2 = 1 2 k f − h k 2 , which establishes Lipschitz continuity of P with L = 1 2 . Since sig (0) = 0 , we trivially hav e P f = 0 for f = 0 . Finally , (43) is satisﬁed with ρ ( x ) := sig (Re( x )) + i sig (Im( x )) . A P P E N D I X E P RO O F O F P RO P O S I T I O N 1 W e need to sho w that Φ Ω ( f ) ∈ ( L 2 ( R d )) Q , for all f ∈ L 2 ( R d ) . This will be accomplished by proving an ev en stronger result, namely ||| Φ Ω ( f ) ||| ≤ k f k 2 , ∀ f ∈ L 2 ( R d ) , (50) which, by k f k 2 < ∞ , establishes the claim. For ease of notation, we let f q := U [ q ] f , for f ∈ L 2 ( R d ) , in the follo wing. Thanks to (14) and (17), we hav e k f q k 2 ≤ k f k 2 < ∞ , and thus f q ∈ L 2 ( R d ) . The key idea of the proof is no w—similarly 10 Strictly speaking, it is actually the sigmoid function x 7→ 1 1+ e − x rather than the shifted sigmoid function x 7→ 1 1+ e − x − 1 2 that is used in [28], [29]. W e incorporated the offset 1 2 in order to satisfy the requirement P f = 0 for f = 0 . 15 to the proof of [22, Proposition 2.5]—to judiciously employ a telescoping series argument. W e start by writing ||| Φ Ω ( f ) ||| 2 = ∞ X n =0 X q ∈ Λ n 1 || f q ∗ χ n || 2 2 = lim N →∞ N X n =0 X q ∈ Λ n 1 || f q ∗ χ n || 2 2 | {z } := a n . (51) The key step is then to establish that a n can be upper-bounded according to a n ≤ b n − b n +1 , ∀ n ∈ N 0 , (52) with b n := P q ∈ Λ n 1 k f q k 2 2 , n ∈ N 0 , and to use this result in a telescoping series argument according to N X n =0 a n ≤ N X n =0 ( b n − b n +1 ) = ( b 0 − b 1 ) + ( b 1 − b 2 ) + · · · + ( b N − b N +1 ) = b 0 − b N +1 | {z } ≥ 0 (53) ≤ b 0 = X q ∈ Λ 0 1 k f q k 2 2 = k U [ e ] f k 2 2 = k f k 2 2 . (54) By (51) this then implies (50). W e start by noting that (52) reads X q ∈ Λ n 1 k f q ∗ χ n k 2 2 ≤ X q ∈ Λ n 1 || f q k 2 2 − X q ∈ Λ n +1 1 k f q k 2 2 , (55) for all n ∈ N 0 , and proceed by examining the second term on the right hand side (RHS) of (55). Every path ˜ q ∈ Λ n +1 1 = Λ 1 × · · · × Λ n | {z } =Λ n 1 × Λ n +1 of length n +1 can be decomposed into a path q ∈ Λ n 1 of length n and an index λ n +1 ∈ Λ n +1 according to ˜ q = ( q , λ n +1 ) . Thanks to (13) we have U [ ˜ q ] = U [( q , λ n +1 )] = U n +1 [ λ n +1 ] U [ q ] , which yields X ˜ q ∈ Λ n +1 1 k f ˜ q k 2 2 = X q ∈ Λ n 1 X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q k 2 2 . (56) Substituting the second term on the RHS of (55) by (56) now yields X q ∈ Λ n 1 k f q ∗ χ n k 2 2 ≤ X q ∈ Λ n 1  || f q k 2 2 − X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q k 2 2  , ∀ n ∈ N 0 , which can be rewritten as X q ∈ Λ n 1  k f q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q k 2 2  (57) ≤ X q ∈ Λ n 1 || f q k 2 2 , ∀ n ∈ N 0 . Next, note that the second term inside the sum on the left hand side (LHS) of (57) can be written as X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q k 2 2 = X λ n +1 ∈ Λ n +1 Z R d | ( U n +1 [ λ n +1 ] f q )( x ) | 2 d x = X λ n +1 ∈ Λ n +1 S d n +1 Z R d    P n +1  M n +1 ( f q ∗ g λ n +1 )  ( S n +1 x )    2 d x = X λ n +1 ∈ Λ n +1 Z R d    P n +1  M n +1 ( f q ∗ g λ n +1 )  ( y )    2 d y = X λ n +1 ∈ Λ n +1 k P n +1  M n +1 ( f q ∗ g λ n +1 )  k 2 2 , (58) for all n ∈ N 0 . Noting that f q ∈ L 2 ( R d ) , as established above, and g λ n +1 ∈ L 1 ( R d ) , by assumption, it follows that ( f q ∗ g λ n +1 ) ∈ L 2 ( R d ) thanks to Y oung’ s inequality [63, Theorem 1.2.12]. W e use the Lipschitz property of M n +1 and P n +1 , i.e., k M n +1 ( f q ∗ g λ n +1 ) − M n +1 h k 2 ≤ L n +1 k f q ∗ g λ n +1 − h k , and k P n +1 ( f q ∗ g λ n +1 ) − P n +1 h k 2 ≤ R n +1 k f q ∗ g λ n +1 − h k , together with M n +1 h = 0 and P n +1 h = 0 for h = 0 , to upper-bound the term inside the sum in (58) according to k P n +1  M n +1 ( f q ∗ g λ n +1 )  k 2 2 ≤ R 2 n +1 k M n +1 ( f q ∗ g λ n +1 ) k 2 2 ≤ L 2 n +1 R 2 n +1 k f q ∗ g λ n +1 k 2 2 , ∀ n ∈ N 0 . (59) Substituting the second term inside the sum on the LHS of (57) by the upper bound resulting from insertion of (59) into (58) yields X q ∈ Λ n 1  k f q ∗ χ n k 2 2 + L 2 n +1 R 2 n +1 X λ n +1 ∈ Λ n +1 k f q ∗ g λ n +1 k 2 2  ≤ X q ∈ Λ n 1 max { 1 , L 2 n +1 R 2 n +1 }  k f q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k f q ∗ g λ n +1 k 2 2  , ∀ n ∈ N 0 . (60) As the functions { g λ n +1 } λ n +1 ∈ Λ n +1 ∪ { χ n } are the atoms of the semi-discrete frame Ψ n +1 for L 2 ( R d ) and f q ∈ L 2 ( R d ) , as established above, we have k f q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k f q ∗ g λ n +1 k 2 2 ≤ B n +1 k f q k 2 2 , which, when used in (60) yields X q ∈ Λ n 1  k f q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q k 2 2  ≤ X q ∈ Λ n 1 max { 1 , L 2 n +1 R 2 n +1 } B n +1 k f q k 2 2 = X q ∈ Λ n 1 max { B n +1 , B n +1 L 2 n +1 R 2 n +1 }k f q k 2 2 , (61) for all n ∈ N 0 . Finally , inv oking the assumption max { B n , B n L 2 n R 2 n +1 } ≤ 1 , ∀ n ∈ N , in (61) yields (57) and thereby completes the proof. 16 A P P E N D I X F P RO O F O F T H E O R E M 1 W e start by proving i). The key step in establishing (19) is to show that the operator U n , n ∈ N , deﬁned in (10) satisﬁes the relation U n [ λ n ] T t f = T t/S n U n [ λ n ] f , (62) for all f ∈ L 2 ( R d ) , t ∈ R d , and λ n ∈ Λ n . With the deﬁnition of U [ q ] in (13) this then yields U [ q ] T t f = T t/ ( S 1 ··· S n ) U [ q ] f , (63) for all f ∈ L 2 ( R d ) , t ∈ R d , and q ∈ Λ n 1 . The identity (19) is then a direct consequence of (63) and the translation- cov ariance of the conv olution operator: Φ n Ω ( T t f ) =  U [ q ] T t f  ∗ χ n  q ∈ Λ n 1 =  T t/ ( S 1 ··· S n ) U [ q ] f  ∗ χ n  q ∈ Λ n 1 =  T t/ ( S 1 ··· S n )  ( U [ q ] f ) ∗ χ n  q ∈ Λ n 1 = T t/ ( S 1 ··· S n )  ( U [ q ] f ) ∗ χ n  q ∈ Λ n 1 = T t/ ( S 1 ··· S n ) Φ n Ω ( f ) , ∀ f ∈ L 2 ( R d ) , ∀ t ∈ R d . T o establish (62), we ﬁrst deﬁne the unitary operator D n : L 2 ( R d ) → L 2 ( R d ) , D n f := S d/ 2 n f ( S n · ) , and note that U n [ λ n ] T t f = S d/ 2 n P n  M n  ( T t f ) ∗ g λ n   ( S n · ) = D n P n  M n  ( T t f ) ∗ g λ n   = D n P n  M n  T t ( f ∗ g λ n )   = D n P n  T t  M n ( f ∗ g λ n )   (64) = D n T t  P n   M n ( f ∗ g λ n )    , (65) for all f ∈ L 2 ( R d ) and t ∈ R d , where in (64) and (65) we employed M n T t = T t M n , and P n T t = T t P n , for all n ∈ N and t ∈ R d , respectively , both of which are by assumption. Next, using D n T t f = S d/ 2 n f ( S n · − t ) = S d/ 2 n f ( S n ( · − t/S n )) = T t/S n D n f , ∀ f ∈ L 2 ( R d ) , ∀ t ∈ R d , in (65) yields U n [ λ n ] T t f = D n T t  P n   M n ( f ∗ g λ n )    = T t/S n  D n P n   M n ( f ∗ g λ n )    = T t/S n U n [ λ n ] f , for all f ∈ L 2 ( R d ) and t ∈ R d . This completes the proof of i). Next, we prove ii). F or ease of notation, again, we let f q := U [ q ] f , for f ∈ L 2 ( R d ) . Thanks to (14) and the admissibility condition (17), we ha ve k f q k 2 ≤ k f k 2 < ∞ , and thus f q ∈ L 2 ( R d ) . W e ﬁrst write ||| Φ n Ω ( T t f ) − Φ n Ω ( f ) ||| 2 = ||| T t/ ( S 1 ··· S n ) Φ n Ω ( f ) − Φ n Ω ( f ) ||| 2 (66) = X q ∈ Λ n 1 k T t/ ( S 1 ··· S n ) ( f q ∗ χ n ) − f q ∗ χ n k 2 2 = X q ∈ Λ n 1 k M − t/ ( S 1 ··· S n ) ( \ f q ∗ χ n ) − \ f q ∗ χ n k 2 2 , (67) for all n ∈ N , where in (66) we used (19), and in (67) we employed Parse val’ s formula [43, p. 189] (noting that ( f q ∗ χ n ) ∈ L 2 ( R d ) thanks to Y oung’ s inequality [63, Theorem 1.2.12]) together with the relation d T t f = M − t b f , ∀ f ∈ L 2 ( R d ) , ∀ t ∈ R d . The key step is then to establish the upper bound k M − t/ ( S 1 ··· S n ) ( \ f q ∗ χ n ) − \ f q ∗ χ n k 2 2 ≤ 4 π 2 | t | 2 K 2 ( S 1 · · · S n ) 2 k f q k 2 2 , ∀ n ∈ N , (68) where K > 0 corresponds to the constant in the decay condition (20), and to note that X q ∈ Λ n 1 k f q k 2 2 ≤ X q ∈ Λ n − 1 1 k f q k 2 2 , ∀ n ∈ N , (69) which follows from (52) thanks to 0 ≤ X q ∈ Λ n − 1 1 || f q ∗ χ n − 1 || 2 2 = a n − 1 ≤ b n − 1 − b n (70) = X q ∈ Λ n − 1 1 k f q k 2 2 − X q ∈ Λ n 1 k f q k 2 2 , ∀ n ∈ N . (71) Iterating on (69) yields X q ∈ Λ n 1 k f q k 2 2 ≤ X q ∈ Λ n − 1 1 k f q k 2 2 ≤ · · · ≤ X q ∈ Λ 0 1 k f q k 2 2 = k U [ e ] f k 2 2 = k f k 2 2 , ∀ n ∈ N . (72) The identity (67) together with the inequalities (68) and (72) then directly imply ||| Φ n Ω ( T t f ) − Φ n Ω ( f ) ||| 2 ≤ 4 π 2 | t | 2 K 2 ( S 1 · · · S n ) 2 k f k 2 2 , (73) for all n ∈ N . It remains to prove (68). T o this end, we ﬁrst note that k M − t/ ( S 1 ··· S n ) ( \ f q ∗ χ n ) − \ f q ∗ χ n k 2 2 = Z R d   e − 2 π i h t,ω i / ( S 1 ··· S n ) − 1   2 | c χ n ( ω ) | 2 | b f q ( ω ) | 2 d ω . (74) Since | e − 2 π ix − 1 | ≤ 2 π | x | , for all x ∈ R , it follows that | e − 2 πi h t,ω i / ( S 1 ··· S n ) − 1 | 2 ≤ 4 π 2 |h t, ω i| 2 ( S 1 · · · S n ) 2 ≤ 4 π 2 | t | 2 | ω | 2 ( S 1 · · · S n ) 2 , (75) where in the last step we employed the Cauchy-Schwartz 17 inequality . Substituting (75) into (74) yields k M − t/ ( S 1 ··· S n ) ( \ f q ∗ χ n ) − \ f q ∗ χ n k 2 2 ≤ 4 π 2 | t | 2 ( S 1 · · · S n ) 2 Z R d | ω | 2 | c χ n ( ω ) | 2 | b f q ( ω ) | 2 d ω ≤ 4 π 2 | t | 2 K 2 ( S 1 · · · S n ) 2 Z R d | b f q ( ω ) | 2 d ω (76) = 4 π 2 | t | 2 K 2 ( S 1 · · · S n ) 2 k b f q k 2 2 = 4 π 2 | t | 2 K 2 ( S 1 · · · S n ) 2 k f q k 2 2 , (77) for all n ∈ N , where in (76) we employed the decay condition (20), and in the last step, again, we used P arsev al’ s formula [43, p. 189]. This establishes (68) and thereby completes the proof of ii). A P P E N D I X G P RO O F O F C O RO L L A RY 1 The key idea of the proof is—similarly to the proof of ii) in Theorem 1—to upper-bound the deviation from perfect cov ariance in the frequency domain. For ease of notation, again, we let f q := U [ q ] f , for f ∈ L 2 ( R d ) . Thanks to (14) and the admissibility condition (17), we hav e k f q k 2 ≤ k f k 2 < ∞ , and thus f q ∈ L 2 ( R d ) . W e ﬁrst write ||| Φ n Ω ( T t f ) − T t Φ n Ω ( f ) ||| 2 = ||| T t/ ( S 1 ··· S n ) Φ n Ω ( f ) − T t Φ n Ω ( f ) ||| 2 (78) = X q ∈ Λ n 1 k ( T t/ ( S 1 ··· S n ) − T t )( f q ∗ χ n ) k 2 2 = X q ∈ Λ n 1 k ( M − t/ ( S 1 ··· S n ) − M − t )( \ f q ∗ χ n ) k 2 2 , (79) for all n ∈ N , where in (78) we used (19), and in (79) we employed Parse val’ s formula [43, p. 189] (noting that ( f q ∗ χ n ) ∈ L 2 ( R d ) thanks to Y oung’ s inequality [63, Theorem 1.2.12]) together with the relation d T t f = M − t b f , ∀ f ∈ L 2 ( R d ) , ∀ t ∈ R d . The key step is then to establish the upper bound k ( M − t/ ( S 1 ··· S n ) − M − t )( \ f q ∗ χ n ) k 2 2 ≤ 4 π 2 | t | 2 K 2   1 / ( S 1 · · · S n ) − 1   2 k f q k 2 2 , (80) where K > 0 corresponds to the constant in the decay condition (20). Arguments similar to those leading to (73) then complete the proof. It remains to prove (80): k ( M − t/ ( S 1 ··· S n ) − M − t )( \ f q ∗ χ n ) k 2 2 = Z R d   e − 2 π i h t,ω i / ( S 1 ··· S n ) − e − 2 π i h t,ω i   2 | c χ n ( ω ) | 2 | b f q ( ω ) | 2 d ω . (81) Since | e − 2 π ix − e − 2 π iy | ≤ 2 π | x − y | , for all x, y ∈ R , it follows that   e − 2 π i h t,ω i / ( S 1 ··· S n ) − e − 2 πi h t,ω i   2 ≤ 4 π 2 | t | 2 | ω | 2   1 / ( S 1 · · · S n ) − 1   2 , (82) where, again, we employed the Cauchy-Schwartz inequality . Substituting (82) into (81), and employing arguments similar to those leading to (77), establishes (80) and thereby completes the proof. A P P E N D I X H P RO O F O F T H E O R E M 2 As already mentioned at the beginning of Section IV -B, the proof of the deformation sensiti vity bound (26) is based on tw o ke y ingredients. The ﬁrst one, stated in Proposition 4 in Appendix I, establishes that the feature extractor Φ Ω is Lipschitz-continuous with Lipschitz constant L Ω = 1 , i.e., ||| Φ Ω ( f ) − Φ Ω ( h ) ||| ≤ k f − h k 2 , ∀ f , h ∈ L 2 ( R d ) , (83) and needs the admissibility condition (17) only . The second ingredient, stated in Proposition 5 in Appendix J, is an upper bound on the deformation error k f − F τ ,ω f k 2 giv en by k f − F τ ,ω f k 2 ≤ C  R k τ k ∞ + k ω k ∞  k f k 2 , (84) for all f ∈ L 2 R ( R d ) , and is valid under the assumptions ω ∈ C ( R d , R ) and τ ∈ C 1 ( R d , R d ) with k D τ k ∞ < 1 2 d . W e now show ho w (83) and (84) can be combined to establish the deformation sensitivity bound (26). T o this end, we ﬁrst apply (83) with h := F τ ,ω f = e 2 π iω ( · ) f ( · − τ ( · )) to get ||| Φ Ω ( f ) − Φ Ω ( F τ ,ω f ) ||| ≤ k f − F τ ,ω f k 2 , (85) for all f ∈ L 2 ( R d ) . Here, we used F τ ,ω f ∈ L 2 ( R d ) , which is thanks to k F τ ,ω f k 2 2 = Z R d | f ( x − τ ( x )) | 2 d x ≤ 2 k f k 2 2 , obtained through the change of variables u = x − τ ( x ) , together with d u d x = | det( E − ( D τ )( x )) | ≥ 1 − d k Dτ k ∞ ≥ 1 / 2 , (86) for x ∈ R d . The ﬁrst inequality in (86) follows from: Lemma 2. [82, Cor ollary 1]: Let M ∈ R d × d be such that | M i,j | ≤ α , for all i, j with 1 ≤ i, j ≤ d . If dα ≤ 1 , then | det( E − M ) | ≥ 1 − dα. The second inequality in (86) is a consequence of the assumption k D τ k ∞ ≤ 1 2 d . The proof is ﬁnalized by replacing the RHS of (85) by the RHS of (84). A P P E N D I X I P RO P O S I T I O N 4 Proposition 4. Let Ω =  (Ψ n , M n , P n )  n ∈ N be an admissible module-sequence. The corresponding featur e e xtractor Φ Ω : L 2 ( R d ) → ( L 2 ( R d )) Q is Lipschitz-continuous with Lipschitz constant L Ω = 1 , i.e., ||| Φ Ω ( f ) − Φ Ω ( h ) ||| ≤ k f − h k 2 , ∀ f , h ∈ L 2 ( R d ) . (87) Remark 4. Proposition 4 gener alizes [22, Proposition 2.5], which shows that the wavelet-modulus featur e extractor Φ W generated by scattering networks is Lipschitz-continuous with Lipschitz constant L W = 1 . Speciﬁcally , our gener alization allows for general semi-discr ete frames (i.e., gener al convolu- tion kernels), gener al Lipschitz-continuous non-linearities M n , 18 and general Lipschitz-continuous operators P n , all of which can be differ ent in dif ferent layers. Moreo ver , thanks to the admissibility condition (17) , the Lipschitz constant L Ω = 1 in (87) is completely independent of the frame upper bounds B n and the Lipsc hitz-constants L n and R n of M n and P n , r espectively . Pr oof. The key idea of the proof is again—similarly to the proof of Proposition 1 in Appendix E—to judiciously employ a telescoping series argument. For ease of notation, we let f q := U [ q ] f and h q := U [ q ] h , for f , h ∈ L 2 ( R d ) . Thanks to (14) and the admissibility condition (17), we have k f q k 2 ≤ k f k 2 < ∞ and k h q k 2 ≤ k h k 2 < ∞ and thus f q , h q ∈ L 2 ( R d ) . W e start by writing ||| Φ Ω ( f ) − Φ Ω ( h ) ||| 2 = ∞ X n =0 X q ∈ Λ n 1 || f q ∗ χ n − h q ∗ χ n || 2 2 = lim N →∞ N X n =0 X q ∈ Λ n 1 || f q ∗ χ n − h q ∗ χ n || 2 2 | {z } =: a n . As in the proof of Proposition 1 in Appendix E, the key step is to show that a n can be upper-bounded according to a n ≤ b n − b n +1 , ∀ n ∈ N 0 , (88) where here b n := X q ∈ Λ n 1 k f q − h q k 2 2 , ∀ n ∈ N 0 , and to note that, similarly to (54), N X n =0 a n ≤ N X n =0 ( b n − b n +1 ) = ( b 0 − b 1 ) + ( b 1 − b 2 ) + · · · + ( b N − b N +1 ) = b 0 − b N +1 | {z } ≥ 0 ≤ b 0 = X q ∈ Λ 0 1 k f q − h q k 2 2 = k U [ e ] f − U [ e ] h k 2 2 = k f − h k 2 2 , which then yields (87) according to ||| Φ Ω ( f ) − Φ Ω ( h ) ||| 2 = lim N →∞ N X n =0 a n ≤ lim N →∞ k f − h k 2 2 = k f − h k 2 2 . Writing out (88), it follows that we need to establish X q ∈ Λ n 1 k f q ∗ χ n − h q ∗ χ n k 2 2 ≤ X q ∈ Λ n 1 || f q − h q k 2 2 − X q ∈ Λ n +1 1 k f q − h q k 2 2 , (89) for all n ∈ N 0 . W e start by examining the second term on the RHS of (89) and note that, thanks to the decomposition ˜ q ∈ Λ n +1 1 = Λ 1 × · · · × Λ n | {z } =Λ n 1 × Λ n +1 and U [ ˜ q ] = U [( q , λ n +1 )] = U n +1 [ λ n +1 ] U [ q ] , by (13), we hav e X ˜ q ∈ Λ n +1 1 k f ˜ q − h ˜ q k 2 2 = X q ∈ Λ n 1 X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q − U n +1 [ λ n +1 ] h q k 2 2 . (90) Substituting (90) into (89) and rearranging terms, we obtain X q ∈ Λ n 1  k f q ∗ χ n − h q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q − U n +1 [ λ n +1 ] h q k 2 2  ≤ X q ∈ Λ n 1 || f q − h q k 2 2 , ∀ n ∈ N 0 . (91) W e next note that the second term inside the sum on the LHS of (91) satisﬁes X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q − U n +1 [ λ n +1 ] h q k 2 2 ≤ X λ n +1 ∈ Λ n +1 k P n +1  M n +1 ( f q ∗ g λ n +1 )  − P n +1  M n +1 ( h q ∗ g λ n +1 )  k 2 2 , (92) where we employed ar guments similar to those leading to (58). Substituting the second term inside the sum on the LHS of (91) by the upper bound (92), and using the Lipschitz property of M n +1 and P n +1 yields X q ∈ Λ n 1  k f q ∗ χ n − h q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q − U n +1 [ λ n +1 ] h q k 2 2  ≤ X q ∈ Λ n 1 max { 1 , L 2 n +1 R 2 n +1 }  k ( f q − h q ) ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k ( f q − h q ) ∗ g λ n +1 k 2 2  , (93) for all n ∈ N 0 . As the functions { g λ n +1 } λ n +1 ∈ Λ n +1 ∪ { χ n } are the atoms of the semi-discrete frame Ψ n +1 for L 2 ( R d ) and f q , h q ∈ L 2 ( R d ) , as established above, we have k ( f q − h q ) ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k ( f q − h q ) ∗ g λ n +1 k 2 2 ≤ B n +1 k f q − h q k 2 2 , which, when used in (93) yields X q ∈ Λ n 1  k f q ∗ χ n − h q ∗ χ n k 2 2 + X λ n +1 ∈ Λ n +1 k U n +1 [ λ n +1 ] f q − U n +1 [ λ n +1 ] h q k 2 2  ≤ X q ∈ Λ n 1 max { B n +1 , B n +1 L 2 n +1 R 2 n +1 }k f q − h q k 2 2 , (94) for all n ∈ N 0 . Finally , inv oking the admissibility condition max { B n , B n L 2 n R 2 n } ≤ 1 , ∀ n ∈ N , 19 in (94) we get (91) and hence (88). This completes the proof. A P P E N D I X J P RO P O S I T I O N 5 Proposition 5. There exists a constant C > 0 such that for all f ∈ L 2 R ( R d ) , ω ∈ C ( R d , R ) , and τ ∈ C 1 ( R d , R d ) with k D τ k ∞ < 1 2 d , it holds that k f − F τ ,ω f k 2 ≤ C  R k τ k ∞ + k ω k ∞  k f k 2 . (95) Remark 5. A similar bound was derived in [22, App. B] for scattering networks, namely k f ∗ ψ ( − J, 0) − F τ ( f ∗ ψ ( − J, 0) ) k 2 ≤ C 2 − J + d k τ k ∞ k f k 2 , (96) for all f ∈ L 2 ( R d ) , where ψ ( − J, 0) is the low-pass ﬁlter of a semi-discrete dir ectional wavelet frame for L 2 ( R d ) , and ( F τ f )( x ) = f ( x − τ ( x )) . The techniques for pr oving (95) and (96) are r elated in the sense of both employing Schur’ s Lemma [63, App. I.1] and a T aylor series expansion ar gument [83, p. 411]. The signal-class speciﬁcity of our bound (95) comes with new technical elements detailed at the beginning of the pr oof. Pr oof. W e ﬁrst determine an integral operator ( K f )( x ) = Z R d k ( x, u ) f ( u )d u (97) satisfying the signal-class speciﬁc identity K f = F τ ,ω f − f , ∀ f ∈ L 2 ( R d ) , and then upper-bound the deformation error k f − F τ ,ω f k 2 according to k f − F τ ,ω f k 2 = k F τ ,ω f − f k 2 = k K f k 2 ≤ k K k 2 , 2 k f k 2 , for all f ∈ L 2 R ( R d ) . Application of Schur’ s Lemma, stated below , then yields k K k 2 , 2 ≤ C  R k τ k ∞ + k ω k ∞  , with C > 0 , which completes the proof. Schur’ s Lemma. [63, App. I.1]: Let k : R d × R d → C be a locally inte grable function satisfying ( i ) sup x ∈ R d Z R d | k ( x, u ) | d u ≤ α, ( ii ) sup u ∈ R d Z R d | k ( x, u ) | d x ≤ α, (98) wher e α > 0 . Then, ( K f )( x ) = R R d k ( x, u ) f ( u )d u is a bounded operator from L 2 ( R d ) to L 2 ( R d ) with operator norm k K k 2 , 2 ≤ α . W e start by determining the integral operator K in (97). T o this end, consider η ∈ S ( R d , C ) such that b η ( ω ) = 1 , for all ω ∈ B 1 (0) . Setting γ ( x ) := R d η ( Rx ) yields γ ∈ S ( R d , C ) and b γ ( ω ) = b η ( ω /R ) . Thus, b γ ( ω ) = 1 , for all ω ∈ B R (0) , and hence b f = b f · b γ , so that f = f ∗ γ , for all f ∈ L 2 R ( R d ) . Ne xt, we deﬁne the operator A γ : L 2 ( R d ) → L 2 ( R d ) , A γ f := f ∗ γ , and note that A γ is well-deﬁned, i.e., A γ f ∈ L 2 ( R d ) , for all f ∈ L 2 ( R d ) , thanks to Y oung’ s inequality [63, Theorem 1.2.12] (since f ∈ L 2 ( R d ) and γ ∈ S ( R d , C ) ⊆ L 1 ( R d ) ). Moreo ver , A γ f = f , for all f ∈ L 2 R ( R d ) . Setting K := F τ ,ω A γ − A γ , we get K f = F τ ,ω A γ f − A γ f = F τ ,ω f − f , for all f ∈ L 2 R ( R d ) , as desired. Furthermore, it follows from ( F τ ,ω A γ f )( x ) = e 2 π iω ( x ) Z R d γ ( x − τ ( x ) − u ) f ( u )d u, that the integral operator K = F τ ,ω A γ − A γ , i.e., ( K f )( x ) = Z R d k ( x, u ) f ( u )d u, has the kernel k ( x, u ) := e 2 π iω ( x ) γ ( x − τ ( x ) − u ) − γ ( x − u ) . (99) Before we can apply Schur’ s Lemma to establish an upper bound on k K k 2 , 2 , we need to verify that k in (99) is locally integrable, i.e., we need to show that for every compact set S ⊆ R d × R d we hav e Z S | k ( x, u ) | d( x, u ) < ∞ . T o this end, let S ⊆ R d × R d be a compact set. Next, choose compact sets S 1 , S 2 ⊆ R d such that S ⊆ S 1 × S 2 . Thanks to γ ∈ S ( R d , C ) , τ ∈ C 1 ( R d , R d ) , and ω ∈ C ( R d , R ) , all by assumption, the function | k | : S 1 × S 2 → C is continuous as a composition of continuous functions, and therefore also Lebesgue-measurable. W e further have Z S 1 Z S 2 | k ( x, u ) | d x d u ≤ Z S 1 Z R d | k ( x, u ) | d x d u ≤ Z S 1 Z R d | γ ( x − τ ( x ) − u ) | d x d u + Z S 1 Z R d | γ ( x − u ) | d x d u ≤ 2 Z S 1 Z R d | γ ( y ) | d y d u + Z S 1 Z R d | γ ( y ) | d y d u (100) = 3 µ L ( S 1 ) k γ k 1 < ∞ , where the ﬁrst term in (100) follows by the change of v ariables y = x − τ ( x ) − u , together with d y d x = | det( E − ( D τ )( x )) | ≥ 1 − d k Dτ k ∞ ≥ 1 / 2 , (101) for all x ∈ R d . The arguments underlying (101) were already detailed at the end of Appendix H. It follows that k is locally integrable owing to Z S | k ( x, u ) | d( x, u ) ≤ Z S 1 × S 2 | k ( x, u ) | d( x, u ) = Z S 1 Z S 2 | k ( x, u ) | d x d u < ∞ , (102) where the ﬁrst step in (102) follows from S ⊆ S 1 × S 2 , the second step is thanks to the Fubini-T onelli Theorem [84, Theorem 14.2] noting that | k | : S 1 × S 2 → C is Lebesgue- measurable (as established above) and non-negativ e, and the last step is due to (100). Next, we need to verify conditions (i) and (ii) in (98) and determine the corresponding α > 0 . In fact, we seek a speciﬁc constant α of the form α = C  R k τ k ∞ + k ω k ∞  , with C > 0 . (103) 20 This will be accomplished as follows: For x, u ∈ R d , we parametrize the integral kernel in (99) according to h x,u ( t ) := e 2 π itω ( x ) γ ( x − tτ ( x ) − u ) − γ ( x − u ) . A T aylor series expansion [83, p. 411] of h x,u ( t ) w .r .t. the variable t no w yields h x,u ( t ) = h x,u (0) | {z } =0 + Z t 0 h 0 x,u ( λ )d λ = Z t 0 h 0 x,u ( λ )d λ, (104) for t ∈ R , where h 0 x,u ( t ) = ( d d t h x,u )( t ) . Note that h x,u ∈ C 1 ( R , C ) thanks to γ ∈ S ( R d , C ) . Setting t = 1 in (104) we get | k ( x, u ) | = | h x,u (1) | ≤ Z 1 0 | h 0 x,u ( λ ) | d λ, (105) where h 0 x,u ( λ ) = − e 2 π iλω ( x ) h∇ γ ( x − λτ ( x ) − u ) , τ ( x ) i + 2 π iω ( x ) e 2 π iλω ( x ) γ ( x − λτ ( x ) − u ) , (106) for λ ∈ [0 , 1] . W e further have | h 0 x,u ( λ ) | ≤    ∇ γ ( x − λτ ( x ) − u ) , τ ( x )    + | 2 π ω ( x ) γ ( x − λτ ( x ) − u ) | ≤ | τ ( x ) ||∇ γ ( x − λτ ( x ) − u ) | + 2 π | ω ( x ) || γ ( x − λτ ( x ) − u ) | . (107) Now , using | τ ( x ) | ≤ sup y ∈ R d | τ ( y ) | = k τ k ∞ and | ω ( x ) | ≤ sup y ∈ R d | ω ( y ) | = k ω k ∞ in (107), together with (105), we get the upper bound | k ( x, u ) | ≤ k τ k ∞ Z 1 0 |∇ γ ( x − λτ ( x ) − u ) | d λ + 2 π k ω k ∞ Z 1 0 | γ ( x − λτ ( x ) − u ) | d λ. (108) Next, we integrate (108) w .r .t. u to establish (i) in (98): Z R d | k ( x, u ) | d u ≤ k τ k ∞ Z R d Z 1 0 |∇ γ ( x − λτ ( x ) − u ) | d λ d u + 2 π k ω k ∞ Z R d Z 1 0 | γ ( x − λτ ( x ) − u ) | d λ d u = k τ k ∞ Z 1 0 Z R d |∇ γ ( x − λτ ( x ) − u ) | d u d λ + 2 π k ω k ∞ Z 1 0 Z R d | γ ( x − λτ ( x ) − u ) | d u d λ (109) = k τ k ∞ Z 1 0 Z R d |∇ γ ( y ) | d y d λ + 2 π k ω k ∞ Z 1 0 Z R d | γ ( y ) | d y d λ = k τ k ∞ k∇ γ k 1 + 2 π k ω k ∞ k γ k 1 , (110) where (109) follows by application of the Fubini-T onelli The- orem [84, Theorem 14.2] noting that the functions ( u, λ ) 7→ |∇ γ ( x − λτ ( x ) − u ) | , ( u, λ ) ∈ R d × [0 , 1] , and ( u, λ ) 7→ | γ ( x − λτ ( x ) − u ) | , ( u, λ ) ∈ R d × [0 , 1] , are both non-negati ve and continuous (and thus Lebesgue-measurable) as compositions of continuous functions. Finally , using γ = R d η ( R · ) , and thus ∇ γ = R d +1 ∇ η ( R · ) , k γ k 1 = k η k 1 , and k∇ γ k 1 = R k∇ η k 1 in (110) yields sup x ∈ R d Z R d | k ( x, u ) | d u ≤ R k τ k ∞ k∇ η k 1 + 2 π k ω k ∞ k η k 1 ≤ max  k∇ η k 1 , 2 π k η k 1  R k τ k ∞ + k ω k ∞  , (111) which establishes an upper bound of the form (i) in (98) that exhibits the desired structure for α . Condition (ii) in (98) is established similarly by integrating (108) w .r .t. x according to Z R d | k ( x, u ) | d x ≤ k τ k ∞ Z R d Z 1 0 |∇ γ ( x − λτ ( x ) − u ) | d λ d x + 2 π k ω k ∞ Z R d Z 1 0 | γ ( x − λτ ( x ) − u ) | d λ d x = k τ k ∞ Z 1 0 Z R d |∇ γ ( x − λτ ( x ) − u ) | d x d λ + 2 π k ω k ∞ Z 1 0 Z R d | γ ( x − λτ ( x ) − u ) | d x d λ (112) ≤ 2 k τ k ∞ Z 1 0 Z R d |∇ γ ( y ) | d y d λ + 4 π k ω k ∞ Z 1 0 Z R d | γ ( y ) | d y d λ (113) = 2 k τ k ∞ k∇ γ k 1 4 π k ω k ∞ k γ k 1 ≤ max  2 k∇ η k 1 , 4 π k η k 1  R k τ k ∞ + k ω k ∞  . (114) Here, again, (112) follows by application of the Fubini- T onelli Theorem [84, Theorem 14.2] noting that the functions ( x, λ ) 7→ |∇ γ ( x − λτ ( x ) − u ) | , ( x, λ ) ∈ R d × [0 , 1] , and ( x, λ ) 7→ | γ ( x − λτ ( x ) − u ) | , ( x, λ ) ∈ R d × [0 , 1] , are both non-negati ve and continuous (and thus Lebesgue-measurable) as compositions of continuous functions. The inequality (113) follows from a change of v ariables argument similar to the one in (100) and (101). Combining (111) and (114), we ﬁnally get (103) with C := max  2 k∇ η k 1 , 4 π k η k 1  . (115) This completes the proof. A C K N O W L E D G M E N T S The authors would lik e to thank P . Grohs, S. Mallat, R. Alai- fari, M. Tschannen, and G. Kutyniok for helpful discussions and comments on the paper . R E F E R E N C E S [1] T . Wiato wski and H. B ¨ olcskei, “Deep conv olutional neural netw orks based on semi-discrete frames, ” in Proc. of IEEE International Symposium on Information Theory (ISIT) , pp. 1212–1216, 2015. [2] Y . Bengio, A. Courville, and P . V incent, “Representation learning: A revie w and new perspectives, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 35, no. 8, pp. 1798–1828, 2013. [3] C. M. Bishop, P attern reco gnition and machine learning . Springer , 2009. [4] R. O. Duda, P . E. Hart, and D. G. Stork, P attern classiﬁcation . John W iley , 2nd ed., 2001. [5] Y . LeCun and C. Cortes, “The MNIST database of handwritten digits. ” http://yann.lecun.com/exdb/mnist/, 1998. [6] C. Cortes and V . V apnik, “Support-vector networks, ” Machine Learning , vol. 20, no. 3, pp. 273–297, 1995. 21 [7] Y . LeCun, B. Boser , J. Denker , D. Henderson, R. Howard, W . Hubbard, and L. Jackel, “Handwritten digit recognition with a back-propagation network, ” in Pr oc. of International Conference on Neural Information Pr ocessing Systems (NIPS) , pp. 396–404, 1990. [8] D. E. Rumelhart, G. Hinton, and R. J. Williams, “Learning internal representations by error propagation, ” in P ar allel distributed processing: Explorations in the microstructur e of cognition (J. L. McClelland and D. E. Rumelhart, eds.), pp. 318–362, MIT Press, 1986. [9] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient-based learning applied to document recognition, ” in Pr oc. of the IEEE , pp. 2278–2324, 1998. [10] Y . LeCun, K. Kavukcuoglu, and C. Farabet, “Con volutional networks and applications in vision, ” in Pr oc. of IEEE International Symposium on Cir cuits and Systems (ISCAS) , pp. 253–256, 2010. [11] Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning, ” Nature , vol. 521, pp. 436–444, 2015. [12] M. Cisse, P . Bojanowski, E. Gra ve, Y . Dauphin, and N. Usunier , “Parsev al Networks: Improving Robustness to Adversarial Examples, ” in Pr oc. of International Confer ence on Machine Learning (ICML) , pp. 854–863, 2017. [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Proc. of IEEE International Conference on Computer V ision and P attern Recognition (CVPR) , pp. 770–778, 2015. [14] I. Goodfello w , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016. http://www .deeplearningbook.org. [15] F . J. Huang and Y . LeCun, “Large-scale learning with SVM and con volutional nets for generic object categorization, ” in Proc. of IEEE International Conference on Computer V ision and P attern Recognition (CVPR) , pp. 284–291, 2006. [16] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y . LeCun, “What is the best multi-stage architecture for object recognition?, ” in Pr oc. of IEEE International Conference on Computer V ision (ICCV) , pp. 2146–2153, 2009. [17] M. A. Ranzato, F . J. Huang, Y . Boureau, and Y . LeCun, “Unsupervised learning of inv ariant feature hierarchies with applications to object recognition, ” in Proc. of IEEE International Conference on Computer V ision and P attern Recognition (CVPR) , pp. 1–8, 2007. [18] M. Ranzato, C. Poultney , S. Chopra, and Y . LeCun, “Efﬁcient learning of sparse representations with an ener gy-based model, ” in Proc. of Inter- national Conference on Neural Information Pr ocessing Systems (NIPS) , pp. 1137–1144, 2006. [19] N. Pinto, D. Cox, and J. DiCarlo, “Why is real-world visual object recognition hard, ” PLoS Computational Biology , vol. 4, no. 1, pp. 151– 156, 2008. [20] T . Serre, L. W olf, and T . Poggio, “Object recognition with features inspired by visual cortex, ” in Pr oc. of IEEE International Conference on Computer V ision and P attern Recognition (CVPR) , pp. 994–1000, 2005. [21] J. Mutch and D. Lowe, “Multiclass object recognition with sparse, local- ized features, ” in Pr oc. of IEEE International Conference on Computer V ision and P attern Recognition (CVPR) , pp. 11–18, 2006. [22] S. Mallat, “Group inv ariant scattering, ” Comm. Pure Appl. Math. , vol. 65, no. 10, pp. 1331–1398, 2012. [23] J. Bruna and S. Mallat, “In variant scattering conv olution networks, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 35, no. 8, pp. 1872–1886, 2013. [24] L. Sifre, Rigid-motion scattering for textur e classiﬁcation . PhD thesis, Centre de Math ´ ematiques Appliqu ´ ees, ´ Ecole Polytechnique Paris-Saclay , 2014. [25] J. And ´ en and S. Mallat, “Deep scattering spectrum, ” IEEE Tr ans. Signal Pr ocess. , vol. 62, no. 16, pp. 4114–4128, 2014. [26] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectiﬁer neural networks, ” in Pr oc. of International Confer ence on Artiﬁcial Intelligence and Statistics (AISTA TS) , pp. 315–323, 2011. [27] V . Nair and G. Hinton, “Rectiﬁed linear units improve restricted Boltz- mann machines, ” in Pr oc. of International Conference on Machine Learning (ICML) , pp. 807–814, 2010. [28] A. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks, ” IEEE T rans. A udio, Speech, and Language Pr ocess. , vol. 20, pp. 14–22, Jan. 2011. [29] X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks, ” in Pr oc. of International Conference on Artiﬁcial Intelligence and Statistics (AIST A TS) , pp. 249–256, 2010. [30] S. Mallat, A wavelet tour of signal processing: The sparse way . Academic Press, 3rd ed., 2009. [31] S. Mallat and S. Zhong, “Characterization of signals from multiscale edges, ” IEEE T rans. P attern Anal. Mach. Intell. , v ol. 14, no. 7, pp. 710– 732, 1992. [32] M. Unser , “T e xture classiﬁcation and segmentation using wa velet frames, ” IEEE T rans. Imag e Process. , v ol. 4, no. 11, pp. 1549–1560, 1995. [33] P . V andergheynst, “Directional dyadic wav elet transforms: Design and algorithms, ” IEEE T rans. Image Pr ocess. , vol. 11, no. 4, pp. 363–372, 2002. [34] E. J. Cand ` es and D. L. Donoho, “Continuous curvelet transform: II. Discretization and frames, ” Appl. Comput. Harmon. Anal. , vol. 19, no. 2, pp. 198–222, 2005. [35] P . Grohs, S. Keiper , G. Kutyniok, and M. Sch ¨ afer , “Cartoon approxima- tion with α -curvelets, ” J . F ourier Anal. Appl. , pp. 1–59, 2015. [36] G. Kutyniok and D. Labate, eds., Shearlets: Multiscale analysis for multivariate data . Birkh ¨ auser , 2012. [37] P . Grohs, “Ridgelet-type frame decompositions for Sobole v spaces related to linear transport, ” J. F ourier Anal. Appl. , vol. 18, no. 2, pp. 309–325, 2012. [38] E. J. Cand ` es and D. L. Donoho, “Ne w tight frames of curv elets and optimal representations of objects with piecewise C 2 singularities, ” Comm. Pur e Appl. Math. , vol. 57, no. 2, pp. 219–266, 2004. [39] K. Guo, G. Kutyniok, and D. Labate, “Sparse multidimensional represen- tations using anisotropic dilation and shear operators, ” in W avelets and Splines (G. Chen and M. J. Lai, eds.), pp. 189–201, Nashboro Press, 2006. [40] P . Grohs, T . W iatowski, and H. B ¨ olcskei, “Deep conv olutional neural net- works on cartoon functions, ” in Pr oc. of IEEE International Symposium on Information Theory (ISIT) , pp. 1163–1167, 2016. [41] S. T . Ali, J. P . Antoine, and J. P . Gazeau, “Continuous frames in Hilbert spaces, ” Annals of Physics , v ol. 222, no. 1, pp. 1–37, 1993. [42] G. Kaiser , A friendly guide to wavelets . Birkh ¨ auser , 1994. [43] W . Rudin, Functional analysis . McGraw-Hill, 2nd ed., 1991. [44] J. P . Antoine, R. Murrenzi, P . V ander gheynst, and S. T . Ali, T wo- dimensional wavelets and their relatives . Cambridge Univ ersity Press, 2008. [45] T . Lee, “Image representation using 2D Gabor wavelets, ” IEEE Tr ans. P attern Anal. Mach. Intell. , vol. 18, no. 10, pp. 959–971, 1996. [46] J. Canny , “A computational approach to edge detection, ” IEEE Tr ans. P attern Anal. Mach. Intell. , vol. P AMI-8, no. 6, pp. 679–698, 1986. [47] E. Oyallon and S. Mallat, “Deep roto-translation scattering for object classiﬁcation, ” in Pr oc. of IEEE International Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 2865–2873, 2015. [48] S. Davis and P . Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, ” IEEE T rans. Acoust., Speech, and Signal Pr ocess. , vol. 28, no. 4, pp. 357– 366, 1980. [49] D. G. Lowe, “Distinctive image features from scale-inv ariant keypoints, ” International Journal of Computer V ision , vol. 60, no. 2, pp. 91–110, 2004. [50] E. T ola, V . Lepetit, and P . Fua, “DAISY : An efﬁcient dense descriptor applied to wide-baseline stereo, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 32, no. 5, pp. 815–830, 2010. [51] S. Chen, C. Cowan, and P . M. Grant, “Orthogonal least squares learning algorithm for radial basis function networks, ” IEEE T rans. Neural Netw . , vol. 2, no. 2, pp. 302–309, 1991. [52] G. Kutyniok and D. Labate, “Introduction to shearlets, ” in Shearlets: Multiscale analysis for multivariate data [36], pp. 1–38. [53] I. Daubechies, T en lectures on wavelets . Society for Industrial and Applied Mathematics, 1992. [54] D. Ellis, Z. Zeng, and J. McDermott, “Classifying soundtracks with audio texture features, ” in Pr oc. of IEEE International Confer ence on Acoust., Speech, and Signal Pr ocess. (ICASSP) , pp. 5880–5883, 2011. [55] G. Tzanetakis and P . Cook, “Musical genre classiﬁcation of audio signals, ” IEEE T rans. Speech Audio Process. , vol. 10, no. 5, pp. 293– 302, 2002. [56] J. Lin and L. Qu, “Feature extraction based on Morlet wav elet and its application for mechanical fault diagnosis, ” J. Sound V ib . , vol. 234, no. 1, pp. 135–148, 2000. [57] G. Y . Chen, T . D. Bui, and A. Kr ˙ zyzak, “Rotation inv ariant pattern recognition using ridgelets, wav elet cycle-spinning and Fourier features, ” P attern Recognition , vol. 38, no. 12, pp. 2314–2322, 2005. [58] Y . L. Qiao, C. Y . Song, and C. H. Zhao, “M-band ridgelet transform based texture classiﬁcation, ” P attern Recognition Letters , vol. 31, no. 3, pp. 244–249, 2010. [59] S. Ariv azhagan, L. Ganesan, and T . S. Kumar , “T exture classiﬁcation using ridgelet transform, ” P attern Recognition Letters , vol. 27, no. 16, pp. 1875–1883, 2006. [60] J. Ma and G. Plonka, “The curvelet transform, ” IEEE Signal Process. Mag. , vol. 27, no. 2, pp. 118–133, 2010. 22 [61] L. Dettori and L. Semler , “A comparison of wavelet, ridgelet, and curvelet-based texture classiﬁcation algorithms in computed tomography , ” Computers in Biology and Medicine , vol. 37, no. 4, pp. 486–498, 2007. [62] P . P . V aidyanathan, Multirate systems and ﬁlter banks . Prentice Hall, 1993. [63] L. Grafakos, Classical F ourier analysis . Springer , 2nd ed., 2008. [64] T . W iatowski, M. Tschannen, A. Stani ´ c, P . Grohs, and H. B ¨ olcskei, “Discrete deep feature extraction: A theory and new architectures, ” in Pr oc. of International Confer ence on Machine Learning (ICML) , pp. 2149–2158, 2016. [65] D. L. Donoho, “Sparse components of images and optimal atomic decompositions, ” Constructive Approximation , vol. 17, no. 3, pp. 353– 382, 2001. [66] T . W iatowski, P . Grohs, and H. B ¨ olcskei, “Energy propagation in deep con volutional neural networks, ” IEEE T ransactions on Information The- ory , to appear. [67] A. J. E. M. Janssen, “The duality condition for Weyl-Heisenberg frames, ” in Gabor analysis: Theory and applications (H. G. Feichtinger and T . Strohmer, eds.), pp. 33–84, Birkh ¨ auser , 1998. [68] A. Ron and Z. Shen, “Frames and stable bases for shift-in variant subspaces of L 2 ( R d ) , ” Canad. J . Math. , vol. 47, no. 5, pp. 1051–1094, 1995. [69] M. Frazier, B. Jawerth, and G. W eiss, Littlewood-P ale y theory and the study of function spaces . American Mathematical Society , 1991. [70] A. W . Naylor and G. R. Sell, Linear operator theory in engineering and science . Springer, 1982. [71] H. B ¨ olcskei, F . Hlawatsch, and H. G. Feichtinger , “Frame-theoretic analysis of oversampled ﬁlter banks, ” IEEE T rans. Signal Process. , vol. 46, no. 12, pp. 3256–3268, 1998. [72] A. J. E. M. Janssen, “Duality and biorthogonality for W eyl-Heisenber g frames, ” J. F ourier Anal. Appl. , vol. 1, no. 4, pp. 403–436, 1995. [73] I. Daubechies, H. J. Landau, and Z. Landau, “Gabor time-frequency lattices and the W exler -Raz identity , ” J. F ourier Anal. Appl. , vol. 1, no. 4, pp. 438–478, 1995. [74] K. Gr ¨ ochening, F oundations of time-fr equency analysis . Birkh ¨ auser , 2001. [75] I. Daubechies, A. Grossmann, and Y . Meyer , “P ainless nonorthogonal expansions, ” J. Math. Phys. , vol. 27, no. 5, pp. 1271–1283, 1986. [76] K. Gr ¨ ochenig and S. Samarah, “Nonlinear approximation with local Fourier bases, ” Constr . Appr ox. , vol. 16, no. 3, pp. 317–331, 2000. [77] C. Lee, J. Shih, K. Y u, and H. Lin, “Automatic music genre classiﬁcation based on modulation spectral analysis of spectral and cepstral features, ” IEEE T rans. Multimedia , vol. 11, no. 4, pp. 670–682, 2009. [78] G. Kutyniok and D. L. Donoho, “Microlocal analysis of the geometric separation problem, ” Comm. Pur e Appl. Math. , vol. 66, no. 1, pp. 1–47, 2013. [79] E. J. Cand ` es, Ridgelets: Theory and applications . PhD thesis, Stanford Univ ersity , 1998. [80] E. J. Cand ` es and D. L. Donoho, “Ridgelets: A key to higher-dimensional intermittency?, ” Philos. T rans. R. Soc. London Ser . A , vol. 357, no. 1760, pp. 2495–2509, 1999. [81] M. Searc ´ oid, Metric spaces . Springer , 2007. [82] R. P . Brent, J. H. Osborn, and W . D. Smith, “Note on best possible bounds for determinants of matrices close to the identity matrix, ” Linear Algebra and its Applications , vol. 466, pp. 21–26, 2015. [83] W . Rudin, Real and comple x analysis . McGraw-Hill, 2nd ed., 1983. [84] E. DiBenedetto, Real analysis . Birkh ¨ auser , 2002. Thomas Wiatowski was born in Strzelce Opolskie, Poland, on December 20, 1987, and received the BSc and MSc degrees, both in Mathematics, from the T echnical University of Munich, Germany , in 2010 and 2012, respectively . In 2012 he was a researcher with the Institute of Computational Biology at the Helmholtz Zentrum in Munich, Germany . He joined ETH Zurich in 2013, where he graduated with the Dr . sc. degree in 2017. His research interests are in deep machine learning, mathematical signal processing, and applied harmonic analysis. Helmut B ¨ olcskei was born in M ¨ odling, Austria, on May 29, 1970, and receiv ed the Dipl.-Ing. and Dr. techn. degrees in electrical engineering from V ienna University of T echnology , V ienna, Austria, in 1994 and 1997, respectiv ely . In 1998 he was with Vienna University of T echnology . From 1999 to 2001 he was a postdoctoral researcher in the Information Systems Laboratory , Department of Electrical Engineering, and in the Department of Statistics, Stanford University , Stanford, CA. He was in the founding team of Iospan Wireless Inc., a Silicon V alley-based startup company (acquired by Intel Corporation in 2002) specialized in multiple-input multiple-output (MIMO) wireless systems for high-speed Internet access, and was a co- founder of Celestrius AG, Zurich, Switzerland. From 2001 to 2002 he was an Assistant Professor of Electrical Engineering at the University of Illinois at Urbana-Champaign. He has been with ETH Zurich since 2002, where he is a Professor of Electrical Engineering. He was a visiting researcher at Philips Research Laboratories Eindhoven, The Netherlands, ENST Paris, France, and the Heinrich Hertz Institute Berlin, Germany . His research interests are in information theory , mathematical signal processing, machine learning, and statistics. He received the 2001 IEEE Signal Processing Society Y oung Author Best Paper A ward, the 2006 IEEE Communications Society Leonard G. Abraham Best P aper A w ard, the 2010 V odafone Innov ations A ward, the ETH “Golden Owl” T eaching A ward, is a Fellow of the IEEE, a 2011 EURASIP Fellow , was a Distinguished Lecturer (2013-2014) of the IEEE Information Theory Society , an Erwin Schr ¨ odinger Fellow (1999-2001) of the Austrian National Science Foundation (FWF), was included in the 2014 Thomson Reuters List of Highly Cited Researchers in Computer Science, and is the 2016 Padov ani Lecturer of the IEEE Information Theory Society . He served as an associate editor of the IEEE T ransactions on Information Theory , the IEEE T ransactions on Signal Processing, the IEEE Transactions on Wireless Communications, and the EURASIP Journal on Applied Signal Processing. He was editor -in-chief of the IEEE Transactions on Information Theory during the period 2010-2013. He served on the editorial board of the IEEE Signal Processing Magazine and is currently on the editorial boards of “Foundations and Trends in Networking” and “Foundations and Trends in Communications and Information Theory”. He was TPC co-chair of the 2008 IEEE International Symposium on Information Theory and the 2016 IEEE Information Theory W orkshop and serves on the Board of Go vernors of the IEEE Information Theory Society . He has been a delegate of the president of ETH Zurich for faculty appointments since 2008.

A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment