Connectivity-Optimized Representation Learning via Persistent Homology

We study the problem of learning representations with controllable connectivity properties. This is beneficial in situations when the imposed structure can be leveraged upstream. In particular, we control the connectivity of an autoencoder's latent s…

Authors: Christoph Hofer, Rol, Kwitt

Connectivity-Optimized Representation Learning via Persistent Homology
Connectivity-Optimized Repr esentation Learning via P ersistent Homology Christoph D. Hofer 1 Roland Kwitt 1 Mandar Dixit 2 Marc Niethammer 3 Abstract W e study the problem of learning representations with controllable connecti vity properties. This is beneficial in situations when the imposed struc- ture can be le veraged upstream. In particular , we control the connectivity of an autoencoder’ s latent space via a nov el type of loss, operating on information from persistent homology . Un- der mild conditions, this loss is dif ferentiable and we present a theoretical analysis of the properties induced by the loss. W e choose one-class learn- ing as our upstream task and demonstrate that the imposed structure enables informed parameter selection for modeling the in-class distribution via kernel density estimators. Evaluated on com- puter vision data, these one-class models e xhibit competitiv e performance and, in a low sample size regime, outperform other methods by a large margin. Notably , our results indicate that a sin- gle autoencoder , trained on auxiliary (unlabeled) data, yields a mapping into latent space that can be reused across datasets for one-class learning. 1. Introduction Much of the success of neural networks in (supervised) learning problems, e.g., image recognition ( Krizhevsky et al. , 2012 ; He et al. , 2016 ; Huang et al. , 2017 ), object de- tection ( Ren et al. , 2015 ; Liu et al. , 2016 ; Dai et al. , 2016 ), or natural language processing ( Grav es , 2013 ; Sutskev er et al. , 2014 ) can be attributed to their ability to learn task-specific representations, guided by a suitable loss. In an unsupervised setting, the notion of a good/useful rep- resentation is less obvious. Reconstructing inputs from a (compressed) representation is one important criterion, high- lighting the relev ance of autoencoders ( Rumelhart et al. , 1986 ). Other criterions include rob ustness, sparsity , or infor- mativ eness for tasks such as clustering or classification. 1 Department of Computer Science, Uni versity of Salzburg, Aus- tria 2 Microsoft 3 UNC Chapel Hill. Correspondence to: Christoph D. Hofer < chr.dav.hofer@gmail.com > . Pr oceedings of the 36 th International Conference on Machine Learning , Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). T o meet these criteria, the reconstruction objecti ve is typi- cally supplemented by additional re gularizers or cost func- tions that directly (/indirectly) impose structure on the latent space. For instance, sparse ( Makhzani & Frey , 2014 ), denoising ( V incent et al. , 2010 ), or contracti ve ( Rifai et al. , 2011 ) autoencoders aim at robustness of the learned representations, either through a penalty on the encoder parametrization, or through training with stochastically per- turbed data. Additional cost functions guiding the mapping into latent space are used in the context of clustering, where sev eral works ( Xie et al. , 2016 ; Y ang et al. , 2017 ; Zong et al. , 2018 ) hav e shown that it is beneficial to jointly train for re- construction and a clustering objectiv e. This is a prominent example for representation learning guided tow ards an up- stream task. Other incarnations of imposing structure can be found in generativ e modeling, e.g., using variational autoen- coders ( Kingma & W elling , 2014 ). Although, in this case, autoencoders arise as a model for approximate variational inference in a latent v ariable model, the additional optimiza- tion objecti ve ef fectiv ely controls distributional aspects of the latent representations via the Kullback-Leibler diver - gence. Adversarial autoencoders ( Makhzani et al. , 2016 ; T olstikhin et al. , 2018 ) equally control the distribution of the latent representations, but through adv ersarial training. Overall, the success of these efforts clearly shows that im- posing structure on the latent space can be beneficial. In this work, we focus on one-class learning as the upstream task. This is a challenging problem, as one needs to unco ver the underlying structure of a single class using only sam- ples of that class. Autoencoders are a popular backbone model for many approaches in this area ( Zhou & Pfaffen- roth , 2017 ; Zong et al. , 2018 ; Sabokrou et al. , 2018 ). By controlling topological characteristics of the latent repre- sentations, connecti vity in particular , we argue that kernel- density estimators can be used as ef fectiv e one-class models. While earlier works ( Pokorn y et al. , 2012a ; b ) sho w that in- formed guidelines for bandwidth selection can be deriv ed from studying the topology of a space, our focus is not on passively analyzing topological properties, but rather on actively controlling them. Besides work by ( Chen et al. , 2019 ) on topologically-guided regularization of decision boundaries (in a supervised setting), we are not aware of any other work along the direction of backpropagating a learning signal deriv ed from topological analyses. Connectivity-Optimized Repr esentation Learning via Persistent Homology Contributions of this paper . 1. A nov el loss, termed connectivity loss ( § 3 ), that oper - ates on persistence barcodes, obtained by computing persistent homology of mini-batches. Our specific in- carnation of this loss enforces a homogeneous arrange- ment of the representations learned by an autoencoder . 2. Differentiability , under mild conditions, of the connec- tivity loss ( § 3.1 ), enabling backpropagation of the loss signal through the persistent homology computation. 3. Theoretical analysis ( § 4 ) on the implications of control- ling connecti vity via the proposed loss. This rev eals sample-size dependent densification effects that are beneficial upstream, e.g., for kernel-density estimation. 4. One-class learning experiments ( § 5 ) on large-scale vi- sion data, showing that k ernel-density based one-class models can be b uilt on top of representations learned by a single autoencoder . These representations are transferable across datasets and, in a low sample size regime, our one-class models outperform recent state- of-the-art methods by a large mar gin. 2. Background W e begin by discussing the machinery to e xtract connectiv- ity information of latent representations. All proofs for the presented results can be found in the appendix. Let us first revisit a standard autoencoding architecture. Giv en a data space X , we denote by { x i } , x i ∈ X , a set of training samples. Further , let f : X → Z ⊂ R n and g : Z ⊂ R n → X be two (non-)linear functions, referred to as the encoder and the decoder . T ypically , f and g are parametrized by neural networks with parameters θ and φ . Upon composition, i.e., g φ ◦ f θ , we obtain an autoencoder . Optimization then aims to find ( θ ∗ , φ ∗ ) = arg min ( θ,φ ) X i l  x i , g φ  f θ ( x i )   , (1) where l : X × X → R denotes a suitable reconstruction loss . If n is much smaller than the dimensionality of X , autoencoder training can be thought-of as learning a (non- linear) low-dimensional embedding of x , i.e., z = f θ ( x ) , referred to as its latent r epr esentation . Our goal is to control connecti vity properties of Z , observ ed via samples. As studying connectivity requires analyzing multiple samples jointly , we focus on controlling the con- nectivity of samples in mini-batches of fix ed size. Notation. W e use the follo wing notational con ventions. W e let [ N ] denote the set { 1 , . . . , N } and P ([ N ]) its power set. Further , let B ( z , r ) = { z 0 ∈ R n : k z − z 0 k ≤ r } denote the closed ball of radius r around z . By S , we denote a random batch of size b of latent representations z i = f θ ( x i ) . z 1 z 2 z 3 V 0 ( S ) =  { 1 } , { 2 } , { 3 }  z 1 z 2 z 3 ε 1 = δ ( z 1 , z 2 ) V ε 1 / 2 ( S ) = V 0 ( S ) ∪  { 1 , 2 }  z 1 z 2 z 3 V ε 2 / 2 ( S ) = V ε 1 / 2 ( S ) ∪  { 2 , 3 }  ε 2 = δ ( z 2 , z 3 ) z 1 z 2 z 3 V ε 3 / 2 ( S ) = V ε 2 / 2 ( S ) ∪  { 1 , 3 }  ε 3 = δ ( z 1 , z 3 ) Figure 1. V ietoris-Rips comple x built from S = { z 1 , z 2 , z 3 } with only zero- and one-dimensional simplices, i.e., vertices and edges. 2.1. Filtration/Persistent homology T o study point clouds of latent representations, z i , from a topological perspecti ve, consider the union of closed balls (with radius r ) around z i w .r .t. some metric δ on R n , i.e., S r = b [ i =1 B ( z i , r ) with r ≥ 0 . (2) S r induces a topological (sub)-space of the metric space ( R n , δ ) . The number of connected components of S r is a topological pr operty . A widely-used approach to access this information, grounded in algebraic topology , is to assign a growing sequence of simplicial complex es (induced by parameter r ). This is referred to as a filtration and we can study ho w the homology groups of these complexes ev olve as r increases. Specifically , we study the rank of the 0 -dimensional homology groups (capturing the number of connected components) as r varies. This extension of homology to include the notion of scale is called persistent homology ( Edelsbrunner & Harer , 2010 ). For unions of balls, the prev alent way to build a filtration is via a V ietoris-Rips complex, see Fig. 1 . W e define the V ietoris-Rips complex in a way beneficial to address differ - entiability and, as we only study connected components, we restrict our definition to simplices, σ , of dimension ≤ 1 . Definition 1 (V ietoris-Rips complex) . Let ( R n , δ ) be a met- ric space. For S ⊂ R n , | S | = b , let V ( S ) = { σ ∈ P ([ b ]) : 1 ≤ | σ | ≤ 2 } and define f S : V ( S ) → R , f S ( σ ) = ( 0 σ = { i } , 1 2 δ ( z i , z j ) σ = { i, j } . The V ietoris-Rips complex w .r .t. r ≥ 0 , restricted to its 1-skeleton, is defined as V r ( S ) = f − 1 S  ( −∞ , r ]  . Connectivity-Optimized Repr esentation Learning via Persistent Homology Giv en that ( ε k ) M k =1 denotes the increasing sequence of pair- wise distance values 1 of S (w .r .t. δ ), then ∅ ⊂ V 0 ( S ) ⊂ V ε 1 / 2 ( S ) · · · ⊂ V ε M / 2 ( S ) (3) is a filtration (for con venience we set ε 0 = 0 ). Hence, we can use 0 -dimensional persistent homology to observ e the impact of r = ε / 2 on the connectivity of S r , see Eq. ( 2 ). 2.2. Persistence bar code Giv en a filtration, as in Eq. ( 3 ) , 0 -dimensional persistent homology produces a multi-set of pairings ( i, j ) , i < j , where each tuple ( i, j ) indicates a connected component that persists from S ε i / 2 to S ε j / 2 . All b points emerge in S 0 , therefore all possible connected components appear , see Fig. 1 (top-left). If there are two points z i , z j contained in different connected components and δ ( z i , z j ) = ε t , those components merge when tran- sitioning from S ε t − 1 / 2 to S ε t / 2 . In the filtration, this is equiv alent to V ε t − 1 / 2 ( S ) ∪ {{ i, j }} ⊂ V ε t / 2 ( S ) . Hence, this specific type of connectivity information is captured by mer ging ev ents of this form. The 0 -dimensional persistence bar code , B ( S ) , represents the collection of those merging ev ents by a multi-set of tuples. In our case, tuples are of the form (0 , ε t / 2 ) , 1 ≤ t ≤ M , as each tuple represents a connected component that persists from S 0 to S ε t / 2 . Definition 2 (Death times) . Let S ⊂ R n be a finite set, ( ε k ) M k =1 be the increasing sequence of pairwise distances values of S and B ( S ) the 0 -dimensional barcode of the V ietoris-Rips filtration of S . W e then define † ( S ) = { t : (0 , ε t / 2 ) ∈ B ( S ) } as the multi-set of death-times, where t is contained in † ( S ) with the same multiplicity as (0 , ε t / 2 ) in B ( S ) . Informally , † ( S ) can be considered a multi-set of filtration indices where merging e vents occur . 3. Connectivity loss T o control the connectivity of a batch, S , of latent represen- tations, we need (1) a suitable loss and (2) a way to compute the partial deriv ati ve of the loss with respect to its input. Our proposed loss operates directly on † ( S ) with | S | = b . As a thought experiment, assume that all ε t , t ∈ † ( S ) are equal to η , meaning that the graph defined by the 1-skeleton V η ( S ) is connected. For ( ε k ) M k =1 , the connectivity loss L η ( S ) = X t ∈† ( S ) | η − ε t | (4) penalizes deviations from such a configuration. T rivially , for all points in S , there would no w be at least one neighbor 1 Formally , ε k ∈ { δ ( z , z 0 ) : z , z 0 ∈ S, z 6 = z 0 } , ε k < ε k +1 . at distance η (a beneficial property as we will see later). The loss is optimized o ver mini-batches of data. In § 4 , we take into account that, in practice, η can only be achiev ed appr ox- imately and study how enforcing the proposed connecti vity characteristics affects sets with cardinality lar ger than b . 3.1. Differentiability W e fix ( R n , δ ) = ( R n , k · k ) , where k · k denotes a p -norm and restate that ε t reflects a distance where a merging ev ent occurs, transitioning from S ε t − 1 / 2 to S ε t / 2 . In this section, we show that L η is differentiable with re- spect to points in S . This is required for end-to-end training via backpropagation, as ε t depends on tw o latent representa- tions, z i t , z j t , which in turn depend on the parametrization θ of f θ . The following definition allo ws us to re-formulate L η to con veniently address dif ferentiability . Definition 3. Let S ⊂ R n , | S | = b and z i ∈ S . W e define the indicator function 1 i,j ( z 1 , . . . , z b ) = ( 1 ∃ t ∈ † ( S ) : ε t = || z i − z j || 0 else , where { i, j } ⊂ [ b ] and ( ε k ) M k =1 is the increasing sequence of all pairwise distance values of S . The follo wing theorem states that we can compute L η using Definition 3 . Theorem 2 subsequently establishes differen- tiability of L η using the deriv ed reformulation. Theorem 1. Let S ⊂ R n , | S | = b , such that the pairwise distances ar e unique. Further , let L η be defined as in Eq. ( 4 ) and 1 i,j as in Definition 3 . Then, L η ( S ) = X { i,j }⊂ [ b ]   η − k z i − z j k   · 1 i,j ( z 1 , . . . , z b ) . Theorem 2. Let S ⊂ R n , | S | = b , such that the pairwise distances ar e unique. Then, for 1 ≤ u ≤ b and 1 ≤ v ≤ n , the partial (sub-)derivative of L η ( S ) w .r .t. the v -th coor di- nate of z u exists, i.e ., ∂ L η ( S ) ∂ z u,v = X { i,j }⊂ [ b ] ∂   η − k z i − z j k   ∂ z u,v · 1 i,j ( z 1 , . . . , z b ) . By using an automatic differentiation framew ork, such as PyT orch ( Paszk e et al. , 2017 ), we can easily realize L η by implementing 1 i,j from Definition 3 . Remark 1 . Theorems 1 and 2 require unique pairwise dis- tances, computed from S . Dropping this requirement would dramatically increase the complexity of those results, as the deriv ati ve may not be uniquely defined. Howe ver , under the practical assumption that the distribution of the latent representations is non-atomic, i.e., P ( f θ ( x ) = z ) = 0 for x ∈ X , z ∈ Z , the requirement is fulfilled almost surely . Connectivity-Optimized Repr esentation Learning via Persistent Homology , , = 0 . 0 8 , 0 . 9 6 , 4 . 7 1 , , = 0 . 2 3 , 1 . 6 7 , 4 . 6 3 , , = 0 . 2 4 , 1 . 6 5 , 4 . 0 1 Figure 2. 2D toy example of a connectivity-optimized mapping, mlp : R 2 → R 2 (see § 3.2 ), learned on 1,500 samples, x i , from three Gaussians (left). The figure highlights the homogenization effect enforced by the proposed loss, at 20 (middle) / 60 (right) training epochs and lists the mean min./avg./max. values of ε t , i.e., ( ˆ α, ˆ ε, ˆ β ), computed over 3,000 batches of size 50. 3.2. T oy example W e demonstrate the effect of L η on toy data generated from three Gaussians with random means/cov ariances, see Fig. 2 (left). W e train a three-layer multi-layer perceptron, mlp : R 2 → R 2 , with leaky ReLU activ ations and hidden layer dimensionality 20. No reconstruction loss is used and L η operates on the output, i.e., on fixed-size batches of ˆ x i = mlp ( x i ) . Although this is different to controlling the latent representations, the example is sufficient to demonstrate the effect of L η . The MLP is trained for 60 epochs with batch size 50 and η = 2 . W e then compute the mean min./avg./max. values (denoted as ˆ α , ˆ ε , ˆ β ) of ε t ov er 3,000 random batches. Fig. 2 (middle & right) shows the result of applying the model after 20 and 60 epochs, respectiv ely . T wo observ ations are worth pointing out. F irst , the gap be- tween ˆ α and ˆ β is fairly lar ge, e ven at con vergence. Howe ver , our theoretical analysis in § 4 (Remark 2 ) shows that this is the expected behavior , due to the interplay between batch size and dimensionality . In this toy example, the range of ε t would only be small if we would train with small batch sizes (e.g., 5). In that case, ho wev er , gradients become in- creasingly unstable. Notably , as dimensionality increases, optimizing L η is less dif ficult and effecti vely leads to a tighter range of ε t around η (see Fig. 6 ). Second , Fig. 2 (right) shows the desired homogenization ef fect of the point arrangement, with ˆ ε close to (but smaller than) η . The latter can, to some extent, be explained by the previous batch size vs. dimensionality argument. W e also conjecture that opti- mization is more prone to get stuck in local minima where ˆ ε is close to, b ut smaller than η . This is observed in higher dimensions as well (cf. Fig. 6 ), but less prominently . Notably , by only training with L η , we can not expect to ob- tain useful representations that capture salient data character- istics as mlp can distribute points freely , while minimizing L η . Hence, learning the mapping as part of an autoencoder , optimized for reconstruction and L η , is a natural choice. Intuitively , the reconstruction loss contr ols “what” is worth capturing, while the connectivity loss encourages “how” to topologically or ganize the latent r epr esentations. 4. Theoretical analysis Assume we have minimized a reconstruction loss jointly with the connectivity loss, using mini-batches, S , of size b . Ideally , we obtain a parametrization of f θ such that for ev ery b -sized random sample, it holds that ε t equals η for t ∈ † ( S ) . Due to two competing optimization objectives, howe ver , we can only expect ε t to lie in an interval [ α, β ] around η . This is captured in the following definition. Definition 4 ( α - β connected set) . Let S ⊂ R n be a finite set and let ( ε k ) M k =1 be the increasing sequence of pairwise distance values of S . W e call S α - β -connected iff α = min t ∈† ( S ) ε t and β = max t ∈† ( S ) ε t . If S is α - β connected, all merging events of connected components occur during the transition from S α / 2 to S β / 2 . Importantly , during training, L η only controls properties of b -sized subsets explicitly . Thus, at conv ergence, f θ ( S ) with | S | = b is α - β connected. When building upstream models, it is desirable to understand ho w the latent representations are affected for samples lar ger than b . T o address this issue, let B ( z , r ) 0 = { z 0 ∈ R n : k z − z 0 k < r } denote the interior of B ( z , r ) and let B ( z , r, s ) = B ( z , s ) \ B ( z , r ) 0 with r < s denote the annulus around z . In the following, we formally in vestigate the impact of α - β connectedness on the density around a latent representation. The next lemma captures one particular densification ef fect that occurs if sets larger than b are mapped via a learned f θ . Lemma 1. Let 2 ≤ b ≤ m and M ⊂ R n with | M | = m such that for eac h S ⊂ M with | S | = b , it holds that S is α - β -connected. Then, for d = m − b and z ∈ M arbitrary but fixed, we find M z ⊂ M with | M z | = d + 1 and M z ⊂ B ( z , α, β ) . Lemma 1 yields a lower bound , d + 1 , on the number of points in the annulus around z ∈ M . Howe ver , it does not provide any further insight whether there may or may not exist more points of this kind. Nev ertheless, the density around z ∈ M increases with | M | = m , for b fixed. Definition 5 ( d - ε -dense set) . Let S ⊂ R n and ε > 0 . W e call S ε -dense if f ∀ z ∈ S ∃ z 0 ∈ S \ { z } : k z − z 0 k ≤ ε . For d ∈ N , we call S d - ε - dense iff ∀ z ∈ S ∃ M ⊂ S \ { z } : | M | = d, z 0 ∈ M ⇒ k z − z 0 k ≤ ε . The following corollary of Lemma 1 pro vides insights into the density behavior of samples around points z ∈ M . Corollary 1. Let 2 ≤ b ≤ m and M ⊂ R n with | M | = m such that for each S ⊂ M with | S | = b , it holds that S is α - β -connected. Then M is ( m − b + 1) - β -dense. Connectivity-Optimized Repr esentation Learning via Persistent Homology Informally , this result can be interpreted as follo ws: Assume we hav e optimized for a specific η . At con ver gence, we can collect ε t for t ∈ † ( S ) ov er batches (of size b ) in the last training epoch to estimate α and β according to Definition 4 . Corollary 1 no w quantifies how man y neighbors, i.e., m − b + 1 , within distance β can be found around each z ∈ M . W e exploit this insight in our e xperiments to construct kernel density estimators with an informed choice of the kernel support radius, set to the value η we optimized for . W e can also study the implications of Lemma 1 on the sepa- ration of points in M . Intuitively , as m increases, we expect the separation of points in M to decrease, as densification occurs. W e formalize this by drawing a connection to the concept of metric entr opy , see ( T ao , 2014 ). Definition 6 ( ε -metric entropy) . Let S ⊂ R n , ε > 0 . W e call S ε -separated iff ∀ z , z 0 ∈ S : z 6 = z 0 ⇒ k z − z 0 k ≥ ε . For X ⊂ R n , the ε -metric entr opy of X is defined as N ε ( X ) = max {| S | : S ⊂ X and S is ε -separated } . Setting E ε,n α,β = N ε  B (0 , α, β )  , i.e., the metric entrop y of the annulus in R n , allows formulating a second corollary of Lemma 1 . Corollary 2. Let 2 ≤ b ≤ m and M ⊂ R n with | M | = m such that for each S ⊂ M with | S | = b , it holds that S is α - β -connected. Then, for ε > 0 and m − b + 1 > E ε,n α,β , it follows that M is not ε -separated. Consequently , understanding the behavior of E ε,n α,β is impor- tant, specifically in relation to the dimensionality , n , of the latent space. T o study this in detail, we have to choose a specific p -norm. W e use k · k 1 from no w on, due to its better behavior in high dimensions, see ( Aggarw al et al. , 2001 ). Lemma 2. Let ε < 2 α and α < β . Then, in ( R n , k · k 1 ) , it holds that E ε,n α,β ≤ ( 2 β / ε + 1) n − ( 2 α / ε − 1) n . This re veals an exponential dependency on n , in other words, a manifestation of the curse of dimensionality . Furthermore, the bound in Lemma 2 is not sharp, as it is based on a volume argument (see appendix). Y et, in light of Corollary 2 , it yields a conservati ve guideline to assess whether M is large enough to be no longer ε -separated. In particular, let | M | = m and set ε = η . If m − b + 1 > ( 2 β / η + 1) n − ( 2 α / η − 1) n , (5) then M is not η -separated, by virtue of Lemma 2 . In comparison to the densification result of Corollary 1 , we obtain no quantification of separatedness for each z ∈ M . W e can only guarantee that beyond a certain sample size, m , ther e exist two points with distance smaller than ε . Remark 2 . W e can also derive necessary conditions on the size b = | S | , gi ven α, β , η and n , such that M satisfies the conditions of Lemma 1 . In particular , assume that the conditions are satisfied and set | M | = m = 2 b − 1 . Hence, we can find M z with M z ⊂ M ∩ B ( z , α, β ) and | M z | = d + 1 = m − b + 1 = b = | S | for z ∈ M . As e very b -sized subset is α - β -connected, it follows that M z is α - β -connected, in particular , α -separated. This yields the necessary condition b ≤ E α,n α,β . By applying Lemma 2 with ε = α , we get b ≤ ( 2 β / α + 1) n − 1 , establishing a relation between b, α , β and n . For example, choosing b large, in relation to n , results in an increased gap between b α and b β , as seen in Fig. 2 (for b = 50 , n = 2 fixed). Increasing n in relation to b tightens this gap, as we will later see in § 5.4 . 5. Experimental study W e focus on one-class learning for visual data, i.e., building classifiers for single classes, using only data from that class. Problem statement . Let C ⊂ X be a class from the space of images, X , from which a sample { x 1 , . . . , x m } ⊂ C is av ailable. Given a new sample, y ∗ ∈ X , the goal is to identify whether this sample belongs to C . It is customary to ignore the actual binary classification task, and consider a scoring function s : X → R instead. Higher scores indicate membership in C . W e further assume access to an unlabeled auxiliary dataset . This is reasonable in the context of visual data, as such data is readily av ailable. Architectur e & T raining . W e use a con volutional autoen- coder follo wing the DCGAN encoder/discriminator archi- tecture of ( Radford et al. , 2016 ). The encoder has three con volution layers (followed by Leaky ReLU activ ations) with 3 × 3 filters, applied with a stride of 2 . From layer to layer , the number of filters (initially , 32) is doubled. The output of the last con v olution layer is mapped into the latent space Z ⊂ R n via a restricted v ariant of a linear layer ( I-Linear ). The weight matrix W of this layer is block-diagonal, corresponding to B branches, indepen- dently mapping into R D with D = n/B . Each branch has its o wn connectivity loss, operating on the D -dimensional representations. This is motiv ated by the dilemma that we need dimensionality (1) sufficiently high to capture the un- derlying characteristics of the data and (2) low enough to effecti vely optimize connectivity (see § 5.3 ). The decoder mirrors the encoder , using con volutional transpose operators ( Zeiler et al. , 2010 ). The full architecture is shown in Fig. 3 . For optimization, we use Adam ( Kingma & Ba , 2014 ) with a fixed learning rate of 0 . 001 , ( β 1 , β 2 ) = (0 . 9 , 0 . 999) and a batch-size of 100. The model is trained for 50 epochs. One-class models . As mentioned in § 1 , our goal is to build one-class models that lev erage the structure imposed on the latent representations. T o this end, we use a simple non- parametric approach. Given m training instances, { x i } m i =1 , Connectivity-Optimized Repr esentation Learning via Persistent Homology I-Linear I-Linear Encoder Decoder x 1 , . . . , x b f θ      g φ      Branch 1 Branch B ⊕ L η  { z B i } b i =1  L η  { z 1 i } b i =1  1 b P b i =1 k x i − g φ ◦ f θ ( x i ) k 1 + λ P B j =1 L η ( { z j 1 , . . . , z j b } ) T otal loss: Figure 3. Autoencoder ar chitectur e with B independent branches mapping into latent space Z ⊂ R n = R D × · · · × R D . The connectivity loss L η is computed per branch, summed, and added to the reconstruction loss (here k · k 1 ). of a new class C , we first compute z i = f θ ( x i ) and then split z i into its D -dimensional parts z 1 i , . . . , z B i , provided by each branch (see Fig. 3 ). For a test sample y ∗ , we compute its latent representation z ∗ = f θ ( y ∗ ) and its corresponding parts z 1 ∗ , . . . , z B ∗ . The one-class score for y ∗ is defined as s ( y ∗ ) = B X j =1    n z j i : k z j ∗ − z j i k ≤ η , 1 ≤ i ≤ m o    , (6) where η is the v alue pre viously used to learn f θ ; for one test sample this scales with O ( B m ) . For each branch, Eq. ( 6 ) counts how man y of the stored training points of class C lie in the k · k 1 -ball of radius η around z ∗ . If normalized, this constitutes a non-parametric kernel density estimate with a uniform kernel of radius η . No optimization, or parameter tuning, is r equir ed to build suc h a model . The scoring func- tion only uses the imposed connectivity structure. Given enough training samples (i.e., m > b ), Corollary 2 favors that the set of training points within a ball of radius η around z ∗ is non-empty . 5.1. Datasets CIF AR-10/100 . CIF AR-10 ( Krizhevsky & Hinton , 2009 ) contains 60,000 natural images of size 32 × 32 in 10 classes. 5,000 images/class are av ailable for training, 1,000/class for validation. CIF AR-100 contains the same number of images, but consists of 100 classes (with little class ov erlap to CIF AR-10). For comparison to other work, we also use the coarse labels of CIF AR-100, where all 100 classes are aggregated into 20 coarse cate gories ( CIF AR-20 ). Tiny-ImageNet . This dataset represents a medium scale image corpus of 200 visual categories with 500 images/class av ailable for training, 50/class for v alidation and 50/class for testing. For e xperiments, we use the training and validation portion, as labels for the test set are not av ailable. ImageNet . F or large-scale testing, we use the ILSVRC 2012 dataset ( Deng et al. , 2009 ) which consists of 1,000 classes with ≈ 1.2 million images for training ( ≈ 1281 /class on avg.) and 50,000 images (50/class) for validation. All images are resized to 32 × 32 (ignoring non-uniform aspect ratios) and normalized to range [0 , 1] . W e resize to 32 × 32 to ensure that autoencoders trained on, e.g., CIF AR- 10/100, can be used for one-class experiments on ImageNet. 5.2. Evaluation pr otocol T o ev aluate one-class learning performance on one dataset, we only train a single autoencoder on the unlabeled auxiliary dataset to obtain f θ . E.g., our results on Tin y-ImageNet and ImageNet use the same autoencoder trained on CIF AR-100. The experimental protocol follo ws ( Ruff et al. , 2018 ) and ( Goland & El-Y aniv , 2018 ). Performance is measured via the area under the ROC curve (A UC) which is a common choice ( Iwata & Y amada , 2016 ; Goland & El-Y ani v , 2018 ; Ruff et al. , 2018 ). W e use a one-vs-all e valuation scheme. Assume we hav e N classes and want to e v aluate one-class performance on class j . Then, a one-class model is built from m randomly chosen samples of class j . For e valuation, all test samples of class j are assigned a label of 1 ; all other samples are assigned label 0 . The A UC is computed from the scores provided by Eq. ( 6 ) . This is repeated for all N classes and the A UC, averaged o ver (1) all classes and (2) fiv e runs (of randomly picking m points) is reported. 5.3. Parameter analysis W e fix the dataset to CIF AR-100 and focus on the aspects of latent space dimensionality , the weighting of L η and the transferability of the connectivity characteristics 2 . F irst , it is important to understand the interplay between the latent dimensionality and the constraint imposed by L η . On the one hand, a low-dimensional space allows fewer possible latent configurations without violating the desired connectivity structure. On the other hand, as dimensionality increases, the concept of proximity degrades quickly for p -norms ( Aggarwal et al. , 2001 ), rendering the connectivity optimization problem trivial. Depending on the dataset, one also needs to ensure that the underlying data characteristics are still captured. T o balance these objectiv es, we divide the latent space into sub-spaces (via separated branches). Fig. 4 (left) shows an e xample where the latent dimensionality is fixed (to 160 ), but branching configurations differ . As expected, the connecti vity loss without branching is small, ev en at initialization. In comparison, models with separate branches e xhibit high connecti vity loss initially , but the loss decreases rapidly throughout training. Notably , the reconstruction error, see Fig. 4 (right), is almost equal (at con ver gence) across all models. 2 W e fix η = 2 throughout our experiments. Connectivity-Optimized Repr esentation Learning via Persistent Homology 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 20 50 100 200 C o n n . - l o s s , L ( S ) #branches=1,D=160 #branches=16,D=10 #branches=32,D=5 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 200 400 600 800 1000 1200 1400 1600 | | | | 1 R e c . - l o s s #branches=1,D=160 #branches=16,D=10 #branches=32,D=5 Figure 4. Connectivity (left) and reconstruction (right) loss over all training iterations on CIF AR-100 w/ and w/o branching. Thus, with respect to reconstruction, the latent space carries equiv alent information with and without branching, b ut is structurally dif fer ent . Further evidence is provided when us- ing f θ for one-class learning on CIF AR-10. Branching leads to an average A UC of 0.78 and 0.75 (for 16/32 branches), while no branching yields an A UC of 0.70. This indicates that controlling connectivity in lo w-dimensional subspaces leads to a structure beneficial for our one-class models. Second , we focus on the branching architecture and study the effect of weighting L η via λ . Fig. 5 (left) shows the connectivity loss o ver all training iterations on CIF AR-100 for four different v alues of λ and 16 branches. 10 0 10 1 10 2 10 3 10 4 10 20 50 100 200 ( Lo g ) Co nn. - l o ss, l o g ( L ( S ) ) = 1. 0 = 10. 0 = 20. 0 = 40. 0 Itera tions 10 1 10 2 10 3 0. 64 0. 66 0. 68 0. 70 0. 72 0. 74 0. 76 0. 78 AUC ( o ve r a l l c l a sse s) = 1. 0 = 10. 0 = 20. 0 = 40. 0 Tra ining sa mples ( m ) Figure 5. (Left) Connectivity loss over training iterations on CIF AR-100 for 16 branches and v arying λ ; (Right) One-class performance (A UC) on CIF AR-10 over the number of training samples, 10 ≤ m ≤ 5,000, per class. During training, the behavior of L η is almost equal for λ ≥ 10 . 0 . For λ = 1 . 0 , howe ver , the loss noticeably con verges to a higher v alue. In fact, reconstruction error dominates in the latter case, leading to a less homogeneous arrangement of latent representations. This detrimental ef fect is also e vident in Fig. 5 (right) which sho ws the a verage A UC for one-class learning on CIF AR-10 classes as a function of the number of samples used to build the k ernel density estimators. F inally , we assess whether the properties induced by f θ , learned on auxiliary data (CIF AR-100), generalize to an- other dataset (CIF AR-10). T o this end, we train an autoen- coder with 16 sub-branches and λ = 20 . W e then compute the average death-times per branch using batches of size 100 on (i) the test split of CIF AR-100 and (ii) o ver all samples of CIF AR-10. Fig. 6 shows that the distribution of death- times is consistent within and across datasets. Also, the increased dimensionality (compared to our 2D to y example) per branch leads to (i) a tight range of death-times and (ii) death-times closer to η = 2 , consistent with Remark 2 . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Branch ID 1.80 1.85 1.90 1.95 2.00 2.05 a v g . d , d ( S ) , | S | = 1 0 0 CIFAR-10 CIFAR-100 Figure 6. A verage ε d , d ∈ † ( S ) , per branch, computed from batches, S , of size 100 ov er CIF AR-10 (all) and CIF AR-100 (test split); f θ is learned from the training portion of CIF AR-100. 5.4. One-class learning perf ormance V arious incarnations of one-class problems occur through- out the literature, mostly in an anomaly or nov elty detection context; see ( Pimentel et al. , 2014 ) for a surve y . Outlier detection ( Xia et al. , 2015 ; Y ou et al. , 2017 ) and out-of- distribution detecti on ( Hendrycks & Gimpel , 2017 ; Liang et al. , 2018 ; Lee et al. , 2018 ) are related tasks, but the prob- lem setup is different. The former works under the premise of corrupted data, the latter considers a dataset as one class. W e compare against recent state-of-the-art approaches, in- cluding techniques using autoencoders and techniques that do not. In the DSEBM approach of ( Zhai et al. , 2016 ), the density of one-class samples is modeled via a deep struc- tured energy model. The energy function then serves as a scoring criterion. D A GMM ( Zong et al. , 2018 ) follows a similar objectiv e, but, as in our approach, density estima- tion is performed in an autoencoder’ s latent space. Autoen- coder and density estimator , i.e., a Gaussian mixture model (GMM), are trained jointly . The negativ e log-likelihood under the GMM is then used for scoring. Deep-SVDD ( Ruff et al. , 2018 ) is conceptually different. Here, the idea of support vector data description (SVDD) from ( T ax & Duin , 2004 ) is extended to neural networks. An encoder (pretrained in an autoencoder setup) is trained to map one- class samples into a hypersphere with minimal radius and fixed center . The distance to this center is used for scor- ing. Motiv ated by the observation that softmax-scores of trained multi-class classifiers tend to differ between in- and out-of-distribution samples ( Hendrycks & Gimpel , 2017 ), ( Goland & El-Y aniv , 2018 ) recently proposed a technique ( ADT ) based on self-labeling . In particular , a neural net- work classifier is trained to distinguish among 72 geometric transformations applied to one-class samples. For scoring, each transform is applied to ne w samples and the softmax outputs (of the class corresponding to the transform) of this classifier are av eraged. Non-linear dimensionality reduction via autoencoders also facilitates using classic approaches to one-class problems, e.g., one-class SVMs ( Sch ¨ olkof et al. , 2001 ). W e compare against such a baseline, OC-SVM (CAE) , using the latent representations of a con volutional autoencoder (CAE). Connectivity-Optimized Repr esentation Learning via Persistent Homology T able 1. A UC scores for one-class learning, a veraged over all classes and 5 runs. ADT - m and Ours- m denote that only m training samples/class are used. The dataset in parentheses denotes the auxiliary dataset on which f θ is trained. All std. deviations for our method are within 10 − 3 and 10 − 4 . Eval. data. Method A UC CIF AR-10 OC-SVM (CAE) 0 . 62 D A GMM ( Zong et al. , 2018 ) 0 . 53 DSEBM ( Zhai et al. , 2016 ) 0 . 61 Deep-SVDD ( Ruff et al. , 2018 ) 0 . 65 ADT ( Goland & El-Y ani v , 2018 ) 0 . 85 Low sample-size r e gime ADT -120 0 . 69 ADT -500 0 . 73 ADT -1,000 0 . 75 Ours -120 (CIF AR-100) 0 . 76 CIF AR-20 OC-SVM (CAE) 0 . 63 D A GMM ( Zong et al. , 2018 ) 0 . 50 DSEBM ( Zhai et al. , 2016 ) 0 . 59 Deep-SVDD ( Ruff et al. , 2018 ) 0 . 60 ADT ( Goland & El-Y ani v , 2018 ) 0 . 77 Low sample-size r e gime ADT -120 0 . 66 ADT -500 0 . 69 ADT -1,000 0 . 71 Ours -120 (CIF AR-10) 0 . 72 CIF AR-100 ADT -120 0 . 75 Ours -120 (CIF AR-10) 0 . 79 T iny-ImageNet Ours -120 (CIF AR-10) 0 . 73 Ours -120 (CIF AR-100) 0 . 72 ImageNet Ours -120 (CIF AR-10) 0 . 72 Ours -120 (CIF AR-100) 0 . 72 Implementation. For our approach 3 , we fix the latent di- mensionality to 160 (as in § 5.3 ), use 16 branches and set λ = 20 (the encoder , f θ , has ≈ 800k parameters). W e imple- ment a PyT orch-compatible GPU v ariant of the persistent homology computation, i.e., V ietoris-Rips construction and matrix reduction (see appendix). For all reference meth- ods, except Deep-SVDD, we use the implementation(s) pro- vided by ( Goland & El-Y ani v , 2018 ). OC-SVM (CAE) and DSEBM use a DCGAN-style conv olutional encoder with slightly more parameters ( ≈ 1.4M) than our variant and 256 latent dimensions. DA GMM relies on the same encoder , a latent dimensionality of fiv e and three GMM components. Results. T able 1 lists the A UC score (av eraged over classes and 5 runs) obtained on each dataset. F or our approach, the name in parentheses denotes the auxiliary (unlabeled) dataset used to learn f θ . F irst , ADT e xhibits the best performance on CIF AR-10/20. Howe ver , if one aims to thoroughly assess one-class per- formance, testing on CIF AR-10/20 can be misleading, as 3 https://github.com/c- hofer/COREL_icml2019 the variation in the out-of-class samples is limited to 9/19 categories. Hence, it is desirable to ev aluate on datasets with higher out-of-class variability , e.g., ImageNet. In this set- ting, the bottleneck of all other methods is the requirement of optimizing one model/class . In case of ADT , e.g., one W ide-ResNet ( Zagoruyko & K omodakis , 2016 ) with 1.4M parameters needs to be trained per class. On ImageNet, this amounts to a total of 1,400M parameters (spread over 1,000 models). On one GPU (Nvidia GTX 1080 Ti) this requires ≈ 75 hrs. Our approach requires to train f θ only once, e.g., on CIF AR-100 and f θ can be reused across datasets. Second , CIF AR-10/20 contains a lar ge number of training samples/class. As the number of classes increases, training set size per class typically drops, e.g., to ≈ 1,000 on Ima- geNet. W e therefore conduct a second experiment, studying the impact of training set size per class on ADT . Our one- class models are built from a fix ed sample size of 120, which is slightly higher than the training batch size (100), thereby implying densification (by our results of § 4 ). W e see that performance of ADT drops rapidly from 0.85 to 0.69 A UC on CIF AR-10 and from 0.77 to 0.66 on CIF AR-20 when only 120 class samples are used. Even for 1,000 class samples, ADT performs slightly worse than our approach. Overall, in this low sample-size r e gime , our one-class models seem to clearly benefit from the additional latent space structure. Thir d , to the best of our knowledge, we report the first full e valuation of one-class learning on CIF AR-100, T in y- ImageNet and ImageNet. This is possible as f θ is reusable across datasets and the one-class models do not require op- timization. For CIF AR-100, we also ran ADT with 120 samples to establish a fair comparison. Although this re- quires training 100 W ide-ResNet models, it is still possible at reasonable effort. Importantly , our method maintains performance when moving from T iny-ImageNet to full Ima- geNet, indicating beneficial scaling behavior with respect to the amount of out-of-class variability in a gi ven dataset. 6. Discussion W e presented one possibility for controlling topological / geometric properties of an autoencoder’ s latent space. The connectivity loss is tailored to enforce beneficial properties for one-class learning. W e believ e this to be a key task that clearly reveals the usefulness of a representation. Being able to backpropagate through a loss based on persistent ho- mology has broader implications. For example, other types of topological constraints may be useful for a wide range of tasks, such as clustering. From a theoretical perspectiv e, we show that controlling connecti vity allo ws establishing pr ovable results for latent space densification and separa- tion. Composing multi-class models from one-class models (cf. ( T ax & Duin , 2008 )), built on top of a topologically- regularized representation, is another promising direction. Connectivity-Optimized Repr esentation Learning via Persistent Homology Acknowledgements This research was supported by NSF ECCS-1610762, the Austrian Science Fund (FWF project P 31799) and the Spinal Cord Injury and Tissue Regeneration Center Salzbur g (SCI-TReCS), Paracelsus Medical Uni versity , Salzbur g. References Aggarwal, C., Hinnebur g, A., and Keim, D. On the sur- prising behavior of distance metrics in high dimensional space. In ICDT , 2001. Bauer , U., K erber, M., and Reininghaus, J. Distributed computation of persistent homology . In ALENEX , 2014a. Bauer , U., K erber , M., and Reininghaus, J. Clear and com- press: Computing persistent homology in chunks. In T opological Methods in Data Analysis and V isualization III , pp. 103–117. Springer , 2014b. Chen, C., Ni, X., Bai, Q., and W ang, Y . A topological regularizer for classifiers via persistent homology . In AIST A TS , 2019. Dai, J., Li, Y ., He, K., and Sun, J. R-FCN: Object detection via region-based fully con volutional networks. In NIPS , 2016. de Silva, V ., Morozo v , D., and V ejdemo-Johansson, M. Du- alities in persistent (co)homology . Inver se Pr oblems , 27 (12):124003, 2011. Deng, J., Dong, W ., Socher, R., Li, L.-J., Li, K., and Fei, L. F . Imagenet: A lar ge-scale hierarchical image database. In CVPR , 2009. Dey , T ., Shi, D., and W ang, Y . SimBa: An efficient tool for approximating Rips-filtration persistence via simplicial batch-collapse. In ESA , 2016. Edelsbrunner , H. and Harer, J. L. Computational T opology : An Intr oduction . American Mathematical Society , 2010. Goland, I. and El-Y aniv , R. Deep anomaly detection using geometric transformations. In NIPS , 2018. Grav es, A. Generating sequences with recurrent neural networks. CoRR , 2013. 1308.0850 . He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR , 2016. Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution e xamples in neural networks. In ICLR , 2017. Huang, G., Liu, Z., van der Maaten, L., and W einberger , K. Densely connected conv olutional networks. In CVPR , 2017. Iwata, T . and Y amada, M. Multi-view anomaly detection via robust probabilistic latent variable models. In NIPS , 2016. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR , 2014. Kingma, D. and W elling, M. Auto-encoding variational Bayes. In ICLR , 2014. Krizhevsk y , A. and Hinton, G. Learning multiple layers of features from tiny images. T echnical report, Univ ersity of T oronto, 2009. Krizhevsk y , A., Sutskev er , I., and Hinton, G. E. Imagenet classification with deep con volutional neural networks. In NIPS , 2012. Lee, K., Lee, H., Lee, K., and Shin, J. Training confidence- calibrated classifiers for detecting out-of-distribution sam- ples. In ICLR , 2018. Liang, S., Y .Li, and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR , 2018. Liu, W ., Anguelov , D., Erhan, D., Szegedy , C., Reed, S., Fu, C.-Y ., and Berg, A. SSD: single shot multibox detector . In ECCV , 2016. Makhzani, A. and Frey , B. k -sparse autoencoders. In ICLR , 2014. Makhzani, A., annd N. Jaitly , J. S., and Goodfellow , I. Ad- versarial autoencoders. In ICLR , 2016. Paszke, A., Gross, S., Chintala, S., Chanan, G., Y ang, E., DeV ito, Z., Lin, Z., Demaison, A., Antiga, L., and Lerer, A. Automatic dif ferentiation in PyT orch. In NIPS Autodif f WS , 2017. Pimentel, M., D.A.Clifton, Clifton, L., and T arassenko, L. A re view of novelty detection. Sig. Proc. , 99:215–249, 2014. Pokorny , F ., Ek, C., Kjellstr ¨ om, H., and Kragic, D. Persistent homology for learning densities with bounded support. In NIPS , 2012a. Pokorny , F ., Ek, C., Kjellstr ¨ om, H., and Kragic, D. T opolog- ical constraints and kernel-based density estimation. In NIPS WS on Algebraic T opology and Machine Learning , 2012b. Connectivity-Optimized Repr esentation Learning via Persistent Homology Radford, A., Metz, L., and Chintala, S. Unsupervised rep- resentation learning with deep con v olutional generativ e adversarial networks. In ICLR , 2016. Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: tow ards real-time object detection with re gion proposal networks. In NIPS , 2015. Rifai, S., V incent, P ., Muller , X., Glorot, X., and Bengio, Y . Contractive auto-encoders: Explicit in veriance during feature extraction. In ICML , 2011. Ruff, L., V andermeulen, R., Goernitz, N., Deeck e, L., Sid- diqui, S., Bindern, A., M ¨ uller , E., and Kloft, M. Deep one-class classification. In ICML , 2018. Rumelhart, D., Hinton, G., and W illiams, R. Learning representations by backpropagating errors. Nature , 323: 533–536, 1986. Sabokrou, M., Khalooei, M., F athy , M., and Adeli, E. Adver - sarially learned one-class classifier for nov elty detection. In CVPR , 2018. Sch ¨ olkof, B., Platt, J., Shawe-T aylor , J., Smola, A., and W illiamson, R. Estimating the support of a highdimen- sional distribution. Neural computation , 13(7):14431471, 2001. Sutske ver , I., V inyals, O., and Le, Q. Sequence to sequence learning with neural networks. In NIPS , 2014. T ao, T . Metric entropy analogues of sum set theory . Online: https://bit.ly/2zRAKUy , 2014. T ausz, A., V ejdemo-Johansson, M., and Adams, H. Jav aPlex: A research software package for persistent (co)homology . In ICMS , 2014. T ax, D. and Duin, R. Support vector data description. Ma- chine learning , 54(1):45–66, 2004. T ax, D. and Duin, R. Gro wing multi-class classifiers with a reject option. P attern Recognition Letter s , 29:1565–1570, 2008. T olstikhin, I., Bousquet, O., Gelly , S., and Sch ¨ olkopf, B. W asserstein auto-encoders. In ICLR , 2018. V incent, P ., Larochele, H., Lajoie, I., Bengio, Y ., and Man- zagol, P .-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR , 11:3371–3408, 2010. Xia, Y ., Cao, X., W en, F ., Hua, G., and Sun, J. Learning discriminativ e reconstructions for unsupervised outlier remov al. In ICCV , 2015. Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In ICML , 2016. Y ang, B., Fu, X., Sidiropoulos, N., and Hong, M. T o wards k -means-friendly spaces: Simultaneous deep learning and clustering. ICML , 2017. Y ou, C., Robinson, D., and V idal, R. Prov able self- representation based outlier detection in a union of sub- spaces. In CVPR , 2017. Zagoruyko, S. and K omodakis, N. W ide residual networks. In BMVC , 2016. Zeiler , M., Krishnan, D., T aylor , G., and Fergus, R. Decon- volutional netw orks. In CVPR , 2010. Zhai, S., Cheng, Y ., Lu, W ., and Zhang, Z. Deep structured energy based models for anomaly detection. In ICML , 2016. Zhou, C. and Pfaf fenroth, R. Anomaly detection with rob ust deep autoencoder . In KDD , 2017. Zong, B., Song, Q., Min, M., Cheng, W ., Lumezanu, C., Cho, D., and Chen, H. Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In ICLR , 2018. Connectivity-Optimized Repr esentation Learning via Persistent Homology This supplementary material contains all proofs omitted in the main submission. For readability , all necessary defini- tions, theorems, lemmas and corollaries are restated (in dark blue ) and the numbering matches the original numbering. Additional (technical) lemmas are prefixed by the section letter , e.g., Lemma 3 . A. Proofs f or Section 3 First, we recall that the connectivity loss is defined as L η ( S ) = X t ∈† ( S ) | η − ε t | (4) Definition 3. Let S ⊂ R n , | S | = b and z i ∈ S . W e define the indicator function 1 i,j ( z 1 , . . . , z b ) = ( 1 ∃ t ∈ † ( S ) : ε t = || z i − z j || 0 else , where { i, j } ⊂ [ b ] and ( ε k ) M k =1 is the increasing sequence of all pairwise distance values of S . Theorem 1. Let S ⊂ R n , | S | = b , such that the pairwise distances ar e unique. Further , let L η be defined as in Eq. (4) and 1 i,j as in Definition 3. Then, L η ( S ) = X { i,j }⊂ [ b ]   η − k z i − z j k   · 1 i,j ( z 1 , . . . , z b ) . Pr oof. W e have to sho w that X t ∈† ( S ) | η − ε t | from Eq. ( 4 ) , denoted as A , equals the right-hand side of Theorem 1 , denoted as B . Part 1 ( A ≤ B ). Let t ∈ † ( S ) . Since the pairwise distances of S are unique, t is contained only once in the multi-set † ( S ) and we can treat † ( S ) as an or dinary set. Further , there is a (unique) { i t , j t } such that ε t = k z i t − z j t k and hence 1 i t ,j t ( z 1 , . . . , z b ) = 1 . This means every summand in A is also present in B . As all summands are non-negati ve, A ≤ B follows. Part 2 ( B ≤ A ). Consider { i, j } ⊂ [ b ] contributing to the sum, i.e., 1 i,j ( z 1 , . . . , z b ) = 1 . By definition ∃ t ∈ † ( S ) : ε t = k z i − z j k and therefore the summand corresponding to { i, j } in B is present in A . Again, as all summands are non-negati ve B ≤ A follows, which concludes the proof. Lemma 3. Let S ⊂ R n , | S | = b , such that the pairwise distances ar e unique. Then, 1 i,j ( S ) is locally constant in S . F ormally , let 1 ≤ u ≤ b, 1 ≤ v ≤ n , h ∈ R and S 0 = { z 1 , . . . , z u − 1 , z u + h · e v , z u +1 , . . . , z b } wher e e v is the v -th unit vector . Then, ∃ ξ > 0 : | h | < ξ ⇒ 1 i,j ( S ) = 1 i,j ( S 0 ) . Pr oof. 1 i,j ( X ) is defined via † ( X ) , which, in turn, is de- fined via the V ietoris-Rips filtration of X . Hence, it is suffi- cient to sho w that the corresponding V ietoris-Rips filtrations of S and S 0 are equal, which we will do next. Let ( ε k ) M k =1 be the increasing sorted sequence of pairwise distance values of S . As all pairwise distances are unique, there is exactly one { i k , j k } for each k such that ε k = k z i k − z j k k . Further , let S 0 = { z 0 1 , . . . , z 0 b } be such that z 0 i = ( z i i 6 = u z u + h · e v i = u , and ε 0 k = k z 0 i k − z 0 j k k . W e now sho w that ( ε 0 k ) M k =1 is sorted and strictly increasing. First, let µ = min 1 ≤ k 0 . Now , by the triangle inequality ,   k z 0 i − z 0 j k − k z i − z j k   ≤ |k z i − z j k + | h | − k z i − z j k| = | h | , which is equiv alent to −| h | ≤ k z 0 i − z 0 j k − k z i − z j k ≤ | h | . This yields k z 0 i − z 0 j k ≥ k z i − z j k − | h | (2) and − k z 0 i − z 0 j k ≥ −k z i − z j k − | h | . (3) Using Eqs. ( 2 ) and ( 3 ) we get ε 0 k +1 − ε 0 k = k z 0 i k +1 − z 0 j k +1 k − k z 0 i k − z 0 j k k ≥ k z i k +1 − z j k +1 k − | h | − k z i k − z j k k − | h | = ε k +1 − ε k − 2 | h | by Eq. ( 1 ) ≥ µ − 2 | h | . Connectivity-Optimized Repr esentation Learning via Persistent Homology Overall, ( ε 0 k ) M k =1 is sorted and strictly increasing if µ − 2 | h | > 0 ⇔ | h | < µ 2 . It remains to show that the V ietoris-Rips filtration ∅ ⊂ V 0 ( S ) ⊂ V ε 1 / 2 ( S ) ⊂ · · · ⊂ V ε M / 2 ( S ) is equal to ∅ ⊂ V 0 ( S 0 ) ⊂ V ε 0 1 / 2 ( S 0 ) ⊂ · · · ⊂ V ε 0 M / 2 ( S 0 ) . For V 0 ( S ) =  { 1 } , . . . , { N }  = V 0 ( S 0 ) this is obvious. For f S , f S 0 , as in Definition 1 (main paper; V ietoris-Rips complex), we get f − 1 S ( ε k +1 / 2 ) =  { i k , j k }  = f − 1 S 0 ( ε 0 k +1 / 2 ) since ε k = k z i k − z j k k and ε 0 k = k z 0 i k − z 0 j k k and the pairwise distances are unique. Now , by induction V ε 0 k +1 / 2 ( S 0 ) = V ε 0 k / 2 ( S 0 ) ∪ f − 1 S 0 ( ε 0 k +1 / 2 ) = V ε k / 2 ( S ) ∪ f − 1 S 0 ( ε 0 k +1 / 2 ) = V ε k / 2 ( S ) ∪ f − 1 S ( ε k +1 / 2 ) = V ε k +1 / 2 ( S ) . Setting ξ = µ / 2 concludes the proof. Theorem 2. Let S ⊂ R n , | S | = b , such that the pairwise distances ar e unique. Then, for 1 ≤ u ≤ b and 1 ≤ v ≤ n , the partial (sub-)derivative of L η ( S ) w .r .t. the v -th coor di- nate of z u exists, i.e ., ∂ L η ( S ) ∂ z u,v = X { i,j }⊂ [ b ] ∂   η − k z i − z j k   ∂ z u,v · 1 i,j ( z 1 , . . . , z b ) . Pr oof. By Theorem 1 , we can write L η ( S ) = X { i,j }⊂ [ b ]   η − k z i − z j k   · 1 i,j ( z 1 , . . . , z b ) . Further , from Lemma 3 , we know that 1 i,j is locally con- stant for u, v . Consequently , the partial deri vati ve w .r .t. z u,v exists and is zero. The rest follows from the product rule of differential calculus. B. Proofs f or Section 4 Lemma 1. Let 2 ≤ b ≤ m and M ⊂ R n with | M | = m such that for eac h S ⊂ M with | S | = b , it holds that S is α - β -connected. Then, for d = m − b and z ∈ M arbitrary but fixed, we find M z ⊂ M with | M z | = d + 1 and M z ⊂ B ( z , α, β ) . Pr oof. Let z ∈ M . Our strategy is to iterati vely construct a set of points { z 1 , . . . , z d +1 } ⊂ B ( z , α , β ) ∩ ( M \ { z } ) . First, consider some S (1) ⊂ M with z ∈ S (1) and | S (1) | = b . Since S (1) is α - β -connected (by assumption), there is S (1) 3 z 1 ∈ B ( z , α, β ) . By repeatedly considering S ( i ) ⊂ M with z i ∈ S ( i ) and | S ( i ) | = b , we can construct M ( i ) z = { z 1 , . . . , z i } for i ≤ d = m − b . It holds that | M \ M ( i ) z | = m − i ≥ m − d = m − ( m − b ) = b . (5) Hence, we find S ( i +1) ⊂ M \ M ( i +1) z with z ∈ S ( i +1) such that | S ( i +1) | = b . Again, as S ( i +1) is α - β -connected, there is S ( i +1) 3 z i +1 ∈ B ( z , α, β ) . Overall, this specific pr ocedur e allo ws constructing d + 1 points, as for i ≥ d + 1 , Eq. ( 5 ) is no longer fulfilled. Corollary 1. Let 2 ≤ b ≤ m and M ⊂ R n with | M | = m such that for each S ⊂ M with | S | = b , it holds that S is α - β -connected. Then M is ( m − b + 1) - β -dense. Pr oof. By Lemma 1 , we can construct m − b + 1 points, M z , such that M z ⊂ B ( z , α, β ) . Conclusively , y ∈ M z ⇒ k z − y k ≤ β . Corollary 2. Let 2 ≤ b ≤ m and M ⊂ R n with | M | = m such that for each S ⊂ M with | S | = b , it holds that S is α - β -connected. Then, for ε > 0 and m − b + 1 > E ε,n α,β , it follows that M is not ε -separated. Pr oof. Choose some z ∈ M . By Lemma 1 , we can con- struct m − b + 1 points, M z , such that M z ⊂ B ( z , α, β ) . The distance induced by k · k is translation in variant, hence E ε,n α,β = N ε  B ( z , α, β )  . If m − b + 1 > E ε,n α,β , we conclude that M z is not ε -separated and therefore M is not ε -separated. Lemma 2. Let ε < 2 α and α < β . Then, in ( R n , k · k 1 ) , it holds that E ε,n α,β ≤ ( 2 β / ε + 1) n − ( 2 α / ε − 1) n . Pr oof. Let M ⊂ B (0 , α, β ) such that M is ε -separated. Then, the open balls B 0 ( z , ε / 2 ) , z ∈ M , are pairwise dis- jointly contained in B (0 , α − ε / 2 , β + ε / 2 ) . T o see this, let y ∈ B 0 ( z , ε / 2 ) . W e get k y k ≤ k y − z k + k z k < ε / 2 + β Connectivity-Optimized Repr esentation Learning via Persistent Homology and (by the rev erse triangle inequality) k y k = k z − ( z − y ) k ≥   k z k − k z − y k   ≥ k z k − k z − y k ≥ α − ε / 2 . Hence, y ∈ B (0 , α, β ) . The balls are pairwise disjoint as M is ε -separated and the radius of each ball is chosen as ε / 2 . Let λ denote the Lebesgue measure in R n . It holds that | M | · λ  B 0 (0 , ε / 2 )  = λ [ z ∈ M B 0 ( z , ε / 2 ) ! ≤ λ  B (0 , α − ε / 2 , β + ε / 2 )  as λ is translation inv ariant and [ z ∈ M B 0 ( z , ε / 2 ) ⊂ B (0 , α − ε / 2 , β + ε / 2 ) . The volume of the k · k 1 -ball with radius r is λ  B (0 , r )  = 2 n n ! r n . Hence, we get | M | · ε n n ! ≤ 2 n n ! (( β + ε / 2 ) n − ( α − ε / 2 ) n ) and thus | M | ≤ 2 n ε n ·  ( β + ε / 2 ) n − ( α − ε / 2 ) n  = 2 n ε n · ε n 2 n  ( 2 β / ε + 1) n − ( 2 α / ε − 1) n  = ( 2 β / ε + 1) n − ( 2 α / ε − 1) n . As the upper bound holds for any M , it specifically holds for the largest M , which bounds the metric entropy E ε,n α,β and completes the proof. C. Parallel persistent homology computation While there exist many libraries for computing persistent homology ( DIPHA ( Bauer et al. , 2014a ), Dinoysus 4 , JavaPlex 5 ( T ausz et al. , 2014 ), GUDHI 6 ) of a filtered simplicial complex, or fast ( RIPSER 7 ) and approximate ( SimBa ) ( Dey et al. , 2016 ) computation of V ietoris-Rips persistent homology , we are not aw are of an av ailable im- plementation that (P1) fully operates on the GPU and (P2) offers easy access to the persistence pairings. 4 http://www.mrzv.org/software/dionysus2 5 https://appliedtopology.github.io/javaplex/ 6 http://gudhi.gforge.inria.fr 7 https://github.com/Ripser/ripser As most deep learning platforms are optimized for GPU computations, (P1) is important to av oid efficienc y bottle- necks caused by expensi ve data transfer operations between main memory and GPU memory; (P2) is required to en- able the integration of persistent homology in an automatic differentiation frame work, such as PyT orch. Next, we present a straightforward (and not necessarily optimal) v ariant of the standar d reduction algorithm to com- pute persistent homology , as introduced in ( Edelsbrunner & Harer , 2010 , p. 153), that offers both properties. While many impro vements of our parallelization approach are pos- sible, e.g., using clearing ( Bauer et al. , 2014b ) or computing cohomology ( de Silva et al. , 2011 ) instead, we do not follow these directions here. W e only present a simple parallel variant that is suf ficient for the purpose of this work. The core idea of the original reduction algorithm is to trans- form the boundary matrix of a filtered simplicial complex such that the “birth-death” times of its homological features can be easily read off. More precisely , the boundary matrix ( Edelsbrunner & Harer , 2010 ) is transformed to its reduced form (see Definition 4 ) via left-to-right column additions, defined in Algorithm 1 . First, we need to define what is meant by a reduced form of a boundary matrix B over Z m × n 2 . Definition 4. Let B ∈ Z m × n 2 and B [ i ] , B [ ≤ i ] denote the i -th column and the sub-matrix of the first i columns, resp., of B . Then, for B [ j ] 6 = 0 , we define low ( B , i ) = j iff j is the ro w-index of the lo west 1 in B [ i ] . For con ve- nience, we set low ( B , i ) = − 1 for B [ j ] = 0 . W e call B reduced iff for 1 ≤ i < j ≤ n B [ i ] , B [ j ] 6 = 0 ⇒ low ( B , i ) 6 = low ( B , j ) . Algorithm 1 Column addition function A D D ( B , i, j ): B [ j ] ← B [ j ] + B [ i ]  Addition in Z 2 end function Next, we restate the original (sequential) reduction algo- rithm. Let ∂ be the boundary matrix of a filtered simplicial complex. Algorithm 2 consists of two nested loops. W e argue that in case column additions would be data-independent, we could easily perform these operations in parallel without conflicts. T o formalize this idea, let us consider a set M of inde x pairs M = { ( i k , j k ) } k ⊂ { 1 , . . . , b } × { 1 , . . . , b } . If the conditions Connectivity-Optimized Repr esentation Learning via Persistent Homology Algorithm 2 Standard PH algorithm ( Edelsbrunner & Harer , 2010 , p. 153) B ← ∂ for i ← 1 , n do while ∃ j 0 < j : low ( B , j 0 ) = low ( B , j ) do A D D ( B , j 0 , j ) end while end for (i) { i k } k ∩ { j k } k = ∅ , and (ii) ∀ j k : ∃ ! i k : ( i k , j k ) ∈ M are satisfied, the A D D ( B , i k , j k ) operations from Algo- rithm 1 are data-independent. Informally , condition (i) ensures that no column is target and origin of a merge operation and condition (ii) ensures that each column is targeted by at most one merging operation. In the following definition, we construct two auxiliary operators that will allo ws us to construct M such that conditions (i) and (ii) are satisfied. Definition 5. Let B ∈ Z m × n 2 and 1 ≤ j ≤ m . W e define I ( B , j ) = ( ∅ |{ i : low ( B , i ) = j }| < 2 { i : low ( i ) = j } else and M ( B , j ) = ( ∅ if I ( B , j ) = ∅ µ ( B , j ) × I ( B , j ) \ µ ( B , j ) else where µ ( B , j ) = { min I ( B , j ) } . Finally , let M ( B ) = n [ j =1 M ( B , j ) . By construction, it holds that M ( B ) = ∅ iff B is reduced. W e can now propose a parallel algorithm, i.e., Algorithm 3 , that iterates until M ( B ) = ∅ . Upon termination, M ( B ) = ∅ , and hence B is reduced. It only remains to show that termination is achiev ed after a finite number of iterations. Lemma 3. F or B ∈ Z m × n 2 , Algorithm 3 terminates after finitely many iterations. Pr oof. Let B ( k ) be the state of B in the k -th iteration. For 1 ≤ l ≤ n it holds that M ( B ( k ) [ ≤ l ]) = ∅ if B ( k ) [ ≤ l ] is reduced. Conclusively , for k 0 > k B ( k ) [ ≤ l ] is reduced ⇒ B ( k 0 ) [ ≤ l ] is reduced as B [ ≤ l ] does not change any more after the k -th iteration. Hence we can inductiv ely show that the algorithm termi- nates after finitely many iterations. Algorithm 3 GPU PH algorithm function A D D P A R A L L E L ( B , M ): parallel for ( i, j ) ∈ M do A D D ( B , i, j ) end parallel for end function B ← ∂ M ← M ( B ) while M 6 = ∅ do A D D P A R A L L E L ( B , M ) M ← M ( B ) end while First, note that B ( k ) [ ≤ 1] is reduced. Now assume B ( k ) [ ≤ l ] is reduced and consider B ( k ) [ ≤ l + 1] . If B ( k ) [ ≤ l + 1] is not reduced M  B ( k ) [ ≤ l + 1]  ⊂ { 1 , . . . , l } × { l + 1 } as B ( k ) [ ≤ l ] is already reduced. Thus, if the algorithm continues to the k + 1 -th iteration the lo west 1 of B ( k ) [ l + 1] is eliminated and therefore low ( B ( k +1) , l + 1) < low ( B ( k ) , l + 1) . Hence, after d ≤ low ( B ( k ) , l + 1) iterations B ( k + d ) [ ≤ l + 1] is reduced as either B ( k + d ) [ l + 1] = 0 or there is no j ≤ l such that low  B ( k + d ) [ ≤ l ] , j  = low  B ( k + d ) [ ≤ l + 1] , l + 1  . In consequence B ( k 0 ) [ ≤ n ] = B ( k 0 ) is reduced for k 0 < ∞ which concludes the proof. Runtime study . W e conducted a simple runtime compar- ison to Ripser and Dionysus (which both run on the CPU). Both implementations are av ailable through Python wrappers 8 . Dionysus implements persistent cohomology computation ( de Silv a et al. , 2011 ), while Ripser imple- ments multiple recent algorithmic improvements, such as the aforementioned clearing optimization as well as com- puting cohomology . Rips complex es are built using k · k 1 , up to the enclosing radius of the point cloud. Specifically , we compute 0 -dimensional features on samples of varying size ( b ), drawn from a unit multiv ariate Gaussian in R 10 . Runtime is measured on a system with ten Intel(R) Core(TM) i9-7900X CPUs (3.30GHz), 64 GB of RAM and a Nvidia GTX 1080 T i GPU. Figure 7 sho ws runtime in seconds, averaged over 50 runs. Note that in this e xperiment, runtime includes construction of the Rips complex as well. While Ripser is, on average, slightly faster than our im- plementation, we note that for mini-batch sizes customary 8 For Ripser , see https://scikit- tda.org/ Connectivity-Optimized Repr esentation Learning via Persistent Homology Figure 7. Runtime comparison of Ripser & Dionysus (both CPU) vs. our parallel GPU variant. Runtime (in seconds) is reported for 0 -dimensional VR persistent homology , computed from random samples of size b drawn from a unit multiv ariate Gaussian in R 10 . 100 200 300 400 500 Sample size b 0.0 0.1 0.2 0.3 0.4 0.5 0.6 A v g . r u n t i m e [ s ] ± 3 D a t a : U n i t G a u s s i a n i n 1 0 Ripser Dionysus Ours 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Branch ID 1.80 1.85 1.90 1.95 2.00 2.05 a v g . d , d ( S ) , | S | = 1 0 0 Tiny-ImageNet CIFAR-100 Figure 8. A verage ε d , d ∈ † ( S ) , per branch, computed from batches, S , of size 100 over CIF AR-100 (test split) and T iny- ImageNet (test split); f θ is learned from the training portion of CIF AR-100 with η = 2 . in training neural networks (e.g., 32, 64, 128), the runtime difference is negligible, especially compared to the o verall cost of backpropag ation. Importantly , our method inte grates well into existing deep learning frameworks, such as Py- T orch, and thus facilitates to easily experiment with new loss functions, such as the proposed connectivity loss. D. Supplementary figur es Fig. 8 shows a second v ariant of Fig. 6 from the main paper, only that we replace CIF AR-10 with T inyImage-Net (test- ing portion). The autoencoder was trained on the training portion of CIF AR-100. E. Algorithmic summary Algorithm 4 provides a high-level description of the work- flow to apply the presented method for one-class learning. Algorithm 4 Summary of training steps Parameters : η > 0 (scaling parameter for L η ); λ > 0 (weighting for L η ); B ≥ 1 (number of branches); D ≥ 1 (branch dimensionality); b (mini-batch size); Remark: These are all global par ameters. function S L I C E ( z , j ): retur n z [ D · ( j − 1) : D · j ] end function Step 1: A utoencoder training T rain g φ and f θ using an auxiliary unlabled dataset { a 1 , . . . , a M } , minimizing (ov er batches of size b ) 1 b b X i =1 k a i − g φ ◦ f θ ( a i ) k 1 + λ B X j =1 L η ( { z j 1 , . . . , z j b } ) where z i = f θ ( a i ) with z j i = S L I C E ( z i , j ) . Remark : This autoencoder can be r e-used. That is, if we alr eady have f θ trained on { a 1 , . . . , a M } (e.g ., fr om another one-class scenario) using the same η , B , D pa- rameter c hoices, autoencoder training can be omitted. Step 2: Create one-class model For one-class samples { x 1 , . . . , x m } , compute and store z j i = S L I C E ( f θ ( x i ) , j ) . Step 3: Evaluate one-class model For each new sample y ∗ , obtain y j ∗ = S L I C E ( f θ ( y ∗ ) , j ) and compute the one-class scor e s ( y ∗ ) = B X j =1    n z j i : k z j ∗ − z j i k ≤ η , 1 ≤ i ≤ m o    . using the stored z j i from Step 2.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment