ATOL: Measure Vectorization for Automatic Topologically-Oriented Learning

Robust topological information commonly comes in the form of a set of persistence diagrams, finite measures that are in nature uneasy to affix to generic machine learning frameworks. We introduce a fast, learnt, unsupervised vectorization method for …

Authors: Martin Royer (DATASHAPE), Frederic Chazal (DATASHAPE), Clement Levrard (LPSM (UMR_8001))

ATOL: Measure Vectorization for Automatic Topologically-Oriented   Learning
A TOL: Measure V ectorization for Automatic T op ologically-Orien ted Learning Martin Ro yer F r´ ed´ eric Chazal Datashap e, Inria Saclay , Palaiseau, F rance. Cl´ emen t Levrard LPSM, Univ. P aris Diderot, Paris, F rance. Umeda Y uhei Ik e Y uichi F ujitsu Laboratories, AI Lab, T okyo, Japan. Abstract Robust top ological information commonly comes in the form of a set of p ersistence di- agrams, finite measures that are in nature uneasy to affix to generic machine learning framew orks. W e in tro duce a fast, learn t, unsup ervised v ectorization metho d for mea- sures in Euclidean spaces and use it for re- flecting underlying c hanges in top ological be- ha viour in machine learning contexts. The algorithm is simple and efficiently discrimi- nates imp ortan t space regions where mean- ingful differences to the mean measure arise. It is prov en to b e able to separate clusters of persistence diagrams. W e show case the strength and robustness of our approac h on a num b er of applications, from emulous and mo dern graph collections where the metho d reac hes state-of-the-art performance to a ge- ometric synthetic dynamical orbits problem. The proposed metho dology comes with a sin- gle high level tuning parameter: the total measure encoding budget. W e pro vide a com- pletely op en access soft ware. 1 In tro duction T op ological Data Analysis (TDA) is a field dedicated to the capture and description of relev an t geometric Under review. or top ological information from data. The use of TD A with standard machine learning to ols has pro ved par- ticularly adv antageous in dealing with all sorts of com- plex data, meaning ob jects that are not or only partly Euclidean, for instance graphs, time series, etc. The applications are abundan t, from so cial net work analy- sis, bio and c hemoinformatics, to ph ysics, imaging and computer vision, to name a few. Recen t examples in- clude [DUC20], [PKP + 19], [Dup18], [CS19], [KMP18]. Through Persisten t Homology , a multi-scale analysis of the top ological prop erties of the data, robust in- formation is extracted. But the resulting features are commonly generated in the form of a p ersistence di- agram whose structure do es not easily fit the general mac hine learning input format. So TD A captures rele- v an t information in a form that is challenging to handle – therefore it is generally com bined to mac hine learn- ing by wa y of an em b edding metho d for p ersistence diagrams. This w ork is set in that trend. Con tributions . First w e introduce a learnt, unsu- p ervised vectorization metho d for measures in Eu- clidean spaces of any dimension (Section 2.1). Then w e sho w how this metho d can b e used for T op ologically- Orien ted Learning (Section 2.2), allowing for easy in te- gration of top ological features such as p ersistence dia- grams in to c hallenging mac hine learning problems. W e illustrate our approach with sets of exp eriments that lead to state-of-the-art results on c hallenging problems (Section 3). W e pro vide an op en source implementa- tion. Our algorithm is simple and easy to use. It relies on a quan tization of the space of diagrams that is statis- tically optimal. It is fast and practical for large scale and high dimensional problems. It is comp etitiv e and A TOL: Measure V ectorization for Automatic T op ologically-Orien ted Learning sometimes largely surpasses more sophisticated meth- o ds in volving k ernels, deep learning, or computations of W asserstein distance. T o the b est of our kno wledge, w e introduce the first vectorization metho d for p ersis- tence diagrams that is pro ven to b e able to separate clusters of p ersistence diagrams. There is little to no tuning to this metho d, and no kno wledge of TDA is required for using it. Related w ork . Finding representations of p ersis- tence diagrams that are w ell-suited to be com bined with standard mac hine learning pip eline is a problem that has attracted a lot of interest these last years. A first family of approac hes consists in finding conv e- nien t v ector represen tations of p ersistence diagrams. F or instance it inv olves interpreting diagrams as im- ages in [AEK + 17], extracting topological signatures with respect to fixed p oin ts whose optimal position are sup ervisedly learnt in [HKNU17], a square-ro ot trans- form of their approximated pdf in [A VR T16]. Recently [PMK19] in tro duced template functions, a mathemat- ical framework to understand featurisation functions that in tegrates against the measure of a p ersistence diagram; our metho d is interpretable in this frame- w ork. A second family of approaches consists in de- signing sp ecific kernel on the space of p ersistence di- agrams, such as the multi-scale kernel of [RHBK15], the w eighted Gaussian kernel of [KHF16] or the sliced W asserstein k ernel of [CCO17]. Those techniques ha ve state-of-the-art b eha viour on problems, but for draw- bac k they require another step for an explicit repre- sen tation, and are kno wn to scale p o orly . A recent other line of work has managed to directly combine the uneasy structure of p ersistence diagrams to neural net works architectures [ZKR + 17], [CCI + 19]. Despite their successful p erformances, these neural net works are heavy to deploy and hard to understand. They are sometimes paired with a representation metho d as in [HKNU17], [HKN19]. P ersistent homology in TDA . P ersistent homology pro vides a rigorous mathematical framework and effi- cien t algorithms to enco de relev an t multi-scale top o- logical features of complex data suc h as p oint clouds, time-series, 3D images... More precisely , p ersisten t ho- mology enco des the ev olution of the top ology of fami- lies of nested top ological spaces ( F α ) α ∈ A , called filtra- tions, built on top of the data and indexed b y a set of real num b ers A that can b e seen as scale param- eters. F or example, for a p oin t cloud in a Euclidean space, F α can b e the union of the balls of radius α cen tered on the data points - see Figure 1. Given a fil- tration ( F α ) α ∈ A , its top ology (homology) changes as α increases: new connected comp onents can app ear, existing connected comp onen ts can merge, lo ops and ca vities can app ear or b e filled, etc. Persisten t homol- Figure 1: filtration by union of balls built on top of a 2-dimensional data set (red points) and its corresp ond- ing p ersistence diagram. As the balls radii increase (from left to righ t and top to b ottom), the c onnected comp onen ts (red p oin ts) are merged; tw o-cycles (blue p oin ts) app ear and disapp ear along the filtration. ogy trac ks these changes, identifies features and asso- ciates, to each of them, an interv al or lifetime from α birth to α death . F or instance, a connected comp onen t is a feature that is born at the smallest α such that the comp onen t is present in F α , and dies when it merges with an older connected comp onen t. The set of in ter- v als represen ting the lifetime of the identified features is called the barco de of the filtration. As an interv al can also b e represented as a p oint in the plane with co ordinates ( α birth , α death ), the p ersistence barcode is equiv alen tly represen ted as an union of suc h points and called the persistence diagram - see [EH10, BCY18] for a more detailed introduction. The classical main adv an tage of p ersistence diagrams is that: (i) they are prov en to provide robust quali- tativ e and quantitativ e topological information ab out the data [CdSGO16]; (ii) since eac h p oin t of the dia- gram represen ts a sp ecific top ological feature with its lifespan, they are easily interpretable as features; (iii) from a practical p ersp ectiv e, persistence diagrams can b e efficiently computed from a wide family of filtra- tions [The20]. Ho wev er, as p ersistence diagrams come as unordered set of p oin ts with non constant cardinal- it y , they cannot b e immediately processed as standard v ector features in mac hine learning algorithms. Con- sidering diagrams as measures has prov en b eneficial in the literature b efore (see for instance [CdSGO16], [CD18]) and allows to naturally encode the p oin ts m ul- tiplicit y problems in the form of weigh ted measures. Notations . Consider M d the set of finite measures on the d -dimensional ball B d (0 , R ) of the Euclidean space R d with total mass smaller than M , for some given M , R ∈ R 2 + . F or m ∈ M d and χ : R d → R b orelian Martin Ro yer, F r´ ed´ eric Chazal function, let χ · m := R x ∈ R d χ ( x ) m ( dx ) whenev er | χ | · m is finite. Next, for b ∈ N ∗ w e call a codeb o ok c = ( c 1 , . . . , c b ) ∈ B d (0 , R ) b the support of a distribution supp orted on b p oin ts and its asso ciated V orono ¨ ı cells: W j ( c ) = { x ∈ R d | ∀ i < j, k x − c j k < k x − c i k and ∀ i > j , k x − c j k 6 k x − c i k} . Finally , we assume that the set of input persistence di- agrams comes as an i.i.d. sample from a distribution of uniformly b ounded diagrams, that is given M , R ∈ R 2 + , let D b e the space of persistence diagrams with at most M p oin ts con tained in the Euclidean disc B 2 (0 , R ). The space D is considered as a subspace of the set M 2 of finite measures on B 2 (0 , R ) with total mass smaller than M : for any D ∈ D , D := P p ∈ D δ p where δ p is the Dirac measure centered at point p . 2 Metho dology In this section we in tro duce A tol , a simple unsuper- vised data-driv en metho d for measure vectorization. A tol allows to automatically conv ert a distribution of p ersistence diagrams into a distribution of feature v ectors that are well-suited for use as top ological fea- tures in standard machine learning pipelines. As an o verview, giv en a p ositiv e integer b , A tol pro- ceeds in tw o steps: it computes a discrete measure in R d supp orted on b p oin ts that approximates the a ver- age measure of the distribution from whic h the input observ ations hav e b een sampled. Then, it computes a set of w ell-chosen contrast functions centered on each p oin t of the supp ort of this measure, that are used to conv ert each observed measure into a v ector of size b . This resulting v ectorization can then be used in standard machine-learning problems such as cluster- ing, classification, etc. 2.1 Measure v ectorization through quan tization W e now introduce Algorithm 1 A tol -featurisation: a featurisation metho d for elements of M d . The first step in our pro cedure is to use quan tization in space M d . Starting from an i.i.d. sample of measures X 1 , . . . , X n dra wn from probability distribution L X on M d and giv en an in teger budget b ∈ N ∗ , we pro duce a compact representation for the mean measure E ( X ). That is, w e pro duce a distribution P ˆ c n supp orted on a fixed-length co debo ok ˆ c n = ( c 1 , . . . , c b ) ∈ B d (0 , R ) b that aims to minimize ov er suc h distributions P based on b p oin ts the distorsion R ( P ) := W 2 2 ( P , E ( X )): the squared 2-W asserstein distance to the mean measure. In practice, one considers the empirical mean measure ¯ X n and the k -means problem for this ¯ X n measure. Then the adaptation of Lloyd’s [Llo82] algorithm to measures can b e used. F rom this quantization our aim is to derive spatial in- formation on measures in order to discriminate them. Muc h lik e one would compactly describ e a p oin t cloud with resp ect to its barycen ter in a PCA procedure, w e describ e measures based on a num b er of reduced difference to our mean measure approximate. T o this end, our second step is to tailor b individual contrast functions each based on the estimated co debo ok that individually describe the space with resp ect to a cer- tain viewp oin t. In other words we set to find regions of the space where measures seem to aggregate on av- erage, and build a dedicated descriptor for those re- gions. W e define and use the follo wing con trast family R d → R + , for i ∈ [ b ]: Ψ i ( · , ˆ c n ) : x 7→ exp  − k x − c i k 2 σ i ( ˆ c n )  (1) where σ i ( ˆ c n ) := min j ∈ [ b ] ,j 6 = i k c i − c j k 2 / 2 . (2) These sp ecific con trast functions are chosen to decrease a wa y from the appro ximate mean cen troid in a Lapla- cian fashion and w e choose the scale to corresp ond to the minim um distance to the closest V oronoi cell in the corresp onding co debo ok ˆ c n . T o our knowledge there is nothing that preven ts other, well designed con trast families to b e substituted in their place, and to that regard some in tuition is pro vided in the ablation study of Section 3.2. Giv en a mean measure co deb ook appro ximate ˆ c n , el- emen t X ∈ M d can no w be compactly describ ed through the integrated contri bution to each con trast functions: Ψ i ( · , ˆ c n,b ) · X . Our algorithm concatenates in to a vector each of those con tributions. This algorithm is practical for large scale and high di- mensional problems: it has a running time in O ( n × M × b × d ), so therefore it is able to handle diffi- cult problems as long as corresp onding measures are found. If necessary , a single-pass quan tization step can b e deriv ed as a minibatch adaptation of the [Mac67] MacQueen algorithm (we refer to [CLR20]), and then com bined with the contrast functions vectorization. Therefore Algorithm 1 is simple, fast and automatic once the desired length for v ectorization has been cho- sen. Now let us in tro duce how it appears in mac hine- learning contexts. 2.2 T op ological learning with Atol Set in the context of a standard learning prob- lem, we in tro duce Algorithm 2 A tol : Automatic A TOL: Measure V ectorization for Automatic T op ologically-Orien ted Learning Algorithm 1: A tol -vectorization Data: Collection of measures X 1 , . . . , X n ∈ ( M d ) n . P arameters: budget b ∈ N ∗ . 1 Quan tization algorithm of the mean measure with fixed-length supp ort (Lloyd’s adaptation to measures): sample c = ( c 1 , . . . , c b ) from ¯ X n . ; 2 while c new = ( c new 1 , . . . , c new b ) 6 = c do 3 c = c new ; 4 ∀ i ∈ [ b ] , c new i :=  u 7→ u 1 W i ( c ) ( u )  · 1 ¯ X n ( W i ( c )) ¯ X n ; 5 let ˆ c n b e the resulting co debo ok, define the b measurable contrast functions (Ψ 1 ( · , ˆ c n ) , . . . , Ψ b ( · , ˆ c n )) to compute featurisation map: v A tol : X 7→ h Ψ i ( · , ˆ c n ) · X i i ∈ [ b ] . ; Result: vectorization map v A tol : M d → R b . T op ologically-Orien ted Learning. Let Ω := ( X, y ) with giv en observ ations X in some space X corresp onding to a kno wn, partially a v ailable or hidden label y ∈ Y . Assume that one has a wa y to extract toplogical fea- tures from X (for example a collection of diagrams asso ciated to those elemen ts), and let κ : X → M d b e the corresp onding map. Then applying Algorithm 1 to the resulting collection of descriptors pro vides some simplified top ological understanding on elements X of this problem. Algorithm 2: A tol : Automatic T opologically- Orien ted Learning Data: Learning problem Ω := ( X , y ) with X ∈ X collections and y ∈ Y lab els. P arameters: κ : X → M d yielding top ological descriptors, and budget b ∈ N ∗ . 1 Compute intermediate learning problem Ω T opo := (( X, κ ( X )) , y ) ∈ ( X × M d ) × Y with top ological features, p oten tially unfit for general mac hine learning routines; 2 Use Algorithm 1 to deriv e Euclidean represen tations of those features, i.e. transform Ω T opo in to a generic mac hine learning problem e Ω := (( X, v A tol ◦ κ ( X )) , y ) ∈ ( X × R b ) × Y . ; Result: Enhanced problem e Ω := (( X, v A tol ◦ κ ( X )) , y ) where v A tol ◦ κ ( X ) ∈ R b . This algorithm is integrated in the op en source top ological library GUDHI [The20] accessible at https://gudhi.inria.fr/python/latest/ representations.html . W e p oin t that as e m b ed- ding map v A tol is automatically computed without kno wledge of a learning task, its deriv ation is fully unsup ervised. The representation is learned since it is data-dep enden t, but it is also agnostic to the task and even tually only dep ends on getting a glimpse at an av erage persistence diagram. A tol in dimension 2 for p ersistence diagrams. W e now specialise this algorithm to the context of per- sisten t homology that is usually set in dimension d = 2. Applying Algorithm 2 to a collection from M 2 suc h as p ersistence diagrams, as D ⊂ M 2 , is straigh tforward and allows to embed the complex, unstructured space M 2 in Euclidean terms. No w let us assume that the measures in M 2 come from distinct sources: that observed measures D 1 , . . . , D n are sampled with noise from a mixture mo del D = P L l =1 π l D ( l ) of distinct measures D (1) , . . . , D ( L ) (b y that w e mean that an y tw o measures in this set dif- fer in supp ort by at leas t one p oin t). Let Z the la- ten t v ariable of the mixture so that D | Z = l ∼ D ( l ) . The follo wing results ensures that v A tol has separativ e p o w er, i.e. that the vectorization clearly separates the differen t sources: Theorem 1 (Separation with A tol ) . F or a given noise level assuming E ( D ) satisfies some (explicit) mar gin c ondition and for n and b lar ge enough ther e exists a non-empty se gment for σ 1 , . . . , σ b in Equation (1) such that for al l i, j ∈ [ n ] 2 , with high pr ob ability: Z i = Z j = ⇒ k v A tol ( D i ) − v A tol ( D j ) k ∞ 6 1 / 4 , (3) Z i 6 = Z j = ⇒ k v A tol ( D i ) − v A tol ( D j ) k ∞ > 1 / 2 . (4) T o our knowledge it is the first time that a measure v ectorization method (or a persistence diagram v ector- ization metho d) has b een prov en to separate clusters. This result follows from Corollary 19 in [CLR20] that studies theoretical prop erties of A tol -like procedures. The explicit statemen t of the assumptions and margin conditions are standard and rather technical. But the theory b ehind it uses an idealistic framework (including the so-called margin condition) under which suc h pro cedures will succeed in separating differen t sources. Based on this framework, the requirements of Theorem 1 cannot b e chec ked in practice: apart from the tec hnical margin condition, the prescribed b ounds on budget b are unknown and they theoreti- cally grow quite large with the n umber of underlying cen ters + cov ering n umber, and the b ounds for band- widths σ 1 , . . . , σ b are heavily dep endent on the struc- ture of source mo del D . In practice we pro v e it needs not b e so difficult: for in- tuition we refer to the ablation study on the influence Martin Ro yer, F r´ ed´ eric Chazal of b exp osed in T able 3, that shows it is easy to c ho ose a low budget for efficient results – so we lea ve it as the only parameter of the algorithm. F or full automatic- it y , a simple adaptive strategy would be to try a range of budgets during the training task, since the combina- tion of Algorithm 1 and a standard learning algorithm suc h as random forests runs very fast. F urthermore, the adaptive strategy of Equation (2) for bandwidths σ 1 , . . . , σ b pro ves efficient, see the bandwiths v ariation study of Section 3.2 Figure 3. In dimension 2 this vectorization is conceptually close to t wo other recen t works. [HKNU17] computes a p er- sistence diagram vectorization through a deep learning la yer that adjusts Gaussian con trast functions used to pro duce topological signatures. So in essence our ap- proac h substitutes quantization to deep learning, with no need of sup ervision and allo wing to provide mathe- matical guaran tees. Next, the bag of w ord metho d of [ZLJ + 19] uses an ad-ho c form of quantization for the space of diagrams, then count functions as contrast functions to pro duce histograms as top ological signa- tures. Those are in fact sensible differences, that will ultimately translate in terms of effectiv eness: Section 3.1 sho ws the A tol -featurisation to pro duce state-of- the-art mean accuracy on t w o difficult m ulti-class clas- sification problems (67.1 % on REDDIT5K and 51.4 % on REDDIT12K ) that are also analysed by those pa- p ers: [HKNU17] report a mean accuracy of respec- tiv ely 54.5% and 44.5%, and [ZLJ + 19] rep ort an accu- racy of resp ectiv ely 49.9% and 38.6%. 3 Comp etitiv e TDA-Learning In this section we show the A tol framework to b e comp etitiv e, sometimes greatly improving the state- of-the-art, but also v ersatile and easy to use. This section presents exp erimen ts on tw o sorts of classifica- tion problems (graphs and p oin t clouds), and another applied exp erimen t on time series is provided in the Supplemen tary Materials. Algorithm 2 transforms the initial problems in to a t yp- ically standard machine-learning problem, so the prob- lem although transformed remains to be solved. In the instances b elo w we use the scikit-learn [PVG + 11] random-forest classification to ol with 100 trees and all other parameters set as default. W e use random forests as a ready-to-use to ol, but comparable p erformances can be obtained from using a linear SVM classifier or a neural net work classifier, depending on the problem. It is a light choice that requires no particular infras- tructure or tuning efforts that would pro duce ov erly design-dep enden t results – as our ambition is to show an ability to perform w ell o verall. 3.1 Graph Classification As learning problems inv olving graph data are receiv- ing a strong in terest at the moment, consider a stan- dard graph classification framework: Ω := ( G, y ) ∈ G × Y is a finite family of graphs and av ailable lab els and one learns to map G → Y . Recen tly [CCI + 19] hav e introduced a p o werful w ay of extracting top ological information from graph struc- tures. They make use of heat k ernel signatures (HKS) for graphs [HRG14], a sp ectral family of signatures (with diffusion parameter t > 0) whose topological structure can b e enco ded in the extended p ersistence framew ork, yielding four types of top ological features with exclusively finite p ersistence. W e replicate their metho dology , and on b oth HKS and extended p ersis- tence w e refer to Sections 4.2 and 2 from [CCI + 19]. Sc hematically for diffusion time t > 0 and graph G ( V , E ) (with V , E the sets of vertices and edges), the top ological descriptors are computed as: κ t := g ◦ h, (5) where g : G ( V , E ) ∈ G heat kernel − − − − − − − → signatures HKS t ( G ) ∈ R | V | , (6) h : HKS t ( G ) extended − − − − − − − → persistence PD(HKS t ( G )) ∈ D 4 . (7) F or the en tire set of problems to come w e c ho ose to use the same tw o HKS diffusion times to b e t 1 = . 1 and t 2 = 10, so that Algorithm 2 is used with top ological map κ := κ t 1 + κ t 2 : G → D 8 , suc h that all in all 8 p ersistence diagrams are computed and considered p er graph. F or budget in Algorithm 2 we c ho ose b = 80 for all exp erimen ts, which means Algorithm 1 will rely on approximating the mean measure on ten p oints p er diagram t yp e and HKS time filtration. W e make no use of graph attributes on edges or vertices that some datasets do possess, and no other sort of features are collected, so that our results are solely based on the toplogical graph structure of the problems. T o sum-up, Algorithm 2 here simply consists in reducing the original problem from Ω to e Ω := ( v A tol ◦ κ ( G ) , y ) with v A tol ◦ κ ( G ) ∈ R 80 . The embedding map v A tol from Algorithm 1 is computed using only 10% of all diagrams from the training set, without sup ervision. On eac h problem w e perform a 10-fold cross-v alidation pro cedure and a verage the resulting accuracies; we re- p ort accuracies and standard deviations o ver ten suc h exp erimen ts. W e use tw o sets of graph classification problems for b enc hmarking, one of So cial Net work ori- gin and one of Chemoinformatics and Bioinformatics origin. They include small and large sets of graphs ( MUTAG has 188 graphs, REDDIT12K has 12000), small A TOL: Measure V ectorization for Automatic T op ologically-Orien ted Learning metho d RetGK F GSD WKPI GNTK P ersLay A tol problem [ZWX + 18] [VZ17] [ZW19] [DHS + 19] [CCI + 19] REDDIT (5K, 5 classes) 56.1 ± .5 47.8 59.5 ± .6 — 55.6 ± .3 67.1 ± .3 REDDIT (12K, 11 classes) 48.7 ± .2 — 48.5 ± .5 — 47.7 ± .2 51.4 ± .2 COLLAB (5K, 3 classes) 81.0 ± .3 80.0 — 83.6 ± .1 76.4 ± .4 88.3 ± .2 IMDB-B (1K, 2 classes) 71.9 ± 1. 73.6 75.1 ± 1.1 76.9 ± 3.6 71.2 ± .7 74.8 ± .3 IMDB-M (1.5K, 3 classes) 47.7 ± .3 52.4 48.4 ± .5 52.8 ± 4.6 48.8 ± .6 47.8 ± .7 T able 1: Mean accuracy and standard deviations for Large So cial Netw ork datasets. and large graphs ( IMDB-M has 13 nodes on av erage, REDDIT5K has more than 500), dense and sparse graphs ( FRANKENSTEIN has around 12 edges p er no des, COLLAB has more than 2000), binary and multi-class problems ( REDDIT12K has 11 classes), all av ailable in the pub- lic rep ository [KKM + 16]. Computations are run on a single laptop (i5-7440HQ 2.80 GHz CPU), in batc h v ersion for datasets smaller than a thousand observ a- tions and mini-batch version otherwise. Average com- puting time of Algorithm 1 (the av erage time to cal- ibrate the vectorization map on the training set then compute the vectorization on the en tire dataset), are: less than 1 second for datasets with less than a thou- sand observ ations, less than 5 seconds for datasets that ha ve less than 5 thousand observ ations, 7.5 seconds for REDDIT-5K , and less than 16 seconds for the largest REDDIT-12K and the densest problem COLLAB . W e compare performances to the top scoring meth- o ds for these problems, to the b est of our kno wl- edge. Those metho ds are mostly graph kernel metho ds tailored to graph problems: tw o graph kernel meth- o ds based on random walks (RetGK1, RetGK11 from [ZWX + 18]), one graph embedding metho d based on sp ectral distances (FGSD from [VZ17]), tw o topo- logical graph kernel metho d (WKPI-kM and WKPI- kC from [ZW19]), one graph k ernel com bined with a graph neural netw ork (GNTK from [DHS + 19]). Fi- nally PersLa y from [CCI + 19] is a top ological v ector- ization metho d learn t b y a neural net w ork that en- co des most top ological frameworks from the litera- ture – landscap es, silhouettes, p ersistence images, etc. Note that the comparisons to PersLa y were computed with the exact same persistence diagrams in most cases (except for a few cases where the authors used those same t wo HKS diffusion times then discarded one with no loss of performances) and the total budget for A tol ( b = 80) is a magnitude b elo w that required to build the PersLa y arc hitecture (several h undreds no des b e- fore the optimisation phase). Comp etitor accuracy are quoted from their resp ective publication and should b e in terpreted as follo ws: for RetGK and WKPI and P ersLay the ev aluating procedure is done o ver ten 10- fold, just as ours is so the results directly compare; for FGSD the av erage accuracy ov er a single 10-fold is rep orted, and for GNTK the av erage accuracy and de- viations is rep orted o ver a single 10-fold as w ell. When there are tw o or more metho ds under one lab el (e.g. RetGK1 and RetGK11), w e alwa ys fav orably rep orted the b est outcome. Our results T able 1 are state-of-the-art or substan tially impro ving the state-of-the-art on the Large So cial Net work datasets that are difficult multi-class prob- lems. The REDDIT s and COLLAB datasets all see ma- jor impro vemen ts in the mean accuracies, and those three datasets can readily be considered the most dif- ficult problems (by size, graph densit y and num b er of classes) in the entire series. The results on the Chemoinformatics and Bioinformatics datasets T able 2 are on par with or sometimes sub state-of-the-art, with a significan t achiev emen t on DHFR . It is not sur- prising that A tol is not alw ays on par with the state- of-the-art, esp ecially on the smaller binary classifica- tion datasets where considering the mean measure can p oten tially b e too simple a mo del, easily refined up on – recall that contrary to comp etitors, A tol do es not build a k ernel or a neural netw ork. Quantisation with- out supervision makes the learning pro cess stricter, hea vily dep endent on the measure input: A tol is only capable of interpreting behavior with resp ect to the mean measure, therefore if some discriminant fea- ture in a problem is found at a border of the measure space, as could b e happ ening on the PROTEINS and NCI s datasets, it shall not be captured and there is no learning-ro om to change that. This is a liability , as w ell as a virtue of the metho d. W e surmise that the A tol p erformances can be in terpreted as a general op- timal score for discriminativ e capacity with resp ect to the mean, in a problem. So for instance on the IMDB datasets, p oten tially whatever is gained on top of this baseline is obtained through sup ervision, at the cost of general discriminativ e p o w er. The simplicit y and absence of tuning indicate robust- ness and generalisation p o wer. Ov erall these results are esp ecially p ositiv e seeing ho w Algorithm 1 has been emplo yed with a universal configuration. Martin Ro yer, F r´ ed´ eric Chazal metho d RetGK F GSD WKPI GNTK P ersLa y A tol problem (size) [ZWX + 18] [VZ17] [ZW19] [DHS + 19] [CCI + 19] MUTAG (188) 90.3 ± 1.1 92.1 88.3 ± 2.6 90.0 ± 8.5 89.8 ± .9 88.3 ± .8 COX2 (467) 81.4 ± .6 — — — 80.9 ± 1. 79.4 ± .7 DHFR (756) 81.5 ± .9 — — — 80.3 ± .8 82.7 ± .7 PROTEINS (1113) 78.0 ± .3 73.4 78.5 ± .4 75.6 ± 4.2 74.8 ± .3 71.4 ± .6 NCI1 (4110) 84.5 ± .2 79.8 87.5 ± .5 84.2 ± 1.5 73.5 ± .3 78.5 ± .3 NCI109 (4127) — 78.8 87.4 ± .3 — 69.5 ± .3 77.0 ± .3 FRNKNSTN (4337) 76.4 ± .3 — — — 70.7 ± .4 72.9 ± .3 T able 2: Mean accuracy and standard deviations for Chemoinformatics and Bioinformatics datasets, all binary classification problems. 3.2 A measured lo ok at discrete dynamical systems W e now show the modularity capacity of the A tol framew ork, as well as its efficiency in compactly en- co ding information. Figure 2: Synthetised orbits x, y co ordinates in [0 , 1] 2 for parameter 4 . 0 (top) and 4 . 1 (b ottom). [AEK + 17] use a synthetic, discrete dynamical system (used to mo del flows in DNA microarrays) with the follo wing prop ert y: the resulting c haotic tra jectories exhibit distinct top ological characteristics dep ending on a parameter r > 0. The dynamical system is: x n +1 := x n + ry n (1 − y n ) mo d 1, and y n +1 := y n + r x n +1 (1 − x n +1 ) mo d 1. With random initialisation and five differen t parameters r ∈ { 2 . 5 , 3 . 5 , 4 , 4 . 1 , 4 . 3 } , a thousand iterations p er tra jectory and a thousand orbits p er parameter, a datasets of fiv e thousand or- bits is built. Figure 2 shows a few orbits generated with parameters r ∈ { 4 . 0 , 4 . 1 } . F or orbits generated with parameter r = 4 . 1, it happ ens that the initiali- sation spawns close to an attractor p oin t that giv es it the special shap e as in the leftmost orbit. The prob- lem of classifying this datasets according to their un- derlying parameter is rather uneasy and c hallenging. This dataset is commonly used for ev aluating topolog- ical metho ds under the following exp erimen tal setup: a learning phase with a 70/30 split, and accuracy with standard deviation computed ov er a h undred such ex- p erimen ts. The state-of-the-art accuracy of 87.7 ± 1.0 with p ersistence diagrams is rep orted in [CCI + 19]. Since those discrete orbits can be seen as measures in [0 , 1] 2 , we instead decide to directly apply Algorithm 2 on the observed p oin t cloud, using the mo dularity of our framework – so in this instance κ is the iden- tit y map. Therefore A tol is used here as a purely spatial approach and in this context, it is alik e an im- age classification algorithm where instead of a fixed grid we hav e learnt cen ter p oints to perform measure- men ts. W e present results in the form of a short ab- lation study T able 3 designed to illustrate influence of the small n umber of parameters from Algorithm 1. In this study w e consider v arying parameter b ∈ N ∗ for describing the measure space; replacing contrast functions Ψ i in Equation (1) with Φ i ( · , ˆ c n ) : x 7→ exp  − k x − c i k 2 2 /σ 2 i ( ˆ c n )  for v ectorization of the quan- tised space; and lastly c hanging the prop ortion of training observ ations used for deriving the quantiza- tion, with 10% indicating that a random selection of a ten th of the measures from the training set w ere used to calibrate Algorithm 1. W e measure accuracies ov er 10 70/30 splits and for comparison purp ose we also compute results for a 2D-grid quan tization scheme la- b eled grid , that uses the same contrast family and a regular grid of size b √ b c × b √ b c . It is exp ected that a higher budget for vectorising the measure space will yield a b etter description of said space, and this intuition is confirmed by T able 1. Al- though the differences are small, there is a slight ad- v an tage for op erating A tol than a fixed grid at lo wer budgets, which is coherent with the intuition that the mean measure p erforms b etter than other pro cedures as a first approximation. Next, there do es not seem to b e significan t differences from using Gaussian ov er Laplacian con trast functions on this exp erimen t, al- though it can b e the case on other problems. Under- standing the ability of such contrast functions to de- scrib e some particular observ ation space is c hallenging and left for future work. Lastly , the p ercen tage of ob- A TOL: Measure V ectorization for Automatic T op ologically-Orien ted Learning Budget Con trast functions Calibration b = 4 b = 16 b = 36 b = 100 Φ-Gaussian Ψ-Laplacian 10% 100% A tol 56.3 ± 1.6 83.1 ± 2.2 89.6 ± 1.3 93.8 ± .8 93.8 ± .5 93.8 ± .8 93.8 ± .8 93.6 ± .4 2.4 s 3.1 s 5.5 s 12.7 s 12.7 s 12.7 s 12.7 s 50.2 s grid 55.8 ± 1.1 82.7 ± .8 88.9 ± 1.0 93.8 ± .7 94.2 ± .5 93.8 ± .7 93.8 ± .7 93.8 ± .7 T able 3: Mean accuracy , deviation and vectorization time (including calibration step) o ver 10 exp erimen ts for ORBIT5K . Blue indicate parameters b y default; only one parameter is v aried at a time. serv ations used in the calibration part of the algorithm do es not ha ve a strong influence on the final result ei- ther (it does ha ve a significant influence when the bud- get is low er then 80). This tells us that the calibration step is rather stable for a given level of information in a problem, and that our pro cedure is well-designed for dealing with problems online. Finally we rep ort that when using budgets greater than 250 (i.e. finer than 12 × 12 for the regular-grid), b oth metho ds reach comparable mean accuracies that are o ver 95%. This indicates that this problem can b e precisely describ ed b y a purely spatial approach, without top ological de- scriptors. Lastly , using the default parameters in T able 3 we com- pare the adaptiv e strategy of Equation (2) for band- withs σ 1 , . . . , σ b to using identical constan t v alues for those bandwiths. F or inv estigating constant v alues w e use the arra y µ × 10 [ − 2 , − 1 . 5 , − 1 , − . 5 , − . 2 , − . 1 , 0 ,. 1 ,. 2 ,. 5 , 1 , 1 . 5 , 2] where µ is the av erage distance b et ween co deb ook p oin ts of ˆ c n . The results are sho wn Figure 3. This ex- p erimen t sho ws that if a constan t v alue for bandwiths can be found for optimal results, the adaptiv e strategy in tro duced in this pap er already pro duces comp etitiv e results effortlessly . 4 Conclusion This pap er introduces an unsup ervised vectorization metho d for measures in Euclidean spaces based on op- timal quantization pro cedures, then shows how this metho d can b e emplo yed in machine learning contexts to exploit top ological information. A tol is fast, has a simple design, is multifaceted and ties theoretical guaran tees to practical efficiency . F rom a practical viewpoint, we can guess that our metho d is less prone to bias and ov erfitting for tw o rea- sons: the centers are designed unsupervisedly (th us no p ossible o verfitting) and the dimension of our v ector- ization is credibly low. F urthermore, effective insight can b e gained as the method is in terpretable: once the mean measure is computed one can observe its lo cation (e.g. for diagrams, are centers close to the diagonal or not) and further derive information from it (e.g. in a classification task, cen ter importance). F or instance on a using a particular dataset of diagrams, one can Figure 3: Classification accuracy and deviations for Orbit5K as σ 1 , . . . , σ b = σ is v aried (in blue), com- pared to the adaptive strategy of Equation (2) (or- ange). learn if low p ersistence p oints are meaningful signal or not, with no preconceiv ed h yp othesis. Lastly A tol only dep ends on a simple parameter: the size b of the co debo ok. References [AEK + 17] Henry Adams, T egan Emerson, Michael Kirb y , Rachel Neville, Chris P eterson, P atrick Shipman, Sofya Chepushtano v a, Eric Hanson, F rancis Motta, and Lori Ziegelmeier. Persistence images: a sta- ble vector represen tation of p ersisten t ho- mology . Journal of Machine L e arning R e- se ar ch , 18(8), 2017. [A VR T16] R. Anirudh, V. V enk ataraman, K. N. Ra- mam urthy , and P . T uraga. A rieman- nian framework for statistical analysis of top ological p ersistence diagrams. In 2016 IEEE Confer enc e on Computer Vi- sion and Pattern R e c o gnition Workshops (CVPR W) , pages 1023–1031, June 2016. [BCY18] Jean-Daniel Boissonnat, F r ´ ed ´ eric Chazal, Martin Ro yer, F r´ ed´ eric Chazal and Mariette Yvinec. Ge ometric and T op olo gic al Infer enc e , v olume 57. Cam- bridge Universit y Press, 2018. [CCI + 19] Mathieu Carri ` ere, F r ´ ed´ eric Chazal, Y uichi Ik e, Th´ eo Lacombe, Martin Ro yer, and Y uhei Umeda. PersLa y: A Simple and V ersatile Neural Netw ork Lay er for Per- sistence Diagrams. AIST A TS 2020 , page arXiv:1904.09378, Apr 2019. [CCO17] Mathieu Carri` ere, Marco Cuturi, and Stev e Oudot. Sliced W asserstein ker- nel for p ersistence diagrams. In Interna- tional Confer enc e on Machine L e arning , v olume 70, pages 664–673, jul 2017. [CD18] F r ´ ed´ eric Chazal and Vincent Div ol. The densit y of exp ected persistence diagrams and its k ernel based estimation. In SoCG 2018 - Symp osium of Computational Ge- ometry , Budapest, Hungary , June 2018. Extended version of the SoCG proceed- ings, submitted to a journal. [CdSGO16] F r ´ ed´ eric Chazal, Vin de Silv a, Marc Glisse, and Stev e Oudot. The struc- tur e and stability of p ersistenc e mo dules . SpringerBriefs in Mathematics. Springer, 2016. [CLR20] F r ´ ed´ eric Chazal, Cl´ emen t Levrard, and Martin Roy er. Optimal quantization of the mean measure and application to clus- tering of measures. arXiv (pr eprint) , page arXiv:2002.01216, 2020. [CS19] Alex Cole and Gary Shiu. T op ological data analysis for the string landscap e. Journal of High Ener gy Physics , 2019, 03 2019. [DHS + 19] Simon S Du, Kangcheng Hou, Russ R Salakh utdinov, Barnabas Poczos, Ru- osong W ang, and Keyulu Xu. Graph neural tangen t kernel: F using graph neu- ral netw orks with graph kernels. In H. W allac h, H. Laro c helle, A. Beygelz- imer, F. d’Alc h´ e Buc, E. F ox, and R. Gar- nett, editors, A dvanc es in Neur al In- formation Pr o c essing Systems 32 , pages 5724–5734. Curran Asso ciates, Inc., 2019. [DUC20] Meryll Dindin, Y uhei Umeda, and F red- eric Chazal. T op ological data analysis for arrhythmia detection through mo du- lar neural netw orks. In Cyril Goutte and Xiao dan Zhu, editors, A dvanc es in Arti- ficial Intel ligenc e , pages 177–188, Cham, 2020. Springer In ternational Publishing. [Dup18] Ludovic Dup onc hel. Exploring hyper- sp ectral imaging data sets with top ologi- cal data analysis. Analytic a Chimic a A cta , 1000:123 – 131, 2018. [EH10] Herb ert Edelsbrunner and John Harer. Computational T op olo gy: A n Intr o duc- tion . AMS, 2010. [HKN19] Christoph Hofer, Roland Kwitt, and Marc Niethammer. Graph filtration learning, 2019. [HKNU17] Christoph Hofer, Roland Kwitt, Marc Ni- ethammer, and Andreas Uhl. Deep learn- ing with top ological signatures. In A d- vanc es in Neur al Information Pr o c essing Systems , pages 1634–1644, 2017. [HR G14] Nan Hu, Raif Rustamo v, and Leonidas Guibas. Stable and informativ e sp ec- tral signatures for graph matching. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gni- tion , pages 2305–2312, 2014. [KHF16] Genki Kusano, Y asuaki Hiraok a, and Kenji F ukumizu. P ersistence w eighted Gaussian kernel for topological data anal- ysis. In International Confer enc e on Ma- chine L e arning , v olume 48, pages 2004– 2013, jun 2016. [KKM + 16] Kristian Kersting, Nils M. Kriege, Christopher Morris, P etra Mutzel, and Marion Neumann. Benc hmark data sets for graph kernels, 2016. http: //graphkernels.cs.tu- dortmund.de . [KMP18] Firas A. Khasa wneh, Elizabeth Munch, and Jose A. Perea. Chatter classifica- tion in turning using mac hine learning and topological data analysis. IF AC- Pap ersOnLine , 51(14):195 – 200, 2018. 14th IF A C W orkshop on Time Delay Sys- tems TDS 2018. [Llo82] Stuart P . Llo yd. Least squares quantiza- tion in PCM. IEEE T r ans. Inform. The- ory , 28(2):129–137, 1982. [Mac67] J. MacQueen. Some metho ds for classifi- cation and analysis of multiv ariate obser- v ations. In Pr o c. Fifth Berkeley Symp os. Math. Statist. and Pr ob ability (Berkeley, A TOL: Measure V ectorization for Automatic T op ologically-Orien ted Learning Calif., 1965/66) , pages V ol. I: Statis- tics, pp. 281–297. Univ. California Press, Berk eley , Calif., 1967. [PKP + 19] Jerem y A Pik e, Ab dullah O Khan, Chiara P allini, Steven G Thomas, Markus Mund, Jonas Ries, Natalie S Poulter, and Iain B St yles. T op ological data analysis quanti- fies biological nano-structure from single molecule lo calization microscopy . Bioin- formatics , 36(5):1614–1621, 10 2019. [PMK19] J. Perea, E. Munch, and Firas A. Kha- sa wneh. Appro ximating con tinuous func- tions on p ersistence diagrams using tem- plate functions. ArXiv , abs/1902.07190, 2019. [PV G + 11] F. P edregosa, G. V aroquaux, A. Gram- fort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. W eiss, V. Dub ourg, J. V anderplas, A. Passos, D. Cournap eau, M. Brucher, M. P errot, and E. Duc hesnay . Scikit-learn: Machine learning in Python. Journal of Machine L e arning R ese ar ch , 12:2825–2830, 2011. [RHBK15] Jan Reininghaus, Stefan Hub er, Ulrich Bauer, and Roland Kwitt. A stable m ulti- scale k ernel for topological mac hine learn- ing. In IEEE Confer enc e on Computer Vision and Pattern R e c o gnition , 2015. [The20] The GUDHI Pro ject. GUDHI User and R efer enc e Manual . GUDHI Editorial Board, 3.3.0 edition, 2020. [VZ17] Saurabh V erma and Zhi-Li Zhang. Hun t for the unique, stable, sparse and fast fea- ture learning on graphs. In A dvanc es in Neur al Information Pr o c essing Systems , pages 88–98, 2017. [ZKR + 17] Manzil Zaheer, Satwik Kottur, Siamak Ra v anbakhsh, Barnabas P o czos, Rus- lan Salakhutdino v, and Alexander Smola. Deep sets. In A dvanc es in Neur al Infor- mation Pr o c essing Systems , pages 3391– 3401, 2017. [ZLJ + 19] Bartosz Zieli ´ nski, Mic ha l Lipi ´ nski, Ma- teusz Juda, Matthias Zeppelzauer, and P aw e l D lotk o. P ersistence bag-of-w ords for top ological data analysis. In Pr o- c e e dings of the Twenty-Eighth Interna- tional Joint Confer enc e on Artificial In- tel ligenc e, IJCAI-19 , pages 4489–4495. In ternational Joint Conferences on Arti- ficial Intelligence Organization, 7 2019. [ZW19] Qi Zhao and Y usu W ang. Learning met- rics for p ersistence-based summaries and applications for graph classification. In H. W allac h, H. Laro c helle, A. Beygelz- imer, F. d’Alc h´ e Buc, E. F ox, and R. Gar- nett, editors, A dvanc es in Neur al In- formation Pr o c essing Systems 32 , pages 9855–9866. Curran Asso ciates, Inc., 2019. [ZWX + 18] Zhen Zhang, Mianzhi W ang, Yijian Xi- ang, Y an Huang, and Arye Nehorai. RetGK: Graph Kernels based on Return Probabilities of Random W alks. In A d- vanc es in Neur al Information Pr o c essing Systems , pages 3968–3978, 2018.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment