Statistical Analysis of Metric Graph Reconstruction

Statistical Analysis of Metric Graph Reconstruction F abrizio Lecci lecci@cmu.edu Alessandro Rinaldo arinaldo@cmu.edu Larry W asserman larr y@cmu.edu Dep artment of Statistics Carne gie Mel lon University Pittsbur gh, P A 15213, USA Abstract A metric graph is a 1-dimensional stratiﬁed metric space consisting of vertices and edges or lo ops glued together. Metric graphs can b e naturally used to represent and mo del data that take the form of noisy ﬁlamentary structures, such as street maps, neurons, netw orks of rivers and galaxies. W e consider the statistical problem of reconstructing the top ology of a metric graph em b edded in R D from a random sample. W e derive lo w er and upp er b ounds on the minimax risk for the noiseless case and tubular noise case. The upper b ound is based on the reconstruction algorithm giv en in Aanjaneya et al. (2012). Keyw ords: Metric Graph, Filament, Reconstruction, Manifold Learning, Minimax Esti- mation 1. In tro duction W e are concerned with the problem of estimating the top ology of ﬁlamen tary data structure. Datasets consisting of p oin ts roughly aligned along intersecting or branc hing ﬁlamentary paths em b edded in 2 or higher dimensional spaces ha ve become an increasingly common type of data in a v ariet y of scientiﬁc areas. F or instance, road reconstruction based on GPS traces, lo calization of earthquak es faults, galaxy reconstruction are all instances of a more general problem of estimating basic topological features of an underlying ﬁlamen tary structure. The recen t pap er by Aanjaney a et al. (2012), up on which our work is based, contains further applications, as well as numerous references. T o provide a more concrete example, consider Figure 1. The left hand side displa ys ra w data p ortra ying a neuron from the hipp ocampus of a rat (Guly´ as et al., 1999). The data were obtained from NeuroMorpho.Org (Ascoli et al., 2007). The righ t hand side of the ﬁgure shows the output of the metric graph reconstruction obtained using the algorithm analyzed in this pap er, originally prop osed by Aanjaney a et al. (2012). The reconstruction, which tak es the form of a graph, captures p erfectly all the top ological features of the neuron, namely , the relationship b et ween the edges and vertices, the num b er of branching p oin ts and the degree of each no de. Metric graphs provide the natural geometric framework for representing intersecting ﬁlamen tary structures. A metric graph embedded in a D -dimensional Euclidean space ( D ≥ 2) is a 1-dimensional stratiﬁed metric space. It consists of a ﬁnite num b er of p oin ts (0- dimensional strata) and curves (1-dimensional strata) of ﬁnite length, where the b oundary c  2013 F abrizio Lecci, Alessandro Rinaldo and Larry W asserman. Lecci, Rinaldo and W asserman of eac h curve is given b y a pair (of not-necessarily distinct) v ertices (see the next section for a formal deﬁnition of a metric graph). In this paper we study the problem of reconstructing the topology of metric graphs from p ossibly noisy data, from a statistical p oin t of view. Sp eciﬁcally , we assume that we hav e a sample of p oin ts from a distribution supp orted on a metric graph or in a small neigh b orhoo d and we are interested in recov ering the top ology of the corresp onding metric graph. T o this end, we use the metric graph reconstruction algorithm given in Aanjaney a et al. (2012). F urthermore, in our theoretical analysis w e characterize explicitly the minimal sample size required for p erfect topological reconstruction as a direct function of parameters deﬁning the shap e of the metric graph, introduced in Section 2. This leads to an upp er b ound on the risk of topological reconstruction. Finally , w e obtain a low er b ound on the risk of top ological reconstruction, whic h, in the noiseless case, almost matc hes the derived upp er b ound, indicating that the algorithm of Aanjaney a et al. (2012) b eha ves nearly optimally . Outline. In Section 2 we formally deﬁne metric graphs, the statistical mo dels w e will consider and the assumptions w e will use throughout. W e will also describ e sev eral geometric quan tities that are cen tral to our analysis. Section 3 contains detailed analysis of the p erformance of algorithm of Aanjaney a et al. (2012) for metric graph reconstruction, under mo diﬁed settings and assumptions. In Section 4 we deriv e low er and upp er b ounds for the minimax risk of metric graph reconstruction problem. In Section 5 we conclude with some ﬁnal comments. R elate d Work. The work most closely related to ours is Aanjaneya et al. (2012) whic h w as, in fact, the motiv ation for our work. F rom the theoretical side, we replace the k ey assumption in Aanjaneya et al. (2012) of the sample b eing a ( ε, R )-approximation to the underlying metric graph, by the milder assumption of the sample b eing dense in a neig- b orhoo d of the metric graph. Approximation and reconstruction of metric graphs has also b een considered in Chazal and Sun (2013) and Ge et al. (2011). Metric graph reconstruction is related to the problem of estimating stratiﬁed spaces (basically , intersecting manifolds). Stratiﬁed spaces ha ve been studied by a n umber of authors such as Bendic h et al. (2010, 2012); Bendic h (2008). A sp ectral metho d for estimating in tersecting structures is given in Arias-Castro et al. (2011). There are a v ariety of algorithms for sp eciﬁc problems, for example, see Ahmed and W enk (2012); Chen et al. (2010) for the reconstruction of road net works. Finally , Chernov and Kurlin (2013) derived an alternativ e algorithm that uses ideas from homology . 2. Bac kground and Assumptions The assumptions in Aanjaneya et al. (2012) lead to a reconstruction pro cess that is aimed at capturing the in trinsic structure of the data and is somewhat oblivious to its extrinsic em b edding. The authors assume that the sample comes with a metric that is close to the in trinsic metric of the underlying graph, b y imposing a limit on the Gromov-Hausdorﬀ distance b et ween the t wo metrics. By considering data em b edded in the Euclidean space and focusing on the top ological aspect, we sho w that the notion of dense sample is suﬃcien t to guarantee a correct reconstruction. 2 St a tistical Anal ysis of Metric Graph Reconstruction Figure 1: Left: Neuron cr22e from the hipp o campus of a rat; NeuroMorpho.Org (Ascoli et al., 2007). Right: A metric graph reconstruction of the neuron. In this section w e provide background on metric graph spaces and describ e the assump- tions and the geometric parameters that we will b e using throughout. Informally , a metric graph is a collection of vertices and edges glued together in some fash- ion. Here we state the formal deﬁnitions of path metric space and metric graph. F or more details see Aanjaneya et al. (2012) and Kuchmen t (2004). Deﬁnition 1 A metric sp ac e ( G, d G ) is a p ath metric sp ac e if the distanc e b etwe en any p air of p oints is e qual to the inﬁmum of the lengths of the c ontinuous curves joining them. A metric gr aph is a p ath metric sp ac e ( G, d G ) that is home omorphic to a 1-dimensional str atiﬁe d sp ac e. A vertex of G is a 0-dimensional str atum of G and an e dge of G is a 1-dimensional str atum of G . W e will consider metric graphs em b edded in R D . Note that, if one ignores the metric structure, namely the length of edges and lo ops, the shap e or top ology of a metric graph ( G, d G ) is enco ded by a graph, whose vertices and edges corresp ond to vertices and edges of G . Since we allow for tw o vertices to b e connected by more than one edge we are actually dealing with pseudographs. W e recall that an undirected pseudograph ( V , E ) is a set of v ertices V , a multiset E of unordered pairs of (not necessarily distinct) v ertices. T o a given pseudograph we can asso ciate a function f : E → V × V , whic h, when applied to an edge e ∈ E , simply extracts the vertices to which e is adjacent. Th us, if e 1 , e 2 ∈ E are such that f ( e 1 ) = f ( e 2 ), then e 1 and e 2 are parallel edges. Similarly , if e ∈ E is suc h that f ( e ) = { v , v } for some v ∈ V , then e is a lo op. F or each pair ( u, v ) ∈ V × V , let ν ( u, v ) = | f − 1 ( { u, v } ) | if { u, v } ∈ E and 0 otherwise. In particular, ν ( u, v ) is the n umber of edges b et ween u and v (or lo ops if u = v ). 3 Lecci, Rinaldo and W asserman W e say that a metric graph reconstruction algorithm p erfectly reco vers the top ology of G if outputs a pseudograph isomorphic to the pseudograph represen ting the top ology of G . W e now deﬁne some key quantities regarding the structure of a metric graph. W e start with the deﬁnition of reach. Let M b e a 1-dimensional manifold embedded in R D . Let T u M denote the 1-dimensional tangent space to M and let T ⊥ u M be the ( D − 1)-dimensional normal space. Deﬁnition 2 Deﬁne the ﬁb er of size a at u ∈ M to b e L a ( u, M ) = T ⊥ u M T B ( u, a ) , wher e B ( u, a ) is the D -dimensional b al l of r adius a c enter e d at u . If M has b oundary { v 1 , v 2 } , the ﬁb er of size a at v i is deﬁne d as the limit of L a ( u, M ) , as u appr o aches v i in M \{ v 1 , v 2 } . The r e ach of M is the lar gest numb er τ such that the ﬁb ers L τ ( u, M ) never interse ct. The reach sets a limit on the curv ature of a manifold. A manifold with large reach do es not come too close to be self-intersecting. F or example the reach of an arc of a circle is equal to its radius. The quan tity 1 /τ is called the c ondition numb er in Niyogi et al. (2008). F or more details see also F ederer (1959); Chazal and Lieutier (2006); Genov ese et al. (2012a). Eac h edge of a metric graph ( G, d G ) can b e seen as a 1-dimensional manifold with boundary . Let the lo c al r e ach of metric graph G b e the minimum reac h asso ciated to an edge of G . When 2 edges intersect at a vertex v they create an angle, where the angle b et ween t wo intersecting curves is formally deﬁned as follo ws. Suppose that e 1 and e 2 in tersect at x . Let B ( x,  ) b e the D -dimensional ball of radius  centered at x . Let ` 1 (  ) b e the line segmen t joining the tw o p oin ts x and ∂ B ( x,  ) T e 1 . Let ` 2 (  ) be the line segment joining the t wo points x and ∂ B ( x,  ) T e 2 . Let α  ( e 1 , e 2 ) b e the angle betw een ` 1 (  ) and ` 2 (  ). The angle betw een e 1 and e 2 is α ( e 1 , e 2 ) = lim  → 0 α  ( e 1 , e 2 ). W e assume that, for each pair of in tersecting edges e 1 and e 2 , the angle α ( e 1 , e 2 ) is well-deﬁned. T o control p oin ts far aw a y in the graph distance, but close in the embedding space, we deﬁne A G = { ( x, x 0 ) ∈ G × G : d G ( x, x 0 ) ≥ min( b, τ α ) } , where b is the shortest edge of G , τ is the lo cal condition n umber and α is the smallest angle formed by t wo edges of G . W e deﬁne the glob al r e ach as the inﬁmum of the Euclidean distances among pairs of p oin t in A G , that is ξ = inf A G k x − x 0 k 2 . Let ( G, d G ) be a metric graph and, for a constan t σ ≥ 0, let G σ = { y : inf x ∈ G || x − y || 2 ≤ σ } b e the σ -tub e around G . If σ = 0, then, trivially , G σ = G . Notice that G σ is a set of dimension D if σ > 0. W e will use the assumption that the sample Y is suﬃciently dense in G σ with respect to the Euclidean metric, as formalized b elo w. Deﬁnition 3 The sample Y = { y 1 , . . . , y n } ⊂ G σ ⊂ R D is δ 2 -dense in G σ if for every x ∈ G σ , ther e exists a y ∈ Y such that k x − y k 2 < δ 2 . The problem of metric graph reconstruction consists of reconstructing a metric graph G giv en a dense sample { y 1 , . . . , y n } = Y ⊂ G σ endo wed with a distance d Y , whic h could b e the D -dimensional Euclidean distance or some more complicate notion of distance. If σ = 0 w e say that the sample Y is noiseless, while if σ > 0, we say that Y is a noisy sample. Throughout our analysis w e restrict the attention to metric graphs embedded in R D that satisfy the following assumptions: 4 St a tistical Anal ysis of Metric Graph Reconstruction A1 The graphs ha ve ﬁnite total length and are free of no des of degree 2 (though they ma y con tain v ertices of degree 1 or 3 and higher). A2 Each edge is a smooth embedded sub-manifold of dimension 1, of length at least b > 0 and with reach at least τ > 0. A3 Each pair of in tersecting edges forms a well-deﬁned angle of size at least α > 0. A4 The global reach is at least ξ > 0. Assumptions A1 and A2 allo w us to consider each edge of a metric graph as a single smo oth curv e. A3 and A4 are additional regularity conditions on the separation b et ween diﬀerent edges. Assumptions similar to A1-A4 are common in the literature. F or diﬀerent regularity conditions that allow for corners within an edge see, for example, Chazal et al. (2009) and Chen et al. (2010). Let G be the set of metric graphs embedded in R D that satisfy assumptions A1, A2, A3 and A4, inv olving the parameters b , α , τ , ξ . W e consider t wo noise mo dels: Noiseless. W e observ e data Y 1 , . . . , Y n ∼ P , where P ∈ P , a collection of probability distributions supp orted ov er metric graphs ( G, d G ) in G having densities p with respect to the length of G b ounded from b elo w by a constant a > 0. T ubular Noise. W e observ e data Y 1 , . . . , Y n ∼ P G,σ where P G,σ is uniform on the σ -tub e G σ . In this case we consider the collection P = { P G,σ : G ∈ G } . W e are interested in b ounding the minimax risk R n = inf b G sup P ∈P P n  b G 6' G  , (1) where the inﬁmum is ov er all estimators b G of the top ology of ( G, d G ), the supremum is o ver the class of distributions P for Y and b G 6' G means that b G and G are not isomorphic. In Section 4 w e will ﬁnd lo wer and upp er b ounds for R n in the noiseless case and the tubular noise case. W e conclude this section by summarizing the man y parameters and symbols inv olv ed in our analysis. See T able 1. 3. P erformance Analysis for the Algorithm of Aanjaneya et al. (2012) In this section w e study the p erformance of the metric graph reconstruction algorithm of Aanjaney a et al. (2012), under assumptions A1-A4 and with a c hoice of parameters adapted to our setting. In Section 4 we will use these results to derive b ounds on the minimax rate for top ology reconstruction. The metric graph reconstruction algorithm is presen ted in Algorithm 1. The algorithm takes a (p ossibly noisy) sample Y from a metric graph G and a distance d Y deﬁned on Y and returns a graph b G that approximates G . The key idea is the following: 5 Lecci, Rinaldo and W asserman Sym b ol Meaning ( G, d G ) metric graph α smallest angle b shortest edge τ lo cal reac h ξ global reach G set of metric graphs embedded in R D , satisfying A1-A4 P set of distributions on G or G σ G σ σ tube around G Y sample, subset of G σ δ Y is a δ / 2-dense sample T able 1: Summary of the symbols used in our analysis. a shell of radius r is constructed around each p oin t in the sample, whic h is lab eled e dge p oint if its shell contains 2 well separated clusters of sampled points and vertex p oint oth- erwise. Several steps of the algorithm require the construction of a Rips-Vietoris graph of parameter δ : R δ ( S y ) is a graph whose vertices are all the p oin ts of S y and there is an edge b et ween tw o p oin ts if the Euclidean distance b etw een them is not larger than δ . A t Step 11 some of the edge p oin ts that are close to v ertices are re-lab eled as vertex p oin ts. This expansion guarantees a precise b orderline b etw een clusters of vertex p oints and clusters of edge p oints. A t steps 15-17 each of these clusters is asso ciated to a vertex or to an edge of the reconstructed graph b G . W e will analyze the algorithm considering the Euclidean Algorithm 1 Metric Graph Reconstruction Algorithm Input: sample Y , d Y , r, p 11 . 1: Lab eling p oints as edge or v ertex 2: for all y ∈ Y do 3: S y ← B ( y , r + δ ) \ B ( y , r ) 4: deg r ( y ) ← Number of connected comp onents of Rips-Vietoris graph R δ ( S y ) 5: if deg r ( y ) = 2 then 6: Lab el y as a edge point 7: else 8: Lab el y as a preliminary v ertex point. 9: end if 10: end for. 11: Lab el all p oints within Euclidean distance p 11 from a preliminary vertex p oin t as vertices. 12: Let E b e the p oin t of Y lab eled as edge p oints. 13: Let V b e the p oin t of Y lab eled as vertices. 14: Reconstructing the graph structure 15: Compute the connected comp onents of the Rips-Vietoris graphs R δ ( E ) and R δ ( V ). 16: Let the connected comp onents of R δ ( V ) b e the vertices of of the reconstructed graph b G . 17. Let there b e an edge b etw een vertices of b G if their corresp onding connected comp onents in R δ ( V ) contain p oints at distance less than δ from the same connected comp onent of R δ ( E ). Output: b G . 6 St a tistical Anal ysis of Metric Graph Reconstruction distance on the sample Y , that is, d Y = k · k 2 . The inner radius of the shell at Step 3 and the width of the expansion at Step 11 are parameters the user has to sp ecify . Before ﬁnding ho w dense a sample has to b e in orderer to guarantee a correct recon- struction of a metric graph, we show that it is suﬃcient to study a particular metric graph em b edded in R 2 , whic h represents the worst case. In other words, if the metric graph algo- rithm can reconstruct this particular planar graph, then it can reconstruct any other metric graph that satisﬁes A1-A4. 3.1 The w orst case: a metric graph in R 2 The w orst case is the one for which it is hard to distinguish t w o edges that intersect at a v ertex b ecause they are to o close in the em b edding space. Figure 2 (top left) sho ws an edge e that intersects tw o edges e 1 , e 2 with reach τ , forming an angle α at vertex x . F or simplicity , we consider this metric graph embedded in R 3 ( D = 3). Therefore Figure 2 sho ws the pro jections of e, e 1 and e 2 on the (limit) plane formed b y e 1 and e 2 , passing through x .                        Figure 2: Even in the worst case, edges e 1 and e 2 m ust lie outside of the torii constructed on the ﬁb ers L τ ( x, e 1 ) and L τ ( x, e 2 ). W e focus on edge e 2 . The blue segment AB is the pro jection of L τ ( x, e 2 ), the ﬁb er of size τ around x . In R 3 , L τ ( x, e 2 ) is a circle of radius τ centered in x . By deﬁnition, for any y ∈ e 2 , the ﬁb er L τ ( y , e 2 ) can not intersect the ﬁb er L τ ( x, e 2 ), otherwise the assumption in volving the reach w ould b e violated. W e represent this condition b y taking a circle C of 7 Lecci, Rinaldo and W asserman radius τ centered at B and rotating it around x along the circumference of L τ ( x, e 2 ). This pro cedure forms a torus with an inner lo op of radius 0. Edge e 2 m ust lie outside of this torus, so that its ﬁb ers do not intersect L τ ( x, e 2 ). See the top right plot of Figure 2. The same reasoning applies to edge e 1 , which must lie outside of the torus constructed on L τ ( x, e 1 ). See the b ottom left plot. The w orst case is the one for which e 1 and e 2 are as close as p ossible: on the same plane and on the b oundaries of the tw o tori. This case is represented in the b ottom right plot of Figure 2. Note that e 1 and e 2 are simply arcs of circles of radius τ .                        Figure 3: Left: edges e 1 and e 2 with minim um reach τ forming the smallest angles α at v ertex x . Right: same metric graph with a tub e of radius σ around it. W e will use basic trigonometric prop erties of the worst case. In Figure 3 (left), O and O 0 are the cen ters of the circles associated to edges e 1 and e 2 . It is easy to see that angle O b xO 0 has width π − α . It can b e shown that x b O O 0 = α/ 2 , (2) T b xQ = α/ 4 . (3) Let Y b e a noisy sample of G . In other words Y is a subset of G σ , the tub e of radius σ ≥ 0 around the metric graph G . See Figure 3 (righ t). Let Q b e the midp oint of segment O O 0 and let T b e the intersection p oin t of O O 0 and edge e 1 . F or 0 ≤ σ ≤ QT = τ − τ cos( α/ 2), the smallest angle formed by the inner faces of the tub e around the metric graph is α 0 = π − arccos 2( τ − σ ) 2 − 4 τ 2 cos 2 ( α/ 2) 2( τ − σ ) 2 , (4) where we applied the cosine law to the triangle O sO 0 and the fact that angle O b sO 0 has width π − α 0 . Note that if σ = 0 then α 0 = α . As in (3), it can b e shown that R b sQ = α 0 / 4 . (5) 8 St a tistical Anal ysis of Metric Graph Reconstruction The few basic trigonometric equations describ ed ab o ve will b e used to determine under whic h conditions on b, α, τ , ξ , σ the metric graph reconstruction algorithm can reconstruct the worst case. 3.2 Analysis of Algorithm 1 with Euclidean distance In this section w e analyze Algorithm 1. It is suﬃcient to study the w orst case of ﬁgures 2 and 3 and extend the results to any metric graph in R D . The Euclidean distance is used at every step of the algorithm, which requires the sp eciﬁcation of r , the inner radius of the shell, and p 11 , the parameter gov erning the expansion of Step 11. W e set r = δ 2 + σ + τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) + δ 2 sin( α 0 / 4) (6) and p 11 = δ 2 + τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) + r + δ sin( α 0 / 2) (7) This choice is justiﬁed in the pro of of Prop osition 4. Deﬁne f ( b, α, τ , ξ , σ ) := ( τ − σ ) sin  min( b,ατ ) − ( α − α 0 ) τ 2 τ  − [ τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2)]  1 + 2 sin( α 0 / 2)  − 2 σ sin( α 0 / 2) 1 + 3[sin( α 0 / 2)] − 1 + [sin( α 0 / 2) sin( α 0 / 4)] − 1 , (8) where α 0 is given in 4. Note that f ( b, α, τ , ξ , σ ) is a decreasing function of σ . Prop osition 4 If Y is δ 2 -dense in G σ and 0 < r + δ < ξ − 2 σ , (9) 0 < δ < f ( b, α, τ , ξ , σ ) , (10) then the gr aph b G pr ovide d by Algorithm 1 (input: Y , k · k 2 , r, p 11 ) is isomorphic to G . Pro of W e will show that under conditions (9) and (10), Algorithm 1 can reconstruct the w orst case describ ed in section 3.1, formed by edges e 1 and e 2 of reach τ forming an angle of width α . This will automatically imply that the algorithm can reconstruct the top ology of other vertices and edges in the D -dimensional space. Condition (9) guarantees that p oints of G whic h are far apart in the metric graph distance d G , and close in the embedding space, do not interfere in the construction of the shells at Steps 3-4. The rest of the proof in volv es condition (10). Since the sample is δ 2 -dense in the tub e, there is at least a p oin t y ∈ Y inside the ball of radius δ 2 cen tered at an y vertex x ∈ G . When using Algorithm 1 we wan t to b e sure that y is lab eled as a vertex, that is, the n umber of connected comp onents of the shell around y is diﬀeren t than 2 (Steps 3-4). The w orst case is depicted in Figure 4 (left), where x is the vertex of minimum angle α , formed 9 Lecci, Rinaldo and W asserman               Figure 4: Left: edges e 1 and e 2 with minim um reach τ forming the smallest angles α at v ertex x . Righ t: The distance k F − G k 2 b et ween the t wo connected comp onen ts of the shell around an edge p oin t y 0 m ust b e greater than δ . b y t wo edges, e 1 and e 2 of reac h τ . First, we show that for the the v alue of r selected in (6), p oints close to an actual vertex are lab eled as vertices at Steps 3-10 and p oin ts far from actual vertices are lab eled as edges. The inner faces of the tub e of radius σ around e 1 and e 2 form an angle of width α 0 at v ertex s , as describ ed in Section 3.1. Let u and v b e the t wo p oints on the faces of the tub e such that they are equidistant from x and k u − v k 2 = δ . Since at Step 4 we construct a δ -graph to determine the num b er of connected comp onen ts of the shell S y and we wan t y to b e a vertex, we choose r , the inner radius of the shell S y , so that if u, v ∈ Y then r ≥ max { d Y ( y , u ) , d Y ( y , v ) } . This guarantees that ∀ t 1 , t 2 ∈ Y with t 1 around edge e 1 , t 2 around edge e 2 suc h that { t 1 , t 2 } ⊂ S y , we hav e d Y ( t 1 , t 2 ) ≥ δ , that is t 1 and t 2 b elong to diﬀerent connected comp onents of the shell around y at Step 4. The distance b etw een y and u is b ounded b y k y − x k 2 + k x − s k 2 + k s − u k 2 , where, using (2), k x − s k 2 = k x − Q k 2 − k s − Q k 2 = τ sin( α / 2) − ( τ − σ ) sin( α 0 / 2) and using (5), k s − u k 2 ≤ δ 2 sin( α 0 / 4) . (11) Therefore we require that r , the inner radius of the shell of Step 4 satisﬁes r ≥ δ 2 + k x − s k 2 + δ 2 sin( α 0 / 4) (12) ≥ k y − x k 2 + k x − s k 2 + k s − u k 2 . Another condition on r arises when w e lab el edge points far from actual v ertices. See Figure 4 (right). If y 0 ∈ Y , then it should b e lab eled as an edge p oint. That is, at Step 4, the Rips graph R δ ( S y 0 ) on the shell S y 0 should ha ve 2 connected components. Therefore the distance k F − G k 2 b et ween them must b e greater than δ . W e require that r ≥ 2 σ + δ / √ 2 (13) 10 St a tistical Anal ysis of Metric Graph Reconstruction whic h implies k F − G k 2 > δ when r is small enough, as implied by (10). Note that the v alue r = δ 2 + σ + k x − s k 2 + δ 2 sin( α 0 / 4) satisﬁes b oth (12) and (13). The outer radius of the shell at Steps 3-4 has length r + δ . This guarantees that when the shell around an edge p oint intersects the tub e around G there is at least a p oint y ∈ Y in each connected comp onen t of the shell, since Y is δ 2 -dense in G σ . In the last part of this pro of w e sho w that condition (10) is needed to guaran tee that the sample is dense enough and the radius of the shells of Step 3 has the correct size, so that, ev en in the worst case, eac h vertex is asso ciated to one set of sampled p oin ts at Steps 15-17 and these connected comp onen ts are correctly linked by sets of sampled p oints lab eled as edge p oints. Let z ∈ G σ b e the p oint around e 2 where the segmen t of length r + δ , orthogonal to the face of the tub e around edge e 1 , intersects the face of the tub e around edge e 2 . See Figure 5. If this segment do es not exist w e simply c onsider the segment of length r + δ from s to a p oint z on e 2 .            Figure 5: The shell around z is tangent to edge e 2 . Supp ose z ∈ Y . Among the p oints that migh t be labeled as vertices at Step 6 b ecause of their closeness to v ertex x , z is the furthest from x , since the shell around z is tangent to the tub e around e 1 . At Step 11, in order to con trol the lab elling of the p oints in the tube b et ween y and z we would like to lab el all the p oints in { y 0 ∈ Y : k y 0 − y k 2 ≤ k y − z k 2 } as v ertices. T o simplify the calculation w e use the following b ound k y − z k 2 ≤ k y − x k 2 + k x − s k 2 + k s − z k 2 , where, using (5), k s − z k 2 ≤ r + δ sin( α 0 / 2) . (14) This justiﬁes the c hoice of p 11 = δ 2 + k x − s k 2 + r + δ sin( α 0 / 2) ≥ k y − z k 2 . Thus, at Step 11 w e lab el all the p oints in { y 0 ∈ Y : k y 0 − y k 2 ≤ p 11 and y is lab eled as v ertex at Step 6 } as v ertices. If z is actually lab eled as a v ertex at Step 6, then through the expansion of Step 11, all the p oin ts at distance not greater than p 11 from z are lab eled as vertices. Finally w e determine under which conditions there is at least a p oin t in the tub e around 11 Lecci, Rinaldo and W asserman e 2 lab eled as an edge p oint after Step 11. Consider the w orst case in whic h e 1 and e 2 are forming an angle of size α at b oth their extremes x and x 0 . See Figure 6.                         Figure 6: Edges e 1 and e 2 , forming an angle of size α at b oth their extremes x and x 0 . All the p oints y 0 ∈ Y such that k y 0 − z k 2 ≤ p 11 or k y 0 − z 0 k 2 ≤ p 11 migh t b e lab eled as v ertices. When w e construct R ( E ) δ and R ( V ) δ at Step 15 the t wo sets of vertices around x and x 0 m ust b e disconnected and there must b e at least an edge p oint b etw een them. A suﬃcien t condition is that the length of edge e 2 is greater than 2( a 1 + a 2 + a 3 ) + a 4 , where • a 1 is the length of the arc of e 2 formed by the pro jections of lines O x and O s on e 2 , • a 2 is the length of the arc of e 2 formed by the pro jection of the c hord of length k s − z k 2 , • a 3 is the length of the arc of e 2 formed by the pro jection of the chord of length p 11 , • a 4 is the length of the arc of e 2 formed by the pro jection of the chord of length δ . Note that, in Figure 6, e 2 = 2 τ arcsin  k x − x 0 k 2 2 τ  = ατ but in general it might b e shorter, so that e 1 and e 2 migh t not intersect in x 0 . How ever, b y assumptions A2, e 2 m ust be longer than b . Th us we require min( b, ατ ) > 2( a 1 + a 2 + a 3 ) + a 4 . (15) By simple prop erties in volving arcs and chords w e hav e a 1 =  α − α 0 2  τ , a 2 = 2 τ arcsin  k s − z k 2 2( τ − σ )  , a 3 = 2 τ arcsin  p 11 2( τ − σ )  , a 4 = 2 τ arcsin  δ 2( τ − σ )  . Since the arcsin is sup eradditive in [0 , 1] we require the stronger condition min( b, ατ ) − ( α − α 0 ) τ > 2 τ arcsin  2 k s − z k 2 + 2 p 11 + δ 2( τ − σ )  , 12 St a tistical Anal ysis of Metric Graph Reconstruction whic h holds if sin  min( b, ατ ) − ( α − α 0 ) τ 2 τ  > 2 r + δ sin( α 0 / 2) + 2 p 11 + δ 2( τ − σ ) . The last condition is equiv alent to (10). If this condition is satisﬁed then the graph is correctly reconstructed at Steps 15-17: ev ery connected component of R δ ( V ) corresponds to a vertex of G and every connected comp onen t of R δ ( E ) corresp onds to an edge of G . Example 1 A Neur on in Thr e e-Dimensions. We r eturn to the neur on example and we try to apply Pr op ositions 4 to the 3D data of Figur e 1, namely the neur on cr22e fr om the hipp o c ampus of a r at (Guly´ as et al., 1999). The data wer e obtaine d fr om Neur oMorpho.Or g (Asc oli et al., 2007). The total length of the gr aph is 1750 . 86 µm . We assume the smal lest e dge has length 100 µm , the smal lest angle π / 3 , the lo c al r e ach 30 µm and ξ = 50 µm . The c onditions of Pr op osition 4 ar e satisﬁe d for δ = 2 . 00 µm . Algorithm 1 r e c onstructs the top ol- o gy of the metric gr aph starting fr om a δ / 2 -dense sample. Figur e 1b shows the r e c onstructe d gr aph. 4. Minimax Analysis In this section we derive lo wer and upp er b ound for the minimax risk R n = inf b G sup P ∈P P n  b G 6' G  , (16) where, as described in Section 2, the inﬁmum is ov er all estimators b G of the metric graph G , the suprem um is ov er the class of distributions P for Y and b G 6' G means that b G and G are not isomorphic. 4.1 Lo w er Bounds T o derive a lo wer bound on the minimax risk, we mak e rep eated use of Le Cam’s lemma. See, e.g., Y u (1997) and Chapter 2 of Tsybak ov (2008). Recall that the total v ariation distance b et ween t wo measures P and Q on the same probability space is deﬁned by TV( P, Q ) = sup A | P ( A ) − Q ( A ) | where the supremum is ov er all measurable sets. It can b e shown that TV( P , Q ) = P ( H ) − Q ( H ), where H = { y : p ( y ) ≥ q ( y ) } and p and q are the densities of P and Q with resp ect to any measure that dominates b oth P and Q . Lemma 5 (Le Cam) L et Q b e a set of distributions. L et θ ( Q ) take values in a metric sp ac e with metric ρ . L et Q 1 , Q 2 ∈ Q b e any p air of distributions in Q . L et Y 1 , . . . , Y n b e dr awn iid fr om some Q ∈ Q and denote the c orr esp onding pr o duct me asur e by Q n . Then inf b θ sup Q ∈Q E Q n h ρ ( b θ , θ ( Q )) i ≥ 1 8 ρ ( θ ( Q 1 ) , θ ( Q 2 ))(1 − TV ( Q 1 , Q 2 )) 2 n (17) wher e the inﬁmum is over al l the estimators of θ ( Q ) . 13 Lecci, Rinaldo and W asserman Belo w w e apply Le Cam’s lemma using sev eral pairs of distributions. An y pair Q 1 , Q 2 is asso ciated with a pair of metric graphs G 0 , G 00 ∈ G . W e tak e θ ( Q 1 ) and θ ( Q 2 ) to b e the classes of graphs that are isomorphic to G 0 and G 00 . W e set ρ ( θ ( Q 1 ) , θ ( Q 2 )) = 0 if G 0 and G 00 are isomorphic and ρ ( θ ( Q 1 ) , θ ( Q 2 )) = 1 otherwise. Figure 7 shows several pairs of metric graphs that are used to deriv e lo wer b ounds in the noiseless case and in the tubular noise case. In the noiseless case w e ignore the σ -tub es around the metric graphs.                   Figure 7: Pairs of metric graphs used in the deriv ation of low er b ounds in the noiseless case and in the tubular noise case. Theorem 6 In the noiseless c ase ( σ = 0 ), for b ≤ b 0 ( a ) , α ≤ α 0 ( a ) , ξ ≤ ξ 0 ( a ) , τ ≤ τ 0 ( a ) , wher e b 0 ( a ) , α 0 ( a ) , ξ 0 ( a ) and τ 0 ( a ) ar e c onstants which dep end on a , a lower b ound on the minimax risk for metric gr aph r e c onstruction is R n ≥ exp  − 2 a min { b, 2 sin( α/ 2) , ξ , 2 π τ } n  . (18) Pro of W e consider the 4 parameters separately . See Figure 7, ignoring the red lines represen ting the tubular noise that is not considered in this theorem. Shortest e dge b . Consider the metric graph G 1 consisting of a single edge of length 1+ b and metric graph G 2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. The densit y on G 1 is constructed in the follo wing wa y: on the set G 1 \ G 2 of length b we set p 1 ( x ) = a and the rest of the mass is evenly distributed ov er the remaining p ortion of G 1 . Similarly , for G 2 w e set p 2 ( x ) = a on G 2 \ G 1 , whic h corresp ond to the orthogonal edge of length b . W e evenly spread the remaining mass. The tw o densities diﬀer only on the sets G 1 \ G 2 and G 2 \ G 1 . Therefore TV( p 1 , p 2 ) ≤ ab and, by Le Cam’s lemma, R n ≥ 1 8 (1 − ab ) 2 n ≥ 1 8 e − 2 abn for all b ≤ b 0 ( a ), where b 0 ( a ) is a constant dep ending on a . Smal lest angle α . Now consider the metric graphs G 3 and G 4 . G 3 consists of t wo edges of length 2 forming an angle α and a third edge of length 1 + 2 sin( α/ 2) glued to the ﬁrst t wo. 14 St a tistical Anal ysis of Metric Graph Reconstruction G 4 is similar: an edge of length 2 sin( α/ 2) is added to complete the triangle, while the edge on the left has length 1. As in the previous case we set p 3 ( x ) = a on G 3 \ G 4 , p 4 ( x ) = a on G 4 \ G 3 and spread evenly the rest of the mass. The total v ariation distance is TV( p 3 , p 4 ) ≤ 2 a sin  α 2  and, b y Le Cam’s lemma, R n ≥ 1 8 (1 − 2 a sin ( α / 2)) 2 n ≥ 1 8 e − 4 a sin( α/ 2) n for all α ≤ α 0 ( a ), where α 0 ( a ) is a constant dep ending on a . Glob al r e ach ξ . W e deﬁned the global reac h as the shortest euclidean distance b etw een t wo points that are far apart in the graph distance. Figure 7 shows metric graph G 5 formed b y a single edge of length 1 and metric graph G 6 consisting of t wo edges of length 0 . 5, ξ apart from eac h other. Again, we set p 5 ( x ) = a on G 5 \ G 6 , p 6 ( x ) = a on G 6 \ G 5 and ev enly spread the rest. W e obtain TV( p 5 , p 6 ) ≤ aξ and, by Le Cam’s lemma, R n ≥ 1 8 (1 − aξ ) 2 n ≥ 1 8 e − 2 aξ n for all ξ ≤ ξ 0 ( a ), where ξ 0 ( a ) is a constant dep ending on a . L o c al r e ach τ . The lo cal reac h τ is the smallest reac h of the edges forming the metric graph. Consider metric graphs G 7 and G 8 . G 7 consists of a lo op of radius τ attached to an edge of length 1 and metric graph G 8 is a single edge of length 1 + 2 πτ . As in the previous cases p 7 ( x ) = a on G 7 \ G 8 and p 8 ( x ) = a on G 8 \ G 7 . It follows that TV( p 7 , p 8 ) ≤ 2 aπ τ and, b y Le Cam’s lemma, R n ≥ 1 8 (1 − 2 aπ τ ) 2 n ≥ 1 8 e − 4 aπ τ n for all τ ≤ τ 0 ( a ), where τ 0 ( a ) is a constan t dep ending on a . F or the tubular noise case we assume that σ is small enough to guarantee that R n < 1, that is, the problem is not hop eless. In particular, we require that σ satisﬁes conditions (9) and (10) of Prop osition 4, which can b e combined in to the following condition 0 < min  ξ − 3 σ − τ sin( α/ 2) + ( τ − σ ) sin( α 0 / 2) 3 / 2 + [2 sin( α 0 / 4)] − 1 , f ( b, α , τ , ξ , σ )  . (19) Theorem 7 Assume that σ is p ositive and satisﬁes c ondition (19) . In the tubular noise c ase, for b ≤ b 0 ( D ) , α ≤ α 0 ( D ) , ξ ≤ ξ 0 ( D ) , τ ≤ τ 0 ( D ) , wher e b 0 ( D ) , α 0 ( D ) , ξ 0 ( D ) and τ 0 ( D ) ar e c onstants which dep end on the ambient dimension D , a lower b ound on the minimax risk for metric gr aph r e c onstruction is R n ≥ 1 8 exp  − 2 min { C D, 1 b, C D, 2 sin( α/ 2) , C D, 3 ξ , C D, 4 τ } n  , (20) for some c onstants C D, 1 , C D, 2 , C D, 3 , C D, 4 . Pro of As in the proof oh Theorem 6 we consider the 4 parameters separately . W e compare the pairs of graphs shown in Figure 7, including the tubular regions constructed around them, from which we get samples uniformly . Shortest e dge b . Consider the metric graph G 1 consisting of a single edge of length 1+ b and metric graph G 2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. Since vol( G 1 ) > v ol( G 2 ), the density q 1 at a p oint in the tube around G 1 is low er than the density q 2 at a p oint around G 2 . F rom the deﬁnition of total v ariation T V = q 1 ( H ) − q 2 ( H ) where H is the set where q 1 > q 2 , the shaded area in Figure 7. Note that q 2 ( H ) = 0 and T V ( q 1 , q 2 ) = q 1 ( H ) = v ol( H ) v ol( G 1 ) ≤ C D, 1 bσ D − 1 (1 + b ) σ D − 1 ≤ C D, 1 b. 15 Lecci, Rinaldo and W asserman By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 1 b ) 2 n ≥ 1 8 e − 2 C D, 1 bn for all b ≤ b 0 ( D ), where b 0 ( D ) is a constant dep ending on D . Smal lest angle α . Now consider the metric graphs G 3 and G 4 . Since vol( G 3 ) > vol( G 4 ), the density q 3 at a p oint in the tub e around G 3 is low er than the densit y q 4 at a p oint around G 4 . T V = q 3 ( H ) − q 4 ( H ) where H is the set where q 3 > q 4 , the shaded area in the tub e around G 3 . Note that q 4 ( H ) = 0 and T V ( q 3 , q 4 ) = q 3 ( H ) = v ol( H ) v ol( G 3 ) ≤ C D, 2 sin( α/ 2) σ D − 1 (1 + sin( α/ 2)) σ D − 1 ≤ C D, 2 sin( α/ 2) . By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 2 sin( α/ 2)) 2 n ≥ 1 8 e − 2 C D, 2 sin( α/ 2) n for all α ≤ α 0 ( D ), where α 0 ( D ) is a constan t dep ending on D . Glob al r e ach ξ . Figure 7 shows metric graph G 5 formed by a single edge of length 1 and metric graph G 6 consisting of t wo edges of length 0 . 5, ξ apart from eac h other. Since v ol( G 5 ) > vol( G 6 ), the density q 5 at a p oin t in the tub e around G 5 is lo wer than the densit y q 6 at a p oint around G 6 . T V = q 5 ( H ) − q 6 ( H ) where H is the set where q 5 > q 6 , the shaded area in the tub e around G 5 . Note that q 6 ( H ) = 0 and T V ( q 5 , q 6 ) = q 5 ( H ) = v ol( H ) v ol( G 5 ) ≤ C D, 3 ξ σ D − 1 σ D − 1 = C D, 3 ξ . By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 3 ξ ) 2 n ≥ 1 8 e − 2 C D, 3 ξ n for all ξ ≤ ξ 0 ( D ), where ξ 0 ( D ) is a constant dep ending on D . L o c al r e ach τ . The lo cal reac h τ is the smallest reac h of the edges forming the metric graph. Consider metric graphs G 7 and G 8 in Figure 7. Since v ol( G 7 ) > v ol( G 8 ), the densit y q 7 at a p oint in the tub e around G 7 is low er than the density q 8 at a p oint around G 8 . T V = q 7 ( H ) − q 8 ( H ) where H is the set where q 7 > q 8 , the shaded area in the tube around G 7 . Note that q 8 ( H ) = 0 and T V ( q 7 , q 8 ) = q 7 ( H ) = v ol( H ) v ol( G 7 ) ≤ C D, 4 τ σ D − 1 (1 + τ ) σ D − 1 ≤ C D, 4 τ . By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 4 τ ) 2 n ≥ 1 8 e − 2 C D, 4 τ n for all τ ≤ τ 0 ( D ), where ξ 0 ( D ) is a constant dep ending on D . Note that, up to constan ts, the lo w er b ound obtained in the tubular noise case is identical to the low er b ound of Prop osition 6 for the noiseless case. 4.2 Upp er Bounds In this section we use the analysis of the p erformance of Algorithm 1 to deriv e an upp er b ound on the minimax risk. W e will use the strategy of Niyogi et al. (2008) to ﬁnd the sample size that guaran tees a δ / 2-dense sample with high probability . W e will use the follo wing t wo lemmas. 16 St a tistical Anal ysis of Metric Graph Reconstruction Lemma 8 (5.1 in Niy ogi et al. (2008)) L et { A i } for i = 1 , . . . , l b e a ﬁnite c ol le ction of me asur able sets and let µ b e a pr ob ability me asur e on S l i =1 A i such that for al l 1 ≤ i ≤ l , we have µ ( A i ) > γ . L et ¯ x = { x 1 , . . . , x n } b e a set of n i.i.d. dr aws ac c or ding to µ . Then if n ≥ 1 γ  log l + log  1 λ  we ar e guar ante e d that with pr ob ability > 1 − λ , the fol lowing is true: ∀ i, ¯ x ∩ A i 6 = ∅ . Recall that the  -cov ering num b er C (  ) of a set S is the smallest num b er of Euclidean balls of radius  required to cov er the set. The  -pac king num b er P (  ) is the maximum num b er of sets of the form B ( x,  ) ∩ S , where x ∈ S , that may b e pack ed into S without o verlap. Lemma 9 (5.2 in Niy ogi et al. (2008)) F or every  > 0 , P (2  ) ≤ C (2  ) ≤ P (  ) . Com bining Lemma 8 and Prop osition 4, we obtain an upp er b ound on R n for the noiseless case. Theorem 10 In the noiseless c ase ( σ = 0 ), an upp er b ound on the minimax risk R n is given by R n ≤ 8 length ( G ) δ exp  − a δ n 4 length ( G )  , wher e δ = 1 2 min  ξ 2 sin( α/ 4) 3 sin( α/ 4) + 1 , τ sin( α / 2) sin( α/ 4) sin( α/ 2) sin( α/ 4) + 3 sin( α/ 4) + 1 sin  min { b, ατ } 2 τ  . (21) Pro of In the noiseless case, Prop osition 4 implies that the graph G can b e reconstructed from a δ / 2-dense sample Y if δ < min  ξ 2 sin( α/ 4) 3 sin( α/ 4) + 1 , f ( b, α, τ , ξ , 0)  . (22) The v alue of δ selected in (21) satisﬁes condition (22), which follows from conditions (9) and (10), with σ = 0. W e lo ok for the sample size n that guaran tees a δ / 2-dense sample with high probability . F ollo wing the strategy in Niyogi et al. (2008), we consider a cov er of the metric graph G by balls of radius δ / 4. Let { x i : 1 ≤ i ≤ l } b e the centers of suc h balls that constitute a minimal co v er. W e can c ho ose A δ / 4 i = B δ / 4 ( x i ) ∩ G . Applying Lemma 8 w e ﬁnd that the sample size that guarantees a correct reconstruction with probabilit y at least 1 − λ is 1 γ  log l + log 1 λ  , (23) where γ ≥ min i a length( A δ / 4 i ) length( G ) ≥ aδ 4 length( G ) , 17 Lecci, Rinaldo and W asserman and we b ound the cov ering num b er l in terms of the packing num b er, using Lemma 9: l ≤ length( G ) min i length( A δ / 8 i ) ≤ 8 length( G ) δ . Therefore, from (23), if n = 4 length ( G ) aδ  log  8 length( G ) δ  + log 1 λ  (24) w e ha ve a δ / 2-dense sample with probability at least 1 − λ and, by Prop osition 4, P ( b G 6' G ) ≤ λ . Rearranging we hav e the result. Note that, in the noiseless case, the upp er and low er b ounds are tigh t up to p olynomial factors in the parameters τ , b, ξ . There is a small gap with resp ect to α ; closing this gap is an op en problem. In the tubular noise case, we assume that σ is small enough, to guarantee that Algorithm 1 correctly reconstructs a metric graph starting from a δ / 2-dense sample. Theorem 11 Assume that σ satisﬁes c ondition (19) and 0 < σ < min { 3 τ / 16 , δ / 8 } , wher e δ = C 0 min  ξ − 3 σ − τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) 3 / 2 + [2 sin( α 0 / 4)] − 1 , f ( b, α , τ , ξ , σ )  , (25) for some 0 < C 0 < 1 . Under the tubular noise mo del, an upp er b ound on the minimax risk R n is given by R n ≤ 16 length ( G ) δ exp  − C 0 D δ ( τ − 8 σ ) n τ length ( G )  , wher e C 0 D is a c onstant dep ending on the ambient dimension. Pro of Prop osition 4 implies that the graph G can be reconstructed from a δ / 2-dense sample Y if δ < min  ξ − 3 σ − τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) 3 / 2 + [2 sin( α 0 / 4)] − 1 , f ( b, α , τ , ξ , σ )  , (26) whic h is satisﬁed by the v alue of δ selected in (25). W e lo ok for the sample size n that guaran tees a δ / 2-dense sample in G σ with high probability . W e consider a co ver of the metric graph G by euclidean balls of radius δ / 8. Let { x i : 1 ≤ i ≤ l } b e the cen ters of suc h balls that constitute a minimal cov er. Note that D -dimensional balls of radius δ / 8 + σ ≤ δ / 4 cen tered at the same x 0 i s constitute a cov er of the tubular region G σ . W e deﬁne A δ / 8+ σ i = B δ / 8+ σ ( x i ) ∩ G σ . Applying Lemma 8 we ﬁnd that the sample size that guaran tees a δ / 2-dense sample in G σ (and a correct top ological reconstruction of G ) with probability at least 1 − λ is 1 γ  log l + log 1 λ  , (27) 18 St a tistical Anal ysis of Metric Graph Reconstruction where γ = min i v ol( A δ / 8+ σ i ) v ol( G σ ) . (28) Deﬁne ˜ A δ i = B δ ( x i ) ∩ G. The cov ering num b er l is b ounded in terms of the packing n umber, using Lemma 9, l ≤ length( G ) min i length( ˜ A δ / 16 i ) ≤ 16 length( G ) δ . W e construct a low er b ound on γ by deriving an upp er b ound on the denominator of (28) and a low er b ound on the numerator. Upp er b ound on vol ( G σ ). Let N σ b e the σ -co vering num b er of G and let C σ b e the set of cen ters of this co ver. By Lemma 9, N σ is b ounded b y the σ / 2-packing n umber. A simple volume argument giv es N σ ≤ C length( G ) /σ, for some constan t C . Note that 2 σ D - dimensional balls around eac h of the cen ters in C σ co ver G σ . Thus v ol( G σ ) ≤ v D N σ (2 σ ) D ≤ C D length( G ) σ D − 1 for some constant C D dep ending on the am bient dimension. Lo w er b ound on v ol ( A δ / 8+ σ i ) , for all i . Let P A ( σ ) b e the σ -packing num b er of ˜ A δ / 8 i and let D A b e the set of cen ters of this packing. Then v ol( A δ / 8+ σ i ) ≥ P A ( σ ) v D σ D , b ecause the union of σ balls around D A is contained in A δ / 8+ σ i . Let C A (2 σ ) b e the 2 σ -co vering n umber of ˜ A δ / 8 i and let C A = { z 1 , . . . , z C A (2 σ ) } b e the set of centers of this cov er. By Lemma 9, P A ( σ ) ≥ C A (2 σ ) ≥ length( ˜ A δ / 8 i ) max z j ∈C A length( B 2 σ ( z j ) ∩ ˜ A δ / 8 i ) ≥ δ / 8 max z j ∈C A length( B 2 σ ( z j ) ∩ ˜ A δ / 8 i ) and, since 2 σ < 3 τ / 8, by Corollary 1.3 in Chazal (2013), max z j ∈C A length( B 2 σ ( z j ) ∩ ˜ A δ / 8 i ) ≤ C 2  τ τ − 8 σ  σ, for some constant C 2 . Thus γ ≥ P A ( σ ) v D σ D C D length( G ) σ D − 1 ≥ C 0 D δ ( τ − 8 σ ) τ length( G ) , where C 0 D is a constant dep ending on the am bient dimension. Finally , from (27), if n = τ length ( G ) C 0 D δ ( τ − 8 σ )  log  16 length( G ) δ  + log 1 λ  , (29) then the sample is δ/ 2-dense with probabilit y at least 1 − λ and P ( b G 6' G ) ≤ λ . Rearranging w e obtain R n ≤ exp  − C 0 D δ ( τ − 8 σ ) n τ length( G ) + log  16length( G ) δ  . 19 Lecci, Rinaldo and W asserman 5. Discussion In this paper, we presen ted a statistical analysis of metric graph reconstruction. W e derived suﬃcien t conditions on random samples from a graph metric space that guarantee top olog- ical reconstruction and w e derived low er and upp er b ounds on the minimax risk for this problem. V arious improv ements and theoretical extensions are p ossible. In Prop osition 4 w e hav e analyzed Algorithm 1 using the Euclidean distance at ev ery step. It is p ossible to obtain a similar result using a diﬀerent notion of distance, for example, the distance induced b y a Rips-Vietoris graph constructed on the sample. While in our analysis w e mainly relied on the assumption of a dense sample, Aanjaneya et al. (2012) used the more reﬁned but stronger assumption of the sample b eing an ap- pro ximation of the metric graph, whic h we recall: giv en p ositive num b ers ε and R , w e sa y that ( Y , d Y ) is an ( ε, R ) -appr oximation of the metric space ( G, d G ) if there exists a corresp ondence C ⊂ G × Y such that ( x, y ) , ( x 0 , y 0 ) ∈ C , min( d G ( x, x 0 ) , d Y ( y , y 0 )) ≤ R = ⇒   d G ( x, x 0 ) − d Y ( y , y 0 )   ≤ ε. (30) As shown in Aanjaneya et al. (2012), the ( ε, R )-approximation assumption is suﬃcien t, for appropriate choice of the parameters ε and R , to reco ver not only the top ology of a metric graph ( G, d G ), but also its metric d G with high accuracy . Ho wev er, when compared to the dense sample assumption, it demands a larger sample complexity to achiev e accurate top ological reconstruction. A strategy similar to the one used in this pap er could be used to determine the sample size that guaran tees an ( ε, R )-approximation of the underlying metric graph with high probability . This would guarantee a correct top ological reconstruction, as w ell as an approximation of the metric d G . W e are also inv estigating the idea of com bining metric graph reconstruction with the subspace constrained mean-shift algorithm (F ukunaga and Hostetler, 1975; Comaniciu and Meer, 2002; Genov ese et al., 2012b) to provide similar guarantees. Our preliminary results indicate that this mixed strategy w orks v ery w ell under more general noise assumptions and with relatively lo w sample size. Ac kno wledgmen ts Researc h supp orted by NSF CAREER Grant DMS 114967, Air F orce Gran t F A95500910373, NSF Gran t DMS-0806009. The authors thank the referees for helpful comments and sug- gestions. References Mridul Aanjaney a, F rederic Chazal, Daniel Chen, Marc Glisse, Leonidas Guibas, and Dmitriy Morozo v. Metric graph reconstruction from noisy data. International Journal of Computational Ge ometry & Applic ations , 22(04):305–325, 2012. Mahm uda Ahmed and Carola W enk. Probabilistic street-in tersection reconstruction from gps tra jectories: approaches and challenges. In Pr o c e e dings of the Thir d A CM SIGSP A- TIAL International Workshop on Querying and Mining Unc ertain Sp atio-T emp or al Data , pages 34–37. ACM, 2012. 20 St a tistical Anal ysis of Metric Graph Reconstruction Ery Arias-Castro, Guangliang Chen, and Gilad Lerman. Sp ectral clustering based on lo cal linear approximations. Ele ctr onic Journal of Statistics , 5:1537–1587, 2011. Giorgio A Ascoli, Duncan E Donohue, and Maryam Halavi. Neuromorpho. org: a cen tral resource for neuronal morphologies. The Journal of Neur oscienc e , 27(35):9247–9251, 2007. P aul Bendich. Analyzing str atiﬁe d sp ac es using p ersistent versions of interse ction and lo c al homolo gy . ProQuest, 2008. P aul Bendich, Sa yan Mukherjee, and Bei W ang. T o wards stratiﬁcation learning through homology inference. arXiv pr eprint arXiv:1008.3572 , 2010. P aul Bendich, Bei W ang, and Say an Mukherjee. Lo cal homology transfer and stratiﬁcation learning. In Pr o c e e dings of the Twenty-Thir d A nnual ACM-SIAM Symp osium on Discr ete A lgorithms , pages 1355–1370. SIAM, 2012. F. Chazal and A. Lieutier. T op ology guaran teeing manifold reconstruction using distance function to noisy data. In Pr o c e e dings of the twenty-se c ond annual symp osium on Com- putational ge ometry , pages 112–118. A CM, 2006. F r´ ed´ eric Chazal. An upp er b ound for the volume of geo desic balls in submanifolds of euclidean spaces. T echnical rep ort, INRIA, January 2013. F r´ ed´ eric Chazal and Jian Sun. Gromo v-hausdorﬀ appro ximation of metric spaces with linear structure. arXiv pr eprint arXiv:1305.1172 , 2013. F r´ ed´ eric Chazal, Da vid Cohen-Steiner, and Andr´ e Lieutier. A sampling theory for compact sets in euclidean space. Discr ete & Computational Ge ometry , 41(3):461–479, 2009. Daniel Chen, Leonidas J Guibas, John Hershberger, and Jian Sun. Road net work recon- struction for organizing paths. In Pr o c e e dings of the Twenty-First Annual ACM-SIAM Symp osium on Discr ete A lgorithms , pages 1309–1320. So ciety for Industrial and Applied Mathematics, 2010. Alexey Cherno v and Vitaliy Kurlin. Reconstructing p ersisten t graph structures from noisy images. Ele ctr onic Journal Image-A , 3(5):19–22, 2013. Dorin Comaniciu and Peter Meer. Mean shift: A robust approach tow ard feature space analysis. Pattern A nalysis and Machine Intel ligenc e, IEEE T r ansactions on , 24(5):603– 619, 2002. Herb ert F ederer. Curv ature measures. T r ansactions of the Americ an Mathematic al So ciety , 93(3):418–491, 1959. Keinosuk e F ukunaga and Larry Hostetler. The estimation of the gradient of a density func- tion, with applications in pattern recognition. Information The ory, IEEE T r ansactions on , 21(1):32–40, 1975. Xiao yin Ge, Issam I Safa, Mikhail Belkin, and Y usu W ang. Data skeletonization via reeb graphs. In A dvanc es in Neur al Information Pr o c essing Systems , pages 837–845, 2011. 21 Lecci, Rinaldo and W asserman Christopher R Genov ese, Marco Perone-P aciﬁco, Isab ella V erdinelli, and Larry W asserman. Minimax manifold estimation. Journal of Machine L e arning R ese ar ch , 13:1263–1291, 2012a. Christopher R Genov ese, Marco Perone-P aciﬁco, Isab ella V erdinelli, and Larry W asserman. Nonparametric ridge estimation. arXiv pr eprint arXiv:1212.5156 , 2012b. A ttila I Guly´ as, Manuel Megı as, Zsuzsa Emri, and T am´ as F F reund. T otal n umber and ratio of excitatory and inhibitory synapses conv erging on to single interneurons of diﬀeren t t yp es in the ca1 area of the rat hippo campus. The Journal of neur oscienc e , 19(22):10082– 10097, 1999. P eter Kuc hment. Quan tum graphs: I. some basic structures. Waves in R andom me dia , 14 (1):107–128, 2004. P . Niy ogi, S. Smale, and S. W ein b erger. Finding the homology of submanifolds with high conﬁdence. Discr ete and Compuational Ge ometry , 38(1-3):419–441, 2008. Alexandre B Tsybako v. Intr o duction to nonp ar ametric estimation . Springer, 2008. Bin Y u. Assouad, fano, and le cam. In F estschrift for Lucien L e Cam , pages 423–435. Springer, 1997. 22

Statistical Analysis of Metric Graph Reconstruction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment