Statistical Analysis of Metric Graph Reconstruction

A metric graph is a 1-dimensional stratified metric space consisting of vertices and edges or loops glued together. Metric graphs can be naturally used to represent and model data that take the form of noisy filamentary structures, such as street map…

Authors: Fabrizio Lecci, Aless, ro Rinaldo

Statistical Analysis of Metric Graph Reconstruction
Statistical Analysis of Metric Graph Reconstruction F abrizio Lecci lecci@cmu.edu Alessandro Rinaldo arinaldo@cmu.edu Larry W asserman larr y@cmu.edu Dep artment of Statistics Carne gie Mel lon University Pittsbur gh, P A 15213, USA Abstract A metric graph is a 1-dimensional stratified metric space consisting of vertices and edges or lo ops glued together. Metric graphs can b e naturally used to represent and mo del data that take the form of noisy filamentary structures, such as street maps, neurons, netw orks of rivers and galaxies. W e consider the statistical problem of reconstructing the top ology of a metric graph em b edded in R D from a random sample. W e derive lo w er and upp er b ounds on the minimax risk for the noiseless case and tubular noise case. The upper b ound is based on the reconstruction algorithm giv en in Aanjaneya et al. (2012). Keyw ords: Metric Graph, Filament, Reconstruction, Manifold Learning, Minimax Esti- mation 1. In tro duction W e are concerned with the problem of estimating the top ology of filamen tary data structure. Datasets consisting of p oin ts roughly aligned along intersecting or branc hing filamentary paths em b edded in 2 or higher dimensional spaces ha ve become an increasingly common type of data in a v ariet y of scientific areas. F or instance, road reconstruction based on GPS traces, lo calization of earthquak es faults, galaxy reconstruction are all instances of a more general problem of estimating basic topological features of an underlying filamen tary structure. The recen t pap er by Aanjaney a et al. (2012), up on which our work is based, contains further applications, as well as numerous references. T o provide a more concrete example, consider Figure 1. The left hand side displa ys ra w data p ortra ying a neuron from the hipp ocampus of a rat (Guly´ as et al., 1999). The data were obtained from NeuroMorpho.Org (Ascoli et al., 2007). The righ t hand side of the figure shows the output of the metric graph reconstruction obtained using the algorithm analyzed in this pap er, originally prop osed by Aanjaney a et al. (2012). The reconstruction, which tak es the form of a graph, captures p erfectly all the top ological features of the neuron, namely , the relationship b et ween the edges and vertices, the num b er of branching p oin ts and the degree of each no de. Metric graphs provide the natural geometric framework for representing intersecting filamen tary structures. A metric graph embedded in a D -dimensional Euclidean space ( D ≥ 2) is a 1-dimensional stratified metric space. It consists of a finite num b er of p oin ts (0- dimensional strata) and curves (1-dimensional strata) of finite length, where the b oundary c  2013 F abrizio Lecci, Alessandro Rinaldo and Larry W asserman. Lecci, Rinaldo and W asserman of eac h curve is given b y a pair (of not-necessarily distinct) v ertices (see the next section for a formal definition of a metric graph). In this paper we study the problem of reconstructing the topology of metric graphs from p ossibly noisy data, from a statistical p oin t of view. Sp ecifically , we assume that we hav e a sample of p oin ts from a distribution supp orted on a metric graph or in a small neigh b orhoo d and we are interested in recov ering the top ology of the corresp onding metric graph. T o this end, we use the metric graph reconstruction algorithm given in Aanjaney a et al. (2012). F urthermore, in our theoretical analysis w e characterize explicitly the minimal sample size required for p erfect topological reconstruction as a direct function of parameters defining the shap e of the metric graph, introduced in Section 2. This leads to an upp er b ound on the risk of topological reconstruction. Finally , w e obtain a low er b ound on the risk of top ological reconstruction, whic h, in the noiseless case, almost matc hes the derived upp er b ound, indicating that the algorithm of Aanjaney a et al. (2012) b eha ves nearly optimally . Outline. In Section 2 we formally define metric graphs, the statistical mo dels w e will consider and the assumptions w e will use throughout. W e will also describ e sev eral geometric quan tities that are cen tral to our analysis. Section 3 contains detailed analysis of the p erformance of algorithm of Aanjaney a et al. (2012) for metric graph reconstruction, under mo dified settings and assumptions. In Section 4 we deriv e low er and upp er b ounds for the minimax risk of metric graph reconstruction problem. In Section 5 we conclude with some final comments. R elate d Work. The work most closely related to ours is Aanjaneya et al. (2012) whic h w as, in fact, the motiv ation for our work. F rom the theoretical side, we replace the k ey assumption in Aanjaneya et al. (2012) of the sample b eing a ( ε, R )-approximation to the underlying metric graph, by the milder assumption of the sample b eing dense in a neig- b orhoo d of the metric graph. Approximation and reconstruction of metric graphs has also b een considered in Chazal and Sun (2013) and Ge et al. (2011). Metric graph reconstruction is related to the problem of estimating stratified spaces (basically , intersecting manifolds). Stratified spaces ha ve been studied by a n umber of authors such as Bendic h et al. (2010, 2012); Bendic h (2008). A sp ectral metho d for estimating in tersecting structures is given in Arias-Castro et al. (2011). There are a v ariety of algorithms for sp ecific problems, for example, see Ahmed and W enk (2012); Chen et al. (2010) for the reconstruction of road net works. Finally , Chernov and Kurlin (2013) derived an alternativ e algorithm that uses ideas from homology . 2. Bac kground and Assumptions The assumptions in Aanjaneya et al. (2012) lead to a reconstruction pro cess that is aimed at capturing the in trinsic structure of the data and is somewhat oblivious to its extrinsic em b edding. The authors assume that the sample comes with a metric that is close to the in trinsic metric of the underlying graph, b y imposing a limit on the Gromov-Hausdorff distance b et ween the t wo metrics. By considering data em b edded in the Euclidean space and focusing on the top ological aspect, we sho w that the notion of dense sample is sufficien t to guarantee a correct reconstruction. 2 St a tistical Anal ysis of Metric Graph Reconstruction Figure 1: Left: Neuron cr22e from the hipp o campus of a rat; NeuroMorpho.Org (Ascoli et al., 2007). Right: A metric graph reconstruction of the neuron. In this section w e provide background on metric graph spaces and describ e the assump- tions and the geometric parameters that we will b e using throughout. Informally , a metric graph is a collection of vertices and edges glued together in some fash- ion. Here we state the formal definitions of path metric space and metric graph. F or more details see Aanjaneya et al. (2012) and Kuchmen t (2004). Definition 1 A metric sp ac e ( G, d G ) is a p ath metric sp ac e if the distanc e b etwe en any p air of p oints is e qual to the infimum of the lengths of the c ontinuous curves joining them. A metric gr aph is a p ath metric sp ac e ( G, d G ) that is home omorphic to a 1-dimensional str atifie d sp ac e. A vertex of G is a 0-dimensional str atum of G and an e dge of G is a 1-dimensional str atum of G . W e will consider metric graphs em b edded in R D . Note that, if one ignores the metric structure, namely the length of edges and lo ops, the shap e or top ology of a metric graph ( G, d G ) is enco ded by a graph, whose vertices and edges corresp ond to vertices and edges of G . Since we allow for tw o vertices to b e connected by more than one edge we are actually dealing with pseudographs. W e recall that an undirected pseudograph ( V , E ) is a set of v ertices V , a multiset E of unordered pairs of (not necessarily distinct) v ertices. T o a given pseudograph we can asso ciate a function f : E → V × V , whic h, when applied to an edge e ∈ E , simply extracts the vertices to which e is adjacent. Th us, if e 1 , e 2 ∈ E are such that f ( e 1 ) = f ( e 2 ), then e 1 and e 2 are parallel edges. Similarly , if e ∈ E is suc h that f ( e ) = { v , v } for some v ∈ V , then e is a lo op. F or each pair ( u, v ) ∈ V × V , let ν ( u, v ) = | f − 1 ( { u, v } ) | if { u, v } ∈ E and 0 otherwise. In particular, ν ( u, v ) is the n umber of edges b et ween u and v (or lo ops if u = v ). 3 Lecci, Rinaldo and W asserman W e say that a metric graph reconstruction algorithm p erfectly reco vers the top ology of G if outputs a pseudograph isomorphic to the pseudograph represen ting the top ology of G . W e now define some key quantities regarding the structure of a metric graph. W e start with the definition of reach. Let M b e a 1-dimensional manifold embedded in R D . Let T u M denote the 1-dimensional tangent space to M and let T ⊥ u M be the ( D − 1)-dimensional normal space. Definition 2 Define the fib er of size a at u ∈ M to b e L a ( u, M ) = T ⊥ u M T B ( u, a ) , wher e B ( u, a ) is the D -dimensional b al l of r adius a c enter e d at u . If M has b oundary { v 1 , v 2 } , the fib er of size a at v i is define d as the limit of L a ( u, M ) , as u appr o aches v i in M \{ v 1 , v 2 } . The r e ach of M is the lar gest numb er τ such that the fib ers L τ ( u, M ) never interse ct. The reach sets a limit on the curv ature of a manifold. A manifold with large reach do es not come too close to be self-intersecting. F or example the reach of an arc of a circle is equal to its radius. The quan tity 1 /τ is called the c ondition numb er in Niyogi et al. (2008). F or more details see also F ederer (1959); Chazal and Lieutier (2006); Genov ese et al. (2012a). Eac h edge of a metric graph ( G, d G ) can b e seen as a 1-dimensional manifold with boundary . Let the lo c al r e ach of metric graph G b e the minimum reac h asso ciated to an edge of G . When 2 edges intersect at a vertex v they create an angle, where the angle b et ween t wo intersecting curves is formally defined as follo ws. Suppose that e 1 and e 2 in tersect at x . Let B ( x,  ) b e the D -dimensional ball of radius  centered at x . Let ` 1 (  ) b e the line segmen t joining the tw o p oin ts x and ∂ B ( x,  ) T e 1 . Let ` 2 (  ) be the line segment joining the t wo points x and ∂ B ( x,  ) T e 2 . Let α  ( e 1 , e 2 ) b e the angle betw een ` 1 (  ) and ` 2 (  ). The angle betw een e 1 and e 2 is α ( e 1 , e 2 ) = lim  → 0 α  ( e 1 , e 2 ). W e assume that, for each pair of in tersecting edges e 1 and e 2 , the angle α ( e 1 , e 2 ) is well-defined. T o control p oin ts far aw a y in the graph distance, but close in the embedding space, we define A G = { ( x, x 0 ) ∈ G × G : d G ( x, x 0 ) ≥ min( b, τ α ) } , where b is the shortest edge of G , τ is the lo cal condition n umber and α is the smallest angle formed by t wo edges of G . W e define the glob al r e ach as the infimum of the Euclidean distances among pairs of p oin t in A G , that is ξ = inf A G k x − x 0 k 2 . Let ( G, d G ) be a metric graph and, for a constan t σ ≥ 0, let G σ = { y : inf x ∈ G || x − y || 2 ≤ σ } b e the σ -tub e around G . If σ = 0, then, trivially , G σ = G . Notice that G σ is a set of dimension D if σ > 0. W e will use the assumption that the sample Y is sufficiently dense in G σ with respect to the Euclidean metric, as formalized b elo w. Definition 3 The sample Y = { y 1 , . . . , y n } ⊂ G σ ⊂ R D is δ 2 -dense in G σ if for every x ∈ G σ , ther e exists a y ∈ Y such that k x − y k 2 < δ 2 . The problem of metric graph reconstruction consists of reconstructing a metric graph G giv en a dense sample { y 1 , . . . , y n } = Y ⊂ G σ endo wed with a distance d Y , whic h could b e the D -dimensional Euclidean distance or some more complicate notion of distance. If σ = 0 w e say that the sample Y is noiseless, while if σ > 0, we say that Y is a noisy sample. Throughout our analysis w e restrict the attention to metric graphs embedded in R D that satisfy the following assumptions: 4 St a tistical Anal ysis of Metric Graph Reconstruction A1 The graphs ha ve finite total length and are free of no des of degree 2 (though they ma y con tain v ertices of degree 1 or 3 and higher). A2 Each edge is a smooth embedded sub-manifold of dimension 1, of length at least b > 0 and with reach at least τ > 0. A3 Each pair of in tersecting edges forms a well-defined angle of size at least α > 0. A4 The global reach is at least ξ > 0. Assumptions A1 and A2 allo w us to consider each edge of a metric graph as a single smo oth curv e. A3 and A4 are additional regularity conditions on the separation b et ween different edges. Assumptions similar to A1-A4 are common in the literature. F or different regularity conditions that allow for corners within an edge see, for example, Chazal et al. (2009) and Chen et al. (2010). Let G be the set of metric graphs embedded in R D that satisfy assumptions A1, A2, A3 and A4, inv olving the parameters b , α , τ , ξ . W e consider t wo noise mo dels: Noiseless. W e observ e data Y 1 , . . . , Y n ∼ P , where P ∈ P , a collection of probability distributions supp orted ov er metric graphs ( G, d G ) in G having densities p with respect to the length of G b ounded from b elo w by a constant a > 0. T ubular Noise. W e observ e data Y 1 , . . . , Y n ∼ P G,σ where P G,σ is uniform on the σ -tub e G σ . In this case we consider the collection P = { P G,σ : G ∈ G } . W e are interested in b ounding the minimax risk R n = inf b G sup P ∈P P n  b G 6' G  , (1) where the infimum is ov er all estimators b G of the top ology of ( G, d G ), the supremum is o ver the class of distributions P for Y and b G 6' G means that b G and G are not isomorphic. In Section 4 w e will find lo wer and upp er b ounds for R n in the noiseless case and the tubular noise case. W e conclude this section by summarizing the man y parameters and symbols inv olv ed in our analysis. See T able 1. 3. P erformance Analysis for the Algorithm of Aanjaneya et al. (2012) In this section w e study the p erformance of the metric graph reconstruction algorithm of Aanjaney a et al. (2012), under assumptions A1-A4 and with a c hoice of parameters adapted to our setting. In Section 4 we will use these results to derive b ounds on the minimax rate for top ology reconstruction. The metric graph reconstruction algorithm is presen ted in Algorithm 1. The algorithm takes a (p ossibly noisy) sample Y from a metric graph G and a distance d Y defined on Y and returns a graph b G that approximates G . The key idea is the following: 5 Lecci, Rinaldo and W asserman Sym b ol Meaning ( G, d G ) metric graph α smallest angle b shortest edge τ lo cal reac h ξ global reach G set of metric graphs embedded in R D , satisfying A1-A4 P set of distributions on G or G σ G σ σ tube around G Y sample, subset of G σ δ Y is a δ / 2-dense sample T able 1: Summary of the symbols used in our analysis. a shell of radius r is constructed around each p oin t in the sample, whic h is lab eled e dge p oint if its shell contains 2 well separated clusters of sampled points and vertex p oint oth- erwise. Several steps of the algorithm require the construction of a Rips-Vietoris graph of parameter δ : R δ ( S y ) is a graph whose vertices are all the p oin ts of S y and there is an edge b et ween tw o p oin ts if the Euclidean distance b etw een them is not larger than δ . A t Step 11 some of the edge p oin ts that are close to v ertices are re-lab eled as vertex p oin ts. This expansion guarantees a precise b orderline b etw een clusters of vertex p oints and clusters of edge p oints. A t steps 15-17 each of these clusters is asso ciated to a vertex or to an edge of the reconstructed graph b G . W e will analyze the algorithm considering the Euclidean Algorithm 1 Metric Graph Reconstruction Algorithm Input: sample Y , d Y , r, p 11 . 1: Lab eling p oints as edge or v ertex 2: for all y ∈ Y do 3: S y ← B ( y , r + δ ) \ B ( y , r ) 4: deg r ( y ) ← Number of connected comp onents of Rips-Vietoris graph R δ ( S y ) 5: if deg r ( y ) = 2 then 6: Lab el y as a edge point 7: else 8: Lab el y as a preliminary v ertex point. 9: end if 10: end for. 11: Lab el all p oints within Euclidean distance p 11 from a preliminary vertex p oin t as vertices. 12: Let E b e the p oin t of Y lab eled as edge p oints. 13: Let V b e the p oin t of Y lab eled as vertices. 14: Reconstructing the graph structure 15: Compute the connected comp onents of the Rips-Vietoris graphs R δ ( E ) and R δ ( V ). 16: Let the connected comp onents of R δ ( V ) b e the vertices of of the reconstructed graph b G . 17. Let there b e an edge b etw een vertices of b G if their corresp onding connected comp onents in R δ ( V ) contain p oints at distance less than δ from the same connected comp onent of R δ ( E ). Output: b G . 6 St a tistical Anal ysis of Metric Graph Reconstruction distance on the sample Y , that is, d Y = k · k 2 . The inner radius of the shell at Step 3 and the width of the expansion at Step 11 are parameters the user has to sp ecify . Before finding ho w dense a sample has to b e in orderer to guarantee a correct recon- struction of a metric graph, we show that it is sufficient to study a particular metric graph em b edded in R 2 , whic h represents the worst case. In other words, if the metric graph algo- rithm can reconstruct this particular planar graph, then it can reconstruct any other metric graph that satisfies A1-A4. 3.1 The w orst case: a metric graph in R 2 The w orst case is the one for which it is hard to distinguish t w o edges that intersect at a v ertex b ecause they are to o close in the em b edding space. Figure 2 (top left) sho ws an edge e that intersects tw o edges e 1 , e 2 with reach τ , forming an angle α at vertex x . F or simplicity , we consider this metric graph embedded in R 3 ( D = 3). Therefore Figure 2 sho ws the pro jections of e, e 1 and e 2 on the (limit) plane formed b y e 1 and e 2 , passing through x .                        Figure 2: Even in the worst case, edges e 1 and e 2 m ust lie outside of the torii constructed on the fib ers L τ ( x, e 1 ) and L τ ( x, e 2 ). W e focus on edge e 2 . The blue segment AB is the pro jection of L τ ( x, e 2 ), the fib er of size τ around x . In R 3 , L τ ( x, e 2 ) is a circle of radius τ centered in x . By definition, for any y ∈ e 2 , the fib er L τ ( y , e 2 ) can not intersect the fib er L τ ( x, e 2 ), otherwise the assumption in volving the reach w ould b e violated. W e represent this condition b y taking a circle C of 7 Lecci, Rinaldo and W asserman radius τ centered at B and rotating it around x along the circumference of L τ ( x, e 2 ). This pro cedure forms a torus with an inner lo op of radius 0. Edge e 2 m ust lie outside of this torus, so that its fib ers do not intersect L τ ( x, e 2 ). See the top right plot of Figure 2. The same reasoning applies to edge e 1 , which must lie outside of the torus constructed on L τ ( x, e 1 ). See the b ottom left plot. The w orst case is the one for which e 1 and e 2 are as close as p ossible: on the same plane and on the b oundaries of the tw o tori. This case is represented in the b ottom right plot of Figure 2. Note that e 1 and e 2 are simply arcs of circles of radius τ .                        Figure 3: Left: edges e 1 and e 2 with minim um reach τ forming the smallest angles α at v ertex x . Right: same metric graph with a tub e of radius σ around it. W e will use basic trigonometric prop erties of the worst case. In Figure 3 (left), O and O 0 are the cen ters of the circles associated to edges e 1 and e 2 . It is easy to see that angle O b xO 0 has width π − α . It can b e shown that x b O O 0 = α/ 2 , (2) T b xQ = α/ 4 . (3) Let Y b e a noisy sample of G . In other words Y is a subset of G σ , the tub e of radius σ ≥ 0 around the metric graph G . See Figure 3 (righ t). Let Q b e the midp oint of segment O O 0 and let T b e the intersection p oin t of O O 0 and edge e 1 . F or 0 ≤ σ ≤ QT = τ − τ cos( α/ 2), the smallest angle formed by the inner faces of the tub e around the metric graph is α 0 = π − arccos 2( τ − σ ) 2 − 4 τ 2 cos 2 ( α/ 2) 2( τ − σ ) 2 , (4) where we applied the cosine law to the triangle O sO 0 and the fact that angle O b sO 0 has width π − α 0 . Note that if σ = 0 then α 0 = α . As in (3), it can b e shown that R b sQ = α 0 / 4 . (5) 8 St a tistical Anal ysis of Metric Graph Reconstruction The few basic trigonometric equations describ ed ab o ve will b e used to determine under whic h conditions on b, α, τ , ξ , σ the metric graph reconstruction algorithm can reconstruct the worst case. 3.2 Analysis of Algorithm 1 with Euclidean distance In this section w e analyze Algorithm 1. It is sufficient to study the w orst case of figures 2 and 3 and extend the results to any metric graph in R D . The Euclidean distance is used at every step of the algorithm, which requires the sp ecification of r , the inner radius of the shell, and p 11 , the parameter gov erning the expansion of Step 11. W e set r = δ 2 + σ + τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) + δ 2 sin( α 0 / 4) (6) and p 11 = δ 2 + τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) + r + δ sin( α 0 / 2) (7) This choice is justified in the pro of of Prop osition 4. Define f ( b, α, τ , ξ , σ ) := ( τ − σ ) sin  min( b,ατ ) − ( α − α 0 ) τ 2 τ  − [ τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2)]  1 + 2 sin( α 0 / 2)  − 2 σ sin( α 0 / 2) 1 + 3[sin( α 0 / 2)] − 1 + [sin( α 0 / 2) sin( α 0 / 4)] − 1 , (8) where α 0 is given in 4. Note that f ( b, α, τ , ξ , σ ) is a decreasing function of σ . Prop osition 4 If Y is δ 2 -dense in G σ and 0 < r + δ < ξ − 2 σ , (9) 0 < δ < f ( b, α, τ , ξ , σ ) , (10) then the gr aph b G pr ovide d by Algorithm 1 (input: Y , k · k 2 , r, p 11 ) is isomorphic to G . Pro of W e will show that under conditions (9) and (10), Algorithm 1 can reconstruct the w orst case describ ed in section 3.1, formed by edges e 1 and e 2 of reach τ forming an angle of width α . This will automatically imply that the algorithm can reconstruct the top ology of other vertices and edges in the D -dimensional space. Condition (9) guarantees that p oints of G whic h are far apart in the metric graph distance d G , and close in the embedding space, do not interfere in the construction of the shells at Steps 3-4. The rest of the proof in volv es condition (10). Since the sample is δ 2 -dense in the tub e, there is at least a p oin t y ∈ Y inside the ball of radius δ 2 cen tered at an y vertex x ∈ G . When using Algorithm 1 we wan t to b e sure that y is lab eled as a vertex, that is, the n umber of connected comp onents of the shell around y is differen t than 2 (Steps 3-4). The w orst case is depicted in Figure 4 (left), where x is the vertex of minimum angle α , formed 9 Lecci, Rinaldo and W asserman               Figure 4: Left: edges e 1 and e 2 with minim um reach τ forming the smallest angles α at v ertex x . Righ t: The distance k F − G k 2 b et ween the t wo connected comp onen ts of the shell around an edge p oin t y 0 m ust b e greater than δ . b y t wo edges, e 1 and e 2 of reac h τ . First, we show that for the the v alue of r selected in (6), p oints close to an actual vertex are lab eled as vertices at Steps 3-10 and p oin ts far from actual vertices are lab eled as edges. The inner faces of the tub e of radius σ around e 1 and e 2 form an angle of width α 0 at v ertex s , as describ ed in Section 3.1. Let u and v b e the t wo p oints on the faces of the tub e such that they are equidistant from x and k u − v k 2 = δ . Since at Step 4 we construct a δ -graph to determine the num b er of connected comp onen ts of the shell S y and we wan t y to b e a vertex, we choose r , the inner radius of the shell S y , so that if u, v ∈ Y then r ≥ max { d Y ( y , u ) , d Y ( y , v ) } . This guarantees that ∀ t 1 , t 2 ∈ Y with t 1 around edge e 1 , t 2 around edge e 2 suc h that { t 1 , t 2 } ⊂ S y , we hav e d Y ( t 1 , t 2 ) ≥ δ , that is t 1 and t 2 b elong to different connected comp onents of the shell around y at Step 4. The distance b etw een y and u is b ounded b y k y − x k 2 + k x − s k 2 + k s − u k 2 , where, using (2), k x − s k 2 = k x − Q k 2 − k s − Q k 2 = τ sin( α / 2) − ( τ − σ ) sin( α 0 / 2) and using (5), k s − u k 2 ≤ δ 2 sin( α 0 / 4) . (11) Therefore we require that r , the inner radius of the shell of Step 4 satisfies r ≥ δ 2 + k x − s k 2 + δ 2 sin( α 0 / 4) (12) ≥ k y − x k 2 + k x − s k 2 + k s − u k 2 . Another condition on r arises when w e lab el edge points far from actual v ertices. See Figure 4 (right). If y 0 ∈ Y , then it should b e lab eled as an edge p oint. That is, at Step 4, the Rips graph R δ ( S y 0 ) on the shell S y 0 should ha ve 2 connected components. Therefore the distance k F − G k 2 b et ween them must b e greater than δ . W e require that r ≥ 2 σ + δ / √ 2 (13) 10 St a tistical Anal ysis of Metric Graph Reconstruction whic h implies k F − G k 2 > δ when r is small enough, as implied by (10). Note that the v alue r = δ 2 + σ + k x − s k 2 + δ 2 sin( α 0 / 4) satisfies b oth (12) and (13). The outer radius of the shell at Steps 3-4 has length r + δ . This guarantees that when the shell around an edge p oint intersects the tub e around G there is at least a p oint y ∈ Y in each connected comp onen t of the shell, since Y is δ 2 -dense in G σ . In the last part of this pro of w e sho w that condition (10) is needed to guaran tee that the sample is dense enough and the radius of the shells of Step 3 has the correct size, so that, ev en in the worst case, eac h vertex is asso ciated to one set of sampled p oin ts at Steps 15-17 and these connected comp onen ts are correctly linked by sets of sampled p oints lab eled as edge p oints. Let z ∈ G σ b e the p oint around e 2 where the segmen t of length r + δ , orthogonal to the face of the tub e around edge e 1 , intersects the face of the tub e around edge e 2 . See Figure 5. If this segment do es not exist w e simply c onsider the segment of length r + δ from s to a p oint z on e 2 .            Figure 5: The shell around z is tangent to edge e 2 . Supp ose z ∈ Y . Among the p oints that migh t be labeled as vertices at Step 6 b ecause of their closeness to v ertex x , z is the furthest from x , since the shell around z is tangent to the tub e around e 1 . At Step 11, in order to con trol the lab elling of the p oints in the tube b et ween y and z we would like to lab el all the p oints in { y 0 ∈ Y : k y 0 − y k 2 ≤ k y − z k 2 } as v ertices. T o simplify the calculation w e use the following b ound k y − z k 2 ≤ k y − x k 2 + k x − s k 2 + k s − z k 2 , where, using (5), k s − z k 2 ≤ r + δ sin( α 0 / 2) . (14) This justifies the c hoice of p 11 = δ 2 + k x − s k 2 + r + δ sin( α 0 / 2) ≥ k y − z k 2 . Thus, at Step 11 w e lab el all the p oints in { y 0 ∈ Y : k y 0 − y k 2 ≤ p 11 and y is lab eled as v ertex at Step 6 } as v ertices. If z is actually lab eled as a v ertex at Step 6, then through the expansion of Step 11, all the p oin ts at distance not greater than p 11 from z are lab eled as vertices. Finally w e determine under which conditions there is at least a p oin t in the tub e around 11 Lecci, Rinaldo and W asserman e 2 lab eled as an edge p oint after Step 11. Consider the w orst case in whic h e 1 and e 2 are forming an angle of size α at b oth their extremes x and x 0 . See Figure 6.                         Figure 6: Edges e 1 and e 2 , forming an angle of size α at b oth their extremes x and x 0 . All the p oints y 0 ∈ Y such that k y 0 − z k 2 ≤ p 11 or k y 0 − z 0 k 2 ≤ p 11 migh t b e lab eled as v ertices. When w e construct R ( E ) δ and R ( V ) δ at Step 15 the t wo sets of vertices around x and x 0 m ust b e disconnected and there must b e at least an edge p oint b etw een them. A sufficien t condition is that the length of edge e 2 is greater than 2( a 1 + a 2 + a 3 ) + a 4 , where • a 1 is the length of the arc of e 2 formed by the pro jections of lines O x and O s on e 2 , • a 2 is the length of the arc of e 2 formed by the pro jection of the c hord of length k s − z k 2 , • a 3 is the length of the arc of e 2 formed by the pro jection of the chord of length p 11 , • a 4 is the length of the arc of e 2 formed by the pro jection of the chord of length δ . Note that, in Figure 6, e 2 = 2 τ arcsin  k x − x 0 k 2 2 τ  = ατ but in general it might b e shorter, so that e 1 and e 2 migh t not intersect in x 0 . How ever, b y assumptions A2, e 2 m ust be longer than b . Th us we require min( b, ατ ) > 2( a 1 + a 2 + a 3 ) + a 4 . (15) By simple prop erties in volving arcs and chords w e hav e a 1 =  α − α 0 2  τ , a 2 = 2 τ arcsin  k s − z k 2 2( τ − σ )  , a 3 = 2 τ arcsin  p 11 2( τ − σ )  , a 4 = 2 τ arcsin  δ 2( τ − σ )  . Since the arcsin is sup eradditive in [0 , 1] we require the stronger condition min( b, ατ ) − ( α − α 0 ) τ > 2 τ arcsin  2 k s − z k 2 + 2 p 11 + δ 2( τ − σ )  , 12 St a tistical Anal ysis of Metric Graph Reconstruction whic h holds if sin  min( b, ατ ) − ( α − α 0 ) τ 2 τ  > 2 r + δ sin( α 0 / 2) + 2 p 11 + δ 2( τ − σ ) . The last condition is equiv alent to (10). If this condition is satisfied then the graph is correctly reconstructed at Steps 15-17: ev ery connected component of R δ ( V ) corresponds to a vertex of G and every connected comp onen t of R δ ( E ) corresp onds to an edge of G . Example 1 A Neur on in Thr e e-Dimensions. We r eturn to the neur on example and we try to apply Pr op ositions 4 to the 3D data of Figur e 1, namely the neur on cr22e fr om the hipp o c ampus of a r at (Guly´ as et al., 1999). The data wer e obtaine d fr om Neur oMorpho.Or g (Asc oli et al., 2007). The total length of the gr aph is 1750 . 86 µm . We assume the smal lest e dge has length 100 µm , the smal lest angle π / 3 , the lo c al r e ach 30 µm and ξ = 50 µm . The c onditions of Pr op osition 4 ar e satisfie d for δ = 2 . 00 µm . Algorithm 1 r e c onstructs the top ol- o gy of the metric gr aph starting fr om a δ / 2 -dense sample. Figur e 1b shows the r e c onstructe d gr aph. 4. Minimax Analysis In this section we derive lo wer and upp er b ound for the minimax risk R n = inf b G sup P ∈P P n  b G 6' G  , (16) where, as described in Section 2, the infimum is ov er all estimators b G of the metric graph G , the suprem um is ov er the class of distributions P for Y and b G 6' G means that b G and G are not isomorphic. 4.1 Lo w er Bounds T o derive a lo wer bound on the minimax risk, we mak e rep eated use of Le Cam’s lemma. See, e.g., Y u (1997) and Chapter 2 of Tsybak ov (2008). Recall that the total v ariation distance b et ween t wo measures P and Q on the same probability space is defined by TV( P, Q ) = sup A | P ( A ) − Q ( A ) | where the supremum is ov er all measurable sets. It can b e shown that TV( P , Q ) = P ( H ) − Q ( H ), where H = { y : p ( y ) ≥ q ( y ) } and p and q are the densities of P and Q with resp ect to any measure that dominates b oth P and Q . Lemma 5 (Le Cam) L et Q b e a set of distributions. L et θ ( Q ) take values in a metric sp ac e with metric ρ . L et Q 1 , Q 2 ∈ Q b e any p air of distributions in Q . L et Y 1 , . . . , Y n b e dr awn iid fr om some Q ∈ Q and denote the c orr esp onding pr o duct me asur e by Q n . Then inf b θ sup Q ∈Q E Q n h ρ ( b θ , θ ( Q )) i ≥ 1 8 ρ ( θ ( Q 1 ) , θ ( Q 2 ))(1 − TV ( Q 1 , Q 2 )) 2 n (17) wher e the infimum is over al l the estimators of θ ( Q ) . 13 Lecci, Rinaldo and W asserman Belo w w e apply Le Cam’s lemma using sev eral pairs of distributions. An y pair Q 1 , Q 2 is asso ciated with a pair of metric graphs G 0 , G 00 ∈ G . W e tak e θ ( Q 1 ) and θ ( Q 2 ) to b e the classes of graphs that are isomorphic to G 0 and G 00 . W e set ρ ( θ ( Q 1 ) , θ ( Q 2 )) = 0 if G 0 and G 00 are isomorphic and ρ ( θ ( Q 1 ) , θ ( Q 2 )) = 1 otherwise. Figure 7 shows several pairs of metric graphs that are used to deriv e lo wer b ounds in the noiseless case and in the tubular noise case. In the noiseless case w e ignore the σ -tub es around the metric graphs.                   Figure 7: Pairs of metric graphs used in the deriv ation of low er b ounds in the noiseless case and in the tubular noise case. Theorem 6 In the noiseless c ase ( σ = 0 ), for b ≤ b 0 ( a ) , α ≤ α 0 ( a ) , ξ ≤ ξ 0 ( a ) , τ ≤ τ 0 ( a ) , wher e b 0 ( a ) , α 0 ( a ) , ξ 0 ( a ) and τ 0 ( a ) ar e c onstants which dep end on a , a lower b ound on the minimax risk for metric gr aph r e c onstruction is R n ≥ exp  − 2 a min { b, 2 sin( α/ 2) , ξ , 2 π τ } n  . (18) Pro of W e consider the 4 parameters separately . See Figure 7, ignoring the red lines represen ting the tubular noise that is not considered in this theorem. Shortest e dge b . Consider the metric graph G 1 consisting of a single edge of length 1+ b and metric graph G 2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. The densit y on G 1 is constructed in the follo wing wa y: on the set G 1 \ G 2 of length b we set p 1 ( x ) = a and the rest of the mass is evenly distributed ov er the remaining p ortion of G 1 . Similarly , for G 2 w e set p 2 ( x ) = a on G 2 \ G 1 , whic h corresp ond to the orthogonal edge of length b . W e evenly spread the remaining mass. The tw o densities differ only on the sets G 1 \ G 2 and G 2 \ G 1 . Therefore TV( p 1 , p 2 ) ≤ ab and, by Le Cam’s lemma, R n ≥ 1 8 (1 − ab ) 2 n ≥ 1 8 e − 2 abn for all b ≤ b 0 ( a ), where b 0 ( a ) is a constant dep ending on a . Smal lest angle α . Now consider the metric graphs G 3 and G 4 . G 3 consists of t wo edges of length 2 forming an angle α and a third edge of length 1 + 2 sin( α/ 2) glued to the first t wo. 14 St a tistical Anal ysis of Metric Graph Reconstruction G 4 is similar: an edge of length 2 sin( α/ 2) is added to complete the triangle, while the edge on the left has length 1. As in the previous case we set p 3 ( x ) = a on G 3 \ G 4 , p 4 ( x ) = a on G 4 \ G 3 and spread evenly the rest of the mass. The total v ariation distance is TV( p 3 , p 4 ) ≤ 2 a sin  α 2  and, b y Le Cam’s lemma, R n ≥ 1 8 (1 − 2 a sin ( α / 2)) 2 n ≥ 1 8 e − 4 a sin( α/ 2) n for all α ≤ α 0 ( a ), where α 0 ( a ) is a constant dep ending on a . Glob al r e ach ξ . W e defined the global reac h as the shortest euclidean distance b etw een t wo points that are far apart in the graph distance. Figure 7 shows metric graph G 5 formed b y a single edge of length 1 and metric graph G 6 consisting of t wo edges of length 0 . 5, ξ apart from eac h other. Again, we set p 5 ( x ) = a on G 5 \ G 6 , p 6 ( x ) = a on G 6 \ G 5 and ev enly spread the rest. W e obtain TV( p 5 , p 6 ) ≤ aξ and, by Le Cam’s lemma, R n ≥ 1 8 (1 − aξ ) 2 n ≥ 1 8 e − 2 aξ n for all ξ ≤ ξ 0 ( a ), where ξ 0 ( a ) is a constant dep ending on a . L o c al r e ach τ . The lo cal reac h τ is the smallest reac h of the edges forming the metric graph. Consider metric graphs G 7 and G 8 . G 7 consists of a lo op of radius τ attached to an edge of length 1 and metric graph G 8 is a single edge of length 1 + 2 πτ . As in the previous cases p 7 ( x ) = a on G 7 \ G 8 and p 8 ( x ) = a on G 8 \ G 7 . It follows that TV( p 7 , p 8 ) ≤ 2 aπ τ and, b y Le Cam’s lemma, R n ≥ 1 8 (1 − 2 aπ τ ) 2 n ≥ 1 8 e − 4 aπ τ n for all τ ≤ τ 0 ( a ), where τ 0 ( a ) is a constan t dep ending on a . F or the tubular noise case we assume that σ is small enough to guarantee that R n < 1, that is, the problem is not hop eless. In particular, we require that σ satisfies conditions (9) and (10) of Prop osition 4, which can b e combined in to the following condition 0 < min  ξ − 3 σ − τ sin( α/ 2) + ( τ − σ ) sin( α 0 / 2) 3 / 2 + [2 sin( α 0 / 4)] − 1 , f ( b, α , τ , ξ , σ )  . (19) Theorem 7 Assume that σ is p ositive and satisfies c ondition (19) . In the tubular noise c ase, for b ≤ b 0 ( D ) , α ≤ α 0 ( D ) , ξ ≤ ξ 0 ( D ) , τ ≤ τ 0 ( D ) , wher e b 0 ( D ) , α 0 ( D ) , ξ 0 ( D ) and τ 0 ( D ) ar e c onstants which dep end on the ambient dimension D , a lower b ound on the minimax risk for metric gr aph r e c onstruction is R n ≥ 1 8 exp  − 2 min { C D, 1 b, C D, 2 sin( α/ 2) , C D, 3 ξ , C D, 4 τ } n  , (20) for some c onstants C D, 1 , C D, 2 , C D, 3 , C D, 4 . Pro of As in the proof oh Theorem 6 we consider the 4 parameters separately . W e compare the pairs of graphs shown in Figure 7, including the tubular regions constructed around them, from which we get samples uniformly . Shortest e dge b . Consider the metric graph G 1 consisting of a single edge of length 1+ b and metric graph G 2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. Since vol( G 1 ) > v ol( G 2 ), the density q 1 at a p oint in the tube around G 1 is low er than the density q 2 at a p oint around G 2 . F rom the definition of total v ariation T V = q 1 ( H ) − q 2 ( H ) where H is the set where q 1 > q 2 , the shaded area in Figure 7. Note that q 2 ( H ) = 0 and T V ( q 1 , q 2 ) = q 1 ( H ) = v ol( H ) v ol( G 1 ) ≤ C D, 1 bσ D − 1 (1 + b ) σ D − 1 ≤ C D, 1 b. 15 Lecci, Rinaldo and W asserman By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 1 b ) 2 n ≥ 1 8 e − 2 C D, 1 bn for all b ≤ b 0 ( D ), where b 0 ( D ) is a constant dep ending on D . Smal lest angle α . Now consider the metric graphs G 3 and G 4 . Since vol( G 3 ) > vol( G 4 ), the density q 3 at a p oint in the tub e around G 3 is low er than the densit y q 4 at a p oint around G 4 . T V = q 3 ( H ) − q 4 ( H ) where H is the set where q 3 > q 4 , the shaded area in the tub e around G 3 . Note that q 4 ( H ) = 0 and T V ( q 3 , q 4 ) = q 3 ( H ) = v ol( H ) v ol( G 3 ) ≤ C D, 2 sin( α/ 2) σ D − 1 (1 + sin( α/ 2)) σ D − 1 ≤ C D, 2 sin( α/ 2) . By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 2 sin( α/ 2)) 2 n ≥ 1 8 e − 2 C D, 2 sin( α/ 2) n for all α ≤ α 0 ( D ), where α 0 ( D ) is a constan t dep ending on D . Glob al r e ach ξ . Figure 7 shows metric graph G 5 formed by a single edge of length 1 and metric graph G 6 consisting of t wo edges of length 0 . 5, ξ apart from eac h other. Since v ol( G 5 ) > vol( G 6 ), the density q 5 at a p oin t in the tub e around G 5 is lo wer than the densit y q 6 at a p oint around G 6 . T V = q 5 ( H ) − q 6 ( H ) where H is the set where q 5 > q 6 , the shaded area in the tub e around G 5 . Note that q 6 ( H ) = 0 and T V ( q 5 , q 6 ) = q 5 ( H ) = v ol( H ) v ol( G 5 ) ≤ C D, 3 ξ σ D − 1 σ D − 1 = C D, 3 ξ . By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 3 ξ ) 2 n ≥ 1 8 e − 2 C D, 3 ξ n for all ξ ≤ ξ 0 ( D ), where ξ 0 ( D ) is a constant dep ending on D . L o c al r e ach τ . The lo cal reac h τ is the smallest reac h of the edges forming the metric graph. Consider metric graphs G 7 and G 8 in Figure 7. Since v ol( G 7 ) > v ol( G 8 ), the densit y q 7 at a p oint in the tub e around G 7 is low er than the density q 8 at a p oint around G 8 . T V = q 7 ( H ) − q 8 ( H ) where H is the set where q 7 > q 8 , the shaded area in the tube around G 7 . Note that q 8 ( H ) = 0 and T V ( q 7 , q 8 ) = q 7 ( H ) = v ol( H ) v ol( G 7 ) ≤ C D, 4 τ σ D − 1 (1 + τ ) σ D − 1 ≤ C D, 4 τ . By Le Cam’s lemma, R n ≥ 1 8 (1 − C D, 4 τ ) 2 n ≥ 1 8 e − 2 C D, 4 τ n for all τ ≤ τ 0 ( D ), where ξ 0 ( D ) is a constant dep ending on D . Note that, up to constan ts, the lo w er b ound obtained in the tubular noise case is identical to the low er b ound of Prop osition 6 for the noiseless case. 4.2 Upp er Bounds In this section we use the analysis of the p erformance of Algorithm 1 to deriv e an upp er b ound on the minimax risk. W e will use the strategy of Niyogi et al. (2008) to find the sample size that guaran tees a δ / 2-dense sample with high probability . W e will use the follo wing t wo lemmas. 16 St a tistical Anal ysis of Metric Graph Reconstruction Lemma 8 (5.1 in Niy ogi et al. (2008)) L et { A i } for i = 1 , . . . , l b e a finite c ol le ction of me asur able sets and let µ b e a pr ob ability me asur e on S l i =1 A i such that for al l 1 ≤ i ≤ l , we have µ ( A i ) > γ . L et ¯ x = { x 1 , . . . , x n } b e a set of n i.i.d. dr aws ac c or ding to µ . Then if n ≥ 1 γ  log l + log  1 λ  we ar e guar ante e d that with pr ob ability > 1 − λ , the fol lowing is true: ∀ i, ¯ x ∩ A i 6 = ∅ . Recall that the  -cov ering num b er C (  ) of a set S is the smallest num b er of Euclidean balls of radius  required to cov er the set. The  -pac king num b er P (  ) is the maximum num b er of sets of the form B ( x,  ) ∩ S , where x ∈ S , that may b e pack ed into S without o verlap. Lemma 9 (5.2 in Niy ogi et al. (2008)) F or every  > 0 , P (2  ) ≤ C (2  ) ≤ P (  ) . Com bining Lemma 8 and Prop osition 4, we obtain an upp er b ound on R n for the noiseless case. Theorem 10 In the noiseless c ase ( σ = 0 ), an upp er b ound on the minimax risk R n is given by R n ≤ 8 length ( G ) δ exp  − a δ n 4 length ( G )  , wher e δ = 1 2 min  ξ 2 sin( α/ 4) 3 sin( α/ 4) + 1 , τ sin( α / 2) sin( α/ 4) sin( α/ 2) sin( α/ 4) + 3 sin( α/ 4) + 1 sin  min { b, ατ } 2 τ  . (21) Pro of In the noiseless case, Prop osition 4 implies that the graph G can b e reconstructed from a δ / 2-dense sample Y if δ < min  ξ 2 sin( α/ 4) 3 sin( α/ 4) + 1 , f ( b, α, τ , ξ , 0)  . (22) The v alue of δ selected in (21) satisfies condition (22), which follows from conditions (9) and (10), with σ = 0. W e lo ok for the sample size n that guaran tees a δ / 2-dense sample with high probability . F ollo wing the strategy in Niyogi et al. (2008), we consider a cov er of the metric graph G by balls of radius δ / 4. Let { x i : 1 ≤ i ≤ l } b e the centers of suc h balls that constitute a minimal co v er. W e can c ho ose A δ / 4 i = B δ / 4 ( x i ) ∩ G . Applying Lemma 8 w e find that the sample size that guarantees a correct reconstruction with probabilit y at least 1 − λ is 1 γ  log l + log 1 λ  , (23) where γ ≥ min i a length( A δ / 4 i ) length( G ) ≥ aδ 4 length( G ) , 17 Lecci, Rinaldo and W asserman and we b ound the cov ering num b er l in terms of the packing num b er, using Lemma 9: l ≤ length( G ) min i length( A δ / 8 i ) ≤ 8 length( G ) δ . Therefore, from (23), if n = 4 length ( G ) aδ  log  8 length( G ) δ  + log 1 λ  (24) w e ha ve a δ / 2-dense sample with probability at least 1 − λ and, by Prop osition 4, P ( b G 6' G ) ≤ λ . Rearranging we hav e the result. Note that, in the noiseless case, the upp er and low er b ounds are tigh t up to p olynomial factors in the parameters τ , b, ξ . There is a small gap with resp ect to α ; closing this gap is an op en problem. In the tubular noise case, we assume that σ is small enough, to guarantee that Algorithm 1 correctly reconstructs a metric graph starting from a δ / 2-dense sample. Theorem 11 Assume that σ satisfies c ondition (19) and 0 < σ < min { 3 τ / 16 , δ / 8 } , wher e δ = C 0 min  ξ − 3 σ − τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) 3 / 2 + [2 sin( α 0 / 4)] − 1 , f ( b, α , τ , ξ , σ )  , (25) for some 0 < C 0 < 1 . Under the tubular noise mo del, an upp er b ound on the minimax risk R n is given by R n ≤ 16 length ( G ) δ exp  − C 0 D δ ( τ − 8 σ ) n τ length ( G )  , wher e C 0 D is a c onstant dep ending on the ambient dimension. Pro of Prop osition 4 implies that the graph G can be reconstructed from a δ / 2-dense sample Y if δ < min  ξ − 3 σ − τ sin( α/ 2) − ( τ − σ ) sin( α 0 / 2) 3 / 2 + [2 sin( α 0 / 4)] − 1 , f ( b, α , τ , ξ , σ )  , (26) whic h is satisfied by the v alue of δ selected in (25). W e lo ok for the sample size n that guaran tees a δ / 2-dense sample in G σ with high probability . W e consider a co ver of the metric graph G by euclidean balls of radius δ / 8. Let { x i : 1 ≤ i ≤ l } b e the cen ters of suc h balls that constitute a minimal cov er. Note that D -dimensional balls of radius δ / 8 + σ ≤ δ / 4 cen tered at the same x 0 i s constitute a cov er of the tubular region G σ . W e define A δ / 8+ σ i = B δ / 8+ σ ( x i ) ∩ G σ . Applying Lemma 8 we find that the sample size that guaran tees a δ / 2-dense sample in G σ (and a correct top ological reconstruction of G ) with probability at least 1 − λ is 1 γ  log l + log 1 λ  , (27) 18 St a tistical Anal ysis of Metric Graph Reconstruction where γ = min i v ol( A δ / 8+ σ i ) v ol( G σ ) . (28) Define ˜ A δ i = B δ ( x i ) ∩ G. The cov ering num b er l is b ounded in terms of the packing n umber, using Lemma 9, l ≤ length( G ) min i length( ˜ A δ / 16 i ) ≤ 16 length( G ) δ . W e construct a low er b ound on γ by deriving an upp er b ound on the denominator of (28) and a low er b ound on the numerator. Upp er b ound on vol ( G σ ). Let N σ b e the σ -co vering num b er of G and let C σ b e the set of cen ters of this co ver. By Lemma 9, N σ is b ounded b y the σ / 2-packing n umber. A simple volume argument giv es N σ ≤ C length( G ) /σ, for some constan t C . Note that 2 σ D - dimensional balls around eac h of the cen ters in C σ co ver G σ . Thus v ol( G σ ) ≤ v D N σ (2 σ ) D ≤ C D length( G ) σ D − 1 for some constant C D dep ending on the am bient dimension. Lo w er b ound on v ol ( A δ / 8+ σ i ) , for all i . Let P A ( σ ) b e the σ -packing num b er of ˜ A δ / 8 i and let D A b e the set of cen ters of this packing. Then v ol( A δ / 8+ σ i ) ≥ P A ( σ ) v D σ D , b ecause the union of σ balls around D A is contained in A δ / 8+ σ i . Let C A (2 σ ) b e the 2 σ -co vering n umber of ˜ A δ / 8 i and let C A = { z 1 , . . . , z C A (2 σ ) } b e the set of centers of this cov er. By Lemma 9, P A ( σ ) ≥ C A (2 σ ) ≥ length( ˜ A δ / 8 i ) max z j ∈C A length( B 2 σ ( z j ) ∩ ˜ A δ / 8 i ) ≥ δ / 8 max z j ∈C A length( B 2 σ ( z j ) ∩ ˜ A δ / 8 i ) and, since 2 σ < 3 τ / 8, by Corollary 1.3 in Chazal (2013), max z j ∈C A length( B 2 σ ( z j ) ∩ ˜ A δ / 8 i ) ≤ C 2  τ τ − 8 σ  σ, for some constant C 2 . Thus γ ≥ P A ( σ ) v D σ D C D length( G ) σ D − 1 ≥ C 0 D δ ( τ − 8 σ ) τ length( G ) , where C 0 D is a constant dep ending on the am bient dimension. Finally , from (27), if n = τ length ( G ) C 0 D δ ( τ − 8 σ )  log  16 length( G ) δ  + log 1 λ  , (29) then the sample is δ/ 2-dense with probabilit y at least 1 − λ and P ( b G 6' G ) ≤ λ . Rearranging w e obtain R n ≤ exp  − C 0 D δ ( τ − 8 σ ) n τ length( G ) + log  16length( G ) δ  . 19 Lecci, Rinaldo and W asserman 5. Discussion In this paper, we presen ted a statistical analysis of metric graph reconstruction. W e derived sufficien t conditions on random samples from a graph metric space that guarantee top olog- ical reconstruction and w e derived low er and upp er b ounds on the minimax risk for this problem. V arious improv ements and theoretical extensions are p ossible. In Prop osition 4 w e hav e analyzed Algorithm 1 using the Euclidean distance at ev ery step. It is p ossible to obtain a similar result using a different notion of distance, for example, the distance induced b y a Rips-Vietoris graph constructed on the sample. While in our analysis w e mainly relied on the assumption of a dense sample, Aanjaneya et al. (2012) used the more refined but stronger assumption of the sample b eing an ap- pro ximation of the metric graph, whic h we recall: giv en p ositive num b ers ε and R , w e sa y that ( Y , d Y ) is an ( ε, R ) -appr oximation of the metric space ( G, d G ) if there exists a corresp ondence C ⊂ G × Y such that ( x, y ) , ( x 0 , y 0 ) ∈ C , min( d G ( x, x 0 ) , d Y ( y , y 0 )) ≤ R = ⇒   d G ( x, x 0 ) − d Y ( y , y 0 )   ≤ ε. (30) As shown in Aanjaneya et al. (2012), the ( ε, R )-approximation assumption is sufficien t, for appropriate choice of the parameters ε and R , to reco ver not only the top ology of a metric graph ( G, d G ), but also its metric d G with high accuracy . Ho wev er, when compared to the dense sample assumption, it demands a larger sample complexity to achiev e accurate top ological reconstruction. A strategy similar to the one used in this pap er could be used to determine the sample size that guaran tees an ( ε, R )-approximation of the underlying metric graph with high probability . This would guarantee a correct top ological reconstruction, as w ell as an approximation of the metric d G . W e are also inv estigating the idea of com bining metric graph reconstruction with the subspace constrained mean-shift algorithm (F ukunaga and Hostetler, 1975; Comaniciu and Meer, 2002; Genov ese et al., 2012b) to provide similar guarantees. Our preliminary results indicate that this mixed strategy w orks v ery w ell under more general noise assumptions and with relatively lo w sample size. Ac kno wledgmen ts Researc h supp orted by NSF CAREER Grant DMS 114967, Air F orce Gran t F A95500910373, NSF Gran t DMS-0806009. The authors thank the referees for helpful comments and sug- gestions. References Mridul Aanjaney a, F rederic Chazal, Daniel Chen, Marc Glisse, Leonidas Guibas, and Dmitriy Morozo v. Metric graph reconstruction from noisy data. International Journal of Computational Ge ometry & Applic ations , 22(04):305–325, 2012. Mahm uda Ahmed and Carola W enk. Probabilistic street-in tersection reconstruction from gps tra jectories: approaches and challenges. In Pr o c e e dings of the Thir d A CM SIGSP A- TIAL International Workshop on Querying and Mining Unc ertain Sp atio-T emp or al Data , pages 34–37. ACM, 2012. 20 St a tistical Anal ysis of Metric Graph Reconstruction Ery Arias-Castro, Guangliang Chen, and Gilad Lerman. Sp ectral clustering based on lo cal linear approximations. Ele ctr onic Journal of Statistics , 5:1537–1587, 2011. Giorgio A Ascoli, Duncan E Donohue, and Maryam Halavi. Neuromorpho. org: a cen tral resource for neuronal morphologies. The Journal of Neur oscienc e , 27(35):9247–9251, 2007. P aul Bendich. Analyzing str atifie d sp ac es using p ersistent versions of interse ction and lo c al homolo gy . ProQuest, 2008. P aul Bendich, Sa yan Mukherjee, and Bei W ang. T o wards stratification learning through homology inference. arXiv pr eprint arXiv:1008.3572 , 2010. P aul Bendich, Bei W ang, and Say an Mukherjee. Lo cal homology transfer and stratification learning. In Pr o c e e dings of the Twenty-Thir d A nnual ACM-SIAM Symp osium on Discr ete A lgorithms , pages 1355–1370. SIAM, 2012. F. Chazal and A. Lieutier. T op ology guaran teeing manifold reconstruction using distance function to noisy data. In Pr o c e e dings of the twenty-se c ond annual symp osium on Com- putational ge ometry , pages 112–118. A CM, 2006. F r´ ed´ eric Chazal. An upp er b ound for the volume of geo desic balls in submanifolds of euclidean spaces. T echnical rep ort, INRIA, January 2013. F r´ ed´ eric Chazal and Jian Sun. Gromo v-hausdorff appro ximation of metric spaces with linear structure. arXiv pr eprint arXiv:1305.1172 , 2013. F r´ ed´ eric Chazal, Da vid Cohen-Steiner, and Andr´ e Lieutier. A sampling theory for compact sets in euclidean space. Discr ete & Computational Ge ometry , 41(3):461–479, 2009. Daniel Chen, Leonidas J Guibas, John Hershberger, and Jian Sun. Road net work recon- struction for organizing paths. In Pr o c e e dings of the Twenty-First Annual ACM-SIAM Symp osium on Discr ete A lgorithms , pages 1309–1320. So ciety for Industrial and Applied Mathematics, 2010. Alexey Cherno v and Vitaliy Kurlin. Reconstructing p ersisten t graph structures from noisy images. Ele ctr onic Journal Image-A , 3(5):19–22, 2013. Dorin Comaniciu and Peter Meer. Mean shift: A robust approach tow ard feature space analysis. Pattern A nalysis and Machine Intel ligenc e, IEEE T r ansactions on , 24(5):603– 619, 2002. Herb ert F ederer. Curv ature measures. T r ansactions of the Americ an Mathematic al So ciety , 93(3):418–491, 1959. Keinosuk e F ukunaga and Larry Hostetler. The estimation of the gradient of a density func- tion, with applications in pattern recognition. Information The ory, IEEE T r ansactions on , 21(1):32–40, 1975. Xiao yin Ge, Issam I Safa, Mikhail Belkin, and Y usu W ang. Data skeletonization via reeb graphs. In A dvanc es in Neur al Information Pr o c essing Systems , pages 837–845, 2011. 21 Lecci, Rinaldo and W asserman Christopher R Genov ese, Marco Perone-P acifico, Isab ella V erdinelli, and Larry W asserman. Minimax manifold estimation. Journal of Machine L e arning R ese ar ch , 13:1263–1291, 2012a. Christopher R Genov ese, Marco Perone-P acifico, Isab ella V erdinelli, and Larry W asserman. Nonparametric ridge estimation. arXiv pr eprint arXiv:1212.5156 , 2012b. A ttila I Guly´ as, Manuel Megı as, Zsuzsa Emri, and T am´ as F F reund. T otal n umber and ratio of excitatory and inhibitory synapses conv erging on to single interneurons of differen t t yp es in the ca1 area of the rat hippo campus. The Journal of neur oscienc e , 19(22):10082– 10097, 1999. P eter Kuc hment. Quan tum graphs: I. some basic structures. Waves in R andom me dia , 14 (1):107–128, 2004. P . Niy ogi, S. Smale, and S. W ein b erger. Finding the homology of submanifolds with high confidence. Discr ete and Compuational Ge ometry , 38(1-3):419–441, 2008. Alexandre B Tsybako v. Intr o duction to nonp ar ametric estimation . Springer, 2008. Bin Y u. Assouad, fano, and le cam. In F estschrift for Lucien L e Cam , pages 423–435. Springer, 1997. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment