Solution manifold and Its Statistical Applications

A solution manifold is the collection of points in a $d$-dimensional space satisfying a system of $s$ equations with $s<d$. Solution manifolds occur in several statistical problems including hypothesis testing, curved-exponential families, constraine…

Authors: Yen-Chi Chen

Solution manifold and Its Statistical Applications
SOLUTION MANIFOLD AND ITS ST A TISTICAL APPLICA TIONS B Y Y E N - C H I C H E N Department of Statistics University of W ashington * yenchic@uw .edu A solution manifold is the collection of points in a d -dimensional space satisfying a system of s equations with s < d . Solution manifolds occur in se veral statistical problems including missing data, algorithmic fairness, hypothesis testing, partial identifications, and nonparametric set estimation. W e theoretically and algorithmically analyze solution manifolds. In terms of theory , we deriv e five useful results: smoothness theorem, stability theorem (which implies the consistency of a plug-in estimator), con vergence of a gra- dient flow , local center manifold theorem and conv ergence of the gradient descent algorithm. W e propose a Monte Carlo gradient descent algorithm to numerically approximate a solution manifold. In the case of the likelihood inference, we design a manifold constraint maximization procedure to find the maximum likelihood estimator on the manifold. 1. Introduction. A solution manifold [ 64 ] is the collection of points in d -dimensional space that solves a system of s equations where s < d . Namely , feasible set is a collection of points in an under-constrained system. Under smoothness conditions, the feasible set forms a manifold kno wn as a solution manifold. The solution manifold occurs in many problems in statistics such as missing data (Example 1), algorithmic fairness (Example 2), constrained likelihood space (Example 3), and density ridges/le vel sets (Example 4). In the re gular case that s = d , the solution manifold reduces to the usual problems such as the Z-estimators [ 77 ] or estimating equations [ 45 ]. While there has been a tremendous amount of literature on the analysis of regular cases ( s = d ), little is kno wn when s < d . This study aims to analyze the problem when s < d and design a practical algorithm to find the manifold. Formally , let Ψ : R d 7→ R s be a vector -v alued function with s < d . The solution set of Ψ M = { x : Ψ( x ) = 0 } is called the solution manifold and we call Ψ the generator (function) of M . Note that in some applications, x represents the parameter in a model; thus, sometimes we write M = { θ : Ψ( θ ) = 0 } . Here we provide examples of solution manifolds from various statistical problems. E X A M P L E 1 (Missing data) . Consider a simple missing data pr oblem wher e we have a binary response variable Y and a binary covariate X . The r esponse variable is subject to missing. W e use a binary variable R to indicate the r esponse pattern of Y (i.e., Y is observed if R = 1 ). Depending on the value of R , we may observe ( X , Y , R = 1) or ( X, R = 0) . In this case, the entir e distribution is char acterized by the following parameter s: ζ x,y = P ( R = 1 | X = x, Y = y ) , µ x = P ( Y = 1 | X = x ) , ξ = P ( X = x ) for x, y ∈ { 0 , 1 } . P arameter ζ x,y is called missing data mec hanism [ 47 ]. µ x is the r egr ession function, and ξ is the mar ginal mean of X . Thus, this pr oblem has seven parameter s. F r om * Supported by NSF grant DMS - 195278 and DMS - 2112907 and NIH grant U24 A G072122 K e ywor ds and phrases: set estimation, gradient descent, noncon ve x optimization, manifold. 1 2 CHEN ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 0 1 2 3 4 1.0 1.5 2.0 2.5 3.0 3.5 4.0 µ σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 0 1 2 3 4 1.0 1.5 2.0 2.5 3.0 3.5 4.0 µ σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 0 1 2 3 4 1.0 1.5 2.0 2.5 3.0 3.5 4.0 µ σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 0 1 2 3 4 1.0 1.5 2.0 2.5 3.0 3.5 4.0 µ σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● F I G 1 . An example of a solution manifold formed by the parameter ( µ, σ ) of a Gaussian with a tail pr obability bound P ( − 5 < Y < 2) = 0 . 5 . The left panel shows 1000 random initializations (uniformly distributed within [1 , 3] × [2 , 4] ). W e keep applying the gradient descent algorithm until con vergence (right panel). The blac k dashed line indicates the actual location of the solution manifold. the observed data (IID elements in the form of ( X , Y , R = 1) or ( X , R = 0) ), we can identify P ( x, y , R = 1) and P ( x, R = 0) for x, y ∈ { 0 , 1 } , which leads to six constraints (note that P ( x, y , r ) = P ( X = x, Y = y , R = r ) ): (1) P (1 , 1 , 1) = ζ 11 µ 1 ξ , P (1 , 0 , 1) = ζ 10 (1 − µ 1 ) ξ , P (0 , 1 , 1) = ζ 01 µ 0 (1 − ξ ) , P (0 , 0 , 1) = ζ 00 (1 − µ 0 )(1 − ξ ) P ( X = 0 , R = 0) = (1 − ζ 01 ) µ 0 (1 − ξ ) + (1 − ζ 00 )(1 − µ 0 )(1 − ξ ) P ( X = 1 , R = 0) = (1 − ζ 11 ) µ 1 ξ + (1 − ζ 10 )(1 − µ 1 ) ξ . Thus, the feasible values of the parameters ( ζ x,y , µ x , ξ ) will form a solution manifold and the above constraints describe the generator Ψ . Note that the resulting solution manifold is r elated to the nonparametric bound of the parameters [ 49 , 50 , 17 ]; see Section 4.1 for mor e discussions. E X A M P L E 2 (Algorithmic fairness). The algorithmic fairness is a tr ending topic in mod- ern machine learning r esear ch [ 40 , 26 , 27 ]. W e consider a post-pr ocessing method in the algorithmic fairness study . Suppose we have a binary r esponse Y ∈ { 0 , 1 } , a sensitive bi- nary variable A ∈ { 0 , 1 } that we wish to pr otect, and an output fr om a trained classifier W ∈ { 0 , 1 } (one can view it as W = c ( X, A ) , where X is the covariate/featur e and c is a trained classifier). The sensitive variable is often the race or gender indicator . The classifica- tion r esult based on W may discriminate against the sensitive variable A ; that is, it is likely that A = 1 and W = 1 occur at the same time. W e want to construct a new ‘fair’ classifier Q ∈ { 0 , 1 } such that Q is a random variable whose distrib ution depends only on A and W and the output of Q will not discriminate a gainst A . In other wor ds, we design a new random variable Q such that Q ⊥ Y | A, W . While there ar e many principles of algorithmic fairness, we consider the test fairness [ 26 ]: W e want to construct Q such that (2) P ( Y = 1 | Q = s, A = 0) = P ( Y = 1 | Q = s, A = 1) , for each s = 0 , 1 . T o construct Q that satisfies the above constr aint, we gener ate Q based on A, W such that its distribution is determined by parameter q w,a = P ( Q = 1 | W = w , A = a ) . As long as we can pr operly c hoose q w,a , the r esulting Q will satisfy equation ( 2 ) . In this case, any q w,a solving the following two equations will satisfy equation ( 2 ) (see Appendix B ): (3) P w q w, 0 P ( W = w, Y = 1 | A = 0) P w 0 q w 0 , 0 P ( W = w 0 | A = 0) = P w q w, 1 P ( W = w, Y = 1 | A = 1) P w 0 q w 0 , 1 P ( W = w 0 | A = 1) . P w (1 − q w, 0 ) P ( W = w, Y = 1 | A = 0) P w 0 (1 − q w 0 , 0 ) P ( W = w 0 | A = 0) = P w (1 − q w, 1 ) P ( W = w, Y = 1 | A = 1) P w 0 (1 − q w 0 , 1 ) P ( W = w 0 | A = 1) . SOLUTION MANIFOLD 3 Note that P ( W = w , Y = y, A = a ) is identifiable fr om the data. The original parameter { q w,a : w , a = 0 , 1 } is in four-dimensional space and we have two constraints, leading to a solution manifold of two dimensions. E X A M P L E 3 (Constrained likelihood space) . Consider a random variable Y fr om an unknown distribution. W e place a parametric model p ( y ; θ ) of the underlying PDF of Y wher e θ ∈ R d is the parameter vector . Suppose that we have a set of constr aints on the model such that a feasible par ameter must satisfy f 1 ( θ ) = f 2 ( θ ) = · · · = f s ( θ ) = 0 for some given funct ions f 1 , · · · , f s . These functions may be fr om independence assumptions or moment constraints E ( g j ( Y )) = 0 for some functions g 1 , · · · , g s . The set of parameters that satisfies these constraints is (4) Θ 0 = { θ : f ` ( θ ) = 0 , ` = 1 , · · · , s } = { θ : Ψ( θ ) = 0 } , which is a solution manifold with Ψ ` ( θ ) = R f ` ( y ) p ( y ; θ ) dy . The above model is used in algebr aic statistics [ 32 , 55 ], partial identification pr oblems with equality constraints [ 38 , 25 ], and mixture models with moment constraints [ 46 , 14 ]. W e will return to this pr oblem in Section 4.3.2 . F igur e 1 shows an example of a solution manifold formed by the tail pr obability constraint P ( − 5 < Y < 2) = 0 . 5 wher e Y ∼ N ( µ, σ 2 ) . W e have two parameters ( µ, σ 2 ) and one constraint; thus, the resulting solution set is a one-dimensional manifold. F r om the left to the right panels, we show that we can r ecover the underlying manifold by random initializations with a suitable gradient descent pr ocess (Algorithm 1 ). E X A M P L E 4 (Density ridges) . A k -ridge [ 35 ] of a density function p ( x ) is defined as the collection of points satisfying { x : V k ( x ) T ∇ p ( x ) = 0 , λ k ( x ) < 0 } , wher e V k ( x ) = [ v k ( x ) , · · · , v d ( x )] ∈ R ( k − d ) is the collection of eigen vectors of H ( x ) = ∇∇ p ( x ) , and λ k ( x ) is the k -th eigen vector . The eigen-pairs ar e or der ed as λ 1 ( x ) ≥ λ 2 ( x ) ≥ · · · ≥ λ d ( x ) . In this case Ψ( x ) = V k ( x ) T ∇ p ( x ) ; hence, ridges ar e also solution manifolds. In addition to ridg es, the level sets and critical points of a function [ 79 , 48 , 13 ] ar e e xamples of solution manifolds. W e will discuss this in Section 4.4 Although the aforementioned examples are from different statistical problems, they all share a similar structure that the feasible set forms a solution manifold. Thus, we study prop- erties of solution manifolds in this paper , and our results will be applicable to all of these cases. Main results. Our main results include theoretical dev elopments and algorithmic innov a- tions. In the theoretical analysis, we show that under a similar sets of assumptions, we hav e the follo wing: 1. Smoothness theorem. The solution manifold is a ( d − s ) -dimensional manifold with a positi ve reach (Lemma 1 and Theorem 3 ). 2. Stability theorem. As long as b Ψ and Ψ and their deriv ati ves are sufficiently close, c M = { x : b Ψ( x ) = 0 } con ver ges to M under the Hausdorff distance (Theorem 5 ). 3. Con ver gence of a gradient flow . For the gradient descent flo w of k Ψ( x ) k 2 , the flow con v erges (in the normal direction of M ) to a point on M when the starting point is suf ficiently close to M (Theorem 7 ). 4. Local center manif old theor em. The collection of points con ver ging to the same location z ∈ M forms an s -dimensional manifold (Theorem 8 ). 4 CHEN 5. Con ver gence of a gradient descent algorithm. With a good initialization, the gradient descent algorithm of k Ψ( x ) k 2 con v erges linearly to a point in M (Theorem 9 ) when the step size is suf ficiently small. W e propose three algorithms to numerically find solution manifolds and use them to handle statistical problems: 1. Monte Carlo gradient descent algorithm: an algorithm generating points over M that requires only the access to Ψ and its gradient (Section 3 and Algorithm 1 ). 2. Manif old-constraint maximizing algorithm: an algorithm that finds the MLE on the solution manifold (Section 4.3.1 and Algorithm 2 ). 3. A pproximated manifold posterior algorithm: a Bayesian procedure that approximates the posterior distribution on a manifold (Appendix A and Algorithm 3 ). W e would like to emphasize that while some of the theoretical results have appeared in the regular case ( s = d , i.e., the solution manifold is a collection of points or just a single point), generalizing these results to the manifold cases ( s < d ) requires non-trivial e xtensions of existing techniques. The major challenge comes from the fact that the set M contains an infinite number of points and the geometry of M will pose technical issues during the theoretical analysis. Also, although the stability theorem has appears for specific examples such as lev el set and ridge estimation [ 12 , 66 , 35 ], there is no such result for the general class of solution manifolds. This paper provides a unified framew ork of analyzing solution manifolds. The impact of this paper is beyond statistics. Our result provides a ne w analysis of the partial identification problem in econometrics [ 38 , 25 ]. The local center manifold theorem of fers a new class of statistical problems where the dynamical system interacts with statistics [ 61 ]. The Monte Carlo approximation of a solution manifold leads to a point cloud over the manifold, which is a common scenario in computational geometry [ 24 , 23 , 30 ]. The algorithmic con ver gence of the gradient descent demonstrates a new class of non-con v ex functions for that we still obtain the linear con v ergence [ 10 , 58 ]. Outline. The remainder of this paper is organized as follows. Section 2 provides a formal definition of a solution manifold and studies the smoothness and stability of the manifold. Section 3 presents an algorithm for approximating the solution manifold and an analysis of its properties. Section 4 discusses sev eral statistical applications of solution manifolds. Section 5 provides future directions, connections with other fields, and some manifolds in statistics that are not in a solution form. Notations. Let v ∈ R d be a vector and V ∈ R n × m be a matrix. k v k 2 is the L 2 norm (Eu- clidean norm) of v , and k v k max = max {| v 1 | , · · · , | v d |} is the vector max norm. F or matrices, we use k V k = k V k 2 = max k u k =1 ,u ∈ R m k V u k 2 k u k 2 as the L 2 norm and k V k max = max i,j k V ij k as the max norm. For a squared matrix A , we define λ min ( A ) , λ max ( A ) to be the minimal and maximal eigenv alue respectively , and λ 2 min ,> 0 ( A ) as the smallest non-zero eigen v alue. For a v ector v alue function Ψ , we define a maximal norm of deri vati ves as k Ψ k ( J ) ∞ = sup x max i max j 1 · · · max j J     ∂ J ∂ x j 1 · · · ∂ x j J Ψ i ( x )     for J = 0 , 1 , 2 , 3 . When Ψ is a scalar function, this is reduced to k Ψ k (1) ∞ = sup x k∇ Ψ( x ) k max , k Ψ k (2) ∞ = sup x k∇∇ Ψ( x ) k max , which are the usual maximal norm of the gradient vector and Hessian matrix ov er all x . W e also define k Ψ k ∗ ∞ ,J = max j =0 , ··· ,J k Ψ k ( j ) ∞ SOLUTION MANIFOLD 5 as a norm that measures distance using upto the J -th deriv ative. The Jacobian (gradient) of Ψ( x ) is an s × d matrix G Ψ ( x ) = ∇ Ψ( x ) =     ∇ Ψ 1 ( x ) T ∇ Ψ 2 ( x ) T . . . ∇ Ψ s ( x ) T     ∈ R s × d and the Hessian of Ψ( x ) will be an s × d × d array H Ψ ( x ) = ∇∇ Ψ( x ) ∈ R s × d × d , [ H Ψ ( x )] ij k = ∂ 2 ∂ x j ∂ x k Ψ i ( x ) and third deri vati ve of Ψ will be an array ∇∇∇ Ψ( x ) ∈ R s × d × d × d , [ ∇∇∇ Ψ( x )] ij k ` = ∂ 3 ∂ x j ∂ x k ∂ x ` Ψ i ( x ) . Let A be a set and x be a point. W e then define d ( x, A ) = inf {k x − y k : y ∈ A } as the projected distance from x to A . For a set A and a positiv e number r , we denote A ⊕ r = { x : d ( x, A ) ≤ r } . 2. Solution manif old and its geometry . Let Ψ : R d 7→ R s be a vector -v alued function and M = { x : Ψ( x ) = 0 } be the solution set/manifold. When the Jacobian matrix G Ψ ( x ) = ∇ Ψ( x ) has rank s at every x ∈ M , the set M is an ( d − s ) -dimensional manifold locally at e very point x due to the implicit function theorem [ 72 ]. In algebraic statistics, the parameters in the solution set { x : Ψ( x ) = 0 } are called an implicit (statistical) algebraic model [ 36 ]. For a solution manifold, its normal space can be characterized using the follo wing lemma. L E M M A 1 . F or every point x ∈ M , the r ow space of G Ψ ( x ) ∈ R s × d spans the normal space of M at x . Lemma 1 is an elementary result from geometry (see Section 6.5.1 of [ 34 ]); hence, we omit its proof. Lemma 1 states that the Jacobian/gradient of Ψ is normal to the solution manifold. This is a natural result because the gradient of a function is alw ays normal to the lev el set and the solution manifold can be vie wed as the intersection of lev el sets of dif ferent functions. 2.1. Assumptions. W e will make the following two major assumptions in this paper . All the theoretical results rely on these two assumptions. (D-k) Ψ( x ) is bounded k -times differentiable. (F) There exists λ 0 , δ 0 , c 0 > 0 , such that A. λ min ( G Ψ ( x ) G Ψ ( x ) T ) ≡ λ min ,> 0 ( G Ψ ( x ) T G Ψ ( x )) > λ 2 0 for all x ∈ M ⊕ δ 0 and B. k Ψ( x ) k max > c 0 for all x / ∈ M ⊕ δ 0 . Assumption (D-k) is an ordinary smoothness of the generator function. It may be relaxed by a Hölder type condition on Ψ( x ) . Assumption (D-k) is weaker for a smaller integer k . In the stability analysis, we need (D-1) (i.e., bounded first order deriv ati ve) condition. If we want to control the smoothness of the manifold, we need (D-2) or a higher-order condition. The stability of a gradient flow requires a (D-3) condition. Assumption (F) is a curvature assumption of Ψ around the solution manifold. Lemma 1 implies that the normal space of M at each point is well-defined. This assumption will reduce to commonly assumed conditions 6 CHEN in the literature. For instance, in the case of mode estimation (finding the local modes of a PDF p ( x ) ), (F) reduces to the assumption that the local modes are well-defined as separated [ 68 , 69 ]. This is similar to the assumption that the PDF p ( x ) is a Morse function [ 19 , 43 ]. In the MLE theory , (F) refers to the Fisher’ s information matrix being positive definite at the MLE (and other local maxima), which is often viewed as a classical condition in the MLE theory (see, e.g., Chapter 5 of [ 77 ]). In the problem of finding the density lev el sets (finding the set { x : p ( x ) = λ } ), this assumption is equi v alent to assuming that p ( x ) has a non-zero gradient around the lev el set [ 56 , 76 , 57 , 12 , 48 , 44 ]. The assumption (F) is critical to the that the set M forms a manifold. When there is no lower bound λ 0 , i.e., the gradient λ min ( G Ψ ( x ) G Ψ ( x ) T ) attains 0 , the set M may not form a manifold. One scenario that this occurs is the density level set { x : p ( x ) = λ } such that the density function p ( x ) is flat at the le vel λ . The constants in (F) can be further characterized by the follo wing lemma. L E M M A 2 . Assume (D-2) and that ther e e xists λ M > 0 such that ( F 0 ) inf x ∈ M λ min ( G Ψ ( x ) G Ψ ( x ) T ) ≥ λ 2 M . Then the constants in (F) can be chosen as λ 0 = 1 2 λ M , δ 0 = 3 λ 2 M 8 k Ψ ∗ ∞ , 1 kk Ψ ∗ ∞ , 2 k , c 0 = inf x / ∈ M ⊕ δ 0 k Ψ k max . Lemma 2 only places assumptions on the beha vior of Ψ and its deri vati ves on the manifold M . (F’) is the eigengap conditions for the ro w space of G Ψ ( x ) . The assumption in Lemma 2 is very mild. When estimating the local modes of a function, the assumption is the same as requiring that the Hessian matrix at local modes hav e all eigenv alues being positiv e. The requirement inf x ∈ M λ min ( G Ψ ( x ) G Ψ ( x ) T ) ≥ λ 2 M implies that ro ws of G Ψ ( x ) are linearly independent for all x . 2.2. Smoothness of Solution Manifolds. W e introduce the concept of r each [ 33 , 28 , 1 , 2 ] (also known as condition number in [ 59 ] and minimal feature size in [ 15 ]) to describe the smoothness of a manifold. The reach is the longest distance away from M , in which ev ery point within this distance to M has a unique projection onto M , that is, (5) reach ( M ) = sup { r > 0 : ∀ x ∈ M ⊕ r , x has a unique projection onto M } . The reach can be viewed as the largest radius of a ball that can freely mov e along the manifold M . See figure 2 for an example. The reach has been used in nonparametric set estimation as a condition to guarantee the stability of a set estimator [ 18 , 20 , 28 ]. The smoothness of Ψ does not suffice to guarantee the smoothness of a solution manifold. Consider the example of a density lev el set { x : p ( x ) = λ } with a smooth density p ( x ) . By construction, this lev el set is a solution manifold with Ψ( x ) = p ( x ) − λ , a smooth function. Suppose that p ( x ) has two modes and a saddle point c . If we choose λ = p ( c ) , the le vel set does not hav e a positi ve reach. See Figure 3 for an e xample. Although the smoothness of Ψ is not enough to guarantee a smooth M , with an additional condition (F), the solution manifold will be a smooth one as will be described in the following theorem. T H E O R E M 3 (Smoothness Theorem) . Conditions (D-2) and (F) imply that the r each of M has a lower bound reach ( M ) ≥ min ( δ 0 2 , λ 0 k Ψ k ∗ ∞ , 2 ) . SOLUTION MANIFOLD 7 M (a) M (b) F I G 2 . An illustration for r each. Reach is the lar gest radius for a ball that can fr eely move along the manifold M without penetrating any part of M . In (a), the r adius of the pink ball is the same to the r each. In (b), the radius is too lar ge; hence, it cannot pass the small gap on M . M 1 M 2 saddle p oint Local mo des F I G 3 . An example of a smooth generator Ψ but the r esulting solution manifold M = M 1 ∪ M 2 may have 0 r each. The dashed line is the contour lines for dif fer ent levels. Theorem 3 sho ws a lo wer bound on the reach of a solution manifold. Essentially , it shows that as long as the generator function is not flat around the solution manifold (assumption (F)), the resulting manifold will be smooth. Note that for Theorem 3 , condition (D-2) can be relaxed to a 2 -Hölder condition and the quantity k Ψ k ∗ ∞ , 2 can be replaced by the correspond- ing Hölder’ s constant. R E M A R K 1 . Reach is r elated to the curvatur e and a quantity called folding [ 65 ]. F olding is defined as the smallest radius r such that B ( x, r ) ∩ M is connected for each x ∈ M . The first quantity δ 0 2 is r elated to the folding. The second quantity λ 0 K is r elated to the curvatur e . When s = 1 , a similar result to Theor em 3 appeared in [ 79 ]. Note that the r each is also r elated to the ‘r olling pr operties’ and ‘ α -conve xity’; see [ 28 ] and Appendix A of [ 60 ]. 2.3. Stability of Solution Manifolds. In this section, we show that when two generator functions are suf ficiently close, the associated solution manifolds will be similar as well. W e first start with a partial stability theorem, which holds under a weaker smoothness condition. P RO P O S I T I O N 4 . Let Ψ , ˜ Ψ : R d 7→ R s be two vector-valued functions and let M = { x : Ψ( x ) = 0 } , ˜ M = { x : ˜ Ψ( x ) = 0 } be the corr esponding solution manifolds. Mor eover , let δ 0 and c 0 be the constants in (F). Assume that (D-1) holds within M ⊕ δ 0 and (F). When k ˜ Ψ − Ψ k ∗ ∞ , 0 < c 0 , we have sup y ∈ ˜ M d ( y , M ) = O  k ˜ Ψ − Ψ k (0) ∞  . In the case of s = d , the conditions in Proposition 4 sho w many connections to the existing work. For instance, the con v ergence rate of estimating a mode (or a local mode) is often based 8 CHEN on a similar condition to (D-1); see, e.g., Theorem 2 of [ 78 ]. In the MLE and Z-estimation theory (see, e.g., Section 5.6 and 5.7 of [ 77 ]), classical conditions often require the first order deri vati ve of the score equation or estimating equations to be uniformly bounded, which are similar to conditions of (D-1). Moreov er , in the MLE theory , we often need the deriv ative of the score function (or Fisher’ s information matrix under appropriate conditions) to be non- singular within a small neighborhood of the population maximizer . This is exactly the same as condition (F). The constant λ 0 in (F) is the smallest absolute eigen value of the deriv ati ve of the score function in the case of MLE problems. Note that in the case of s = d , Proposition 4 is often enough for statistical consistency because M is a collection of disjoint points; thus, there is no need to consider the smoothness of M . Howe ver , when s < d , the set M contains infinite amount of points and the smoothness of M plays a role in terms of analyzing its stability . Therefore, we will need additional deri vati ves. Before we formally state the stability theorem, we first introduce the concept of the Haus- dorf f distance, a common metric of sets. The Hausdorff distance is defined as Haus ( A, B ) = max  sup x ∈ B d ( x, A ) , sup x ∈ A d ( x, B )  . The Hausdorf f distance is a distance between two sets and can be viewed as an L ∞ type distance between sets. T H E O R E M 5 (Stability Theorem) . Let Ψ , ˜ Ψ : R d 7→ R s be two vector-valued functions and let M = { x : Ψ( x ) = 0 } , ˜ M = { x : ˜ Ψ( x ) = 0 } be the corr esponding solution manifolds. Assume (D-2) and (F) and that ˜ Ψ is bounded two- times differ entiable. When k ˜ Ψ − Ψ k ∗ ∞ , 2 is sufficiently small, 1. (F) holds for ˜ Ψ . 2. Haus ( M , ˜ M ) = O  k ˜ Ψ − Ψ k (0) ∞  . 3. reach ( ˜ M ) ≥ min n δ 0 2 , λ 0 k Ψ k ∗ ∞ , 2 o + O  k ˜ Ψ − Ψ k ∗ ∞ , 2  . Theorem 5 sho ws that two similar generator functions hav e similar solution manifolds. Claim 2 is a geometric con v ergence property indicating that a consistent generator func- tion estimator implies a consistent manifold estimator . The need of a second-order deriv a- ti ve comes from the constants in (F). These constants are associated with the second-order deri vati ves through Lemma 2 . Claim 3 is the con ver gence in smoothness, which implies that ˜ M cannot be very wiggly when ˜ Ψ is suf ficiently closed to Ψ . E X A M P L E 5 (Missing data) . The stability theor em (Theorem 5 ) pr ovides a simple ap- pr oach for obtaining the con ver gence rate of an estimator . Consider the missing data example in Example 1 . The ‘population’ solution manifold is the parameter s θ = ( ζ x,y , µ x , ξ ) Θ = { θ : Ψ( θ ) = 0 } ⊂ R 7 such that Ψ( θ ) ∈ R 6 is based on equation ( 1 ) . When we observed random samples of size n in the form of ( X i , Y i , R i = 1) or ( X i , R i = 0) , we derived estimator s b P ( X = x, Y = y , R = 1) and b P ( X = x, R = 0) and constructed an estimated version b Ψ n ( θ ) by r eplacing P ( x, y , r = 1) and P ( x, r = 0) in equation ( 1 ) with the estimated versions. This leads to an estimated solution manifold b Θ n = { θ : b Ψ n ( θ ) = 0 } ⊂ R d . SOLUTION MANIFOLD 9 Algorithm 1 Monte Carlo gradient descent algorithm 1. Randomly choose an initial point x 0 ∼ Q , where Q is a distribution ov er the region of interest K . 2. Iterates (7) x t +1 ← x t − γ ∇ f ( x t ) until con ver gence. Let x ∞ be the con ver gent point. 3. If Ψ( x ∞ ) = 0 (or suf ficiently small), we keep x ∞ ; otherwise, discard x ∞ . 4. Repeat the abov e procedure until we obtain enough points for approximating M . The stability theor em (Theor em 5 ) bounds the distance between b Θ n and Θ via the dif fer ence max {| b P ( x, y , r = 1) − P ( x, y , r = 1) | , | b P ( x, r = 0) − P ( x, r = 0) | : x, y = 0 , 1 } . 3. Monte Carlo appr oximation to solution manif olds. Given Ψ or its estimator/approximation b Ψ , numerically finding the solution manifold M is a non-trivial task. In this section, we pro- pose a simple gradient descent procedure to find a point on M . Note that though we describe the algorithm using Ψ , we will apply the algorithm to b Ψ in practice. M is the solution set of Ψ ; thus, we may rewrite it as (6) M = { x : Ψ( x ) = 0 } = { x : f ( x ) = 0 } , f ( x ) = Ψ( x ) T ΛΨ( x ) , where Λ is an s × s positiv e definite matrix. Let x be an initial point. Consider the gradient flo w π x ( t ) : π x (0) = x, π 0 x ( t ) = −∇ f ( π x ( t )) . Points in M are stationary points of the gradient system; moreov er , they are the minima of the function f ( x ) . Thus, we can use a gradient descent approach to find points on M . Algorithm 1 summarizes the gradient descent procedure for approximating M . Note that we may choose Λ = I to be the identity matrix. In this case, f ( x ) = k Ψ( x ) k 2 , so we will be in vestigating the gradient descent flo w of k Ψ( x ) k 2 . For the case of d = s, this is a common method in numerical analysis to find the solution set of non-linear equations; see, e.g., Section 6.5 of [ 29 ]. Algorithm 1 consists of three steps: a random initialization step, a gradient descent step, and a rejection step. The random initialization step allo ws us to e xplore different parts of the manifold. The gradient descent step moves the initial points to possible candidates on M by iterating the gradient descent. The rejection step ensures that points being kept are indeed on the solution manifold. Note that the random initialization and rejection steps are popular strategies in numerical analysis. They serve as a remedy to resolve the problem of local minimizers of f that is not in the solution manifold M . See the discussion in page 152 of [ 29 ]. Figure 1 sho ws an example of finding the solution manifold { ( µ, σ ) : P ( − 5 < Y < 2) = 0 . 5 , Y ∼ N ( µ, σ 2 ) } using random initializations (from a uniform distribution ov er [1 , 3] × [2 , 4] ) and the gradient descent (we will pro vide more details on the implementations later in Example 6 ). W e reco ver the underlying 1 -dimensional manifold structure using Algorithm 1 . 10 CHEN 3.1. Analysis of the gradient flow . When an initial point is given, we perform gradi- ent descent to find a minimum. W e analyze this process by starting with an analysis of the (continuous-time) gradient flo w π x ( t ) . The gradient descent algorithm can be viewed as a discrete-time approximation to the continuous-time gradient flow . T o analyze the con ver - gence of the flow and the algorithm, we denote Λ max and Λ min as the largest and smallest eigen v alues of Λ , respectiv ely . L E M M A 6. Assume (D-2) and (F). Let G f ( x ) = ∇ f ( x ) and H f ( x ) = ∇∇ f ( x ) and G ψ ( x ) = ∇ Ψ( x ) ∈ R s × d . Then we have the following pr operties: 1. F or each x ∈ M , a) the non-zer o eigen vectors of H f ( x ) span the normal space of M at x . b) the minimal non-zer o eigen value λ min ,> 0 ( H f ( x )) ≥ ψ 2 min ( x ) ≡ λ 2 min ,> 0 ( G ψ ( x ) T G ψ ( x ))Λ min ≥ 2 λ 2 0 Λ min . c) the minimal eig en value in the normal space of M at x λ min , ⊥ ( H f ( x )) = λ min ,> 0 ( H f ( x )) . 2. Suppose that x has a unique pr ojection x M ∈ M and let N M ( x ) be the normal space of M at x M . If d ( x, M ) < δ c = min n δ 0 , Λ min 8 d Λ max λ 2 0 k Ψ k ∗ ∞ , 2 k Ψ k ∗ ∞ , 3 o , then λ min , ⊥ ,M ( H f ( x )) ≡ min v ∈ N M ( x M ) v T H f ( x ) v k v k 2 = min v ∈ N M ( x M ) k H f ( x ) v k k v k ≥ λ 2 0 Λ min . Property 1 in Lemma 6 describes the behavior of the Hessian H f ( x ) on the manifold. The eigenspace (corresponds to non-zero eigen values) is the same as the normal space of the manifold. W ith this insight, it is easy to understand property 1-(c) that the minimal eigen v alue in the normal space is the same as the minimal non-zero eigen v alue. Property 2 is about the behavior of H f ( x ) around the manifold. The Hessian H f ( x ) is well-behav ed as long as we are sufficiently close to M . The following theorem characterizes se veral important properties of the gradient flo w . T H E O R E M 7 (Con ver gence of gradient flo ws) . Assume (D-3) and (F). Let δ c be defined in Lemma 6 . Define π x ( ∞ ) = lim t →∞ π x ( t ) . The gradient flow π x ( t ) satisfies the following pr operties: 1. (Con ver gence radius) F or all x ∈ M ⊕ δ c , π x ( ∞ ) ∈ M . 2. (T erminal flow orientation) Let v x ( t ) = π 0 x ( t ) k π 0 x ( t ) k be the orientation of the gradient flow . If π x ( ∞ ) ∈ M , then v x ( ∞ ) = lim t →∞ v x ( t ) ⊥ M at π x ( ∞ ) . The first result of Theorem 7 defines the conv er gence radius of the gradient flow . The flo w con v erges to the manifold as long as the gradient flow starts within δ c distance to the manifold. The second statement of the theorem characterizes the orientation of the gradient flo w . The flow intersects the manifold from the normal space of the manifold. Namely , the flo w hits the manifold orthogonally . If we choose Λ = I to be the identity matrix ( Λ min > 0 in this case), Theorem 7 implies the con v ergence of the gradient flo w of k Ψ k 2 . Note that Theorem 7 requires one additional deri vati ve (D-3) because we need to perform a T aylor expansion of the Jacobian of Ψ around M to ensure that the gradient flo w con ver ges from a normal direction to a manifold. W e need the third-order deriv atives to ensure that the remainders are small. Suppose that the initial point x is drawn from the distribution Q , which has a PDF q , the con v ergent point π x ( ∞ ) can be viewed as a random draw from a distribution Q M defined SOLUTION MANIFOLD 11 ov er the manifold M . The distribution Q and the distribution Q M are associated via the mapping induced by the gradient descent process; thus, Q M is a pushforward measure of Q [ 7 ]. W e now in vestigate ho w Q and Q M are associated. For e v ery point z ∈ M , we define its basin of attraction [ 13 , 19 ] as A ( z ) = { x : π x ( ∞ ) = z } . Namely , A ( z ) is the collection of initial points that the gradient flow conv erges to z ∈ M . Let A M = ∪ z ∈ M A ( z ) be the union of all basins of attraction. The set A M characterizes the regions where the initialization leads to an accepted point in Algorithm 1 . Thus, the acceptance probability of the rejection step of Algorithm 1 is Q ( A M ) = R A M Q ( dx ) . Basin A ( z ) has an interesting geometric property of forming an s -dimensional manifold under smoothness assumption. This result is similar to the stable manifold theorem in dy- namical systems literature [ 53 , 54 , 5 ]. In fact, it is more relev ant to the local center manifold theorem (see, e.g., Section 2.12 of [ 61 ]). T H E O R E M 8 (Local center manifold theorem) . Assume (D-3) and (F). The basin of at- traction A ( z ) forms an s -dimensional manifold at each z ∈ M . An outcome from Theorem 8 is that the pushforward measure Q M has an s -dimensional Hausdorf f density function [ 52 , 62 ] if Q has a regular PDF q . Note that an s -dimensional Hausdorf f density at point x is defined through lim r → 0 Q M ( B ( x, r )) C s r s , where C s is the s -dimensional v olume of a unit ball. If the limit of the abov e equation exists, Q M has an s -dimensional Hausdorff density at point x . Thus, if we obtain Z 1 , · · · , Z N ∈ M from applying Algorithm 1 , we may view them as IID observations from a distribution Q M defined over the manifold M , and this distribu- tion Q M has an s -dimensional Hausdorff density function. The model that we observe IID Z 1 , · · · , Z N from a distribution supported over a lower -dimensional manifold is common in computational geometry [ 24 , 30 , 31 , 16 ]. Hence, Theorem 8 implies that Algorithm 1 pro- vides a ne w statistical example for this model. Note that [ 3 ] proved the stability of a gradient ascent flo w when the target is to find the local modes of density function. The stability of basins of attraction was studied in [ 21 ] in a similar scenario. These results may be generalized to solution manifolds. The major technical issue that we need to solve is that the con ver gent points of flows form a collection of infinite number of points. Therefore, the analysis is much more complicated. W e leav e this as a future work. 3.2. Analysis of the gradient descent algorithm. In Algorithm 1 , we did not perform the gradient descent using the flow π x ; instead, we used an iterative gradient descent approach that creates a sequence of discrete points x 0 , x 1 , · · · , x ∞ such that (8) x t +1 = x t − γ ∇ f ( x t ) , x 0 = x, where γ > 0 is the step size. The gradient descent algorithm will diver ge if the step size γ is chosen incorrectly . Thus, it is crucial to in v estigate the range of γ leading to a con ver gent point x ∞ and how fast the sequence { x t : t = 0 , 1 , · · · } con ver ges to a point on M . The follo wing theorem characterizes the algorithmic con v ergence along with a feasible range of the step size γ . 12 CHEN T H E O R E M 9 (Linear con v ergence). Assume (D-2) and (F). When the initial point x 0 and step size γ satisfy d ( x 0 , M ) < δ c = min ( δ 0 , Λ min λ 2 0 8 d Λ max k Ψ k ∗ ∞ , 2 k Ψ k ∗ ∞ , 3 ) , γ < min ( 1 Λ max k Ψ k ∗ ∞ , 2 , Λ max k Ψ k ∗ ∞ , 2 4 λ 4 0 Λ 2 min , δ c ) , we have the following pr operties for t = 0 , 1 , 2 , · · · : f ( x t ) ≤ f ( x 0 ) · 1 − γ λ 4 0 Λ 2 min Λ max k Ψ k ∗ ∞ , 2 ! t , d ( x t , M ) ≤ d ( x 0 , M )  1 − γ λ 2 0 Λ min  t 2 . The con ver gence radius δ c is the same as in Theorem 7 . Theorem 9 shows that when the initial point is within the con ver gence radius of the gradient flow and the step size is suffi- ciently small, the gradient descent algorithm conv erges linearly to a point on the manifold. An equiv alent statement of Theorem 9 is that the algorithm takes only O (log (1 / )) iterations to con v erge to  -error to the minimum. The key step in the deriv ation of Theorem 9 is to in v estigate the minimal eigen value of the normal space λ min , ⊥ ( H f ( x )) for each x ∈ M . This quantity (appears in the theorem through the lower bound λ 2 0 Λ min ) controls the flattest direction of f ( x ) in the normal space. The three requirements on the step sizes are due to different reasons. The first requirement ( γ < 1 Λ max k Ψ k ∗ ∞ , 2 ) ensures that the objectiv e function is decreasing. The second requirement ( γ < Λ max k Ψ k ∗ ∞ , 2 λ 2 0 Λ min ) establishes the con ver gence rate. The third requirement ( γ < δ c ) guarantees that the Hessian matrix behav es of f is well-behaved when applying the gradient descent algorithm. The first and third requirements together are enough for the con vergence of the gradient descent algorithm but will not lead to the conv er gence rate. W e need the additional second requirement to obtain the con v ergence rate. Theorem 9 is a v ery interesting result. The function f ( x ) is non-con ve x within an y neigh- borhood of M (i.e., not locally con ve x), but the gradient descent algorithm still con ver ges linearly to a stationary point. An intuitiv e explanation of this result is that the function f ( x ) is a ‘directionally’ con ve x function in the normal subspace of M (Property 2 in Lemma 6 ). Note that similar to Theorem 7 , Theorem 9 applies to the gradient descent algorithm with Λ = I . 4. Statistical A pplications. In this section, we show that the idea of solution manifolds can be applied to v arious statistical problems. W e also include a Bayesian approach that finds a credible region on a solution manifold in Appendix A . 4.1. Missing data. The solution manifold frame work we de veloped can be used to ana- lyze the nonparametric bound in the missing data problem [ 49 , 50 , 17 ]. W e use Example 1 to illustrate the idea, b ut our analysis can be generalized to a complex missing data scenario. The nonparametric bound refers to the feasible range of parameters θ = ( ζ x,y , µ x , ξ : x, y = 0 , 1) ∈ R 7 without any additional assumptions. Hence, the only constraint for these se ven parameters is the six equations in equation ( 1 ). Thus, we know that the resulting parameter space is a one-dimensional manifold. SOLUTION MANIFOLD 13 This manifold will be a smooth one due to Theorem 3 . The stability theorem informs us that when we estimate the constraints by sample analogues, the estimated manifold (can be vie wed as an estimated nonparametric bound) will be at O P (1 / √ n ) distance to the popula- tion manifold by the stability theorem (Theorem 5 ). Algorithm 1 provides a simple approach for numerically finding points on this solution manifold. W e can obtain a point cloud approximation of the 1D manifold characterizing the nonparametric bound of all the parameters with multiple random initializations. 4.2. Algorithmic fairness. W e now revisit the algorithmic fairness problem in Exam- ple 2 . W e hav e sho wn that a simple method of generating a test fair classifier Q from W, A is to sample from a Bernoulli random variable with a parameter q W,A = P ( Q = 1 | W, A ) that satisfies equation ( 3 ). This leads to a solution manifold Θ = { θ = ( q w,a : w , a = 0 , 1) : Ψ( θ ) = 0 } , where Ψ( θ ) is described in equation ( 3 ). By Theorem 3 , the collection of feasi- ble parameters will be a smooth manifold. When we estimate the underlying constraint by a random sample, the con v ergence rate (of manifolds) is described by Theorem 5 . In practice, finding Θ is often not the ultimate goal. Our goal is to find a classifier that is test fair and has a good classification error [ 40 , 27 ]. A con ventional approach of measuring classification accurac y is via a loss function L ( y , y 0 ) and we want to find the optimal q ∗ w,a ∈ Θ such that q ∗ = a rgmin q ∈ Θ R ( q ) = E Q ∼ q ( L ( Y , Q )) . This is essentially a manifold constraint maximization/minimization problem. This problem also occurs in the constraint likelihood inference (see next section) where we want to find the MLE under a solution manifold constraint. W e will discuss a unified treatment of this manifold-constraint optimization problem in the next section and propose an algorithm for it (Algorithm 2 ). While the algorithm is written in the form of finding the MLE, one can easily adapt it to the test fairness problem. 4.3. P arametric model. One scenario that the solution manifolds will be useful is analyz- ing parametric models. W e provide two different examples sho wing how solution manifolds can be used in parametric modeling. Suppose that we observe IID observ ations X 1 , · · · , X n from some distribution P , and we model the distribution using a parametric model P θ and θ ∈ Θ . Let p θ be the PDF/PMF of P θ and let ` ( θ | X 1 , · · · , X n ) = 1 n n X i =1 log p θ ( X i ) be the log-likelihood function. Note that in Appendix A , we also provide a Bayesian proce- dure that approximates the posterior distribution on a manifold (Algorithm 3 ). 4.3.1. Constrained MLE. In the likelihood inference, we may need to compute the MLE under some constraints. One example is the likelihood ratio test when the parametric space Θ 0 under H 0 is generated by equality constraints. Namely , Θ 0 = { θ ∈ Θ : Ψ( θ ) = 0 } . This problem occurs in algebraic statistic and asymptotic theories can be found in [ 55 ] and Section 5 of [ 32 ]. T o use the likelihood ratio test, we need to find the MLEs under both Θ 0 and Θ . Finding the MLE under Θ is a regular statistical problem. Howe ver , finding the MLE under Θ 0 may not be easy because of the constraint Ψ( θ ) = 0 . W e propose to a procedure combining the gradient ascent of the likelihood function and the gradient descent to the manifold to compute 14 CHEN Algorithm 2 Manifold-constraint maximizing algorithm 1. Randomly choose an initial point θ (0) 0 = θ (0) ∞ ∈ Θ . 2. For m = 1 , 2 , · · · , do step 3-6: 3. Ascent of likelihood. Update (9) θ ( m ) 0 = θ ( m − 1) ∞ + α ∇ ` ( θ ( m − 1) ∞ | X 1 , · · · , X n ) , where α > 0 is the step size of the gradient ascent ov er likelihood function and ` ( θ | X 1 , · · · , X n ) is the log- likelihood function. 4. Descent to manifold. F or each t = 0 , 1 , 2 , · · · iterates θ ( m ) t +1 ← θ ( m ) t − γ ∇ f ( θ ( m ) t ) until con ver gence. Let θ ( m ) ∞ be the con ver gent point. 5. If Ψ( θ ( m ) ∞ ) = 0 (or suf ficiently small), we keep θ ( m ) ∞ ; otherwise, discard θ ( m ) ∞ and return to step 1. 6. If ∇ ` ( θ ( m ) ∞ | X 1 , · · · , X n ) belongs to the ro w space of ∇ Ψ( θ ( m ) ∞ ) , we stop and output θ ( m ) ∞ . the constrained MLE. Algorithm 2 describes the procedure, and Figure 4 provides a graphical illustration. This algorithm consists of a one-step gradient ascent of the likelihood function (Step 3) and a gradient descent to manifold (Algorithm 1 ; steps 4 and 5). The stopping criterion (Step 6) is that ∇ ` ( θ ( m ) ∞ | X 1 , · · · , X n ) belongs to the ro w space of ∇ Ψ( θ ( m ) ∞ ) . Due to Lemma 1 , the row space of ∇ Ψ( θ ( m ) ∞ ) is the normal space of M at θ ( m ) ∞ . It is easy to see that any critical points of the log-likelihood function on the manifold satisfy the condition that the likelihood gradient belongs to the row space of ∇ Ψ ; thus, the constrained MLE is a stationary point in Algorithm 2 . As a result, we stop the algorithm when the stopping criterion occurs. Howe ver , other local modes and saddle points and local minima are also the stationary points. Hence, in practice, we need to run the algorithm with multiple initial points to increase the chance of finding the true MLE. Note that one may replace the gradient ascent process by the EM algorithm. Howe ver , the EM algorithm is not identical to a gradient ascent, so it is unclear if the movement θ ( m +1) 0 − θ ( m ) ∞ will be normal to the manifold Θ 0 or not. E X A M P L E 6 (T esting a tail probability in a Gaussian model). T o illustrate the idea, sup- pose that X i ∈ R is fr om an unknown distribution that we place a par ametric model on it. W e further assume that the parametric distribution is a Gaussian N ( µ, σ 2 ) with unknown mean and variance. Consider the null hypothesis H 0 : P ( r 0 ≤ Y ≤ r 1 ) = s 0 for some given s 0 > 0 and r 0 , r 1 (note that this e xample also appears in F igure 1 ). Let Φ( y ) = P ( Z ≤ y ) denote the CDF of a standar d normal. The null hypothesis H 0 forms the following constraint on ( µ, σ 2 ) : s 0 = Φ  r 1 − µ σ  − Φ  r 0 − µ σ  . Thus, Ψ( µ, σ ) = Φ  r 1 − µ σ  − Φ  r 0 − µ σ  − s 0 ∈ R . The feasible set of ( µ, σ 2 ) forms a 1 D solution manifold in R 2 . It is difficult to find the analytical form of the MLE under H 0 , but we may use the method in Algorithm 2 to obtain a SOLUTION MANIFOLD 15 µ −3.8 −3.75 −3.7 −3.65 −3.6 −3.55 −3.55 −3.5 −3.45 −3.4 1.0 1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● µ −3.429 −3.428 −3.427 −3.426 −3.425 −3 .424 −3.423 −3.422 −3.421 −3.42 −3.419 −3 .418 −3.417 −3.416 −3. 415 −3.414 −3.413 −3.412 −3.411 −3.41 −3.409 −3.408 −3.407 −3.406 −3.405 −3.404 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.90 2.95 3.00 3.05 3.10 3.15 3.20 σ ● Descent to manifold Ascent of likelihood Manifold F I G 4 . An e xample illustr ating how Algorithm 2 works. W e consider the example of estimating the tail pr obability in a Gaussian model N ( µ, σ 2 ) with the constraint P ( − 5 ≤ X ≤ 2) = 0 . 5 . W e generate n = 1000 points fr om N (1 . 5 , 3 2 ) and display the log-likelihood function in the two panels (contours ar e the log-likelihood surface). Left: W e initialize Algorithm 2 with five r andom points (blue boxes) and the algorithm cr eates an ascending path (blue lines) to the maximum point (orange cr oss). Right: W e illustrate Algorithm 2 by showing points in each iteration in the algorithm in a zoom-in ar ea relative to the left panel. Starting fr om the solid black point, we first perform a gradient ascent with r espect to the log-likelihood function (br own arr ow) then apply Algorithm 1 to descent to the solution manifold. W e keep r epeating this pr ocess until it con ver ges. numerical appr oximation. The derivative of Ψ( µ, σ ) with r espect to µ and σ has the following closed-form: ∂ ∂ µ Ψ( µ, σ ) = − 1 σ φ  r 1 − µ σ  + 1 σ φ  r 0 − µ σ  ∂ ∂ σ Ψ( µ, σ ) = − r 1 − µ σ 2 φ  r 1 − µ σ  + r 0 − µ σ 2 φ  r 0 − µ σ  , wher e φ ( y ) = 1 √ 2 π e − y 2 / 2 is the PDF of the standar d normal. Algorithm 2 can easily be implemented with the above derivatives. F igur e 4 shows an example of applying Algorithm 2 to this example with r 1 = − 5 , r 2 = 2 and s 0 = 0 . 5 (and 1000 random numbers gener ated fr om N (1 . 5 , 3 2 ) ). All five random initial points con verg e to the maximum on the manifold. 4.3.2. P artial identification and gener alized method of moments. The solution manifolds appear in the partial identification problem [ 51 ] in Econometrics. One example is the moment constraint problem [ 25 ], also known as the generalized method of moments [ 38 , 39 ]. In this case, we want to estimate parameter θ ∈ R d that solves the moment equation E ( g ( Y ; θ )) = 0 , where g ( y ; θ ) ∈ R s is a vector -valued function, and X is a random variable denoting the observed data. When s < d , the solution set (also called an identified set in [ 25 ]) M = { θ : E ( f ( Y ; θ )) = 0 } forms a solution manifold. Thus, the smoothness theorem (Theorem 3 ) and the stability theorem (Theorem 5 ) can be applied to this case. When the estimator is obtained by the empirical moment equation c M n =  θ : 1 n P n i =1 f ( Y i ; θ ) = 0  , Theorem 5 implies Haus ( c M n , M ) P → 0 when the empiri- cal moments 1 n P n i =1 f ( Y i ; θ ) and its deri v ativ es with respect to θ con ver ge to the population moments E ( g ( X ; θ )) and its deri vati ves, respectiv ely . In generalized method of moments, a common approach of finding a solution to E ( g ( Y ; θ )) = 0 is by minimizing a criterion function Q ( θ ) = E ( g ( Y ; θ )) T Λ E ( g ( Y ; θ )) , where Λ is a positiv e definite matrix [ 39 ]. This is identical to the function f defined in 16 CHEN 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 CD4 CD8b CD3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 CD4 CD8b CD3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 CD8b CD4 CD3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● F I G 5 . An example of approximating a density level set. This is the Graft-versus-Host Disease (GvHD) data that we borrowed fr om mclust packa ge in R . W e use the control gr oup and choose three variables: CD3 , CD4 , and CD8b . W e apply a Gaussian kernel density estimator with bandwidth h = 20 (same in all coor dinates). The le vel set of inter est is the density le vel corr esponding to 25% quantile of densities at all observations. The thr ee panels display the level set under thr ee differ ent angles. equation ( 6 ). Thus, the analysis in Section 3 can be used to study the minimization problem in the generalized method of moments. R E M A R K 2 . In econometrics, a similar pr oblem to the solution manifold is the inequality constraint pr oblem, which occurs when we r eplace the equality constraints with inequality constraints [ 75 , 70 ], i.e., E ( g ` ( Y ; θ )) ≤ 0 for ` = 1 , · · · , s . The goal is to find θ satisfying the above inequality constraint. A common appr oach to finding the feasible set is by defining an objective function Q ( θ ) =      s X ` =1 [ E ( g ` ( Y ; θ ))] +      2 , [ y ] + = max( y , 0) such that the feasible set is { θ : Q ( θ ) = 0 } . The inequality constraint implies that { θ : Q ( θ ) = 0 } may not form a lower-dimensional manifold but a subset of the original parameter space . A common estimator of { θ : Q ( θ ) = 0 } is n θ : b Q n ( θ ) ≤ c n o , b Q n ( θ ) =       s X ` =1 " 1 n n X i =1 g ` ( Y i ; θ ) # +       2 for some sequence c n → 0 . Note that by pr operly choosing c n , we may construct both an estimator and a confidence r e gion; see [ 25 ] and [ 70 ] for more details. 4.4. Nonparametric set estimation. Solution manifolds occur in many scenarios of non- parametric set estimation problems. One famous example is the density level set problem in which the parameter of interest is the (density) le vel set { x : p ( x ) = λ } , and p is the PDF that generates our data, and λ is a pre-specified level. In this case, the smoothness theorem (Theorem 3 ) yields the same result as [ 20 ]. Moreover , the stability theorem (Theorem 5 ) sug- gests that the con ver gence rate under the Hausdorf f distance will be the rate of estimating the density function, which is consistent with se veral e xisting works [ 12 , 67 , 66 ]. The methods dev eloped in Section 3 can be used to find points on the lev el set. As an illustration, Figure 5 shows an example of approximating the le vel set using Algorithm 1 with the Graft-versus-Host Disease (GvHD) data [ 11 ] from mclust package in R [ 6 ]. W e use v ariables CD3 , CD4 , and CD8b and focus on the control group. The density is computed using a Gaussian k ernel density estimator with an equal bandwidth h = 20 in all coordinates. W e choose the lev el as the 25% quantile of all observations’ densities. The three panels SOLUTION MANIFOLD 17 display the approximated le vel sets from three angles. The three surf aces in the data indicate three connected components in the regions where the density is abo ve this threshold. In addition to the lev el set problem, the density ridges [ 18 , 35 ] are also examples of solu- tion manifolds. The density k -ridges are the collection { x : V k ( x ) T ∇ p ( x ) = 0 , λ k +1 ( x ) < 0 } , where V k ( x ) = [ v k +1 ( x ) , · · · , v d ( x )] denotes the matrix of d − k eigen v ectors of ∇∇ p ( x ) corresponding to the smallest eigen v alues and λ k ( x ) is the k -th largest eigen value. If we only pick the lowest d − k eigen v ectors, we obtain a system of equations with d − k equations, leading to a k -dimensional manifold (under smoothness conditions). The stability theorem (Theorem 5 ) states that the con vergence rate of a density ridge estimator will be at the rate of estimating the Hessian matrix, which is consistent with the findings of [ 35 ]. Note that the density local modes can be vie wed as 0 -dimensional ridges. In this special case, the matrix V 1 ( x ) is full rank under assumption (F); thus, { x : V 0 ( x ) T ∇ p ( x ) = 0 , λ 1 ( x ) < 0 } = { x : ∇ p ( x ) = 0 , λ 1 ( x ) < 0 } . As a result, the function Ψ that generates the local modes does not in v olve the Hessian ma- trix, leading to the con vergence rate of the mode estimator to be the same as the gradient estimation rate rather than the rate of estimating the Hessian. 5. Discussion. In this paper , we in vestigate both geometric and algorithmic properties of solution manifolds. While the solution manifolds may seem to be abstract, we sho wed that they appear in v arious statistical problems including missing data, algorithmic fairness, lik e- lihood inference, and nonparametric set estimation. Hence, the methodologies and theories de veloped in this paper provide a generic framework for analyzing all these problems. This frame work may inform us of the hidden relation among all these seemingly dif ferent statisti- cal problems. In what follo ws, we discuss some relev ant topics to the solution manifolds. 5.1. Smoothness, stability , and con ver gence of gradient flow . W e dev eloped 5 major the- oretical results: smoothness theorem (Theorem 3 ), stability theorem (Theorem 5 ), gradient flo w theorem (Theorem 7 ), local center manifold theorem (Theorem 8 ), and algorithmic con- ver gence theorem (Theorem 9 ). These results characterize different properties of solution manifolds and are often studied in various fields. In our work, we showed that they all rely on a similar set of assumptions: (D), the smoothness of Ψ , and (F), the curv ature assumption of Ψ around M . Assumption (D) is more than enough for some theoretical results. The smoothness theorem can be relaxed to assuming that Ψ satisfies β -Hölder condition with various β . For instance, the (D-1) condition in the weak stability result (Proposition 4 ) can be relax ed to the 1 -Hölder condition. The condition (D-2) in the smoothness theorem (Theorem 3 ) can be replaced by the 2 -Hölder condition. Moreov er , we observe a hierarchy of smoothness corresponding to different theoretical results. If we only have (D-1), the first order deriv atives, we hav e a weak stability result from Proposition 4 . If we make a further assumption (D-2), we hav e a stability theorem (Theorem 5 ), a characterization of smoothness (Theorem 3 ), and an algorithmic con ver gence (Theorem 9 ). Under an additional assumption (D-3), we can deri ve an e v en stronger result of the corresponding gradient flo w (Theorem 7 and 8 ) 5.2. Connections to other fields. W e would like to point out that the results of this paper hav e se veral connections to other fields. • Econometrics. Solution manifolds occur in the partial identification problem (Sec- tion 4.3.2 ); hence, our analysis pro vides some insights into the moment equality constraint 18 CHEN problem [ 25 ]. Our analysis on the gradient descent (e.g., Theorem 7 ) can be applied to in- vestigate the property of the minimization problem in the generalized method of moments approach [ 38 , 39 ]. • Dynamical systems. As mentioned before, Theorem 8 is related to the stable manifold theorem and the local center manifold theorem in dynamical systems [ 53 , 54 , 5 , 61 ]. Our analysis provides statistical e xamples that these theorems may be useful in data analysis. • Computational geometry . If we stop the gradient descent process early , we do not obtain points that are on the manifold. The resulting points Z 1 , · · · , Z n may be viewed as from Z i = X i +  i , where X i ∈ M is from a distribution ov er the manifold and  i is some ad- diti ve noise. This model is a common additi ve noise model in the computational geometry literature [ 24 , 23 , 30 , 31 , 16 , 8 ]. Our proposed method pro vides another concrete e xample of the manifold additi ve noise model. • Optimization. In general, a gradient descent method has a linear conv er gence rate when the objecti ve function is strongly con v ex and has a smooth gradient [ 10 , 58 ]. Ho we ver , in our setting, the objectiv e function f ( x ) is non-con ve x (and is not locally con ve x), but the gradient descent algorithm still obtains a linear (algorithmic) con v ergence rate (The- orem 7 ). This rev eals a class of non-con ve x objectiv e functions that can be minimized quickly using a gradient descent algorithm. 5.3. Futur e work. The framew ork dev eloped in this paper has many potentials in other problems. W e provide some possible directions that we plan to pursue in the future. • Log-linear model. The log-linear model of categorical variables is an interesting ex- ample in the sense that it can be expressed as a solution manifold when there are con- straints like conditional independence but it may be unnecessary to use the developed techniques. Consider a d -dimensional categorical random vector X that takes values in { 0 , 1 , 2 , · · · , J − 1 } d . The joint PMF of X p ( x 1 , · · · , x d ) has J d entries with the constraint that P x p ( x 1 , · · · , x d ) = 1 , so it has J d − 1 degrees of freedom. In the log-linear model, we reparametrize the PMF using the log-linear expansion: log p ( x ) = P A ψ A ( x A ) , where A is any non-empty subset of { 1 , 2 , · · · , J } and x A = ( x j : j ∈ A ) with the constraint that ψ A ( x A ) = 0 if any x j = 0 for j ∈ A . Under the log-linear model, we reparametrize the joint PMF using the parameters Θ LL = { ψ A ( x A ) : A ⊂ { 1 , 2 , · · · , J } , x A ∈ { 0 , 1 } | A | } , where | A | is the cardinality of A . The feasible parameters in Θ LL forms a solution mani- fold due to the aforementioned constraints. Ho wever , common constraints in the log-linear model are that interaction terms ψ A = 0 for some A . This leads to a flat manifold; hence, there is no need to use the developed technique. W e may need to use techniques from the solution manifold when the constraint is placed on the PMF p ( x 1 , · · · , x d ) rather than the log-linear models because the constraints on the PMF lead to an implicit constraint on Θ LL . W e leav e this as future work. • Confidence regions of solution manifolds. Another future direction is to de v elop a method for constructing the confidence regions of solution manifolds. There are two com- mon approaches to construct a confidence region of a set. The first one is based on the “vertical uncertainty”, which is the uncertainty due to b Ψ − Ψ . This idea has been applied in generalized method of moment problems [ 25 , 70 , 17 ] and le vel set estimations [ 48 , 63 , 22 ] The other approach is based on the “horizontal uncertainty”, which is the uncertainty due to Haus ( c M , M ) . This technique has been used in constructing confidence sets of density ridges and level sets [ 18 , 20 ]. Based on these results, we belie ve that it is possible to de- velop a procedure for constructing the confidence regions of solution manifolds. W e leave this as future work. SOLUTION MANIFOLD 19 • A new class of non-conv ex problems. W e observe an interesting phenomenon in Theo- rem 9 . Although the objective function f ( x ) = Ψ T ΛΨ( x ) is non-con ve x, we still obtain a linear (algorithmic) con ver gence. Note that for a non-con ve x but locally con vex around the minimizer , the linear con vergence can be established via assuming a local strong conv exity of the objectiv e function [ 4 ], i.e., f ( x ) is strongly con v ex within B ( x ∗ , r ) for some radius r > 0 and x ∗ is the global minimizer . Howe ver , our problem is more complicated in the sense that f ( x ) is flat along M , so it is not locally strongly con vex. The key element in our result is assumption (F) stating that f ( x ) behav es like being “locally strongly-con ve x” in the normal direction of M . Thus, with some additional structure on the non-con ve x func- tion, we may still obtain a fast con ver gence. W e will in v estigate how this may be useful in other non-con vex optimization problems. In addition, the analysis may be applied to other forms of f ( x ) that are not limited to a “squared”-type transformation of Ψ( x ) ( f ( x ) behav es like the square of Ψ ), which may further improve the conv ergence rate. For in- stance, the gradient descent over f 1 ( x ) = k Ψ( x ) k 1 may also con ver ge faster than ov er the function f ( x ) . W e will in vestig ate this in the future. APPENDIX A: B A YESIAN INFERENCE The techniques we dev eloped for solution manifolds can be used for the Bayesian infer- ence after some modifications. One example is the univ ariate Gaussian with unknown mean µ and v ariance σ 2 , and a second moment constraint. The parameter space is Θ( s 0 ) = { ( µ, σ 2 ) : E ( Y 2 ) = µ 2 + σ 2 = s 2 0 } . W e place a prior π ( θ ) o ver Θ( s 0 ) that reflects our prior belief about the parameter θ = ( µ, σ ) . Howe v er , ho w to sample from π (and the posterior) is a non-trivial task because π is supported on a manifold. The Monte Carlo approximation method in Section 3 offers a solution to sampling from π . W ith a little modification of Algorithm 1 , we can approximate the posterior distribution defined on the solution manifold. Let π be a prior PDF defined ov er the solution set M = { θ : Ψ( θ ) = 0 } where Ψ : Θ 7→ R k . W e observe IID observations X 1 , · · · , X n that are assumed to be from a parametric model p ( x | θ ) . The posterior distribution of θ will be π ( θ | X 1 , · · · , X n ) ∝ ( π ( θ ) Q n i =1 p ( X i | θ ) , if θ ∈ M ; 0 , if θ / ∈ M . W e propose a method that approximates the posterior distribution using a weighted point cloud. Our approach is formally described in Algorithm 3 . Note that the algorithm we develop only requires the ability to ev aluate a function ρ ( θ ) ∝ π ( θ ) . W e do not need the exact value of the prior density . E X A M P L E 7 (Bayesian analysis of Example 6 ). F igur e 6 shows an example of 90% cred- ible intervals and the MAPs under thr ee scenarios: prior distrib ution only (left panel), poste- rior distrib ution with n = 100 (middle panel), and posterior distribution with n = 1000 (right panel). This is the same setting in Example 6 and F igure 4 , wher e the manifold is formed by the constraint P ( − 5 < X < 2) = 0 . 5 with X ∼ N ( µ, σ ) . W e choose the prior distrib ution (density) as π ( µ, σ ) ∝ φ ( µ ; 2 , 0 . 2) φ ( σ ; 2 . 5 , 0 . 2) I (( µ, σ ) ∈ M ) , wher e φ ( x ; a, b ) is the density of N ( a, b 2 ) . In the left panel, the cr edible interval is com- pletely determined by the prior distribution and the MAP is the mode of the prior . In the middle and right panels, the data are incorporated into the posterior distributions. Both the cr edible intervals and MAPs are changing because of the influence of the data. Our method (Algorithm 3 and equation ( 12 ) ) pr ovides a simple and ele gant appr oach of appr oximating the cr edible intervals on the manifold. 20 CHEN Algorithm 3 Approximated manifold posterior algorithm 1. Apply Algorithm 1 to generate many points Z 1 , · · · , Z N ∈ M . 2. Estimate a density score of Z i using b ρ i,N = 1 N N X j =1 K  k Z i − Z j k h  , where h > 0 is a tuning parameter and K is a smooth function such as a Gaussian. 3. Compute the posterior density score of Z i as (10) b ω i,N = 1 b ρ i,N · b π i,N , b π i,N = π ( Z i ) · n Y j =1 p ( X j | Z i ) , retur n W eighted point clouds ( Z 1 , b ω i,N ) , · · · , ( Z N , b ω N ,N ) . Prior only µ −3.8 −3.75 −3.7 −3.65 −3.6 −3.55 −3.55 −3.5 −3.45 −3.4 1.0 1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% Cred. Int. MAP Posterior (n=100) µ −3.8 −3.75 −3.7 −3.65 −3.6 −3.55 −3.55 −3.5 −3.45 −3.4 1.0 1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% Cred. Int. MAP Posterior (n=1000) µ −3.8 −3.75 −3.7 −3.65 −3.6 −3.55 −3.55 −3.5 −3.45 −3.4 1.0 1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 σ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% Cred. Int. MAP F I G 6 . An e xample showing the cr edible interval (cr edible r e gion) and MAP on the manifold. W e use the same e x- ample in F igure 4 and the prior distrib ution π ( µ, σ ) ∝ φ ( µ ; 2 , 0 . 2) φ ( σ ; 2 . 5 , 0 . 2) I (( µ, σ ) ∈ M ) , wher e φ ( x ; a, b ) is the density of N ( a, b 2 ) . Left: 90% credible interval along with the MAP using only the prior distribution. Middle: we randomly generate n = 100 observations from N (1 . 5 , 3 2 ) and compute 90% cr edible interval and MAP fr om the posterior distribution. Right: the same analysis as the middle panel but now we use a sample of size n = 1000 . Note that the bac kgr ound gr ay contours show the log-likelihood function (as an indication of how the likelihood function will influence posterior). T o see why the outputs from Algorithm 3 are a valid approximation to the posterior density , note that the density score b ρ i,N is proportional to the underlying density of Z 1 , · · · , Z M defined ov er M . Hence, the weighted point cloud ( Z 1 , b ρ − 1 i,N ) , · · · , ( Z N , b ρ − 1 N ,N ) beha ves like a uniform sample ov er M . Thus, to account for the unweighted point cloud density , we hav e to rescale the posterior score of Z i in equation ( 10 ) by the factor b ρ − 1 i,N . Note that the v alue b π i,N is proportional to the posterior density π ( Z i | X 1 , · · · , X n ) ev aluated at point θ = Z i . The quantity h is the smoothing bandwidth in the kernel density estimation. Because this is a density estimation problem, we would recommend to choose it using the Silverman’ s rule of thumb [ 74 ] or other popular approaches such as least square cross-validation [ 71 , 9 ]; see the re view paper of [ 73 ] for a list of reliable methods. W ith the output from Algorithm 3 , the posterior density π ( θ | X 1 , · · · , X n ) is represented by the collection of points Z 1 , · · · , Z N along with the corresponding weights b ω 1 ,N , · · · , b ω N ,N . The posterior mean can be approximated using b θ Pmean = P n i =1 b ω i,N Z i P n i =1 b ω i,N . This estimator is essentially the importance sampling estimator . The posterior mode (MAP: maximum a posteriori) can be approximated using b θ MAP = Z i ∗ , i ∗ = a rgmax i ∈{ 1 , ··· ,N } b π i,N . SOLUTION MANIFOLD 21 The weighted point cloud also leads to an approximated credible region. Let 1 − α be the credible le vel and Z (1) , · · · , Z ( N ) be the ordered points such that b π (1) ,N ≥ b π (2) ,N ≥ · · · ≥ b π ( N ) ,N . Define (11) i ( α ) = a rgmin ( i : P i j =1 b ω ( j ) ,N P N ` =1 b ω ( ` ) ,N ≥ 1 − α ) . Then we may use the collection of points (12) { Z (1) , · · · , Z ( i ( α )) } as an approximation of a 1 − α credible region. Alternativ ely , one may use the set { θ ∈ M : π ( θ | X 1 , · · · , X n ) ≥ π ( Z i ( α ) | X 1 , · · · , X n ) } as another approximation of a 1 − α credible region. Here is an explanation of the choice in equation ( 11 ). b π i,N is proportional to the posterior v alue at Z i ; hence, π ( Z (1) | X 1 , · · · , X n ) ≥ π ( Z (2) | X 1 , · · · , X n ) ≥ · · · ≥ π ( Z ( N ) | X 1 , · · · , X n ) . Define the upper-le vel set of lev el λ of the posterior distribution as L ( λ ) = { θ : π ( θ | X 1 , · · · , X n ) ≥ λ } . The posterior probability within L ( λ ) is π ( L ( λ ) | X 1 , · · · , X n ) = Z I ( θ ∈ L ( λ )) π ( θ | X 1 , · · · , X n ) dθ . A 1 − α credible region can be constructed by choosing the minimal value λ α such that (13) λ α = inf { λ : π ( L ( λ ) | X 1 , · · · , X n ) ≥ 1 − α } . W ith the weights b ω 1 ,N , · · · , b ω N ,N , an approximation to π ( L ( λ ) | X 1 , · · · , X n ) is b π ( L ( λ ) | X 1 , · · · , X n ) = P n j =1 b ω j,N I ( Z j ∈ L ( λ )) P n ` =1 b ω `,N . The posterior lev els b π 1 ,N , · · · , b π N ,N form a discrete approximation of all levels of λ . Thus, an approximation to equation ( 13 ) is b λ α = min ( b π i,N : P n j =1 b ω j,N I ( Z j ∈ L ( b π i,N )) P n ` =1 b ω `,N ≥ 1 − α ) = min ( b π i,N : P n j =1 b ω j,N I ( b π j,N ≥ b π i,N ) P n ` =1 b ω `,N ≥ 1 − α ) = min ( b π ( i ) ,N : P i j =1 b ω ( j ) ,N P n ` =1 b ω ( ` ) ,N ≥ 1 − α ) = b π ( i ( α )) ,N , i ( α ) = argmin ( i : P i j =1 b ω ( j ) ,N P N ` =1 b ω ( ` ) ,N ≥ 1 − α ) . The choice in equation ( 11 ) is from the abov e approximation to the le vel λ α . 22 CHEN R E M A R K 3. Note that the posterior mean may not be on the manifold. One may r eplace it by the posterior Fréchet mean [ 37 ] defined as b θ PFmean = Z i † , i † = a rgmin i ∈{ 1 , ··· ,N } N X j =1 b ω j,N ( Z i − Z j ) 2 . The F réchet mean defines a mean of a random variable X using the minimization pr oblem a rgmin µ E ( X − µ ) 2 and constraints the minimizer to be in the manifold. Here , we use the weighted point appr oximation to this minimization. APPENDIX B: DERIV A TION OF EQ U A TION (2) T o deriv e the constraint in equation ( 3 ) from equation ( 2 ), we expand the first term and consider s = 1 : P ( Y = 1 | Q = 1 , A = 0) = P ( Y = 1 , Q = 1 | A = 0) P ( Q = 1 | A = 0) = P w P ( Y = 1 , Q = 1 , W = w | A = 0) P w 0 P ( Q = 1 , W = w 0 | A = 0) ( ∗ ) = P w P ( Q = 1 | W = w , A = 0) P ( W = w , Y = 1 | A = 0) P w 0 P ( Q = 1 | W = w 0 , A = 0) P ( W = w 0 | A = 0) = P w q w, 0 P ( W = w, Y = 1 | A = 0) P w 0 q w 0 , 0 P ( W = w 0 | A = 0) . Note that the equality labeled with (*) is due to Q ⊥ Y | A, W . The two probabilities P ( W = w , Y = 1 | A = 0) and P ( W = w 0 | A = 0) are identifiable from the data. A similar calculation sho ws that P ( Y = 1 | Q = 1 , A = 1) = P w q w, 1 P ( W = w, Y = 1 | A = 1) P w 0 q w 0 , 1 P ( W = w 0 | A = 1) . So the test fairness constraint in equation ( 3 ) requires P w q w, 0 P ( W = w, Y = 1 | A = 0) P w 0 q w 0 , 0 P ( W = w 0 | A = 0) = P w q w, 1 P ( W = w, Y = 1 | A = 1) P w 0 q w 0 , 1 P ( W = w 0 | A = 1) . Also, for the case of s = 0 , the abov e constraint becomes P w (1 − q w, 0 ) P ( W = w, Y = 1 | A = 0) P w 0 (1 − q w 0 , 0 ) P ( W = w 0 | A = 0) = P w (1 − q w, 1 ) P ( W = w, Y = 1 | A = 1) P w 0 (1 − q w 0 , 1 ) P ( W = w 0 | A = 1) . The abov e two equations are what equation ( 3 ) refers to as. APPENDIX C: PR OOFS P RO O F O F L E M M A 2 . Essentially , we only need to sho w that when d ( x, M ) ≤ 3 λ 2 M 8 k Ψ k ∗ ∞ , 1 k Ψ k ∗ ∞ , 2 , the minimal eigen v alue λ min ( G Ψ ( x ) G Ψ ( x ) T ) ≥ 1 4 λ 2 M . For an y point x with d ( x, M ) ≤ 3 λ 2 M 8 k Ψ k ∗ ∞ , 1 k Ψ k ∗ ∞ , 2 , let x M be the projection on M . The min- imal eigen v alue (14) λ min ( G Ψ ( x ) G Ψ ( x ) T ) = λ min ( G Ψ ( x M ) G Ψ ( x M ) T ) + ( λ min ( G Ψ ( x ) G Ψ ( x ) T ) − λ min ( G Ψ ( x M ) G Ψ ( x M ) T )) ≥ λ 2 M − | λ min ( G Ψ ( x ) G Ψ ( x ) T ) − λ min ( G Ψ ( x M ) G Ψ ( x M ) T ) | . SOLUTION MANIFOLD 23 The W eyl’ s theorem (see, e.g., Theorem 4.3.1 of [ 41 ]) shows that the eigen v alue difference can be bounded via | λ min ( G Ψ ( x ) G Ψ ( x ) T ) − λ min ( G Ψ ( x M ) G Ψ ( x M ) T ) | ≤ k G Ψ ( x ) G Ψ ( x ) T − G Ψ ( x M ) G Ψ ( x M ) T k max ≤ 2 k Ψ k ∗ ∞ , 1 k Ψ k ∗ ∞ , 2 k x − x M k (by T aylor’ s theorem) = 2 k Ψ k ∗ ∞ , 1 k Ψ k ∗ ∞ , 2 d ( x, M ) . Thus, as long as 2 k Ψ k ∗ ∞ , 1 k Ψ k ∗ ∞ , 2 d ( x, M ) ≤ 3 4 λ 2 M we hav e λ min ( G Ψ ( x ) G Ψ ( x ) T ) ≥ 1 4 λ 2 M , which completes the proof. P RO O F O F T H E O R E M 3 . The proof is modified from the proof of Lemma 4.11 and Theo- rem 4.12 in [ 33 ]. Let r 0 = min { δ 0 2 , λ 0 k Ψ k ∗ ∞ , 2 } . W e will show that r 0 is a lower bound of reach ( M ) . W e pro- ceed by the proof of contradiction. Suppose that the conclusion is incorrect that the reach is less than r 0 . Then there exists a point x such that d ( x, M ) < r 0 and x has two projections onto M , denoted as b, c ∈ M . Since b, c ∈ M , Ψ( b ) = Ψ( c ) = 0 and by T aylor’ s remainder theorem and condition (D-2) and (F), (15) k G Ψ ( b )( b − c ) k 2 = k f ( b ) − f ( c ) − G Ψ ( b )( b − c ) k 2 ≤ 1 2 k b − c k 2 2 k Ψ k ∗ ∞ , 2 . By the nature of projection, we can find a vector t b ∈ R s such that x − b = t T b G Ψ ( b ) because the normal space is spanned by the row space of G Ψ ( b ) (Lemma 1 ). T ogether with ( 15 ), this implies (16) 2 | ( x − b ) T ( b − c ) | = 2 | t T b G Ψ ( b )( b − c ) | ≤ k G Ψ ( b )( b − c ) k 2 k t b k 2 ≤ k Ψ k ∗ ∞ , 2 k b − c k 2 2 k t b k 2 . Since b, c are projections of x onto M , k x − b k 2 = k x − c k 2 . As a result, (17) 0 = k x − c k 2 2 − k x − b k 2 2 = k b − c k 2 2 + 2( b − c ) T ( x − b ) ≥ k b − c k 2 2 − k Ψ k ∗ ∞ , 2 k b − c k 2 2 k t b k 2 ( 16 ) = k b − c k 2 2 (1 − k Ψ k ∗ ∞ , 2 k t b k 2 ) . Ho wev er , starting from the definition of r 0 , we hav e (18) λ 0 k Ψ k ∗ ∞ , 2 > r 0 ≥ k x − b k 2 = k t T b G Ψ ( b ) k 2 ≥ |{z} ( F ) λ 0 k t b k 2 . 24 CHEN As a result, k Ψ k ∗ ∞ , 2 k t b k 2 < 1 so k b − c k 2 2 = 0 by ( 17 ), which implies b = c , a contradiction. Accordingly , x must have a unique projection so the reach has a lo wer bound r 0 . P RO O F O F P R O P O S I T T I O N 4 . Consider a point x ∈ ˜ M . By the condition k ˜ Ψ − Ψ k ∗ ∞ , 0 < c 0 and assumption (F), we kno w that d ( x, M ) ≤ δ 0 , where c 0 and δ 0 are the constants in (F). Define (19) h ( x ) = k Ψ( x ) k 2 = q Ψ( x ) T Ψ( x ) to be the L 2 norm for Ψ . The deriv ati ve of h ( x ) (20) ∇ h ( x ) = Ψ( x ) T G Ψ ( x ) k Ψ( x ) k 2 is a vector of R d . Note that G Ψ ( x ) = ∇ Ψ( x ) ∈ R s × d is the Jacobian. For an y point x ∈ ˜ M , we define a flo w (21) φ x : R 7→ R d such that (22) φ x (0) = x, ∂ ∂ t φ x ( t ) = − ∇ h ( φ ( t )) . Later we will pro ve in Theorem 7 that φ x ( ∞ ) ∈ M when x ∈ M ⊕ δ c , where δ c is defined in Theorem 7 . By Theorem 3.39 in [ 42 ], φ x ( t ) is uniquely defined since the gradient ∇ h ( x ) is well- defined for all x / ∈ M . W e define an arc-length flow (i.e., a constant velocity flow) based on φ x : (23) γ x (0) = x, ∂ ∂ t γ x ( t ) = − ∇ h ( γ x ( t )) k∇ h ( γ x ( t )) k 2 . The time traveled in this flo w is the same as the distance trav eled (due to the velocity being a unit vector). Let T x = inf { t > 0 : γ x ( t ) ∈ M } be the terminal time point and let γ x ( T x ) ∈ M as the endpoint of the flo w . This means that T x is the length of the flow from x to the destination on M . The goal is to bound T x since the length must be greater or equal to the projection distance for x ∈ ˜ M . W e define ξ x ( t ) = h ( γ x ( t )) − h ( γ x ( T x )) = h ( γ x ( t )) . Differentiating ξ x ( t ) with respect to t leads to (24) ξ 0 x ( t ) = − d dt h ( γ x ( t )) = − [ ∇ h ( γ x ( t ))] T d dt γ x ( t ) = −k∇ h ( γ x ( t )) k = − k Ψ( γ x ( t )) T G Ψ ( γ x ( t )) k 2 k Ψ( γ x ( t )) k 2 ≤ − λ min ( G Ψ ( γ x ( t )) G Ψ ( γ x ( t )) T ) ≤ − λ 0 SOLUTION MANIFOLD 25 because γ x ( t ) ∈ M ⊕ δ 0 for all t . Let  0 = k Ψ − ˜ Ψ k ∗ ∞ , 0 = sup x k Ψ( x ) − ˜ Ψ( x ) k max and recall that x ∈ ˜ M so ˜ Ψ( x ) = 0 . Then by the fact that k v k 2 ≤ √ d × k v k max for vector v , √ d ·  0 = √ d sup x k Ψ( x ) − ˜ Ψ( x ) k max ≥ sup x k Ψ( x ) − ˜ Ψ( x ) k ≥ k Ψ( x ) − ˜ Ψ( x ) k ≥ h ( x ) = h ( γ x (0)) − h ( γ x ( T x )) (since h ( γ x ( T x )) = 0 ) = ξ (0) − ξ ( T x ) ( ξ ( T x ) = 0 and ξ (0) = h (0) ) = − T x ξ 0 ( T ∗ x ) (mean value Theorem) ≥ T x λ 0 by equation ( 24 ) . Hence, T x ≤ √ d λ 0  0 = O (  0 ) which is independent of x . This implies that sup x ∈ ˜ M d ( x, M ) ≤ √ d λ 0  0 = O ( k ˜ Ψ − Ψ k ∗ ∞ , 0 ) . P RO O F O F T H E O R E M 5 . 1. Since condition (F) in volves only Ψ and its deri v ati ve, when k Ψ − ˜ Ψ k ∗ ∞ , 2 is suf ficiently small, (F) holds for ˜ Ψ . 2. By the first assertion, condition (F) holds for ˜ Ψ . Thus, we can exchange ˜ M and M and repeat the proof of Proposittion 4 , which leads to sup x ∈ M d ( x, ˜ M ) ≤ √ d λ 0  0 . As a result, we conclude that Haus ( ˜ M , M ) ≤ √ d λ 0  0 = O (  0 ) . 3. By Theorem 3 , the reach of M has lo wer bound min { δ 0 / 2 , λ 0 / k Ψ k ∗ ∞ , 2 } . Note that δ 0 , λ 0 depends on the first deriv ati ve of Ψ . Hence, the lower bound for reach of M and ˜ M will be bounded at rate O ( k Ψ − ˜ Ψ k ∗ ∞ , 2 ) . Before moving forward, we would like to note that the Jacobian and Hessian of f can be expressed as G f ( x ) = ∇ f ( x ) = 2Ψ( x ) T Λ[ ∇ Ψ( x )] (25) H f ( x ) = ∇∇ f ( x ) = 2[ ∇ Ψ( x )] T Λ[ ∇ Ψ( x )] + 2Ψ( x )Λ[ ∇∇ Ψ( x )] , (26) ∇∇∇ f ( x ) = 6[ ∇ Ψ( x )] T Λ[ ∇∇ Ψ( x )] + 2Ψ( x )Λ[ ∇∇∇ Ψ( x )] , (27) where ∇ Ψ( x ) ∈ R s × d and ∇∇ Ψ( x ) ∈ R s × d × d . P RO O F O F L E M M A 6 . Property 1 (F or each x ∈ M ). 1-(a). By equation ( 26 ) and the fact that Ψ( x ) = 0 whenev er x ∈ M , we obtain H f ( x ) = 2[ ∇ Ψ( x )] T Λ[ ∇ Ψ( x )] . 26 CHEN Because Λ is positi ve definite, it can be decomposed into Λ = U D U T where D is a diagonal matrix so the eigen vectors corresponding to non-zero eigen values of H f will be the the ro ws of U T [ ∇ Ψ( x )] , which spans the same subspace as the row space of [ ∇ Ψ( x )] so by Lemma 1 , the non-zero eigen v ectors spans the normal space of M at x . 1-(b). Because H f ( x ) = 2[ ∇ Ψ( x )] T Λ[ ∇ Ψ( x )] when x ∈ M , the minimal non-zero eigen- v alue λ min ,> 0 ( H f ( x )) = 2 λ min ,> 0 ( G Ψ ( x ) T Λ G Ψ ( x )) . Since Λ is positi ve definite and symmetric, we can decompose G Ψ ( x ) T Λ G Ψ ( x ) = G Ψ ( x ) T Λ 1 / 2 Λ 1 / 2 G Ψ ( x ) so we obtain λ min ,> 0 ( H f ( x )) = 2 λ min ,> 0 ( G Ψ ( x ) T Λ G Ψ ( x )) = 2 λ min (Λ 1 / 2 G Ψ ( x ) G Ψ ( x ) T Λ 1 / 2 ) ≥ 2Λ min λ min ( G Ψ ( x ) G Ψ ( x ) T ) ≥ 2Λ min λ 2 0 . 1-(c). Because the normal space of M at x is spanned by the rows of G Ψ ( x ) = ∇ Ψ( x ) , which by 1-(a) is spanned by the non-zero eigen v ectors of H f ( x ) , the result follows. Property 2. Because d ( x, M ) < δ c so x is within the reach of M and thus, x M , the projec- tion from x onto M , is unique. As a result, the normal space of x M , N M ( x ) , is well-defined. W e can decompose λ min , ⊥ ,M ( H f ( x )) = min v ∈ N M ( x ) v T H f ( x ) v k v k 2 = min v ∈ N M ( x ) v T ( H f ( x M ) + H f ( x ) − H f ( x M )) v k v k 2 ≥ min v ∈ N M ( x ) v T H f ( x M ) v k v k 2 − max v ∈ N M ( x ) v T ( H f ( x ) − H f ( x M )) v k v k 2 ≥ min v ∈ N M ( x ) v T H f ( x M ) v k v k 2 − d k H f ( x ) − H f ( x M ) k max ≥ 2 λ 2 0 Λ min − d k H f ( x ) − H f ( x M ) k max By equation ( 26 ), k H f ( x ) − H f ( x M ) k max ≤ 2 k G Ψ ( x ) T Λ G Ψ ( x ) − G Ψ ( x M ) T Λ G Ψ ( x M ) k max + 2 k Ψ( x )Λ H Ψ ( x ) − Ψ( x M )Λ H Ψ ( x M ) k max ≤ 4 k Ψ k ∗ ∞ , 1 Λ max k Ψ k ∗ ∞ , 2 k x − x M k + 4 k Ψ k ∗ ∞ , 2 Λ max k Ψ k ∗ ∞ , 3 k x − x M k ≤ 8 k Ψ k ∗ ∞ , 2 Λ max k Ψ k ∗ ∞ , 3 k x − x M k . Thus, as long as k x − x M k = d ( x, M ) ≤ λ 2 0 Λ min 8 d Λ max k Ψ k ∗ ∞ , 2 k Ψ k ∗ ∞ , 3 , SOLUTION MANIFOLD 27 we hav e λ min , ⊥ ,M ( H f ( x )) ≥ λ 2 0 Λ min , which completes the proof. P RO O F O F T H E O R E M 7 . 1. Con vergence radius. W e prove this by showing that for any x ∈ M ⊕ δ c and x / ∈ M , the destination π x ( ∞ ) ∈ M . The idea of the proof relies on two properties: (P1) Any stationary point of f inside M ⊕ δ c must be a point in M . (P2) Let x M be a point on M that is closest to x . For any point x ∈ M ⊕ δ c , ( x − x M ) T ∇ f ( x ) > 0 . Namely , the gradient flow only mo ves π x ( t ) closer to ward M . W ith the above two properties, it is easy to see that if we start a gradient flow π x from x ∈ M ⊕ δ c , then by (P2) this flow must stays within M ⊕ δ c . Because stationary points within M ⊕ δ c are all in M by (P1) and the destination of a gradient flo w must be a stationary point, we conclude that π x ( ∞ ) ∈ M , which completes the proof of con ver gence radius. In what follo ws, we show the tw o properties. Property P1: Any stationary point inside M ⊕ δ c must be a point in M . Because ∇ f ( x ) = Ψ( x )Λ G Ψ ( x ) and Λ is positive definite, there are only two cases that ∇ f ( x ) = 0 : 1. Ψ( x ) = 0 and 2., row space of G Ψ ( x ) has a dimension less than s (in fact, if Ψ( x ) 6 = 0 , then the second case is a necessary condition). The first case is the solution manifold M so we only need to focus on sho wing that the second case will not happen for x ∈ M ⊕ δ c . The ro w space of G Ψ ( x ) has a dimension less than s when there exists a singular v alue of G Ψ ( x ) being 0 ; or equi valently , λ min ( G Ψ ( x ) G Ψ ( x ) T ) = 0 . Ho wev er , assumption (F) already requires that this will not happen within M ⊕ δ c . Thus, this property holds. Property P2: For any x ∈ M ⊕ δ c , the directional gradient ( x − x M ) T ∇ f ( x ) > 0 . By T aylor expansion and property 2 of Lemma 6 , (28) ( x − x M ) T ∇ f ( x ) = ( x − x M ) T ( ∇ f ( x ) − ∇ f ( x M ) | {z } =0 ) = ( x − x M ) T Z  =1  =0 H f ( x M +  ( x − x M ))( x − x M ) d ≥ k x − x M k 2 inf y ∈ M ⊕ δ c λ min , ⊥ ,M ( H f ( y )) ≥ d ( x, M ) 2 λ 2 0 Λ min > 0 2. T erminal flo w orientation. T o study the gradient flo w close to M , it suf fices to analyze the beha vior of gradient close to M . Let x ∈ M and define u to be a unit vector in the normal space of M at x . By Lemma 1 , u belongs to the row space of ∇ Ψ( x ) = G Ψ ( x ) . No w we consider the gradient at x + u when  → 0 . By T aylor’ s theorem and the fact that f has bounded third deriv ati ves (from (D-3)), G f ( x + u ) ≡ ∇ f ( x + u ) = ∇ f ( x + u ) − ∇ f ( x ) = H f ( x ) u + O (  2 ) . Thus, lim  → 0 1  G f ( x + u ) = H f ( x ) u. By equation ( 26 ), H f ( x ) = 2 G Ψ ( x ) T Λ G Ψ ( x ) + 2Ψ( x )Λ H Ψ ( x ) = 2 G Ψ ( x ) T Λ G Ψ ( x ) 28 CHEN because Ψ( x ) = 0 when x ∈ M . Using the fact that G Ψ ( x ) T = [ ∇ Ψ 1 ( x ) , · · · , ∇ Ψ s ( x )] , it is easy to see that H f ( x ) u = s X ` =1 a ` ∇ Ψ ` ( x ) , where a ` = e T ` Λ G Ψ ( x ) u with e ` = (0 , 0 , · · · , 0 , 1 , 0 , · · · , 0) T ∈ R s is the coordinate vector pointing tow ard ` -th coordinate. Thus, by Lemma 1 ∇∇ f ( x ) u belongs to the normal space of M at x , which completes the proof of terminal orientation. P RO O F O F T H E O R E M 8 . W e prove this result using the idea of the L yapuno v-Perron method [ 61 ]. Recall that A ( z ) = { x : π x ( ∞ ) = z } for z ∈ M is the basin of attraction of point z . Consider a ball B ( z , r ) such that an y gradient flo w π x ( t ) that con ver ges to z = π x ( ∞ ) in- tersects one and only one point at the boundary ∂ B ( z , r ) = { y : k y − z k = r } . This occurs when r < δ c due to property (P2) in the proof of Theorem 7 . Consider the gradient flo w π x ( t ) with x ∈ ∂ B ( z , r ) and π x ( ∞ ) = z . By T aylor’ s theorem, this flo w solves the follo wing equation (29) π 0 x ( t ) = − G f ( π x ( t )) = − G f ( π x ( t )) + G f ( π x ( ∞ )) | {z } =0 = − H f ( π x ( ∞ ))( π x ( t ) − π x ( ∞ )) +  ( π x ( t )) , where k  ( π x ( t )) k ≤ C 0 k π x ( t ) − π x ( ∞ ) k ≤ C 0 r for some finite constant C 0 due to Assump- tion (D-3). Equation ( 29 ) is a perturbed ODE with a fixed point π x ( ∞ ) and by the variation of parameters, its solution can be written as π x ( t ) − π x ( ∞ ) = e − tH f ( π x ( ∞ )) ( π x (0) − π x ( ∞ )) + Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  ( π x ( s )) ds. Denoting v x = π x (0) − π x ( ∞ ) , we can rewrite the flo w as π x ( t ) − π x ( ∞ ) = e − tH f ( π x ( ∞ )) v x + Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  ( π x ( s )) ds. By Lemma 1 , the normal space of M at z = π x ( ∞ ) is the ro w space of G Ψ ( z ) = ∇ Ψ( z ) , which will also be the space spanned by the eigenv ectors of H f ( π x ( ∞ )) that corresponds to non-zero eigen values (Lemma 6 1-(a)). The spectral decomposition shows H f ( π x ( ∞ )) = P s ` =1 λ ` u ` u T ` and we define the projection matrix onto the normal space of M as Π N = P s ` =1 u ` u T ` and the projection matrix onto the tangent space of M as Π T = I d − Π N . By construction, Π N H f ( π x ( ∞ )) = H f ( π x ( ∞ )) and Π T H f ( π x ( ∞ )) = 0 so Π N e − tH f ( π x ( ∞ )) = e − tH f ( π x ( ∞ )) and Π T e − tH f ( π x ( ∞ )) = Π T . W e decompose (30) π x ( t ) − π x ( ∞ ) = Π T ( π x ( t ) − π x ( ∞ )) + Π N ( π x ( t ) − π x ( ∞ )) = Π T e − tH f ( π x ( ∞ )) v x + Π T Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  ( π x ( s )) ds + Π N e − tH f ( π x ( ∞ )) v x + Π N Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  ( π x ( s )) ds = v x,T + Z s = t s =0  T ( π x ( s )) ds + e − tH f ( π x ( ∞ )) v x,N + Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  N ( π x ( s )) ds, SOLUTION MANIFOLD 29 where v x,T = Π T v x , v x,N = Π N v x ,  T ( π x ( s )) = Π T  ( π x ( s )) ,  N ( π x ( s )) = Π N  ( π x ( s )) . In the tangent direction, when t → ∞ 0 = lim t →∞ Π T ( π x ( t ) − π x ( ∞ )) = lim t →∞ Π T e − tH f ( π x ( ∞ )) v x + lim t →∞ Π T Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  ( π x ( s )) ds = Π T v x + Z s = ∞ s =0 Π T  ( π x ( s )) ds = v x,T + Z s = ∞ s =0  T ( π x ( s )) ds. Thus, (31) v x,T = − Z s = ∞ s =0  T ( π x ( s )) ds and equation ( 30 ) can be re written as (32) π x ( t ) − π x ( ∞ ) = − Z s = ∞ s =0  T ( π x ( s )) ds + Z s = t s =0  T ( π x ( s )) ds + e − tH f ( π x ( ∞ )) v x,N + Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  N ( π x ( s )) ds = e − tH f ( π x ( ∞ )) v x,N + Z s = t s =0 e − ( t − s ) H f ( π x ( ∞ ))  N ( π x ( s )) ds − Z s = ∞ s = t  T ( π x ( s )) ds. The latter tw o term s in volving integral are determined entirely by the T aylor remainder terms  ( π x ( t )) . Thus, to uniquely determine a point on the gradient flow π x ( t ) that conv erges to z (and is inside B ( z , r ) ), we only need to specify the time t and the vector v x,N that belongs to the normal space of M at z with k v x,N k = r . Namely , there e xists a mapping (due to equation ( 32 )) Ω such that π x ( t ) = Ω( t, v x,N ) for all π x ( t ) with k x − z k = r . Note that equation ( 32 ) implies that the mapping Ω has bounded deri vati ve with respect to both t and v x,N . Therefore, the set A ( z ) ∩ B ( z , r ) = ( π x ( t ) = Ω( t, v x,N ) : t ∈ [0 , ∞ ) , v x,N = s X ` =1 a ` u ` , s X ` =1 a 2 ` = r 2 ) is parameterized by ( t, a 1 , · · · , a s ) with a constraint P s ` =1 a 2 ` = r 2 so it is an s -dimensional manifold. T o generalize this to the entire set A ( z ) , note that ev ery gradient flow ending at z must pass the boundary ∂ B ( z , r ) so allowing the gradient π x ( t ) to mov e tow ard t → −∞ cov ers the entire basin, i.e., A ( z ) = ( π x ( t ) = Ω( t, v x,N ) : t ∈ R , v x,N = s X ` =1 a ` u ` , s X ` =1 a 2 ` = r 2 ) . 30 CHEN This implies that A ( z ) is parametrized by ( t, a 1 , · · · , a s ) with a constraint P s ` =1 a 2 ` = r 2 so again it is an s -dimensional manifold. P RO O F O F T H E O R E M 9 . Con vergence of f ( x t ) . Because x t +1 = x t − γ G f ( x t ) , simple T aylor expansion sho ws that f ( x t ) − f ( x t +1 ) = f ( x t ) − f ( x t − γ G f ( x t )) = γ k G f ( x t ) k 2 − 1 2 γ 2 Z  =1  =0 G f ( x t ) H f ( x t − γ G f ( x t )) G f ( x t ) d ≥ γ k G f ( x t ) k 2 − 1 2 γ 2 k G f ( x t ) k 2 sup z k H f ( z ) k 2 . Note that one can also use the fact that the gradient G f is Lipschitz to obtain a similar bound. Thus, when γ < 2 sup z k H f ( z ) k 2 , we obtain f ( x t ) − f ( x t +1 ) > 0 which implies that f ( x t +1 ) < f ( x t ) , i.e., the objectiv e function is decreasing. W e can sum- marize the result as (33) f ( x t +1 ) ≤ f ( x t ) − γ k G f ( x t ) k 2  1 − 1 2 γ sup z k H f ( z ) k 2  . T o obtain the algorithmic con vergence rate, we need to associate the objecti ve function f ( x ) and the squared gradient k G f ( x ) k 2 . W e focus on the case of t = 0 , and in vestigate (34) f ( x 1 ) ≤ f ( x 0 ) − γ k G f ( x 0 ) k 2  1 − 1 2 γ sup z k H f ( z ) k 2  . Because d ( x 0 , M ) ≤ δ c ≤ reach ( M ) , there is a unique projection x M ∈ M from x 0 . Note that d ( x 0 , M ) = k x 0 − x M k . The gradient has a lower bound from the following T aylor expansion: k G f ( x 0 ) k = k G f ( x 0 ) − G f ( x M ) | {z } =0 k =     Z  =1  =0 H f ( x M +  ( x 0 − x M ))( x 0 − x M ) d     ≥ k x 0 − x M k inf  ∈ [0 , 1] λ min , ⊥ ( H f ( x M +  ( x 0 − x M ))) ≥ k x 0 − x M k λ 2 0 Λ min ≥ d ( x 0 , M ) λ 2 0 Λ min , where the second to the last inequality is due to property 2 in Lemma 6 . Thus, (35) k G f ( x 0 ) k 2 ≥ d ( x 0 , M ) 2 λ 4 0 Λ 2 min . The distance d ( x 0 , M ) and the objectiv e function f ( x 0 ) can also be associated using an- other T aylor expansion: f ( x 0 ) = f ( x 0 ) − f ( x M ) = ( x 0 − x M ) T G f ( x M ) | {z } =0 + 1 2 ( x 0 − x M ) T Z  =1  =0 H f ( x M +  ( x 0 − x M )) d ( x 0 − x M ) SOLUTION MANIFOLD 31 ≤ 1 2 d 2 ( x 0 , M ) sup z k H f ( z ) k 2 . Thus, d 2 ( x 0 , M ) ≥ 2 f ( x 0 ) sup z k H f ( z ) k 2 which implies an improv ed bound on equation ( 35 ) as (36) k G f ( x 0 ) k 2 ≥ d ( x 0 , M ) 2 λ 4 0 Λ 2 min ≥ λ 4 0 Λ 2 min sup z k H f ( z ) k 2 2 f ( x 0 ) . By inserting equation ( 36 ) into equation ( 34 ), we obtain f ( x 1 ) ≤ f ( x 0 ) − γ k G f ( x 0 ) k 2  1 − 1 2 γ sup z k H f ( z ) k 2  ≤ f ( x 0 ) − γ  1 − 1 2 γ sup z k H f ( z ) k 2  λ 4 0 Λ 2 min sup z k H f ( z ) k 2 2 f ( x 0 ) = f ( x 0 )  1 − 2 γ  1 − 1 2 γ sup z k H f ( z ) k 2  λ 4 0 Λ 2 min sup z k H f ( z ) k 2  When γ < 1 sup z k H f ( z ) k 2 , the abov e inequality can be simplified as f ( x 1 ) ≤ f ( x 0 ) ·  1 − γ λ 4 0 Λ 2 min sup z k H f ( z ) k 2  . Thus, we ha ve pro ved the result for t = 0 . The same deri v ation works for other t (by treating x t as x 0 ). By telescoping, we conclude that f ( x t ) ≤ f ( x 0 ) ·  1 − γ λ 4 0 Λ 2 min sup z k H f ( z ) k 2  t . Finally , using the fact that sup z k H f ( z ) k 2 ≤ Λ max k Ψ k ∗ ∞ , 2 , we obtain the desired bound. Con ver gence of d ( x t , M ) . Let x t,M ∈ M be the point on the manifold that is closest to x t ; again, due to the reach condition this projection is unique. The T aylor expansion along with property 2 in Lemma 6 sho ws that − f ( x t ) = f ( x t,M ) − f ( x t ) = ( x t,M − x t ) T G f ( x t ) + 1 2 ( x t,M − x t ) T Z  =1  =0 H f ( x t +  ( x t,M − x t ))( x t − x t,M ) d ≥ ( x t,M − x t ) T G f ( x t ) + 1 2 k x t − x t,M k 2 λ 2 0 Λ min . Thus, (37) − f ( x t ) − 1 2 k x t,M − x t k 2 λ 2 0 Λ min ≥ − ( x t − x t,M ) T G f ( x t ) . Because of equation ( 33 ) and that sup z k H f ( z ) k 2 ≤ d k Ψ k ∗ ∞ , 2 , we hav e f x − 1 d Λ max k Ψ k ∗ ∞ , 2 G f ( x ) ! − f ( x ) ≤ − 1 2 d Λ max k Ψ k ∗ ∞ , 2 k G f ( x ) k 2 . 32 CHEN Using the fact that f  x − 1 d Λ max k Ψ k ∗ ∞ , 2 G f ( x )  ≥ 0 , we conclude that 1 2 d Λ max k Ψ k ∗ ∞ , 2 k G f ( x ) k 2 ≤ f ( x ) , which implies (38) k G f ( x ) k 2 ≤ 2 d Λ max k Ψ k ∗ ∞ , 2 f ( x ) . For an y t , we have d ( x t +1 , M ) 2 ≤ k x t +1 − x t,M k 2 = k x t − x t,M − γ G f ( x t ) k 2 = k x t − x t,M k 2 − 2( x t − x t,M ) T G f ( x t ) + γ 2 k G f ( x t ) k 2 ( 37 ) ≤ k x t − x t,M k 2 (1 − γ λ 2 0 Λ min ) − 2 γ f ( x t ) + γ 2 k G f ( x t ) k 2 ( 38 ) ≤ k x t − x t,M k 2 (1 − γ λ 2 0 Λ min ) − 2 γ f ( x t )  1 − dγ Λ max k Ψ k ∗ ∞ , 2  | {z } ≥ 0 ≤ k x t − x t,M k 2 (1 − γ λ 2 0 Λ min ) = d ( x t , M ) 2 (1 − γ λ 2 0 Λ min ) whene ver γ < 1 Λ max k Ψ k ∗ ∞ , 2 . By telescoping, the result follo ws. REFERENCES [1] A A M A R I , E . , K I M , J . , C H A Z A L , F . , M I C H E L , B . , R I NA L D O , A . and W A S S E R M A N , L . (2019). Estimating the Reach of a Manifold. Electr onic J ournal of Statistics 13 1359–1399. [2] A A M A R I , E . and L E V R A R D , C . (2019). Nonasymptotic rates for manifold, tangent space and curvature estimation. The Annals of Statistics 47 177–204. [3] A R I A S - C A S T RO , E . , M A S O N , D . and P E L L E T I E R , B . (2016). On the Estimation of the Gradient Lines of a Density and the Consistency of the Mean-Shift Algorithm. J ournal of Machine Learning Resear ch 17 1–28. [4] B A L A K R I S H N A N , S . , W A I N W R I G H T , M . J . and Y U , B . (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics 45 77–120. [5] B A N YAG A , A . and H U RT U B I S E , D . (2013). Lectures on Morse homolo gy 29 . Springer Science & Business Media. [6] B AU D RY , J . - P ., R A F T E RY , A . E . , C E L E U X , G . , L O , K . and G OT TA R D O , R . (2010). Combining mixture components for clustering. Journal of computational and gr aphical statistics 19 332–353. [7] B O G A C H E V , V . I . (2007). Measure theory 1 . Springer Science & Business Media. [8] B O I S S O N NAT , J . - D . and G H O S H , A . (2014). Manifold reconstruction using tangential Delaunay com- plex es. Discr ete & Computational Geometry 51 221–267. [9] B O W M A N , A . W. (1984). An alternati ve method of cross-v alidation for the smoothing of density estimates. Biometrika 71 353–360. [10] B OY D , S . and V A N D E N B E R G H E , L . (2004). Conve x optimization . Cambridge univ ersity press. [11] B R I N K M A N , R . R ., G A S PA R E T T O , M . , L E E , S . - J . J . , R I B I C K A S , A . J ., P E R K I N S , J . , J A N S S E N , W ., S M I L E Y , R . and S M I T H , C . (2007). High-content flow cytometry and temporal data analysis for defining a cellular signature of graft-versus-host disease. Biology of Blood and Marr ow T ransplantation 13 691–700. [12] C A D R E , B . (2006). K ernel estimation of density lev el sets. Journal of multivariate analysis 97 999–1023. [13] C H AC Ó N , J . E . (2015). A population background for nonparametric density-based clustering. Statistical Science 30 518–532. [14] C H AU V E AU , D . and H U N T E R , D . (2013). ECM and MM algorithms for normal mixtures with constrained parameters. [15] C H A Z A L , F. and L I E U T I E R , A . (2005). The “ λ -medial axis”. Graphical Models 67 304–331. SOLUTION MANIFOLD 33 [16] C H A Z A L , F . and L I E U T I E R , A . (2008). Smooth manifold reconstruction from noisy and non-uniform ap- proximation with guarantees. Computational Geometry 40 156–170. [17] C H E N , X . , C H R I S T E N S E N , T. M . and T A M E R , E . (2018). Monte Carlo confidence sets for identified sets. Econometrica 86 1965–2018. [18] C H E N , Y . - C . , G E N O V E S E , C . R . and W A S S E R M A N , L . (2015). Asymptotic theory for density ridges. The Annals of Statistics 43 1896–1928. [19] C H E N , Y . - C . , G E N OV E S E , C . R . and W A S S E R M A N , L . (2016). A comprehensi ve approach to mode clus- tering. Electr onic J ournal of Statistics 10 210–241. [20] C H E N , Y . - C . , G E N O V E S E , C . R . and W A S S E R M A N , L . (2017). Density le vel sets: Asymptotics, inference, and visualization. Journal of the American Statistical Association 112 1684–1696. [21] C H E N , Y . - C . , G E N OV E S E , C . R . and W A S S E R M A N , L . (2017). Statistical inference using the Morse-Smale complex. Electr onic Journal of Statistics 11 1390–1433. [22] C H E N G , G . , C H E N , Y . - C . et al. (2019). Nonparametric inference via bootstrapping the debiased estimator. Electr onic J ournal of Statistics 13 2194–2256. [23] C H E N G , S . - W ., D E Y , T. K . and R A M O S , E . A . (2005). Manifold reconstruction from point samples. In SOD A 5 1018–1027. [24] C H E N G , S . W., F U N K E , S . , G O L I N , M . , K U M A R , P . , P O O N , S . H . and R A M O S , E . (2005). Curve recon- struction from noisy samples In Computational Geometry 31 . [25] C H E R N O Z H U K OV , V . , H O N G , H . and T A M E R , E . (2007). Estimation and confidence regions for parameter sets in econometric models 1. Econometrica 75 1243–1284. [26] C H O U L D E C H OV A , A . (2017). Fair prediction with disparate impact: A study of bias in recidi vism prediction instruments. Big data 5 153–163. [27] C O R B E T T - D A V I E S , S . , P I E R S O N , E ., F E L L E R , A ., G O E L , S . and H U Q , A . (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23r d acm sigkdd international conference on knowl- edge discovery and data mining 797–806. [28] C U E V A S , A . (2009). Set estimation: Another bridge between statistics and geometry. Bol. Estad. Investig . Oper 25 71–85. [29] D E N N I S J R , J . E . and S C H N A B E L , R . B . (1996). Numerical methods for unconstrained optimization and nonlinear equations . SIAM. [30] D E Y , T. K . (2006). Curve and surface reconstruction: algorithms with mathematical analysis 23 . Cam- bridge Univ ersity Press. [31] D E Y , T. K . and G O S WAM I , S . (2006). Provable surf ace reconstruction from noisy samples. Computational Geometry 35 124–141. [32] D RT O N , M . and S U L L I V A N T , S . (2007). Algebraic statistical models. Statistica Sinica 1273–1297. [33] F E D E R E R , H . (1959). Curvature measures. T ransactions of the American Mathematical Society 93 418– 491. [34] G A R R I T Y , T. A . (2001). All the mathematics you missed: but need to kno w for graduate school. [35] G E N O V E S E , C . R ., P E R O N E - P AC I FI C O , M ., V E R D I N E L L I , I . and W A S S E R M A N , L . (2014). Nonparametric ridge estimation. The Annals of Statistics 42 1511–1545. [36] G I B I L I S C O , P ., R I C C O M A G N O , E . , R O G A N T I N , M . P . and W Y N N , H . P . (2010). Algebraic and geometric methods in statistics . Cambridge Univ ersity Press. [37] G R OV E , K . and K A R C H E R , H . (1973). How to conjugatec 1-close group actions. Mathematisc he Zeitschrift 132 11–20. [38] H A N S E N , L . P . (1982). Large sample properties of generalized method of moments estimators. Economet- rica: Journal of the Econometric Society 1029–1054. [39] H A N S E N , L . P . and S I N G L E T O N , K . J . (1982). Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica: J ournal of the Econometric Society 1269–1286. [40] H A R D T , M . , P R I C E , E . and S R E B RO , N . (2016). Equality of opportunity in supervised learning. In Ad- vances in neural information pr ocessing systems 3315–3323. [41] H O R N , R . A . and J O H N S O N , C . R . (2012). Matrix analysis . Cambridge university press. [42] I RW I N , M . (1980). Smooth dynamical systems . Academic Press. [43] J I S U , K . , C H E N , Y . - C . , B A L A K R I S H NA N , S . , R I N A L D O , A . and W A S S E R M A N , L . (2016). Statistical in- ference for cluster trees. In Advances in Neural Information Pr ocessing Systems 1839–1847. [44] L A L O E , T. and S E RV I E N , R . (2013). Nonparametric estimation of regression level sets. Journal of the K or ean Statistical Society . [45] L I A N G , K . - Y . and Z E G E R , S . L . (1986). Longitudinal data analysis using generalized linear models. Biometrika 73 13–22. [46] L I N D S A Y , B . G . (1995). Mixture models: theory , geometry and applications. In NSF-CBMS r e gional con- fer ence series in pr obability and statistics i–163. JSTOR. 34 CHEN [47] L I T T L E , R . J . and R U B I N , D . B . (2019). Statistical analysis with missing data 793 . John Wile y & Sons. [48] M A M M E N , E . and P O L O N I K , W. (2013). Confidence re gions for le vel sets. J ournal of Multivariate Analysis 122 202–214. [49] M A N S K I , C . F. (1990). Nonparametric Bounds On Treatment Effects. The American Economic Review 80 319. [50] M A N S K I , C . F. (1999). Identification pr oblems in the social sciences . Harv ard Univ ersity Press. [51] M A N S K I , C . F. (2003). P artial identification of pr obability distributions . Springer Science & Business Media. [52] M ATT I L A , P . (1999). Geometry of sets and measur es in Euclidean spaces: fractals and r ectifiability 44 . Cambridge univ ersity press. [53] M C G E H E E , R . (1973). A stable manifold theorem for degenerate fixed points with applications to celestial mechanics. Journal of Dif fer ential Equations 14 70–88. [54] M C G E H E E , R . and S A N D E R , E . (1996). A new proof of the stable manifold theorem. Zeitschrift für ange- wandte Mathematik und Physik ZAMP 47 497–513. [55] M I C H A Ł E K , M . , S T U R M F E L S , B . , U H L E R , C . and Z W I E R N I K , P . (2016). Exponential varieties. Pr oceed- ings of the London Mathematical Society 112 27–56. [56] M O L C H A N OV , I . (1991). Empirical estimation of distribution quantiles of random closed sets. Theory of Pr obability & Its Applications 35 594–600. [57] M O L C H A N OV , I . S . (1998). A limit theorem for solutions of inequalities. Scandinavian J ournal of Statistics 25 235–242. [58] N E S T E ROV , Y . (2018). Lectur es on con vex optimization 137 . Springer . [59] N I Y O G I , P ., S M A L E , S . and W E I N B E R G E R , S . (2008). Finding the homology of submanifolds with high confidence from random samples. Discr ete & Computational Geometry 39 419–441. [60] P A T E I R O L Ó P E Z , B . (2008). Set estimation under conve xity type r estuctions. Uni v Santiago de Compostela. [61] P E R K O , L . (2001). Differ ential equations and dynamical systems 7 . Springer Science & Business Media. [62] P R E I S S , D . (1987). Geometry of measures in R n: distrib ution, rectifiability , and densities. Annals of Math- ematics 125 537–643. [63] Q I AO , W. and P O L O N I K , W. (2019). Nonparametric confidence regions for le vel sets: Statistical properties and geometry. Electr onic J ournal of Statistics 13 985–1030. [64] R H E I N B O L D T , W. C . (1988). On the Computation of Multi-Dimensional Solution Manifolds of Parametrized Equations. Numerisc he Mathematik . [65] R I C E , J . R . (1967). Nonlinear Approximation. II Curv ature in Minko wski Geometry and Local Uniqueness. T ransactions of the American Mathematical Society 128 437–459. [66] R I NA L D O , A . , S I N G H , A . , N U G E N T , R . and W A S S E R M A N , L . (2012). Stability of density-based clustering. The Journal of Mac hine Learning Resear ch 13 905–948. [67] R I NA L D O , A . and W A S S E R M A N , L . (2010). Generalized density clustering. The Annals of Statistics 38 2678–2722. [68] R O M A N O , J . P . (1988). Bootstrapping the mode. Annals of the Institute of Statistical Mathematics 40 565– 586. [69] R O M A N O , J . P . (1988). On weak conv ergence and optimality of kernel density estimates of the mode. The Annals of Statistics 629–647. [70] R O M A N O , J . P . and S H A I K H , A . M . (2010). Inference for the identified set in partially identified econo- metric models. Econometrica 78 169–211. [71] R U D E M O , M . (1982). Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics 65–78. [72] R U D I N , W. (1964). Principles of mathematical analysis 3 . McGraw-hill Ne w Y ork. [73] S H E AT H E R , S . J . (2004). Density estimation. Statistical Science 19 588–597. [74] S I L V E R M A N , B . W. (1986). Density estimation for statistics and data analysis 26 . CRC press. [75] T A M E R , E . (2010). Partial identification in econometrics. Annu. Re v . Econ. 2 167–195. [76] T S Y B A KO V , A . B . (1997). On nonparametric estimation of density level sets. The Annals of Statistics 25 948–969. [77] V A N D E R V A A RT , A . (1998). Asymptotic Statistics . Cambridge Uni versity Press, Cambridge. [78] V I E U , P . (1996). A note on density mode estimation. Statistics & pr obability letters 26 297–307. [79] W A LT H E R , G . (1997). Granulometric smoothing. The Annals of Statistics 25 2273–2299.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment