An axiomatic approach to intrinsic dimension of a dataset

Reading time: 6 minute
...

📝 Original Info

  • Title: An axiomatic approach to intrinsic dimension of a dataset
  • ArXiv ID: 0712.2063
  • Date: 2009-11-17
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We perform a deeper analysis of an axiomatic approach to the concept of intrinsic dimension of a dataset proposed by us in the IJCNN'07 paper (arXiv:cs/0703125). The main features of our approach are that a high intrinsic dimension of a dataset reflects the presence of the curse of dimensionality (in a certain mathematically precise sense), and that dimension of a discrete i.i.d. sample of a low-dimensional manifold is, with high probability, close to that of the manifold. At the same time, the intrinsic dimension of a sample is easily corrupted by moderate high-dimensional noise (of the same amplitude as the size of the manifold) and suffers from prohibitevely high computational complexity (computing it is an $NP$-complete problem). We outline a possible way to overcome these difficulties.

💡 Deep Analysis

Deep Dive into An axiomatic approach to intrinsic dimension of a dataset.

We perform a deeper analysis of an axiomatic approach to the concept of intrinsic dimension of a dataset proposed by us in the IJCNN'07 paper (arXiv:cs/0703125). The main features of our approach are that a high intrinsic dimension of a dataset reflects the presence of the curse of dimensionality (in a certain mathematically precise sense), and that dimension of a discrete i.i.d. sample of a low-dimensional manifold is, with high probability, close to that of the manifold. At the same time, the intrinsic dimension of a sample is easily corrupted by moderate high-dimensional noise (of the same amplitude as the size of the manifold) and suffers from prohibitevely high computational complexity (computing it is an $NP$-complete problem). We outline a possible way to overcome these difficulties.

📄 Full Content

An often-held opinion on intrinsic dimensionality of data sampled from submanifolds of the Euclidean space is expressed in [10] thus: "...the goal of estimating the dimension of a submanifold is a well-defined mathematical problem. Indeed all the notions of dimensionality like e.g. topological, Hausdorff, or correlation dimension agree for submanifolds in R d ."

We will argue that it may be useful to have at one’s disposal a concept of intrinsic dimension of data which behaves in a different fashion from the more traditional concepts.

Our approach is shaped up by the following five goals. 1. We want a high value of intrinsic dimension to be indicative of the presence of the curse of dimensionality.

  1. The concept should make no distinction between continuous and discrete objects, and the intrinsic dimension of a discrete sample should be close to that of the underlying manifold.

  2. The intrinsic dimension should agree with our geometric intuition and return standard values for familiar objects such as Euclidean spheres or Hamming cubes. 4. We want the concept to be insensitive to highdimensional random noise of moderate amplitude (on the same order of magnitude as the size of the manifold).

  3. Finally, in order to be useful, the intrinsic dimension should be computationally feasible.

For the moment, we have managed to attain the goals (1),( 2),(3), while (4) and ( 5) are not met. However, it ap-pears that in both cases the problem is the same, and we outline a promising way to address it.

Among the existing approaches to intrinsic dimension, that of [5] comes closest to meeting the goals (2),(3), (5) and to some extent (1), cf. a discussion in [18]. (Lemma 1 in [11] seems to imply that (4) does not hold for moderate noise with E x = O(1), i.e., σ 2 ∼ 1/d.)

We work in a setting of metric spaces with measure (mmspaces), i.e., triples (X, d, µ) consisting of a set, X, furnished with a distance, d, satisfying axioms of a metric, and a probability measure µ. This concept is broad enough so as to include submanifolds of R n (equipped with the induced, or Minkowski, measure, or with some other probability distribution), as well as data samples themselves (with their empirical, that is normalized counting, measure). In Section 2, we describe this setting and discuss in some detail the phenomenon of concentration of measure on high dimensional structures, presenting it from a number of different viewpoints, including an approach of soft margin classification.

The curse of dimensionality is understood as a geometric property of mm-spaces whereby features (1-Lipschitz, or non-expanding, functions) sharply concentrate near their means and become non-discriminating. This way, the curse of dimensionality is equated with the phenomenon of concentration of measure on high-dimensional structures [15,9], and can be dealt with an a precise mathematical fashion, adopting (1) as an axiom.

The intrinsic dimension, ∂, is defined for mm-spaces in an axiomatic way in Section 4, following [18].

To deal with goal (2), we resort to the notion of a distance, d conc (X, Y ), between two mm-spaces, X and Y , measuring their similarity [9]. This forms the subject of Section 3. Our second axiom says that if two mm-spaces are close to each other in the above distance, then their intrinsic dimension values are also close. In this article, we show that if a dataset X is sampled with regard to a probability measure µ on a manifold M , then, with high confidence, the distance between X and M is small, and so ∂(M ) and ∂(X) are close to each other.

The goal (3) can be made into an axiom in a more or less straightforward way. We give a new example of a dimension function ∂ satisfying our axioms.

We show that the Gromov distance between a lowdimensional manifold M and its corruption by highdimensional gaussian noise of moderate amplitude is close to M in the Gromov distance. However, this property does not carry over to the samples unless their size is exponential in the dimension of R d (unrealistic assumption), and thus our approach suffers from high sensitivity to noise (Section 6.) Another drawback is computational complexity: we show that computing the intrinsic dimension of a finite sample is an N P -complete problem (Sect. 5.)

However, we believe that the underlying cause of both problems is the same: allowing arbitrary non-expanding functions as features is clearly too generous. Restricting the class of features to that of low-complexity functions whose capacity is manageable and rewriting the entire theory in this setting opens up a possibility to use statistical learning theory and offers a promising way to solve both problems, which we discuss in Conclusion.

As in [18], we model datasets within the framework of spaces with metric and measure (mm-spaces). So is called a triple (X, d, µ), consisting of a (finite or infinite) set X, a metric d on X, and a probability measure1 µ defined on the family B of all Borel subsets2 o

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut