Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human judgment.
Generative models have achieved significant progress in recent years, enabling the synthesis of data for arbitrary modalities, including critical areas such as medical imaging (Pinaya et al., 2022;Koetzier et al., 2024;Bluethgen et al., 2024). Evaluating the quality of generated data is essential to ensure its reliability for downstream applications.
Yet, structured data, such as images, are high-dimensional, which makes direct density estimation infeasible. A common approach for evaluating generative models is to use simple distributional approximations. For example, Fréchet Inception Distance (FID) variants (Heusel et al., 2017;Stein et al., 2023) assume normality of the distribution. This enables evaluation via a single score, typically the distance between the real and synthetic approximated distributions. However, beyond the untested assumptions, this aggregate score makes it difficult to diagnose whether a generative model lacks realism or diversity (Sajjadi et al., 2018).
Preprint. February 19, 2026. The scenario in (a) consists of real samples from a 60/40 mixture of two hyperspheres, and generated samples from a mixture with swapped radii and proportions. Standard metrics are plotted in (b) as the dimension increases. The real and generated sets are disjoint, so all metrics should score 0; however, in high dimensions (top), none does due to hubness. After applying GICDM (bottom), their scores correctly remain at 0. See Section G for individual metric results.
Pairs of fidelity and coverage metrics aim to measure these two aspects separately. Their computation most often relies on distances (Sajjadi et al., 2018;Naeem et al., 2020;Salvy et al., 2026). A synthetic sample is then considered realistic (high fidelity) if it is sufficiently close to real data points, while a real data point is considered covered if there are synthetic samples sufficiently close to it. The closeness threshold is usually defined locally as the distance from each real point to its k-th nearest real neighbor.
A recent position paper argued that all existing fidelity and diversity metrics are flawed (Räisä et al., 2025). The paper introduced a synthetic benchmark of tests for evaluating generative model metrics and showed that all current metrics fail to meet at least 40% of the success criteria. While subsequent work has led to improvements (Salvy et al., 2026), many failures persist. We argue that the underlying cause of many reported failures are related to high-dimensionality, specifically the hubness phenomenon. As shown in Figure 1, all metrics fail a simple hypersphere test in high dimension.
In distance-based evaluation, it is assumed that distances in the feature space are meaningful: close points should be semantically similar, and distant points should be semantically different. In practice, evaluation is performed in a pre-trained embedding space, because feature extractors provide richer semantic representations than the raw observation space. However, modern embedding spaces are usually high-dimensional (e.g., 2048 for InceptionV3 (Szegedy et al., 2016), 4096 for DINOv3 (Siméoni et al., 2025), 1024 for CLAP (Elizalde et al., 2023)), making them vulnerable to the curse of dimensionality (Bellman, 1961).
A key aspect of this curse, known as hubness (Radovanovic et al., 2010), undermines the reliability of nearest neighbor relationships in high-dimensional spaces (Beyer et al., 1999;Aggarwal et al., 2001). Specifically, certain points, called hubs, appear disproportionately often among the knearest neighbors of other data points, even when they are not semantically related to those points (Pachet & Aucouturier, 2004). This phenomenon arises from the structure of high-dimensional data distributions and is not simply due to limited sample size (Radovanovic et al., 2010).
Hubness has been recognized as problematic in various domains such as image recognition (Tomasev et al., 2011) and recommender systems (Hara et al., 2015). We show that hubness also affects modern embedding spaces (Table 4).
To enable reliable distance-based generative model evaluation in high-dimensional spaces, we aim to mitigate hubness while preserving metric validity. Our goals are: (1) to reduce hubness in the real dataset so that nearest neighbor relationships accurately reflect the local structure of the data manifold, (2) to preserve the relative positioning of generated points with respect to real data, and (3) to ensure that the evaluation of each generated point is independent of the others. Standard hubness mitigation methods focus only on in-sample reduction and do not directly satisfy these requirements (Feldbauer & Flexer, 2019).
In this paper, we introduce GICDM, a hubness reduction method tailored for distance-based generative model evaluation. Our contributions are:
• Demonstrating hubness: We show that common embedding spaces for generative model evaluation exhibit hubness, and find that the Iterative Contextual Dissimil
This content is AI-processed based on open access ArXiv data.