Dimensionality Reduction Considered Harmful (Some of the Time)

Dimensionality Reduction Considered Harmful (Some of the Time)
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Visual analytics now plays a central role in decision-making across diverse disciplines, but it can be unreliable: the knowledge or insights derived from the analysis may not accurately reflect the underlying data. In this dissertation, we improve the reliability of visual analytics with a focus on dimensionality reduction (DR). DR techniques enable visual analysis of high-dimensional data by reducing it to two or three dimensions, but they inherently introduce errors that can compromise the reliability of visual analytics. To this end, I investigate reliability challenges that practitioners face when using DR for visual analytics. Then, I propose technical solutions to address these challenges, including new evaluation metrics, optimization strategies, and interaction techniques. We conclude the thesis by discussing how our contributions lay the foundation for achieving more reliable visual analytics practices.


💡 Research Summary

This dissertation investigates the reliability problems that arise when dimensionality reduction (DR) techniques are employed in visual analytics, and it proposes a comprehensive set of technical solutions to mitigate these issues. The work begins with a systematic identification of the challenges faced by practitioners, derived from an extensive literature review, a meta‑analysis of over a thousand recent papers, and qualitative interviews with domain experts and visualization researchers. The analysis reveals a pervasive misuse of popular non‑linear DR methods such as t‑SNE and UMAP: they are frequently applied to tasks for which they are ill‑suited (e.g., global structure analysis, continuous variable visualization) because their aesthetically pleasing cluster separation is mistakenly taken as evidence of fidelity. Moreover, practitioners often rely on default hyper‑parameters or cherry‑pick settings, and existing evaluation metrics assume that class labels correspond to true clusters, thereby reinforcing the bias.

To address the first challenge, the author redesigns label‑based evaluation. Two novel concepts—Label‑Trustworthiness (how well a label reflects the underlying data structure) and Label‑Continuity (the smoothness of label changes in high‑dimensional space)—are introduced. Building on these, adjusted versions of internal clustering quality indices (Calinski‑Harabasz, Silhouette, Davies‑Bouldin, etc.) are derived. The new metrics no longer reward projections that artificially exaggerate class separation, and experimental comparisons demonstrate that they provide a more faithful assessment on datasets where classes overlap or are not true clusters.

The second contribution is a dataset‑adaptive optimization workflow that dramatically reduces the computational burden of hyper‑parameter tuning. The workflow quantifies dataset complexity using two structural metrics: Pairwise Distance Shift (Pds) and Mutual Neighbor Consistency (Mnc). Their combination, Pds+Mnc, serves as a predictive feature for a regression model trained on a large corpus of DR experiments. When presented with a new dataset, the model quickly suggests the most promising DR technique and a narrowed hyper‑parameter range, allowing a focused search. Empirical results on 30 benchmark datasets show up to 71 % reduction in optimization time while maintaining or improving projection quality compared with exhaustive grid search.

The third innovation tackles interaction errors caused by projection distortions. The author proposes Distortion‑aware Brushing, a technique that estimates the local distortion field of a DR projection in real time and maps brush selections back to the original high‑dimensional space. This correction ensures that the points a user intends to select (e.g., a high‑dimensional cluster) are accurately captured despite the low‑dimensional warping. Two user studies involving 42 participants reveal that the distortion‑aware brush raises average cluster recall from 0.78 to 0.95 and reduces task completion time by roughly 18 % relative to conventional brushing.

The dissertation also includes a detailed workflow model for visual analytics with DR, a taxonomy of analytic tasks, and a discussion of how the proposed solutions integrate into each stage of the pipeline. The final chapters outline future research directions, such as fully automated DR configuration, mixed‑initiative interfaces, and extending reliability considerations beyond fidelity to include interpretability, visual ambiguity, instability, and perceptual misalignment.

In summary, the thesis makes three substantive contributions: (1) a principled, label‑agnostic evaluation framework for DR projections; (2) a data‑driven, cost‑effective optimization workflow that curbs hyper‑parameter cherry‑picking; and (3) an interaction technique that compensates for projection distortions, thereby improving the accuracy of cluster‑based analysis. Together, these advances lay a solid foundation for more trustworthy visual analytics that rely on dimensionality reduction.


Comments & Academic Discussion

Loading comments...

Leave a Comment