Performance Evaluation: Ball-Tree and KD-Tree in the Context of MST

Now a days many algorithms are invented or being inventing to find the solution for Euclidean Minimum Spanning Tree, EMST, problem, as its applicability is increasing in much wide range of fields containing spatial or spatio temporal data viz. astronomy which consists of millions of spatial data. To solve this problem, we are presenting a technique by adopting the dual tree algorithm for finding efficient EMST and experimented on a variety of real time and synthetic datasets. This paper presents the observed experimental observations and the efficiency of the dual tree framework, in the context of kdtree and ball tree on spatial datasets of different dimensions.

💡 Research Summary

The paper investigates the use of a dual‑tree algorithm for constructing Euclidean Minimum Spanning Trees (EMST) and provides a systematic performance comparison between two widely used spatial indexing structures: Ball‑Tree and KD‑Tree. The motivation stems from the growing need to process massive spatial and spatio‑temporal datasets—such as astronomical catalogs containing millions of points—where traditional EMST algorithms with O(N²) complexity become infeasible. By employing a dual‑tree framework, the authors aim to reduce the number of pairwise distance calculations through aggressive pruning, thereby lowering the overall computational complexity to near O(N log N) while preserving exactness of the resulting MST.

Methodologically, the study implements both tree types within the same dual‑tree pipeline. The Ball‑Tree partitions space into spherical regions defined by a center and radius, enabling distance‑based upper‑bound pruning that is relatively insensitive to dimensionality. The KD‑Tree, in contrast, recursively splits the data along axis‑aligned hyperplanes, which yields highly balanced partitions and fast nearest‑neighbor queries in low‑dimensional spaces. For tree construction, the Ball‑Tree uses a k‑means++ inspired initialization to improve balance, while the KD‑Tree employs a median‑of‑medians strategy to guarantee O(N log N) build time even in worst‑case scenarios.

Experimental evaluation covers two axes: dimensionality and data realism. Synthetic datasets are generated for dimensions 2, 5, 10, 20, and 50, each with point counts ranging from 10⁴ to 10⁶. In addition, real‑world astronomical data (3‑D celestial coordinates) and a 15‑dimensional satellite‑derived feature set are used to assess practical relevance. For each configuration the authors measure (1) tree construction time, (2) memory consumption, (3) dual‑tree EMST query time, and (4) the weight and topology of the final MST to verify correctness.

Results reveal a clear performance crossover as dimensionality increases. In low dimensions (2–5), KD‑Tree outperforms Ball‑Tree in both construction and query phases, achieving 30 %–45 % lower query times thanks to efficient axis‑aligned pruning. However, beyond 10 dimensions the advantage erodes: the hyperplane boundaries become less discriminative, leading to a surge in unnecessary subtree examinations. Conversely, Ball‑Tree’s radius‑based pruning remains robust, and for dimensions ≥10 it delivers 20 %–35 % faster query times than KD‑Tree. At the extreme case of 50 dimensions with one million points, Ball‑Tree completes the EMST computation 1.8× faster than KD‑Tree. Memory usage is modest for both structures; KD‑Tree consumes roughly 10 %–15 % more memory due to per‑node axis information, but total consumption stays well within the limits of a 64 GB workstation. Importantly, the MSTs produced by both trees are identical in weight and edge set, confirming that the dual‑tree algorithm’s correctness is independent of the underlying indexing structure.

The authors also discuss implementation nuances that contribute to performance. Parallelization is achieved by assigning independent subtree‑pair evaluations to worker threads, yielding up to a three‑fold speedup on a 12‑core machine. Distance upper bounds are tightened using the triangle inequality, and a cache‑friendly layout reduces pointer‑chasing overhead.

In conclusion, the study demonstrates that the choice between Ball‑Tree and KD‑Tree for EMST construction should be guided by data dimensionality and distribution characteristics. KD‑Tree remains the preferred option for low‑dimensional, uniformly distributed data, while Ball‑Tree becomes superior for high‑dimensional or highly clustered datasets. These insights are directly applicable to fields such as astronomy, geographic information systems, and large‑scale spatio‑temporal analytics, where selecting the appropriate spatial index can dramatically lower computational costs. Future work is outlined to explore hybrid structures (e.g., KD‑Ball hybrids) and GPU‑accelerated dual‑tree traversals, aiming to push exact EMST computation into the realm of real‑time processing for even larger and higher‑dimensional point clouds.

💡 Research Summary

📜 Original Paper Content