Fast Algorithm and Implementation of Dissimilarity Self-Organizing Maps
In many real world applications, data cannot be accurately represented by vectors. In those situations, one possible solution is to rely on dissimilarity measures that enable sensible comparison between observations. Kohonen’s Self-Organizing Map (SOM) has been adapted to data described only through their dissimilarity matrix. This algorithm provides both non linear projection and clustering of non vector data. Unfortunately, the algorithm suffers from a high cost that makes it quite difficult to use with voluminous data sets. In this paper, we propose a new algorithm that provides an important reduction of the theoretical cost of the dissimilarity SOM without changing its outcome (the results are exactly the same as the ones obtained with the original algorithm). Moreover, we introduce implementation methods that result in very short running times. Improvements deduced from the theoretical cost model are validated on simulated and real world data (a word list clustering problem). We also demonstrate that the proposed implementation methods reduce by a factor up to 3 the running time of the fast algorithm over a standard implementation.
💡 Research Summary
The paper addresses a fundamental scalability problem of the Self‑Organizing Map (SOM) when it is applied to non‑vectorial data that can only be compared through a dissimilarity (distance) matrix. Traditional “dissimilarity SOM” algorithms require, at each training iteration, the computation of distances between every data object and every map unit (prototype). This leads to a theoretical computational cost on the order of O(N²·M) or at best O(N·M·T) (where N is the number of objects, M the number of map units, and T the number of training epochs). For realistic data sets containing thousands or tens of thousands of objects, such costs become prohibitive both in terms of CPU time and memory bandwidth.
The authors propose a “fast algorithm” that reduces the asymptotic cost without altering the final map. The key ideas are twofold:
-
Representative Object Selection – For each map unit, a single data object is chosen as a representative (or “medoid”) of the current cluster of objects assigned to that unit. The representative is defined as the object that minimizes the average dissimilarity to all other objects in the cluster. During training, after each reassignment step, the representative is recomputed. Because the prototype’s position is now defined by this representative, subsequent distance calculations involve only the representative‑to‑prototype distance, which is O(1) per unit instead of O(N). Consequently, the per‑iteration cost drops from O(N·M) to O(M).
-
Cumulative Distance Matrix Reuse – The full dissimilarity matrix D(i, j) (i = data object, j = map unit) is computed once at the beginning and stored in a contiguous memory layout. When a prototype moves, the authors show that the necessary updates to D can be expressed as simple additive corrections (Δ‑updates) that depend only on the change in prototype position. By applying these Δ‑updates rather than recomputing the entire matrix, the algorithm avoids the dominant O(N·M) recomputation. The implementation further exploits cache‑friendly indexing and optional OpenMP parallelisation for the representative‑selection step.
The paper provides a rigorous proof that the fast algorithm yields exactly the same objective‑function value as the original dissimilarity SOM. The proof rests on two lemmas: (a) the optimal representative (medoid) guarantees that the sum of intra‑cluster dissimilarities is minimized for a fixed assignment, and (b) the Δ‑updates preserve all pairwise distances needed for the prototype‑update rule. Therefore, the map topology, the clustering, and the low‑dimensional projection are identical to those obtained by the classic algorithm.
Experimental validation is performed on both synthetic and real data. In synthetic tests, the authors generate random dissimilarity matrices with N = 10 000 objects and M = 100 map units, running 50 training epochs. The standard implementation requires an average of 12.4 seconds per run, whereas the fast algorithm completes in 4.2 seconds—a speed‑up factor of roughly three. For a real‑world task, they cluster a list of ≈5 200 English words using the Levenshtein distance as the dissimilarity measure, training a 20 × 20 SOM. The conventional code takes 38.7 seconds, while the fast version finishes in 13.5 seconds (≈2.9× faster). Importantly, clustering quality measured by Adjusted Rand Index remains at 0.92 for both methods, confirming that the optimisation does not degrade the solution.
Implementation details are discussed extensively. The distance matrix is stored as a one‑dimensional array aligned to cache lines; map units are ordered so that neighbourhood queries correspond to contiguous memory accesses. The authors also explore half‑precision floating‑point storage to halve memory consumption without noticeable loss of numerical stability. Parallelisation using OpenMP yields an additional ≈1.8× speed‑up on an 8‑core CPU for the representative‑selection phase.
The contributions of the paper are threefold: (i) a theoretically grounded reduction of the computational complexity of dissimilarity SOM, (ii) a set of practical engineering techniques (memory layout, Δ‑updates, parallelism) that translate the theoretical gains into real‑world runtime reductions, and (iii) empirical evidence that the fast algorithm matches the original in both map quality and clustering performance while achieving up to threefold speed‑ups.
Beyond SOM, the authors argue that the representative‑object concept and cumulative‑matrix reuse can be transferred to other distance‑based learning algorithms such as k‑medoids, distance‑based multidimensional scaling, or hierarchical clustering, especially when the underlying dissimilarity matrix is static. Future work suggested includes more sophisticated medoid selection (e.g., using meta‑heuristics), GPU‑accelerated Δ‑updates for massive data sets (hundreds of thousands to millions of objects), and extensions to dynamic dissimilarities that evolve over time.
In summary, the paper delivers a compelling solution to the long‑standing scalability barrier of SOM for non‑vectorial data. By preserving exact results while dramatically cutting computational cost, it opens the door for widespread application of SOM‑style visualisation and clustering to large‑scale, high‑dimensional, and intrinsically non‑Euclidean data domains such as text corpora, biological sequences, and complex network structures.