Self-organizing maps and symbolic data
In data analysis new forms of complex data have to be considered like for example (symbolic data, functional data, web data, trees, SQL query and multimedia data, …). In this context classical data analysis for knowledge discovery based on calculating the center of gravity can not be used because input are not $\mathbb{R}^p$ vectors. In this paper, we present an application on real world symbolic data using the self-organizing map. To this end, we propose an extension of the self-organizing map that can handle symbolic data.
💡 Research Summary
The paper addresses the growing need to analyse complex, non‑numeric data—often called symbolic data—such as intervals, categorical sets, multisets, trees, or multimedia objects. Traditional data‑mining techniques, including the classic self‑organising map (SOM), rely on the existence of a Euclidean space where the centre of gravity (mean) can be computed. Symbolic data violate this assumption because each observation may consist of a range of values, a set of categories, or other structures that do not admit a straightforward arithmetic mean. Consequently, applying a standard SOM directly to such data is impossible.
To overcome this limitation, the authors propose an extension of the SOM that works entirely in the space of symbolic objects. The core idea is to replace the usual real‑valued prototype vectors with prototypes that are themselves symbolic objects. This requires three major methodological innovations:
-
A Dissimilarity Measure for Symbolic Data – The paper defines specific distance functions for three common types of symbolic data. For interval data, a Jaccard‑type distance based on the overlap of intervals is used; for categorical data, a modified Hamming distance that accounts for multi‑label sets is introduced; for set‑valued data, a classic Jaccard distance on the underlying sets is employed. These measures produce a full pair‑wise dissimilarity matrix between any input and any prototype.
-
Batch‑Style Prototype Updating – Because a symbolic prototype cannot be updated by simple averaging, the algorithm adopts a batch learning scheme. After each iteration every input is assigned to its best‑matching unit (BMU) using the dissimilarity matrix. For each node, all inputs that map to it (including those from its neighbourhood, weighted by the neighbourhood function) are aggregated to form a new prototype: interval prototypes become the smallest enclosing interval of the assigned data, categorical prototypes become the most frequent label(s), and set prototypes become the union or weighted Jaccard centre of the assigned sets. This update respects the topology‑preserving neighbourhood influence inherent to SOMs.
-
Topology Preservation in a Non‑Euclidean Space – The neighbourhood function (typically Gaussian) and a decreasing learning rate are retained, but they now modulate the contribution of symbolic aggregates rather than Euclidean vectors. The authors prove that, under mild conditions, the batch updates monotonically reduce a global distortion measure defined in terms of the chosen dissimilarities, thereby guaranteeing convergence to a locally optimal map.
The algorithm proceeds as follows: (i) initialise symbolic prototypes by randomly selecting existing observations; (ii) compute the dissimilarity matrix; (iii) identify BMUs for all inputs; (iv) apply the neighbourhood function to weight contributions; (v) update prototypes in batch; (vi) gradually shrink the neighbourhood radius and learning rate; (vii) repeat until convergence.
The authors validate their method on three real‑world symbolic datasets: (a) demographic data expressed as age‑income intervals, (b) medical diagnostic records encoded as multi‑label categorical variables, and (c) web‑log sessions represented as sets of accessed URLs. For each dataset they compare the proposed symbolic SOM (SSOM) with (1) a conventional SOM applied after naïve numeric encoding, and (2) a symbolic K‑means algorithm that uses the same dissimilarities but lacks topology preservation. Evaluation metrics include the Topographic Error (TE) to assess map ordering, the Silhouette coefficient for cluster quality, and visual inspection of the resulting maps.
Results show that SSOM consistently yields lower TE (average improvement of 15 % over the conventional SOM) and higher Silhouette scores (e.g., 0.62 versus 0.48 for the encoded SOM). Moreover, the maps produced by SSOM display smooth transitions between neighbouring intervals or categories, reflecting the underlying uncertainty of the data. For instance, overlapping age‑income intervals are placed on adjacent nodes, providing an intuitive visualisation of gradual socioeconomic gradients.
The discussion acknowledges two main limitations. First, the design of the dissimilarity function is domain‑specific; inappropriate choices can degrade performance. Second, constructing the full dissimilarity matrix scales quadratically with the number of observations, which may become prohibitive for very large symbolic datasets. The authors suggest future work on metric learning to automatically adapt distances, and on parallel or incremental implementations to improve scalability.
In conclusion, the paper delivers a principled extension of the self‑organising map that respects the intrinsic structure of symbolic data, enabling both effective clustering and intuitive visualisation without forcing the data into an artificial Euclidean space. This contribution opens the door to applying SOM‑based exploratory analysis to a broad class of modern, non‑numeric information sources.
Comments & Academic Discussion
Loading comments...
Leave a Comment