Une adaptation des cartes auto-organisatrices pour des donnees decrites par un tableau de dissimilarites
Many data analysis methods cannot be applied to data that are not represented by a fixed number of real values, whereas most of real world observations are not readily available in such a format. Vector based data analysis methods have therefore to be adapted in order to be used with non standard complex data. A flexible and general solution for this adaptation is to use a (dis)similarity measure. Indeed, thanks to expert knowledge on the studied data, it is generally possible to define a measure that can be used to make pairwise comparison between observations. General data analysis methods are then obtained by adapting existing methods to (dis)similarity matrices. In this article, we propose an adaptation of Kohonen’s Self Organizing Map (SOM) to (dis)similarity data. The proposed algorithm is an adapted version of the vector based batch SOM. The method is validated on real world data: we provide an analysis of the usage patterns of the web site of the Institut National de Recherche en Informatique et Automatique, constructed thanks to web log mining method.
💡 Research Summary
The paper addresses a fundamental challenge in modern data analysis: many real‑world observations cannot be expressed as fixed‑length real‑valued vectors, yet most statistical and machine‑learning techniques assume such a representation. To bridge this gap, the authors propose a general strategy based on pairwise (dis)similarity measures. By defining a (dis)similarity matrix that captures expert knowledge about how two observations should be compared, any algorithm that originally works on vectors can be adapted to operate directly on the matrix.
The core contribution is an adaptation of Kohonen’s Self‑Organizing Map (SOM) to work with (dis)similarity data. The authors start from the batch version of SOM, which updates prototype (codebook) vectors after processing the whole dataset in each iteration. They replace the Euclidean distance used in the original algorithm with the entries of a pre‑computed (dis)similarity matrix D(i, j). The learning procedure proceeds as follows:
- Initialization – a set of M prototypes is created, either randomly or by a heuristic that uses the (dis)similarity matrix.
- Best‑Matching Unit (BMU) assignment – for each data point i the prototype k* that minimizes D(i, k) is identified. This step is the analogue of finding the nearest neuron in the classic SOM.
- Neighborhood weighting – a neighborhood function h_{kl} (typically Gaussian) is defined on the 2‑D lattice of prototypes. The weight assigned to a data point i when updating prototype k is the product of h_{kl} (where l is the BMU for i) and a learning rate that decays over iterations.
- Prototype update – each prototype w_k is replaced by the object that minimizes the weighted sum of (dis)similarities to the data points assigned to k and to its neighbours. In practice the authors approximate this by selecting the data point with the smallest weighted (dis)similarity, or by computing a weighted average when the matrix is metric‑like.
- Annealing – the neighborhood radius and learning rate are gradually reduced until convergence, just as in the standard batch SOM.
Because the algorithm only requires the (dis)similarity matrix, it can be applied to any data for which a meaningful pairwise comparison can be defined, even if the data are strings, graphs, sequences, or mixed‑type records. The method, however, inherits the O(N²) memory requirement of storing a full matrix, and it assumes the matrix is symmetric and non‑negative. The authors discuss possible extensions such as sparse matrix handling, asymmetric dissimilarities, and metric‑learning to alleviate these constraints.
To validate the approach, the authors conduct a case study on web‑log data from the Institut National de Recherche en Informatique et Automatique (INRIA). Each user session is treated as an observation. Experts construct a composite dissimilarity that combines three components: (i) edit distance between the ordered lists of pages visited, (ii) absolute time difference between session start times (log‑scaled), and (iii) Jaccard similarity of URL path structures. The resulting matrix captures both temporal and navigational similarities.
Applying the adapted SOM to this matrix yields a 10 × 10 two‑dimensional map. Sessions with similar browsing patterns cluster in neighboring cells, revealing interpretable usage groups: one region corresponds to researchers downloading technical reports during late‑night hours, another to students accessing tutorial material during daytime, and a few isolated cells highlight rare internal project pages accessed by specific teams. The map therefore provides a topologically coherent visualization that is richer than simple frequency histograms or flat clustering.
The authors enumerate the strengths of their method: (1) it eliminates the need for ad‑hoc vectorisation of complex data, (2) it leverages domain expertise through custom dissimilarities, and (3) the batch learning scheme scales well to moderately large datasets. They also acknowledge limitations: the quadratic memory footprint, sensitivity to the choice of dissimilarity (which can introduce bias), and the fact that non‑metric dissimilarities may lead to prototype updates that are not true minimizers of a global objective.
Future work suggested includes: (i) exploiting sparse or block‑structured representations to reduce memory usage, (ii) extending the algorithm to handle asymmetric or partially missing dissimilarities, (iii) integrating metric‑learning techniques to automatically tune the weighting of the dissimilarity components, and (iv) adopting soft‑assignment (multiple BMUs) to better model ambiguous observations.
In summary, the paper presents a practical and theoretically grounded extension of the Self‑Organizing Map to arbitrary (dis)similarity data, demonstrates its applicability on a real‑world web‑log mining task, and outlines a roadmap for scaling and refining the approach for broader use in domains such as bioinformatics, social‑network analysis, and any setting where data are naturally described by pairwise comparisons rather than fixed‑length vectors.
Comments & Academic Discussion
Loading comments...
Leave a Comment