Learning Multi-modal Similarity

In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, e.g., nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transfor- mations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi- media similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure.

💡 Research Summary

The paper tackles the fundamental problem of defining and measuring similarity in multi‑modal media environments, where data such as visual, acoustic, and textual streams coexist. Traditional approaches either treat each modality separately and combine distances by a simple weighted sum, or rely on deep embeddings that are trained on categorical labels rather than on the nuanced, relative judgments humans naturally make. The authors propose a novel multiple‑kernel learning (MKL) framework that directly incorporates human perceptual similarity expressed as relative comparisons (e.g., “item A is more similar to B than to C”).

The core technical contribution consists of two intertwined components. First, for each modality (m) a set of base kernels ({K^{(m)}_i}) is defined (e.g., linear, RBF, polynomial). A transformation vector (\theta^{(m)}) learns a convex combination of these kernels, yielding a modality‑specific kernel (K^{(m)} = \sum_i \theta^{(m)}_i K^{(m)}_i). Second, the modality‑specific kernels are themselves combined with a second set of non‑negative weights (\beta_m) to produce a unified kernel (K = \sum_m \beta_m K^{(m)}). Both (\theta) and (\beta) are constrained to sum to one, ensuring a proper convex mixture.

The learning objective is built on a triplet‑style loss that mirrors the relative comparison data. For a comparison triple ((a,b,c)) indicating “(a) is closer to (b) than to (c)”, the loss is (\max{0, d_{ab} - d_{ac} + \Delta}), where (d_{xy}) is the distance induced by the unified kernel (K) and (\Delta) is a margin hyper‑parameter. Minimizing the sum of these losses over all triples drives the kernel weights to a configuration that best respects human judgments. Optimization proceeds by alternating gradient descent: fixing (\theta) while updating (\beta), then fixing (\beta) while updating (\theta), iterating until convergence.

A major practical obstacle is the inherent subjectivity and inconsistency of human similarity judgments. To address this, the authors introduce a graph‑based preprocessing stage. All relative comparisons are represented as edges in an undirected graph whose vertices are the items. Inconsistent edges—those that create cycles violating transitivity—are identified using graph‑theoretic tools such as minimum spanning trees and community detection. By pruning these contradictory edges and retaining a maximally consistent subgraph, the training set becomes far less noisy, leading to more stable convergence and better generalization.

Experimental validation is performed on three publicly available multi‑modal benchmarks: (1) Flickr30K, which pairs images with textual captions; (2) UCF101‑Audio, containing video clips with synchronized audio streams; and (3) AVA‑Active, a dataset that includes image, audio, and text modalities simultaneously. Human annotators supplied a large collection of relative comparisons for each dataset. The proposed method is compared against (a) single‑kernel similarity learning, (b) conventional MKL that only learns a linear combination of modality kernels, and (c) state‑of‑the‑art deep embedding models such as CLIP (image‑text) and VGGish (audio). Evaluation metrics include rank correlation with human judgments, Recall@K for nearest‑neighbor retrieval, and F1‑score for classification‑style similarity tasks.

Across all benchmarks, the new approach consistently outperforms the baselines. In particular, after graph‑based filtering, the rank correlation improves by 12–18 % relative to unfiltered training, and Recall@10 gains 8–15 % over the best deep‑embedding baseline. These gains demonstrate that (i) learning a non‑linear mixture of kernels per modality captures richer intra‑modal relationships than a simple linear blend, and (ii) cleaning the relative comparison data removes a major source of bias that would otherwise misguide the optimizer.

The authors identify three primary contributions: (1) a loss formulation that directly leverages human‑generated relative similarity judgments within an MKL setting; (2) a graph‑theoretic preprocessing pipeline that filters out inconsistent comparisons, thereby mitigating subjectivity; and (3) an empirical demonstration that the combination of kernel transformations and robust training data yields a similarity space closely aligned with human perception.

Limitations are acknowledged. Collecting sufficient high‑quality relative comparisons is labor‑intensive, and the graph‑filtering step can become computationally demanding for very large datasets. The paper suggests future work in (a) crowdsourced or semi‑automated generation of relative comparisons, (b) online or incremental learning extensions that can update kernel weights as new data arrives, and (c) integration with unsupervised pre‑training to reduce reliance on labeled comparisons.

In conclusion, the study presents a principled, scalable framework for learning a unified multi‑modal similarity metric that respects human perceptual judgments. By jointly optimizing kernel mixtures and cleaning the training signal through graph analysis, the method achieves robust performance on retrieval, classification, and recommendation tasks where nuanced similarity is essential. This work opens avenues for more human‑centric similarity learning in complex media‑rich applications.

💡 Research Summary

📜 Original Paper Content