Learning Contextualized Semantics from Co-occurring Terms via a Siamese Architecture

One of the biggest challenges in Multimedia information retrieval and understanding is to bridge the semantic gap by properly modeling concept semantics in context. The presence of out of vocabulary (OOV) concepts exacerbates this difficulty. To address the semantic gap issues, we formulate a problem on learning contextualized semantics from descriptive terms and propose a novel Siamese architecture to model the contextualized semantics from descriptive terms. By means of pattern aggregation and probabilistic topic models, our Siamese architecture captures contextualized semantics from the co-occurring descriptive terms via unsupervised learning, which leads to a concept embedding space of the terms in context. Furthermore, the co-occurring OOV concepts can be easily represented in the learnt concept embedding space. The main properties of the concept embedding space are demonstrated via visualization. Using various settings in semantic priming, we have carried out a thorough evaluation by comparing our approach to a number of state-of-the-art methods on six annotation corpora in different domains, i.e., MagTag5K, CAL500 and Million Song Dataset in the music domain as well as Corel5K, LabelMe and SUNDatabase in the image domain. Experimental results on semantic priming suggest that our approach outperforms those state-of-the-art methods considerably in various aspects.

💡 Research Summary

This paper tackles the persistent “semantic gap” problem in multimedia information retrieval by learning contextualized meanings of descriptive terms directly from their co‑occurrence patterns. The authors observe that traditional tag‑based or global‑embedding approaches ignore the fact that the same term can convey different meanings depending on the surrounding terms, and they also struggle with out‑of‑vocabulary (OOV) concepts that were not seen during training. To address these issues, the authors propose a novel Siamese neural architecture that integrates pattern aggregation, probabilistic topic modeling, and a contrastive learning objective.

The methodology consists of four main stages. First, a pattern‑aggregation step builds a term‑term co‑occurrence matrix from the entire corpus and applies statistical weighting (e.g., PMI) to obtain a robust representation of each term’s local context. Second, these context vectors are fed into a probabilistic topic model (LDA or a variant of PLSA) to derive a latent‑topic distribution for every term, effectively mapping each term into a high‑dimensional semantic space that captures global co‑occurrence structure. Third, a Siamese network with two weight‑sharing sub‑networks processes pairs of term sets (e.g., two different contexts). Each sub‑network receives the topic distribution of its input set, passes it through several fully‑connected layers, and produces a dense embedding. The training objective combines a contrastive loss—pulling together embeddings from semantically similar contexts and pushing apart those from different contexts—with a reconstruction loss that forces each embedding to retain information about its original topic distribution. This dual‑loss formulation encourages the model to learn embeddings that are both context‑sensitive and semantically coherent.

A key contribution is the handling of OOV terms. When an unseen term appears, the model estimates its topic distribution by aggregating the distributions of the known terms that co‑occur with it, typically via a weighted average. This estimated distribution is then passed through the already‑trained Siamese network, yielding an embedding that naturally resides in the same space as all in‑vocabulary terms, without any additional parameter updates.

The authors evaluate the approach on six large annotation corpora spanning two modalities: music (MagTag5K, CAL500, Million Song Dataset) and images (Corel5K, LabelMe, SUN Database). The primary evaluation task is semantic priming, where a query term must retrieve its most semantically related terms. Performance is measured using precision@k, mean average precision (MAP), and normalized discounted cumulative gain (NDCG). Baselines include classic word‑embedding models (Word2Vec, GloVe, FastText), pure topic‑model embeddings, and recent deep‑learning multi‑label classifiers. Across all datasets and metrics, the proposed Siamese‑based method outperforms the baselines by a substantial margin, especially in scenarios with high contextual ambiguity or a large proportion of OOV terms. Visualization with t‑SNE further demonstrates that identical terms placed in different contexts form distinct clusters, confirming that the model captures contextual nuances.

The paper’s contributions can be summarized as follows: (1) a hybrid framework that fuses co‑occurrence pattern aggregation with probabilistic topic modeling to generate context‑aware term representations; (2) a Siamese architecture with a combined contrastive‑reconstruction loss that learns embeddings reflecting both local context and global semantic structure; (3) an OOV handling mechanism that projects unseen terms into the learned embedding space without retraining; and (4) extensive empirical validation across multiple domains showing the method’s robustness and superiority.

Limitations are acknowledged. The topic‑modeling stage incurs significant computational overhead on very large corpora, which may hinder real‑time deployment; future work could explore online or streaming LDA variants to mitigate this cost. Additionally, the current design uses only a pairwise Siamese setup; extending the architecture to handle multiple contexts simultaneously (e.g., a multi‑branch network) could further enrich the learned semantics. Overall, the paper presents a compelling solution to contextual semantic learning and opens avenues for more nuanced multimedia retrieval systems.

💡 Research Summary

📜 Original Paper Content