특징 적응형 잡음 주입을 통한 멀티모달 표현 학습 향상

Reading time: 5 minute
...

📝 Abstract

Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.

💡 Analysis

Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.

📄 Content

FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning Jiaoyang Li*, Jun Fang*, Tianhao Gao, Xiaohui Zhang, Zhiyuan Liu, Chao Liu†, Pengzhang Liu, Qixia Jiang JD, Retail, Beijing, China {lijiaoyang7, fangjun8, gaotianhao1, zhangxiaohui40, liuzhiyuan8, liuchao397, liupengzhang, jiangqixia}@jd.com Abstract Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding perfor- mance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions dur- ing training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representa- tion learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of con- trastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this the- oretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall per- formance on multimodal tasks across various base VLM models. Introduction Representation learning, which aims to capture meaningful and transferable features from raw data, has become a cor- nerstone of modern machine learning. It plays a pivotal role across diverse applications, from text retrieval (e.g., (Xiao et al. 2023; Li et al. 2023c)) to multimodal understanding (e.g., (Radford et al. 2021; Jia et al. 2021; Jiang et al. 2024b; Wei et al. 2024; Ren et al. 2024; Zhang et al. 2024a; Liu et al. 2022)). Despite its remarkable successes, learning ro- bust and generalizable representations remains challenging. In multimodal representation learning, contrastive learn- ing frameworks such as CLIP (Radford et al. 2021), ALIGN (Jia et al. 2021), SigLIP (Zhai et al. 2023), and BLIP (Li et al. 2022) have achieved significant progress. Most current multimodal embedding methods primarily fo- cus on architectural innovations (Li et al. 2022; Wei et al. 2024; Ren et al. 2024; Zhang et al. 2024a; Liu et al. 2022) or on enriching training data via data augmentation (Zhang *These authors contributed equally. †Corresponding author Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org ). All rights reserved. et al. 2024b; Chen et al. 2025; Zhou et al. 2024), such as applying transformations to existing samples or generating new synthetic data. While data augmentation serves to in- crease data diversity and thereby improve model robustness, it does so by implicitly introducing variations or perturba- tions at the data level. In contrast, explicit noise injection strategies operate directly at the representation or feature level, offering a more controllable and theoretically analyz- able approach to improving robustness and generalization. However, the systematic study of such representation-level noise injection, especially in complex multimodal settings, remains largely unexplored. Recently, active noise injection approaches such as CLAE (Ho and Nvasconcelos 2020), SimCSE (Gao, Yao, and Chen 2021), and NEFTune (Jain et al. 2023) have demonstrated the effectiveness of controlled noise injection strategies including adversarial noise, dropout masks, and feature perturbations in improving representation quality, particularly in unimodal scenarios. These studies highlight the crucial role of both explicit and implicit noise injection mechanisms in representation learning. However, most ex- isting methods rely on heuristic or static noise augmentation schemes, without explicitly modeling the underlying feature distributions or adapting to the dynamic nature of training. This leads to several fundamental open questions: • What are the underlying mechanisms by which noise in- jection affects representation learning? • How can we design and adapt optimal noise injection strategies for different feature structures and learning scenarios? • How do theoretical principles of noise injection lead to practical improvements in robustness and generaliza- tion? Addressing these questions is essential for developing a sys- tematic and theoretically grounded framework for multi- modal representation learning. Motivated by these open questions and the limitations of existing approaches, this work systematically investigates the role of noise in multimodal representation learning from both gradient-based and feature distribution perspectives. Specifically, we analyze how noise injection affects the gra- dient dynamics of the InfoNCE loss and examine its impact on feature distributions through the lens of singu

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut