Noise-aware few-shot learning through bi-directional multi-view prompt alignment

Noise-aware few-shot learning through bi-directional multi-view prompt alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.


💡 Research Summary

The paper tackles the problem of label noise in few‑shot learning with vision‑language models (VLMs). Prompt tuning has shown strong few‑shot capabilities, but when only a handful of labeled examples are available, even a small amount of mislabeled data can severely bias the learned prompts and degrade cross‑modal alignment. Existing noisy‑label learning (LNL) methods for VLMs typically rely on global image‑text matching and single‑view or explicit negative prompts, which are insufficient for capturing fine‑grained semantic cues and for separating clean from corrupted signals.

To address these issues, the authors propose NA‑MVP (Noise‑Aware Multi‑View Prompt alignment). The core idea is to move from global matching to region‑aware alignment and to use a bi‑directional, multi‑view prompt design that explicitly models both clean‑oriented and noise‑aware semantics. For each class, NA‑MVP learns N clean prompts and N noise‑aware prompts, each consisting of M learnable context tokens plus a class token. These prompts are encoded by the VLM’s text encoder into two sets of feature matrices, G_c^k (clean) and G_n^k (noise‑aware).

Given an image, the visual encoder extracts a global feature f_i and a dense map of L local patch features F_i = {f_l}. The local patches are aligned with the prompt features via Unbalanced Optimal Transport (UOT). UOT relaxes the strict mass‑conservation constraint of classic OT, allowing partial matching between patches and prompts. This prevents noisy or irrelevant patches from forcing a full alignment that would corrupt the prompt learning. The cost matrix is defined as C_k = 1 – F_i G_k^T (cosine‑based), and the UOT problem is solved with entropic regularization using a Sinkhorn‑type algorithm. The resulting transport plan yields clean and noisy alignment scores s_c(i,k) and s_n(i,k), which are turned into probabilities p_c(i,k) and p_n(i,k) via a temperature‑scaled softmax.

An auxiliary Image‑Text Bi‑directional Prompt (ITBP) loss is introduced to reinforce the separation of clean and noisy semantics. ITBP encourages images to be close to clean prompts while pushing them away from the corresponding noise‑aware prompts and unrelated negatives, effectively stabilizing the bi‑directional learning dynamics.

For label correction, NA‑MVP leverages the alignment probabilities. An adaptive threshold φ, derived from the distribution of p_c and p_n, identifies samples whose labels are likely corrupted. Those samples undergo selective refinement: a classic (mass‑preserving) OT aligns the global image feature with class‑level clean text features, producing a corrected label. This selective refinement avoids the over‑correction problems of global pseudo‑labeling and keeps clean hard examples untouched.

The framework is evaluated on several benchmarks: CIFAR‑10/100, ImageNet‑R, and the real‑world noisy dataset Clothing1M. Both synthetic symmetric/ asymmetric noise and real noise are injected at varying rates (10%–80%). NA‑MVP consistently outperforms strong baselines such as CoOp, CoCoOp, PLOT, and CLIPN, achieving 3–7 percentage‑point gains in 1‑shot and 2‑shot settings, especially when noise exceeds 40%. Ablation studies confirm that each component—multi‑view prompts, UOT‑based fine‑grained alignment, ITBP loss, and selective OT refinement—contributes positively to the final performance.

The authors also discuss limitations: the multi‑view prompt set and UOT computation increase memory and runtime, particularly for high‑resolution images or large embedding dimensions. Convergence of the Sinkhorn iterations can be sensitive to the regularization parameter ε and the choice of marginal masses. Future work may explore more efficient OT approximations, adaptive prompt cardinality, or integration with other robust loss functions.

In summary, NA‑MVP introduces a novel combination of region‑level optimal‑transport alignment, bi‑directional multi‑view prompting, and alignment‑guided selective label refinement. This synergy yields a VLM‑based few‑shot learner that is markedly more robust to label noise, offering a practical solution for real‑world applications where clean annotations are scarce and noisy supervision is unavoidable.


Comments & Academic Discussion

Loading comments...

Leave a Comment