Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at https://github.com/Jinx630/ADTE.
💡 Research Summary
The paper tackles a fundamental flaw in current test‑time adaptation (TTA) methods for vision‑language models (VLMs) such as CLIP. Existing approaches rely on Shannon entropy (SE) to gauge prediction uncertainty and to select high‑confidence augmented views. Because CLIP is pretrained on heavily imbalanced web‑crawled data, its output probabilities are biased: head classes receive overly high confidence while tail classes are under‑confident. This bias contaminates SE, which treats every probability with the same (-p\log p) formula, leading to distorted uncertainty estimates and sub‑optimal view selection.
To remedy this, the authors introduce Tsallis entropy (TE), a non‑extensive generalization of SE that adds a tunable parameter (q). TE is defined as
(H_{TE}(P)=\frac{1}{1-q}\big(\sum_{l}p_l^{,q}-1\big)).
When (q\to1), TE reduces to SE, establishing SE as a special case. The key insight is that for (q<1) the entropy surface becomes “sharper”: low‑confidence (tail) predictions are penalized less, and the entropy values of high‑confidence views drop more dramatically. Empirically, the authors show that decreasing (q) raises the Top‑K Cumulative Reliability (the sum of the K highest similarity scores) of the selected views, meaning TE with (q<1) consistently picks more accurate candidates than SE.
However, a fixed (q) is impractical because optimal values vary across domains and even across classes within a single domain. The authors therefore propose Adaptive Debiasing Tsallis Entropy (ADTE). ADTE estimates the label bias for each class on the fly by monitoring the incoming stream of test instances. The per‑class bias is min‑max normalized to produce a class‑specific parameter (q^{(l)}). Tail classes, which exhibit larger bias, receive smaller (q^{(l)}) (closer to 0), while head classes obtain values near 1. This dynamic adjustment tailors the entropy landscape to the current distribution, effectively debiasing the uncertainty measure.
The ADTE‑based TTA pipeline works as follows: (1) each test image is augmented N times to generate a set of views; (2) for each view the ADTE entropy is computed; (3) the τ‑fraction of views with the lowest entropy are kept as “high‑confidence views”; (4) these views are either averaged (ensemble) or combined with a logit‑adjustment (LA) step that further compensates for class imbalance using the same (q^{(l)}) values. The final prediction is obtained by aggregating the adjusted logits of the selected views.
Theoretical analysis demonstrates that TE’s non‑extensive term ((1-q)H_{TE}(A)H_{TE}(B)) captures interactions between independent components, which SE ignores. By examining the function (F(p,q)=p^{q}/(1-q)+p\log p), the authors prove that for (0<q<1) the correction term is positive, i.e., TE yields higher entropy than SE for small probabilities, thereby mitigating the under‑confidence of tail classes.
Extensive experiments validate ADTE across multiple model backbones (CLIP‑ViT‑B/32, CLIP‑ViT‑L/14, CLIP‑ViT‑H/14) and prompt variations (“a photo of {class}”, “an image of {class}”, etc.). On five ImageNet variants (R, A, Sketch, V2, C) and ten cross‑domain benchmarks (Office‑Home, DomainNet, VisDA‑2017, PACS, VLCS, etc.), ADTE consistently outperforms state‑of‑the‑art TTA methods such as Zero, TPT, DiffTPT, and ML‑TTA. Average accuracy gains range from 1.5 to 3.2 percentage points, and ADTE achieves the highest mean performance across all ten cross‑domain datasets. Importantly, ADTE requires no dataset‑specific hyper‑parameter tuning; the class‑wise (q^{(l)}) are computed automatically. Ablation studies show that (i) fixing (q) degrades performance dramatically, (ii) the method remains robust when bias estimates are noisy, and (iii) combining ADTE with logit adjustment yields an additional 0.5–1.0 % improvement over plain ensemble.
In summary, the paper makes three major contributions: (1) it identifies and theoretically characterizes the bias introduced by SE in TTA for imbalanced VLMs; (2) it demonstrates that Tsallis entropy with (q<1) naturally corrects this bias and serves as a stronger uncertainty metric; (3) it proposes ADTE, a fully automatic, class‑adaptive entropy measure that integrates seamlessly with existing TTA pipelines and logit‑adjustment techniques. The result is a simple yet powerful framework that substantially boosts test‑time adaptation performance without extra training, hyper‑parameter search, or architectural changes, opening a new direction for robust, on‑the‑fly domain adaptation of large vision‑language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment