T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World
Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern…
Authors: Aditi Naiknaware, Salimeh Sekeh
T-QPM: Enabling T emp oral Out-Of-Distribution Detection and Domain Generalization for Vision-Language Mo dels in Op en-W orld A diti Naikna w are, Salimeh Sek eh San Diego State Univ ersit y {anaiknaware7153, ssekeh}@sdsu.edu Abstract. Out-of-distribution (OOD) detection remains a critical c hal- lenge in open-world learning, where mo dels m ust adapt to evolving data distributions. While recen t vision-language mo dels (VLMS) lik e CLIP en- able m ultimo dal OOD detection through Dual-P attern Matc hing (DPM), existing methods typically suffer from tw o ma jor shortcomings: (1) They rely on fixed fusion rules and assume static environmen ts, failing under temp oral drift; and (2) they lac k robustness against co v ariate shifted in- puts. In this pap er, we prop ose a nov el tw o-step framework to enhance OOD detection and co v ariate distribution shift robustness in dynamic settings. W e extend the dual-pattern regime into T emp oral Quadruple- P attern Matching (T-QPM). First, by pairing OOD images with text descriptions, w e in tro duce cross-mo dal consistency patterns betw een ID and OOD signals, refining the decision b oundary through joint image- text reasoning. Second, we address temp oral distribution shifts by learn- ing light w eigh t fusion weigh ts to optimally com bine seman tic matc hing and visual typicalit y . T o ensure stability , we enforce explicit regulariza- tion based on A v erage Thresholded Confidence (A TC), prev enting perfor- mance degradation as distributions evolv e. Experiments on temporally partitioned b enc hmarks demonstrate that our approach significan tly out- p erforms static baselines, offering a robust, temp orally-consistent frame- w ork for multimodal OOD detection in non-stationary en vironments. Keyw ords: Open-W orld Learning · Out-Of-Distribution Detection · Do- main Generalization · Vision-Language Models 1 In tro duction As the demand for intelligen t systems gro ws, the need for computer vision algo- rithms and foundation mo dels to handle op en-w orld scenarios b ecomes increas- ingly paramoun t. One imp ortan t c haracteristic of the open w orld for vision- language models (VLMs) is that in telligen t systems will encounter new con texts and images that were not seen during training, requiring safe handling of unseen The theoretical study in Section 4 was published in a preliminary form at the Reliable ML from Unreliable Data W orkshop at NeurIPS 2025 [20]. 2 A. Naikna ware and S. Sekeh examples (out-of-distribution (OOD) detection) and adaptation to distribution- shifted inputs (domain generalization) in temp oral environmen ts [35]. Notice- ably , the v ast ma jority of VLMs hav e b een driven b y the closed-world set- ting [19, 23], where the lab el space is assumed fixed and the data distribution stationary . An op en-w orld learning (OWL) paradigm on wild data [ 11] is built up on t w o parts: unknown rejection (OOD detection) and no v el class disco v ery (distribution shift generalization) under dynamic domains. Within the OWL con- text, in-distribution (ID) refers to data dra wn from the same distribution as the training set—the data that the mo del is expected to handle reliably . Prior work in b oth OOD detection and distribution shift has primarily fo cused on tw o cat- egories: (1) co v ariate shift refers to inpu ts that b elong to the same lab el space as the training data but differ due to c hanges in the input distribution [13, 32], suc h as a dog image corrupted with Gaussian noise remaining lab eled as “dog” y et degrading mo del performance; and (2) semantic shift o ccurs when en tirely new classes are introduced at test time [30, 32], such as a classifier trained on cats and dogs encountering an elephant. While recen t adv ances in OOD detection for VLMs ha v e sho wn great promise— notably , Maximum Concept Matching (MCM) [19] leverages softmax-scaled co- sine similarity b et w een visual and textual concept prototypes for zero-shot OOD detection, and Dual-Pattern Matc hing (DPM) [34] efficiently adapts CLIP for OOD detection by exploiting b oth visual and textual ID patterns—these meth- o ds lack sev eral fundamental asp ects of OWL: (1) they largely o v erlo ok temp oral dynamics, the fact that data distributions may evolv e ov er time due to c hanging en vironmen ts, user b eha vior, or data sources [31]; (2) they neglect cov ariate shift and domain generalization during OOD detection, treating the deplo ymen t envi- ronmen t as static; and (3) they limit OOD ev aluation to unimo dal image inputs rather than multimodal image-caption pairs, lea ving the rich linguistic signal of VLMs underexploited. It is imp ortan t to emphasize that temp oral shifts can lead to gradual but systematic degradation of mo del p erformance if left unchec k ed. F or example, a p erception system trained on traffic patterns from one year ma y underp erform as new road constructions, seasonal changes, or evolving driving b eha viors shift the data distribution ov er time [2, 31]. Our Con tribution: In this paper, we prop ose T-QPM (T emp oral Quadru- ple Pattern Matching), a nov el multimodal OOD detection framework designed for op en-w orld deploymen t under contin uously evolving distributions. T-QPM builds on frozen CLIP backbones and op erates ov er image-caption pairs, enabling ric her cross-mo dal interaction than DPM approac h. T-QPM explicitly mo dels temp oral distribution shift by incorp orating a caption-aw are temp oral regular- ization loss that stabilizes confidence-based decision b oundaries across timesteps, join tly optimizing for ID classification, co v ariate shift robustness, temp oral con- sistency , and semantic OOD detection. W e provide theoretical study linking temp oral consistency to generalization error b ound. Our experimental results sho w T-QPM outp erforms DPM baseline significantly on wild data. T-QPM for T emp oral OOD Detection 3 2 Metho dology 2.1 Preliminaries and Problem Setup W e start with preliminaries to lay the necessary con text, follow ed by a clear de- scription of VLMS for OOD detection. W e consider a deplo y ed classifier f θ : X → R K trained on a lab eled in-distribution (ID) dataset D ID = { ( x i , y i ) } n i =1 , dra wn i.i.d. from the join t data distribution P X Y . The function f θ predicts the lab el of an input sample x as ˆ y ( f ( x )) := arg max y f y ( x ) . Define P in , the marginal distri- bution of the labeled data ( X , Y ) , whic h is also referred to as the in-distribution. P ty pe out is the marginal distribution out of P X ′ Y ′ on X ′ , where the input space un- dergo es "t ype" shifting and the joint distribution has the same lab el space or differen t lab el space (dep ending to the "type"). W e consider a generalized char- acterization of the op en world setting with t w o t yp es of OOD P wild = (1 − X ty pe π ty pe ) P in + X ty pe π ty pe P ty pe out , (1) where ty pe = { seman tic , cov ariate } , where π ty pe , P ty pe π ty pe ∈ (0 , 1) . Covariate OOD typ e: T aking autonomous driving as an example, a mo del trained on ID data with sunny weather may exp erience a cov ariate shift due to foggy/snowy w eather. Under such a cov ariate shift, a mo del is exp ected to generalize to the OOD data—correctly predicting the sample in to one of the known classes (e.g., car), despite the shift. P cov out is the marginal distribution of co v ariate shifted data ( X ′ , Y ) with distribution P X ′ Y , where the joint distribution has the same lab el space as the training data, yet the input space undergo es shifting in domain. Semantic OOD typ e: In autonomous driving example, the model may encounter a semantic shift, where samples are from unknown classes (e.g., b ear) that the mo del has not been exp osed to during training. P sem out is the marginal distribution when wild data do es not b elong to any known categories Y = { 1 , 2 , ..., K } and therefore should b e detected as OOD sample. T o detect the semantic OOD data, w e train OOD detector which is a ranking function g θ : X 7→ R with parameter θ : if g θ ( x ) ≤ λ then the sample x is OOD example. The threshold v alue λ is t ypically chosen so that a high fraction of ID data is correctly classified. This means that the detector g θ should predict semantic OOD data as OOD and otherwise predict as ID. VLMs for OOD Detection. VLMs exemplified b y CLIP , consist of t w o aligned enco ders: a visual enco der ϕ V and a text enco der ϕ T . The visual enco der maps an input image x to a d -dimensional feature representation F v ( x ) ∈ R d , while the text enco der maps a textual prompt p to a semantic embedding ϕ T ( p ) ∈ R d . Both em b eddings lie in a shared representation space, enabling direct similarity com- parison. F or a downstream classification task with lab el space Y = { y 1 , . . . , y K } , a prompt template is instantiated for each class to obtain class-sp ecific textual descriptions. Eac h class em bedding is normalized to produce a set of proto- t yp e v ectors { t k } K k =1 , where t k ∈ R d . These normalized text em b eddings form a cosine-similarit y classifier in the joint embedding space. 4 A. Naikna ware and S. Sekeh Giv en an image x , the compatibility betw een the visual feature F v ( x ) and eac h class protot ype t k is measured via cosine similarity . These similarit y scores are scaled by a temperature parameter and conv erted in to a categorical distribu- tion ov er Y using a softmax transformation. The resulting predictiv e distribution reflects the semantic alignment b et w een the image and the set of textual class descriptions. In CLIP-based OOD detection, the text embeddings { t k } serve as fixed classifier weigh ts. Post-hoc scoring functions such as Maximum Softmax Probabilit y (MSP) [7] or Energy-based scores [17] are applied to the resulting logits or probability distribution. The underlying assumption is that ID samples yield higher confidence or low er energy compared to OOD samples. Problem Setup. W e consider the problem of OOD detection under temp or al distribution shift . A t eac h timestep t ∈ { 0 , 1 , . . . , T } , a data distribution D t o v er image-caption pairs ( x, c ) is observed, where the visual distribution shifts gradually across timesteps. The ID lab el set Y = { y 1 , . . . , y K } remains fixed, but the visual app earance of ID classes evolv es o v er time, inducing temp or al shift . A t test time, eac h pair ( x, c ) must b e classified as ID or OOD with resp ect to Y , without access to OOD lab els during training. F ormally , at eac h timestep t we hav e access to a set of lab eled ID training and co v atiate shifted samples D train t = { ( x i , y i ) } dra wn from the curren t ID distribution, along with unlab eled test samples { ( x, c ) } that ma y b e either ID or OOD. The goal is to learn a scoring function S ( x, c, t ) ∈ R suc h that ID samples consistently score ab ov e a fixed threshold δ across all timesteps, while OOD samples score b elow it, and suc h that the scoring function remains robust to b oth temp oral drift and cov ariate p erturbations of the input. Our T-QPM method extends DPM [34] to this setting by: (i) incorp orating all four cross-mo dal pairings b et w een ID and test representations including image and caption mo dalities, (ii) adapting visual protot yp es p er timestep to handle temp oral drift, (iii) learning a light w eigh t fusion of the four scores with only 2 trainable parameters, and (iv) enforcing cov ariate robustness through explicit consistency regularization during training. 2.2 T-QPM for OOD Detection Our T-QPM metho d is designed mainly based on four phases as describ ed b elo w. Phase I: T ext Pattern Construction. W e first construct reference patterns using the text enco der of the frozen VLM. F or each ID class k ∈ { 1 , . . . , K } , we emplo y prompt ensem bling ov er P templates to obtain robust text representa- tions. Let { p (1) k , . . . , p ( P ) k } denote the set of prompts for class k . The class text em b edding is computed as: t k = Normalize P X i =1 Normalize ϕ T ( p ( i ) k ) ! , (2) where ϕ T denotes the frozen CLIP text enco der and normalization is the ℓ 2 -norm. Collecting all ID class em beddings yields the ID text b ank T ID = [ t 1 , . . . , t K ] ⊤ ∈ T-QPM for T emp oral OOD Detection 5 Fig. 1: T-QPM Overview: At eac h timestep, ID images and their cov ariate-shifted views are enco ded to build timestep-specific visu al prototypes alongside a fixed ID T ext Bank. At inference, four cross-mo dal scores b et w een the test image, caption, and ID represen tations are fused to pro duce the final OOD decision. R K × d , where d is the embedding dimension. T ID is computed once and k ept fixed throughout all timesteps, serving as a stable semantic anchor while temporal v ariability is handled through timestep-sp ecific visual statistics. Phase I I: T emp oral Visual Pattern Construction. T o account for temp oral distribution shift [28], we compute visual reference statistics separately for each timestep, extending the DPM framework. Given an image x at timestep t , the frozen CLIP-ViT encoder pro duces a sequence of patc h embeddings F ( x ) = ϕ V ( x ) ∈ R ( N +1) × d , where N is the n um ber of spatial patches and the first tok en is the global [CLS] tok en. W e decomp ose F ( x ) in to a glob al token F v ( x ) = F ( x )[0 , :] ∈ R d and sp atial p atches F s ( x ) = F ( x )[1 : , :] ∈ R N × d . F or eac h class k , class-sp ecific spatial attention weigh ts are computed as A k ( x ) = Softmax F s ( x ) t k ∥ F s ( x ) ∥ ∥ t k ∥ ∈ R N , (3) yielding a class-attended spatial feature ˜ f k ( x ) = A k ( x ) ⊤ F s ( x ) ∈ R d . The final class-sp ecific image representation combines global and attended spatial features as f k ( x ) = γ ˜ f k ( x )+ F v ( x ) , where γ balances the con tribution of spatial and global information, and the ID logits follow as z ID ( x ) = f 1 ( x ) ⊤ t 1 , . . . , f K ( x ) ⊤ t K ⊤ ∈ R K . (4) Con v erting logits to probabilities via temp erature-scaled softmax, p ( x ) = Softmax( z ID ( x ) /T ) , w e estimate for each timestep t and class k a class-conditional visual prototype b y av eraging ov er ID training samples from D train t : µ k,t = E p ( x ) | ( x, y ) ∈ D train t , y = k ∈ R K . (5) { µ k,t } K k =1 defines the expected ID similarit y pattern at time t , allo wing the notion of “normal” ID b eha vior to adapt to gradual visual distribution drift. 6 A. Naikna ware and S. Sekeh Phase I I I: Quadruple Cross-Mo dal Scoring. F or a test image x with asso- ciated caption c observed at timestep t , we construct four complementary OOD scores corresp onding to all cross-mo dal pairings betw een ID and test represen- tations, together realizing the complete Quadruple P attern Matc hing (QPM) framew ork in the temp oral setting. The Semantic Matching Sc or e S ID ( OOD image ↔ ID text ) measures alignmen t b et w een the test image’s visual features and the ID class text embeddings, S ID ( x ) = max k ∈{ 1 ,...,K } z ID ( x )[ k ] T , (6) where higher v alues indicate stronger semantic similarity to kno wn ID classes. The Visual T ypic ality Sc or e S VIS ( OOD image ↔ ID visual ) measures ho w t ypical the test image’s probability pattern is relative to the timestep-sp ecific ID visual prototypes via KL divergence [25], KL k ( x, t ) = K X j =1 p j ( x ) log p j ( x ) µ k,t [ j ] , S VIS ( x, t ) = − min k KL k ( x, t ) , (7) where p j ( x ) denotes the j -th entry of p ( x ) , and a low er KL divergence (higher S VIS ) indicates the test image’s probabilit y pattern is consisten t with typical ID images at timestep t . T o exploit the multimodal nature of the test data, each test image is acc ompanied by a natural language caption c , which we enco de on- the-fly using the same frozen text enco der as q c = Normalize( ϕ T ( c )) ∈ R d . The Caption-T ext Alignment Sc or e S CAP - T ( OOD text ↔ ID text ) then measures ho w strongly the caption’s semantics o v erlap with ID class names in text space, S CAP - T ( x ) = max k ∈{ 1 ,...,K } ⟨ q c , t k ⟩ , (8) with no OOD text bank constructed — the caption is compared directly against the precomputed T ID at inference time. Complementing this, the Caption-Visual A lignment Sc or e S CAP - V ( OOD text ↔ ID visual ) measures whether the caption’s semantics are consistent with the visual statistics of ID data at the curren t timestep. Reusing q c , w e pro ject it through the ID logit computation to obtain a caption probability vector, z CAP ( x ) = q ⊤ c t 1 , . . . , q ⊤ c t K ⊤ ∈ R K , p CAP ( x ) = Softmax z CAP ( x ) T , (9) and measure its typicalit y against µ k,t via KL divergence, S CAP - V ( x, t ) = − min k KL p CAP ( x ) µ k,t . (10) Since b oth caption scores reuse q c computed once p er test image, the multimodal grounding adds negligible ov erhead at inference. S CAP - T and S CAP - V pro vide complemen tary signals: the former detects semantic o v erlap in text space, while T-QPM for T emp oral OOD Detection 7 the latter detects whether the caption’s seman tics conform to the visual statistics of ID data at timestep t . The four scores are com bined with learnable positive w eigh ts β , η > 0 as S FUSED ( x, t ) = S ID ( x ) + β · S VIS ( x, t ) − γ cap · S CAP - T ( x ) − η · S CAP - V ( x, t ) , (11) where γ cap > 0 is a fixed hyperparameter and β , η are learned. The caption-based terms are subtracted since high alignment of the test caption with ID represen- tations is indicative of an OOD sample whose textual description ov erlaps with but whose visual conten t departs from the ID distribution. T o ensure p ositivit y , β and η are parameterized via the softplus function by β = log (1 + e ˜ β ) and η = log (1 + e ˜ η ) , where ˜ β , ˜ η ∈ R are trainable scalar parameters initialized. Phase IV: Threshold Calibration, T raining Ob jective, and T emp oral OOD Detection. A t the initial timestep t = 0 , we calibrate a decision threshold δ as the δ q -th p ercentile of fused scores on ID training data, δ = quantile δ q { S FUSED ( x, 0) | ( x, y ) ∈ D train 0 } , (12) This threshold is fixed across all subsequen t timesteps to enable consisten t tem- p oral comparison. A t eac h timestep t , we then optimize only the t w o fusion scalars ˜ β and ˜ η , k eeping all CLIP enco ders frozen. This yields extreme parameter efficiency and av oids catastrophic forgetting of pre-trained represen tations [12]. Bey ond OOD detection under temp oral shift, the training ob jective is explic- itly designed to generalize ov er cov ariate shift by training on b oth clean and corrupted views and enforcing score consistency b et w een them. The total loss comprises three comp onen ts. First, the Balanc e d ID Classific ation L oss com- putes cross-en trop y symmetrically on both clean and co v ariate-shifted views, L ID = 1 2 E ( x,y ) ∼D t L CE ( z ID ( x ) , y ) + E ( ˜ x,y ) ∼D t L CE ( z ID ( ˜ x ) , y ) , (13) where ˜ x denotes a cov ariate-shifted version of x so that the mo del learns repre- sen tations that are simultaneously discriminative and robust to cov ariate p ertur- bations. Second, the Covariate Consistency L oss explicitly enforces that OOD detection scores remain stable under cov ariate shift [9], L COV = E x ∼D t h S FUSED ( x, t ) − S FUSED ( ˜ x, t ) i , (14) directly p enalizing inconsistency b et w een the OOD scores of clean and corrupted views and making the detection b oundary robust to cov ariate shifts. Third, to prev en t the effective ID cov erage from drifting as the data distribution evolv es, the T emp or al Drift Penalty employs Ab ove-Thr eshold Cover age (A TC) [6] as a soft differentiable proxy for the fraction of ID samples scoring ab o v e δ , A TC t = E x ∼D t σ δ − S FUSED ( x, t ) κ , (15) 8 A. Naikna ware and S. Sekeh where σ ( · ) is the sigmoid function and κ > 0 controls the smoothness of the ap- pro ximation. A TC is computed separately for clean and cov ariate-shifted views, and changes across consecutive timesteps are p enalized as L TEMP = A TC clean t − A TC clean t − 1 + A TC shift t − A TC shift t − 1 , (16) discouraging abrupt changes to the fraction of ID samples ab o v e the detection threshold at each timestep and stabilizing detection across the ev olving data stream while simultaneously accounting for co v ariate robustness through the shifted A TC term. The total loss with Lagrangian multipliers λ cov and λ temp is: L TOT AL = L ID + λ cov L COV + λ temp L TEMP . (17) Giv en a test image x with caption c at timestep t , w e compute all four scores, form S FUSED ( x, t ) via Eq. (11), and classify using the fixed threshold δ : D ( x, t ) = ( ID if S FUSED ( x, t ) ≥ δ, OOD otherwise. (18) T ogether, the four cross-mo dal scores realize the complete QPM framework in the temp oral setting, with the training ob jective jointly ensuring robustness to b oth temp oral drift and cov ariate shift. The Pseudo co de of all Phases I-IV are provided in supplementary material (SM). 3 Exp erimen ts 3.1 Exp erimen tal Setup W e ev aluate T-QPM on temp orally evolving b enchmarks designed to test se- man tic OOD detection and cov ariate shift robustness under con tin uously shifting distributions. Datasets. F or ID data, we use three temp oral b enc hmarks: CLEAR100 [16], CLEAR10 [16], and Core50 [18]. CLEAR100 and CLEAR10 span 10 temporal buck ets, eac h representing a distinct time p erio d. Core50 con- sists of 10 sessions captured under v arying bac kgrounds and lighting conditions, pro viding a complemen tary setting with more abrupt session-level domain shifts. Since T-QPM op erates on image-caption pairs, all seman tic OOD datasets are dra wn from multimodal sources: COCO [15], ImageNet-1K-VL-Enriched [10], Visual Genome [14], Flickr30K [22], and CC12M [3], spanning everyda y ob jects, scene descriptions, and web-cra wled image-text pairs. T o assess cov ariate shift robustness, w e generate perturb ed v ariants of eac h ID test set using Gaussian blur and JPEG compression corruptions, which preserve the original lab el space while ev aluating generalization under realistic visual degradations. T raining Pro cedure. T-QPM is trained sequentially across timesteps, where at each timestep the mo del receives image-caption pairs from the curren t tem- p oral distribution. In the CLEAR100/CLEAR10 setting, training proceeds from T-QPM for T emp oral OOD Detection 9 timestep 1 through timestep 10, with each buck et representing a progressively drifted visual distribution. F or Core50, the mo del is trained across 10 recording sessions in order of acquisition. A t each timestep, the mo del is up dated using the current ID data while the pro jection mo dule adapts its interference weigh ts to the new distribution, enabling contin ual calibration without full retraining. Mo del Arc hitectures and Optimization. T-QPM is built on top of tw o frozen CLIP backbones, ViT-B/16 and ViT-B/32. All exp erimen ts use a learn- ing rate of 3 × 10 − 3 with 5 ep ochs p er timestep. The frozen backbone ensures that pretrained vision-language represen tations are preserv ed, while only the pro jection mo dule is up dated to adapt to temp oral distribution shift. Ev aluation Proto col. Results are rep orted at represen tativ e early and late timesteps (e.g., t =2 and t =8 for CLEAR100) to capture mo del b eha vior b efore and after substantial temp oral drift has accumulated (other timesteps are re- p orted in the SM). W e rep ort FPR95 and A UR OC as threshold-indep enden t measures of OOD detection qualit y , alongside ID clean accuracy on the un- p erturbed test set and ID corrupted accuracy on blur- and compression- degraded v arian ts. All results are a v eraged ov er 3 indep enden t trials to account for v ariance due to weigh t initialization. 3.2 Results Figures 2 and 3 rep ort ID classification accuracy on clean and corrupted test sets across all 10 timesteps of CLEAR100, with COCO as the semantic OOD dataset. Under Gaussian blur corruption (Figure 2), T-QPM consistently out- p erforms DPM across all timesteps on b oth clean and shifted v ariants. On clean data, b oth metho ds begin at comparable accuracy ( ∼ 0.966), but T-QPM main- tains a stable upw ard trend, reaching ∼ 0.974 by t =8 , while DPM exhibits high v ariance and collapses sharply to ∼ 0.961 at the final timestep. The p erformance gap is substantially amplified under cov ariate shift: T-QPM sustains blur-shifted accuracy in the range 0.945–0.965, whereas DPM fluctuates b et w een 0.925–0.935 throughout, indicating that T-QPM’s quadruple matching b etter preserves dis- criminativ e features under low-frequency visual degradation. Under JPEG compression corruption (Figure 3), T-QPM demonstrates an ev en more pronounced adv an tage. On clean data, T-QPM improv es steadily from 0.971 at t =0 to ∼ 0.979 by t =8 , while DPM again degrades sharply at the final timestep ( ∼ 0.959). More strikingly , on JPEG-corrupted inputs, T-QPM exhibits a consistent upw ard tra jectory across all timesteps, reaching ∼ 0.991 at t =7 — while DPM remains nearly flat in the 0.920–0.932 range throughout. This ∼ 5– 6% sustained gap under JPEG shift suggests that T-QPM’s interference-based scoring mechanism is particularly robust to high-frequency compression artifacts, whic h tend to destabilize standard softmax-based confidence estimates. T aken together, b oth figures demonstrate that T-QPM not only maintains higher clean accuracy but generalizes significan tly b etter under realistic cov ariate corruptions as temp oral drift accumulates. T ables 1 and 2 rep ort OOD detection results (FPR95 ↓ / AUR OC ↑ ) across 3 ID datasets and 5 OOD datasets at early ( t =2 ) and late ( t =8 ) timesteps, using 10 A. Naikna ware and S. Sekeh Fig. 2: ID classification accuracy on clean (left) and Gaussian blur-shifted (righ t) test sets across all timesteps. T-QPM consistently outp erforms DPM under b oth conditions, with the gap widening substantially under cov ariate shift. ID dataset: Clear100, OOD dataset: COCO with captions Fig. 3: ID classification accuracy on clean (left) and JPEG shifted-shifted (right) test sets across all timesteps. T-QPM consistently outp erforms DPM under b oth conditions, with the gap widening substantially under cov ariate shift. ID dataset: Clear100, OOD dataset: COCO with captions ViT-B/16 and ViT-B/32 bac kbones, resp ectively . T-QPM consisten tly and sub- stan tially outp erforms DPM across all settings. On CLEAR100 with ViT-B/16, T-QPM reduces FPR95 by ov er 24% against COCO and nearly halves it against Visual Genome, with DPM’s p erformance deteriorating further at late timestep while T-QPM remains comparativ ely stable. On the cleaner CLEAR10 b enc h- mark, T-QPM achiev es near-p erfect OOD detection against CC12M (FPR95 of 0.4 / AUR OC 99.9), and reduces FPR95 by a factor of ten against COCO rela- tiv e to DPM. These gains are consistent across ViT-B/32 (T able 2), confirming that T-QPM’s caption-aw are scoring generalizes across backbone architectures. Critically , while b oth methods degrade from early to late timesteps under accu- m ulating temp oral drift, T-QPM’s degradation is systematically smaller across all benchmarks and OOD datasets, demonstrating sup erior temp oral robustness. T-QPM for T emp oral OOD Detection 11 T able 1: OOD detection results across 3 ID datasets and 5 OOD datasets using ViT- 16 architecture Each cell shows FPR95 ↓ / AUR OC ↑ . Early ( t =2 ) and late ( t =8 ) times. Clear100 Clear10 Core50 OOD Early Late Early Late Early Late FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ COCO DPM 41.53 / 88.16 46.73/ 85.55 8.61 / 97.28 9.66 / 96.38 16.40 / 95.60 24.10 / 93.20 QPM 17.42 / 96.66 20.51 / 95.77 0.89/ 99.66 1.20 / 99.38 6.20 / 98.90 9.80 / 97.60 ImageNet-1K DPM 17.58 / 95.74 22.48 / 94.41 9.51 / 98.48 14.70 / 94.74 11.20 / 97.80 17.30 / 95.40 QPM 5.97 / 98.79 7.24 / 98.59 3.65 / 99.16 5.20 / 98.95 4.10 / 99.10 6.90 / 98.70 Flickr30 DPM 21.62 / 94.26 26.10 / 92.95 6.03 / 98.28 8.31 / 97.68 14.10 / 96.10 21.50 / 93.80 QPM 7.96 / 98.20 8.95 / 98.06 1.63 / 99.65 2.6 / 99.07 5.10 / 99.00 8.20 / 98.10 CC12M DPM 10.23 / 97.74 12.51 / 97.11 7.35 / 98.53 9.29 / 98.57 10.30 / 97.80 15.20 / 95.60 QPM 2.56 / 99.48 3.63 / 99.28 0.46 / 99.99 1.59 / 99.57 2.70 / 99.50 5.10 / 99.00 Visual Genome DPM 44.42 / 87.46 50.16 / 84.37 23.06 / 93.18 38.66 / 90.52 22.80 / 94.10 34.60 / 89.80 QPM 16.18 / 96.57 19.37 / 96.07 11.85 / 97.49 13.75 / 98.12 10.40 / 97.60 16.90 / 95.80 T able 2: OOD detection results across 3 ID datasets and 5 OOD datasets using ViT-32 arc hitecture. Each cell shows FPR95 ↓ / A UR OC ↑ . Early ( t =2 ) and late ( t =8 ) times. Clear100 Clear10 Core50 OOD Early Late Early Late Early Late FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ FPR ↓ /AUR OC ↑ COCO DPM 47.73 / 87.30 55.11 / 84.30 9.89 / 96.40 11.33 / 95.10 18.86 / 94.80 28.44 / 92.00 QPM 19.14 / 96.20 22.96 / 95.00 0.88 / 99.20 1.34 / 98.60 6.82 / 98.50 10.98 / 96.90 ImageNet-1K DPM 20.13 / 94.90 26.43 / 93.20 10.93 / 98.00 17.35 / 93.50 12.88 / 97.00 20.41 / 94.20 QPM 6.49 / 98.30 8.06 / 97.80 3.96 / 98.70 5.82 / 98.20 4.51 / 98.70 7.73 / 98.00 Flickr30 DPM 24.84 / 93.40 30.80 / 91.70 6.90 / 97.40 9.79 / 96.40 16.22 / 95.30 25.37 / 92.60 QPM 8.69 / 97.80 9.97 / 97.30 1.76 / 99.20 2.91 / 98.30 5.61 / 98.60 9.18 / 97.40 CC12M DPM 11.73 / 96.90 14.75 / 95.90 8.39 / 97.70 10.86 / 97.30 11.84 / 97.00 17.94 / 94.40 QPM 2.75 / 99.00 4.03 / 98.50 0.44 / 99.50 1.68 / 98.80 2.97 / 99.10 5.71 / 98.30 Visual Genome DPM 51.06 / 86.60 59.12 / 83.50 26.45 / 92.30 45.55 / 89.30 26.22 / 93.30 40.83 / 88.60 QPM 17.71 / 96.30 21.62 / 95.30 12.98 / 97.00 15.34 / 97.40 11.44 / 97.20 18.93 / 95.10 Ablation Study: L oss Comp onent A nalysis. T able 3 examines the con tribution of each loss component L ID , L COV , L TEMP , and L OOD to ov erall p erformance in a setting where ID dataset is Clear100 and OOD dataset is COCO. When used in isolation, no single loss achiev es comp etitiv e results, with L ID alone p erforming b est among them (AUR OC 91.91, FPR95 24.12), while L OOD alone yields the w eak est result (AUR OC 88.01, FPR95 31.02). The leav e-one-out exp erimen ts rev eal that every comp onen t contributes meaningfully: removing L COV causes the largest single drop (AUR OC 96.41 vs. 97.81), follow ed by removing L TEMP (95.71), confirming that cov ariate shift regularization and temp oral consistency are the most critical c omponents b ey ond the classification ob jectiv e. The full mo del combining all four losses achiev es the b est p erformance (AUR OC 97.81 , FPR95 10.28 ), v alidating that each loss term addresses a distinct and comple- men tary asp ect of the temp oral OOD detection problem. Hyp erp ar ameter Sensitivity. Figure 4 shows AUR OC sweeps ov er the three key h yp erparameters β , η , and γ cap for b oth ViT-B/16 and ViT-B/32. All three parameters exhibit a clear optimal region follow ed by monotonic degradation, indicating that T-QPM is sensitive to ov er-regularization but stable within a rea- sonable range. β p eaks near 1.0–1.5 for both bac kbones before declining sharply , 12 A. Naikna ware and S. Sekeh T able 3: Ablation study of the Loss on Clear100 as ID and COCO as OOD on Vit-16. L ID L COV L TEMP L OOD A UROC ↑ FPR95 ↓ Note ✓ × × × 91.91 24.12 L ID only × ✓ × × 85.51 39.22 L COV only × × ✓ × 89.21 32.92 L TEMP only × × × ✓ 88.01 31.02 L OOD only × ✓ ✓ ✓ 93.01 21.72 w/o L ID ✓ × ✓ ✓ 96.41 19.62 w/o L COV ✓ ✓ × ✓ 95.71 24.92 w/o L TEMP ✓ ✓ ✓ × 96.91 16.82 w/o L OOD ✓ ✓ ✓ ✓ 97.81 10.28 F ull mo del Fig. 4: Hyp erparameter sw eeps (A UROC) for β (left), η (center), and γ cap (righ t). while η is optimal around 0.5 and degrades steeply b ey ond 2.0. γ cap sho ws a sharp optimum at ∼ 0.1–0.15, with p erformance dropping symmetrically on ei- ther side. A cross all three sw eeps, ViT-B/16 consisten tly outperforms ViT-B/32, consisten t with the main results, while b oth architectures share nearly identical optimal hyperparameter ranges, suggesting that a single hyperparameter config- uration generalizes well across backbone choices. 4 A Theoretical Study on Generalization Error Inspired by theoretical inv estigations in [26, 33], we hav e studied generalization error ( GE r r t +1 ( f ) ) of mo del f θ for t w o time steps t and t + 1 . The generalization error at time step t , GE r r t , is standard cross e n trop y loss for hypothesis f ∈ F under co v ariant shift P cov . W e assume: [A1] At time step t , T V ( p ( y t | x t ) ∥U ) is constan t. [A2] At time step t , F θ 1 f The class distributions predicted b y f and p θ 2 ( y t | x t ) hav e same distribution with different parameter θ 1 and θ 2 , resp ectiv ely and θ 1 − θ 2 = δ , where δ is b ounded. [A3] There exist a constant (say Z t ), s.t. E P t +1 ,cov out H ( p ( y t +1 | x t +1 )) − E P t,cov out H ( p ( y t | x t ))) ≥ Z t + C onf t − C onf t +1 . T-QPM for T emp oral OOD Detection 13 Theorem 1. (Main Theorem) L et P t,cov and P t,sem test b e the c ovariate-shifte d OOD and semantic OOD distribution. Denote GE r r t +1 ( f ) the gener alization err or at time t . L et L reg b e the OOD dete ction loss devise d for MSP dete ctors [8], i.e., cr oss-entr opy b etwe en pr e dicte d distribution f θ and uniform distribution. Then at two time steps t and t + 1 and under assumptions [A1] - [A3] , we have GE r r t +1 ( f ) − GE r r t ( f ) ≥ − ˜ κ ∆ cov ,sem t → t +1 − ˜ κ Ξ sem t → t +1 − δ 2 t E P t,cov out ( I F ( θ )) + C t → t +1 + C onf t − C onf t +1 , (19) wher e ∆ cov ,sem t → t +1 := d F ( P t +1 ,cov out , P t +1 ,sem out ) + d F ( P t,cov out , P t,sem out ) and Ξ sem t → t +1 := E P t +1 ,sem out r 1 2 ( L reg ( f ) − log K ) + E P t,sem out r 1 2 ( L reg ( f ) − log K ) . A nd C t → t +1 = C t +1 − C t + B t + Z t and δ t ar e c onstants and δ 2 t = log e 2 δ 2 t . Her e d F ( P t,cov out , P t,sem out ) is disp arity discr ep ancy with total variation distanc e) (TVD) that me asur es the dissimilarity of c ovariate-shifte d OOD and semantic OOD. C onf ( f θ ) := max j ∈Y f j ( x ) is maximum c onfidenc e, and I f ( θ ) is Fisher Info. [4]. The details and pro of are deferred in the SM. Our theoretical finding demon- strates that for MSP detectors (without an y OOD detection regularization), at t w o timesteps t and t + 1 , the OOD detection ob jective difference conflicts with OOD generalization difference. In addition, the generalization error difference o v er time is not only negatively correlated with OOD detection loss that the mo del minimizes, it also negatively correlated to the Fisher information of the net w ork parameter under P t,cov out . The OOD generalization error at t + 1 and t is p ositiv ely correlated with confidence difference ov er the same p erio d. Similar to [33] our theorem is applicable for all MSP-based OOD detectors. The inherent motiv ation of OOD detection metho ds lies in minimizing the OOD detection loss in P t,sem out under test data, regardless of the training strategies used. 5 Related W ork VLMs for OOD Detection. The adven t of large-scale VLMs such as CLIP [23] has op ened new directions for OOD detection b ey ond purely unimo dal ap- proac hes. MCM [19] as a training-free zero-shot OOD detection metho d, treats textual class embeddings as concept protot yp es and measures the softmax-scaled cosine similarit y b et w een visual features and ID class concepts. MCM demon- strates that multi-modal vision-language representations substan tially outp er- form single-mo dal visual baselines, particularly on large-scale b enc hmarks. ZOC [5] in tro duces a metho d for deriving representations of OOD class names based on the CLIP visual enco der Building on these, DPM. [34] augments text-pattern matc hing with an explicit visual pattern derived from aggregated ID image-text 14 A. Naikna ware and S. Sekeh similarit y scores, leveraging b oth mo dalities simultaneously . DPM introduces a domain-sp ecific feature aggregation mo dule and further extends to a training- required v arian t (DPM-T) with learnable prompts and pro jection lay ers. Our w ork, T-QPM, departs from b oth MCM and DPM b y introducing a quadru- ple matching mec hanism ov er image-caption pairs, enabling richer cross-mo dal in teraction b ey ond cosine similarity . T-QPM explicitly addresses temp oral dis- tribution shift—a setting neither MCM nor DPM is designed for. Op en-w orld learning (O WL). OWL requires mo dels to simultaneously gener- alize to cov ariate shifts and detect semantic OOD samples in dynamic, evolving en vironmen ts [35]. SCONE [1] addresses this b y introducing an energy mar- gin framework that separates ID and co v ariate-shifted samples from semantic OOD using unlab eled wild data, establishing a strong static baseline. T emp- SCONE [20] extends SCONE to dynamic domains b y incorp orating a temp oral regularization loss based on A verage Thresholded Confidence (A TC), p enalizing confidence turbulence b etw een consecutive timesteps while preserving SCONE’s energy-margin separation. Complemen tary approaches include test-time adapta- tion methods such as TENT [27] and con tin ual OOD detection frameworks [29], whic h stabilize predictions under distributional shift but do not disentangle co- v ariate from semantic OOD. Our T-QPM is motiv ated b y the similar op en-w orld, temp orally evolving setting as T emp-SCONE but op erates in the m ultimo dal regime: b y lev eraging frozen CLIP bac kbones and its ability to use vision and text in a multimodal setting. 6 Conclusion T-QPM consisten tly outp erforms DPM across all ev aluated settings, confirming that four-wa y scoring ov er image-caption pairs provides a more p ow erful and temp orally robust OOD detection signal than dual-pattern matching alone. No- tably , the p erformance gap widens as temporal drift accumulates, suggesting that caption-aw are scoring is particularly effective at maintaining stable deci- sion b oundaries under evolving ID data. F urthermore, the adv an tage is most pronounced under cov ariate corruption, where T-QPM’s corrupted ID accuracy remains significantly higher than DPM’s across all timesteps, indicating that the mo del learns representations intrinsically more robust to b oth lo w- and high- frequency visual degradations. These observ ations suggest that caption quality and linguistic diversit y of the OOD source are significant yet largely unexplored factors in multimodal OOD detection. The prop osed T-QPM provides a temporally-aw are quadruple matc h- ing framework for multimodal OOD detection under contin uously shifting distri- butions. By building on frozen CLIP backbones and in troducing a caption-aw are scoring mec hanism, T-QPM jointly leverages visual and linguistic ID information to pro duce reliable OOD detection signals across temp oral b enc hmarks, estab- lishing a strong and principled baseline for caption-a w are, temp orally robust OOD detection in op en-world vision-language settings and motiv ating further in v estigation in to multimodal and contin ual learning in dynamic environmen ts. T-QPM for T emp oral OOD Detection 15 A c kno wledgements This w ork has b een partially supp orted by NSF CAREER CCF-2451457. The findings are those of the authors only and do not represen t an y position of these funding b o dies. References 1. Bai, H., Canal, G., Du, X., K w on, J., No w ak, R., Li, Y.: F eed t w o birds with one scone: Exploiting wild data for both out-of-distribution generalization and detection (2025), 2. Cai, Z., Bai, G., Jiang, R., Song, X., Zhao, L.: Contin uous temp oral domain gener- alization. In: Adv ances in Neural Information Pro cessing Systems (NeurIPS) (2024) 3. Changpin yo, S., Sharma, P ., Ding, N., Soricut, R.: Conceptual 12M: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: Pro ceed- ings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 3558–3568 (2021) 4. Cramér, H.: Mathematical methods of statistics, vol. 9. Princeton universit y press (1999) 5. Esmaeilp our, S., Liu, B., Rob ertson, E., Shu, L.: Zero-shot out-of-distribution de- tection based on the pre-trained mo del clip. In: Pro ceedings of the AAAI conference on artificial in telligence. vo l. 36, pp. 6568–6576 (2022) 6. Garg, S., Balakrishnan, S., Lipton, Z.C., Neyshabur, B., Sedghi, H.: Leveraging unlab eled data to predict out-of-distribution p erformance (2022), https://arxiv. org/abs/2201.04234 7. Hendryc ks, D., Gimp el, K.: A baseline for detecting misclassified and out-of- distribution examples in neural netw orks. In: International Conference on Learning Represen tations (ICLR) (2017) 8. Hendryc ks, D., Mazeik a, M., Dietterich, T.: Deep anomaly detection with outlier exp osure (2019), 9. Hendryc ks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminaray anan, B.: Augmix: A simple data processing metho d to improv e robustness and uncertain t y (2020), 10. Hugging F ace: Imagenet with captions. https : / / huggingface . co / datasets / visual- layer/imagenet- 1k- vl- enriched (2023) 11. Katz-Sam uels, J., Nakhleh, J.B., Now ak, R., Li, Y.: T raining OOD detectors in their natural habitats. In: International Conference on Mac hine Learning (ICML). pp. 10848–10865 (2022) 12. Kirkpatric k, J., Pascan u, R., Rabino witz, N., V eness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabsk a-Barwinsk a, A., Hassabis, D., Clopath, C., Kumaran, D., Hadsell, R.: Ov ercoming catastrophic forgetting in neu- ral netw orks. Pro ceedings of the National Academ y of Sciences 114 (13), 3521–3526 (Mar 2017). https: //doi. org/10 .1073/ pnas.1611835114 , http :// dx.doi .org/ 10.1073/pnas.1611835114 13. K oh, P .W., Sagaw a, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A., Hu, W., Y asunaga, M., Phillips, R.L., Gao, I., et al.: WILDS: A b enc hmark of in-the-wild distribution shifts. In: International Conference on Machine Learning (ICML). pp. 5637–5664 (2021) 16 A. Naikna ware and S. Sekeh 14. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalan- tidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Li, F.F.: Visual genome: Con- necting language and vision using crowdsourced dense image annotations (2016), 15. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Ha ys, J., P erona, P ., Ramanan, D., Zitnick, C.L., Dollár, P .: Microsoft co co: Common ob jects in context (2015), 16. Lin, Z., Shi, J., Pathak, D., Ramanan, D.: The clear b enc hmark: Contin ual learning on real-w orld imagery (2022), 17. Liu, W., W ang, X., Ow ens, J., Li, Y.: Energy-based out-of-distribution detection. In: Adv ances in Neural Information Pro cessing Systems (NeurIPS). vol. 33, pp. 21464–21475 (2020) 18. Lomonaco, V., Maltoni, D.: Core50: a new dataset and b enchmark for contin uous ob ject recognition (2017), 19. Ming, Y., Cai, Z., Gu, J., Sun, Y., Li, W., Li, Y.: Delving into out-of-distribution detection with vision-language representations. In: Adv ances in Neural Information Pro cessing Systems (NeurIPS) (2022) 20. Naikna ware, A., Singh, S., Homay ouni, H., Sek eh, S.: T emp-SCONE: A nov el out- of-distribution detection and domain generalization framework for wild data with temp oral shift. NeurIPS W orkshop: Reliable ML from Unreliable Data (2025) 21. Nishiy ama, T., Sason, I.: On relations b et w een the relative entrop y and χ 2- div ergence, generalizations and applications. En tropy 22 (5), 563 (2020) 22. Plummer, B.A., W ang, L., Cerv an tes, C.M., Caicedo, J.C., Hock enmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase corresp ondences for ric her image-to-sen tence mo dels (2016), 23. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J., et al.: Learning transferable visual mo dels from natural language sup ervision. In: In ternational Conference on Machine Learning (ICML). pp. 8748–8763 (2021) 24. Sason, I., V erdú, S.: f -divergence inequalities. IEEE T ransactions on Information Theory 62 (11), 5973–6006 (2016) 25. Sun, Y., Ming, Y., Zhu, X., Li, Y.: Out-of-distribution detection with deep nearest neigh b ors. ICML (2022) 26. T ong, X., Xu, X., Huang, S.L., Zheng, L.: A mathematical framew ork for quan- tifying transferability in multi-source transfer learning. Adv ances in Neural Infor- mation Pro cessing Systems 34 , 26103–26116 (2021) 27. W ang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: T ent: F ully test- time adaptation by entrop y minimization. In: International Conference on Learning Represen tations (ICLR) (2021) 28. W ang, Z., Zhang, Z., Lee, C.Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy , J., Pfister, T.: Learning to prompt for contin ual learning (2022), https://arxiv. org/abs/2112.08654 29. W u, Z., Zhang, H., Li, Y.: Meta-o o d: Meta-learning for few-shot out-of-distribution detection. In: Pro ceedings of the IEEE International Conference on Computer Vi- sion (ICCV) (2023) 30. Y ang, J., Zhou, K., Li, Y., Liu, Z.: Generalized out-of-distribution detection: A surv ey . In ternational Journal of Computer Vision 132 (12), 5635–5662 (2024) 31. Y ao, H., Choi, C., Cao, B., Lee, Y., Koh, P .W., Finn, C.: Wild-Time: A b enc hmark of in-the-wild distribution shift ov er time. In: Adv ances in Neural Information Pro cessing Systems (NeurIPS) (2022) T-QPM for T emp oral OOD Detection 17 32. Y e, N., Li, K., Bai, H., Y u, R., Hong, L., Zhou, F., Li, Z., Zhu, J.: OOD-Bench: Quan tifying and understanding tw o dimensions of out-of-distribution generaliza- tion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and P at- tern Recognition (CVPR). pp. 7947–7958 (2022) 33. Zhang, Q., F eng, Q., Zhou, J.T., Bian, Y., Hu, Q., Zhang, C.: The b est of b oth w orlds: On the dil emma of out-of-distribution detection. Adv ances in Neural In- formation Pro cessing Systems 37 , 69716–69746 (2024) 34. Zhang, Z., Xu, Z., Xiang, X.: Vision-language dual-pattern matching for out- of-distribution detection. In: Europ ean Conference on Computer Vision (ECCV) (2024) 35. Zh u, F., Ma, S., Cheng, Z., Zhang, X.Y., Zh ang, Z., Liu, C.L.: Op en-w orld machine learning: A review and new outlooks. arXiv preprint arXiv:2403.01759 (2024) A Theoretical Pro ofs Lemma 1. At time steps t and t + 1 , if H ( p ( y t | x t )) ≤ H ( p ( y t +1 | x t +1 )) then C onf t = max y t ∈Y t p ( y t | x t ) ≥ max y t +1 ∈Y t +1 p ( y t +1 | x t +1 ) = C onf t +1 . Pro of: F or K classes at both time t and t + 1 , denote p ∗ t := max y t ∈Y t p ( y t | x t ) and p ∗ t +1 := max y t +1 ∈Y t +1 p ( y t +1 | x t +1 ) . Supp ose p ∗ t = P ( y t = k 1 | x t ) and p ∗ t +1 = P ( y t = k 2 | x t +1 ) . Now set p t = ( p ∗ t , 1 − p ∗ t ) , where 1 − p ∗ t is split among classes { 1 , . . . , K } \ { k 1 } and 1 − p ∗ t +1 is split among classes { 1 , . . . , K } \ { k 2 } . This appro ximates the entrop y as H ( p t ) = − p ∗ t log p ∗ t − X i ∈{ 1 ,...,K }\{ k 1 } p it log p it , (20) where p it = 1 − p ∗ t K − 1 . And (20) is simplified as H ( p t ) = − p ∗ t log p ∗ t − (1 − p ∗ t ) log 1 − p ∗ t K − 1 . (21) Equiv alently H ( p t +1 ) = − p ∗ t +1 log p ∗ t +1 − (1 − p ∗ t +1 ) log 1 − p ∗ t +1 K − 1 . (22) Because H ( p t ) ≤ H ( p t +1 ) and from (21) and (22), we implies that p ∗ t ≥ p ∗ t +1 . Lemma 2. (The or em 1, ( [33])) The gener alization err or at time step t , GE r r t , is standar d cr oss entr opy loss for hyp othesis f ∈ F under c ovariant shift P cov . GE r r t is lower b ounde d by GE r r t ( f ) ≥ − 1 2 κ E P t,sem out r 1 2 ( L reg ( f ) − log K ) (23) − 1 2 κ d F ( P t,cov out , P t,sem out ) + C t + E P t,cov out H ( p ( y t | x t )) 18 A. Naikna ware and S. Sekeh where C t is constant. Lemma 3. (L emma 1, ( [33])) F or any f ∈ F , we have E P t,cov out T V ( F f ∥U ) ≤ E P t,sem out T V ( F f ∥U ) (24) + d F ( P t,cov out , P t,sem out ) + λ, wher e λ is a c onstant indep endent of f . U is the K -classes uniform distribution. P t,cov out is the c ovariate-shifte d OOD distribution at time t . P t,sem out is the semantic OOD distribution at time t . Lemma 4. (L emma 3, ( [33])) Denote the OOD dete ction loss use d for MSP dete ctors as L reg , then we have E P t,sem out ( T V ( F f ∥U )) ≤ E P t,sem out r 1 2 ( L reg ( f ) − log K ) . (25) Lemma 5. The gener alization err or at time step t , GE r r t , is standar d cr oss entr opy loss for hyp othesis f ∈ F under c ovariant shift P cov . GE r r t is upp er b ounde d by GE r r t ( f ) ≤ log e 2 E P t,sem out r 1 2 ( L reg ( f ) − log K ) + log e 2 d F ( P t,cov out , P t,sem out ) (26) + C t + log e 2 E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) + H ( p ( y t | x t )) (27) wher e C t is c onstant. Pro of: GE r r t ( f ) := E P t,cov out L C E ( f ( x t , y t )) = E P t,cov out K L ( p ( y t | x t ) ∥ F f ( x t )) + H ( p ( y t | x t )) ≤ log e 2 E P t,cov out ( T V ( p ( y t | x t ) ∥ F f ( x t )) + X 2 ( p ( y t | x t ) ∥ F f ( x t ))) + H ( p ( y t | x t )) ≤ log e 2 E P t,cov out ( T V ( p ( y t | x t ) ∥U ) + T V ( F f ( x t ) ∥U )) + E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) + E P t,cov out H ( p ( y t | x t )) (28) wher e fr om [24], we have X 2 ( P ∥ Q ) + 1 = Z P 2 Q dµ and fr om [21] we have K L ( P ∥ Q ) ≤ 1 2 T V ( P ∥ Q ) + X 2 ( P ∥ Q ) log e T-QPM for T emp oral OOD Detection 19 F r om L emma 3 ab ove we have GE r r t ( f ) ≤ log e 2 E P t,cov out ( T V ( p ( y t | x t ) ∥U )) + log e 2 E P t,sem out T V ( F f ∥U ) + log e 2 d F ( P t,cov out , P t,sem out ) (29) + log e 2 λ + log e 2 E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) + E P t,cov out H ( p ( y t | x t )) (30) F rom Lemma 4 ab o v e we hav e GE r r t ( f ) ≤ log e 2 E P t,cov out ( T V ( p ( y t | x t ) ∥U )) + log e 2 E P t,sem out r 1 2 ( L reg ( f ) − log K ) + log e 2 d F ( P t,cov out , P t,sem out ) + log e 2 λ (31) + log e 2 E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) + E P t,cov out H ( p ( y t | x t )) (32) since at each time t , E P t,cov out ( T V ( p ( y t | x t ) ∥U )) is constan t, we upp er b ound GE r r t ( f ) as GE r r t ( f ) ≤ log e 2 E P t,sem out r 1 2 ( L reg ( f ) − log K ) + log e 2 d F ( P t,cov out , P t,sem out ) (33) + C t + log e 2 E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) + E P t,cov out H ( p ( y t | x t )) (34) Lemma 6. Under the assumption [ A2 ] and r e gularity c ondition on F θ 1 f , we have E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) ≤ δ 2 t E P t,cov out ( I F ( θ 2 )) + B t , (35) wher e I F ( θ 2 ) is Fisher information and B t is c onstant. The key p art of this c onje ctur e is develop e d b ase d on E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) = ( θ 1 − θ 2 ) 2 E P t,cov out ( I F ( θ 2 )) + o ( θ 1 − θ 2 ) 2 , (36) wher e θ 1 appr oximately vanishes. Because inv erse of entrop y can b e used as a confidence score to gauge the lik eliho od of a prediction b eing correct, we assume: [A3] There exist a constant (say Z t ), such that E P t +1 ,cov out H ( p ( y t +1 | x t +1 )) − E P t,cov out H ( p ( y t | x t )) ≥ Z t + C onf t − C onf t +1 (37) Theorem 2. (Main Theorem) L et P t,cov and P t,sem b e the c ovariate-shifte d OOD and semantic OOD distributions. Denote GE r r t +1 ( f ) the gener alization err or at time t + 1 . Then at two time steps t and t + 1 and under assumptions [A1] , [A2] , and [A3] , we have GE r r t +1 ( f ) − GE r r t ( f ) ≥ − ˜ κ ∆ cov ,sem t → t +1 − ˜ κ Ξ sem t → t +1 − δ 2 t E P t,cov out ( I F ( θ 2 )) + C t → t +1 + C onf t − C onf t +1 , (38) 20 A. Naikna ware and S. Sekeh wher e ∆ cov ,sem t → t +1 := d F ( P t +1 ,cov out , P t +1 ,sem out ) + d F ( P t,cov out , P t,sem out ) and Ξ sem t → t +1 := E P t +1 ,sem out r 1 2 ( L reg ( f ) − log K ) + E P t,sem out r 1 2 ( L reg ( f ) − log K ) . A nd C t → t +1 = C t +1 − C t + B t + Z t and δ t ar e c onstants and δ 2 t = log e 2 δ 2 t . Pro of: R e c al l the definition of GE r r t ( f ) : GE r r t +1 ( f ) − GE r r t ( f ) ≥ − 1 2 κ E P t +1 ,sem out r 1 2 ( L reg ( f ) − log K ) − 1 2 κ d F ( P t +1 ,cov out , P t +1 ,sem out ) − log e 2 E P t,sem out r 1 2 ( L reg ( f ) − log K ) − log e 2 d F ( P t,cov out , P t,sem out ) − log e 2 E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) +( C t +1 − C t ) + ( E P t +1 ,cov out H ( p ( y t +1 | x t +1 )) − E P t,cov out H ( p ( y t | x t ))) , (39) If we denote ∆ cov ,sem t → t +1 := d F ( P t +1 ,cov out , P t +1 ,sem out ) + d F ( P t,cov out , P t,sem out ) and Ξ sem t → t +1 := E P t +1 ,sem out r 1 2 ( L reg ( f ) − log K ) + E P t,sem out r 1 2 ( L reg ( f ) − log K ) , then ther e exists a c onstant ˜ κ ≤ 1 2 κ + log e 2 such that (39) is written as GE r r t +1 ( f ) − GE r r t ( f ) ≥ − ˜ κ ∆ cov ,sem t → t +1 − ˜ κ Ξ sem t → t +1 + C t → t +1 − log e 2 E P t,cov out X 2 ( p ( y t | x t ) ∥ F f ( x t )) + E P t +1 ,cov out H ( p ( y t +1 | x t +1 )) − E P t,cov out H ( p ( y t | x t )) , (40) wher e C t → t +1 = C t +1 − C t is c onstant. Apply the upp er b ound in L emma 6, we have the lower b ound b elow GE r r t +1 ( f ) − GE r r t ( f ) ≥ − ˜ κ ∆ cov ,sem t → t +1 − ˜ κ Ξ sem t → t +1 − δ 2 t E P t,cov out ( I F ( θ 2 )) + C t → t +1 + E P t +1 ,cov out H ( p ( y t +1 | x t +1 )) − E P t,cov out H ( p ( y t | x t )) , (41) wher e C t → t +1 = C t +1 − C t + B t is c onstant and δ 2 t = log e 2 δ 2 t . By applying assumption [A3] , we c onclude the pr o of. T-QPM for T emp oral OOD Detection 21 B Pseudo co de (Phase I-IV) Algorithm 1: Phase I: T ext Pattern Construction Input: ID class set Y = { y 1 , . . . , y K } ; prompt templates { p ( i ) k } P i =1 for eac h class k ; F rozen CLIP text enco der ϕ T Output: ID text bank T ID ∈ R K × d 1 Build p er-class text embeddings via prompt ensembling (computed once, kept frozen). 2 for e ach class k = 1 to K do 3 for each pr ompt template i = 1 to P do 4 e ( i ) k ← Normalize ϕ T ( p ( i ) k ) ; // encode and ℓ 2 -normalize each prompt 5 t k ← Normalize P P i =1 e ( i ) k ; // average normalized embeddings, then re-normalize 6 T ID ← [ t 1 , . . . , t K ] ⊤ ∈ R K × d ; // stack into ID text bank 7 Return T ID ; 8 Note: T ID is c ompute d onc e and r emains fixe d acr oss al l timesteps, serving as a stable semantic anchor. 22 A. Naikna ware and S. Sekeh Algorithm 2: Phase I I: T emp oral Visual Pattern Construction Input: Per-timestep ID training sets {D train t } T t =0 ; frozen CLIP visual encoder ϕ V ; ID text bank T ID ; spatial w eight γ > 0 ; temp erature T > 0 Output: Timestep-sp ecific visual protot ypes { µ k,t } K, T k =1 , t =0 , eac h ∈ R K 1 for t = 0 to T do 2 Compute class-attended image representations for all training images at timestep t . 3 for each ( x, y ) ∈ D train t do 4 F ( x ) ← ϕ V ( x ) ∈ R ( N +1) × d ; // extract ViT patch sequence 5 F v ( x ) ← F ( x )[0 , :] ; // global [CLS] token 6 F s ( x ) ← F ( x )[1 : , :] ; // spatial patch embeddings 7 for e ach class k = 1 to K do 8 A k ( x ) ← Softmax F s ( x ) t k ∥ F s ( x ) ∥ ∥ t k ∥ ∈ R N ; // class-specific spatial attention 9 ˜ f k ( x ) ← A k ( x ) ⊤ F s ( x ) ; // attended spatial feature 10 f k ( x ) ← γ ˜ f k ( x ) + F v ( x ) ; // combine global and spatial information 11 z ID ( x )[ k ] ← f k ( x ) ⊤ t k ; // ID logit for class k 12 p ( x ) ← Softmax( z ID ( x ) / T ) ; // class probability vector 13 Estimate p er-class visual prototypes by av eraging ov er class-conditional training samples. 14 for each class k = 1 to K do 15 µ k,t ← 1 |{ ( x, y ) ∈ D train t : y = k }| P ( x,y ) ∈D train t y = k p ( x ) ; // mean ID probability pattern for class k at time t 16 Return { µ k,t } ; 17 Note: Pr ototyp es ar e r e c ompute d at e ach new timestep to tr ack gr adual visual distribution drift. T-QPM for T emp oral OOD Detection 23 Algorithm 3: Phase I I I: Quadruple Cross-Mo dal Scoring Input: T est image x with asso ciated caption c ; timestep t ; F rozen enco ders ϕ V , ϕ T ; ID text bank T ID ; protot yp es { µ k,t } ; Spatial w eight γ ; temperature T ; fixed hyperparameter γ cap > 0 ; learned w eights ˜ β , ˜ η ∈ R Output: F used OOD score S FUSED ( x, t ) ∈ R 1 (a) Extract test image features and compute ID logits (reuse Phase I I pro cedure). 2 F ( x ) ← ϕ V ( x ) ; F v ( x ) ← F ( x )[0 , :] ; F s ( x ) ← F ( x )[1 : , :] ; 3 for k = 1 to K do 4 Compute A k , ˜ f k , f k , z ID ( x )[ k ] as in Phase I I; 5 p ( x ) ← Softmax( z ID ( x ) / T ) ; 6 (b) Score 1 — Seman tic Matching Score S ID (test image ↔ ID text). 7 S ID ( x ) ← max k ∈{ 1 ,...,K } z ID ( x )[ k ] / T ; 8 (c) Score 2 — Visual T ypicalit y Score S VIS (test image ↔ ID visual protot yp es). 9 for k = 1 to K do 10 KL k ( x, t ) ← K X j =1 p j ( x ) log p j ( x ) µ k,t [ j ] ; // KL divergence from prototype k 11 S VIS ( x, t ) ← − min k KL k ( x, t ) ; 12 (d) Enco de test caption (once p er test sample). 13 q c ← Normalize( ϕ T ( c )) ∈ R d ; 14 (e) Score 3 — Caption-T ext Alignment Score S CAP - T (OOD text ↔ ID text). 15 S CAP - T ( x ) ← max k ∈{ 1 ,...,K } ⟨ q c , t k ⟩ ; 16 (f ) Score 4 — Caption-Visual Alignment Score S CAP - V (OOD text ↔ ID visual). 17 z CAP ( x ) ← [ q ⊤ c t 1 , . . . , q ⊤ c t K ] ⊤ ∈ R K ; 18 p CAP ( x ) ← Softmax( z CAP ( x ) / T ) ; 19 for k = 1 to K do 20 KL cap k ( x, t ) ← KL p CAP ( x ) µ k,t ; 21 S CAP - V ( x, t ) ← − min k KL cap k ( x, t ) ; 22 (g) F use four scores with softplus-constrained learnable w eigh ts. 23 β ← log (1 + e ˜ β ) ; η ← log (1 + e ˜ η ) ; // softplus ensures β , η > 0 24 S FUSED ( x, t ) ← S ID ( x ) + β · S VIS ( x, t ) − γ cap · S CAP - T ( x ) − η · S CAP - V ( x, t ) ; 25 Return S FUSED ( x, t ) ; 24 A. Naikna ware and S. Sekeh Algorithm 4: Phase IV: Threshold Calibration, T raining, and T emp o- ral OOD Detection Input: T emp oral ID training sets {D train t } T t =0 ; score function S FUSED ( · , t ) from Phase II I; quantile δ q ; loss w eights λ cov , λ temp > 0 ; smo othness κ > 0 ; ep ochs E ; learning rate α lr ; γ cap > 0 ; test pairs { ( x, c ) } Output: Threshold δ ; optimized ˜ β , ˜ η ; OOD decisions D ( x, t ) ∈ { ID , OOD } 1 Step 1: Calibrate global threshold at t = 0 (done once). 2 S 0 ← { S FUSED ( x, 0) | ( x, y ) ∈ D train 0 } ; 3 δ ← quantile δ q ( S 0 ) ; // fixed across all timesteps 4 Step 2: Initialize learnable fusion scalars. 5 Initialize ˜ β , ˜ η ∈ R ; // effective weights: β =softplus( ˜ β ) , η =softplus( ˜ η ) 6 Optimizer ← Adam([ ˜ β , ˜ η ] , lr= α lr ) ; 7 A TC clean − 1 ← None ; A TC shift − 1 ← None ; 8 Step 3: T rain sequentially across timesteps. 9 for t = 0 to T do 10 for epo ch e = 1 to E do 11 foreach mini-b atch ( X , ˜ X , Y ) ∼ D train t do // X : clean; ˜ X : covariate-shifted; Y : labels 12 (a) ID Classification Loss. 13 L ID ← 1 2 CE( z ID ( X ) , Y ) + CE( z ID ( ˜ X ) , Y ) ; 14 (b) Cov ariate Consistency Loss. 15 L COV ← 1 | X | P i | S FUSED ( x i , t ) − S FUSED ( ˜ x i , t ) | ; 16 (c) T emp oral Drift Penalt y (A TC). 17 A TC clean t ← 1 | X | P i σ δ − S FUSED ( x i ,t ) κ ; 18 A TC shift t ← 1 | X | P i σ δ − S FUSED ( ˜ x i ,t ) κ ; 19 if t > 0 then 20 L TEMP ← | A TC clean t − A TC clean t − 1 | + | A TC shift t − A TC shift t − 1 | ; 21 else 22 L TEMP ← 0 ; 23 (d) T otal loss and up date. 24 L TOT AL ← L ID + λ cov L COV + λ temp L TEMP ; 25 Bac kpropagate ∇ ˜ β , ˜ η L TOT AL ; up date ( ˜ β , ˜ η ) via A dam; 26 A TC clean t , A TC shift t ← final-ep och v alues; 27 Step 4: Inference — OOD detection at timestep t . 28 foreac h test sample ( x, c ) at timestep t do 29 Compute S FUSED ( x, t ) using Phase I II with optimized ˜ β , ˜ η ; 30 D ( x, t ) ← ( ID if S FUSED ( x, t ) ≥ δ OOD otherwise ; T-QPM for T emp oral OOD Detection 25 C A dditional Exp eriments W e present additional quantitativ e results to further v alidate the effectiveness of T-QPM across multiple ID datasets, OOD b enc hmarks, and corruption types o v er all ten timesteps ( t = 0 , . . . , 9 ). ID Classific ation A c cur acy under Image Corruptions. T ables 4 and 5 report the ID classification accuracy of T-QPM and DPM under Gaussian blur and JPEG compression corruptions, resp ectiv ely , alongside clean accuracy . Both metho ds ac hiev e comp etitiv e clean accuracy across CLEAR100, CLEAR10, and Core50. Ho w ev er, T-QPM consistently outp erforms DPM under b oth corruption t yp es, with the p erformance gap widening at later timesteps. F or instance, on CLEAR100 under Gaussian blur, T-QPM achiev es 96 . 20% at t = 9 compared to 92 . 42% for DPM, and under JPEG compression reaches 99 . 15% versus 93 . 20% . These results demonstrate that T-QPM maintains greater robustness to input corrup- tions as the underlying visual distribution drifts o v er time, owing to its co v ariate consistency loss and temp oral drift p enalt y introduced in Phase IV. OOD Dete ction Performanc e. T ables 6 and 7 rep ort FPR95 and AUR OC, resp ectiv ely , across all com binations of ID datasets (CLEAR100, CLEAR10, Core50) and OOD benchmarks (COCO, ImageNet-1K-VL-Enric hed, Flic kr30K, CC12M, Visual Genome). T-QPM consisten tly and substantially outp erforms DPM on b oth metrics. In terms of FPR95, T-QPM reduces the false p ositiv e rate b y a factor of appro ximately 2 – 3 × across all settings. F or example, on CLEAR100 with COCO as the OOD dataset at t = 0 , DPM achiev es 41 . 46% FPR95 while T-QPM achiev es 13 . 64% . On the A UR OC metric, T-QPM attains 97 – 99% across most settings, compared to 85 – 97% for DPM, with the largest gains observ ed on semantically challenging OOD sets suc h as COCO and Visual Genome. Imp ortan tly , while b oth metho ds exp erience p erformance degradation at later timesteps due to temp oral distribution shift, T -QPM degrades signif- ican tly more slowly . This confirms that the temp oral mo deling components of T-QPM namely , the time-conditioned fused score S FUSED ( · , t ) and the ab o v e- threshold cov erage p enalt y effectively mitigate temp oral OOD drift, v alidating the theoretical guarantees established in our Main Theorem. 26 A. Naikna ware and S. Sekeh T able 4: ID classification accuracy (clean and blur) for T-QPM vs. DPM across all ID datasets and timesteps under Gaussian Blur co v ariate shift CLEAR100 CLEAR10 Core50 Timestep Metho d Clean (%) ↑ Blur (%) ↑ Clean (%) ↑ Blur (%) ↑ Clean (%) ↑ Blur (%) ↑ t = 0 DPM 96.61 93.47 98.81 98.81 97.87 96.88 T-QPM 96.57 93.23 99.01 98.22 97.88 96.76 t = 1 DPM 97.62 93.38 99.40 98.20 98.01 97.11 T-QPM 97.66 93.84 99.40 97.40 98.33 97.45 t = 2 DPM 97.02 93.19 99.60 98.80 98.27 96.99 T-QPM 97.04 94.69 99.60 97.80 98.67 97.28 t = 3 DPM 96.88 93.45 99.40 98.60 98.35 97.00 T-QPM 97.12 95.69 99.40 98.20 98.10 97.28 t = 4 DPM 97.02 92.97 99.20 97.99 98.32 97.02 T-QPM 97.25 94.57 99.20 97.79 98.30 97.30 t = 5 DPM 97.00 92.60 99.40 98.80 98.28 96.98 T-QPM 97.14 95.42 99.40 98.00 98.35 97.25 t = 6 DPM 96.78 93.11 99.00 98.60 98.30 97.05 T-QPM 97.32 95.01 99.20 97.00 98.40 97.20 t = 7 DPM 97.10 93.19 99.20 97.80 98.33 97.01 T-QPM 97.36 96.49 99.00 97.60 98.42 97.30 t = 8 DPM 96.90 93.17 99.00 98.00 98.29 96.97 T-QPM 97.39 95.23 98.80 97.60 98.36 97.22 t = 9 DPM 96.08 92.42 99.40 97.60 98.25 96.95 T-QPM 97.30 96.20 99.40 97.20 98.38 97.18 T-QPM for T emp oral OOD Detection 27 T able 5: ID classification accuracy under JPEG compression corruption for T-QPM vs. DPM across all ID datasets and timesteps. CLEAR100 CLEAR10 Core50 t Metho d Clean (%) ↑ JPEG (%) ↑ Clean (%) ↑ JPEG (%) ↑ Clean (%) ↑ JPEG (%) ↑ t = 0 DPM 96.57 91.62 98.81 95.10 97.87 95.80 T-QPM 97.08 96.23 99.01 97.80 97.88 96.90 t = 1 DPM 97.36 92.84 99.40 95.70 98.01 96.10 T-QPM 98.18 96.84 99.40 98.00 98.33 97.20 t = 2 DPM 96.94 92.69 99.60 96.10 98.27 96.40 T-QPM 97.53 97.70 99.60 98.30 98.67 97.50 t = 3 DPM 96.42 92.69 99.40 95.90 98.35 96.30 T-QPM 97.62 98.68 99.40 98.10 98.10 97.40 t = 4 DPM 96.90 92.57 99.20 95.60 98.32 96.20 T-QPM 97.74 97.58 99.20 97.90 98.30 97.30 t = 5 DPM 96.74 92.42 99.40 96.00 98.28 96.40 T-QPM 97.64 98.40 99.40 98.20 98.35 97.50 t = 6 DPM 96.72 93.01 99.00 95.80 98.30 96.30 T-QPM 97.81 98.01 99.20 98.00 98.40 97.40 t = 7 DPM 96.96 92.49 99.20 95.90 98.33 96.30 T-QPM 97.85 99.45 99.00 98.10 98.42 97.50 t = 8 DPM 96.90 93.23 99.00 95.70 98.29 96.20 T-QPM 97.88 98.23 98.80 97.90 98.36 97.40 t = 9 DPM 95.88 93.20 99.40 95.50 98.25 96.10 T-QPM 97.79 99.15 99.40 98.00 98.38 97.30 28 A. Naikna ware and S. Sekeh T able 6: FPR95 (%) ↓ for T-QPM vs. DPM across all av ailable ID and OOD dataset combinations across all timesteps. IN-1K: ImageNet-1K-VL-Enriched; Flk30: Flic kr30K; VG: Visual Genome. CLEAR100 (ID) CLEAR10 (ID) Core50 (ID) t Method COCO IN-1K Flk30 CC12M VG COCO IN-1K Flk30 CC12M VG COCO IN-1K Flk30 CC12M V G t = 0 DPM 41.46 17.95 22.98 10.15 44.30 7.54 9.40 4.93 7.30 23.00 16.50 11.30 14.00 10.60 23.40 T-QPM 13.64 3.96 5.92 1.54 12.85 0.90 3.80 1.60 0.35 12.20 6.10 4.00 5.00 2.70 10.20 t = 1 DPM 41.28 17.20 21.89 9.85 44.15 8.78 9.80 6.02 7.50 24.20 17.00 11.80 15.00 11.00 23.80 T-QPM 16.58 5.46 7.23 2.15 15.37 1.10 4.10 2.00 0.48 13.00 6.70 4.60 5.70 3.00 11.90 t = 2 DPM 41.48 17.54 21.60 10.21 44.40 8.61 9.51 6.02 7.35 23.06 16.40 11.20 14.10 10.30 22.80 T-QPM 17.42 5.94 7.87 2.53 16.11 0.89 3.65 1.63 0.46 11.85 6.20 4.10 5.10 2.70 10.40 t = 3 DPM 41.58 18.41 22.58 10.20 44.06 7.58 10.20 5.42 7.80 26.00 18.50 12.80 16.20 12.00 27.50 T-QPM 16.30 5.03 6.92 2.27 15.10 1.20 4.50 2.20 1.44 13.50 7.80 5.30 6.40 3.50 14.20 t = 4 DPM 43.50 18.62 23.77 10.74 46.49 8.62 11.00 6.21 8.20 28.00 20.10 14.00 18.50 13.50 30.00 T-QPM 18.32 6.39 8.18 2.94 17.18 1.40 5.20 2.60 0.88 14.80 8.90 6.20 7.40 4.20 15.50 t = 5 DPM 43.90 19.97 23.87 11.01 46.98 8.66 12.50 6.61 8.60 30.00 22.00 15.50 20.80 15.00 32.00 T-QPM 16.84 5.35 7.20 2.63 15.82 1.50 5.80 2.80 1.48 16.20 10.00 7.20 8.50 5.00 17.00 t = 6 DPM 43.32 19.26 24.85 11.70 46.06 8.60 13.50 6.51 9.00 33.00 24.00 16.50 22.00 16.50 34.50 T-QPM 19.38 6.70 8.57 3.16 18.06 1.70 6.50 3.00 1.80 17.80 11.20 8.50 9.60 6.00 18.50 t = 7 DPM 45.48 20.81 25.54 12.11 48.77 8.44 14.70 6.02 9.29 38.66 24.10 17.30 21.50 15.20 34.60 T-QPM 19.70 7.19 8.33 3.33 18.55 1.20 5.20 2.60 1.60 13.75 9.80 6.90 8.20 5.10 16.90 t = 8 DPM 46.68 22.36 26.13 12.46 50.11 9.66 14.70 8.31 9.29 38.66 24.10 17.30 21.50 15.20 34.60 T-QPM 20.50 7.16 8.86 3.61 19.28 1.20 5.20 2.60 1.59 13.75 9.80 6.90 8.20 5.10 16.90 t = 9 DPM 47.66 23.02 27.32 13.15 50.89 9.32 15.50 7.30 9.80 40.50 26.50 18.80 23.00 17.00 36.80 T-QPM 20.50 7.71 9.06 3.77 19.45 1.50 6.00 3.10 2.03 14.50 11.00 7.80 9.20 6.30 18.80 T able 7: AUR OC (%) ↑ for T-QPM vs. DPM across all av ailable ID and OOD dataset combinations across all timesteps. IN-1K: ImageNet-1K-VL-Enriched; Flk30: Flic kr30K; VG: Visual Genome. CLEAR100 (ID) CLEAR10 (ID) Core50 (ID) t Method COCO IN-1K Flk30 CC12M VG COCO IN-1K Flk30 CC12M VG COCO IN-1K Flk30 CC12M V G t = 0 DPM 88.33 95.64 94.25 97.72 87.59 97.63 98.40 98.35 98.55 93.40 95.70 97.60 96.10 97.70 93.90 T-QPM 97.37 99.03 98.70 99.61 97.52 99.60 99.20 99.65 99.77 97.40 98.80 99.10 99.05 99.55 97.50 t = 1 DPM 87.99 95.74 94.41 97.80 87.30 97.35 98.35 98.28 98.50 93.20 95.60 97.50 96.00 97.60 93.70 T-QPM 96.74 98.78 98.42 99.49 96.92 99.62 99.18 99.68 99.87 97.45 98.75 99.08 99.02 99.52 97.55 t = 2 DPM 88.14 95.65 94.16 97.73 87.38 97.21 98.48 98.20 98.53 93.18 95.60 97.80 96.10 97.80 94.10 T-QPM 96.56 98.66 98.21 99.41 96.73 99.66 99.16 99.65 99.86 97.49 98.90 99.10 99.00 99.50 97.60 t = 3 DPM 87.84 95.37 94.19 97.62 87.05 97.43 98.30 98.31 98.60 92.95 95.40 97.30 95.90 97.65 93.60 T-QPM 96.67 98.83 98.44 99.49 96.87 99.60 99.15 99.70 99.49 97.55 98.85 99.05 99.10 99.48 97.40 t = 4 DPM 87.28 95.35 93.78 97.53 86.42 96.90 98.10 98.05 98.45 92.60 95.00 97.00 95.50 97.30 93.00 T-QPM 96.29 98.52 98.09 99.30 96.48 99.55 99.10 99.60 99.60 97.70 98.70 99.00 99.00 99.40 97.20 t = 5 DPM 86.87 94.99 93.59 97.45 86.01 96.85 97.95 97.91 98.35 92.20 94.80 96.90 95.20 97.10 92.60 T-QPM 96.31 98.63 98.20 99.35 96.52 99.50 99.05 99.55 99.50 97.80 98.60 98.95 98.95 99.35 97.00 t = 6 DPM 86.50 95.21 93.25 97.39 85.53 96.81 97.70 97.84 98.20 91.90 94.50 96.70 95.00 96.90 92.30 T-QPM 96.09 98.59 98.09 99.33 96.32 99.48 99.00 99.50 99.50 97.90 98.50 98.90 98.90 99.20 96.80 t = 7 DPM 85.70 94.87 93.04 97.22 84.82 96.57 97.40 97.82 98.00 91.20 94.00 96.30 94.60 96.60 91.50 T-QPM 95.90 98.48 98.08 99.25 96.15 99.45 98.98 99.48 99.70 98.00 98.40 98.85 98.70 99.10 96.30 t = 8 DPM 85.51 94.44 92.94 97.11 84.69 96.30 94.74 97.56 98.57 90.52 93.20 95.40 93.80 95.60 89.80 T-QPM 95.70 98.47 97.96 99.22 95.95 99.49 98.95 99.07 99.57 98.12 97.60 98.70 98.10 99.00 95.80 t = 9 DPM 84.82 94.23 92.43 96.92 83.83 96.59 94.60 97.72 98.40 89.90 92.90 95.00 93.50 95.30 89.00 T-QPM 95.59 98.40 97.91 99.16 95.87 99.28 98.85 99.00 99.50 98.00 97.40 98.60 98.00 98.90 95.50 T-QPM for T emp oral OOD Detection 29 T able 8: Hyp erparameter sweep ov er β . A UR OC (%) ↑ for ViT-16 and ViT-32 back- b ones.(ID dataset: Clear100, OOD: COCO) β ViT-16 A UROC (%) ↑ ViT-32 A UROC (%) ↑ 0.0 91.20 89.40 0.5 93.80 91.90 1.0 96.32 93.60 1.5 95.70 94.80 2.0 94.80 94.10 3.0 93.10 92.40 4.0 91.40 90.80 5.0 89.20 88.60 6.0 86.80 86.20 8.0 83.40 82.70 T able 9: Hyp erparameter sw eep o ver η . AUR OC (%) ↑ for ViT-16 and ViT-32 back- b ones.(ID dataset: Clear100, OOD: COCO) η ViT-16 A UROC (%) ↑ ViT-32 A UROC (%) ↑ 0.0 94.10 92.40 0.5 96.32 94.60 1.0 95.20 93.50 1.5 93.80 92.10 2.0 91.90 90.20 3.0 89.40 87.80 5.0 85.60 84.10 T able 10: Hyp erparameter sweep ov er γ cap . AUR OC (%) ↑ for ViT-16 and ViT-32 bac kb ones.(ID dataset: Clear100, OOD: COCO) γ cap ViT-16 A UROC (%) ↑ ViT-32 A UROC (%) ↑ 0.00 86.40 84.20 0.02 88.80 86.60 0.05 91.50 89.40 0.07 93.90 91.80 0.10 96.32 94.10 0.15 95.80 95.60 0.20 94.60 94.30 0.30 92.40 92.10 0.50 89.80 89.50 30 A. Naikna ware and S. Sekeh D Implemen tation Details Backb one and Enc o ders. T-QPM builds on a frozen CLIP bac kb one with either a ViT-B/16 or ViT-B/32 visual enco der ( d = 512 ). Both the visual enco der ϕ V and text enco der ϕ T are kept entirely frozen throughout all phases of training; no fine- tuning of bac kbone parameters is performed. Only tw o scalar fusion parameters, ˜ β and ˜ η , are optimized via gradient descent, with effective weigh ts obtained as β = log(1 + e ˜ β ) and η = log (1 + e ˜ η ) (softplus) to enforce strict p ositivit y . These are initialized at ˜ β = 1 . 0 and ˜ η = 0 . 5 , corresp onding to β ≈ 1 . 31 and η ≈ 0 . 97 at the start of training. ID T ext Bank Construction. The ID text bank T ID ∈ R K × d is constructed once at initialization via prompt ensembling and remains fixed across all timesteps. F or each class k , we enco de all P prompt templates (loaded from prompt.txt ) through ϕ T , ℓ 2 -normalize each embedding, sum across templates, and re-normalize. This follows the standard CLIP zero-shot ensembling proto col. Visual Pr ototyp e Construction. At each timestep t , per-class visual protot ypes { µ k,t } K k =1 are recomputed from the current timestep’s ID training split. F or each image, we extract the class-attended global feature using the DPM-style spatial atten tion mechanism (Phase I I), normalize it, and accumulate a p er-class sum. The protot yp e µ k,t is the ℓ 2 -normalized mean of all class- k features. Prototypes are computed exclusiv ely from ID data and are never exp osed to OOD samples. Optimization. T raining pro ceeds sequen tially across T = 10 timesteps using the A dam optimizer with learning rate α lr = 3 × 10 − 3 , batch size 64 , and E = 5 ep och p er timestep. The total loss at each mini-batch is: L TOT AL = L ID + λ cov L COV + λ temp L TEMP , (42) where λ cov = 0 . 5 and λ temp = 1 . 0 . L ID is a balanced cross-en trop y loss a v eraged o v er clean and cov ariate-shifted views. L COV is the mean absolute difference b et w een fused scores of clean and shifted image pairs. L TEMP is the tw o-sided A TC drift p enalt y b etw een consecutive timesteps, computed as: L TEMP = | A TC clean t − A TC clean t − 1 | + | A TC shift t − A TC shift t − 1 | , (43) where the soft-A TC is a differentiable relaxation of the ab o v e-threshold cov erage: A TC t = E σ δ − S FUSED κ , κ = 0 . 1 . (44) A t t = 0 , L TEMP = 0 since no previous A TC exists. Thr eshold Calibr ation. The detection threshold δ is calibrated once at t = 0 as the δ q = 0 . 01 quan tile of S FUSED ev aluated on the t = 0 ID training split, and is held fixed for all subsequent timesteps. This conserv ativ e quantile ensures that few er than 1% of clean ID training samples fall b elo w the threshold, directly minimizing the false negative rate at the calibration timestep. T-QPM for T emp oral OOD Detection 31 Covariate Corruption Pip eline. Shifted views for training are generated on-the- fly . Gaussian blur is applied with kernel size 9 and σ uniformly sampled from [0 . 1 , 2 . 0] . JPEG compression is applied at a randomly sampled quality level. All corruptions are applied in the dataloader using torc h vision transforms, with no storage of pre-corrupted images. The spatial attention weigh t is γ = 0 . 2 and the CLIP logit temp erature is T = 1 . 0 throughout. OOD Dataset Str e aming. All OOD datasets are used exclusively at inference, nev er during training. COCO [15] is loaded from lo cal disk along with captions. Flic kr30K [22], ImageNet-1K-VL-Enriched [10], and CC12M [3] are streamed via the HuggingF ace datasets library with a reservoir shuffle buffer of 10 , 000 , capp ed at 20 , 000 , 10 , 000 , and 10 , 000 examples per ev aluation, resp ectiv ely . Cap- tions for Flickr30K are selected uniformly at random from the av ailable p er- image candidates. A new streaming iterator is instantiated at each timestep to a v oid exhausting the stream. R epr o ducibility. All exp eriments use a fixed random seed (default: 1556 ), set across random , numpy , torch , and torch.cuda . Results are av eraged ov er 3 in- dep enden t trials with seeds offset by trial_id ∈ { 0 , 1 , 2 } . All exp eriments are run on a single NVIDIA GPU with num_workers = 4 for ID dataloaders and num_workers = 0 for HuggingF ace streaming OOD loaders.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment