AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting

A GCD: Agen t-Guided Cross-mo dal Deco ding for W eather F orecasting Jing W u 1 ∗ , Y ang Liu 2 ∗ , Lin Zhang 3 ∗ , Jun b o Zeng 1 , Jiabin W ang 1 , Zi Y e 1 , Guo wen Li 1 , Shilei Cao 1 , Jiash un Cheng 2 , F ang W ang 4 , Meng Jin 5 , Y eRong F eng 6 , Hong Cheng 2 , Y utong Lu 1 , 7 , Haoh uan F u 8 , 7 , and Juep eng Zheng 1 , 7 † 1 Sun Y at-sen Univ ersity , Zhuhai, China 2 The Chinese Univ ersity of Hong Kong, Hong Kong, China 3 Jiangxi Science and T echnology Normal Univ ersity , Nanchang, China 4 China Meteorological A dministration, Beijing, China 5 Hua wei T echnologies Co., Ltd, China 6 Guangdong-Hong Kong-Macao Greater Ba y Area W eather Researc h Center for Monitoring W arning and F orecasting, China 7 National Sup ercomputing Center in Shenzhen, Shenzhen, China 8 T singhua Univ ersity , Shenzhen, China ∗ Equal contribution † Corresp onding author Abstract. A ccurate weather forecasting is more than grid-wise regres- sion: it m ust preserve coheren t synoptic structures and ph ysical con- sistency of meteorological ﬁelds, especially under autoregressive roll- outs where small one-step errors can amplify into structural bias. Exist- ing ph ysics-priors approaches t ypically imp ose global, once-for-all con- strain ts via arc hitectures, regularization, or NWP coupling, oﬀering lim- ited state-adaptiv e and sample-sp eciﬁc con trollability at deplo yment. T o bridge this gap, w e propose Agent-Guided Cross-modal Decoding (A GCD), a plug-and-pla y deco ding-time prior-injection paradigm that deriv es state-conditioned ph ysics-priors from the current multiv ariate at- mosphere and injects them into forecasters in a controllable and reusable w ay . Speciﬁcally , W e design a m ulti-agent meteorological narration pip eline to generate state-conditioned physics-priors, utilizing MLLMs to extract v arious meteorological elemen ts eﬀectively . T o eﬀectively apply the pri- ors, AGCD further in tro duce cross-modal region in teraction decoding that p erforms region-aw are multi-scale tokenization and eﬃcient physics- priors injection to reﬁne visual features without c hanging the bac kb one in terface. Experiments on W eatherBenc h demonstrate consisten t gains for 6-hour forecasting across tw o resolutions ( 5 . 62 5 ◦ and 1 . 40625 ◦ ) and div erse bac kb ones (generic and w eather-sp ecialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error ac- cum ulation and improv e long-horizon stability . Keyw ords: W eather forecasting · Multi-agent generation · Ph ysics-priors injection 2 J. W u et al. 1 In tro duction Short-range w eather forecasting is a cornerstone of op erational prediction, un- derpinning public safety and high-stakes decision-making. High-impact phenom- ena can dev elop within hours, requiring accurate forecasts of ev olving m ulti- v ariable atmospheric states that preserv e cross-v ariable physical consistency . In this regime, small single-step errors, seemingly minor under grid-wise metrics, can accumulate and amplify in to structural biases during autoregressive deploy- men t. T raditionally , Numerical W eather Prediction (NWP) main tains consis- tency by solving dynamical equations, but it incurs prohibitive computational cost under high resolution and frequen t up date cycles [1, 2]. In con trast, data- driv en forecasters trained on large reanalysis datasets enable substantially faster inference while achieving comp etitiv e short-to-medium-range accuracy [5, 28]. Despite their eﬃciency , purely data-driv en forecasters do not explicitly enforce ph ysical consistency across v ariables and space, where small short-range errors can b e ampliﬁed under autoregressive deplo yment and ev olv e in to physically implausible states. In contrast, op erational forecasting routinely performs state- a ware diagnosis and targeted corrections to maintain coherent synoptic struc- tures. Recognizing that purely data-driven, grid-wise regression is insuﬃcient to preserv e meteorologically meaningful structures and constrain ts under complex atmospheric dynamics, prior work has revisited a central principle of NWP: constraining ev olution with physical kno wledge. A ccordingly , researchers ha v e attempted to inject meteorological physical priors into learning-based forecasters in v arious forms to guide physics-a w are represent ation learning and impro v e predictiv e p erformance. Existing attempts to incorp orate ph ysical kno wledge in to data-driven forecasters mainly diﬀer in where the prior is imposed: (1) mo del-lev el biases bak ed in to arc hitectures (e.g., spectral or op erator designs [27, 37, 48], v ariable em b eddings [15, 35, 46], spherical/mesh representations [22], and tailored ob jectiv es [26]), (2) training-time constraints added as regularization or ph ysics-informed ob jectives [3, 31, 42, 61], and (3) Hybrid schemes [40] that couple with NWP can enhance physical consistency . While eﬀective, these priors are usually imp osed in a global, once-for-all manner, limiting sample-speciﬁc con trollability and state-adaptive guidance during multi-step deploymen t. Fig. 1 summarizes this gap and motiv ates an alternativ e: deriving state-conditioned, ph ysically consisten t guidance from the current atmosphere and applying it in a con trollable and reusable wa y . Recen tly , Multimodal Large Language Mo dels (MLLMs) and agent w orkﬂows ha ve ac hieved strong results across computer vision [13, 14, 16, 29,44, 49, 53, 59] and natural language processing [7, 10, 25, 54, 60, 62], and are increasingly trained with an emphasis on ph ysical consistency and guidance [19, 43, 51, 52, 55]. Their abil- it y to pro duce consistent visual descriptions suggests a route to summarize the curren t multi-v ariable atmosphere in to an explicit, controllable prior that high- ligh ts synoptic structures and enforces cross-v ariable consistency . Unlike static ph ysics injection that is baked into arc hitectures or losses and is hard to steer p er sample, state-conditioned summaries provide situation-a ware guidance with A GCD for W eather F orecasting 3 Fig. 1: Global static ph ysics-priors vs. State-conditioned physics-priors: prop osed A GCD injects cached state-conditioned ph ysics-priors at decoding time. p er-v ariable evidence and c heck able consistency constraints. How ev er, naively applying generic captioners or single-round MLLMs to meteorological ﬁelds is hindered by t wo b ottlenec ks: reliability (cov erage gaps and cross-v ariable incon- sistencies in strongly coupled, high-dimensional states) and eﬃciency (online m ulti-agent reasoning is costly for training and deploymen t). Therefore, we seek a reliable and eﬃcient mechanism that generates causally v alid, state-conditioned priors and injects them in to forecasters with less runtime cost. Motiv ated b y this p ersp ectiv e, we propose Agen t-Guided Cross-mo dal Deco d- ing ( A GCD ), a plug-and-play prior-injection paradigm designed for T ransformer- based neural forecasters. Concretely , A GCD emplo ys an oﬄine Multi-agen t Me- teorological Narration Pip eline ( MMNP ) to generate concise, state-conditioned ph ysics-priors from m ulti-v ariable heatmaps, and injects them as decoding-time guidance in to T ransformer-based forecasters. T o realize this, we further intro- duce Cross-mo dal Region Interaction Deco ding ( CRID ), a plug-and-pla y cross- mo dal deco der that eﬃcien tly fuses the cac hed priors with visual tokens for region-adaptiv e reﬁnement, impro ving structural ﬁdelit y without mo difying the bac kb one I/O interface. W e ev aluate AGCD on W eatherBench [39] at tw o res- olutions; for long-horizon assessmen t, w e p erform strictly causal 6 hour-step autoregressiv e rollouts up to 48 hours, where the narrative is refreshed from the curren t rollout state without introducing future information. Across settings, A GCD consistently impro ves accuracy and reduces error accumulation, leading to more stable long-horizon b eha vior. Con tributions. Our main contributions are three-fold: – W e in tro duce a new p ersp ectiv e for ph ysics-priors injection in weather fore- casting: leveraging MLLMs to conv ert multi-v ariable atmospheric states into state-conditioned physics-priors that are explicit, con trollable, and reusable. – W e propose A GCD, a plug-and-play deco ding-time prior-injection frame- w ork that couples an oﬄine m ulti-agent narration pip eline (MMNP) with a light w eight cross-mo dal decoder (CRID) to enable region-adaptive reﬁne- men t without mo difying bac kb one interfaces. – W e demonstrate consistent gains for 6-hour forecasting on W eatherBenc h across tw o resolutions and div erse backbones (generic and weather-specialized), 4 J. W u et al. including 48-hour autoregressiv e rollouts that reduce early-stage error accu- m ulation. 2 Related W ork 2.1 Data-driv en W eather F orecasting Data-driv en w eather forecasting has adv anced rapidly with deep mo dels that learn spatiotemp oral dynamics from large reanalysis datasets. Beyond early con- v olutional and recurren t approac hes, recent progress mainly follo ws three direc- tions: (i) Neural op erators that approximate the evolution operator in function space and enable eﬃcien t global mixing for autoregressive rollouts [6, 32, 37]; (ii) T ransformer-st yle forecasters that scale sequence mo deling on latitude–longitude grids, often incorp orating weather-a w are designs such as v ariable level emb ed- dings, pressure-level structure, and latitude-weigh ted ob jectives [4, 5, 36]; and (iii) graph-based forecasters that perform message passing on spherical meshes for long-range transp ort and m ulti-scale interactions b ey ond regular grids [28, 34, 63]. These metho ds ha ve ac hieved strong single-step accuracy and practical inference eﬃciency , making them promising alternativ es or complements to traditional NWP in short-range settings. Existing ph ysics-priors injection is largely static, whic h is often hard to con- trol at training time and lacks a mechanism for state-aw are revision o ver dy- namically sensitiv e regions. This static becomes fragile under autoregressive de- plo yment: small structural misplacements and weak cross-v ariable coherence at early steps can be recursively ampliﬁed, yielding systematic bias and unstable long-horizon tra jectories. These limitations motiv ate an explicit, con trollable, and plug-and-play guidance mechanism that injects state-conditioned priors at deco ding time, without redesigning strong bac kbones, thereb y impro ving the stabilit y of early-stage autoregressive rollouts. 2.2 MLLMs and Agentic W orkﬂo ws for Structured Guidance Recen t multimodal large language mo dels (MLLMs) and agen tic workﬂo ws [11] ha ve b ecome practical mechanisms for structured guidance, showing strong ca- pabilities in grounded description [20, 24, 44, 53], region-centric reasoning [14, 49], m ulti-step decomposition [12, 21, 23, 41, 50, 57], and to ol-augmen ted veriﬁca- tion [7, 47, 62] across vision and language tasks. In particular, veriﬁcation-orien ted designs are used to suppress omissions, contradictions, and ov erconﬁden t state- men ts [17, 30, 56]. This paradigm suggests a route to conv ert high-dimensional visual observ ations [9] into compact semantic summaries that can act as con- trollable signals for do wnstream mo dels [8, 33, 58]. Ho wev er, transferring generic captioners or online m ulti-agen t reasoning to meteorological ﬁelds is c hallenging due to tw o constraints: reliabilit y and ef- ﬁciency . W eather states are high-dimensional and strongly cross-v ariable cou- pled, making one-shot generation prone to incomplete cov erage and inconsistent A GCD for W eather F orecasting 5 Fig. 2: The o verview of the prop osed AGCD. seman tics, whic h is undesirable as a stable training-time prior. Meanwhile, on- line multi-agen t execution is costly and often diﬃcult to repro duce within large- scale forecasting pip elines. These limitations motiv ate guidance mechanisms that pro duce deterministic, evidence-grounded state summaries with explicit consis- tency con trol, enable oﬄine cac hing to av oid online multi-agen t iterations during training and one-step inference while supp orting strictly causal rollouts via a ligh tw eight single-step editor. 3 Metho dology 3.1 Ov erall F ramew ork As illustrated in Fig. 2, our framework couples structured language guidance with visual spatiotemp oral representation learning for meteorological forecasting. It consists of a language path wa y that pro vides semantic cues and a visual path wa y that pro duces spatiotemporal tokens for prediction. Language path wa y . F or eac h meteorological v ariable ﬁeld X i ∈ R H × W , w e render it in to an RGB heatmap I i ∈ R H × W × 3 using a ﬁxed colormap and a ﬁxed normalization scheme to ensure a deterministic v alue-to-color mapping. Giv en the multiv ariate heatmaps { I i } N i =1 , the prop osed Multi-agent Meteorolog- ical Narration Pip eline ( MMNP ) (Sec. 3.2) generates a coherent meteorologi- cal narrative S ﬁnal summarizing salient atmospheric states and p oten tial inter- v ariable in teractions. T o av oid running multi-agen t iterations, S ﬁnal is precom- puted oﬄine for eac h sample and cached for training and inference. 6 J. W u et al. W e then enco de S ﬁnal using a pretrained Large Language Mo del (LLM) and extract the last-la yer hidden states as tok en embeddings: T = E LLM ( S ﬁnal ) ∈ R N t × d t . (1) Imp ortan tly , the LLM is k ept frozen throughout training and inference. Visual pathw a y and cross-mo dal coupling. In parallel, the ra w ﬁelds are fed into a T ransformer-based forecasting bac kb one (such as Pangu [4], ClimaX [36], etc.), producing patch tokens P ∈ R N × d (with N = H · W ) and a global class tok en C ∈ R 1 × d . W e then p erform a cross-modal guidance prepro cessing step in our Cross-mo dal Region Interaction Deco ding ( CRID ) (Sec. 3.3). Sp eciﬁcally , the class tok en C generates tok en-wise and channel-wise gates to reﬁne the frozen text embeddings T , yielding visual-guided text features ˜ T aligned to the visual feature space. CRID then injects ˜ T into region-aw are decoding through token distillation and cross-attention mo dulation, pro ducing improv ed forecasts that lev erage b oth lo cal atmospheric patterns and global seman tic context. 3.2 Multi-agen t Meteorological Narration Pip eline (MMNP) Generating a coherent and meteorologically plausible narrative from multiv ari- ate atmospheric inputs requires (i) capturing salient spatial patterns within each v ariable and (ii) integrating cross-v ariable cues without in tro ducing contradic- tions or temp orally-confounded causal claims. Therefore, w e prop ose MMNP , a collab orativ e multi-agen t pip eline that pro duces an oﬄine narrative prior S ﬁnal from deterministically rendered RGB heatmaps { I i } N i =1 (Sec. 3.1). T o keep com- putation b ounded and repro ducible, MMNP op erates under ﬁxed prompts tem- plates and a ﬁxed reﬁnemen t budget. Agen ts and roles MMNP c onsists of three agents with complementary respon- sibilities: (1) V ariable-sp e ciﬁc description agents A V i . F or eac h v ariable V i , agen t A V i tak es the corresp onding heatmap I i and extracts salient spatial structures with coarse lo calization cues in a concise textual form: d i = A V i ( I i ) , i = 1 , . . . , N . (2) Eac h d i follo ws a light w eight, template-constrained style (short clauses with appro ximate regions and intensit y trends) to facilitate downstream integration and v eriﬁcation. (2) Se quential inte gr ation agent A I . The integration agent A I merges { d i } N i =1 in to a uniﬁed narrativ e b y iterativ ely up dating a running state S i − 1 → S i under a ﬁxed v ariable order: S i = A I ( S i − 1 , d i ) , S 0 = ∅ . (3) A GCD for W eather F orecasting 7 T o prev ent uncontrolled verbosity and to ensure consistent phrasing across sam- ples, A I writes S i in a template-constrained format (short sentences or bullets) and explicitly separates: (a) observ ations grounded in the curren t heatmap pat- terns, and (b) h yp othesized interactions across v ariables, phrased as ten tativ e rather than factual or future-dep enden t claims. (3) Evidenc e-gr ounde d evaluator E . Giv en the v ariable-wise descriptions { d i } N i =1 and the integrated narrative S ﬁnal , the ev aluator E performs a structured con- sistency chec k and returns either PASS or a feedbac k pack age. Concretely , E assesses three asp ects: – P er-v ariable co verage : whether salient structures describ ed in each d i are reﬂected in S ﬁnal (mitigating co verage gaps); – Consistency with describ ed evidence : whether statements in S ﬁnal pre- serv e the coarse localization and intensit y trends stated in { d i } , without distortion or un warran ted sp eciﬁcit y; – Coherence : whether the narrative is concise, w ell-structured, and non- redundan t. The ev aluator rep orts lo calized issues with diﬀerent types (suc h as missing, dis- torted, con tradictory , and ov erstated-causality) to enable targeted reﬁnement. F orw ard generation and ev aluation All v ariable-sp eciﬁc agents are executed in parallel to pro duce { d i } N i =1 , follow ed by chained integration to obtain S ﬁnal . The ev aluator then v eriﬁes S ﬁnal against the v ariable-wise descriptions: flag = E ( { d i } N i =1 , S ﬁnal ) . (4) If flag is PASS , we output S ﬁnal as the ﬁnal narrative prior for subsequent frozen- LLM. If the flag is Fail , E returns a feedback pac k age that sp eciﬁes the issue t yp e and the implicated v ariable, together with the current in tegrated narrative: Feedback = ( type , i, d i , S ﬁnal ) , (5) where type ∈ {missing, distorted, con tradictory , o verstated-causalit y} and i indexes the v ariable whose description is implicated. Conditioned on Feedback , the in tegration agent A I revises S ﬁnal b y (i) adding missing but supp orted con- ten t from d i , (ii) correcting distorted localization and in tensity phrasing, (iii) resolving contradictions by rephrasing or narrowing claims and (iv) weak ening causal language in to hypothesis form, while preserving unaﬀected conten t. T o ensure bounded and reproducible computation, w e run at most R re- ﬁnemen t rounds (ﬁxed across the dataset). If the narrativ e still fails after R rounds, we fall back to the b est-scoring version selected by E . The ﬁnalized S ﬁnal is cac hed oﬄine and reused during training and inference, av oiding online m ulti-agent iterations during optimization. 8 J. W u et al. 3.3 Cross-mo dal Region Interaction Deco ding (CRID) As discussed in Sec. 3.1, the generated meteorological narrative is not merely for p ost-hoc explanation; instead, it should serv e as a deco ding-time explicit ph ysics-priors that guides the forecaster to ward dynamically sensitive regions and cross-v ariable-consistent structures. T o this end, we prop ose CRID, a plug- and-pla y decoder that injects state-conditione d physics-priors without chang- ing the backbone interface. CRID consists of tw o comp onen ts: Cross-Mo dal Guidance ( CMG ), whic h produces visual-conditioned text features, and Cross- Mo dal Interaction ( CMI ), whic h performs region-aw are m ulti-source interaction to mo dulate patc h tok ens for forecasting (See Fig. 3). Fig. 3: Structure of Cross-Mo dal In- teraction . Inputs. Giv en an input state at time t , the forecasting backbone produces a set of patc h-wise visual tok ens P ∈ R N × d (with N = H · W ) and a global summary tok en C ∈ R 1 × d . In parallel, a frozen text enco der embeds the narrative prior (Sec. 3.1) into token features T ∈ R N t × d t . CRID tak es ( P , C , T ) as inputs and p er- forms deco ding-time revision. Cross-Mo dal Guidance (CMG): CMG conv erts frozen text features into visual-conditioned semantics. The core idea is to use the class tok en C as a compact summary of the curren t atmospheric state and let it gate the narrativ e tok ens T , thereb y selectiv ely emphasizing state-relev an t semantic cues. W e ﬁrst align text features to the visual c hannel dimension as Eq. (6), where g ( · ) is a learnable linear pro jection if d t  = d . W e then map C through a light weigh t MLP f ( · ) and split the output into t wo queries as Eq. (7): U = g ( T ) ∈ R N t × d , (6) [ q tok , q ch ] = f ( C ) , q tok ∈ R 1 × N t , q ch ∈ R 1 × d . (7) T ok en-wise gating rew eights narrative tokens by their compatibilit y with the global state as Eq. (8), and channel-wise gating further reﬁnes the seman tic c hannels to match the state-dependent emphasis as Eq. (9): α = softmax( q ch U ⊤ ) ∈ R 1 × N t , U (1) = α ⊙ U. (8) β = softmax( q tok U (1) ) ∈ R 1 × d , ˜ T = β ⊙ U (1) . (9) The resulting ˜ T ∈ R N t × d serv es as a state-conditioned physics-priors and will b e injected in to CMI for region-aw are in teraction. Cross-Mo dal Interaction (CMI): CMI injects the guided semantics ˜ T in to patc h tokens via region-a ware tokenization and memory-based mo dulation. Given patc h tokens P ∈ R N × d , w e construct multi-scale region tokens by p o oling on A GCD for W eather F orecasting 9 the token grid. Let P grid ∈ R H × W × d b e the reshaped tokens, and for scales S w e compute R ( s ) = Flatten  AvgP o ol s × s ( P grid )  , R = [ R ( s ) ] s ∈S ∈ R N r × d . (10) W e ﬁrst construct a uniﬁed deco ding context b y concatenating patc h tokens, m ulti-scale region tokens, and guided seman tic tokens: X = Concat( P , R , ˜ T ) = [ P ; R ; ˜ T ] ∈ R L × d , L = N + N r + N t . (11) Since directly op erating on X is computationally exp ensive and may dilute salien t cross-mo dal cues, we further distill X in to a compact set of M memory tok ens ( M ≪ L ) via Hopﬁeld p ooling [38], yielding representativ e prototypes for eﬃcien t deco ding-time mo dulation: Z = HopﬁeldPool( Q h , X ) ∈ R M × d , (12) where Q h ∈ R M × d denotes learnable p ooling queries. W e apply m ulti-head at- ten tion (MHA) with P as queries and the memory Z as keys and v alues: ˆ P = MHA( P W Q , Z W K , Z W V ) + P , P out = MLP( ˆ P ) . (13) where W Q , W K , W V are learnable linear pro jections for queries, k eys, and v alues, resp ectiv ely . The prop osed CMI acts as a plug-in deco der that replaces the orig- inal deco ding head and directly outputs the ﬁnal forecasts, without mo difying the bac kb one enco der. 4 Exp erimen ts 4.1 Setup Dataset. W e ev aluate on W eatherBench at 5.625 ◦ and 1.40625 ◦ for 6 hour forecasting: given state at time t , predict t +6h . W e further assess long-horizon b eha vior via autoregressive rollouts up to 48 hours b y iterativ ely feeding pre- dictions bac k as inputs. Inputs include surface v ariables { wind10m , t2m } and upp er-air v ariables { z , r , q , wind , t } o ver 13 pressure lev els; we rep ort canonical W eatherBenc h scores on Z500, T850, T2m, and 10m wind. W e use a strict tem- p oral split: train (1979-01-01 to 2016-12-31) and test (2017-01-01 to 2017-12-31). Metrics. All metho ds are trained under an iden tical supervised setup to predict t +6h from t , using the same v ariable conﬁguration and optimization schedule; ev aluation uses latitude-weigh ted RMSE and A CC computed on climatology- based anomalies. F ull details are provided in the supplemen tary material (Sec. S2). 10 J. W u et al. T able 1: 6-hour forecasting results on W eatherBenc h at tw o resolutions. AGCD con- sisten tly impro ves RMSE and ACC across bac kb ones. RMSE ↓ and A CC ↑ . Method T2m [K] 10m Wind [m/s] Z500 [m 2 /s 2 ] T850 [K] 5.625 ◦ 1.40625 ◦ 5.625 ◦ 1.40625 ◦ 5.625 ◦ 1.40625 ◦ 5.625 ◦ 1.40625 ◦ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ ViT 1.6859 0.9554 1.3570 0.9710 0.5490 0.9723 0.5946 0.9668 131.01 0.9929 80.80 0.9970 1.21 0.9736 0.91 0.9852 ViT+AGCD 1.2601 0.9768 1.2450 0.9754 0.4781 0.9788 0.5600 0.9695 113.07 0.9951 75.90 0.9976 0.98 0.9830 0.86 0.9871 CaiT 1.8747 0.9317 1.5200 0.9658 0.6051 0.9674 0.6420 0.9622 149.95 0.9917 104.60 0.9953 1.51 0.9516 1.06 0.9829 CaiT+AGCD 1.8703 0.9450 1.4700 0.9682 0.5993 0.9678 0.6210 0.9638 132.80 0.9938 96.80 0.9960 1.41 0.9635 0.99 0.9846 ClimaX 1.2308 0.9759 0.7799 0.9904 0.4970 0.9776 0.3443 0.9889 88.98 0.9972 32.84 0.9995 0.92 0.9857 0.49 0.9957 ClimaX+AGCD 0.8843 0.9880 0.7420 0.9912 0.4513 0.9812 0.3320 0.9896 70.22 0.9979 31.10 0.9996 0.78 0.9905 0.46 0.9962 Pangu 0.4965 0.9961 0.5147 0.9958 0.5963 0.9666 0.5321 0.9733 80.91 0.9970 74.36 0.9974 0.52 0.9951 0.67 0.9920 Pangu+A GCD 0.4916 0.9962 0.4551 0.9967 0.5507 0.9716 0.4451 0.9814 68.92 0.9978 58.73 0.9984 0.50 0.9954 0.63 0.9929 Baselines. W e ev aluate A GCD as a plug and play mo dule on b othgeneric vision bac kb ones and weather-specialized forecasters. ViT [18] is a pure T rans- former that mo dels an image as a sequence of patc h tokens, serving as a strong and scalable generic backbone for grid-lik e inputs. CaiT [45] extends ViT b y in tro ducing class-attention mechanisms to enable deeper Image T ransformers with impro ved optimization and represen tation. ClimaX [36] is a foundation mo del for weather and climate that is designed to b e ﬂexible ov er heterogeneous datasets (diﬀerent v ariables and spatiotemp oral cov erage), and can be pretrained and then ﬁnetuned for downstream forecasting tasks. Pangu-W eather [4] is a high-resolution global w eather forecasting mo del that p erforms fast deterministic forecasts with a 3D arc hitecture tailored to atmospheric ﬁelds. Implementation Details MMNP uses ﬁxed prompt templates with a b ounded reﬁnemen t budget R to pro duce deterministic physics-priors from multi-v ariable heatmaps. F ull MMNP details and all hyperparameters are provided in the sup- plemen tary material (Sec. S1–S2; T able S1). 4.2 6 hour F orecasting F or eac h framework, we rep ort b oth the v anilla mo del and with A GCD coun- terpart obtained by plugging our semantic guidance (MMNP+CRID) into the deco ding stage. T ables 1 summarize the 6 hour forecasting p erformance at 5.625 ◦ and 1.40625 ◦ , respectively . Our prop osed plug-and-pla y A GCD consisten tly im- pro ves all tested backbones, reducing RMSE and increasing ACC on the canon- ical v ariables. W e provide qualitative comparisons of representativ e 6 hour fore- casts for Z500, T850, T2m, and 10m wind at 1.40625 ◦ (P angu Figs. 4) and 5.625 ◦ (ClimaX), resp ectiv ely , whic h shows that our proposed method yields re- sults that closely match the ground with smaller bias. The 5.625 ◦ visualization is deferred to the supplemen tary material (Sec. S3). 4.3 Autoregressiv e forecasting T ext up date rule for r ol louts. While our base task is 6 hour forecasting and the narrative prior is in tentionally concise, regenerating the full MMNP at every A GCD for W eather F orecasting 11 Fig. 4: Qualitative comparison of 6 hour weather forecasting with Pangu and P angu+AGCD (A GCD) on 1.40625 ◦ data across multiple v ariables. (a) Initial ﬁelds at time t . (b) Ground-truth targets at t +6 h. (c) Predictions from the v anilla Pangu. (d) Error maps from the v anilla Pangu. (e) Predictions from Pangu with our AGCD. (f ) Error maps from Pangu with our AGCD. Error maps visualize Pred − GT. rollout step is unnecessary and ineﬃcient. Therefore, w e adopt a ligh t weigh t rollout update: we k eep the v ariable-sp eciﬁc describers and ev aluator oﬀ during rollouts, and reuse only the sequential integration agen t as a single-step editor. In all autoregressiv e experiments, we instantiate this editor with In tern VL3.5. Concretely , at step k the editor tak es (i) the curren t predicted meteorological heatmap stack { I ( k ) i } and (ii) the previous-step narrative S ( k − 1) , then outputs an up dated narrativ e S ( k ) b y making minimal, evidence-grounded edits: S ( k ) = A I  S ( k − 1) , { I ( k ) i } N i =1  . (14) This yields a causally v alid, step-adaptive ph ysics-priors with negligible ov er- head, while av oiding rep eated m ulti-agent reﬁnemen t. The u pdated S ( k ) is then enco ded b y the frozen LLM and injected by CRID for the next rollout step. W e ev aluate AGCD via strictly causal autoregressive rollouts with a 6 hour step: starting from the initial state at time t , the mo del iteratively feeds its own prediction bac k as input to forecast t +6 h, . . . , t +48 h . Fig. 5 rep orts the lead- time curves of latitude-weigh ted RMSE/ACC, and the detailed RMSE results at 12-hour interv als across backbones are deferred to the supplementary material (Sec. S3). A cross v ariables, our AGCD consistently reduces error accumulation and yields more stable tra jectories under rollout. 5 Discussion 5.1 Ho w crucial is semantic alignmen t for improv emen t? W e k eep the backbone and CRID iden tical and only change the text: Matched (sample-aligned), Sh uﬄed (mismatc hed), and Empty (null). T able 2 shows 12 J. W u et al. Fig. 5: Autoregressiv e rollout comparison betw een Pangu and Pangu+A GCD up to 48 hours (6 hour steps). T able 2: Semantic relev ance con trols. All settings keep the visual backbone (ViT) and CRID identical; only the text input is mo diﬁed. T ext setting Z500 T850 T2m 10m Wind RMSE ↓ A CC ↑ RMSE ↓ A CC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ Vision-only (no text) 131.01 0.9929 1.21 0.9736 1.6859 0.9554 0.5490 0.9723 Matched (ours) 113.07 0.9951 0.98 0.9830 1.2601 0.9768 0.4781 0.9788 Shuﬄed (mismatch) 136.40 0.9922 1.24 0.9730 1.7120 0.9550 0.5650 0.9718 Empty (null prompt) 134.80 0.9924 1.23 0.9732 1.7050 0.9552 0.5600 0.9720 that improv ements app ear only with Matched text, while Sh uﬄed/Empt y largely remo ve the b eneﬁt and can even underp erform the vision-only baseline, conﬁrm- ing that semantic alignmen t is necessary . Fig. 6 provides a concrete example sho wing that matched narrativ es oﬀer lo calized, state-consisten t priors rather than generic text cues. F or T850, the narrativ e explicitly highligh ts the dy- namically active regions ov er Eurasia–North and Africa, whic h coincide with the b oxed areas where the baseline exhibits structured w arm/cold displace- men t errors. F or Z500, the prior emphasizes the Siberian ridge, aligning with the synoptic-scale height pattern and guiding corrections on the corresp onding ridge-related error patches. F or T2m, the narrativ e p oin ts to a temperature- gradien t band around 60 ◦ S, matc hing the sharp frontal-lik e transitions where the baseline tends to blur gradien ts and incur coheren t bias. F or 10m wind, the prior focuses on the North Paciﬁc, consistent with the prominen t wind struc- tures and the concentrated error clusters in that region. Across v ariables, these region-sp eciﬁc priors translate in to targeted error reductions in the zo omed-in b o xes, supp orting that the gain come s from sample-aligned seman tic guidance rather than extra text capacit y . A GCD for W eather F orecasting 13 Fig. 6: Relev ance case study: State-consisten t priors yield targeted error reductions. T able 3: Ablation on MMNP generation strategies (same CRID and bac kb one (ViT)). T ext generator Z500 T850 T2m 10m Wind RMSE ↓ A CC ↑ RMSE ↓ A CC ↑ RMSE ↓ ACC ↑ RMSE ↓ A CC ↑ Single-agent ( A I only) 123.80 0.9937 1.05 0.9802 1.3900 0.9688 0.5070 0.9756 Multi-agent w/o ev aluator ( { A V i } + A I ) 118.20 0.9944 1.01 0.9819 1.3200 0.9742 0.4920 0.9773 F ull MMNP (ours) ( { A V i } + A I + E ) 113.07 0.9951 0.98 0.9830 1.2601 0.9768 0.4781 0.9788 5.2 Can multi-agen t decomp osition enhance narrative reliability? W e compare three narrative generation strategies under the same forecasting bac kb one and CRID: (1) Single-agen t uses only the integration agent A I to pro duce a single-pass narrative b y taking the full m ulti-v ariable heatmap stack { I i } N i =1 as input, without v ariable-wise decomp osition or p ost-ho c veriﬁcation. (2) Multi-agen t w/o ev aluator decomp oses the input in to v ariable-sp eciﬁc de- scriptions { d i } and integrates them with A I , but remov es the evidence-grounded ev aluator E . (3) F ull MMNP further adds E to detect and revise omissions and cross-v ariable inconsistencies. Representativ e narrative comparisons are pro vided in the supplemen tary material (Sec. S3). T able 3 shows a monotonic improv emen t as the generation pip eline becomes more reliable: v ariable-wise decomposition already impro ves ov er Single-agent, and adding the ev aluator yields further gains. This supp orts that the b eneﬁt comes from b etter evidence co verage and consistency con trol, rather than in- creasing text length. 5.3 What drives p er-v ariable ﬁdelity? Finally , ﬁxing the integration agen t A I and ev aluator E , we study p er-v ariable ﬁdelit y from t wo angles: (i) sw apping the v ariable-sp eciﬁc agen ts { A V i } while k eeping the rest of the pipeline unchanged (Fig. 7); and (ii) incremen tally en- abling a subset of { A V i } (T able 4). Fig. 7 shows that stronger v ariable-sp eciﬁc describ ers yield consisten tly b etter RMSE/ACC across all four canonical v ari- 14 J. W u et al. Fig. 7: Ablation on v ariable-sp eciﬁc description agen ts in MMNP (ViT backbone; A I and E ﬁxed). W e rep ort latitude-weigh ted RMSE (bars; lo wer is b etter) and ACC (line; higher is better) on Z500, T850, T2m, and 10m wind. T able 4: Incremental ablation on enabling v ariable-sp eciﬁc description agents in MMNP (ViT bac kb one; A I and E ﬁxed). RMSE ↓ / ACC ↑ for 6 hour forecasts. Enabled A · Z500 T850 T2m 10m wind RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ No MMNP 131.01 0.9929 1.21 0.9736 1.6859 0.9554 0.5490 0.9723 A t2m 130.62 0.9929 1.16 0.9752 1.5438 0.9620 0.5352 0.9735 A t2m,10m 130.12 0.9930 1.13 0.9768 1.4820 0.9650 0.5249 0.9745 A t2m,10m,t850 129.66 0.9930 1.10 0.9782 1.4187 0.9678 0.5146 0.9755 A t2m,10m,t850,z500 113.07 0.9951 0.98 0.9830 1.2601 0.9768 0.4781 0.9788 T able 5: Ablation on CRID components. Setting Z500 T850 T2m 10m Wind RMSE ↓ A CC ↑ RMSE ↓ A CC ↑ RMSE ↓ A CC ↑ RMSE ↓ A CC ↑ Vision-only (no CRID) 131.01 0.9929 1.21 0.9736 1.6859 0.9554 0.5490 0.9723 + T ext, w/o Region tok ens 124.70 0.9936 1.09 0.9786 1.4200 0.9681 0.5150 0.9750 + T ext, w/o HopﬁeldPool 115.60 0.9948 1.00 0.9824 1.3000 0.9758 0.4860 0.9780 + T ext, w/o CMG gating 118.90 0.9943 1.02 0.9818 1.3400 0.9748 0.4930 0.9775 F ull CRID (ours) 113.07 0.9951 0.98 0.9830 1.2601 0.9768 0.4781 0.9788 ables, conﬁrming that MMNP b eneﬁts from higher-quality per-v ariable evidence rather than simply increasing text capacit y . 5.4 Whic h CRID comp onen ts matter most? W e next v alidate the design c hoices in CRID b y ablating its k ey components while keeping the backbone, training proto col, and text generator ﬁxed. W e con- sider remo ving (i) the region-a w are multi-scale tok ens , (ii) the Hopﬁeld- based distillation and directly performing atten tion o ver the concatenated tok ens, and (iii) the CMG gating that pro duces visually aligned text features. Results in T able 5 indicate that eac h comp onen t con tributes to the ﬁnal perfor- mance. Notably , region-aw are tok ens consisten tly impro v e v ariables character- ized by sharp gradients or coherent synoptic structures, while Hopﬁeld distilla- tion provides a fav orable accuracy–eﬃciency trade-oﬀ by retaining informative cross-mo dal cues with reduced token complexity . Remo ving CMG gating de- grades p erformance, suggesting that coarse global visual context (class token) is b eneﬁcial for rew eigh ting and aligning frozen text embeddings b efore fusion. A GCD for W eather F orecasting 15 6 Conclusions In this work, we prop ose AGCD , an explicit and plug-and-play deco ding-time prior injection paradigm for neural w eather forecasting. AGCD is motiv ated b y a key gap in existing forecasters: grid-wise regression alone lac ks a state- a ware physics-priors, so structural errors and cross-v ariable inconsistencies can b e ampliﬁed under autoregressive rollouts. T o bridge this gap, we in troduce a MMNP to pro duce state-conditioned physics-priors with evidence-grounded consistency control, and a light w eight CRID deco der to inject these priors for region-adaptiv e reﬁnement without changing the backbone interface. Extensiv e exp erimen ts on W eatherBenc h at 5 . 625 ◦ and 1 . 40625 ◦ demonstrate consisten t gains in latitude-weigh ted RMSE and A CC across both generic vision back- b ones and w eather-sp ecialized forecasters, and improv ed stabilit y under strictly causal 48-hour autoregressive rollouts. F urther analyses and ablations v alidate that the improv ements stem from matc hed and reliable narratives with suﬃcient p er-v ariable co v erage, rather than merely adding extra text tok ens. In the fu- ture, w e plan to extend A GCD to broader v ariable sets and higher-resolution forecasting, explore more eﬃcient state up date and cac hing strategies for long- horizon deploymen t, and integrate stronger physically grounded constraints to further enhance robustness in op erational settings. References 1. Bauer, P ., Dueb en, P .D., Ho eﬂer, T., Quintino, T., Sc hulthess, T.C., W edi, N.P .: The digital rev olution of earth-system science. Nature Computational Science 1 (2), 104–113 (2021) 2. Bauer, P ., Thorpe, A., Brunet, G.: The quiet revolution of n umerical weather prediction. Nature 525 (7567), 47–55 (2015) 3. Beucler, T., Pritc hard, M., Rasp, S., Ott, J., Baldi, P ., Gen tine, P .: Enforcing analytic constraints in neural netw orks emulating physical systems. Physical review letters 126 (9), 098302 (2021) 4. Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: P angu-weather: A 3d high-resolution mo del for fast and accurate global w eather forecast. arXiv preprint arXiv:2211.02556 (2022) 5. Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: Accurate medium-range global weather forecasting with 3d neural netw orks. Nature 619 (7970), 533–538 (2023) 6. Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., Anand- kumar, A.: Spherical fourier neural operators: Learning stable dynamics on the sphere. In: International conference on machine learning. pp. 2806–2823. PMLR (2023) 7. Caﬀagni, D., Co cc hi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language mo dels: A survey . Findings of the association for computational linguistics: ACL 2024 pp. 13590– 13618 (2024) 8. Chen, C.Y., Shi, M., Zhang, G., Shi, H.: T2i-copilot: A training-free m ulti-agent text-to-image system for enhanced prompt interpretation and in teractive gener- ation. In: Pro ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19396–19405 (2025) 16 J. W u et al. 9. Chen, G., Zhou, X., Shao, R., Lyu, Y., Zhou, K., W ang, S., Li, W., Li, Y., Qi, Z., Nie, L.: Less is more: Emp ow ering gui agent with con text-aw are simpliﬁcation. In: Pro ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5901–5911 (2025) 10. Chen, H., Lin, J., Chen, X., F an, Y., Dong, J., Jin, X., Su, H., F u, J., Shen, X.: Multimo dal language mo dels see b etter when they lo ok shallow er. In: Pro ceedings of the 2025 Conference on Empirical Metho ds in Natural Language Processing. pp. 6688–6706 (2025) 11. Chen, L., W ang, Y., T ang, S., Ma, Q., He, T., Ouy ang, W., Zhou, X., Bao, H., Peng, S.: Egoagen t: a joint predictive agent mo del in ego centric worlds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6970–6980 (2025) 12. Chen, M.: Agentcaster: Reasoning-guided tornado forecasting. In: NeurIPS 2025 W orkshop on Ev aluating the Ev olving LLM Lifecycle: Benchmarks, Emergent Abil- ities, and Scaling (2025), https://openreview.net/forum?id=ZTQNYngIyF 13. Chen, Y., Xue, F., Li, D., Hu, Q., Zhu, L., Li, X., F ang, Y., T ang, H., Y ang, S., Liu, Z., et al.: Longvila: Scaling long-context visual language mo dels for long videos. arXiv preprint arXiv:2408.10188 (2024) 14. Chen, Z., W u, J., W ang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zh u, X., Lu, L., et al.: Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 15. Cirstea, R.G., Guo, C., Y ang, B., Kieu, T., Dong, X., P an, S.: T riformer: T ri- angular, v ariable-sp eciﬁc attentions for long sequence m ultiv ariate time series forecasting–full version. arXiv preprin t arXiv:2204.13767 (2022) 16. Dai, Z., Li, K., Liu, J., Y ang, J., Qiao, Y.: No need for real anomaly: Mllm emp ow- ered zero-shot video anomaly detection. arXiv preprin t arXiv:2602.19248 (2026) 17. Dh uliaw ala, S., K omeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., W e- ston, J.: Chain-of-veriﬁcation reduces hallucination in large language mo dels. In: Findings of the association for computational linguistics: A CL 2024. pp. 3563–3578 (2024) 18. Doso vitskiy , A., Beyer, L., Kolesnik ov, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., et al.: An image is w orth 16x16 words: T ransformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 19. Driess, D., Xia, F., Sa jjadi, M.S., Lync h, C., Chowdhery , A., Ich ter, B., W ahid, A., T ompson, J., V uong, Q., Y u, T., et al.: Palm-e: An em b odied multimodal language mo del. arXiv preprint arXiv:2303.03378 (2023) 20. Gao, J., Li, Y., Cao, Z., Li, W.: Interlea ved-modal c hain-of-thought. In: Proceedings of the Computer Vision and P attern Recognition Conference. pp. 19520–19529 (2025) 21. Ghezlo o, F., Seyﬁoglu, M.S., Soraki, R., Ikezogw o, W.O., Li, B., Vivek anandan, T., Elmore, J.G., Krishna, R., Shapiro, L.: P athﬁnder: A m ulti-mo dal m ulti-agent sys- tem for medical diagnostic decision-making applied to histopathology . In: Pro ceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 23431– 23441 (2025) 22. Gomez, M., Berne, A., Beucler, T., et al.: Global forecasting of tropical cyclone in tensity using neural weather mo dels. arXiv preprint arXiv:2508.17903 (2025) 23. Gupta, T., Kembha vi, A.: Visual programming: Comp ositional visual reasoning without training. In: Pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) A GCD for W eather F orecasting 17 24. He, X., Y ou, Z., Gong, J., Liu, C., Y ue, X., Zh uang, P ., Zhang, W., BAI, L.: RadarQA: Multi-mo dal qualit y analysis of w eather radar forecasts. In: The Thirty- nin th Annual Conference on Neural Information Pro cessing Systems (2025), https: //openreview.net/forum?id=WlrmpjocNe 25. Jiang, S., Liang, J., W ang, J., Dong, X., Chang, H., Y u , W., Du, J., Liu, M., Qin, B.: F rom speciﬁc-mllms to omni-mllms: a surv ey on mllms aligned with multi- mo dalities. In: Findings of the Asso ciation for Computational Linguistics: ACL 2025. pp. 8617–8652 (2025) 26. Kashinath, K., Mustafa, M., Alb ert, A., W u, J.L., Jiang, C., Esmaeilzadeh, S., Aziz- zadenesheli, K., W ang, R., Chattopadh ya y , A., Singh, A., et al.: Physics-informed mac hine learning: case studies for w eather and climate modelling. Philosophical T ransactions of the Roy al Society A: Mathematical, Physical and Engineering Sci- ences 379 (2194) (2021) 27. K o chk o v, D., Y uv al, J., Langmore, I., Norgaard, P ., Smith, J., Mo oers, G., Klöwer, M., Lottes, J., Rasp, S., Düb en, P ., et al.: Neural general circulation mo dels for w eather and climate. Nature 632 (8027), 1060–1066 (2024) 28. Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P ., F ortunato, M., Alet, F., Ra vuri, S., Ew alds, T., Eaton-Rosen, Z., Hu, W., et al.: Learning skillful medium-range global w eather forecasting. Science 382 (6677), 1416–1421 (2023) 29. Li, C., Im, E.W., F azli, P .: Vidhalluc: Ev aluating temp oral hallucinations in m ul- timo dal large language mo dels for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13723– 13733 (2025) 30. Li, M., Hou, X., Liu, Z., Y ang, D., Qian, Z., Chen, J., W ei, J., Jiang, Y., Xu, Q., Zhang, L.: Mccd: Multi-agen t collab oration-based comp ositional diﬀusion for complex text-to-image generation. In: Pro ceedings of the Computer Vision and P attern Recognition Conference. pp. 13263–13272 (2025) 31. Li, Y., Zhang, Y., Yin, M.: Ph ysics-informed mamba netw ork for ultra-short-term photo voltaic p o wer forecasting: integrating wgan-gp augmentation and ceemdan- sst decomp osition. Renewable Energy p. 124851 (2025) 32. Li, Z., Ko v achki, N., Azizzadenesheli, K., Liu, B., Bhattachary a, K., Stuart, A., Anandkumar, A.: F ourier neural operator for parametric partial diﬀerential equa- tions. arXiv preprin t arXiv:2010.08895 (2020) 33. Liao, X., Zeng, X., W ang, L., Y u, G., Lin, G., Zhang, C.: Motionagent: ﬁne- grained con trollable video generation via motion ﬁeld agent. arXiv preprint arXiv:2502.03207 (2025) 34. Linander, H., Petersson, C., Persson, D., Gerk en, J.E.: PEAR: Equal area weather forecasting on the sphere. In: NeurIPS 2025 AI for Science W orkshop (2025), https://openreview.net/forum?id=MlnM8lvFSq 35. Mitra, P .P ., Rama v a jjala, V.: Learning to forecast diagnostic parameters using pre-trained weather embedding. arXiv preprint arXiv:2312.00290 (2023) 36. Nguy en, T., Brandstetter, J., Kapo or, A., Gupta, J.K., Gro ver, A.: Climax: A foundation model for w eather and climate. arXiv preprint arXiv:2301.10343 (2023) 37. P athak, J., Subramanian, S., Harrington, P ., Ra ja, S., Chattopadhy ay , A., Mar- dani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., et al.: F ourcastnet: A global data-driv en high-resolution weather model using adaptive fourier neural op- erators. arXiv preprin t arXiv:2202.11214 (2022) 38. Ramsauer, H., Schäﬂ, B., Lehner, J., Seidl, P ., Widrich, M., Gruber, L., Holzleitner, M., Adler, T., Kreil, D., K opp, M.K., Klambauer, G., Brandstetter, J., Ho c hreiter, S.: Hopﬁeld netw orks is all you need. In: In ternational Conference on Learning Represen tations (2021), https://openreview.net/forum?id=tL89RnzIiCd 18 J. W u et al. 39. Rasp, S., Dueben, P .D., Sc her, S., W eyn, J.A., Mouatadid, S., Thuerey , N.: W eath- erb enc h: a b enchmark data set for data-driven weather forecasting. Journal of A d- v ances in Modeling Earth Systems 12 (11), e2020MS002203 (2020) 40. Sc hultz, M.G., Betancourt, C., Gong, B., Kleinert, F., Langguth, M., Leufen, L.H., Mozaﬀari, A., Stadtler, S.: Can deep learning beat numerical weather prediction? Philosophical T ransactions of the Roy al Society A: Mathematical, Physical and Engineering Sciences 379 (2194) (2021) 41. Shi, Y., Di, S., Chen, Q., Xie, W.: Enhancing video-llm reasoning via agent-of- though ts distillation. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference. pp. 8523–8533 (2025) 42. T aghizadeh, M., Zandsalimi, Z., Nabian, M.A., Shaﬁee-Joo d, M., Alemazk o or, N.: In terpretable physics-informed graph neural net works for ﬂo od forecasting. Computer-Aided Civil and Infrastructure Engineering 40 (18), 2629–2649 (2025) 43. T eam, H.V., Lyu, P ., W an, X., Li, G., P eng, S., W ang, W., W u, L., Shen, H., Zhou, Y., T ang, C., et al.: Hunyuanocr technical rep ort. arXiv preprint (2025) 44. T ong, P ., Brown, E., W u, P ., W oo, S., Iyer, A.J.V., Akula, S.C., Y ang, S., Y ang, J., Middep ogu, M., W ang, Z., et al.: Cam brian-1: A fully op en, vision-cen tric ex- ploration of multimodal llms. Adv ances in Neural Information Pro cessing Systems 37 , 87310–87356 (2024) 45. T ouvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deep er with image transformers. In: Pro ceedings of the IEEE/CVF in ternational confer- ence on computer vision. pp. 32–42 (2021) 46. Uey ama, A., Kaw amoto, K., Kera, H.: V artex: Enhancing w eather forecast through distributed v ariable represen tation. arXiv preprint arXiv:2406.19615 (2024) 47. V arambally , S., Fisher, M., Thakker, J., Chen, Y., Xia, Z., Jafari, Y., Niu, R., Jain, M., Maniv annan, V.V., Nov ac k, Z., Han, L., Eranky , S., Cacha y , S.R., Berg- Kirkpatric k, T., W atson-Parris, D., Ma, Y., Y u, R.: Zephyrus: An agen tic frame- w ork for weather science. In: The F ourteenth International Conference on Learning Represen tations (2026), https://openreview.net/forum?id=aVeaNahsID 48. V erma, Y., Heinonen, M., Garg, V.: ClimODE: Climate and w eather forecast- ing with ph ysics-informed neural ODEs. In: The T w elfth In ternational Confer- ence on Learning Representations (2024), https: / / openreview. net / forum? id = xuY33XhEGR 49. W ang, P ., Bai, S., T an, S., W ang, S., F an, Z., Bai, J., Chen, K., Liu, X., W ang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s p erception of the w orld at any resolution. arXiv preprint arXiv:2409.12191 (2024) 50. W ang, S., Zhang, L., Zhu, L., Qin, T., Y ap, K.H., Zhang, X., Liu, J.: Cog-dqa: Chain-of-guiding learning with large language models for diagram question an- sw ering. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition. pp. 13969–13979 (2024) 51. W ang, W., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al.: In tern vl3.5: Adv ancing open-source multimodal models in v ersatilit y , reasoning, and eﬃciency . arXiv preprin t arXiv:2508.18265 (2025) 52. W u, Z., Chen, X., P an, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., W u, C., W ang, B., et al.: Deepseek-vl2: Mixture-of-exp erts vision-language mo dels for adv anced m ultimo dal understanding. arXiv preprint arXiv:2412.10302 (2024) 53. Xiao, B., W u, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., Y uan, L.: Florence-2: Adv ancing a uniﬁed represen tation for a v ariety of vision tasks. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4818–4829 (2024) A GCD for W eather F orecasting 19 54. Xing, S., He, Y., Chen, H., Ke, W.: Incorporating llm versus llm in to m ultimo dal c hain-of-thought for ﬁne-grained evidence generation. IEEE Access 13 , 202143– 202170 (2025) 55. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical rep ort. arXiv preprint arXiv:2505.09388 (2025) 56. Y ang, Z., Zeng, W., Jin, S., Qian, C., Luo, P ., Liu, W.: Nader: Neural architecture design via m ulti-agent collab oration. In: Proceedings of the Computer Vision and P attern Recognition Conference. pp. 4452–4461 (2025) 57. Y ang, Z., Chen, D., Y u, X., Shen, M., Gan, C.: V ca: Video curious agent for long video understanding. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision. pp. 20168–20179 (2025) 58. Y ao, J., Liu, Y., Dong, Z., Guo, M., Hu, H., Keutzer, K., Du, L., Zhou, D., Zhang, S.: Promptcot: Align prompt distribution via adapted chain-of-though t. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7027–7037 (2024) 59. Yin, H., Ren, Y., Y an, K., Ding, S., Hao, Y.: Ro d-mllm: T ow ards more reliable ob ject detection in multimodal large language models. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 14358–14368 (2025) 60. Yin, S., F u, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on m ultimo dal large language models. National Science Review 11 (12), n wae403 (2024) 61. Y uv al, J., O’Gorman, P .A., Hill, C.N.: Use of neural netw orks for stable, accu- rate and physically consistent parameterization of subgrid atmospheric pro cesses with go od performance at reduced precision. Geoph ysical Researc h Letters 48 (6), e2020GL091363 (2021) 62. Zhang, D., Y u, Y., Dong, J., Li, C., Su, D., Chu, C., Y u, D.: Mm-llms: Recent adv ances in m ultimodal large language models. Findings of the Asso ciation for Computational Linguistics: A CL 2024 pp. 12401–12430 (2024) 63. Zheng, J., Ling, Q., F eng, Y.: Physics-assisted and top ology-informed deep learn- ing for w eather prediction. In: Kw ok, J. (ed.) Pro ceedings of the Thirt y-F ourth In ternational Joint Conference on Artiﬁcial Intelligence, IJCAI-25. pp. 7958–7966. In ternational Join t Conferences on Artiﬁcial Intelligence Organization (8 2025). https : / / doi . org / 10 . 24963 / ijcai . 2025 / 885 , https : / / doi . org / 10 . 24963 / ijcai.2025/885 , main T rac k

AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment