Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression

Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this st…

Authors: Mridankan M, al

Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression
Fusion Comple xity In v ersion: Why Simpler Cross V ie w Modules Outperform SSMs and Cross V ie w Attention T ransformers for Pasture Biomass Re gression Mridankan Mandal Department of Information T echnology Indian Institute of Information T echnology , Allahabad Prayagraj, India mridankanmandal2006@gmail.com Abstract —Accurate estimation of pasture biomass from agri- cultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real-world monitoring . In this study , adaptation of vision foundation models to agri- cultural r egression is systematically ev aluated on the CSIR O Pastur e Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise gr ound truth for fiv e biomass targets, through 17 configurations spanning f our back- bones (EfficientNet-B3 to DINOv3-V iT -L), five cr oss view fusion mechanisms, and a 4 × 2 metadata factorial. A counterintuitive principle, termed fusion complexity inv ersion , is uncover ed: on scarce agricultural data, a two layer gated depthwise conv olu- tion ( R 2 =0 . 903 ) outperforms cross view attention transformers ( 0 . 833 ), bidirectional SSMs ( 0 . 819 ), and full Mamba ( 0 . 793 , below the no-fusion baseline). Backbone pr etraining scale is found to monotonically dominate all architectural choices, with the DINOv2 → DINOv3 upgrade alone yielding +5 . 0 R 2 points. T raining only metadata (species, state, and NDVI) is shown to create a universal ceiling at R 2 ≈ 0 . 829 , collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agri- cultural benchmarks ar e established: backbone quality should be prioritized over fusion complexity , local modules pr eferred over global alternatives, and features unav ailable at inference excluded. Keyw ords: pasture biomass estimation, foundation models, agri- cultural imagery , dual view fusion, sparse annotations, precision agriculture. I . I N T R O D U C T I O N Scalable, non-destructiv e estimation of crop and pasture properties from imagery alone is increasingly enabled by com- puter vision [ 20 ]. Pasture biomass estimation, the prediction of dry matter weight of vegetation components from field photographs, is a canonical agricultural vision task requiring fine grained pattern recognition, robustness to sparse and imbalanced annotations, and generalization across geographic and seasonal conditions. Traditional methods (like rising plate meters, and destructive harv esting) cannot scale to the millions of hectares under pastoral management, moti vating vision based alternativ es. Self supervised vision transformers pretrained at scale, notably DINOv2 [ 17 ] and DINOv3 [ 18 ] (up to 1.7B images), now provide general purpose encoders that transfer to narrow agricultural domains with minimal task specific data [ 28 ], [ 4 ]. Scale dependent transfer has been further demonstrated by vision language [ 27 ] and large language models [ 29 ]. A crit- ical question remains, howe ver: Giv en a powerful pretrained backbone, how much task specific complexity should be added when training data is scarce? The CSIRO Pasture Biomass dataset [ 15 ], the first pub- licly av ailable, multi-site pasture resource combining visual, spectral, and structural modalities with laboratory validated, component wise ground truth, is adopted as the benchmark. Unlike prior datasets relying on visual estimation or single site collection, CSIRO spans 19 sites across four Australian states over three years (2014–2017), with consumer grade cameras under natural lighting. Each of 357 dual view training photographs is paired with destructive cut-dry-weigh measure- ments: v egetation within a 70 cm × 30 cm quadrat is harv ested, sorted into green, dead, and clover fractions, ov en dried at 70 ◦ C for 48 h, and laboratory weighed, producing ground truth unmatched by any comparable benchmark. Three targets show significant zero inflation (up to 37.8% for clover), and right ske wed distributions. Auxiliary metadata (species, state, ND VI, height, and date), a vailable only during training, creates a realistic modality shift scenario common in agricultural deployment. A systematic study spanning 17 configurations is presented along three axes: (1) cross view fusion complexity (iden- tity , gated depthwise conv olution, cross-view gated atten- tion transformer , bidirectional Mamba SSM, and full Mamba SSM), (2) backbone scale (EfficientNet-B3 [ 30 ] through DINOv3-V iT -L, spanning ImageNet-1K [ 38 ] to L VD-1.7B), and (3) training only metadata fusion. All e xperiments are conducted with identical recipes, 5 fold stratified group cross validation, and a single 8 GB consumer GPU. Three principal findings are established. (1) Fusion com- plexity in version : a two layer gated depthwise conv olu- tion ( R 2 =0 . 903 ) outperforms all global alternatives, and full Mamba ( 0 . 793 ) falls below the no fusion baseline. (2) Foun- dation model scale dominance : R 2 increases monotonically from Ef ficientNet-B3 ( 0 . 555 ) to DINOv3-V iT -L ( 0 . 903 ), and the DINOv2 → DINOv3 upgrade alone yields +5 . 0 points. (3) Metadata fusion can harm : training only metadata collapses all fusion types to R 2 ≈ 0 . 829 , destroying the best model’ s 7.4 point adv antage. Feature space analysis of image deriv ed color indices and sensor metadata re veals moderate correlations between simple hand crafted features and biomass components, establishing an interpretable baseline that learned representations must surpass. I I . R E L A T E D W O R K Computer vision for agricultural imagery . Crop and pas- ture analysis has progressed from hand crafted vegetation in- dices [ 2 ], [ 9 ], [ 20 ] to CNN based proximal sensing [ 10 ], [ 19 ], composition classification [ 33 ], and aerial monitoring [ 39 ]. A persistent bottleneck in agricultural vision is the scarcity of annotated data. The CSIR O Pasture Biomass dataset [ 15 ] ex emplifies this: laboratory validated, component wise ground truth (destructiv e cut-dry-weigh) is paired with heterogeneous inputs (imagery , ND VI, and compressed height) across 19 sites in four Australian states, yet only 357 training images are provided, with significant zero inflation and right ske wed targets. Foundation models in agricultur e. Self supervised pre- training at scale has been catalyzed by the vision transformer paradigm [ 26 ], [ 31 ]. DINO [ 4 ], DINOv2 [ 17 ] (L VD-142M), and DINOv3 [ 18 ] (L VD-1.7B) learn label free features adopted for remote sensing [ 5 ] and plant phenotyping [ 11 ], and com- plementary signals are provided by masked autoencoders [ 28 ]. Scale dependent transfer has been confirmed by vision lan- guage [ 27 ] and large language models [ 29 ], yet systematic guidance on task specific complexity atop foundation models for scarce agricultural data is lacking. This gap is addressed here by quantifying backbone–fusion interactions on a small agricultural benchmark. Efficient training on sparse agricultural data. Dif ferential learning rates [ 13 ], gradient checkpointing, mix ed precision training [ 14 ], data augmentation [ 30 ], [ 37 ], and robust loss functions [ 36 ] enable fine tuning of large backbones on consumer hardware. In this study , R 2 =0 . 903 is achieved on 357 images without external data through these strategies combined with an appropriate foundation model. Data fusion and auxiliary metadata integration. Dual branch architectures employ cross attention [ 21 ], [ 22 ], con- catenation [ 3 ], [ 6 ], or late fusion [ 7 ], and ModDrop [ 40 ] provides robustness to missing modalities. Metadata fusion is common in remote sensing [ 1 ], [ 8 ], and Gaussian pro- cess regression [ 34 ] provides a probabilistic framew ork for sparse data regression. Selective SSMs (Mamba [ 12 ]) have been adapted for vision through VMamba [ 16 ], with bidi- rectional [ 23 ], [ 24 ] and hybrid [ 25 ] variants. Fiv e fusion paradigms are benchmarked here, and local fusion is found to dominate all global alternatives. T raining only metadata is shown to act as a harmful shortcut ( − 7 . 4 R 2 points). I I I . M E T H O D A. Pr oblem F ormulation Giv en a dual view pasture photograph, the task is to predict fiv e biomass targets: Dry Green ( g ), Dry Dead ( g ), Dry Clov er ( g ), Green Dry Matter (GDM = Green + Clover), and Dry T otal (T otal = GDM + Dead). All targets are log(1+ y ) transformed to stabilize v ariance across right ske wed distributions. The primary metric is a weighted R 2 : R 2 weighted = 5 X i =1 w i · R 2 i , w = [0 . 1 , 0 . 1 , 0 . 1 , 0 . 2 , 0 . 5] (1) where the weights reflect agronomic priorities: Dry T otal (50%) is the primary indicator of carrying capacity , GDM (20%) measures digestible fraction, and the three components each receiv e 10%. B. Dual V iew Input Pipeline Each input image ( ∼ 2000 × 1000 pixels) captures a 70 cm × 30 cm pasture quadrat from abov e, split vertically into left/right halves and resized to 512 × 512 with area-based resampling ( INTER_AREA ). Both halves are normalized with ImageNet statistics and pass through a weight tied back- bone, halving the parameter count relati ve to independent encoders while providing complementary spatial coverage of the quadrat. The resulting token sequences ( 1024 × 1024 each) are concatenated to form a 2048 × 1024 joint representation before fusion. C. Ar chitectur e The model comprises four components: a weight tied back- bone, a local fusion module, adaptive pooling, and composi- tional prediction heads (Figure 1 ). Backbone. DINOv3-V iT -L (303.08M parameters, 24 trans- former layers) pretrained on L VD-1.7B is used. Each 512 × 512 view yields a 32 × 32 = 1024 token grid of dimension 1024. Gradient checkpointing fits the dual-view forward pass within 8 GB VRAM. The backbone is fine-tuned at 1 × 10 − 5 , while task specific layers use 5 × 10 − 4 . Fusion: Gated Depthwise Con volution. The concatenated 2048 × 1024 token sequence is processed by two stacked GatedDepthwiseCon vBlocks. Each block applies LayerNorm, sigmoid gating, depthwise 1D conv olution ( k =5 ), linear pro- jection with dropout ( p =0 . 2 ), and a residual connection: GatedD WCon v ( x ) = x + Drop  W p · D WCon v k =5  LN ( x ) ⊙ σ ( W g · LN ( x ))   (2) T wo stacked blocks hav e an effecti ve receptiv e field of 9 tokens. This local operation does not attend across the full sequence: the backbone’ s self attention already captures global dependencies within each view . T otal fusion parameters: 4.21M. Fig. 1. Architecture overvie w of the proposed dual-view biomass regression pipeline. Each input image is split into left/right halves, encoded by a weight-tied DINOv3-V iT -L backbone, and fused through two stacked GatedDepthwiseCon v blocks before compositional prediction heads. Prediction Heads. Three independent heads (Green, Dead, and Clov er) map the average pooled 1024-d vector to scalars through a two layer MLP with Softplus output: head ( x ) = Softplus ( W 2 · Drop ( GELU ( W 1 · x ))) (3) Composite targets are computed by summation: GDM = Green + Clover, T otal = GDM + Dead. T otal task specific parameters: 5.79M (1.9% of the 308.87M total), comprising 4.21M fusion and 1.58M heads. No metadata is used at training or inference. D. Baseline Fusion Mechanisms Four alternative fusion modules are benchmarked with the same DINOv3-V iT -L backbone, training recipe, and cross validation protocol. Cross V iew Gated Attention (CVGA). The concatenated sequence is split into left and right halves, and bidirectional cross attention (8 heads, d h =128 ) with sigmoid gating enables global cross view interaction at O ( N 2 ) cost. T wo blocks: 10.50M parameters. Bidirectional Mamba SSM (BidirMamba). Combines lo- cal depthwise con volution ( k =5 ) with weight tied bidirectional Mamba SSM ( d state =16 , e xpand = 2) for global O ( N ) sequence modeling. Requires FP32 due to CUD A kernel constraints. T wo blocks: 17.55M parameters. Full Mamba SSM (MambaFusionBlock). Unidirectional Mamba with expand = 2 and no gating or depthwise conv o- lution ov erhead, making each block leaner than BidirMamba despite the same expand factor . T wo blocks: 13.34M parame- ters. Identity (no fusion). The concatenated sequence passes directly to pooling with no learned cross view interaction (zero parameters). T able I summarizes all fusion blocks. T ABLE I F U SI O N BL O C K C O M P A R I SO N S UM M A RY . Property GatedDWCon v CVGA BidirMamba FullMamba Identity Receptive field Local ( k =5 ) Global O ( N 2 ) Local+Global Global O ( N ) None Params/block 2.11M 5.25M 8.77M 6.67M 0 T otal (2 blocks) 4.21M 10.50M 17.55M 13.34M 0 AMP compatible Y es Y es No No Y es E. Metadata Injection (Ablation V ariant) For metadata inclusiv e experiments, the CSIR O training set’ s auxiliary information is encoded into a 23 dimensional vector: State (one hot, 4d), Species (one hot, 15d), ND VI (1d), Height (1d), and c yclical sampling month (2d). A two layer MLP ( 23 → 64 , 1.12M parameters) produces a metadata embedding that is concatenated with the 1024-d pooled image features and projected back to 1024 dimensions. During train- ing, the metadata vector is zeroed with probability p =0 . 2 per sample, and at test time, metadata is absent entirely . As shown in Section IV -F , this dropout rate is insufficient to prev ent the metadata shortcut. I V . E X P E R I M E N T S A. Dataset The competition subset of the CSIRO Pasture Biomass dataset [ 15 ] comprises 357 dual view photographs from 19 sites across four Australian states (2014–2017), selected from 3,187 samples through rigorous quality control. At each site, ve getation within a 70 cm × 30 cm quadrat was photographed with consumer grade cameras under natural lighting, then harvested, sorted into green, dead, and clov er fractions ( ≥ 30 g subsample), oven dried at 70 ◦ C for 48 h, and laboratory weighed. CSIR O is the first public pasture benchmark to provide separate dead matter annotations, and to combine visual, spectral (GreenSeeker ND VI, 100 reading a verage), and structural (falling plate meter) modalities. T ABLE II T A R G ET V A RI A B L E S TA T I S T IC S I N T H E T R A I NI N G SE T ( n = 357 ) . T arget w Mean Std Min Med. Max Skew Zero% Dry Green 0.1 26.6 25.4 0.0 20.8 158.0 1.75 5.0 Dry Dead 0.1 12.0 12.4 0.0 8.0 83.8 1.76 11.2 Dry Clover 0.1 6.7 12.1 0.0 1.4 71.8 2.84 37.8 GDM 0.2 33.3 24.9 1.0 27.1 158.0 1.56 0.0 Dry T otal 0.5 45.3 28.0 1.0 40.3 185.7 1.43 0.0 All fiv e biomass targets are right skewed (skewness 1.4– 2.8), with Dry Clo ver sho wing 37.8% zero v alues (Figure 2 ). The heavy right tails motiv ate the log(1+ y ) transform, and composite targets (GDM, Dry T otal) ha ve no zeros by con- struction. T able II provides summary statistics. Fig. 2. Histograms of the five biomass target variables in the training set ( n = 357 ). Correlation structure. Figure 3 presents the Pearson cor- relation matrix among all fi ve biomass targets and four meta- data v ariables. Dry Green and GDM are highly correlated ( r = 0 . 98 ) by construction, as are GDM and Dry T otal ( r = 0 . 90 ). Auxiliary metadata includes NDVI ( r =0 . 54 with green biomass), compressed height ( r =0 . 48 with total biomass), pasture species, state, and sampling date, all absent at test time. Fig. 3. Pearson correlation heatmap among biomass targets and metadata variables. Figure 4 visualizes metadata–biomass relationships: both correlations show substantial scatter , confirming genuine but insufficient signal, and incorporating metadata degrades the best model by 7.4 R 2 points (Section IV -F ). Fig. 4. NDVI and compressed height vs. biomass scatter plots, colored by pasture species. Geographic and seasonal variation. V ictoria and T asmania show higher median biomass ( ∼ 50–55 g) than NSW and W A ( ∼ 30–40 g), though substantial interquartile overlap indicates state alone is a weak predictor (Figure 5 ). Collection peaks in spring and autumn (Figure 6 ), and species composition spans 15 types dominated by ryegrass-clo ver mixtures (Figure 7 ), and lucerne has higher a verage biomass, b ut wide per species ranges motiv ate the visual only approach. Fig. 5. Dry T otal biomass distributions by Australian state. B. Cr oss-V alidation Protocol All models are ev aluated with 5-fold Stratified Group K- Fold cross validation (scikit-learn, and seed 17). Folds are stratified on Dry T otal quintiles, and grouped by image ID to prev ent leakage. Each fold contains ∼ 71 v alidation and ∼ 286 training images. Models train for up to 50 epochs with Fig. 6. T emporal distribution of sampling dates and seasonal biomass dynamics. Fig. 7. Species distribution and associated biomass ranges in the training set. early stopping (patience 10) on v alidation weighted R 2 . All 17 configurations conv erged before the 50 epoch maximum, with training halted by early stopping in e very case. A single master seed controls all randomness across the 17 configurations. C. T raining Configuration All experiments use AdamW [ 13 ] with differential learning rates ( 1 × 10 − 5 backbone, 5 × 10 − 4 heads), weight decay 10 − 2 , cosine annealing with 5 epoch linear warmup, Huber loss [ 36 ] ( β =5 . 0 ) on log(1+ y ) targets, and gradient clipping at 1.0. Differential rates preserve pretrained backbone representations while accelerating randomly initialized heads. Mixed precision (FP16) is used for AMP compatible blocks (GatedDWCon v , CVGA), while Mamba blocks require FP32, consuming 1 . 5 × more VRAM. Gradient checkpointing trades ∼ 30% compu- tation for 40% VRAM reduction, essential for dual view DINOv3-V iT -L on 8 GB. Augmentations (flip, rotation ± 15 ◦ , shift-scale-rotate, and color jitter) are applied identically to both vie ws through albumentations , and no test time augmentation (TT A) is used. All experiments run on a single NVIDIA R TX 4060 Laptop GPU (8 GB VRAM), with each 5 fold CV run requiring 4–8 hours. D. Main Results T able III presents all 17 configurations. The proposed model (DINOv3-V iT -L + 2 × GatedDWCon v , no metadata) achie ves R 2 =0 . 903 ± 0 . 064 , outperforming all alternatives by at least 5 points. A dense cluster of DINOv3 based models occupies the 0 . 81 – 0 . 85 range re gardless of fusion mechanism, while VMamba based models ( ∼ 0 . 72 ), and EfficientNet-B3 ( 0 . 555 ) form progressively weaker tiers. Figure 8 (b) rev eals a near lin- ear relationship between log pretraining scale and downstream R 2 . The median predictor (B1, R 2 = − 0 . 065 ) and zero shot DINOv2+Con vNeXt (B3, R 2 = − 1 . 999 ) confirm that neither constant prediction nor off the shelf features are viable, with T ABLE III M A IN R E SU LT S . A L L M O D E LS A R E E V A LU A T E D T HR OU G H 5 - F O L D S T RAT IFI E D G R OU P C RO S S - V A LI D A T I O N O N T HE C S I RO T R A I NI N G S ET ( n = 357 ) . W EI G H T ED R 2 O N log(1 + y ) TA R GE T S I S R EP O RT E D . B E ST I N B O LD . ID Model Backbone Fusion Meta R 2 Std B1 Median Predictor None None No − 0 . 065 0.006 B3 DINOv2+ConvNeXt ZS V iT -B/14 Ensemble No − 1 . 999 0.341 B2 EfficientNet-B3 EffNet-B3 Single-view No 0.555 0.084 B9 VMamba+Mamba VMamba-B 2 × Mamba No 0.717 0.052 B6 VMamba+Mamba VMamba-B 2 × Mamba Y es 0.743 0.048 E5 DINOv3+FullMamba DINOv3-L 2 × Mamba No 0.793 0.034 E7 DINOv3+4 × GDWC DINOv3-L 4 × GDWC No 0.814 0.039 E4 DINOv3+Identity DINOv3-L Identity No 0.819 0.055 E1 DINOv3+BidirMamba DINOv3-L 2 × BidirM No 0.819 0.051 E6 DINOv3+1 × GDWC DINOv3-L 1 × GDWC No 0.821 0.034 E8 DINOv3+Identity+M DINOv3-L Identity Y es 0.828 0.053 B8 DINOv3+BidirM+M DINOv3-L 2 × BidirM Y es 0.829 0.042 E3 DINOv3+GDWC+M DINOv3-L 2 × GDWC Y es 0.829 0.045 B7 DINOv3+CVGA+M DINOv3-L 2 × CVGA Y es 0.830 0.050 E2 DINOv3+CVGA DINOv3-L 2 × CVGA No 0.833 0.051 B4 DINOv2+GDWC DINOv2-L 2 × GDWC No 0.853 0.097 B5 DINOv3+GDWC DINOv3-L 2 × GDWC No 0.903 0.064 T ABLE IV F U SI O N CO M PAR I S O N O N D IN O V 3 - V I T-L , N O M E T A DAT A . Rank Fusion ID R 2 Std Complexity 1 2 × GatedDWCon v B5 0.903 0.064 O ( N k ) 2 2 × CVGA E2 0.833 0.051 O ( N 2 ) 3 1 × GatedDWCon v E6 0.821 0.034 O ( N k ) 4 Identity (no fusion) E4 0.819 0.055 O (1) 5 2 × BidirMamba E1 0.819 0.051 O ( N ) 6 4 × GatedDWCon v E7 0.814 0.039 O ( N k ) 7 2 × FullMamba E5 0.793 0.034 O ( N ) B3’ s negati ve score underscoring the necessity of task specific training. Among fine tuned models, a clear three tier hierarchy emerges: DINOv3 ( 0 . 793 – 0 . 903 ), VMamba ( 0 . 717 – 0 . 743 ), and EfficientNet-B3 ( 0 . 555 ), aligning with pretraining data scale. Fig. 8. Main results across all 17 configurations. E. Fusion Complexity Analysis T able IV isolates fusion comple xity effects on DINOv3-V iT - L without metadata. Three patterns emerge. (1) BidirMamba ( R 2 =0 . 819 ) matches the no-fusion identity baseline despite having the highest parameter count (17.55M), while FullMamba ( 0 . 793 ) falls 2.6 points below identity with 13.34M fusion parameters, establishing a negati ve correlation between fusion parameters and performance. (2) The GatedDWCon v depth curve traces an in verted U: 0 blocks ( 0 . 819 ) → 1 ( 0 . 821 ) → 2 ( 0 . 903 ) T ABLE V F U SI O N × M E TAD A TA FAC TO R I AL O N DI N O V 3 - V I T - L . Fusion No Meta With Meta ∆ 2 × GatedD WCon v 0.903 (B5) 0.829 (E3) − 0.074 2 × CV GA 0.833 (E2) 0.830 (B7) − 0.003 2 × BidirMamba 0.819 (E1) 0.829 (B8) + 0.010 Identity 0.819 (E4) 0.828 (E8) + 0.009 ∆ (Fusion Effect) + 0.084 + 0.001 → 4 ( 0 . 814 ). The disproportionate 1-to-2 block jump (+8.2 points) suggests that the 9-token recepti ve field captures a critical spatial scale at the left-right boundary , while the 2-to-4 drop ( − 8.9) signals o verfitting. (3) CVGA ( 0 . 833 ) outperform s both SSM variants but trails 2 × GatedDWCon v by 7.0 points: quadratic attention’ s expressi veness is offset by overfitting on 286 training images per fold. Figure 9 visualizes these results. Fig. 9. Ablation studies: (a) fusion type comparison, (b) fusion depth curve, (c) metadata interaction heatmap. F . The Metadata P aradox T able V presents a 4 × 2 factorial crossing fusion type with metadata on/off on DINOv3-V iT -L. W ithout metadata, the fusion spread is 8.4 R 2 points, and with metadata, it collapses to 0.1 points as all configurations con ver ge to R 2 ≈ 0 . 829 . Metadata destr oys the best model: GatedD WCon v drops from 0 . 903 to 0 . 829 ( − 7.4 points). The mechanism is straightforward: during training, species and state metadata provide predicti ve shortcuts through the MLP , and at test time, metadata is absent, and GatedD WCon v suffers the largest degradation from this distribution shift. The ∆ column reveals an asymmetry: weaker models (Identity , BidirMamba) gain slightly ( +0 . 009 to +0 . 010 ), CVGA shows near zero effect ( − 0 . 003 ), and GatedDWCon v is harmed most ( − 0 . 074 ). Metadata harm is proportional to model quality: the better the visual backbone, the more important it is to exclude training only metadata. G. Backbone Pr etraining Scale T able VI controls for backbone scale with matched fusion and no metadata. Performance is strictly monotonic with pretraining scale. The DINOv2 → DINOv3 transition keeps architecture fixed while increasing pretraining data from 142M to 1.7B images, yielding +5 . 0 R 2 points with zero additional parameters. Across the full backbone range, pretraining scale contributes +34 . 8 points (Ef ficientNet-B3 to DINOv3), far exceeding the T ABLE VI B A CK B O N E P R E TR A I NI N G S CA L E VS . D OW N ST R E A M R 2 . Backbone Pretrain Data Params R 2 ∆ EfficientNet-B3 ImageNet-1K (1.2M) 10.70M 0.555 baseline VMamba-Base ImageNet-1K (1.2M) 88.56M 0.717 + 0.162 DINOv2-V iT -L L VD-142M 304.37M 0.853 + 0.298 DINOv3-V iT -L L VD-1.7B 303.08M 0.903 + 0.348 maximum fusion ( +8 . 4 ) or metadata ( ± 7 . 4 ) effects. VMamba- Base ( 0 . 717 ), despite 8.3 × more parameters than EfficientNet- B3, remains far below DINOv2 ( − 13 . 6 ), confirming that pretraining data dominates architecture choice. Practically , upgrading from DINOv2 to DINOv3 ( +5 . 0 points, zero ad- ditional parameters, no overfitting risk) provides the most efficient single impro vement, and the log-linear relationship in Figure 8 (b) suggests continued data scaling would yield predictable gains. H. F eatur e Space Analysis Figure 10 examines the relationship between image deriv ed color indices, sensor metadata, and biomass targets. The Excess Green Inde x (ExG) is moderately correlated with GDM ( ρ =0 . 525 ) and Image Greenness with Dry Green ( ρ =0 . 404 ), whereas Mean Brightness sho ws negligible association with Dry Dead ( ρ =0 . 102 ), indicating that simple color features capture green biomass signals, b ut fail for senescent material. The ND VI × Height hexbin plot re veals that these two sensor metadata variables jointly stratify mean Dry T otal across most of its range, explaining why metadata fusion improves weaker backbones, but saturates for DINOv3, which already encodes equiv alent visual cues. The state level density plot further shows that geographic pro venance induces distinct clusters in the ND VI–Height space, motiv ating the stratified group cross validation protocol adopted in Section IV . Figure 11 shows spatial feature maps ov erlaid on representati ve images: DINOv3 produces spatially coherent activ ations that cleanly segment green vegetation from dead material, DINOv2 shows similar but less refined selecti vity , and VMamba-Base pro- duces coarser maps, paralleling the R 2 ranking and confirming that spatial discrimination, not fusion capacity , is the primary bottleneck. I. F old Analysis and Stability Figure 12 sho ws per-fold R 2 for all DINOv3-based models. Fold 4 is consistently hardest across ev ery model (5–12 point drops), suggesting edge case pasture conditions: under represented species, atypical seasonal states, and a higher proportion of low biomass clover dominant swards, rather than a model specific failure. T able VII details Fold 4 performance: the proposed model (B5) achiev es the highest score (0.779), but shows the largest drop ( − 0.125), revealing a performance stability tradeoff. T able VIII presents stability analysis: the proposed model’ s CV of 7.0% is moderate, while the most stable models (FullMamba 4.3%, 1 × GatedD WCon v 4.2%) achie ve lower Fig. 10. Feature space analysis. (a)–(c) Image deri ved color indices versus biomass targets, colored by state. (d) ND VI × Height hexbin colored by mean Dry T otal. (e) State level density in the NDVI–Height plane. Fig. 11. Spatial feature map visualizations from three backbone architectures. R 2 . Metadata models cluster at lower CV (5.1–6.0%) because the ceiling suppresses both peaks and troughs, and practition- ers v aluing consistency may prefer 1 block GatedDWCon v (CV = 4.2%, R 2 =0 . 821 ). Fig. 12. Per-fold performance analysis across DINOv3-based configurations. T ABLE VII F O LD 4 P ER F O R MA N C E ( H A R DE S T F O L D ) F O R D IN O V 3 - BA S ED C O NFI G S . Model Fold 4 R 2 Drop E4: Identity 0.732 − 0.086 E1: BidirMamba 0.746 − 0.073 B7: CVGA+Meta 0.749 − 0.081 E7: 4 × GD WC 0.749 − 0.065 E2: CVGA 0.752 − 0.081 E3: GD WC+Meta 0.758 − 0.071 B8: BidirMamba+Meta 0.761 − 0.068 E6: 1 × GD WC 0.761 − 0.060 B5: Proposed 0.779 − 0.125 T ABLE VIII S T A B IL I T Y A N ALY SI S ( C OE FFI C I EN T O F V A R IAT IO N , LO W ER I S MO R E S T A B LE ) . Model R 2 Std CV (%) E5: FullMamba 0.793 0.034 4.3 E6: 1 × GD WC 0.821 0.034 4.2 E7: 4 × GD WC 0.814 0.039 4.7 B8: BidirMamba+M 0.829 0.042 5.1 E3: GD WC+Meta 0.829 0.045 5.4 B7: CVGA+Meta 0.830 0.050 6.0 E1: BidirMamba 0.819 0.051 6.2 E2: CVGA 0.833 0.051 6.2 E4: Identity 0.819 0.055 6.7 B5: Proposed 0.903 0.064 7.0 B4: DINOv2+GD WC 0.853 0.097 11.4 B2: Ef fNet-B3 0.555 0.084 15.0 J. Pr ediction Quality Analysis Figure 13 shows predicted vs. actual scatter plots for all five targets ( log(1+ y ) ) of the proposed model (B5, R 2 =0 . 903 ). Predictions cluster tightly around the identity line, with larger residuals only for high-biomass tail samples, and Dry Clov er shows the widest scatter , consistent with its 37.8% zero- inflation. Residuals are symmetric and centered near zero, confirming no systematic bias, and the tight T otal scatter is encouraging given that compositional heads propagate component-lev el errors. V . D I S C U S S I O N Implications for agricultural vision pipelines. Since global self attention is already performed across 24 trans- former layers within each view by the DINOv3-V iT -L back- bone, the fusion module need only enable local cross view communication, a role adequately served by kernel-5 depth- wise con volution. Global mechanisms (attention transformers, SSMs) introduce parameters that overfit on ∼ 286 images per fold, a scale typical of precision agriculture datasets. A general design principle is thus indicated: fusion complexity should be matched to dataset scale, not task aspiration. Cautionary lessons for metadata fusion in agriculture. A broader risk for agricultural pipelines that fuse heterogeneous Fig. 13. Prediction quality analysis for the proposed model. data sources is revealed by the metadata paradox: auxiliary features a v ailable only at training time can create harmful shortcuts. Predictiv e cues from species and state metadata (like “Lucerne in V ictoria”) cause information to be routed through the metadata MLP at the expense of visual feature learning, and at test time, the strongest visual learner (GatedD WCon v) suffers the largest degradation ( − 7 . 4 R 2 points). This finding is generalizable: any pipeline fusing training only auxiliary data (sensor readings, weather logs, field management records) risks the same shortcut, and modality dropout alone may be insufficient on small datasets. Benchmarking position of CSIRO Biomass Dataset. Compared to GrassClover [ 33 ] (435 images, 2 Danish sites, May–October , no dead matter labels, and controlled lighting), CSIR O Biomass Dataset provides 2.7 × greater geographic div ersity (19 sites, 4 states), year round cov erage, dead mat- ter annotations, camera di versity , and heterogeneous auxil- iary inputs (ND VI, and height) absent from an y comparable benchmark. Other agricultural vision datasets (CropHarvest, PlantNet, DeepW eeds, Agriculture-V ision) target classifica- tion, detection, or aerial segmentation, and none provides ground lev el, component wise biomass regression with lab- oratory v alidated measurements. CSIR O Biomass Dataset is thus the only appropriate benchmark for this study , and the 17 configuration suite serves as a reproducible reference for future work. Limitations. All findings are deriv ed from a single dataset, though no comparable alternativ e with laboratory validated, component wise biomass ground truth for proximal pasture imagery currently exists. The fusion complexity in version may not hold where global modules have sufficient data to av oid ov erfitting. The backbone comparison partially confounds pre- training data size with algorithmic improv ements, and 8 GB VRAM limits batch sizes to 1. V alidation on emerging agri- cultural benchmarks with similar ground truth quality is left to future work. V I . C O N C L U S I O N Foundation model adaptation for agricultural imagery is sys- tematically studied on the 357 image CSIRO Pasture Biomass benchmark through 17 configurations. Three findings are es- tablished. First, fusion complexity in version : a two layer gated depthwise con volution ( R 2 =0 . 903 ) outperforms cross view attention transformers ( 0 . 833 ), bidirectional SSMs ( 0 . 819 ), and full SSMs ( 0 . 793 , below the no fusion baseline). Second, foundation model dominance : R 2 rises monotonically from EfficientNet-B3 ( 0 . 555 ) to DINOv3-V iT -L ( 0 . 903 ), confirm- ing representation quality as the primary bottleneck. Third, the metadata fusion trap : training only metadata creates a ceiling at R 2 ≈ 0 . 829 , collapsing an 8.4 point spread to 0.1 points. For scarce agricultural data, backbone quality should be prioritized, local fusion preferred, and inference-unav ailable modalities excluded. R E F E R E N C E S [1] N. Jean, M. Burke, M. Xie, W . M. Davis, D. B. Lobell, and S. Ermon, “Combining satellite imagery and machine learning to predict poverty , ” Science , vol. 353, no. 6301, pp. 790–794, 2016. [2] C. Adjorlolo, O. Mutanga, and M. A. Cho, “Estimation of canopy nitrogen concentration across C3 and C4 grasslands using W orldV iew-2 multispectral data, ” IEEE Journal of Selected T opics in Applied Earth Observations and Remote Sensing , vol. 7, no. 11, pp. 4385–4392, 2014. [3] S. Bhojanapalli, W . Chen, A. V eit, and A. S. Rawat, “Understanding Robustness of T ransformers for Image Classification, ” in Advances in Neural Information Pr ocessing Systems (NeurIPS) , vol. 34, 2021. [4] M. Caron, H. T ouvron, I. Misra, H. Jégou, J. Mairal, P . Bojano wski, and A. Joulin, “Emerging properties in self-supervised vision transformers, ” in Proceedings of the IEEE/CVF International Conference on Computer V ision (ICCV) , pp. 9650–9660, 2021. [5] D. W ang, J. Zhang, B. Du, G. S. Xia, and D. T ao, “Self-supervised pre-training for remote sensing image analysis, ” IEEE T ransactions on Geoscience and Remote Sensing , vol. 60, pp. 1–16, 2022. [6] Y . Chen, X. W ang, Z. Zhang, and H. Liu, “Multi-view learning for fusion of multi-sensor data, ” Information Fusion , vol. 79, pp. 75–94, 2022. [7] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng, “Multimodal deep learning, ” in Pr oceedings of the 28th International Confer ence on Machine Learning (ICML) , pp. 689–696, 2011. [8] M. Rußwurm, N. Jacobs, and D. T uia, “Meta-learning for cross- regional crop type mapping, ” in Pr oceedings of the IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition W orkshops (CVPR Agricultur e-V ision W orkshop) , 2023. [9] J. Gao, “ND VI – a review , ” Remote Sensing Reviews , vol. 13, no. 1–2, pp. 145–174, 1996. [10] L. Petrich, G. Lohrmann, M. Neumann, and N. W eishaupt, “Estimation of ground cover and ve getation height from images using deep learning, ” Pr ecision Agricultur e , vol. 21, no. 6, pp. 1243–1262, 2020. [11] S. A. Tsaftaris, M. Minervini, and H. Scharr, “Plant phenotyping with deep learning, ” Annual Revie w of Plant Biology , vol. 74, 2023. [12] A. Gu and T . Dao, “Mamba: Linear-time sequence modeling with selectiv e state spaces, ” arXiv pr eprint arXiv:2312.00752 , 2023. [13] I. Loshchilov and F . Hutter , “Decoupled weight decay regularization, ” in International Confer ence on Learning Repr esentations (ICLR) , 2019. [14] P . Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaie v , G. V enkatesh, and H. Wu, “Mixed precision training, ” in International Confer ence on Learning Repr esentations (ICLR) , 2018. [15] Q. Liao, D. W ang, R. Haling, J. Liu, X. Li, M. Plomecka, A. Robson, M. Pringle, R. Pirie, M. W alker , and J. Whelan, “Estimating pasture biomass from top-view images: A dataset for precision agriculture, ” arXiv pr eprint arXiv:2510.22916 , 2025. [16] Y . Liu, Y . Tian, Y . Zhao, H. Y u, L. Xie, Y . W ang, Q. Y e, J. Jiao, and Y . Liu, “VMamba: V isual state space model, ” arXiv preprint arXiv:2401.10166 , 2024. [17] M. Oquab, T . Darcet, T . Moutakanni, H. V o, M. Szafraniec, V . Khalido v , P . Fernandez, D. Haziza, F . Massa, A. El-Nouby , M. Assran, N. Ballas, W . Galuba, R. Howes, P . Y . Huang, S. W . Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P . Labatut, A. Joulin, and P . Bojano wski, “DINOv2: Learning robust visual features without supervision, ” T ransactions on Machine Learning Resear ch , 2024. [18] O. Siméoni, H. V . V o, M. Seitzer, F . Baldassarre, M. Oquab, C. Jose, V . Khalidov , M. Szafraniec, et al., “DINOv3, ” arXiv preprint arXiv:2508.10104 , 2025. [19] A. Bauer , A. G. Bostrom, J. Ball, C. Applegate, T . Cheng, S. Laycock, S. M. Rojas, J. Kirwan, and J. Zhou, “Combining computer vision and interactive spatial statistics for the characterization of precision agriculture observations, ” Computers and Electr onics in Agricultur e , vol. 162, pp. 223–234, 2019. [20] D. Lu, “The potential and challenge of remote sensing-based biomass estimation, ” International Journal of Remote Sensing , vol. 27, no. 7, pp. 1297–1328, 2006. [21] C. F . Chen, Q. Fan, and R. Panda, “CrossV iT : Cross-attention multi- scale vision transformer for image classification, ” in Pr oceedings of the IEEE/CVF International Conference on Computer V ision (ICCV) , pp. 357–366, 2021. [22] Y . H. H. Tsai, S. Bai, P . P . Liang, J. Z. Kolter , L. P . Morency , and R. Salakhutdinov , “Multimodal transformer for unaligned multimodal language sequences, ” in Pr oceedings of the 57th Annual Meeting of the Association for Computational Linguistics (A CL) , pp. 6558–6569, 2019. [23] L. Zhu, B. Liao, Q. Zhang, X. W ang, W . Liu, and X. W ang, “V ision Mamba: Ef ficient visual representation learning with bidirectional state space model, ” in Pr oceedings of the 41st International Conference on Machine Learning (ICML) , PMLR 235:62429–62442, 2024. [24] R. Xu, S. Y ang, Y . W ang, Y . Cai, B. Du, and H. Chen, “ A survey on vision mamba: Models, applications and challenges, ” arXiv preprint arXiv:2404.18861 , 2024. [25] A. Hatamizadeh, H. Hosseini, N. Parchami, D. T erzopoulos, and J. Kautz, “MambaV ision: A hybrid Mamba-T ransformer vision back- bone, ” arXiv pr eprint arXiv:2407.08083 , 2024. [26] A. Dosovitskiy , L. Beyer , A. Kolesniko v , D. W eissenborn, X. Zhai, T . Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly , J. Uszkoreit, and N. Houlsby , “ An image is worth 16x16 words: T rans- formers for image recognition at scale, ” in International Confer ence on Learning Repr esentations (ICLR) , 2021. [27] A. Radford, J. W . Kim, C. Hallacy , A. Ramesh, G. Goh, S. Agarwal, G. Sastry , A. Askell, P . Mishkin, J. Clark, G. Krueger , and I. Sutske ver , “Learning transferable visual models from natural language supervi- sion, ” in Proceedings of the 38th International Conference on Machine Learning (ICML) , PMLR 139:8748–8763, 2021. [28] K. He, X. Chen, S. Xie, Y . Li, P . Dollár, and R. Girshick, “Masked au- toencoders are scalable vision learners, ” in Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 16000–16009, 2022. [29] T . B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry , A. Ask ell, S. Agarw al, A. Herbert- V oss, G. Krueger , T . Henighan, R. Child, A. Ramesh, D. M. Ziegler , J. W u, C. W inter , C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray , B. Chess, J. Clark, C. Berner , S. McCandlish, A. Radford, I. Sutskev er , and D. Amodei, “Language models are few-shot learners, ” in Advances in Neural Information Pr ocessing Systems (NeurIPS) , vol. 33, pp. 1877– 1901, 2020. [30] M. T an and Q. V . Le, “Ef ficientNet: Rethinking model scaling for con volutional neural networks, ” in Proceedings of the 36th International Confer ence on Machine Learning (ICML) , PMLR 97:6105–6114, 2019. [31] A. V aswani, N. Shazeer, N. P armar, J. Uszk oreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, “ Attention is all you need, ” in Advances in Neural Information Pr ocessing Systems (NeurIPS) , vol. 30, pp. 5998– 6008, 2017. [32] R. W ightman, “PyT orch image models (timm), ” GitHub repository , https: //github .com/rwightman/pytorch- image- models , 2019. [33] S. Sko vsen, M. Dyrmann, A. K. Mortensen, K. A. Steen, O. Green, J. Eriksen, R. Gislum, R. N. Jørgensen, and H. Karstoft, “The Grass- Clover image dataset for semantic and hierarchical species understand- ing in agriculture, ” in Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition W orkshops (CVPR W orkshop) , 2019. [34] E. Schulz, M. Speekenbrink, and A. Krause, “ A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions, ” Journal of Mathematical Psycholo gy , vol. 85, pp. 1–16, 2018. [35] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs), ” arXiv pr eprint arXiv:1606.08415 , 2016. [36] P . J. Huber , “Robust estimation of a location parameter, ” The Annals of Mathematical Statistics , vol. 35, no. 1, pp. 73–101, 1964. [37] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W . W ang, T . W eyand, M. Andreetto, and H. Adam, “MobileNets: Efficient con vo- lutional neural networks for mobile vision applications, ” arXiv pr eprint arXiv:1704.04861 , 2017. [38] J. Deng, W . Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Ima- geNet: A large-scale hierarchical image database, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition (CVPR) , pp. 248–255, 2009. [39] W . Guo, U. K. Rage, and S. Ninomiya, “ Aerial imagery analysis – quantifying appearance and number of sorghum heads for breeding optimization, ” F r ontiers in Plant Science , vol. 9, p. 1544, 2018. [40] N. Nevero va, C. W olf, G. T aylor , and F . Nebout, “ModDrop: Adaptiv e multi-modal gesture recognition, ” IEEE T ransactions on P attern Analy- sis and Machine Intelligence , vol. 38, no. 8, pp. 1692–1706, 2016.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment