Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025

Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.


💡 Research Summary

This paper reports the winning solutions of Team HCMUS_TheFangs in the NeurIPS 2025 “Mouse vs. AI: Robust Visual Foraging Competition,” covering both the Visual Robustness track (Track 1) and the Neural Alignment track (Track 2). The competition required agents to navigate a 3D Unity environment using only a low‑resolution grayscale view (86 × 155 pixels) while a mouse performed the same task. Track 1 measured behavioral success under unseen visual perturbations (fog, lighting changes, etc.), combining Average Success Rate (ASR) on the training distribution with Modified Success Rate (MSR) on perturbed conditions. Track 2 evaluated how well internal representations of the agents could predict neural activity recorded from over 19 000 mouse visual‑cortex neurons using linear read‑out R² and representational similarity analysis.

Track 1 – Visual Robustness
Initial experiments with state‑of‑the‑art deep architectures (InceptionNet, a 24‑block IMPALA‑style ResNet, LSTM‑based temporal models) consistently failed: they either did not converge, over‑fit the training distribution, or suffered dramatic performance drops under perturbations. The team therefore adopted a minimalist design: a two‑layer convolutional backbone followed by a Gated Linear Unit (GLU) and Observation Normalization (ON). The first convolution uses an 8 × 8 kernel with stride 4, producing 16 channels; the second uses a 4 × 4 kernel with stride 2, expanding to 32 channels. Both layers employ LeakyReLU (α = 0.2). After flattening, a fully‑connected layer projects to a 256‑dimensional feature vector. The GLU splits this vector into a Swish‑activated transformation path and a sigmoid‑gated path, multiplying them element‑wise to retain only perturbation‑robust features. ON maintains running channel‑wise mean and variance, normalizing each observation to mitigate global illumination changes. This combination yields a model with only a few hundred thousand parameters yet achieves a final competition score of 95.4 %. Ablation studies show that removing GLU reduces performance by ~7 %, while removing ON costs ~5 %, confirming their essential role.

Track 2 – Neural Alignment
For neural alignment, the team built a deeper, biologically‑inspired architecture: a 16‑layer ResNet‑style network with 17.8 M parameters and multiple GLU gating modules. The network begins with a 4 × 4 convolution (stride 4) producing 64 channels, then proceeds through four stages (64 → 128 → 256 → 512 channels). Each stage contains a strided 2 × 2 down‑sampling convolution followed by two 3 × 3 convolutions organized as residual blocks. GLU layers are inserted after each residual block; they compute a softmax‑based gating vector that weights feature maps, allowing the network to selectively emphasize informative channels. Empirically, only the first GLU layer learns substantial weights, while later GLUs remain relatively static, suggesting that early‑stage filtering is the dominant mechanism for aligning with mouse visual processing. The final representation is fed to a linear read‑out used for ridge regression against the neural data. This model attains the highest R² among all submissions and shows strong representational similarity with higher visual areas (V2, LM, etc.). Ablation of GLU reduces R² by ~12 %, and removing residual connections leads to training instability, underscoring the necessity of both depth and gating.

Training Duration vs. Performance
The authors saved ten checkpoints spanning 60 K to 1.14 M training steps. Performance curves are non‑monotonic: the peak for both tracks occurs around 200 K steps. Beyond this point, additional training causes over‑fitting, decreasing ASR/MSR by 5‑10 % in Track 1 and lowering R² by 3‑4 % in Track 2. This finding highlights that, in reinforcement‑learning settings with limited environmental diversity, longer training does not guarantee better generalization and that early stopping is crucial.

Failed Approaches and Lessons Learned
Detailed documentation of failed attempts (InceptionNet, deep ResNet, LSTM, extensive data‑augmentation pipelines) reveals that architectural complexity often introduces training instability, high memory consumption, and sensitivity to visual perturbations. For example, a full‑scale InceptionNet never converged, while a 24‑block ResNet achieved high training accuracy but suffered a 35 % drop on perturbed test sets. Aggressive data augmentation, intended to improve robustness, paradoxically reduced the robust model’s score from 87.7 % to 59.8 %. These negative results provide practical guidance for future competitions and research.

Key Insights

  1. Simplicity for Robustness – A shallow CNN with targeted GLU gating and observation normalization outperforms deeper, more sophisticated models on unseen visual perturbations.
  2. Depth for Biological Plausibility – Hierarchical, residual architectures with sufficient capacity are required to capture the multi‑scale representations present in mouse visual cortex, leading to superior neural prediction.
  3. Non‑Monotonic Training Dynamics – Optimal performance is achieved at intermediate training lengths; excessive training harms both behavioral robustness and neural alignment.
  4. Targeted Gating is Crucial – GLU modules consistently improve both tracks by allowing the network to filter out perturbation‑sensitive features while preserving task‑relevant information.

Conclusion
The paper demonstrates that the design goals of visual robustness and neural alignment can be at odds, necessitating distinct architectural strategies. By systematically exploring a spectrum of models, documenting failures, and performing extensive ablations, the authors provide a clear roadmap for building visual agents that are both behaviorally resilient and biologically plausible. Their findings challenge the prevailing assumption that “bigger is better” in visuomotor learning and suggest that careful alignment of model capacity, gating mechanisms, and training schedules is essential for advancing AI systems that truly emulate animal vision.


Comments & Academic Discussion

Loading comments...

Leave a Comment