SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as “visual thoughts” into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.


💡 Research Summary

SwimBird tackles a fundamental limitation of current multimodal large language models (MLLMs): the reliance on a single, pre‑defined reasoning template that forces every query to follow either a text‑only chain‑of‑thought (CoT), a vision‑only latent visual CoT, or a fixed interleaved schedule. This rigidity creates a modality mismatch—visual thoughts are injected even for purely textual logical problems, degrading symbolic reasoning, while text‑only reasoning is insufficient for vision‑dense tasks such as maze solving, fine‑grained visual search, or spatial navigation, where intermediate visual states are essential.

The paper proposes a hybrid autoregressive framework that unifies next‑token prediction for textual thoughts with next‑embedding prediction for visual thoughts. Textual spans are handled exactly like a standard language model, trained with a shifted cross‑entropy loss. Visual thoughts are represented as continuous hidden‑state embeddings; at each visual step the model predicts the next embedding, supervised by a mean‑squared‑error loss against target embeddings generated by the same vision encoder. A single loss function combines the two modalities with tunable weights (λ_text, λ_vis), allowing the model to learn all three reasoning patterns without forcing unnecessary supervision.

A key architectural innovation is the dynamic visual‑token budget. Prior latent‑visual methods allocate a fixed number of latent tokens (e.g., eight) regardless of image resolution or task difficulty, which either discards fine‑grained details on high‑resolution inputs or wastes computation on easy, low‑resolution queries. SwimBird leverages the resolution‑aware tokenization of Qwen‑ViT: for both the original question image and any intermediate “thinking” images, the vision encoder is allowed to emit a variable number of visual tokens bounded by an independent range


Comments & Academic Discussion

Loading comments...

Leave a Comment