Jina-VLM: Small Multilingual Vision Language Model

Reading time: 12 minute
...

📝 Original Info

  • Title: Jina-VLM: Small Multilingual Vision Language Model
  • ArXiv ID: 2512.04032
  • Date: 2025-12-03
  • Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

📝 Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

📄 Full Content

Vision-language models (VLMs) combine pretrained vision encoders with large language models to tackle tasks requiring joint visual and textual understanding (Alayrac et al., 2022;Liu et al., 2023). Recent VLMs have achieved strong results on visual question answering (VQA), OCR, and multimodal reasoning. However, two challenges limit their practical deployment. First, multilingual capabilities often degrade during vision adaptation: models that perform well on English benchmarks show uneven results across other languages (Manea & Libovický, 2025). Second, highquality VLMs remain computationally expensive to train and deploy, limiting accessibility for researchers and practitioners with constrained resources. This work introduces jina-vlm, a 2.4B parameter VLM that addresses both challenges. The model aligns a SigLIP2-So400M/14-384 vision encoder (Tschannen et al., 2025) with Qwen3-1.7B-Base (Yang et al., 2025) through an attention-pooling connector, trained with a two-stage pipeline that explicitly incorporates multilingual data. Among open 2B-scale VLMs, jina-vlm achieves state-of-the-art performance on multilingual multimodal benchmarks including MMMB and Multilingual MMBench, demonstrating that small models can excel at cross-lingual visual understanding without sacrificing general capabilities. On standard English benchmarks spanning diagrams, charts, documents, and OCR, jina-vlm achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs. These results are enabled by two technical contributions: an efficient arbitrary-resolution pipeline that combines overlapping tiling with attention-based token pooling to reduce visual token count by 4×, and a training recipe that incorporates text-only data to preserve the language understanding performance of the backbone LLM.

VLM architecture and training. Modern VLMs follow an architecture introduced by PaLI (Chen et al., 2023): a pretrained vision encoder extracts visual features, a connector projects them into the language model’s embedding space, and a decoder-only language model generates text conditioned on these visual tokens. Vision Transformers (ViTs) (Dosovitskiy et al., 2021) produce patch-level representations that the language model processes alongside text embeddings. This design is adopted by LLaVA (Liu et al., 2023;2024a;Xu et al., 2024;Li et al., 2024d;a), QwenVL (Bai et al., 2023;Wang et al., 2024c;Bai et al., 2025), InternVL (Chen et al., 2024d;c;2025;Zhu et al., 2025;Wang et al., 2025), and Ovis (Lu et al., 2024b;2025). Training strategies vary: Wang et al. (2024c); Chen et al. (2025) alternate between multimodal instruction tuning and general training; Liu et al. (2024a) incorporate academic VQA datasets; Deitke et al. (2025), Li et al. (2024a), andTong et al. (2024) curate large-scale, diverse data mixtures.

Efficient resolution-agnostic image processing. Standard ViTs process fixed-resolution images, requiring resizing that discards fine-grained detail. Since visual token count scales with resolution and Transformer computation scales quadratically with sequence length, naive high-resolution processing is prohibitive. Several solutions exist: Deitke et al. (2025) tile images with overlap; Wang et al. (2024c) introduce Naive Dynamic Resolution with Multimodal Rotary Position Embedding (Su et al., 2024;Heo et al., 2024); Lu et al. (2025) use native-resolution ViTs (Dehghani et al., 2023). Orthogonally, images often contain low-information regions (e.g., sky backgrounds), making visual tokens highly redundant. Token compression methods address this (Chen et al., 2024a;Shang et al., 2024;Yang et al., 2024;Xing et al., 2025). Chen et al. (2024c) develop Dynamic High-Resolution Tiling, and Liu et al. (2025) propose scale-then-compress strategies. Recent work on training-free token budgeting, such as HERO (Li et al., 2025), demonstrates that inference-time pruning can achieve significant speedups while preserving accuracy; our approach differs by learning compact representations during training rather than dropping tokens at inference.

Vision-language connectors. The connector bridging vision encoders and language models significantly impacts both efficiency and performance. BLIP-2 (Li et al., 2023a) introduces Q-Former, a learnable query-based transformer that extracts fixed-length representations from visual features, reducing the number of tokens fed to the LLM. Flamingo (Alayrac et al., 2022) uses a Perceiver Resampler with cross-attention to compress visual tokens. Our attention-pooling connector shares the goal of token reduction but operates differently: rather than learning a fixed set of queries, we apply local 2×2 attention pooling that preserves spatial structure while achieving 4× compression, which we found more effective for tasks requiring fine-grained spatial understanding.

Small VLMs. Efficiency has become a central objective. Chu et al. (2024) demonstrate competitive performance below 2B parameters. Shao et al. (2025) combine quantization with aggressive resolution reduction for mobile deployment, matching larger models’ performance. MiniCPM-V (Yao et al., 2024) targets edge deployment while maintaining strong OCR and multilingual capabilities. Marafioti et al. (2025) systematically explore design parameters to train VLMs as small as 256M parameters.

Multilingual VLMs. Many lightweight VLMs (Beyer et al., 2024;Steiner et al., 2024;Abdin et al., 2024) achieve strong English performance but degrade on other languages. Wang et al. (2024c) and Chen et al. (2024c) address this through targeted multilingual training data. Yue et al. (2025) introduce instruction-tuning data spanning 39 languages.

Retaining text-only performance. Multimodal training often degrades text-only capabilities. Mitigation strategies include balanced data mixtures, careful learning rate scheduling (Laurenc ¸on et al., 2024), and partial backbone freezing (Li et al., 2024a;Wang et al., 2025). The vision encoder, SigLIP2-So400M/14-384, is a 27-layer Vision Transformer with 400M parameters that processes 378×378 pixel inputs as 27×27 grids of 14×14 patches. To handle arbitrary resolutions, we decompose each image into overlapping tiles of this size and process each tile independently through the encoder. A global thumbnail, the full image resized to 378×378, provides context alongside the tile representations. We use a default of 12 tiles during training; this limit can be increased at inference or during continued training to handle higher resolutions, with memory scaling linearly with tile count. The tiling algorithm is detailed in Appendix A.1. et al., 2025). The VL connector concatenates features from layers 24 and 18, the third-and ninth-to-last layers, then applies 2×2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder (Yang et al., 2025).

Rather than using the final ViT output, jina-vlm concatenates features from two intermediate layers: the third-to-last and ninth-to-last, corresponding to layers 24 and 18 of the 27-layer encoder. This captures both fine-grained spatial details from earlier layers and high-level semantics from later layers. The connector then applies attention pooling over 2×2 patch neighborhoods, using meanpooled features as queries. This reduces the token count by 4× while preserving local structure.

A SwiGLU projection layer maps the pooled representations to the language model’s embedding dimension.

In more formal terms, let H (ℓ) ∈ R N ×dv denote the hidden states from ViT layer ℓ, where N is the number of patches, d v is the vision encoder hidden size, and negative indices count from the final layer (e.g., ℓ = -1 is the last layer). We concatenate features from two internal layers:

For each 2×2 patch neighborhood N i , we compute a query vector as the mean of the neighborhood features:

where N i contains the four patches at positions (2i x , 2i y ), (2i x + 1, 2i y ), (2i x , 2i y + 1), and (2i x + 1, 2i y + 1) and M = N/4.

Attention pooling is then computed as:

where

are learnable weight matrices. Finally, the pooled visual features are projected to the language model embedding dimension via a SwiGLU (Shazeer, 2020) layer:

where Swish(x) = x • σ(x), σ is the sigmoid function, ⊙ denotes element-wise multiplication, W 1 , W 2 ∈ R dv×3d l , W 3 ∈ R 3d l ×d l are learnable parameters, and d l is the language model embedding size.

The language decoder is initialized from Qwen3-1.7B-Base1 , which empirically outperformed the instruction-tuned variant in our setting. We introduce three special tokens to structure visual inputs: and delimit image and thumbnail sequences, while marks row boundaries within the patch grid, where tokens are arranged left-to-right and top-tobottom. Input and output embedding weights are not tied.

Table 1 quantifies the computational benefits of attention pooling. With the default 12-tile configuration (plus thumbnail), the unpooled baseline would produce 9,477 visual tokens per image, while our 2×2 pooling reduces this to 2,366 tokens. Since the ViT processes each tile identically regardless of pooling, the savings apply exclusively to the LLM: we observe a 3.9× reduction in prefill FLOPs and a 4× reduction in KV-cache memory. The overall FLOPs reduction is 2.3× when including the shared ViT cost.

Training proceeds in two stages, both updating all model components (encoder, connector, and decoder) without freezing, following Deitke et al. (2025). The combined data comprises approximately 5M multimodal samples and 12B text tokens across 30+ languages, with roughly half in English and the remainder spanning high-and moderate-resource languages. Table 2 summarizes hyperparameters for both stages.

The first stage focuses on cross-language semantic grounding rather than task-specific objectives.

Training data consists primarily of caption datasets (PixmoCap (Deitke et al., 2025), PangeaIns (Yue et al., 2025)) spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. We include 15% text-only data from PleiAS/common corpus (Langlais et al., 2025) to mitigate degradation on text-only tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder.

The second stage trains instruction-following for VQA and reasoning tasks. We combine public dataset collections, including LLaVA OneVision (Li et al., 2024a) Results for models other than jina-vlm are from their respective papers, Wang et al. ( 2025 Given the diversity of instruction data, we found single-source batches more effective initially, likely due to the heterogeneous data mixture. We train for 30K steps with single-source batches, then 30K steps with mixed-source batches.

We compare jina-vlm against lightweight VLMs across six capability areas: general VQA, multimodal comprehension, multi-image reasoning, hallucination control, mathematical reasoning, textonly performance, and multilingual understanding. All evaluations use VLMEvalKit2 (Duan et al., 2024) with English prompts matching our training format (e.g., “Return only the letter of the best answer option” for multiple-choice, “Respond very briefly” for open-ended questions).

Table 3 reports results on eight VQA benchmarks covering diagrams (AI2D (Kembhavi et al., 2016)), charts (ChartQA (Masry et al., 2022), CharXiv (Wang et al., 2024e)), scene text (TextVQA (Singh et al., 2019)), documents (DocVQA (Mathew et al., 2021), InfoVQA (Mathew et al., 2022)), OCR (OCRBench (Liu et al., 2024c)), and diverse scenes (SEED-Bench-2-Plus (Li et al., 2024b)). jina-vlm achieves the highest average (72.3), with particularly strong performance on diagram interpretation and text extraction. Results for models other than jina-vlm are from their respective papers, (Wang et al., 2025;Zhu et al., 2025;Wang et al., 2024c), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).

Table 4 shows results on multimodal comprehension (MME (Fu et al., 2025), MMB v1.1 (Liu et al., 2024b), MMStar (Chen et al., 2024b)) and real-world understanding (RealWorldQA (xAI, 2024), MME-RealWorld (Zhang et al., 2025), R-Bench (Li et al., 2024c)). jina-vlm scores 67.4 on multimodal tasks and 61.9 on real-world tasks, achieving the best RealWorldQA result (68.2).

Table 5 reports multi-image reasoning (BLINK (Fu et al., 2024), MuirBench (Wang et al., 2024a), MMT (Ying et al., 2024)) and hallucination benchmarks that measure the tendency to fabricate visual details (HallBench (Guan et al., 2024), POPE (Li et al., 2023b)). jina-vlm scores 47.3 on multi-image tasks, which is expected given limited multi-image training data, but achieves the best POPE score (90.3), indicating low hallucination rates.

Table 6 reports structured reasoning benchmarks: multidisciplinary comprehension (MMMU (Yue et al., 2024)), visual mathematics (MathVista (Lu et al., 2024a), MathVision (Wang et al., 2024b), MathVerse (Zhang et al., 2024), WeMath (Qiao et al., 2025)), and logical reasoning (LogicVista (Xiao et al., 2024)). jina-vlm performs comparably to InternVL3-2B and outperforms Qwen2-VL-2B.

Table 7 compares jina-vlm against the backbone Qwen3-1.7B on text-only benchmarks: MMLU (Hendrycks et al., 2021), MMLU-Pro (Wang et al., 2024d), GSM-8K (Cobbe et al., 2021), ARC-C (Clark et al., 2018), and HellaSwag (Zellers et al., 2019). Results show mixed preservation of text-only capabilities: jina-vlm matches or exceeds the backbone on commonsense reasoning Results for models other than jina-vlm are from their respective papers, (Wang et al., 2025;Zhu et al., 2025;Wang et al., 2024c), except those marked with * which are computed using VLMEvalKit. † indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%). (ARC-C, HellaSwag) and retains most performance on MMLU and GSM-8K. However, MMLU-Pro shows substantial degradation (46.4 → 30.3), likely because this benchmark emphasizes extended multi-step reasoning that conflicts with our instruction-tuning toward concise visual responses. This suggests a trade-off between optimizing for multimodal tasks and preserving complex text-only reasoning, which future work could address through more balanced data mixtures or curriculum scheduling.

Table 8 reports multilingual multimodal benchmarks: MMMB (Sun et al., 2025), Multilingual MM-Bench (Sun et al., 2025), and MTVQA (Tang et al., 2025). jina-vlm achieves state-of-the-art multilingual performance among 2B-scale VLMs, with the highest averages on MMMB (78.8) and Multilingual MMBench (74.3).

We presented jina-vlm, a 2.4B vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. Our results demonstrate that small VLMs can attain strong cross-lingual visual understanding through careful architectural and training choices: attention-based token pooling reduces visual tokens by 4× while preserving spatial information, and incorporating text-only data during multimodal training mitigates the catastrophic forgetting typically observed in vision-adapted language models. On standard English VQA benchmarks, jina-vlm achieves leading results, demonstrating that multilingual capabilities need not come at the cost of general performance.

The current approach has limitations. Multi-tile processing introduces computational overhead that scales with image resolution, and tiling can fragment global spatial context, potentially impairing performance on tasks requiring holistic scene understanding such as object counting or precise spatial reasoning across tile boundaries. While the global thumbnail partially mitigates this, nativeresolution approaches (Dehghani et al., 2023) may be better suited for such tasks. We have not emphasized safety-critical training or alignment, and multi-image reasoning remains weak due to limited training data in this regime. Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.

https://huggingface.co/Qwen/Qwen3-1.7B-Base

https://github.com/open-compass/VLMEvalKit

Results for baseline models are derived from their original publications,(Wang et al., 2025;Zhu et al., 2025; Wang et al., 2024c), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).

📸 Image Gallery

clevr.png docvqa.png pathvqa.png screenqa.png tallyqa.png tatqa.png textvqa.png visualwebinstruct.png vqav2.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut