How Much Information Can a Vision Token Hold? A Scaling Law for Recognition Limits in VLMs

How Much Information Can a Vision Token Hold? A Scaling Law for Recognition Limits in VLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent vision-centric approaches have made significant strides in long-context modeling. Represented by DeepSeek-OCR, these models encode rendered text into continuous vision tokens, achieving high compression rates without sacrificing recognition precision. However, viewing the vision encoder as a lossy channel with finite representational capacity raises a fundamental question: what is the information upper bound of visual tokens? To investigate this limit, we conduct controlled stress tests by progressively increasing the information quantity (character count) within an image. We observe a distinct phase-transition phenomenon characterized by three regimes: a near-perfect Stable Phase, an Instability Phase marked by increased error variance, and a total Collapse Phase. We analyze the mechanical origins of these transitions and identify key factors. Furthermore, we formulate a probabilistic scaling law that unifies average vision token load and visual density into a latent difficulty metric. Extensive experiments across various Vision-Language Models demonstrate the universality of this scaling law, providing critical empirical guidance for optimizing the efficiency-accuracy trade-off in visual context compression.


💡 Research Summary

This paper investigates the fundamental capacity limit of vision tokens used in Vision‑Language Models (VLMs) when encoding dense textual information rendered as images. The authors treat the vision encoder as a lossy communication channel with a finite representational bandwidth and ask: how much semantic information can a single vision token reliably carry? To answer this, they construct a controlled testbed where pure text from six domains (novels, law, economics, medicine, newspapers, letters) is rendered into images with varying typographic densities. A block‑wise shuffling strategy randomizes the order of text blocks, preventing large language models from exploiting linguistic priors and ensuring that performance reflects pure visual perception.

The experiments fix the number of vision tokens by controlling the input resolution (R = 512, 640, 1 024, 1 280) of a Vision Transformer‑based encoder (DeepSeek‑OCR) and progressively increase the amount of information by adding characters (including whitespace and punctuation). Recognition quality is measured with edit distance (ED) between the decoded text and ground truth.

The results reveal a striking three‑phase transition as the character count grows:

  1. Stable Phase – For short texts the ED remains near zero; the token budget is sufficient and reconstruction is virtually perfect.
  2. Instability Phase – In a middle range the average ED rises, but the same character length can produce widely varying errors. This variance is traced to spatial‑alignment sensitivity of the fixed‑size patch partitioning in Vision Transformers. A pixel‑shift perturbation experiment shows that shifting the image by up to one patch size can dramatically reduce ED for many samples, confirming that misalignment across patch boundaries is the primary cause of instability.
  3. Collapse Phase – Beyond a critical “Hard Wall” the ED jumps abruptly to high values (>0.6) and cannot be rescued by pixel shifting. Here the information load exceeds the intrinsic capacity of the vision tokens, leading to irreversible loss of content.

Higher input resolutions shift the Hard Wall to the right, extending both the Stable and Instability phases, which demonstrates that the capacity limit scales with the total number of tokens but is not linearly proportional to resolution.

To quantitatively model these phenomena, the authors introduce two variables: average token load ( \bar{L} ) (characters per token) and visual density ( D ) (characters per unit image area). They propose a probabilistic scaling law:

\


Comments & Academic Discussion

Loading comments...

Leave a Comment