Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT's overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT's bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at https://github.com/Perkzi/IPCV .
Multimodal large language models (MLLMs) [25,45] have achieved remarkable performance across diverse visionlanguage tasks [5,15,46,47,51,[57][58][59], benefiting from powerful language reasoning capabilities and strong vision encoders [39,56]. However, the visual branch of these models often produces hundreds to thousands of tokens per image or video, especially at high resolutions or with * Equal Contribution. โ Corresponding author: zhanglinfeng@sjtu.edu.cn multiple frames [6]. Due to the quadratic computational complexity O(N 2 ) of self-attention [44], longer token sequences incur significantly higher latency. This severely hinders the widespread adoption and edge deployment of MLLMs. In practice, however, visual tokens in MLLMs exhibit substantial redundancy [36]. To address this, a surge of recent research has focused on token pruning methods [26,27,29,30,50], which aim to directly reduce the number of tokens and lower computational costs. Nevertheless, despite the remarkable progress of these approaches, most operate only on the LLM within MLLMs. In practice, as vision encoders increasingly adopt dynamic designs or higher native resolutions, the resulting visual token sequences grow significantly longer, rendering the computational cost of the vision encoder impossible to overlook (as shown in Figure 1). While several methods have been proposed for token compression at the vision transformer (e.g., ToMe [1], ToFu [16]), they primarily focus on vision-only models and tasks (e.g., classification), resulting in suboptimal performance on multimodal understanding tasks [60]. Generally speaking, token pruning in the ViT for MLLMs faces two significant challenges. Absence of language guidance. A fundamental challenge Figure 2. Visualization of the delta of tokens from the shallow layers to the deep layers. (a) PCA projection of the hidden state trajectories of a pruned token and the mean of its top-10 nearest neighbors, traced from the pruning layer to the final layer of the ViT, with arrows indicating the shift direction. (b) and (c): Distributions of L1 distance and cosine similarity, computed from pairwise comparisons between the change (delta) of retained tokens and that of pruned tokens across layers. The deep-blue vertical line denotes the overall median of the pairwise distances (cosine similarities) computed between each pruned token and the mean of its top-10 most similar tokens. These visualizations reveal that tokens with high similarity tend to exhibit highly similar changes from shallow to deep layers. of performing token compression at the vision encoder in MLLMs lies in the absence of language guidance [48]. Since the token compression process occurs before any crossmodal interaction, it risks discarding visual tokens that may appear unimportant from a purely visual perspective but are in fact crucial for downstream language-driven reasoning. Negative influence from pruned tokens. Moreover, the removal of certain vision tokens can introduce feature distortions in the remaining tokens, as the bidirectional attention mechanism in vision transformers [7] propagates information globally. This issue is even more pronounced at the vision encoding stage than in the language model, where unidirectional attention limits the scope of such errors [37,41]. As a result, naive token pruning in the vision tower can inadvertently undermine the model's ability to preserve information essential for multimodal understanding.
To tackle the above two challenges, we propose Information-Preserving Compression for MLLM Visual Encoders (IPCV). Specifically, in the vision encoder (i.e., ViT), IPCV prunes tokens in shallow layers to reduce the computation costs. Then, in the final layer, we introduce Neighbor-Guided Reconstruction (NGR), which aims to reconstruct the pruned tokens in the following layers based on their most similar remaining tokens (i.e., tokens that are not pruned). As studied in Figure 2, we find that similar tokens exhibit similar delta from the shallow layers to the deep layers. As a result, instead of directly copying the similar remaining tokens [1], NGR reconstructs the pruned tokens in the final layer by adding their values in the shallow layers with the delta of their most similar and remaining tokens from the shallow layers to the deep layers.
Secondly, to further reduce the negative influence of the pruned tokens on the remaining tokens, we introduce Attention Stabilization (AS), which aims to approximate the keys and values of pruned tokens in the middle layers, which is achieved by reconstructing pruned tokens via NGR, and then computing their keys and values.
The proposed IPCV is training-free for the ViT and integrates seamlessly into diverse MLLMs. For the LLM stage, IPCV is orthogonal to and compatible with various token compression schemes; in this paper, we pair IPCV by default with the LLM-stage compressor in DART [49] to further reduce redundancy after modal fusion. In summary, our main contributions are as follows:
โข We provide a systematic comparison of LLM-stage and ViT-stage token pruning in MLLMs, revealing the need and challenges for vision-side token pruning in MLLMs.
Visual Compression in MLLM. To reduce the inflow of visual tokens into the LLM, recent work compresses at multiple junctures [12,13,23,28,34,52]. At the projector, adaptive pooling (DeCo [54]) and query-based projectors (BLIP-2’s Q-Former [18]) shorten the sequence before it enters the language model. Within the LLM, methods include training-based policies guided by cross-modal supervision (e.g., LVPruning [43], Skip-Vision [55]) and training-free heuristics that use attention or feature similarity to prune or merge tokens dynamically during inference (e.g., FastV [3], SparseVLM [60], DART [49], EfficientVLA [53]). However, since a significant portion of computation already occurs in ฮ 3 = @ target layer -@ pruned layer ฮ = @ target layer -@ pruned layer the ViT, where the initial visual tokens are processed, compression applied only within the LLM can provide only limited acceleration, as the vision encoder remains a bottleneck for overall efficiency. Visual Compression in Visual Encoders. In visual encoders, token count can be reduced structurally or dynamically. Structurally, multi-scale encoders and hierarchical backbones with patch merging (e.g., LLaMA-VID [21], Swin Transformer [33]) progressively lower spatial or temporal resolution across stages, thus reducing the number of tokens. Dynamically, tokens are compressed within transformer layers: Token Merging (ToMe) [1] merges redundancy, while pruning within the ViT (DynamicViT [40], EViT [22], SparseViT [4]) removes low-importance tokens and reduces encoder computation. Most methods target single ViTs rather than MLLMs and often require extra training, which is costly in the MLLM setting.
Token compression in the vision encoder of MLLMs inherently risks discarding tokens that appear visually redundant but are semantically important. IPCV is designed to solve this problem (see Figure 3): In the early ViT layers, redundant tokens are pruned to reduce computation. Their semantic contribution is maintained through Neighbor-Guided Reconstruction, which rectifies the pruned tokens, allowing them to continue contributing in subsequent layers. In this way, the LLM ultimately receives a complete and semanti-cally intact set of vision tokens.
Let l p denote the index of the transformer block in the vision encoder where token pruning is applied. Given the input patch embeddings to block l p , denoted as H lp โ R LรD , together with the embeddings from the preceding block H lp-1 , we compute a token-wise importance score s i based on the simple feature-difference criterion:
Let I keep be the indices of the top-K tokens with the largest s i , and I rem be the remaining indices. The retained and removed token embeddings at the pruning layer l p are:
We visualize the behavior of pruned tokens and their nearest neighbors in Figure 2. Panel (a) shows that, in the unpruned model, the hidden state trajectory of a token expected to be pruned remains highly consistent with the mean trajectory of its top-10 most similar unpruned tokens. This consistency indicates that the evolution of pruned tokens can be well approximated by the updates of their similar neighbors. Panels (b) and (c) present the empirical distribution of L1 distances between pruned and kept tokens, demonstrating that the change of a pruned token is much closer to the mean change of its top-10 similar tokens than to that of randomly selected tokens. These findings support the design of our Neighbor-Guided Reconstruction mechanism. NGR aims to reconstruct the removed tokens at a layer l > l p by transferring local updates from their nearest retained neighbors, thereby enabling pruned tokens to continue contributing contextual information either within current attention layer or subsequent processing stages.
Given the retained and removed index sets I keep and I rem (and their embeddings H keep, lp and H rem, lp ), we can construct the rectified embeddings Hrem, l for any given layer l > l p in the vision transformer. Neighbor selection. For each removed token i โ I rem with feature h i, lp , we find its k nearest neighbors among H keep, lp by minimizing the Euclidean distance:
(3)
Local update transfer. Let H keep, l be the retained-token embeddings at the current layer l, and define the per-token update for the kept set from l p to l as:
We reconstruct each removed token i by adding the average updates of its k neighbors to its original embedding h i, lp :
Stacking all removed tokens yields Hrem, l โ R |Irem|รD . The complete NGR procedure is presented in Algorithm 1.
Pruning at l p alters the attention distribution because the removed tokens no longer participate in query-key interactions.
To mitigate this with minimal overhead, IPCV temporarily retains the pruned tokens as keys and values for a few layers after pruning. Thus, they contribute context in attention while skipping the costly FFN layers [55].
- Algorithm 1 Neighbor-Guided Reconstruction (NGR)
-
end for 10: Stack all hi, l to form Hrem, l Concretely, for layers l โ [ l p , l p + โl max ), the removed tokens are temporarily rectified via:
At l = l p , this is equivalent to directly using H rem, lp . Then, we restore the full token sequence H full, l โ R LรD by placing both retained and reconstructed tokens back to their corresponding original positions:
The full sequence is provided as input to the multi-head attention module for bidirectional token interactions:
after which only the updated retained tokens are kept and forwarded to the FFN:
This strategy is motivated by Skip-Vision [55], which shows that in transformer-based MLLMs, the feed-forward network (FFN) dominates computation for visual tokens, while attention is relatively lightweight. By enabling pruned tokens to skip the FFN but still join attention, IPCV preserves semantic aggregation with negligible extra FLOPs.
After the final ViT block l final , IPCV merges the updated retained tokens H keep, lfinal+1 โ R |Ikeep|รD (denoting the output of l final ) with the newly reconstructed removed tokens Hrem, lfinal+1 โ R |Irem|รD to restore the complete sequence:
The resulting H full, lfinal+1 โ R LรD matches the original sequence length L prior to pruning. This Reintegration step guarantees that the downstream LLM receives the same number and ordering of visual tokens as in the uncompressed setting, thereby preserving both interface compatibility and semantic integrity without introducing additional computational overhead in earlier layers.
A key property of IPCV is the strict preservation of the visual token sequence: after vision-side pruning and NGR-based reconstruction, the sequence of visual tokens passed to the LLM exactly matches the original uncompressed sequence in both length and ordering. This consistency in sequence structure ensures interface compatibility with the multimodal fusion module as well as the LLM input layer, allowing IPCV to be seamlessly integrated into existing architectures without requiring any modifications. Consequently, IPCV can be seamlessly combined with existing LLM-side token pruning or acceleration techniques that assume a complete token set. For instance, methods such as FastV [3] and DART [49] can be directly applied to the LLM input sequence to further reduce computation by pruning less informative tokens in the language model’s self-attention layers. In practice, IPCV serves as a drop-in vision-side token compression module that complements language-side efficiency methods in an orthogonal manner.
Models and Baselines. We evaluate IPCV primarily on Qwen2-VL-7B-Instruct [45], an instruction-tuned vision-language model, and additionally on InternVL3-38B [62] to verify generalization across architectures. Base- lines include training-free vision-side token compression methods (ToMe [1], ToFu [16]) and LLM-side token pruning methods (FastV [3], V 2 Drop [2], SparseVLM [60], DART [49]). For a direct comparison, we also implement a vision-side variant of DART, denoted DART (ViT).
ViT-stage and LLM-stage methods, we report results using the overall acceleration ratio as the evaluation metric. For ViT-stage pruning methods, we adopt symmetric settings where both the ViT encoder and LLM decoder retain 50%, 35%, or 20% of tokens. For LLM-stage pruning methods, we use asymmetric settings that retain 100% of ViT tokens while pruning LLM tokens to 20% or 5%. Benchmarks cover diverse image datasets (GQA [14], MMBench [31], MME [9], POPE [20], SEED [17], TextVQA [42], VizWiz [11], OCRBench [32]) and video datasets (MVBench [19], EgoSchema [35], MLVU [61],
VideoMME [10]). Additional details on implementation and datasets are provided in Appendices C and D.
Tables 1 and2 summarize the accuracy-latency trade-off of IPCV and baselines on Qwen2-VL-7B across diverse benchmarks. Here, Avg. Acc. is the average percentage of performance relative to the vanilla model. Rel. Latency is total inference time normalized to vanilla (100%), measured on MMBench-EN for image and MVBench for video. To highlight the speed-accuracy trade-off, latencies are colorcoded by relative runtime: dark blue (โฅ 80%), blue-green (70-80%), green (60-70%), and light green (50-60%). On image understanding tasks, IPCV consistently achieves a superior balance between performance and efficiency. Under moderate acceleration (70.9% vanilla runtime), it preserves 97.8% of the vanilla performance, out-performing all baselines with comparable or higher runtime. At a higher speedup of 60.8% runtime, IPCV still retains 94.9% Avg. Acc, exceeding the next-best method by at least 12 points. In contrast, ViT-stage pruning baselines are prone to removing visual tokens that are critical for grounding textual understanding, resulting in pronounced accuracy degradation. This issue is particularly evident on visual detail-sensitive tasks (OCRBench, VizWiz, TextVQA), where IPCV proves more stable. LLM-stage pruning methods generally maintain reasonable accuracy under modest acceleration, but their performance declines noticeably when aiming for higher speedups, indicating a limited ability to sustain accuracy at high compression ratios.
On video understanding tasks, IPCV attains 95.7% of vanilla performance with lower inference latency, outperforming all baselines in this setting. Preserving text-critical visual cues also appears to be important for sustaining temporal-spatial reasoning. Similar to the image setting, ViTstage pruning baselines suffer clear accuracy drops, while LLM-stage pruning methods, though relatively stable, still lag behind our method. IPCV reliably preserves multimodal reasoning across image and video benchmarks.
To further examine the generalization of IPCV beyond Qwen2-VL, we conduct additional experiments on InternVL3-38B (Table 3). Rel. Latency is still measured on MMBench-EN for consistency. In this setting, we pair IPCV with FastV as the LLM-side pruner to test its overall effectiveness. The results show that IPCV continues to deliver favorable performance-efficiency trade-offs under this larger architecture: compared with alternative baselines, IPCV achieves higher accuracy at similar or lower runtime.
We evaluate efficiency on MMBench-EN, measuring total GPU inference latency and prefilling latency, together with FLOPs and KV cache usage (Table 4). IPCV achieves a favorable balance of accuracy and efficiency: latency drops to 70.9%, 60.8%, and 51.7% of the vanilla baseline, while maintaining strong accuracy over competing methods.
As shown in Figure 4, ViT-stage pruning achieves larger latency reductions but suffers sharper accuracy degradation, whereas LLM-stage methods plateaus around 69% runtime and fails to deliver additional speedup under higher pruning ratios. Prefilling latency decreases proportionally with total latency, and results on VizWiz confirm IPCV’s effectiveness across benchmarks. Compared with LLM-stage pruning, ViT-stage methods such as IPCV retain more LLM tokens and consequently incur larger KV caches, yet this overhead remains acceptable for practical deployment. Moreover, FLOPs alone do not reliably predict latency, methods with similar FLOPs can yield different GPU times.
Table 5 examines the compatibility of IPCV with representative LLM-stage token pruning methods. We observe that IPCV combined with FastV, DART and SparseVLM achieves comparable accuracy, all maintaining over 93% of vanilla performance. IPCV+V 2 Drop, on the other hand, shows more noticeable fluctuations across benchmarks. As seen in the InternVL3-38B results, V 2 Drop’s performance can be less stable under different architectures. While Spar-seVLM preserves accuracy, it often incurs higher inference latency, limiting practical efficiency. Thus, in practice we primarily consider FastV and DART as suitable LLM-stage token pruning methods to pair with IPCV.
Table 6 presents the ablation analysis of IPCV. Removing either the AS module or the Reintegration step slightly reduces accuracy, and omitting both causes a more noticeable drop. These mechanisms provide such benefits with only a marginal impact on runtime. The key reason is that pruned tokens are reused only in lightweight attention while skipping the more costly feed-forward layers. The results confirm that AS and Reintegration are indeed complementary, enhancing IPCV without sacrificing efficiency.
In our default configuration, the pruning start layer is l p = 3, the Attention Stabilization depth is โl max = 7, and the neighbor size in NGR is k = 10. We evaluate IPCV’s sensitivity to these hyperparameters on MMBench-EN (see Figure 5). Varying the Attention Stabilization depth โl max shows that performance remains stable across a broad range (3)(4)(5)(6)(7)(8)(9). This indicates that a moderate depth suffices for stabilizing attention, with IPCV delivering consistent performance across choices of this parameter. For the pruning start layer l p , later pruning is more accurate but less efficient, while earlier pruning is faster with a slight accuracy drop. We also analyze the sensitivity of the neighbor size k in NGR. A very small k can increase variance as reconstruction relies on too few neighbors, while a large k may introduce noise from less similar tokens. Empirically, IPCV remains stable over a wide range of k, confirming NGR’s robustness.
We introduce IPCV, a training-free framework for token compression in MLLM visual encoders. IPCV combines earlystage pruning with two NGR-based modules: Attention Stabilization and Reintegration. This design enables aggressive token reduction in the ViT while preserving semantic information needed for downstream reasoning. Experiments on diverse image and video benchmarks demonstrate that IPCV achieves superior accuracy-efficiency trade-offs compared to existing training-free methods. IPCV also generalizes well across different model architectures. Looking ahead, IPCV provides a flexible foundation. Future work may focus on refining the reconstruction mechanism and further analyze the Attention Stabilization component to improve semantic preservation and computational efficiency, thereby enabling more powerful multimodal systems.
To fairly evaluate our proposed IPCV framework, we compare it against representative baselines from two perspectives: MLLM language-side pruning and ViT compression.
These methods are developed for MLLM token compression. They prune less informative tokens within the language model, without performing compression at the ViT stage. FastV. FastV [3] prunes vision tokens in large visionlanguage models based on their attention scores. Tokens with low scores are removed early to accelerate inference. SparseVLM. SparseVLM [60] proposes a text-aware visual token sparsification framework for efficient vision-language model inference. It leverages text tokens as raters to assess the significance of visual tokens, pruning redundant ones with a recycling mechanism to minimize information loss.
V 2 Drop. V 2 Drop [2] introduces a variation-aware pruning strategy for large vision-language models. By measuring variation across layers, it adaptively discards uninformative tokens rather than relying on fixed importance scores. DART. DART [49] leverages token similarity to identify and remove duplicated tokens. This approach does not rely on explicit attention scores, making it compatible with FlashAttention and avoiding large GPU memory overhead.
ViT approaches are initially designed for vision-only transformers, targeting tasks like image classification. When applied to MLLMs, they operate directly on the vision encoder, merging or pruning visual tokens before the LLM-stage. ToMe. ToMe [1] accelerates vision transformers by merging similar tokens. Using a bipartite matching algorithm based on attention keys, it gradually combines redundant tokens across layers to shorten the sequence length. ToFu. ToFu [16] combines token pruning and merging in a unified framework. It adaptively chooses between them according to each layer’s characteristics and employs MLERP merging to better preserve feature norms.
Multimodal Large Language Models. Multimodal large language models (MLLMs) extend LLMs beyond text to vision and other modalities, enabling VQA, captioning, and multimodal reasoning [8,25]. They typically couple a ViT pretrained with CLIP or SigLIP [38,56], or Qwen2-VL’s dynamic-resolution ViT [45], with a lightweight projector and an LLM. Higher-resolution images or longer temporal windows yield longer visual sequences: Qwen2-VL encodes a 308 ร 196 image into 5,220 tokens in the ViT stage and 1,305 after merging [45]; LLaVA maps 336 ร 336 to 576 tokens, rising to 2,304 at 672 ร 672 [25]; Video-LLaVA (8 frames) produces 2,048 tokens [24]. The resulting abundance of tokens strains the vision encoder and cross-modal layers, increasing compute and latency.
All experiments are conducted on Nvidia GPUs. Qwen2-VL-7B-Instruct is evaluated on image benchmarks using RTX 4090 (48GB), and on video benchmarks using A100-80G. InternVL3-38B is evaluated on image benchmarks using A100-80G. For ToMe and ToFu, we follow the original implementations with a minor modification: tokens are reduced proportionally across layers until the target sparsity is reached. Other baseline settings follow the original papers.
Our evaluation spans a diverse set of benchmarks for both image and video understanding.
GQA. GQA [14] is built upon images, scene graphs, and compositional questions. It provides detailed annotations of entities, attributes, and their relationships, together with diverse queries that require multi-step reasoning, spatial understanding, and logical inference. MMBench. MMBench [31] comprises over 3,000 multiplechoice questions across 20 fine-grained ability dimensions such as object localization and social reasoning. Each dimension contains more than 125 questions. MME. MME [9] contains 14 subtasks covering perception (e.g., object existence, count, OCR) and cognition (e.g., commonsense reasoning, calculation, translation). All instruction-answer pairs are manually constructed in concise yes/no format for straightforward evaluation. POPE. POPE [20] provides a polling-based evaluation benchmark designed to measure object hallucination in large vision-language models. It reformulates hallucination detection as a binary yes/no probing task about the presence of specific objects in an image. SEED. SEED [17] introduces a large-scale benchmark with human-annotated multiple-choice questions across 27 dimensions, providing hierarchical evaluation of MLLMs from image-text comprehension to joint text-image generation. TextVQA. TextVQA [42] is built from natural images in Open Images containing textual elements, paired with human-posed questions requiring reading the embedded text. It is designed to test whether models can integrate OCR outputs with visual reasoning to answer text-centric questions. VizWiz. VizWiz [11] is a goal-driven VQA dataset built from images taken and spoken questions asked by blind people using a mobile phone application. It is designed to evaluate models in realistic assistive scenarios, where images may be low quality, questions conversational, and some visual questions inherently unanswerable. OCRBench. OCRBench [32] evaluates large multimodal models on five OCR-related tasks, including text recognition, text-centric and document VQA, key information extraction, and handwritten expression recognition, aiming to expose their limitations on text-heavy vision tasks.
MVBench. MVBench [19] systematically converts static image tasks into dynamic, defining 20 temporal understanding tasks spanning perception to cognition. It provides multiplechoice QAs generated from 11 public video datasets. EgoSchema. EgoSchema [35] is a benchmark of long egocentric video clips with multiple-choice questions, designed to test very long-form video-language understanding. It introduces temporal certificates to measure intrinsic temporal hardness and exposes limitations in long-term reasoning. MLVU. MLVU [61] is built from diverse videos lasting minutes to hours across real and simulated domains. It defines nine tasks, such as summarization, action counting, and ordering, to evaluate models on complex long-video reasoning. Video-MME. Video-MME [10] contains 900 videos from six domains (e.g., knowledge, film & television, and multilingual) with 2,700 multiple-choice questions. The videos range from 11 seconds to 1 hour with subtitles and audio.
This section provides a theoretical analysis of perturbations arising from ViT-stage pruning and Reintegration.
Assumption E.1 (Hausdorff-Lipschitz Continuity). We assume the ViT mapping F from layer l p to the output of l final is L ViT -Lipschitz with respect to the Hausdorff distance. Formally, for any two sets X , Y โ R d and any indices i โ Idx(X ), j โ Idx(Y), where Idx(โข) denotes the token index set, we assume index correspondence is preserved across layers. Then we have
where F i (โข) denotes the final-layer output at position i. d H denotes the Hausdorff distance under the Euclidean norm: Theorem E.5 (NGR Reconstruction Error). For each removed token i, the NGR reconstruction satisfies the bound โฅ h i,lfinal+1 -h i,lfinal+1 โฅ 2 โค 2B (L ViT + 1).
Here h i,lfinal+1 denotes the reconstructed hidden state of pruned token i via NGR, and h i,lfinal+1 the true hidden state without pruning. Consequently, the Hausdorff distance between the uncompressed final-layer set X โฒ and the reconstructed set X โฒ is bounded by d H (X โฒ , X โฒ ) โค 2B (L ViT + 1).
By the triangle inequality,
From the Remark E.4 we obtain the worst-case bound โฅโ r -โ i โฅ 2 โค 2B (L ViT + 1) for any r. Hence each term in the sum is bounded by 2B(L ViT + 1), so the average yields the desired inequality.
2: Output: Hrem, l 3: for each i โ I rem do 4:
This content is AI-processed based on open access ArXiv data.