Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions…

Authors: Xuanpu Zhao, Zhentao Tan, Dianmo Sheng

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Learning to F ocus and Pr ecise Cr opping: A Reinf orcement Lear ning Framework with Inf ormation Gaps and Grounding Loss f or MLLMs Xuanpu Zhao 1,2 Zhentao T an 3 Dianmo Sheng 1,2 T ianxiang Chen 1,2 Y ao Liu 3 Y ue W u 3 T ao Gong 1,2 * Qi Chu 1,2 Nenghai Y u 1,2 1 School of Cyber Science and T echnology , Uni versity of Science and T echnology of China 2 Anhui Province K ey Laboratory of Digital Security 3 Independent Researcher zhaoxuanpu@mail.ustc.edu.cn zhentaotan5@gmail.com tgong@ustc.edu.cn Abstract T o enhance the per ception and r easoning capabilities of multimodal large language models in complex visual scenes, r ecent r esearc h has intr oduced agent-based work- flows. In these works, MLLMs autonomously utilize image cr opping tool to analyze r e gions of inter est for question an- swering. While existing training strate gies, such as those employing supervised fine-tuning and r einfor cement learn- ing, have made significant pr ogr ess, our empirical analy- sis re veals a key limitation. W e demonstrate the model’ s str ong r eliance on global input and its weak dependence on the details within the cr opped re gion. T o addr ess this is- sue, we pr opose a novel two-stag e r einforcement learning frame work that does not r equir e trajectory supervision. In the first stage, we intr oduce the “Information Gap” mech- anism by adjusting the granularity of the global image. This mechanism tr ains the model to answer questions by focusing on cropped ke y re gions, driven by the informa- tion gain these r egions pr ovide . The second stag e further enhances cropping pr ecision by incorporating a gr ound- ing loss, using a small number of bounding box annota- tions. Experiments show that our method significantly en- hances the model’ s attention to cr opped r egions, enabling it to ac hieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method pro- vides a mor e efficient appr oach for per ceiving and r eason- ing fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu- Z/LFPC . 1. Introduction Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and reasoning about the visual world, achieving impressi ve per- formance across a wide range of vision-language tasks * Corresponding author . [ 7 , 10 , 23 , 35 ]. Howe ver , their ability to perform fine- grained perception and reasoning in complex visual scenes remains a significant challenge. Especially when the object of reasoning is small or obscured by a comple x background [ 41 , 45 ], the model struggles to accurately locate the target and attend to detailed cues in a single inference. T o address these limitations, a promising direction in recent research has been the introduction of agent-based workflo ws [ 14 , 39 , 51 , 54 ]. In this paradigm, the MLLM is not merely a passiv e observer b ut an active agent that can use external tools to dynamically explore visual in- put. A particularly effecti ve approach in volv es empower - ing the MLLM with a “cropping tool”, allowing it to au- tonomously identify and zoom in on specific regions of interest to gather detailed information necessary for an- swering complex questions. These agentic MLLMs have achiev ed substantial progress through tw o primary training strategies: a hybrid of Supervised Fine-T uning (SFT) and Reinforcement Learning (RL) [ 14 , 39 , 51 ], or a pure Rein- forcement Learning approach [ 54 ]. Despite their success, e xisting training methodologies hav e some inherent limitations. The SFT+RL hybrid ap- proach heavily relies on generating vast quantities of rea- soning trajectories from a po werful, proprietary teacher MLLM. This process is not only computationally expen- siv e and time-consuming b ut also creates a performance ceiling, as the student model’ s capabilities are inherently upper-bounded by those of its teacher . While pure RL meth- ods circumvent the need for a teacher model, they intro- duce a subtler yet critical issue that we observed, which is also present to some extent in hybrid approaches. T ak- ing DeepEyes [ 54 ], a typical RL-based model, as a sample, we discovered a recurring pattern in its reasoning process: the model often predicts the final answer before executing a crop, and subsequently uses the cropped region merely to confirm its pre-existing conclusion, as sho wn in Figure 1 . W e hypothesize that this “answer first, crop later” behav- The image … The telephone booth is white , which matches the color of the telephone. Grounding: {"bbox_2d": [65, 554, 100, 576], "label": "telephone"} The telephone booth is white , as indicated by the zoom-in tool . Answer: D. The color of the telephone is white . Here is the zoom in result . Continue to answer the question. What is the color of the telephone? Options: A. The color of the telephone is blue. B. The color of the telephone is orange. C. The color of the t elephone is green. D. The color of the telephone is white. What is the color of the Apple logo? Options: A. The color of the Apple logo is red. B. The color of the Apple logo is silver. C. The color of the Apple logo is polychromatic. D. The color of the Apple logo is white. The image shows a…. The color of the logo is silver , which is a c ommon color for the Apple logo. Grounding: {"bbox_2d": [5, 34, 193, 220], "label": "App le logo"} The image shows a close-up of … The logo is clearly visible and appears to be silver in color . Answer: B. The color of the Apple logo is silve r . Here is the zoom in result . Continue to answer the question. (a) A reasoning example for RL-based method with success crop. (b) A reasoning example for RL-based method with failure crop. Figure 1. Reasoning example for RL-based methods. In example (a), the model crops the re gion correctly b ut still fails to answer the color of the logo. While in example (b), the model confuses the sign with a telephone booth e ven after zooming in. ior indicates that the model is often performing a perfunc- tory tool call for cropping, rather than genuinely le veraging the information within the cropped region to aid its reason- ing and question-answering process. T o v alidate this hy- pothesis, we constructed a dedicated e v aluation benchmark, which confirmed that the model’ s performance is not signif- icantly dependent on the content of the cropped region. T o tackle this fundamental challenge of ensuring authen- tic reliance on tool-based observations, we propose a novel two-stage pure reinforcement learning framew ork. Our core objectiv e is to compel the model to activ ely seek out and utilize information from cropped regions. In the first stage, we introduce an “Information Gap Mechanism”. In contrast to pre vious work that directly feeds original high-resolution images into the model, we strate gically do wnsample the ini- tial input images. The degree of do wnsampling is deter- mined by the model’ s own uncertainty; we select an appro- priately low resolution such that the model yield answers in- consistent with those obtained at higher resolutions. How- ev er, when the agent decides to use its cropping tool, the region is extracted from the original, full-resolution image. This creates a crucial “information gap” between the low- detail global view and the high-detail local view , making the information within the crop indispensable for a correct an- swer and thus enhancing the model’ s focus on the cropped region. In the second stage, to further refine the agent’ s be- havior , we annotate a small number of bounding boxes and introduce a grounding re ward signal. This encourages the model to not only use the cropping tool b ut also to place the cropped region at a more precise location. Our experiments demonstrate the effecti veness of our method. It significantly improv es the model’ s attention to and reliance on cropped regions, while mitigating the for - malistic cropping tool in vocation issue observed in previ- ous work. Remarkably , our method achiev es state-of-the- art performance under both visual token budgets of 16,384 and 1,024 for the input image. Furthermore, when the vi- sual token budget is limited to 1,024, our method still out- performs other approaches on some benchmarks e ven when they are allo wed 16,384 visual tokens, highlighting both its effecti ve utilization of fine-grained details within cropped regions and its computational ef ficiency . 2. Related W ork Multi-modal Large Language Models. Multimodal large language models (MLLMs) has progressed from early sys- tems that aligned a vision encoder with a large language model [ 6 , 38 ] to more tightly integrated structures enabled by joint training. BLIP-2 [ 19 ] and LLaV A [ 21 , 22 ] pio- neered the alignment of visual features with LLMs using lightweight Q-Former or projectors, while approaches such as LLaV A-OneV ision [ 16 ] enhanced visual fidelity by ac- commodating flexible image resolutions. These advance- ments have fostered a rich ecosystem of po werful open- source models, including the Qwen-VL [ 1 , 2 , 40 ], and In- ternVL [ 3 – 5 , 55 ] series. W ith the maturation of these mod- els, attention has increasingly turned to their reasoning abil- ities, particularly through Multimodal Chain-of-Thought (MCoT). Research in MCoT reasoning can be broadly cat- egorized into two main paradigms. One paradigm de- pends on manually designed workflows to structure rea- soning [ 25 , 27 ], typically targeting problems lik e localiza- tion [ 8 , 43 ] or knowledge retrie val [ 17 , 36 ]. The other ex- plores RL-based approaches [ 29 , 32 ] to adapt text-centric reasoning for multimodal challenges like spatial reasoning [ 20 , 28 , 44 ] and reasoning segmentation [ 24 ]. High-Resolution V isual Question Answering. Multi- modal Large Language Models (MLLMs) performs poorly on High-Resolution V isual Question Answering [ 41 , 45 ] task which requires them to perceive fine-grained details in complex scenes. T o address this, one line of works uses at- tention maps to analyze import regions [ 26 , 49 ]. Another approach represents the image as a hierarchical tree of re- gions and employs a search algorithm to locate the import regions [ 18 , 26 , 39 ]. Despite demonstrating significant per- formance improv ements, these methods often in volv e com- plex pipelines and suf fer from low inference ef ficiency . Agent-based Methods. Recently , a line of work has emerged to address complex visual reasoning by introduc- ing agentic workflows [ 48 ]. These methods can actively in voke an cropping tool, to focus on critical regions and facilitate their reasoning process. These approaches can be broadly categorized into tw o streams. The first stream employs a two-stage training paradigm: Supervised Fine- T uning (SFT) followed by Reinforcement Learning (RL) [ 14 , 39 , 51 ]. Initially , a powerful teacher MLLM (e.g., GPT -4o [ 11 ]) generates reasoning trajectories, which in- clude both textual rationales and coordinates of correspond- ing cropped regions. These trajectories are then used to su- pervise the fine-tuning of the agent model, which is subse- quently optimized further using RL. The second stream re- lies solely on RL for training [ 54 ], using a dataset that only consists of image-question pairs. Howe ver , the SFT+RL paradigm is constrained by the need to generate high- quality trajectories, a process that is both costly and fun- damentally capped by the teacher model’ s capabilities. As for the pure RL methods, we observe that their impressiv e performance still heavily relies on the original, full input image, with weak attention paid to the cropped regions. 3. Preliminary Analysis T o quantitati vely e valuate the ef ficiency of existing mod- els in utilizing identified cropped regions, we conducted ex- periments on three challenging high-resolution visual un- derstanding datasets: HR-Bench-8k, HR-Bench-4k and V ∗ . Follo wing the inference process of previous work [ 14 , 54 ], the input consists of a complete image and a corresponding question. The large language model can autonomously de- termine whether to in vok e image cropping tools to assist in the answer . W e limit the maximum number of visual tok ens (obtained from MLLM’ s vision encoder) for the complete input image at 16,384. This setting allows the full image to be processed at a high resolution. The cropped region are extracted from this original high-resolution image. Build- ing upon this, we further analyzed the model’ s attention to cropped re gions using the following two controlled settings: What is the col or of t he Apple logo? {"bbox_2d": [5, 34, 193, 220] , "label": "Appl e logo"} full resolutio n image Compressed im age The answer is …… Her e is the zoom in r esult . Continue to answe r the question. Figure 2. T esting pipeline of agentic-based MLLMs. 1) Ground T ruth Cropped Region : Replacing the bound- ing boxes predicted by MLLM with ground truth bounding boxes creates a “perfect” high-resolution cropped region. 2) Random Noise Cropped Region : Replacing the high- resolution cropped region with random noise containing no useful information. In both replacements, if the model truly pays attention to the pro vided cropped region, its performance should signif- icantly improv e when using the ground truth; con versely , its performance should drastically decrease when the cropped region is replaced with random noise. As shown in T able 1a , when we replace the MLLM’ s predicted bounding box with the ground truth or substi- tute the cropped region with random noise, the performance change is marginal (particularly for DeepEyes). This sug- gests that when the full input image retains rich information, the model tends to rely on the full image to answer ques- tions, rather than the fine-grained details within the cropped region. W e attribute this to the follo wing: during training, the resolution of the full image and the source image for cropping are kept consistent. Consequently , the cropped region, compared to the full image, merely remov es irrel- ev ant conte xt without providing any additional information. As a result, the MLLM fails to ef fecti vely learn to utilize the cropped re gion for answering questions during training, leading it to continue relying on the full image. T o further in vestigate whether introducing a resolution disparity at test time could encourage the model to focus Model & Setting HR-Bench 8K HR-Bench 4K V ∗ FSP FCP Overall FSP FCP Overall Attr Spatial Overall DeepEyes prediction 83.0 53.0 68.0 92.0 56.0 74.0 84.3 88.2 85.9 DeepEyes Random Noise 83.0 54.0 68.5 94.0 57.0 75.5 83.5 88.2 85.4 DeepEyes Ground T ruth 86.0 53.0 69.5 93.0 55.0 74.0 86.1 89.5 87.5 CoF-sft prediction 87.0 53.0 70.0 94.0 52.0 73.0 90.4 85.5 88.5 CoF-sft Random Noise 85.0 52.0 68.5 90.0 53.0 71.5 80.9 84.2 82.2 CoF-sft Ground T ruth 89.0 51.0 70.0 96.0 54.0 75.0 91.3 85.5 89.0 (a) Maximum visual tokens limited at 16,384. Model & Setting HR-Bench 8K HR-Bench 4K V ∗ FSP FCP Overall FSP FCP Overall Attr Spatial Overall DeepEyes prediction 68.0 59.0 63.5 78.0 54.0 66.0 75.7 76.3 75.9 DeepEyes Random Noise 62.0 57.0 59.5 73.0 54.0 63.5 73.0 77.6 74.8 DeepEyes Ground T ruth 70.0 59.0 64.5 79.0 54.0 66.5 77.4 77.6 77.5 CoF-sft prediction 68.0 53.0 60.5 78.0 54.0 66.0 73.0 72.4 72.8 CoF-sft Random Noise 55.0 55.0 55.0 65.0 55.0 60.0 65.2 72.4 68.1 CoF-sft Ground T ruth 77.0 54.0 65.5 85.0 57.0 71.0 87.0 78.9 83.8 (b) Maximum visual tokens limited at 1,024. T able 1. Evaluation results with dif ferent cropped region settings. on the cropped region, we do wnsample the full input image to a maximum of 1,024 tokens, while the cropped region are still extracted from the original, high-resolution image. This significant downsampling results in the loss of some fine details crucial for answering questions in these datasets. Howe ver , it preserves enough coarse information to allo w the model to still locate the approximate region of interest. Under this setting, we re-apply the two aforementioned re- placement strategies. The results, presented in T able 1b , show that although the performance change is more pro- nounced than in T able 1a , ho we ver it still remains insignif- icant. This indicates that even when the information in the full image is insufficient, the model still fails to effecti vely attend to the cropped region. W e argue that the reason for this phenomenon is intu- itiv e. In previous methods, the original image fed to model and the image used to obtain the cropped region shared the same resolution during training. Consequently , the intro- duction of the cropped region provided no additional infor- mation. As a result, the model failed to effecti vely learn how to le verage the cropped region to aid in reasoning. Inspired by the abo ve analysis, we introduce an informa- tion gap mechanism during the first stage of training to ex- plicitly encourage model to better focus on and utilize the cropped regions. Building upon this foundational ability , our second training stage is dedicated to enhancing model’ s cropping precision to achiev e further performance gains. 4. Method Consistent with prior work, we e xpect the model can predict the coordinates of a critical region, which is then cropped and lev eraged to answer the given question. The action for the step t can be fromulated as: a t ∼ π θ ( a | I 0 , q, [ r 1 , I crop 1 ] , . . . , [ r m , I crop m ]) , (1) where I 0 is the full input image, r i is the text response and I crop i is the cropped image. Our model is trained via a two- stage reinforcement learning process. In the first stage, we enhance the model’ s attention to the cropped region by in- troducing an information gap mechanism. In the second stage, we improve the model’ s cropping precision by anno- tating a small number of bounding boxes and incorporating a grounding rew ard. 4.1. Learn to F ocus The primary objective of our first training stage is to culti- vate the model’ s ability to focus its attention on a specific cropped re gion for detailed analysis. W e hypothesize that the information contained within both the complete input image, which we denote as I 0 , and the region, I crop i , is crucial for this learning process. The complete image pro- vides essential context, while the cropped region contains the fine-grained details necessary to answer the question. T o ef fectively train the model’ s focus, the cropped re- gion must retain maximum detail, ensuring that the correct Answer: Harbinger School Multimodal Large Language Model Vi sion Encode & Projection Question: What is the school name written on the blue signboard? Answer: Harbinger School Answer: Berkhamsted School Answer Final Answer Chain of Thought 𝐼 ! 𝐼 "#$% Offline Resolution Selection Bbox Annota tion IoU Reward 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 & > 0.9-? max ( 𝐼𝑜𝑈 ! , 𝐼𝑜 𝑈 " , ⋯ , 𝐼𝑜 𝑈 # ) 0 B1 B2 L1 Distance Reward 1 − min (𝑙 & , 𝑙 ' , ⋯ , 𝑙 ( ) Stage 1: Learn to Focus Stage 2: Learn to Crop Prec isely Info Gap … B2 B1 𝐵 ! 𝐵 ! Figure 3. Framework of the proposed tw o-stage training method. answer can be deriv ed from it, thereby highlighting its im- portance. Howe ver , a critical challenge arises from the in- formation balance in the complete image. If the complete image is e xcessi vely rich in detail (i.e., high-resolution), the model may dev elop a “shortcut” learning behavior . It tends to directly extract the answer from the comprehen- siv e global view rather than the crop (which is an issue that DeepEyes suf fers from, as rev ealed in our preliminary anal- ysis). Con versely , if the complete image contains insuffi- cient information, the model cannot identify the region rele- vant to the question, which renders the problem unsolv able. T o strike this delicate balance, we introduce an Of fline Resolution Selection process to prepare the optimal com- plete image for training. In this procedure, we start with an original high-resolution image from the training set. W e then iteratively downsample it, creating a sequence of pro- gressiv ely lo wer-resolution versions. At each step, we use the to be trained MLLM to answer the question based on the original image and the current downsampled version. The process terminates as soon as the answer deriv ed from the do wnsampled image di ver ges from the answer obtained from the original image. The image from the resoluton v er- sion that causes this div ergence is selected as the optimal complete image for our training stage. This ensures it is just informative enough to provide context but not so de- tailed as to allow the model to bypass the high-resolution crop. Through this selection strategy , we formulate the training data for this stage. Each data instance is a tu- ple ( I 0 , I ori , q , answ er ) , where I 0 is the selected optimal- resolution image, I ori is the original high-resolution image used for e xtracting the detailed cropped region I crop i , q is the question, and answ er is the ground-truth response. In this stage, we formulates the total reward r for a out- put trajectory as a sum of three distinct components: r = r acc + r f or mat + I r acc > 0 · r tool . (2) The accuracy re ward is determined by whether the model response matches the ground-truth response, whereas the formatting reward penalizes outputs with deficient struc- tures. Furthermore, a tool usage bonus is granted exclu- siv ely when the model not only provides a correct solu- tion but also lev erages at least one external perception tool during its process. W e then compute the adv antages A by normalizing the rewards and update the policy by adopting the GRPO [ 9 , 31 ] optimization objectiv e ov er mini-batches. Formally , the optimization objective is gi ven by: J = E q ∼D , { o i } N i =1 ∼ π θ ′ ( ·| q ) " 1 N X i =1 N  min  π θ ( o i | q ) π θ ′ ( o i | q ) A i , clip  π θ ( o i | q ) π θ ′ ( o i | q ) , 1 − ϵ, 1 + ϵ  A i   # , (3) where A i = r i − mean ( { r 1 ,r 2 ,...,r N } ) std ( { r 1 ,r 2 ,...,r N } ) . 4.2. Learn to Cr op Precisely While the initial training stage enables the model to lo- cate question-relev ant regions, we observe a significant challenge stemming from the inherent information discrep- ancy in our approach. Specifically , the Multi-modal Lar ge Language Model (MLLM) receiv es a down-sampled, low- resolution version of the full image, whereas the cropped patch retains high-resolution details. This asymmetry in- centivizes the model, guided by the accuracy rew ard ( r acc ), to progressi vely enlar ge the predicted bounding box ( B p ). A larger crop naturally contains more detailed information, making it easier for the MLLM to answer the question cor- rectly and maximize its rew ard. Howe ver , this behavior leads to two primary issues. First, the resulting oversized crops often include substan- tial redundant information, imposing unnecessary compu- tational overhead on the MLLM and reducing overall ef fi- ciency . Second, this extraneous content can act as noise, potentially distracting the MLLM and causing interference during the reasoning process. T o mitigate this, we intro- duce a second training stage focused on refining the agent’ s cropping policy with direct grounding supervision. Hierarchical Bbox Annotation. W e introduce an ad- ditional rew ard signal based on a small set of human- annotated bounding boxes. Inspired by the observation that visual regions of interest often possess a compositional and hierarchical structure [ 12 , 30 , 50 ], we annotate a series of nested ground-truth (GT) bboxes for each question-image pair , denoted as B 1 , B 2 , . . . , B n . As illustrated in Figure 3 (right), these boxes range from the minimal necessary region required to answer the question ( B 1 , e.g., the text “Harbinger School”) to a larger , more contextually rele v ant area ( B 2 , e.g., the entire sign). This hierarchical annotation provides a more fle xible and robust tar get for our model. Grounding Reward Formulation. T o guide the agent to- wards generating a precise yet sufficient crop, we design a composite grounding re ward r g eo , which consists of an IoU rew ard and an L 1 distance rew ard. The IoU Reward ( r I oU ) encourages the predicted box B p to have a high ov erlap with one of the GT boxes. Ho wever , naively maximizing the In- tersection over Union (IoU) could encourage the model to predict tighter bounding boxes, which may not completely cov er the smallest GT box (e.g., B 1 ), thus causing the omis- sion of crucial details needed to answer the question. T o prev ent this, we introduce a conditional reward. W e first calculate the coverage ratio of B 1 by B p , which we term ov er l ap . The IoU reward is only granted if this ratio ex- ceeds a threshold (set to 0.9 in our experiments). The re- ward is formulated as: r I oU = ( max i ∈{ 1 ,..,n } IoU i if ov er l ap > τ 0 otherwise (4) where I oU i = Area ( B p ∩ B 1 ) Area ( B p ∪ B 1 ) and ov er l ap = Area ( B p ∩ B 1 ) Area ( B 1 ) . The condition of r I oU can lead to sparse re wards, espe- cially early in training when the agent struggles to satisfy the ov erlap condition. T o pro vide a denser and more consis- tent training signal, we incorporate an L 1 Distance Reward ( r l 1 ). This rew ard measures the normalized L 1 distance be- tween the corners of the predicted box and the closest GT box, ensuring the agent receiv es a corrective gradient even when the IoU rew ard is zero. The L 1 rew ard is defined as: r l 1 = 1 − min i ∈{ 1 ,..,n } d L 1 ( B p , B i ) (5) Finally , the grounding re ward is the combination of the IoU rew ard and the L 1 rew ard: r g eo = ω ∗ r I oU + (1 − ω ) ∗ r l 1 . (6) Pixel-Reasoner 2.7k CoF 2.1k ThinkLite- VL 36% 17% 47% Figure 4. Training data distrib ution in Stage-I 5. Experiment 5.1. Setups T raining Details . W e train Qwen2.5-VL-7B-Instruct on 8 A100 GPUs for 80 steps with GRPO. Each step contains 256 samples and 16 rollouts per sample. The maximum response length is set to 2048. The learning rate is 1 × 10 − 6 , and neither KL regularization nor entropy is applied. T raining Datasets. In the first stage, we address the po- tential information loss by down-sampling high-resolution images. T o this end, we train the model’ s visual percep- tion capabilities using high-resolution VQA data from Pix- elreasoner and CoF , and enhance its analytical abilities with a small set of samples from ThinkLite-VL. The distribu- tion of the datasets is detailed in Figure 4 . Notably , while Pixelreasoner and CoF also provide reasoning trajectories generated by MLLMs for supervised fine-tuning, our train- ing methodology does not utilize them. W e only require the images and their corresponding question-answer (QA) pairs. W e observed that after the first stage of training, the model could already crop regions that generally co vered the question-relev ant areas. T o refine this capability and further improv e the cropping precision, we curated a dataset for the second stage by sampling 256 instances from V isualProbe datasets from Mini-o3 [ 14 ] and manually annotating them with precise bounding boxes (BBox es). Benchmarks and Compared Methods. Our test bench- mark is built upon HR-Bench 8K, HR-Bench 4K and V ∗ . T o ev aluate a model’ s ability to answer questions based on cropped regions, we limit the maximum number of vi- sual tokens to 1,024 and 16,384 respectively . Howe ver , the cropped regions themselv es are extracted from the original, uncompressed images. The models we compare against in- clude CoF (only the SFT -stage model is publicly a vailable), Pixel Reasoner and Mini-o3, which are trained via a tw o- stage process of SFT+RL, and DeepEyes, which is trained purely with RL. For the e v aluation on Mini-o3, we dis- abled sampling and employed a greedy decoding strategy to maintain consistency with the testing setups of other mod- els. Furthermore, we find that Mini-o3 e xhibits a propensity for a high number of calls to the cropping tool. T o address Model Size T rajectory Free HR-Bench 8K HR-Bench 4K V ∗ FSP FCP Overall FSP FCP Overall Attr Spatial Overall CoF-sft [ 51 ] 7B × 66.0 51.7 58.9 75.5 56.5 66.0 73.0 72.4 72.8 Pixel Reasoner [ 39 ] 7B × 62.8 57.0 59.9 69.8 61.8 65.8 73.1 75.0 73.9 Mini-o3 [ 14 ] 7B × 77.5 54.5 66.0 82.0 59.3 70.7 78.3 82.9 80.1 DeepEyes [ 54 ] 7B ✓ 68.8 55.2 62.0 73.0 54.7 63.9 75.7 76.3 75.9 Ours 7B ✓ 83.3 60.8 72.1 86.8 63.7 75.3 82.6 77.6 80.6 (a) Maximum visual tokens limited at 1,024. Model Size T rajectory Free HR-Bench 8K HR-Bench 4K V ∗ FSP FCP Overall FSP FCP Overall Attr Spatial Overall CoF-sft [ 51 ] 7B × 87.7 51.0 69.4 92.2 53.7 73.0 90.4 85.5 88.5 Pixel Reasoner [ 39 ] 7B × 78.3 55.5 66.9 85.8 61.5 73.6 82.6 86.8 84.3 Mini-o3 [ 14 ] 7B × 82.0 49.2 65.6 84.0 54.0 69.0 87.8 88.2 88.0 Mini-o3 † [ 14 ] 7B × - - 73.3 - - 77.5 - - 88.2 DeepEyes [ 54 ] 7B ✓ 84.3 54.7 69.5 90.5 55.2 72.9 84.3 88.2 85.9 DeepEyes ‡ [ 54 ] 7B ✓ 86.8 58.5 72.6 91.3 59.0 75.1 91.3 88.2 90.1 Ours 7B ✓ 87.7 63.0 75.4 92.7 60.0 76.4 91.3 86.8 89.5 † Result from Mini-o3 paper , where it was obtained by setting temperature to 1.0 and a veraging results o ver 32 runs for each question. ‡ Result from DeepEyes paper , where they did not shuf fle the order of options on the V ∗ dataset, unlike our e valuation. (b) Maximum visual tokens limited at 16,384. T able 2. Comparison with other models. Our method achieves SO T A performance on HR-Bench 8K, HR-Bench 4K and V ∗ benchmarks. this, we imposed an upper limit of 12 dialogue turns. If the model reached this maximum turn count without producing a final answer , we classified the attempt as incorrect. 5.2. Results As sho wn in T able 2 , our method achie ves state-of-the-art results on the ev aluated benchmarks when the number of visual tokens is limited to both 1,024 and 16,384. Notably , under the 1024-token constraint, our method significantly outperforms other approaches, and this performance adv an- tage becomes more pronounced as the input image resolu- tion increases. This phenomenon indicates that our method can ef fectiv ely utilize information from the cropped re gions to aid inference and demonstrates a significant performance advantage in high-resolution scenarios. 5.3. Ablation Study W e conduct a comprehensi ve ablation study to v alidate the effecti veness of each component. All ablation studies in this section are conducted with the maximum number of visual tokens set to 1,024 if not specified. T wo-Stage T raining. W e first v alidate the ef fecti veness of the two-stage training strategy . As summarized in T able 3 , compared to training by follo wing earlier work [ 14 , 54 ] (de- noted as the baseline), the information gap we introduce in the first stage of training effecti vely helps the model quickly learn to utilize detailed information in the cropped image, thereby improving the accuracy of the responses. On the HR-Bench 8k, HR-Bench 4k, and V ∗ test sets, the av erage accuracy improved by 8.9%, 6.5%, and 5.7%, respectiv ely . T o further demonstrate the model’ s attention to the cropped image, we summarize the changes in model response ac- curacy after replacing the cropped image in T able 5 , refer - ring to the testing method in Section 3 . Clearly , the model trained in the first stage pays more attention to information from the cropped image. In the second stage, we focus on training the model’ s ability to accurately locate and crop lo- cal details. T able 3 also shows that this stage of training can further improve the accurac y of the model’ s responses and achie ve a significant improv ement in the IoU metric for cropped images (around 5%-9%). Effect of Data Selecting. As shown in T able 4 , an intuitiv e approach in the second stage of training is to select a portion of the training data from the first stage, label it, and then train and optimize the model. Ho wever , we find that the model trained in the first stage can correctly answer almost all the questions on the training data, and direct training does not bring any additional gains; in fact, it even slightly decreased in the three benchmark tests. Therefore, we chose the more challenging V isualProb datasets [ 14 ] for the sec- ond stage of training. W ith the help of the IoU re ward, the model improves its accuracy by an a verage of 4.6% in the Model HR-Bench 8k HR-Bench 4k V ∗ FSP FCP Acc FSP FCP Acc Attr Spatial Acc Acc IoU Acc IoU Acc IoU Acc IoU Acc IoU Acc IoU Baseline 65.2 28.5 56.8 40.1 61.0 75.0 30.3 60.8 45.6 67.9 72.2 39.1 72.4 30.3 72.3 Stage-I 77.7 29.8 62.0 39.3 69.9 85.3 32.3 63.5 46.5 74.4 80.0 25.7 75.0 21.0 78.0 Stage-II 83.3 34.5 60.8 44.4 72.1 86.8 37.8 63.7 51.0 75.3 82.6 37.4 77.6 30.8 80.6 T able 3. Effecti veness of two-stage training strategy . The best result is indicated in bold . Model HR-Bench 8k HR-Bench 4k V ∗ FSP FCP Overall FSP FCP Overall Attr Spatial Overall Stage-I 77.7 62.0 69.9 85.3 63.5 74.4 80.0 75.0 78.0 + Original Data 76.7 59.5 68.1 82.5 62.0 72.3 79.1 75.0 77.5 + VP Data 81.8 63.5 72.7 83.8 62.0 72.9 77.4 75.0 76.4 + VP Data & L1 Rew ard 83.3 60.8 72.1 86.8 63.7 75.3 82.6 77.6 80.6 T able 4. Ef fectiv eness of different designs on the second training stage. Original Data means training the model with the data from the same source used in the first stage. VP Data denotes the V isualProb datasets. The setting of VP Data & L1 Re ward is denoted as Stage-II. Model HR-Bench 8k HR-Bench 4k V ∗ Baseline 4.5 0.5 3.2 Stage-I 18.5 10.5 11.0 T able 5. Acc delta of replacing the cropped images with ground truth or random noise on the three benchmarks. most challenging benchmark test, HR-Bench 8k. Effect of L1 Reward. While using V isualProb data can help the model impro ve response accuracy under high- resolution conditions, more difficult learning samples in- crease the sparsity of IoU re wards, reducing the model’ s learning efficienc y . Therefore, we additionally add an L 1 rew ard for the location center distance to help the model learn how to locate more accurately in dif ficult samples. The results in T able 4 show that the introduction of this re- ward mechanism can effecti vely impro ve the response ac- curacy on all three test benchmarks, significantly e xceeding the performance of the Stage-I model. 6. Efficiency Analysis W e compare the accuracy (Acc) and inference time per question (T ime) of our method against others. For this com- parison, our method processes full images with a maximum of 1,024 visual tokens, whereas others utilize up to 16,384 tokens. The results, as shown in T able 6 , indicate that our method surpasses existing approaches e ven when us- ing lo wer-resolution complete images (implied by the fe wer Model HR-Bench 8k HR-Bench 4k Acc Time (s) ↓ Acc Time (s) ↓ CoF-sft 69.4 6.0 73.0 5.5 Pixel Reasoner 66.9 5.0 73.6 8.4 Mini-o3 65.6 27.8 69.0 21.4 DeepEyes 69.5 12.4 72.9 9.3 Ours 1 , 024 72.1 2.8 75.3 2.6 T able 6. Our method processes complete images with a maximum of 1,024 visual tokens, whereas the others utilize up to 16,384. tokens). Due to the significantly smaller number of visual tokens, our method also demonstrates a substantial advan- tage in efficienc y , with a much lo wer time cost per question. This indicates that our approach enables the model to utilize visual tokens more ef ficiently , which we attribute to its abil- ity to focus on accurately cropped regions of interest. 7. Conclusion W e analyze limitations of existing agent-based MLLMs, which ov eremphasize original input image while neglecting cropped region. T o guide the model to utilize crop, we pro- pose a two-stage RL training strategy . In first stage, we con- struct an information gap mechanism by compressing input image while maintaining the resolution of cropped re gion, forcing model to attend to details in the cropped image. In second stage, we select a smaller while challenging dataset and further optimize the model’ s ability to locate regions. References [1] Jinze Bai, Shuai Bai, Shusheng Y ang, Shijie W ang, Sinan T an, Peng W ang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv pr eprint arXiv:2308.12966 , 1(2):3, 2023. 2 [2] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, Humen Zhong, Y uanzhi Zhu, Mingkun Y ang, Zhao- hai Li, Jianqiang W an, Pengfei W ang, W ei Ding, Zheren Fu, Y iheng Xu, Jiabo Y e, Xi Zhang, T ianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Y ang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 2 [3] Zhe Chen, W eiyun W ang, Y ue Cao, Y angzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Y e, Hao T ian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv pr eprint arXiv:2412.05271 , 2024. 2 [4] Zhe Chen, W eiyun W ang, Hao T ian, Shenglong Y e, Zhang- wei Gao, Erfei Cui, W enwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences , 67(12):220101, 2024. [5] Zhe Chen, Jiannan W u, W enhai W ang, W eijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Intern vl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 24185–24198, 2024. 2 [6] Hyung W on Chung, Le Hou, Shayne Longpre, Barret Zoph, Y i T ay , W illiam Fedus, Y unxuan Li, Xuezhi W ang, Mostaf a Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models. Journal of Machine Learning Resear ch , 25(70):1–53, 2024. 2 [7] Xingyu Fu, Y ushi Hu, Bangzheng Li, Y u Feng, Haoyu W ang, Xudong Lin, Dan Roth, Noah A Smith, W ei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see b ut not perceiv e. In Eur opean Confer ence on Com- puter V ision , pages 148–166. Springer , 2024. 1 [8] Xingyu Fu, Minqian Liu, Zhengyuan Y ang, John Cor- ring, Y ijuan Lu, Jianwei Y ang, Dan Roth, Dinei Floren- cio, and Cha Zhang. Refocus: V isual editing as a chain of thought for structured image understanding. arXiv pr eprint arXiv:2501.05452 , 2025. 2 [9] Daya Guo, Dejian Y ang, Hao wei Zhang, Junxiao Song, Peiyi W ang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incenti vizing reasoning ca- pability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025. 5 [10] Dre w A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- fer ence on computer vision and pattern reco gnition , pages 6700–6709, 2019. 1 , 4 [11] Aaron Hurst, Adam Lerer, Adam P Goucher , Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow , Akila W eli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv pr eprint arXiv:2410.21276 , 2024. 3 [12] Alexander Kirillo v , Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, T ete Xiao, Spencer White- head, Alexander C Berg, W an-Y en Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer - ence on computer vision , pages 4015–4026, 2023. 6 [13] Xin Lai, Zhuotao T ian, Y ukang Chen, Y anwei Li, Y uhui Y uan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via lar ge language model. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 9579–9589, 2024. 4 [14] Xin Lai, Jun yi Li, W ei Li, T ao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search. arXiv pr eprint arXiv:2509.07969 , 2025. 1 , 3 , 6 , 7 [15] Mengcheng Lan, Chaofeng Chen, Y iping Ke, Xinjiang W ang, Litong Feng, and W ayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. In Eur opean Confer ence on Computer V ision , pages 70–88. Springer , 2024. 4 [16] Bo Li, Y uanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Y anwei Li, Zi- wei Liu, et al. Llava-one vision: Easy visual task transfer . arXiv pr eprint arXiv:2408.03326 , 2024. 2 [17] Chengzu Li, W enshan W u, Huanyu Zhang, Y an Xia, Shaoguang Mao, Li Dong, Ivan V uli ´ c, and Furu W ei. Imag- ine while reasoning in space: Multimodal visualization-of- thought. arXiv pr eprint arXiv:2501.07542 , 2025. 2 [18] Geng Li, Jinglin Xu, Y unzhen Zhao, and Y uxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In Proceedings of the Computer V ision and P attern Reco gnition Conference , pages 9098–9108, 2025. 3 [19] Junnan Li, Dongxu Li, Silvio Sav arese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and lar ge language models. In In- ternational confer ence on machine learning , pages 19730– 19742. PMLR, 2023. 2 [20] Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Y u Rong, T ingyang Xu, Ziqi Zhang, Deli Zhao, and W enbing Huang. Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms. arXiv pr eprint arXiv:2505.15804 , 2025. 3 [21] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neur al information pr ocessing systems , 36:34892–34916, 2023. 2 [22] Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In Pr o- ceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 26296–26306, 2024. 2 [23] Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, W angbo Zhao, Y ike Y uan, Jiaqi W ang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In Eur opean confer ence on computer vi- sion , pages 216–233. Springer , 2024. 1 [24] Y uqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Y ue, F anbin Lu, Bei Y u, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cogniti ve reinforcement. arXiv preprint arXiv:2503.06520 , 2025. 3 [25] Zuyan Liu, Y uhao Dong, Y ongming Rao, Jie Zhou, and Ji- wen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv pr eprint arXiv:2403.12966 , 2024. 2 [26] Shunqi Mao, Chaoyi Zhang, and W eidong Cai. Through the magnifying glass: Adapti ve perception magnifica- tion for hallucination-free vlm decoding. arXiv preprint arXiv:2503.10183 , 2025. 3 [27] Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. In Pr o- ceedings of the AAAI conference on artificial intelligence , pages 18798–18806, 2024. 2 [28] Kun Ouyang, Y uanxin Liu, Haoning W u, Y i Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Rein- forcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 , 2025. 3 [29] Y ingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan Y ou, Jie Liu, Qipeng Zhu, Kai Y ang, Xingzhong Xu, Xin Geng, and Xu Y ang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv pr eprint arXiv:2503.07536 , 2025. 3 [30] Nikhila Ravi, V alentin Gabeur, Y uan-Ting Hu, Ronghang Hu, Chaitanya Ryali, T engyu Ma, Haitham Khedr , Roman R ¨ adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 , 2024. 6 [31] Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Hao wei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models. arXiv pr eprint arXiv:2402.03300 , 2024. 5 [32] Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin F ang, Y ibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and general- izable r1-style large vision-language model, 2025. URL https://arxiv . or g/abs/2504.07615 , 3(6):11, 2025. 3 [33] Dianmo Sheng, Dongdong Chen, Zhentao T an, Qiankun Liu, Qi Chu, Jianmin Bao, T ao Gong, Bin Liu, Shengwei Xu, and Nenghai Y u. T o wards more unified in-context visual un- derstanding. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 13362– 13372, 2024. 4 [34] Dianmo Sheng, Dongdong Chen, Zhentao T an, Qiankun Liu, Qi Chu, T ao Gong, Bin Liu, Jing Han, W enbin T u, Shengwei Xu, et al. Unicl-sam: Uncertainty-driv en in-context segmen- tation with part prototype discovery . In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 20201–20211, 2025. 4 [35] Amanpreet Singh, V ivek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. T owards vqa models that can read. In Pr oceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 8317–8326, 2019. 1 [36] Guangyan Sun, Mingyu Jin, Zhenting W ang, Cheng-Long W ang, Siqi Ma, Qifan W ang, T ong Geng, Y ing Nian W u, Y ongfeng Zhang, and Dongf ang Liu. V isual agents as f ast and slow thinkers. arXiv pr eprint arXiv:2408.08862 , 2024. 2 [37] Hao T ang, Chenwei Xie, Haiyang W ang, Xiaoyi Bao, T ingyu W eng, Pandeng Li, Y un Zheng, and Liwei W ang. Ufo: A unified approach to fine-grained visual perception via open- ended language interface. arXiv pr eprint arXiv:2503.01342 , 2025. 4 [38] Hugo T ouvron, Louis Martin, K e vin Stone, Peter Albert, Amjad Almahairi, Y asmine Babaei, Nikolay Bashlykov , Soumya Batra, Prajjwal Bharg av a, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv pr eprint arXiv:2307.09288 , 2023. 2 [39] Haozhe W ang, Alex Su, W eiming Ren, Fangzhen Lin, and W enhu Chen. Pixel reasoner: Incenti vizing pixel-space rea- soning with curiosity-driv en reinforcement learning. arXiv pr eprint arXiv:2505.15966 , 2025. 1 , 3 , 7 [40] Peng W ang, Shuai Bai, Sinan T an, Shijie W ang, Zhihao Fan, Jinze Bai, K eqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, et al. Qwen2-vl: Enhancing vision-language model’ s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 , 2024. 2 [41] W enbin W ang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Y ong Luo, W ei Y u, and Dacheng T ao. Di vide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Pr oceedings of the AAAI Confer ence on Artificial Intelli- gence , pages 7907–7915, 2025. 1 , 3 [42] Xinlong W ang, Xiaosong Zhang, Y ue Cao, W en W ang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting ev- erything in context. arXiv preprint , 2023. 4 [43] Y ana W ei, Liang Zhao, Kangheng Lin, En Y u, Y uang Peng, Runpei Dong, Jianjian Sun, Haoran W ei, Zheng Ge, Xi- angyu Zhang, et al. Perception in reflection. arXiv pr eprint arXiv:2504.07165 , 2025. 2 [44] Junfei W u, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang W ang, W ei Wu, and T ieniu T an. Reinforcing spatial reasoning in vision-language models with interwov en think- ing and visual drawing. arXiv pr eprint arXiv:2506.09965 , 2025. 3 [45] Penghao W u and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 13084–13094, 2024. 1 , 3 [46] W eijia W u, Y uzhong Zhao, Hao Chen, Y uchao Gu, Rui Zhao, Y efei He, Hong Zhou, Mike Zheng Shou, and Chunhua Shen. Datasetdm: Synthesizing data with perception annota- tions using diffusion models. Advances in Neural Informa- tion Pr ocessing Systems , 36:54683–54695, 2023. 4 [47] Mengde Xu, Zheng Zhang, Fangyun W ei, Han Hu, and Xi- ang Bai. Side adapter network for open-v ocabulary semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 2945– 2954, 2023. 4 [48] Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international confer ence on learning r epr esentations , 2022. 3 [49] Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilie vski. Mllms know where to look: T raining-free per- ception of small visual details with multimodal llms. arXiv pr eprint arXiv:2502.17422 , 2025. 3 [50] Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Y an, Junt- ing Pan, Hao Dong, Peng Gao, and Hongsheng Li. Person- alize segment an ything model with one shot. arXiv preprint arXiv:2305.03048 , 2023. 6 [51] Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Y ang Liu, T ao Y uan, Y uwei W u, Y unde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptiv e visual search and zooming for multimodal reasoning via rl. arXiv e-prints , pages arXiv–2505, 2025. 1 , 3 , 7 [52] Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang W en, Lu Y uan, Ce Liu, W enbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion. In International Confer ence on Machine Learning , pages 42098–42109. PMLR, 2023. 4 [53] Xuanpu Zhao, Dianmo Sheng, Zhentao T an, Zhiwei Zhao, T ao Gong, Qi Chu, Bin Liu, and Nenghai Y u. Training- free open-vocabulary semantic segmentation via di verse pro- totype construction and sub-region matching. In Pr oceed- ings of the AAAI Conference on Artificial Intelligence , pages 10474–10482, 2025. 4 [54] Ziwei Zheng, Michael Y ang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Y ang, Chao Shen, and Xing Y u. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning. arXiv preprint , 2025. 1 , 3 , 7 [55] Jinguo Zhu, W eiyun W ang, Zhe Chen, Zhaoyang Liu, Shen- glong Y e, Lixin Gu, Hao T ian, Y uchen Duan, W eijie Su, Jie Shao, et al. Intern vl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv pr eprint arXiv:2504.10479 , 2025. 2 Learning to F ocus and Pr ecise Cr opping: A Reinf orcement Lear ning Framework with Inf ormation Gaps and Grounding Loss f or MLLMs Supplementary Material 8. T raining Dynamics Analysis In this section, we visualize and analyze ev olution of se veral key metrics during the first and second stages of training. 8.1. Stage-I T o analyze the model’ s behavior during the first stage of training, we present the ev olution of four key metrics for both the BaseLine and our Stage-I Model with Info Gap in Figure 5 . These metrics are: the total rew ard, the number of tool calls (T ool Call Num), the Intersection ov er Union (IoU) of predicted crop regions with ground truth, and the ov erlap ratio (Overlap), which measures the proportion of ground-truth region covered by the model’ s predicted crops. Reward. As shown in the ‘Re ward’ plots, the BaseLine model (Figure 5 (a)) exhibits a faster initial increase in re- ward. W e attribute this to uncompressed images in its train- ing set, which present a simpler learning task. Critically , both models ev entually con ver ge to a similar re ward lev el. This demonstrates that the introduction of the information gap compels Stage-I to achiev e a comparable performance lev el on more challenging (compressed) data, demonstrat- ing the model’ s effecti ve utilization of the crop. T ool Calls. The ‘T ool Call Num’ for both models increases rapidly and then stabilizes. Notably , the Stage-I Model (Figure 5 (b)) conv erges to a higher number of tool calls. W e hypothesize this is because the compressed input image necessitates more exploratory crops to gather sufficient in- formation to answer the question. This behavior confirms that the Stage-I Model learns to activ ely utilize the tool to attend to relev ant image regions. IoU . The ‘IoU’ metric, which ev aluates the precision of the cropped regions, shows no significant improvement for ei- ther model throughout the training process. This suggests that without an explicit grounding reward, the model is not incentivized to refine the precision of its crops. The consis- tently higher IoU of the BaseLine is likely due to the un- compressed images making box prediction an easier task. Overlap. The ‘Overlap’ re veals a distinct difference be- tween the training progress of BaseLine and Stage-I. The BaseLine’ s o verlap remains relati vely stable without a clear upward trend. In contrast, the Stage-I’ s overlap shows a significant increase before plateauing. This indicates that while image compression initially makes it difficult for the model to identify question-rele vant regions, the model learns a stronger region identification capability , dri ven by the accuracy re ward, to correctly answer the question. 8.2. Stage-II The Figure 6 illustrates the ev olution of four key metrics during the second training stage, with and without the use of the grounding reward. These metrics are: IoU, overlap, the ratio of the predicted crop box area to the image area (Bbox Ratio), and the visual token number of the predicted cropped region (T oken Num). IoU . Comparing the IoU ev olution in Figures 6 (a) and (b), we observe a distinct upward trend in (b) during Stage 2 training. This indicates that the introduction of the ground- ing re ward enables the model to localize key regions with greater precision. Bbox Ratio and T oken Num. Comparison for the Bbox Ratio re veals that this metric remains largely constant in (a), whereas it decreases significantly in (b), leading to a subsequent reduction in the T oken Num. This demonstrates that the grounding re ward encourages smaller crop re gions, thereby improving the model’ s inference ef ficiency . Overlap. As shown in (b), the Overlap metric exhibits a moderate increase. This suggests that the model is not in- discriminately shrinking the crop area; rather, it is learning to prune redundant, non-critical parts of the region. Con- sequently , the resulting crop contains less distracting infor- mation, making it more conducive for the model to answer questions based on the visual content. Setting HR-Bench 8K HR-Bench 4K V ∗ prediction 41.7 47.0 55.6 GT 50.0 53.0 60.3 (a) GT T est. Setting HR-Bench 8K HR-Bench 4K V ∗ prediction 100.0 100.0 100.0 RandNoise 100.0 80.8 100.0 (b) Noise test. T able 7. Results for more rigorous preliminary analysis. 9. More Pr eliminary Analysis T o more rigorously demonstrate DeepEyes’ s insuf ficient at- tention to cropped regions, we also conducted the following experiments. All experiments in this section are conducted with the maximum number of visual tokens set to 1,024. GT test. W e isolate samples from benchmarks where re- gions predicted by DeepEyes poorly co ver GT ( Ov er lap ≤ (a) T raing progress of BaseLine (b) T raing progress with Info Gap Figure 5. Training progress of BaseLine and Stage-I model. (a) T raining progress without grounding rew ard (b) T raing progress with grounding rew ard Figure 6. Training progress of Stage-II. 0 . 2 ). As shown in T able 7a , Replacing the regions of these samples with GT only yields returns within 10% (signifi- cant growth be e xpected if model attend to these re gions). This provides a more rigorous proof that DeepEyes cannot fully utilize cropped regions. Noise test. W e select samples where DeepEyes fails when forced to answer directly (without cropping pattern) but succeeds with cropping. As whon in T able 7b , when re- placing the cropped region with noise, the model still an- swers most of these samples correctly . This demonstrates that the performance gain is attributed to the cropping pat- tern (structural cues) rather than the enhanced visual infor- mation within the crops. 10. More Ablation Studies Data Utilization Strategy . As shown in the T able 8 , we compare two data usage strategies. The first strategy , ‘Mix Data’, in volves training the model in a single stage by mix- ing our collected data with the V isual Probe data at a 1:1 ratio. The second strategy , which we term ‘Stage-II’, is the two-stage approach adopted. In the first stage, the model is trained exclusi vely on our collected data. Subsequently , in the second stage, we train the model with a small amount of V isual Probe data, incorporating a grounding re ward. The results demonstrate that our proposed tw o-stage strat- egy achie ves superior performance. Resolution Selection Strategy . T o demonstrate the effec- tiv eness of our resolution selection strategy in the first stage, we design two simple baselines, ‘Hard’ and ‘Random’, and Model HR-Bench 8k HR-Bench 4k V ∗ FSP FCP Overall FSP FCP Overall Attr Spatial Overall Mix Data 77.5 59.0 68.3 85.0 61.0 73.0 82.6 73.7 79.1 Stage-II 83.3 60.8 72.1 86.8 63.7 75.3 82.6 77.6 80.6 T able 8. Comparison between dif ferent data utilization strate gies. Mix Data: Using the mixture of our collected data and V isual Probe data to train the model in a single stage. Stage-II: First, train the model on our collected data, and then train it with a minimal amount of V isual Probe data with grounding rew ard. Model HR-Bench 8k HR-Bench 4k V Star Hard 60.5 63.2 71.1 Random 59.0 64.0 72.4 Answer 62.0 63.5 75.0 T able 9. Comparison of accuracy between dif ferent resolution se- lection strategies in the multi-re gion case. Design FSP FCP Acc Acc IoU Acc IoU IoU Rew ard 81.8 39.3 63.5 47.4 72.7 - IoU Thresh 79.0 29.3 64.2 39.0 71.6 - IoU Rew ard 79.0 30.6 63.7 37.7 71.2 T able 10. Comparison between dif ferent re ward designs in Stage- II on HR-Bench 8k. compare them against our method on the multi-tar get subset (FCP / Spatial) of the dataset. This subset is chosen because it is more challenging than the single-target subset (FSP / Attr), and thus better highlights the ef ficacy of the informa- tion difference mechanism in our first stage. In the ‘Hard’ baseline, we consistently select the most hea vily downsam- pled image (where max ( h, w ) = 224 ). In the ‘Random’ baseline, we randomly select one image from the pool of all images generated by downsampling the original image at various scales. As shown in the T able 9 , our method out- performs both of these simple baselines. Reward Design in Stage-II. W e compare different reward designs in Stage-II on the HR-Bench 8k. As sho wn in the T able 10 , the first ro w presents our design where an IoU re- ward is employed during the second-stage training (without the L 1 rew ard). In the second ro w , we remov e the condition that the IoU reward is only applied when the overlap ex- ceeds a certain threshold. The experimental results demon- strate that retaining the conditional IoU reward (i.e., apply- ing it only when the overlap is above the threshold) leads to higher IoU and accuracy . This suggests that this condi- tion ef fecti vely guides the model to perform more precise cropping. Furthermore, the models trained with an IoU re- ward consistently outperform those without it in terms of accuracy and IoU. This indicates that the performance gain from the Stage-II training is primarily attributed to the IoU rew ard itself, rather than simply the introduction of addi- tional training data. 11. Benchmarks and Metrics Details Our method is e valuated on three benchmarks. The first, HR-Bench 8K with an average resolution of 7680, which consists of tw o sub-tasks: Fine-grained Single-instance Per- ception (FSP) and Fine-grained Cross-instance Perception (FCP). The 8K images are cropped around the objects in question to produce HR-Bench 4K . The third, V ∗ , with an av erage resolution of 2246x1582, features sub-tasks in attribute recognition (Attr) and spatial reasoning (Spatial). W e e v aluate our model on three datasets: hr-8k, hr-4k, and vstar . For all three benchmarks, the e valuation metric is ac- curacy (Acc), defined as the number of questions answered correctly . Additionally , to assess the cropping precision of our model, we also employ the Intersection ov er Union (IoU). This is calculated as the IoU between the model’ s predicted cropping box and the GT box of the question- relev ant region. Method Time (A100-hours) DeepEyes 480.0 Ours Stage-I 320.0 Stage-II 96.0 Offline Res. Selection 2.5 T otal 418.5 T able 11. Comparison of T raining T ime. 12. T raining Details W e show the related hyper-parameters we use in T able 12 . W e also compare our training time with that of DeepEyes as sho wn in T able 11 . Our total training time is shorter than that of DeepEyes. This is because we reduce the number Parameter V alue train batch size 256 rollout num per sample 16 ppo mini batch size 32 ppo micro batch size per gpu 2 rollout log prob micro batch size per gpu 4 ref log prob micro batch size per gpu 4 single response max tokens 2048 max turns 5 kl loss coef 0.0 entropy coef f 0.0 nodes num 1 gpus num per node 8 learning rate in Stage-I 1e-6 learning rate in Stage-II 5e-7 ov erlap threshold ( τ ) 0.9 acc rew ard weight ( r acc ) 0.8 format rew ard weight ( r f or mat ) 0.2 tool call weight ( r tool ) 1.2 T able 12. The hyperparameters we used in the training pipeline. of input visual tokens in the first stage by compressing the original images, accelerating the model rollout process. 13. V isualization Analysis In Figure 7 , we visually analyze the inference processes of DeepEyes and Stage-I. In the first example, while both mod- els correctly crop the flag, DeepEyes provides an incorrect answer , whereas Stage-I arriv es at the correct one. In the second e xample, both models initially fail to crop the jack’ s sleev e. Ho we ver , DeepEyes proceeds to answer the ques- tion e ven with the jack’ s sleev e absent from the cropped region. In contrast, Stage-I identifies the absence of the jack’ s slee ve in the initial crop, performs a second cropping action, and ultimately succeeds in locating it and answer- ing the question correctly . In Figure 8 , we visualize and analyze the inference processes of the Stage-I and Stage-II models. In the first crop, both models fail to capture the tar - get car and subsequently enlarge their cropping regions in the second attempt. Howe ver , the Stage-I model’ s second crop includes excessi ve redundant areas, causing it to fail again in identifying the car . In contrast, the Stage-II model accurately crops the entire road area, leading to a correct answer . 14. Limitations and Future W orks In this work, we first identify a critical issue in existing agent-based workflo ws for complex image understanding: the sub-optimal tool inv ocation that stems from a rigid for- malization of the cropping tool. W e address this by propos- ing an information gap mechanism. Building upon this, we further enhance model’ s cropping precision and, con- sequently , its o verall performance by manually annotating a small set of bboxes and introducing a grounding re ward. Howe ver , the performance gain from the second stage of training is relati vely limited, which we attribute to the small number of annotated bboxes. This suggests that leveraging synthetic data [ 46 , 52 ], could address the data scarcity prob- lem in high-resolution VQA. For instance, a large-scale, high-resolution VQA dataset with bbox annotations could be synthesized by cropping objects from general-purpose VQA datasets like GQA [ 10 ] (which are lo w-resolution but rich in annotations including bbox) and pasting them onto high-resolution backgrounds. Furthermore, current methods only contains cropping tool, whose effecti veness is constrained by the inherent grounding ability of the MLLM and the fact that cropped re- gions contain background distractors. For future, a promis- ing direction is to integrate segmentation models—such as open vocabulary semantic segmentation [ 15 , 47 , 53 ], refer- ring expression segmentation [ 13 , 37 ], or in-conte xt seg- mentation [ 33 , 34 , 42 ] —as callable tools for the agent. The flag flying on the top of the boat appears to be Y ellow, blue and red . This can … Answer: C. Y ellow, blue and red. The flag on the top of the boat is indeed black, yell ow , and green . The image… Answer: A. Black, yellow and green. (b) Reasoning example for Stage- I. What is the color of the flag flying on the top of the boat? A. Black, yellow and green B. Blue, white and red C. Y ellow , blue and red D. Green, white and r ed What is the color of the flag flying on the top of the boat? A. Black, yellow and green B. Blue, white and red C. Y ellow , blue and red D. Green, white and r ed Grounding: {"bbox_2d": [610, 224, 636, 258]], "label": "the flag flying on the top of the boat"} The flag on the top of the boat appears to be black, yellow, and green . I can see… Grounding: {"bbox_2d": [588, 214, 635, 258], "label": "flag"} (a) Reasoning example for DeepEyes. What is the text written above the logo on the jacket's sleeve? A. epig MIX B. epog MIX C. epoc MIX D. epio MIX Grounding: {"bbox_2d": [541, 252, 615, 325]], "label": "the text above the logo on the jacket's sleeve"} The text … The logo appears to be a stylized te xt that reads "epoc MIX". Grounding: {"bbox_2d": [568, 268, 610, 300], "label": "logo on the jacket's sleeve"} What is the text written above the logo on the jacket's sleeve? A. epig MIX B. epog MIX C. epoc MIX D. epio MIX The zoom-in tool confirmed that the text above the logo on the ja cket's sleeve is indeed "epoc MIX". Answer: C. epoc MIX. The text above the logo on the ja cket's sleeve is not visible in the origina l image. I need to zoom in on the area to see t he text more clearly. Grounding: {"b box_2d": [410, 105, 678, 414]], "label": "the text above the logo on the jacket's sleeve"} The text above the logo on the ja cket's sleeve is "epio MIX". Answer: D. epio MIX. Figure 7. Comparison between DeepEyes and Stage-I. The man wearing a yellow shirt is carrying a backpack. T o de termine the color, I need to zoom in on the region where the backpa ck is located. Grounding: {"bbox_2d": [100, 391, 555, 465], "label": "the car"} (b) Reasoning example for Stage- II. What's the color of the car? A. Red B. Black C. Green D. Y ellow Grounding: {"bbox_2d": [150, 412, 442, 462], "label": "the car"} (a) Reasoning example for Stage-I. What's the color of the car? A. Red B. Black C. Green D. Y ellow The color of the car is red. Answer : A. Red. The car is not visible in the ima ge. I need to zoom in on a specific are a to see the car more clearly. Gr ounding: {"bbox_2d": [200, 382, 850, 460], "label": "the car"} Grounding: {"bbox_2d": [150, 412, 442, 462], "label": "the car"} The color of the car is not visible in t he image. However, based on the options provided, the most likely color is blac k. Answer: B. Black. Figure 8. Comparison between Stage-I and Stage-II.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment