FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRie…

Authors: Rui Xiao, Sanghwan Kim, Yongqin Xian

FINER: MLLMs Hallucinate under Fine-grained Negative Queries
FINER: MLLMs Hallucinate under Fine-grained Negativ e Queries Rui Xiao 1,2 , Sanghwan Kim 1,2,3 , Y ongqin Xian 4 , Zeynep Akata 1,2,3 , Stephan Alaniz 5 1 T echnical Uni versity of Munich 2 Munich Center for Machine Learning 3 Helmholtz Munich 4 Google 5 L TCI, T ´ el ´ ecom Paris, Institut Polytechnique de P aris, France Abstract Multimodal larg e language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underr epr esented by e xisting benchmarks that fo- cus on coarse image-r elated questions. W e intr oduce FI ne- grained NE gative que R ies ( FINER ), alongside two benc h- marks: FINER-CompreCap and FINER-DOCCI . Using FINER, we analyze hallucinations acr oss four settings: multi-object, multi-attribute, multi-relation, and “what” questions. Our benchmarks r eveal that MLLMs hallu- cinate when fine-grained mismatches co-occur with gen- uinely pr esent elements in the image . T o addr ess this, we pr opose FINER-Tuning , lever aging Direct Prefer ence Optimization (DPO) on FINER-inspired data. F inetun- ing four fr ontier MLLMs with FINER-T uning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations fr om our benchmarks, while simultaneously impr oving perfor- mance on eight existing hallucination suites, and en- hancing general multimodal capabilities acr oss six bench- marks. Code, benchmark, and models are available at https://explainableml.github .io/finer-pr oject/ . 1. Introduction Multimodal large language models (MLLMs) have demon- strated significant progress in visual perception [ 2 ] and instruction following [ 25 ], enabling increasingly sophisti- cated image question answering. Real-world users, how- ev er, often ask fine-grained questions requiring precise un- derstanding of image content. While current models [ 4 , 27 , 46 ] handle coarse questions reasonably well, it remains unclear whether they can detect nuanced errors in detailed user queries when describing image content. This is critical in domains like medical visual question answering, where trustworthiness requires spotting and correcting errors in complex queries. In the conte xt of natural images, we focus on hallucination [ 5 , 37 ], the generation of answers unsup- ported by the image, and define “neg ativ e queries” as those asking about non-e xistent image content. Prior studies show Can you see the cat in this image? Can you see the wolf in this image? Can you see the cat with predominantly white coat featuring black and grey markings in this image? Can you see the cat with predominantly brown coat featuring orange and pink markings in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head tilted backward in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with drooping ears in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears , that is sitting on the chair in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears , that is sitting below the chair in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears , that is sitting on the chair in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears , that is sitting on the sofa in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears , that is sitting on the chair with primarily blue and white in color exhibiting signs of wear in this image? Can you see the cat with predominantly white coat featuring black and grey markings, with its head turned downwards, with perked ears , that is sitting on the chair with largely purple and orange in color indicating heavy use in this image? Can you see the cat in this image? Can you see the wolf ? Can you see the cat with predominantly white coat featuring black and grey markings in this image? Can you see the cat with predominantly brown coat ? Can you see the cat with …, with its head turned downwards in this image? Can you see the cat with …, with its head tilted backward ? Can you see the cat with …, …, with perked ears in this image? Can you see the cat with …, …, with drooping ears ? Can you see the cat with …, …, …, that is sitting on the chair in this image? Can you see the cat with …, ..., …, that is sitting below the chair? Can you see the cat with …, …, …, that is sitting on the chair in this image? Can you see the cat with …, …, …, that is sitting on the sofa ? Can you see the cat with …, …, …, that is … with primarily blue and white in color in this image? Can you see the cat with …, …, …, that … chair with largely purple… ? Increasing levels of Negative Questions [NEG_OBJ] [OBJ] [NEG_A TTR] [OBJ] [A TTR] [NEG_A TTR] [OBJ] [A TTR] [A TTR] [NEG_A TTR] [OBJ] [A TTR] [A TTR] [A TTR] [NEG_REL] [OBJ] [OBJ] [A TTR] [A TTR] [A TTR] [REL] [NEG_OBJ] [OBJ] [A TTR] [A TTR] [A TTR] [REL] [OBJ] [NEG_A TTR] L1 L2 L3 L4 L5 L6 L7 [NEG_OBJ] [OBJ] [NEG_A TTR] [OBJ] [A TTR] [NEG_A TTR] [OBJ] [A TTR] [A TTR] [NEG_A TTR] [OBJ] [A TTR] [A TTR] [A TTR] [NEG_REL] [OBJ] [OBJ] [A TTR] [A TTR] [A TTR] [REL] [NEG_OBJ] [OBJ] [A TTR] [A TTR] [A TTR] [REL] [OBJ] [NEG_A TTR] 320 Samples 1687 Samples (a) Negative Queries from Coarse to Fine Granularity (b) Comparison between Baseline and FINER-T uning Coarse Fine Granularity Granularity Coarse Fine Granularity Coarse Fine Figure 1. W e compare the performance InternVL3.5-14B [ 46 ] (Baseline) with the one fine-tuned by FINER-Tuning under ne g- ativ e queries of seven dif ferent granularity levels. MLLMs often e xhibit false-positiv e hallucination, failing to answer “No” to negati ve queries [ 3 , 22 , 44 , 56 ]. Y et, these probes are largely coarse; POPE and D ASH focus on single object presence [ 3 , 22 ], and AMBER includes only single objects, attributes, and relations [ 44 ]. This raises an im- portant question: Can MLLMs reject fine-grained mistakes in volving multiple objects, attributes, and r elations, rather than only coarse mismatc hes? T o in vestigate, we first con- duct a moti v ation study , increasing the granularity of ne ga- tiv e queries to probe for false positiv es. Question granularity affects hallucination. W e examine how MLLMs beha ve as negati ve queries become progres- siv ely more fine-grained . Mimicking how human constructs a sentence: starting with a single object, adding attributes, and then relations, we construct queries of increasing gran- ularity from coarse to fine, as shown in Fig. 1 . This yields sev en levels, each injecting a single, fine-grained contra- diction ( NEG OBJ , NEG ATTR , or NEG REL ) while keeping the 1 rest of the description visually consistent. For each sam- ple, we feed the model with the image and each of the sev en queries separately , limiting the answer to “Y es” or “No”, while the correct answer is always “No”. W e sam- ple from two sources: 320 from F I N E R - C O M P R E C A P and 1,687 from F I N E R - D O C C I . W e report av eraged accuracy per lev el for I N T E R N V L 3 . 5 - 1 4 B [ 46 ] and the model fine- tuned with FINER-T uning. As shown in Fig. 1 , the accuracy of I N T E R N V L 3 . 5 - 1 4 B steadily decreases with increased query granularity , drop- ping from ∼ 80% at le vel 1 to ∼ 20% by levels 5-7 on F I N E R - C O M P R E C A P , and from ∼ 58% at lev el 1 to ∼ 15% by levels 6-7 on F I N E R - D O C C I . This demonstrates the model’ s brittleness to fine-grained negations: as gran- ularity increases, it more often answers “Y es” to queries that should be “No”, resulting in more false positives. The model finetuned with FINER-T uning, ho wever , consistently demonstrates performance gains, particularly at finer granu- larity . This highlights MLLMs’ susceptibility to hallucina- tion at finer granularity and the potential for improv ement. Hence, we ask: Can we systematically study hallucina- tions under fine-grained ne gative queries? Our initial anal- ysis mixes objects, attributes, and relations, hindering iso- lation of causal factors. T o disentangle these, we introduce F I N E R - C O M P R E C A P and F I N E R - D O C C I , which group queries into four settings: multiple objects (Multi-obj), mul- tiple attributes (Multi-attr), multiple relations (Multi-rel), and “what”-questions (Wh). The first three target exis- tence and binding, assessing whether the model can de- tect errors hidden in multiple objects, attributes, and rela- tions. The Wh-setting probes factual answering with ill- posed queries, asking “what”-questions about a target ob- ject with one incorrect attribute. T ogether, these four set- tings re veal whether a model can say “No” to precise but wrong claims, beyond handling coarse mismatches. 2. FINER Benchmarks Our FINER benchmarks aim to compose ne gati ve ques- tions in volving multiple semantic elements, i.e., objects, at- tributes, and relations, to ev aluate an MLLM’ s ability to de- tect and reason about missing or incorrect components in a scene, e ven with subtle perturbations. W e begin by e xplain- ing our benchmark construction as illustrated in Fig. 2 . 2.1. Question Construction Pipeline W e base our FINER benchmarks on the scene graph (SG) of an image, encoding objects ( OBJ ), their attributes ( ATTR ), and spatial or semantic relations ( REL ). For each compo- nent, we generate negati ve counterparts ( NEG OBJ , NEG ATTR , NEG REL ), semantically plausible but incorrect substitutions (e.g., replacing “door frame” with “pillar”). Unlike prior work [ 3 , 22 ], which rely on a single negativ e, we gener - ate four distinct negati ve v ariants per entity (as described in Sec. 2.3 ). The initial processing steps are visualized at the top of Fig. 2 . W e then use a template-based approach to compose pos- itiv e questions ( q + ) mentioning multiple elements of the same category sampled from the positi ve SG. F or example, a multiple-object question ( q + multi-obj ) might be “Can you see cat and door frame?”. Corresponding negati ve questions ( q − ) are constructed by replacing one randomly chosen el- ement with a randomly sampled, ne gative counterpart (e.g., “Can you see cat and pillar?”). The correct answers are “Y es” and “No” respectively . T o move beyond binary re- sponses, we construct Multiple Choice Questions (MCQs) requiring the model to specify the correct entities in the im- age. For example, the correct answer to q − multi-obj would be “No, but I can see cat and door frame”. W e use the other negati ve options of the same component as distractors for the other answer options (see “Multi-obj” in Fig 2.). Equi v- alently , we construct q ± multi-attr and q ± multi-rel from the SGs’ at- tributes and relations. Finally , we create “what”-questions (Wh) asking about an object in relation to another, using ei- ther its positive or negativ e attrib ute. The complete question template is described in Sec. B in the supplementary . Benchmarks. Based on this pipeline, we constructed F I N E R - C O M P R E C A P (based on CompreCap [ 31 ]) and F I N E R - D O C C I (based on DOCCI [ 34 ]). CompreCap provides human-annotated scene graphs, but is limited to COCO images. DOCCI consists of 5K images with long human-annotated captions which allow us to create a more large-scale question set. The detailed statistics of both benchmarks are in Sec. B in the supplementary . F I N E R - C O M P R E C A P consists of 6,300 Multi-obj, 3,338 Multi-attr, 4,280 Multi-rel, and 3,166 Wh MCQs with a maximum of 6,3,3 objects, attrib utes, or relations per question. F I N E R - D O C C I comprises 10,000 Multi-obj, 28,630 Multi-attr , 11,542 Multi-rel, and 20,944 Wh MCQs with a maximum of 6,5,3 objects, attrib utes, or relations per question. In the following, we detail how we extract the SG from DOCCI, and how we generate the ne gativ e components. 2.2. Scene Graph Extraction For DOCCI, where ground-truth SGs are unav ailable, we build a non-panoptic SG by e xtracting objects, attributes, and relations directly from the human-written long captions. W e use a multi-stage pipeline powered by Gemini-2.0- Flash [ 41 ], with filtering by a strong MLLM (Qwen2.5VL- 72B [ 4 ]) and human v erification on sampled data, to conv ert captions into SG-like annotations. The validation steps re- duce the risk of introducing incorrect features into the SG which is particularly important for REL . W e provide more details regarding the pipeline in Sec. B.2 in supplementary . 2 Extract Scene Graph A: Y es, … grass … Can y ou see the couch that [REL] the wall, [REL] the window fr ame, and is inside the cur tain in this image? A. No, but I can see the couch that [REL] the wall, [REL] the window fr ame, and is in fr ont of the cur tain in this image. B. No, but I can see the couch that [REL] the wall, [REL] the window fr ame, and is t o the r ar e of the cur tain in this image. C. No, but I can see the couch that [REL] the wall, [REL] the window fr ame, and is at the back of the cur tain in this image. D . Y es, I can see the couch that [REL] the wall, [REL] the window fr ame, and is inside the cur tain in this image. E. No, but I can see the couch that [REL] the wall, [REL] the window fr ame, and is behind the cur tain in this image. with a light gray color Is looking out of An overhead view of a white puppy and a gray tabby cat staring out of a door . The puppy is standing and the cat is sitting on a light brown wooden floor . …The cat is casting a shadow on the door frame to the right. There is a white wall visible on the right side of the door frame in the top right portion of the image. Human-a nnotation Gemini Summarize objects & attributes Puppy: white Cat: gray Door:... Door Frame:... Carpet:... Bag:... … Gemini: Recor ding r elations if any Puppy is staring out of Door Cat is staring out of Door . … Cat is casting shadow on Door Frame. Gemini Qwen2.5VL Human Filtering Puppy Cat Door Door Frame Carpet Bag Cat Carpet Bag “Give me four negatives” Obj: Cat Neg_obj: dog; fox; rabbit; raccoon Attr: with a gray color Neg_attr: orange; black; white; br own Rel: is staring out of Neg_rel: is sleeping inside; is looking into; is walking away fr om; is facing away fr om Qwen2.5-VL-72B Gemini / Qwen3-VL “Pick the correct obj/attr/rel” If incorr ect and low entr opy Gemini / Qwen3-VL “‘Is looking into’ might be ambiguous, rewrite to another phrase” Rule-based MCQ construction Is looking out of Floor Door Frame Is looking out of {"example_id": "test_01653", "objects": [{"object": "puppy", "object_index": 0, "neg_object": ["kitten", "hamster", "turtle", "parrot"], "attribute": ["with a white color", "with a standing posture", "with a body and head directed toward the window", "with a standing still posture"], "neg_attribute": [["with a black color", "with a patterned color", "with a dark color", "with a pink color"], ["with a lying posture", "with a jumping posture", "with a crouching posture", "with a running posture"], ["with a body and head directed away from the window", "with a body directed toward the wall", "with a head directed toward the door", "with a body facing the camera"], ["with a walking posture", "with a crouching posture", "with a jumping posture", "with a running posture"]], "parsed_relation": [{"object_a": 0, "rel": "is standing on the", "object_b": 2}], "neg_relation": [["is lying on the", "is hanging from the", "is running to the left of the", "is falling behind the"]]}, {"object": "cat", "object_index": 1, "neg_object": ["horse", "cow", "sheep", "goat"], "attribute": ["with a gray tabby color", "with a sitting posture", "with a head slightly turned toward the window", "with a posture sitting on its two hind legs", "with front legs propped up in front of its body", "with its tail flat on the floor"], "neg_attribute": [["with a orange tabby color", "with a black color", "with a dark color", "with a bright color"], ["with a standing posture", "with a lying posture", "with a jumping posture", "with a running posture"], ["with a head slightly turned away from the window", "with a head fully turned toward the camera", "with a head facing the wall", "with a head facing the door"], ["with a posture standing on its four legs", "with a posture lying on its side", "with a posture with its back legs in the air", "with a posture with its front legs crossed"], ["with hind legs propped up behind its body", "with back legs propped up in front of its body", "with front legs stretched out to the side", "with front legs tucked under its body"], ["with its tail raised in the air", "with its tail curled around its body", "with its tail wagging quickly", "with its tail tucked between its legs"]], "parsed_relation": [{"object_a": 1, "rel": "is sitting on the", "object_b": 2}, {"object_a": 1, "rel": "is casting a shadow on the", "object_b": 4}], "neg_relation": [["is standing on the", "is lying behind the", "is running to the right of the", "is hanging from the"], ["is standing behind the", "is pushing the", "is pulling the", "is balancing on the"]]}, {"object": "floor", "object_index": 2, "neg_object": ["ceiling", "countertop", "shelf", "staircase"], "attribute": ["with a light brown wooden color"], "neg_attribute": [["with a dark gray metal color", "with a light green metal color", "with a dark color", "with a bright color"]], "parsed_relation": [], "neg_relation": []}, {"object": "door", "object_index": 3, "neg_object": ["bookshelf", "cabinet", "painting", "gyroscope"], "attribute": ["with a dark green color"], "neg_attribute": [["with a light orange color", "with a bright red color", "with a dark color", "with a bright color"]], "parsed_relation": [], "neg_relation": []}, {"object": "door frame", "object_index": 4, "neg_object": ["baseboard", "pillar", "archway", "banister"], "attribute": ["with a white color"], "neg_attribute": [["with a black color", "with a patterned color", "with a dark color", "with a bright color"]], "parsed_relation": [], "neg_relation": []}, {"object": "carpet", "object_index": 5, "neg_object": ["lampshade", "tile", "mat", "linoleum"], "attribute": ["with a blue and white color"], "neg_attribute": [["with a red and black color", "with a green and yellow color", "with a dark color", "with a bright color"]], "parsed_relation": [], "neg_relation": []}, {"object": "bag", "object_index": 6, "neg_object": ["box", "crate", "suitcase", "basket"], "attribute": ["with a gray and black color", "with a logo that is a red circle with a white star in it"], "neg_attribute": [["with a red and yellow color", "with a green and white color", "with a dark color", "with a bright color"], ["with a logo that is a blue square with a black circle in it", "with a logo that is a yellow triangle with a black line in it", "with a logo that is a green rectangle with a white stripe in it", "with a logo that is a purple diamond with a gold outline"]], "parsed_relation": [], "neg_relation": []}, {"object": "wall", "object_index": 7, "neg_object": ["ceiling", "curtain", "bookshelf", "barometer"], "attribute": ["with a white color"], "neg_attribute": [["with a black color", "with a patterned color", "with a dark color", "with a bright color"]], "parsed_relation": [], "neg_relation": []}]} Obj: Puppy Attr: With a white color Rel: Is standing on the Floor Is looking out of the door Attr: 1. W ith a white color 2. … Cat Door Frame Obj: Carpet Bag Is casting a shadow on Is on the right of the 1. W ith a sitting posture 2. … puppy door with a standing posture with a dark green color is looking out of Four negs for ‘Puppy’ , trees , pedestrian… Pick one object and summarize its attributes Pick one object and summarize its relations A gleaming green, convertible Aston Martin with chrome bumper … An Aston Martin that reflects trees on its body , is in front of the pedestrian … Compose one sentence A green …, reflects the trees, which are planted in urban rows … Query Generate Negatives Phi-4-14B LLM [ pillar , fence ] Is looking out of is looking away from [ with a kneeling posture, with a sitting posture ] hamster [ is facing away from, is lying on ] MLLM LLM Generate Discriminate Can you see … door , and puppy in this image? MCQ Construction Can you see …, door , puppy in this image? Can you see …, door , hamster in this image? Can you see the puppy .., with a standing postur e in this image? Can you see the puppy .., with a kneeling postur e in this image? Can you see the puppy that is looking out of the door , that.. In this image? Can you see the puppy that is facing away fr om the door , that.. In this image? What is looking out of the door with a dark gr een color ? What is looking out of the door with a light gray color ? 1. with a light gray color , with a blue color , with a brownish tone, with a white color … gate, fence, stairs, curtain 1. with a dark color , with a black color , with a pink color , with a purple color 2. with a lying posture, with a sitting posture, with a crouching position, with a kneeling posture … squirrel, hamster , turtle, parrot is looking away from, is facing away from, is ignoring the view from, is turning back from Pos Question Neg Question Multi-obj Multi-attr Multi-r el Multi-wh Y es, I can see the puppy…, with a kneeling postur e in this image. No, but I can see the puppy…, with a standing postur e in this image. No, but I can see the puppy…, [NEG_A TTR] in this image. X3 Choices (T ake ‘Neg Multi-attr as an example) Positive scene graph Negative scene graph Can you see light, grass , building, roof, pavilion, and clock in this image? A. Y es, I can see light, grass , building, roof, pavilion, and clock in this image. B. No, but I can see light, tree , building, roof, pavilion, and clock in this image. C. No, but I can see light, bush , building, roof, pavilion, and clock in this image. D. No, but I can see light, flower , building, roof, pavilion, and clock in this image. E. No, but I can see light, shrub , building, roof, pavilion, and clock in this image. Can you see the man [A TTR], [A TTR], and…resembles walking or a slow movement in this image? A. Y es, I can see the man [A TTR], [A TTR], and…resembles walking or a slow movement in this image. B. No, but I can see the man [A TTR], [A TTR], and… resembles sitting or a calm stance in this image. C. No, but I can see the man [A TTR], [A TTR], and…resembles standing or a neutral position in this image. D. No, but I can see the man [A TTR], [A TTR], and…resembles a jump or a dynamic movement in this image. E. No, but I can see the man [A TTR], [A TTR], and…resembles crawling or a low movement in this image. What is the man with dark gr een shor ts and pink shoes with light accents , with a blue headband, and with y ellow shor t-slee v ed T -shir t standing on? A. The soccer field with gr een and blue colors B. The golf course with gr een and blue colors C. The tennis cour t with gr een and blue colors D . The man is not with dark gr een shor ts and pink shoes with light accents, but is with dark shor ts and white shoes with dark accents . E. The bask etball cour t with gr een and blue colors Multi-obj Multi-attr Multi-rel Wh Can you see the flowers [A TTR], with brown leaves , and [A TTR] in this image? A. Y es, I can see the flowers [A TTR], with brown leaves , and [A TTR] in this image. B. No, but I can see the flowers [A TTR], with dry leaves , and [A TTR] in this image. C. No, but I can see the flowers [A TTR], with red leaves , and [A TTR] in this image. D. No, but I can see the flowers [A TTR], with no leaves , and [A TTR] in this image. E. No, but I can see the flowers [A TTR], with green leaves , and with [A TTR] in this image. Can you see cat, door , pillar , floor , puppy , and bag in this image? A. No, but I can see cat, door , archway , floor , puppy , and bag in this image. B. No, but I can see cat, door , baseboard , floor , puppy , and bag in this image. C. Y es, I can see cat, door , pillar , floor , puppy , and bag in this image. D. No, but I can see cat, door , door frame , floor , puppy , and bag in this image. E. No, but I can see cat, door , banister , floor , puppy , and bag in this image. Can you see the cat that is sleeping behind the window and is looking at the lizard in this image? (1) No, but I can see the cat that is standing on the window and is looking at the lizard in this image. (2) No, but I can see the cat that is looking out of the window and is looking at the lizard in this image. (3) No, but I can see the cat that is walking towards the window and is looking at the lizard in this image. (4) Y es, I can see the cat that is sleeping behind the window and is looking at the lizard in this image. (5) No, but I can see the cat that is jumping over the window and is looking at the lizard in this image. Multi-obj What is the reflection [A TTR], with purple text , and [A TTR] on? A. the synthesizer with a black color B. The reflection is not with purple text , but is with black text . C. the piano with a black color D. the cello with a black color E. the harpsichord with a black color Wh Multi-obj Multi-rel Multi-attr Wh {obj: teddy bear , attr: with a brown color , rel: is leaning against, obj: pillow} Multi-obj Multi-attr Multi-r el Multi-wh Construct MCQ Answers POS NEG Multi-obj Multi-attr Q: Can you see light, grass , building, roof, pavilion, and clock in this image? A: No, but I can see light, tree , building, roof, pavilion, and clock in this image. 6-objects Q: Can you see light, tree , building, roof, pavilion, and clock in this image? A: Y es, but I can see light, tree , building, roof, pavilion, and clock in this image. NEG POS Options: tree , grass, bush, flower , shrub Can you see light, grass , building, roof, pavilion, and clock in this image? A. Y es, I can see light, grass, building, roof, pavilion, and clock in this image. B. No, but I can see light, tree , building, roof, pavilion, and clock in this image. C. No, but I can see light, bush , building, roof, pavilion, and clock in this image. D. No, but I can see light, flower , building, roof, pavilion, and clock in this image. E. No, but I can see light, shrub , building, roof, pavilion, and clock in this image. WH POS NEG POS NEG POS NEG Can you see the puppy that.. and is facing away from the door ? A. Y es, puppy that is.. and is facing away from the door . B. No, but puppy that is.. and is looking out of the door . ✅ C. No, but puppy that is.. and is lying on the door . D. … E. … Can you see the puppy with a kneeling posture and with..? A. Y es, puppy with a kneeling posture and with.. B. No, but puppy with a standing posture and with.. ✅ C. No, but puppy with a sitting posture and with.. D. … E. … Can you see the puppy that.. and is looking out of the door ? A. Y es, puppy that is.. and is looking out of the door . ✅ B. No, but puppy that is.. and is facing away from the door . C. No, but puppy that is.. and is lying on the door . D. … E. … Can you see the puppy with a standing posture and with.. A. Y es, puppy with a standing posture and with.. ✅ B. No, but puppy with a kneeling posture and with.. C. No, but puppy with a sitting posture and with.. D. … E. … What is looking out of the door with a light gray color ? A. The puppy . B. The door is with a dark green color . ✅ C. The hamster . D. … E. … What is looking out of the door with a dark green color ? A. The puppy . ✅ B. The door is with a light gray color . C. The hamster . D. … E. … Multi-rel GT Scene Graph #MCQs FINER-CompreCap #MCQs Multi-obj : 6,300 Multi-attr: 3,338 Multi-rel : 4,280 Wh : 3,166 Multi-obj: 10,000; Max: 6 Multi-attr: 28,630; Max: 5 Multi-rel: 1 1,542; Max: 3 Wh: 20,944 An overhead view of a white puppy looking out of a dark green door . The puppy is standing on a light brown wooden floor . DOCCI CompreCap Human-annotated Caption Ground T ruth Scene Graph Extract Scene Graph Negative Generation Rule-based Query Construction NEG POS Can you see cat and door frame ? A. Y es, I can see cat and door frame . ✅ B. No, but I can see cat and pillar . C. No, but I can see cat and fence . D. … E. … Can you see cat and pillar ? A. Y es, I can see cat and pillar . B. No, but I can see cat and door frame . ✅ C. No, but I can see cat and fence . D. … E. … Multi-rel NEG POS Multi-attr NEG POS Wh NEG POS Gemini- 2.0-Flash Qwen2.5-V L-72B Human Extract Obj, Attr , Rel Extract Rels V erify V erify Samples Negatives Generation V erify Samples Generate Four Negatives Classify Regenerate FINER-CompreCap #MCQs Multi-obj : 10,000 Multi-attr: 28,630 Multi-rel : 11,542 Wh : 20,944 FINER-DOCCI Can you see { } in this image? Yes,I can see { }. / No,but I can see { }. Input Image for MCQ Gemini-2.0-Flash Qwen2.5-VL-72B Human Gemini-2.0-Flash Figure 2. Data construction pipeline for FINER benchmarks. For F I N E R - D O C C I , we extract the positive scene graph (SG) from DOCCI [ 34 ] captions, while for F I N E R - C O M P R E C A P , the SG is provided by CompreCap [ 31 ]. From the positive SG, we generate the neg ativ e SG using Qwen3-14B [ 51 ] as negativ es generator for F I N E R - C O M P R E C A P and Gemini-2.0-Flash [ 41 ] for F I N E R - D O C C I. Finally , a rule-based query construction pipeline builds multiple choice questions. In practice, choices are shuffled in both benchmarks. 2.3. Negatives Generation Starting from the positi ve SGs, we generate four corre- sponding negati ves for each object, attribute, and relation, using an LLM with carefully designed prompts. W e use Qwen3-14B [ 51 ] for FINER-CompreCap and Gemini-2.0- Flash [ 41 ] for FINER-DOCCI to ensure consistency with the SG creation. T o decrease the risk of generated nega- tiv es being present in the image, we use a strong MLLM (Qwen2.5-VL-72B) as a discriminator . If it fails to identify the positi ve item mixed into the ne gati ves, we conclude that at least one neg ativ e is ambiguous or present in the image. Based on the MLLM’ s classification entropy , we identify which negati ves require to be re generated and repeat this process iteratively . Human verifies samples to specify re- generation thresholds. For more details on the negati ves generation, please refer to Sec. B.3 in the supplementary . 2.4. Evaluation Setting As binary “Y es/No” responses are vulnerable to model bi- ases, we use MCQs to move models beyond simple negation and enforce visual understanding, with each MCQ includ- ing one correct answer and four distractors. T o pre vent bias tow ard positive or negati ve answers, we pair each negativ e MCQ ( q − ) with its corresponding positive MCQ ( q + ), re- quiring both to be answered correctly . This pairing ensures models cannot succeed by simply memorizing “No” pat- terns or exploiting label imbalances. As a result, let M ( · ) be the model, we define paired accuracy as the primary ev al- uation metric for N paired questions of q + and q − : Acc paired = 1 N N X i =1 Γ( M ( x i , q + i )) Γ( M ( x i , q − i )) (1) where Γ( · ) ev aluates to 1 for correct responses and 0 oth- erwise. This metric requires success on both positi ve and negati ve v ariants, ensuring robustness against false posi- tiv es and false negati ves. 3. T raining with FINER (FINER-T uning) Observing MLLM vulnerabilities under FINER, we address them with a data-driv en training approach via direct prefer- ence optimization (DPO) [ 36 ] using fine-grained ne gative queries , denoted as FINER-T uning. Unlike approaches op- timizing for simple queries [ 52 , 55 , 57 ], FINER-Tuning em- ploys minimally edited, semantically precise contradictions ov er objects, attributes, and relations (e.g., “car with yellow bumper” vs. “car with chrome bumper”), including both fine-grained positiv e and negati ve queries. Fig. 3 illustrates our training data generation pipeline. It is inspired by the four settings in our benchmarks with both accept and reject answers for every query . This focuses learning on detecting fine-grained hallucinations in the queries, rather than solely av oiding them in the model’ s responses. Setup. W e select data avoiding in-distribution leakage , ex- cluding COCO data [ 23 ], and the DOCCI training split [ 34 ]. T o lev erage the availability of dense image annotations, we 3 This striking image features a gleaming green Aston Martin convertible, capturing the essence of luxury and sophistication. … This atmospheric setting contributes to the car's prominence and the overall allure of the photograph. Diverse Summarization Phi-4-14B Aston Martin, trees, pedestrian… A gleaming green, convertible Aston Martin chrome bumper and wheels… An Aston Martin that reflects the surrounding trees on its body , is in front of the… A green convertible Aston Martin…, reflects the background trees, which are planted in urban rows … Random negative generation Phi-4-14B Aston Martin, flowers , pedestrian… A gleaming green, hard-top Aston Martin chrome bumper and wheels… An Aston Martin that reflects the surrounding trees on its body , is behind the… A green convertible Aston Martin…, reflects the background trees, which are planted in countryside fields … Acc-Rej Data construction Summarize the objects Aston Martin, trees , pedestrian… Pick one object and summarize its attributes Phi-4-14B Pick one object and summarize its relations A gleaming green, convertible Aston Martin chrome bumper and wheels… A gleaming green, convertible Aston Martin with chrome bumper … An Aston Martin that reflects trees on its body , is in front of the pedestrian … Compose one sentence A green …, reflects the trees, which are planted in urban rows … Phi-4-14B Negate object i Aston Martin, flowers , pedestrian… Negate attribute i Negate relation i A gleaming green, convertible Aston Martin with yellow bumper … An Aston Martin that reflects trees on its body , is behind the pedestrian … Negate component i A green …, reflects the trees, which are planted in countryside fields … Question: Do you see … is in front of the ..? Acc: Y es, I can see … is in front of the .. Rej: No, but I can see .. is behind the .. Question: What reflects the trees that… in countryside fields ..? Acc: The trees are not planted in countryside fields , but in urban rows .. Rej: A green, convertible Aston Martin with.. Question: Can you see … with yellow bumper ..? Acc: No, but I can see … with chrome bumper .. Rej: Y es, but I can see … with yellow bumper .. MLLM MLLM MLLM P(Acc): 0.75 P(Rej): 0.25 P(Acc): 0.32 P(Rej): 0.68 P(Acc): 0.55 P(Rej): 0.45 ✅ ❌ ✅ (a) T raining Data Generation (b) FINER-T rain Query LLM Original Data Query LLM Extract Features Generate Negative Features Puppy Door 1. W ith a white color 2. W ith a standing posture … 1. W ith a dark green color … Is looking out of Scene-graph Annotations Four negs for ‘Puppy’ , trees , pedestrian… Pick one object and summarize its attributes Pick one object and summarize its relations A gleaming green, convertible Aston Martin with chrome bumper … An Aston Martin that reflects trees on its body , is in front of the pedestrian … Compose one sentence A green …, reflects the trees, which are planted in urban rows … Query Generate Negatives Phi-4-14B LLM hamster , turtle, parrot, squirrel Question: … with chrome bumper ..? ✅ Y es, … with chrome bumper ❌ No, but … with yellow bumper Question: Can you see … with yellow bumper ..? Acc: No, but I can see … with chrome bumper .. Rej: Y es, but I can see … with yellow bumper .. This striking image features a gleaming green Aston Martin convertible, capturing the essence of luxury and sophistication. … This atmospheric setting contributes to the car's prominence and the overall allure of the photograph. Long Caption Summarize objects Aston Martin, trees, pedestrian… Summarize attributes Summarize relations Compose one sentence (1) Extract Features (2) Generate Negative Features Aston Martin, flowers , pedestrian… A green, convertible Aston Martin with chrome bumper … T emplate-based Composition Question: Can you see the Aston Martin … NEG POS … with yellow bumper …? … with chrome bumper ..? Acc : No, but … Y es, … with chrome bumper Rej : Y es, … No, but … with yellow bumper (3) Construct Question Negative Queries Positive Queries Question T emplates: Is there … / Can you see… / Is it true that … / … ? Phi-4-14B Question: … the flowers …? ✅ No, but … Aston Martin… ❌ Y es, … flowers… An Aston Martin that reflects trees on its body , is in front of the pedestrian … A green …, reflects the trees on urban rows … Phi-4-14B Question: … in front of the pedestrian ? ✅ Y es, … in front of the ❌ No, but … behind the A green, convertible Aston Martin with yellow bumper … An Aston Martin that reflects trees on its body , is behind the pedestrian … A green …, reflects the trees on country fields … Phi-4 14B Q :… is behind the … ? … is in front of the pedestrian ..? ✅ : No, but … is in front of the … ✅ : Y es, … is in front of the.. ❌ : Y es, … is behind the … ❌ : No, but … is behind the… What reflects the trees that are planted in countryside fields ..? Question: … in country fields ? ✅ The trees are not in country fields , but in urban rows ❌ A green Aston Martin with.. Generated Question: What reflects the trees that are planted…? Q: What… trees that… in urban rows ..? ✅ : The trees are not… in urban rows , but in countryside fields .. ❌ : A green, convertible Aston Martin with.. A green, convertible Aston Martin with Question: … in urban rows ? ✅ A green Aston Martin with... ❌ The trees are not in urban rows , but in country fields (1) Extract Positives (2) Generate Negatives (3) Query & Answer Construction NEG POS Question: … the Aston Martin ...? ✅ Y es, … Aston Martin ❌ No, but … flowers Question: … with yellow bumper ..? ✅ No, but … with chrome bumper … ❌ Y es, … with yellow bumper… Question: … behind the pedestrian ? ✅ No, but … in front of the ❌ Y es, … behind the Input Image Multi-obj Multi-attr Multi-rel Wh Multi-obj NEG POS NEG POS NEG POS Figure 3. Training data generation pipeline for FINER-Tuning. (1) W e adopt long captions from Pixmo [ 11 ] and e xtract di verse phrases with P H I - 4 - 1 4 B [ 1 ]. (2) W e then prompt the same LLM to modify and generate ne gati ve phrases. (3) W e construct both positiv e and negati ve query-answer tuples via template-based composition or LLM generation. adopt Pixmo-caption [ 11 ] as our base corpus. W e further av oid using the LLMs used for benchmark construction, employing Phi-4-14B [ 1 ] for our training data pipeline. (1) Extract P ositives. As illustrated in Fig. 3 , given a long caption, we prompt Phi-4-14B to extract fine-grained posi- tiv e phrases, mirroring our four ev aluation scenarios: Multi- obj, Multi-attr, Multi-rel, and Wh. W e define the following four positiv e phrase types: Ψ + ∈  Ψ + O B J , Ψ + A T T R , Ψ + R E L , Ψ + W H  (2) The LLM produces: Ψ + O B J : a phrase summarizing the objects; Ψ + A T T R : a phrase summarizing attributes for a ran- dom object; Ψ + R E L : a phrase summarizing relations between a random object and others; Ψ + W H : a composed sentence describing two objects with a relation and summarized at- tributes, subsequently forming a positiv e question-answer pair . Our prompt templates are detailed in Sec. G . (2) Generate Negatives. T ransforming the positi ve phrases Ψ + , we generate negati ve phrases Ψ − with the same LLM: Ψ − ∈  Ψ − O B J , Ψ − A T T R , Ψ − R E L , Ψ − W H  (3) For each phrase type Ψ + T (where T ∈ { O B J , A T T R , R E L , W H } ), we randomly select one in- stance of T , and prompt the LLM to replace that instance with a negati ve, forming Ψ − T . Please refer to Sec. E for the complete prompt details. (3) Query & Answer Construction. W ith Ψ + and Ψ − , we construct query-answer pairs for DPO training, including both positi ve ( q + ) and negati ve ( q − ) questions paired with accepted ( a + ) and rejected ( a − ) responses. a + begins with the correct response (”Y es” for q + , ”No” for q − ) and men- tions the correct image features, while a − is the opposite. For O B J / A T T R / R E L , we directly use question-answer templates on Ψ + and Ψ − to construct ( q + , a + + , a − + ) and ( q − , a + − , a − − ) pairs. W e use five templates to av oid ov er- fitting to the benchmark’ s prompt pattern, as detailed in Sec. G. For W H , data pairs are already constructed by the LLM due to the free-form nature of these questions and an- swers. Fig. 3 provides example data for all data types and more examples are pro vided in Sec. C in the supplementary . DPO T raining. This creates a dataset of preference tuples D = { ( x, q s , a + s , a − s ) } , s ∈ { + , −} (4) where x is the image. Let π θ ( · | x, q ) be the polic y and π ref be a frozen reference model. W e train with DPO, maximiz- ing the probability that the policy ranks a + abov e a − : ∆ θ ( x, q ) := log π θ ( a + | x, q ) − log π θ ( a − | x, q ) , ∆ ref ( x, q ) := log π ref ( a + | x, q ) − log π ref ( a − | x, q ) , L DPO ( θ ) = − E ( x,q,a + ,a − ) ∼D h log σ  β (∆ θ − ∆ ref )  i . (5) where σ ( · ) is the logistic function and β = 0 . 1 . 4. Experiments W e present experiments of FINER-T uning on three tasks, i.e., e v aluation on FINER benchmarks (Sec. 4.2 ), other hal- lucination benchmarks (Sec. 4.3 ), and general MLLM ca- pabilities (Sec. 4.4 ). In addition, we show qualitativ e e xam- ples on FINER benchmarks (Sec. 4.5 ), and ablate important training strategies and subset selections (Sec. 4.6 ). 4.1. Experimental Setup Fine-tuning Setup. W e are interested in applying FINER- T uning to frontier-MLLMs: LLaV A-NeXT -7B (LLaV A- 1.6-7B) [ 27 ], Qwen2.5-VL-7B-Instruct [ 4 ], and InternVL- 3.5-8B [ 46 ]. T o test scalability within our compute limits, we also include InternVL-3.5-14B [ 46 ]. W e fine-tune each model on our constructed data with maximally 160k pref- erence tuples. All models are trained for one epoch using LLaMA-Factory [ 58 ] with LoRA [ 17 ]. Full training details are in Sec. C in the supplementary . Evaluation Setup. W e e valuate all models on three tasks across 16 benchmarks. W e primarily use VLMEvalKit [ 14 ] 4 T able 1. Paired accuracy (Acc paired ) results on FINER-CompreCap and FINER-DOCCI. ∗ For Gemini-2.5-Flash, we ev aluate on the whole F I N E R - C O M P R E C A P and on 3K MCQs per setting in F I N E R - D O C C I due to the scale of the benchmark. FINER-CompreCap FINER-DOCCI Models Size Multi-obj Multi-attr Multi-rel Wh Multi-obj Multi-attr Multi-rel Wh Random Guess - 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 LR V -V2 [ 24 ] 13B 6.1 6.8 5.6 4.0 6.3 5.4 6.1 5.2 LLaV A-RLHF [ 40 ] 13B 11.4 2.0 1.1 6.9 7.3 3.0 5.1 5.3 RLHF-V [ 54 ] 13B 13.4 6.1 1.6 10.8 13.2 7.2 8.1 7.0 OP A-DPO [ 52 ] 13B 10.9 3.0 2.2 6.9 8.1 5.5 8.3 8.0 RLAIF-V [ 55 ] 12B 62.2 39.6 19.2 20.5 46.5 31.7 32.4 19.4 LLaV A-1.6 [ 27 ] 7B 25.3 13.0 7.6 15.3 10.1 12.3 8.2 13.3 +FINER-T uning 7B 48.4 23.1 38.4 25.4 24.2 16.6 22.1 6.8 26.4 16.3 29.4 17.1 24.7 16.5 18.5 5.2 Qwen2.5-VL [ 4 ] 7B 69.2 62.5 30.1 28.9 48.7 47.5 36.7 23.4 +FINER-T uning 7B 71.4 2.2 67.0 4.5 38.3 8.2 34.8 5.9 49.8 1.1 52.2 4.7 43.4 6.7 28.0 4.6 InternVL-3.5 [ 46 ] 8B 75.0 72.5 49.8 23.5 58.1 54.3 41.8 16.8 +FINER-T uning 8B 77.1 2.1 78.9 6.4 64.1 14.3 34.2 10.7 62.6 4.5 60.1 5.8 52.7 10.9 23.7 6.9 InternVL-3.5 [ 46 ] 14B 74.5 68.1 47.0 21.8 58.6 55.9 41.4 15.6 +FINER-T uning 14B 80.0 5.5 78.9 10.8 71.2 24.2 30.1 8.3 65.9 7.3 65.0 9.1 57.0 15.6 23.0 7.4 InternVL-3.5 [ 46 ] 38B 77.8 78.1 66.8 50.9 62.3 64.8 54.2 36.6 Gemini-2.5-Flash [ 10 ] ∗ - 75.7 77.3 77.8 58.2 64.4 64.5 56.7 49.6 for standardized ev aluations. For benchmarks not integrated in VLMEvalKit, we follow each benchmark’ s official ev alu- ation protocol. Refer to Sec. D in supplementary for details. 4.2. Results on FINER benchmarks Baselines. W e primarily compare the performance of the four frontier MLLMs before and after FINER-T uning, and also show the performance of stronger models such as InternVL-3.5-38B and Gemini-2.5-Flash [ 41 ]. Addition- ally , we benchmark hallucination-aware fine-tuning meth- ods such as RLAIF-V [ 55 ], OP A-DPO [ 52 ], RLHF-V [ 54 ], Llav a-RLHF [ 40 ], and LR V -Instruct-V2 [ 24 ]. Note that dif- ferent methods are typically based on different MLLMs and fine-tuned on dif ferent data. Giv en their effecti veness on general hallucination reduction, we aim to find out how well they fare on our FINER benchmarks. Furthermore, we es- timate human performance with a human study on a subset of 20 MCQs for each setting. The results and details of our human study can be found in Sec. F in the supplementary . Main results. The results are presented in T ab . 1 . Base model capability strongly influences ov erall performance. Hallucination-aware fine-tuning methods like RLHF-V [ 54 ] and LLaV A-RLHF [ 40 ] only achieve 1.6% and 1.1% paired accuracy on the Multi-rel subset of F I N E R - C O M P R E C A P . RLAIF-V -12B, while remaining the best among these meth- ods, scores substantially belo w advanced MLLMs, includ- ing Qwen2.5-VL and InternVL-3.5. This shows that mit- igating hallucination on previous datasets do not directly translate to our FINER benchmarks, highlighting the im- portance to start from and improv e upon frontier MLLMs. Meanwhile, FINER-T uning consistently improves all baselines. Specifically , on F I N E R - C O M P R E C A P , LLaV A- 1.6 shows remarkable 23.1% and 25.4%, and 16.6% on Multi-obj, Multi-Attr and Multi-Rel subsets, and InternVL- 3.5-14B shows improv ements of up to 24.2% (Multi-rel), outperforming its 38B version by 4.4%. On F I N E R - D O C C I , FINER-T uning on InternVL-3.5-14B scores on- paar with Gemini-2.5-Flash in 3 out of 4 settings. Moreover , Wh-questions challenge all models. Even InternVL-3.5- 38B and Gemini-2.5-Flash achie ve only 36.6% and 49.6% Acc paired on F I N E R - D O C C I, leaving room for future re- search on reducing hallucinations in FINER. Different number of objects, attrib utes and relations. Both FINER benchmarks cover Multi-obj, Multi-attr, and Multi-rel settings. W e study how Acc paired changes as the number of entities increases (Fig. 4 ). Models sho w sim- ilar trends in all three settings: performance drops as the entity counts increases, with much smaller drops in Multi- obj. FINER-T uning consistently impro ves performance, with larger gains in Multi-attr and Multi-rel, and the gains grow with higher counts. For example, FINER-T uning im- prov es InternVL3.5-14B by 8.3%, 19.1% and 28.1% in 6- obj, 3-attr and 3-rel setting on F I N E R - C O M P R E C A P . 4.3. Results on other hallucination benchmarks FINER-T uning achiev es consistent improv ements on FINER benchmarks. Hence, we are interested how well models fine-tuned with FINER-Tuning generalize to other hallucination benchmarks. Additionally , we show the performance of RLAIF-V -12B against its baseline model OmniLMM-12B [ 35 ], to see whether other hallucination reduction methods achie ve balanced improvements across various hallucination benchmarks. W e ev aluate models on both discriminativ e benchmarks like DASH [ 3 ], POPE [ 22 ], RePOPE [ 33 ], HallusionBench [ 16 ], AMBER [ 44 ], CRPE relation split (CRPE R) [ 45 ], as well as generativ e bench- 5 1 2 3 4 5 6 0 20 40 60 80 100 P aired Accuracy (%) +17.0 +26.0 +26.7 +23.0 +24.5 +21.2 -0.3 +3.1 +0.1 +3.3 +3.1 +3.0 +1.9 +2.5 +3.3 +0.6 +2.3 +3.7 +2.5 +7.3 +5.7 +3.1 +4.7 +8.3 1 2 3 +30.4 +19.0 +13.9 +2.4 +6.6 +6.6 +2.6 +10.9 +13.1 +4.2 +18.3 +19.1 1 2 3 +18.7 +15.0 +12.1 +8.1 +9.2 +5.9 +13.0 +16.0 +16.0 +21.1 +28.8 +28.1 1 2 3 4 5 6 Number of objects 0 20 40 60 80 100 P aired Accuracy (%) +9.2 +21.4 +22.7 +16.2 +15.1 +11.9 +3.8 -0.3 -0.2 +2.8 +1.5 +1.5 +0.1 +3.3 +3.7 +4.8 +4.4 +5.3 +1.6 +6.5 +6.1 +6.3 +6.4 +9.6 1 2 3 4 5 Number of Attributes +27.8 +19.8 +13.4 +11.3 +6.6 +3.0 +5.9 +5.0 +5.1 +4.5 +3.1 +5.5 +7.3 +5.8 +7.2 +5.7 +8.3 +11.0 +10.6 +11.1 1 2 3 Number of relations +18.3 +10.6 +8.6 +6.5 +7.4 +6.6 +10.0 +13.1 +18.6 +14.2 +17.8 +30.7 Llava-Next-7B Qwen2.5VL-7B InternVL3.5-8B InternVL3.5-14B w/ FINER- T uning Figure 4. Acc paired versus the number of objects, attributes, and relations. T op: F I N E R - C O M P R E C A P ; Bottom: F I N E R - D O C C I . Dashed arrows sho w the gain from FINER-T uning. T able 2. Results on hallucination benchmarks including discriminati ve (DASH [ 3 ], POPE [ 22 ], RePOPE [ 33 ], HallusionBench [ 16 ], AMBER [ 44 ], CRPE R [ 45 ]) and generativ e ones (MMHal-Bench [ 40 ], HaloQuest [ 47 ]). Sc.:Score (max. 6); HR.: Hallucination Rate. D ASH POPE RePOPE HallBench AMBER CRPE R MMHal-Bench HaloQuest Models Size Acc. ↑ Acc. ↑ Acc. ↑ aAcc. ↑ Acc. ↑ Acc. ↑ Sc. ↑ HR. ↓ Sc. ↑ OmniLMM [ 35 ] 12B 79.0 88.0 93.8 54.9 86.9 51.7 3.5 34.0 39.9 +RLAIF-V [ 55 ] 12B 76.3 2.7 87.7 0.3 93.4 0.4 53.7 1.2 87.4 0.5 52.2 0.5 4.0 0.5 29.0 5.0 62.4 22.5 LLaV A-1.6 [ 27 ] 7B 58.0 88.2 92.3 33.0 78.1 56.5 3.3 43.0 44.2 +FINER-T uning 7B 57.4 0.6 88.8 0.6 93.2 0.9 36.3 3.3 85.0 6.9 56.0 0.5 3.5 0.2 40.0 3.0 63.5 19.3 Qwen2.5-VL [ 4 ] 7B 74.6 86.4 92.4 65.4 85.2 69.9 4.6 18.0 74.8 +FINER-T uning 7B 76.6 2.0 87.2 0.8 92.8 0.4 68.5 3.1 85.8 0.6 70.7 0.8 4.7 0.1 15.0 3.0 80.8 6.0 InternVL-3.5 [ 46 ] 8B 68.3 88.6 91.5 71.0 88.2 67.7 4.5 19.0 62.4 +FINER-T uning 8B 74.5 6.2 89.4 0.8 93.1 1.6 73.0 2.0 88.6 0.4 68.0 0.3 4.6 0.1 14.0 5.0 73.5 11.1 InternVL-3.5 [ 46 ] 14B 55.8 89.5 91.8 69.5 88.0 67.2 4.7 11.0 65.0 +FINER-T uning 14B 61.3 5.5 90.2 0.7 93.6 1.8 71.2 1.7 89.4 1.4 69.0 1.8 4.7 10.0 1.0 71.0 6.0 marks like MMHalBench [ 40 ] and HaloQuest [ 47 ]. The summarized results are sho wn in T ab. 2 . In supplementary , W e further include detailed breakdowns (T abs. 13 and 14 ), results for AMBER gener ative (T ab . 15 ) and comparisons with more methods (T ab . 16 ). Intuiti vely , FINER-T uning strengthens discrimination through FINER training; our re- sults on discriminativ e benchmarks confirm this. FINER- T uning consistently improves Qwen2.5-VL and InternVL- 3.5 across all benchmarks. On DASH, it boosts the two InternVL-3.5 variants by 6.2% and 5.5%. LLaV A-1.6 also gains 6.9% on AMBER with FINER-T uning. FINER- T uning further reduces hallucination on generative bench- marks. On MMHal-Bench, it lowers hallucination rate for all base models, reaching 10% with InternVL-3.5-14B. On HaloQuest, it improves LLaV A-1.6 by 19.3%. Even for Qwen2.5-VL and InternVL-3.5, we observe at least 6% gains. In contrast, while RLAIF-V deli vers strong gains on generativ e benchmarks, its improv ements on discrimi- nativ e tasks are less consistent, where FINER-T uning ben- efits both. RLAIF-V degrades performance compared to the base OmniLMM on benchmarks like D ASH, POPE, Re- POPE, and HallusionBench. By comparing these “deltas” between fine-tuned models and baselines, we show that FINER-T uning is a balanced approach that leads to a com- prehensiv e reduction in hallucination. These results also validate the ef fecti veness of FINER benchmarks, show- ing that improv ements on FINER benchmarks align with broader improv ements in other benchmarks as well. 4.4. Results on general capabilities Since FINER-T uning adds fine-grained negati ve queries to DPO, a natural concern is over -rejection : the model becom- ing ov erly cautious, refusing answerable questions, or re- gressing on existing skills. T o test this, we compare each base model and its FINER-T uning-tuned counterpart on six additional benchmarks: MMStar [ 7 ] (general abilities), 6 T able 3. Results on six general purpose MLLM benchmarks. M.S.: MMStar [ 7 ]; T ext: T extVQA [ 39 ]; Chart: ChartQA [ 32 ]; M.P .: MMVP [ 42 ]; N.B.: NaturalBench [ 21 ]; V ∗ : V ∗ Bench [ 48 ] Models M.S. T ext Chart M.P . N.B. V ∗ A vg. OmniLMM-12B 39.7 64.5 24.2 69.7 26.9 52.9 46.3 +RLAIF-V 40.9 64.5 25.1 70.0 19.4 54.4 45.7 LLaV A-1.6-7B 37.6 63.7 54.4 65.0 15.7 53.9 48.4 +FINER-T uning 39.2 63.9 54.9 68.7 19.8 55.0 50.3 Qwen2.5-VL-7B 63.7 84.9 87.0 76.7 34.1 72.7 69.8 +FINER-T uning 64.7 85.1 86.4 77.3 34.1 72.8 70.1 InternVL3.5-8B 68.0 77.8 86.7 76.7 30.4 69.1 68.1 +FINER-T uning 68.3 77.9 86.7 77.0 31.1 71.2 68.7 InternVL3.5-14B 67.2 77.2 86.4 78.3 30.7 68.0 68.0 +FINER-T uning 67.7 77.2 86.8 78.7 35.5 70.2 69.4 T extVQA [ 39 ], ChartQA [ 32 ], MMVP [ 42 ] (vision-centric abilities), NaturalBench [ 21 ] (compositionality), and V ∗ (visual search). The results are sho wn in T ab . 3 . Unlike prior w ork reporting an “alignment tax”, with gains on tar- get benchmarks at the cost of general ability [ 56 ], FINER- T uning a voids this trade-off and even improves strong base- lines on general benchmarks (improving InternVL3.5-14B by 1.4%). This shows that FINER pro vides a useful training signal that complements the model’ s internal capabilities. 4.5. Qualitative Results Figure 5 sho ws four F I N E R - C O M P R E C A P e xamples; more qualitativ e results, including F I N E R - D O C C I , are in Sec. E in the supplementary . FINER-T uning av oids the spurious “necklace” in the Multi-obj case and correctly identifies the fine color details of the strawberry-patterned food in the Multi-attr case. In the Multi-rel example, both Qwen2.5-VL and InternVL3.5 hallucinate the second relation as “hiding behind the football”. In the Wh example, FINER-T uning shifts InternVL-3.5-14B from answering “bear” to flagging the incorrect attrib ute of the rock. These examples indicate that FINER-T uning helps the model detect fine-grained er- rors and locate correct the information in complex queries. 4.6. Ablation Studies T raining strategies. FINER-T uning trains on both positiv e and negati ve queries { ( x, q + , a + + , a − + ) , ( x, q − , a + − , a − − ) } . T o ablate this setting, we inv estigate the training with and without positiv e questions, and compare the performance of DPO against supervised fine-tuning (SFT). W e train four InternVL-3.5-8B variants accordingly and compare with the baseline in T ab . 4 . Results show mixed outcomes for SFT : with both queries, SFT reduces Multi- obj performance by 36.7% relati ve to the baseline. DPO with only negati ve queries exceeds the base model b ut still lags behind DPO with both query types (FINER-T uning), underscoring the value of training with both. T able 4. Ablation study on dif ferent training strategies. SFT methods only use a + . The base model is InternVL-3.5-8B [ 46 ]. Q.T ype: Query T ype; M.S.: MMStar [ 7 ] Method Q.T ype FINER-CompreCap Other Neg Both Obj Attr Rel Wh RePOPE M.S. Base - - 74.2 71.9 49.8 25.5 91.5 68.0 +SFT ✓ 47.4 59.7 53.8 38.7 69.1 61.7 +SFT ✓ 37.5 49.5 55.2 18.9 92.2 63.3 +DPO ✓ 75.8 75.2 52.4 29.8 93.1 68.3 +DPO ✓ 76.5 78.3 64.1 36.1 93.1 68.3 T able 5. T raining-on-subset ablation for FINER-T uning with InternVL-3.5-8B [ 46 ]. Obj/Attr/Rel denote Multi-obj/Multi- attr/Multi-rel for both training and ev aluation. T rain Subset FINER-CompreCap Other Obj Attr Rel Wh RePOPE M.S. Base 74.2 71.9 49.8 25.5 91.5 68.0 Obj 78.8 76.4 54.2 28.7 93.5 67.9 Attr 71.3 76.7 56.8 26.5 91.5 68.2 Rel 69.2 73.0 66.7 24.1 91.4 67.7 Wh 75.9 75.3 55.0 46.5 92.9 68.3 All 76.5 78.3 64.1 36.1 93.1 68.3 T raining on subsets. Our training data matches the bench- mark query types: Multi-Obj, Multi-Attr , Multi-Rel, and Wh. W e train InternVL-3.5-8B on each subset separately and compare to FINER-T uning trained on all subsets, keep- ing the total number of training samples fixed at 160k. As shown in T ab . 5 , models trained only on Multi-Obj, Multi- Rel, or Wh achiev e the best scores on their corresponding tests. Notably , they also improv e on other settings, suggest- ing the model is not merely echoing supervision from data: FINER fosters a more general rejection pattern that trans- fers beyond the seen subset. Overall, training on all subsets yields the most balanced results. 5. Related W orks Hallucination Benchmarks. POPE [ 22 ] probes object hal- lucination by asking yes-or-no questions. RePOPE [ 33 ] identifies and corrects annotation errors in POPE. Am- ber [ 44 ] categorizes hallucinations into “object, ” “relation, ” and “attribute” types in its discriminati ve subset. A com- mon limitation of these benchmarks is their reliance on the MSCOCO dataset [ 23 ]. Therefore, DASH [ 3 ] applies re- triev al to select challenging images from LAION-5B [ 20 ]. CRPE [ 45 ] focuses on relation hallucinations but is limited to single-relation cases. NOPE [ 30 ] tar gets non-existent objects, not attribute or relation hallucinations. ROPE [ 8 ] probes object classes with visual prompts (bounding boxes). Unlike R OPE, our Multi-obj setting randomly replaces a positiv e object with a negati ve one and does not rely on 7 Can you see the boy that is on the playingfield and is hiding behind the football in this image? A. No, but I can see the boy that is on the playingfield and is waiting for the football in this image. B. Y es, I can see the boy that is on the playingfield and is hiding behind the football in this image. C. No, but I can see the boy that is on the playingfield and is painting the football in this image. D. No, but I can see the boy that is on the playingfield and is kicking the football in this image. E. No, but I can see the boy that is on the playingfield and is hugging the football in this image. Model Index T ext lla v a_next_7b pr ed 4 Y es, I can see the food with str awberr y patterns and with shades of gr een and y ellow in this image. qwen7b pr ed 1 No, but I can see the food with str awberr y patterns and with hues of r ed and blue in this image. internvl14b pr ed 4 Y es, I can see the food with str awberr y patterns and with shades of gr een and y ellow in this image. internvl14b_dpo160k pr ed 1 No, but I can see the food with str awberr y patterns and with hues of r ed and blue in this image. Multi-attr Can you see the food with strawberry patterns and with shades of green and yellow in this image? A. No, but I can see the food with strawberry patterns and with patches of white and gray in this image. B. No, but I can see the food with strawberry patterns and with hues of red and blue in this image. C. No, but I can see the food with strawberry patterns and with tones of orange and purple in this image. D. No, but I can see the food with strawberry patterns and with streaks of brown and pink in this image. E. Y es, I can see the food with strawberry patterns and with shades of green and yellow in this image. NEG Can you see teddy bear , table, keyboard, floor , bench , and wall in this image? A. No, but I can see teddy bear , table, keyboard, floor , table , and wall in this image. B. No, but I can see teddy bear , table, keyboard, floor , sofa , and wall in this image. C. Y es, I can see teddy bear , table, keyboard, floor , bench , and wall in this image. D. No, but I can see teddy bear , table, keyboard, floor , chair , and wall in this image. E. No, but I can see teddy bear , table, keyboard, floor , stool , and wall in this image. Multi-obj NEG Multi-rel NEG Wh NEG What is walking on the rock with shiny surfaces and various shades of blue ? A. The wolf with large torso and strong, broad shoulder . B. The bear with large torso and strong, broad shoulders. C. The rock is not with shiny surfaces and various shades of blue , but is with uneven surfaces and dif ferent shades of gray . D. The lion with large torso and strong, broad shoulders. E. The tiger with large torso and strong, broad shoulders. Qwen2.5-VL-7B LLaV A-1.6-7B InternVL3.5-14B InternVL3.5-14B w/ FINER-T uning C ❌ D ❌ D ❌ A ✅ E ❌ B ✅ E ❌ B ✅ C ❌ B ❌ B ❌ D ✅ A ✅ E ❌ B ✅ E ❌ B ✅ B ✅ C ❌ B ❌ B ❌ D ✅ D ✅ C ✅ B ❌ B ❌ C ✅ C ✅ Gemini-2.5-Flash Can you see man and necklace in this image? A. Y es, I can see man and necklace in this image. B. No, but I can see man and belt in this image. C. No, but I can see man and tie in this image. D. No, but I can see man and scarf in this image. E. No, but I can see man and shoelace in this image. Multi-obj NEG A ❌ A ❌ A ❌ C ✅ A ❌ Figure 5. Qualiati ve examples of FINER-CompreCap MCQs for each category together with MLLM answers. MSCOCO/ADE20K box annotations [ 23 , 59 ]. MMHal- Bench [ 40 ] ev aluates hallucination via eight types of ques- tions with limited scale. HaloQuest [ 47 ] includes a “false premise” subset with a similar motiv ation to our Wh set- ting. Howe ver , our setting dif fers: we target false premises in fine-grained attributes of existing objects, whereas Halo- Quest primarily targets non-e xistent objects. Hallucination-aware Fine-tuning. Prior work reduces hallucinations via supervised or contrastive tuning and instruction-based data augmentation: LR V -Instruct [ 24 ] adds neg ativ e instructions to MiniGPT -4 [ 61 ] and mPLUG- Owl [ 53 ]; HAL V A [ 38 ] builds paired correct vs. hal- lucinated responses for contrasti ve learning; Pertur- boLLaV A [ 6 ] trains under misleading contexts; RE- VERSE [ 49 ] adds uncertainty tokens and retrospectiv e rea- soning. Other studies use preference learning: OP A- DPO [ 52 ] constructs on-policy corrections with GPT -4V ; CHiP [ 15 ] decomposes the DPO loss into three hierar- chies; HA-DPO [ 57 ] detects and corrects hallucinations with GPT -4; LLaV A-RLHF [ 40 ] and RLHF-V [ 54 ] rely on human preferences; RLAIF-V [ 55 ] iterates with model feedback. FINER-T uning dif fers in three ways: (1) we tar- get fine-grained negati ve input queries, not only response- side errors [ 38 , 40 , 52 , 54 , 55 , 57 ]; (2) we post-train fron- tier MLLMs beyond the LLaV A family [ 38 , 52 ] and show strong performance against FINER; (3) we use standard DPO with a scalable data pipeline and a small LLM [ 1 ] for annotation, a voiding costly closed-source models and multi-iteration training [ 6 , 24 , 38 , 52 , 55 , 57 ]. 6. Conclusion and Limitation Conclusion. W e introduced F I N E R , a suite of fine- grained negati ve queries that rev eals how current MLLMs fail under precise negations. Systematic ev aluation across all four settings of F I N E R - C O M P R E C A P and F I N E R - D O C C I sho ws that e ven frontier MLLMs remain vulner- able to FINER-induced hallucinations. T o address this, we proposed FINER-Tuning, a simple, model-agnostic recipe that aligns models to react correctly to fine-grained negati ve queries. Across diverse backbones and training regimes, FINER-T uning consistently reduces hallucinations and im- prov es paired accuracy on FINER benchmarks, as well as a wide range of hallucination and general purpose bench- marks. Despite these gains, high-granularity cases and Wh questions remain challenging. Future work will focus on stronger ne gation-aw are reasoning, that comprehensi vely enhances MLLMs’ capabilities. W e en vision FINER as a start to incentivize better benchmarks and methods. Limitations. Despite careful filtering, the large-scale benchmark is not fully curated by human; constructing a noise-free, fully human-validated FINER benchmark is left for future research. Our rule-based MCQ construction en- ables fle xible entity combinations but may reduce question naturalness. Future work could refine phrasing with LLMs or human re writes while ensuring correctness. In addition, our Multi-rel subsets contain at most three relations, which, with a suitable data source, could be extended to improve model capabilities and further challenge FINER. 8 Acknowledgments. This work was supported by the German Research Foundation (DFG): SFB 1233, Rob ust V ision: Inference Principles and Neural Mechanisms, TP A2, project number: 276693517. This work was partially funded by the ERC (853489 - DEXIM), the German Federal Ministry of Education and Research (BMBF , grant number: 01IS18039A), and the Alfried Krupp von Bohlen und Halbach Foundation, which we thank for their gener- ous support. This work is also supported by Hi! P ARIS and ANR/France 2030 program (ANR-23-IA CL-0005). This project was also supported by Google.org with a Google Cloud Platform (GCP) credit aw ard. The authors gratefully acknowledge the scientific support and resources of the AI service infrastructure LRZ AI Systems provided by the Leibniz Supercomputing Centre (LRZ) of the Bav arian Academy of Sciences and Humanities (B AdW), funded by Bayerisches Staatsministerium f ¨ ur W issenschaft und Kunst (StMWK). In addition, the authors gratefully acknowledge the Gauss Centre for Supercomputing e.V . ( www . gauss - centre . eu ) for funding this project by providing computing time on the GCS Supercomputer JUWELS [ 18 ] at J ¨ ulich Supercomputing Centre (JSC). References [1] Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ ebastien Bubeck, Ronen Eldan, Suriya Gunasekar , Michael Harrison, Russell J Hewett, Mojan Jav aheripi, Piero Kauf fmann, et al. Phi-4 technical report. arXiv , 2024. 4 , 8 , 2 , 11 , 14 [2] Josh Achiam, Stev en Adler , Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv , 2023. 1 , 9 , 11 [3] Maximilian Augustin, Y annic Neuhaus, and Matthias Hein. Dash: Detection and assessment of systematic hallucinations of vlms. In ICCV , 2025. 1 , 2 , 5 , 6 , 7 , 8 [4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2. 5-vl technical report. arXiv , 2025. 1 , 2 , 4 , 5 , 6 , 11 , 13 [5] Zechen Bai, Pichao W ang, Tianjun Xiao, T ong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey . arXiv , 2024. 1 [6] Cong Chen, Mingyu Liu, Chenchen Jing, Y izhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Per- turbollav a: Reducing multimodal hallucinations with pertur- bativ e visual training. ICLR , 2025. 8 , 1 , 2 [7] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Y uhang Zang, Zehui Chen, Haodong Duan, Jiaqi W ang, Y u Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? NeurIPS , 2024. 6 , 7 , 10 [8] Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Y ang, David F ouhey , and Joyce Chai. Multi-object hallucination in vision language models. In NeurIPS , 2024. 7 , 1 [9] Y ung-Sung Chuang, Y ujia Xie, Hongyin Luo, Y oon Kim, James R Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves f actuality in large language models. In The T welfth International Confer ence on Learning Repr e- sentations , 2023. 11 , 13 [10] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachde va, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality , long context, and next generation agentic capabilities. arXiv , 2025. 5 , 2 , 11 [11] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Y ue Y ang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In CVPR , 2025. 4 , 8 , 10 [12] Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. W ords or vision: Do vision-language models hav e blind faith in text? In CVPR , 2025. 1 [13] Peng Ding, Jingyu W u, Jun Kuang, Dan Ma, Xuezhi Cao, Xunliang Cai, Shi Chen, Jiajun Chen, and Shujian Huang. Hallu-pi: Evaluating hallucination in multi-modal large lan- guage models within perturbed inputs. In ACM MM , 2024. 1 [14] Haodong Duan, Junming Y ang, Y uxuan Qiao, Xinyu Fang, Lin Chen, Y uan Liu, Xiao yi Dong, Y uhang Zang, P an Zhang, Jiaqi W ang, et al. Vlmev alkit: An open-source toolkit for ev aluating large multi-modality models. In A CM MM , 2024. 4 , 8 , 10 [15] Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, and See-Kiong Ng. Chip: Cross-modal hierarchical direct preference optimization for multimodal llms. In ICLR , 2025. 8 , 1 [16] Tianrui Guan, Fuxiao Liu, Xiyang W u, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun W ang, Lichang Chen, Furong Huang, Y aser Y acoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in lar ge vision-language models. In CVPR , 2024. 5 , 6 , 8 , 11 , 13 [17] Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen- Zhu, Y uanzhi Li, Shean W ang, Lu W ang, W eizhu Chen, et al. Lora: Low-rank adaptation of lar ge language models. ICLR , 2022. 4 , 8 [18] J ¨ ulich Supercomputing Centre. JUWELS Cluster and Booster: Exascale Pathfinder with Modular Supercomputing Architecture at Juelich Supercomputing Centre. Journal of lar ge-scale resear ch facilities , 2021. 9 [19] Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, and Zeynep Akata. Cosmos: Cross- modality self-distillation for vision language pre-training. In Pr oceedings of the IEEE/CVF Conference on Computer V i- sion and P attern Recognition , pages 14690–14700, 2025. 1 [20] LAION. Releasing re-laion-5b: transparent iteration on laion-5b with additional safety fixes, 2024. Accessed: 30 aug, 2024. 7 , 1 [21] Baiqi Li, Zhiqiu Lin, W enxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, 9 Graham Neubig, and Deva Ramanan. Naturalbench: Evalu- ating vision-language models on natural adversarial samples. In NeurIPS , 2024. 7 , 10 [22] Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, Xin Zhao, and Ji-Rong W en. Evaluating object hallucination in large vision-language models. In EMNLP , 2023. 1 , 2 , 5 , 6 , 7 , 8 , 11 , 13 [23] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Dev a Ramanan, Piotr Doll ´ ar , and C La wrence Zitnick. Microsoft coco: Common objects in context. In ECCV , 2014. 3 , 7 , 8 , 1 [24] Fuxiao Liu, Ke vin Lin, Linjie Li, Jianfeng W ang, Y aser Y a- coob, and Lijuan W ang. Aligning large multi-modal model with robust instruction tuning. In ICLR , 2024. 5 , 8 , 1 , 2 [25] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning, 2023. 1 [26] Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In CVPR , 2024. 11 , 13 [27] Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llav a-next: Im- prov ed reasoning, ocr , and world kno wledge, 2024. 1 , 4 , 5 , 6 , 13 [28] Y exin Liu, Zhengyang Liang, Y ueze W ang, Xianfeng W u, Feilong T ang, Muyang He, Jian Li, Zheng Liu, Harry Y ang, Sernam Lim, et al. Unv eiling the ignorance of mllms: Seeing clearly , answering incorrectly . In Proceedings of the Com- puter V ision and P attern Recognition Confer ence , 2025. 1 [29] Ilya Loshchilov and Frank Hutter . Decoupled weight decay regularization. In ICLR , 2019. 8 [30] Holy Lovenia, W enliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Neg ativ e object presence ev aluation (nope) to measure object hallucination in vision-language models. In Pr oceedings of the 3r d W orkshop on AL VR , 2024. 7 , 1 [31] Fan Lu, W ei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, W ei Zhai, Y ang Cao, Y ujun Shen, and Zheng- Jun Zha. Benchmarking large vision-language models via directed scene graph for comprehensive image captioning. In CVPR , 2025. 2 , 3 [32] Ahmed Masry , Xuan Long Do, Jia Qing T an, Shafiq Joty , and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning. In F indings of ACL , 2022. 7 , 10 [33] RePOPE: Impact of Annotation Errors on the POPE Bench- mark. Neuhaus, yannic and hein, matthias. arXiv , 2025. 5 , 6 , 7 , 1 , 8 , 11 , 13 [34] Y asumasa Onoe, Sunayana Rane, Zachary Berger , Y onatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-T uset, Garrett T anzer, et al. Docci: De- scriptions of connected and contrasting images. In ECCV , 2024. 2 , 3 , 1 , 4 [35] OpenBMB. Large multi-modal models for strong perfor- mance and efficient deplo yment, 2024. 5 , 6 [36] Rafael Rafailo v , Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a rew ard model. NeurIPS , 2023. 3 [37] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, T rev or Darrell, and Kate Saenko. Object hallucination in image cap- tioning. arXiv , 2018. 1 [38] Pritam Sarkar , Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan ¨ O Arık, and T omas Pfister . Data-augmented phrase-lev el alignment for mitigating object hallucination. In ICLR , 2025. 8 , 1 , 2 , 11 , 13 [39] Amanpreet Singh, V ivek Natarjan, Meet Shah, Y u Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. T owards vqa models that can read. In CVPR , 2019. 7 , 10 [40] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Y ikang Shen, Chuang Gan, Liang-Y an Gui, Y u-Xiong W ang, Y iming Y ang, et al. Aligning large multi- modal models with factually augmented rlhf. arXiv , 2023. 5 , 6 , 8 , 1 , 2 , 11 [41] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv , 2023. 2 , 3 , 5 , 1 , 4 , 9 , 11 , 24 [42] Shengbang T ong, Zhuang Liu, Y uexiang Zhai, Y i Ma, Y ann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR , 2024. 7 , 10 [43] Y ahan T u, Rui Hu, and Jitao Sang. Ode: Open-set e valuation of hallucinations in multimodal large language models. In CVPR , 2025. 1 , 2 [44] Junyang W ang, Y uhang W ang, Guohai Xu, Jing Zhang, Y ukai Gu, Haitao Jia, Jiaqi W ang, Haiyang Xu, Ming Y an, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination ev aluation. arXiv , 2023. 1 , 5 , 6 , 7 , 8 , 11 , 13 [45] W eiyun W ang, Y iming Ren, Haowen Luo, T iantong Li, Chenxiang Y an, Zhe Chen, W enhai W ang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: T o- wards general relation comprehension of the open world. In ECCV , 2024. 5 , 6 , 7 , 1 , 8 , 11 , 13 [46] W eiyun W ang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang W ei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility , reasoning, and efficienc y . arXiv , 2025. 1 , 2 , 4 , 5 , 6 , 7 , 3 , 10 , 13 , 14 , 15 [47] Zhecan W ang, Garrett Bingham, Adams W ei Y u, Quoc V Le, Thang Luong, and Golnaz Ghiasi. Haloquest: A visual hallucination dataset for adv ancing multimodal reasoning. In ECCV , 2024. 6 , 8 , 1 , 11 , 13 [48] Penghao W u and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In CVPR , 2024. 7 [49] Tsung-Han W u, Heekyung Lee, Jiaxin Ge, Joseph E Gonza- lez, T rev or Darrell, and David M Chan. Generate, b ut verify: Reducing hallucination in vision-language models with ret- rospectiv e resampling. In NeurIPS , 2025. 8 , 1 , 11 , 13 [50] Rui Xiao, Sanghw an Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. Flair: Vlm with fine- grained language-informed image representations. In CVPR , 2025. 1 [51] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen 10 Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv , 2025. 3 , 2 , 4 [52] Zhihe Y ang, Xufang Luo, Dongqi Han, Y unjian Xu, and Dongsheng Li. Mitigating hallucinations in large vision- language models via dpo: On-policy data hold the key . In CVPR , 2025. 3 , 5 , 8 , 1 , 2 [53] Qinghao Y e, Haiyang Xu, Guohai Xu, Jiabo Y e, Ming Y an, Y iyang Zhou, Junyang W ang, Anwen Hu, Pengcheng Shi, Y aya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality . arXiv , 2023. 8 , 1 [54] Tianyu Y u, Y uan Y ao, Haoye Zhang, T aiwen He, Y ifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-T ao Zheng, Maosong Sun, et al. Rlhf-v: T ow ards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR , 2024. 5 , 8 , 1 , 2 [55] Tianyu Y u, Haoye Zhang, Y uan Y ao, Y unkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, T aiwen He, Zhiyuan Liu, T at-Seng Chua, and Maosong Sun. Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. In CVPR , 2025. 3 , 5 , 6 , 8 , 1 , 2 , 11 , 13 [56] Zongmeng Zhang, W engang Zhou, Jie Zhao, and Houqiang Li. Robust multimodal lar ge language models against modal- ity conflict. In ICML , 2025. 1 , 7 [57] Zhiyuan Zhao, Bin W ang, Linke Ouyang, Xiaoyi Dong, Ji- aqi W ang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op- timization, 2023. 3 , 8 , 1 , 2 , 11 , 13 [58] Y aowei Zheng, Richong Zhang, Junhao Zhang, Y anhan Y e, Zheyan Luo, Zhangchi Feng, and Y ongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language models. In A CL , 2024. 4 , 8 [59] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio T orralba. Scene parsing through ade20k dataset. In CVPR , 2017. 8 , 1 [60] Y iyang Zhou, Chenhang Cui, Rafael Rafailov , Chelsea Finn, and Huaxiu Y ao. Aligning modalities in vision large lan- guage models via preference fine-tuning. arXiv , 2024. 1 , 2 [61] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny . Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR , 2023. 8 , 1 11 FINER: MLLMs Hallucinate under Fine-grained Negativ e Queries Supplementary Material A. Extended Related W orks A.1. Hallucination benchmarks CHAIR [ 37 ] benchmarks object hallucination in image cap- tioning by measuring ho w many generated words actually appear in the image, based on ground-truth captions and object segmentations. Howe ver , the CHAIR metric suf- fers from instability issues [ 22 ]. POPE [ 22 ] simplifies hallucination detection by asking models yes-or-no ques- tions. RePOPE [ 33 ] identifies annotation errors in POPE and provides a revised version. Amber [ 44 ] ev aluates hal- lucinations in both generati ve and discriminativ e settings. In the discriminati ve setting, it categorizes hallucinations into “object, ” “relation, ” and “attrib ute” types. A com- mon limitation of these benchmarks is their reliance on the MSCOCO dataset [ 23 ]. T o better detect object hallucina- tions at scale, D ASH [ 3 ] adopts a retriev al-based approach to select images from LAION-5B [ 20 ]. CRPE [ 45 ] focuses on relation-based hallucinations but limits its ev aluation to single-relation cases. Beyond hallucination detection, MMMC [ 56 ] introduces the concept of “modality conflicts, ” referring to mismatches between the image and the text query , an approach we con- sider coarse-grained negati ve querying. FLAIR [ 50 ] con- structs DOCCI-FG that also adopts DOCCI captions to test how well vision-language models understand images from a fine-grained perspective. COSMOS [ 19 ] ev aluates and fur- ther improves fine-grained vision-language alignment via a self-distillation approach. The “Blind-faith-in-T ext” phe- nomenon [ 12 ] shows that when a conflicting textual conte xt is prefixed to a query , models tend to trust the text more than the image. Similarly , Hallu-PI [ 13 ] ev aluates halluci- nations by appending additional images or te xts as a pertur - bation. In our work, we do not add extra textual context. Instead, we design user queries that contain subtle and nu- anced conflicts with the image, allowing us to study hallu- cination behavior without altering the conv ersational setup. MMVU [ 28 ] also proposes a benchmark that in vestigates “negati ve questions. ” The ke y difference is that our work studies this problem at a finer lev el of granularity . HaloQuest [ 47 ] includes a “false premise” subset with a similar motiv ation to our Wh setting. Howe ver , our set- ting differs because our false premises lie in the fine-grained attributes of existing objects, while HaloQuest mainly fo- cuses on non-e xistent objects. Likewise, NOPE [ 30 ] mainly ev aluates hallucinations inv olving non-existent objects but does not test hallucinations related to attrib utes or rela- tions. R OPE [ 8 ] ev aluates object hallucinations by prompt- ing MLLM to pick the correct objects corresponding multi- ple input visual prompts. While this approach shares simi- larity with our Multi-obj subset, we aim for more flexibility by directly inserting the negativ e object at random position in the prompt and we do not rely on bounding boxes an- notation from MSCOCO-Panoptic [ 23 ] or ADE20K [ 59 ]. ODE [ 43 ] introduces an open-set dynamic hallucination ev aluation to prev ent data contamination. This also aligns with our intuition to adopt DOCCI [ 34 ] as an additional data source and create the less-saturated F I N E R - D O C C I . A.2. Hallucination-aware Fine-tuning T o reduce hallucinations, various fine-tuning techniques hav e been developed for MLLMs. Closely related to our motiv ation, LR V -Instruct [ 24 ] applies supervised fine- tuning (SFT) to MiniGPT -4 [ 61 ] and mPLUG-Owl [ 53 ], and introduces negati ve instructions by manipulating ob- jects and factual knowledge using GPT -4 [ 2 ]. HAL V A [ 38 ] lev erages Gemini V ision Pro [ 41 ] to construct both correct and hallucinated responses, and applies a contrasti ve loss between them, e xplicitly pushing the model away from hal- lucinated generations. PerturboLLaV A [ 6 ] appends misleading textual context as perturbations generated by GPT -4o [ 2 ] and trains the model via instruction tuning to remain robust under such distracting inputs. REVERSE [ 49 ] expands the model’ s v o- cabulary with special uncertainty tok ens and builds a lar ge- scale instruction-follo wing dataset; the model learns to per- form retrospecti ve reasoning whenev er these tokens are triggered, allo wing it to re vise potentially hallucinated con- tent. RLHF-V [ 54 ] and LLaV A-RLHF [ 40 ] apply reinforce- ment learning from human feedback (RLHF) to vision- language models, using human preference signals to im- prov e response quality and reduce hallucinations. RLAIF- V [ 55 ] instead lev erages AI feedback (RLAIF): a stronger teacher model provides automatic preference judgments, and the student model is updated in a self-ev olving manner ov er multiple training rounds. Sev eral studies employ Direct Preference Optimization (DPO) to reduce hallucinations. OP A-DPO [ 52 ] constructs on-policy data for hallucination mitigation and uses GPT - 4V for fine-grained hallucination correction in the train- ing set. CHiP [ 15 ] decomposes the DPO objectiv e into response-lev el, segment-le vel, and token-lev el components to better localize hallucinations. HA-DPO [ 57 ] also uses GPT -4 [ 2 ] to identify and correct hallucinations in model outputs. PO VID [ 60 ] adopts GPT -4V to inject hallucinated objects, attributes, and relations directly into the dispre- ferred responses, encouraging the model to reject these pat- terns during training. 1 In light of these works, our approach differs in three main aspects. First, most prior studies [ 38 , 40 , 52 , 54 , 55 , 57 , 60 ] focus on detecting and correcting hallucinations in model responses, whereas we explicitly construct fine-grained negati ve input queries at the object, attrib ute, and relation lev el. Second, previous ef forts [ 38 , 52 ] primarily target the LLaV A family , while we directly post-train several state- of-the-art MLLMs and e v aluate them on the FINER bench- marks, improving model’ s robustness against nuanced er- rors in queries. Third, FINER-T uning follows the standard DPO algorithm and does not require multi-iteration training as in RLAIF-V . Unlike prior works [ 6 , 24 , 38 , 52 , 57 , 60 ] that rely hea vily on costly closed-source models to build training data, we propose a scalable pipeline that uses an open-source LLM [ 1 ] to generate high-quality preference pairs from existing long-caption datasets. B. FINER Benchmark Details In this section, we describe the construction of F I N E R - C O M P R E C A P and F I N E R - D O C C I . F I N E R - C O M P R E C A P starts from human-annotated positive scene- graphs (SGs) with minor edits (Sec. B.1 ). F I N E R - D O C C I deriv es positive SGs from dense captions (Sec. B.2 ). W e then apply the same ne gati ve-generation and filtering pipeline to obtain ne gati ve SGs (Sec. B.3 ). Finally , both positiv e and negativ e SGs are con verted into benchmark questions via our rule-based MCQ pipeline (Sec. B.4 ). The two benchmarks are motiv ated slightly differently . F I N E R - C O M P R E C A P b uilds on human-annotated SGs, supporting more precise ev aluation. In contrast, F I N E R - D O C C I explores whether dense captions can be used to synthesize SGs beyond COCO object classes and images, enabling open-set ev aluation [ 43 ] at substantially larger scale. As a result, F I N E R - D O C C I is primarily designed to v alidate our findings at scale, rather than to maximize per-sample annotation fidelity . B.1. Positi ve SG for F I N E R - C O M P R E C A P CompreCap [ 31 ] offers 560 human-annotated images, each with a scene-graph (SG) annotation. Each SG annotation already consists of objects, attributes, and relations. The attribute annotations in the original SG are lists of sim- ple sentences, which we rewrite with Qwen3-14B [ 51 ] into “with { attr } ” phrases without changing their original mean- ing. The original relation annotations are also sentences de- scribing a relation between a subject and an object. There- fore, we use a rule-based method to parse the relation sen- tences into dictionary-lik e annotations. These steps are nec- essary because we need to combine objects, relations, and attributes in our MCQ construction. W e manually inspect the positive annotations to ensure their integrity . Since our preprocessing only changes sentence structure and does not introduce new annotations, it is robust. W e provide an ex- ample SG in Fig. 7 . As shown in Fig. 7 , the original at- tribute “The cat is black and orange” is rewritten as “with a black and orange color”. Meanwhile, the original relation “The cat is lying on a desk” is parsed into a dictionary-like structure. B.2. SG Extraction Pipeline f or F I N E R - D O C C I DOCCI [ 34 ] consists of 5,000 images, each paired with a detailed human-annotated caption. Such rich descriptions already contain the necessary information about objects, at- tributes, and relations. Fig. 8 shows an example caption together with the positi ve scene graph extracted by Gemini- 2.0-Flash [ 10 ]. Directly prompting an LLM to “summarize” a full scene graph is known to be brittle and prone to errors. Instead, inspired by PerturboLLaV A [ 6 ], which prompts an LLM to extract objects, attrib utes, and relations from long captions, we design a conservati ve two-stage extraction pipeline that decomposes the task into simpler subproblems and incorpo- rates explicit cross-checks and human v alidation. Stage 1: object and attrib ute extraction. In the first stage, we only ask Gemini-2.0-Flash to extract objects and their attributes from the caption. The model is instructed to cop y phrases verbatim from the caption and to av oid in venting new entities or attrib utes. This turns the problem into a pure information extraction task rather than open-ended genera- tion. The prompt is visualized in Fig. 24 . Human annotators inspect randomly sampled outputs to check the robustness of this stage, as the model only needs to detect and group textual mentions instead of inferring unseen content. Stage 2: relation extraction and validation. In the sec- ond stage, we consider pairs of extracted objects and ask Gemini-2.0-Flash whether the caption explicitly states a re- lation between them. Gi ven the full caption and a candidate object pair, Gemini is instructed to either (i) return the e xact relation phrase from the caption, or (ii) not return anything if no relation is explicitly mentioned. The model is explic- itly told not to infer or imagine relations that are not written in the caption. This again restricts Gemini to acting as an in- formation extractor , which increases reliability . The prompt is displayed in Fig. 25 . Even with these restrictions, some errors in the e xtracted relations remain. T o further filter noisy relations, we per- form a joint visual-textual validation step. For each candi- date relation, we: • run a binary classifier with Qwen2.5-VL-72B [ 4 ] to de- cide whether the relation holds in the image; and • query Gemini again, this time asking whether the relation is explicitly supported by the caption. If both models disagree with the proposed relation, we dis- card it. Among the misclassified relations, we further ask human annotators to verify a subset of 400 samples and, whenev er they spot errors, remov e incorrect extracted rela- 2 0 1 2 Position 0 10 20 30 40 50 60 70 80 90 P aired A ccuracy (%) Multi-obj 0 1 2 Position Multi-attr 0 1 2 Position Multi-rel Llava-next Qwen2.5VL-7B InternVL3.5-8B InternVL3.5-14B Base W/ FINER- T uning Figure 6. Positional bias analysis on F I N E R - C O M P R E C A P . W e select all q ± multi-obj , q ± multi-attr , and q ± multi-rel that contain three entities. Since each q − always has exactly one negated entity , we cyclically move that negated entity to each of the three positions (and move the corresponding positiv e entity accordingly), and compute the averaged paired accuracy Acc paired for each position. CompreCap Image Original Scene Graph [ { " object ": "cat", " attribute ": ["The cat is black and orange.", "It has large, round, golden-yellow and black eyes that stand out.", "Its ears are pointy and alert."], " relation ": [“The cat is lying on a desk.”, “The cat is in the drawer .”] }, …] Positive Scene Graph [ { " object ": "cat", " attribute ": ["with a black and orange color", "with large, round, golden-yellow and black eyes that stand ou", "with pointy and alert ears"], " relation ": [{"object_a": cat, "rel": "is lying on the", "object_b": desk}], {"object_a": cat, "rel": "is in the", "object_b": drawer}] }, …] Rewrite Figure 7. Example of positiv e scene graph (SG) in F I N E R - C O M P R E C A P . CompreCap [ 31 ] already pairs each image with SG-like annotation. W e further adopts Qwen3-14B [ 51 ] to simply rewrite attrib ute sentences into phrases. tion annotations. In total, this joint process of Qwen2.5-VL, Gemini, and humans filters out 1,771 relations. Overall, this pipeline is deliberately conservati ve: we only keep relations that are supported by the caption (via extraction) and by the image (via a strong MLLM), with additional human checks on top. This design prioritizes pre- cision over recall and makes our extracted SG for F I N E R - D O C C I more reliable despite the known challenges of us- ing LLMs for scene-graph extraction. Quality Assessment. T o assess the quality of the ex- tracted objects, attributes, and relations in the positiv e SG of F I N E R - D O C C I , we run InternVL3.5-8B [ 46 ] as a binary classifier . For each extracted object, attribute, or relation, the model is asked to answer “Y es” or “No” reg arding its presence in the image. As a baseline, we apply the same procedure to the positiv e SG of F I N E R - C O M P R E C A P , whose scene graphs are human-annotated. The results are reported in T ab. 6 . InternVL3.5-8B achiev es comparable performance (96.4% vs. 96.1%) when classifying ground- truth objects in both benchmarks. For attributes, its accu- racy on F I N E R - D O C C I is 3.2% lower than on F I N E R - C O M P R E C A P . Given that the SG in F I N E R - D O C C I is much lar ger in scale than in F I N E R - C O M P R E C A P (see T ab . 7 ), this gap is acceptable. Notably , the accuracy on relations in the positiv e SG of F I N E R - D O C C I is slightly higher than that of F I N E R - C O M P R E C A P (85.1% vs. 82.8%). This likely reflects that the relation annota- tions in F I N E R - D O C C I are more detailed, providing the MLLM with more information to verify their correctness, rather than indicating that the human-annotated relations in F I N E R - C O M P R E C A P are of lo wer quality . B.3. Negatives Generation Pipeline. Having obtained the positiv e scene graphs (SGs) for both F I N E R - C O M P R E C A P and F I N E R - D O C C I , we construct a pipeline for generating negativ es. For each object ( OBJ ), attribute ( ATTR ), and relation ( REL ), we generate four negati ve counterparts, denoted as NEG OBJ , NEG ATTR , and NEG REL . LLM-based negatives proposal. W e first use an LLM as a “negativ es generator”. For F I N E R - D O C C I we use Gemini-2.0-Flash [ 41 ], and for F I N E R - C O M P R E C A P we 3 An outdoor front view of a turtle that is sitting on a floating tree trunk that has moss growing at the front of it. The turtle is yellow and green and has a dark green shell. The turtle is pointing his head up and soaking up the sun. On the water , there are a couple pieces of foam floating in the swamp. In the far background, there are multiple dried pieces of grass. On the far left side of the swamp, there is a fallen tree trunk that has moss on it. {"example_id": "test_00001", "objects": [{"object": "turtle", "object_index": 0, "neg_object": ["frog", "fish", "duck", "beaver"], "attribute": ["with a yellow and green color", "with a dark green shell", "with his head pointing up", "with a posture soaking up the sun"], "neg_attribute": [["with a red and blue color", "with a purple and orange color", "with a black and white color", "with a silver and gold color"], ["with a light blue shell", "with a bright pink shell", "with a pale yellow shell", "with a dull grey shell"], ["with his head pointing down", "with his head pointing left", "with his head buried in the sand", "with his head turned away"], ["with a posture shivering in the cold", "with a posture running from the rain", "with a posture hiding in the shadows", "with a posture digging in the dirt"]], "parsed_relation": [{"object_a": 0, "rel": "is sitting on the", "object_b": 1}], "neg_relation": [["is hanging from the", "is running to the left of the", "is falling behind the", "is standing under the"]]}, {"object": "trunk", "object_index": 1, "neg_object": ["root", "branch", "bottle", "stump"], "attribute": ["with moss growing at the front"], "neg_attribute": [["with dirt covering the back", "with a carved wooden handle", "with a crack running down the side", "with a hole worn in the top"]], "parsed_relation": [], "neg_relation": []}]} Human Annotation DOCCI Image Gemini-2.0-Flash Extract Obj, Attr Extract Rel Extraction [{"object": "turtle", "object_index": 0, "attribute": ["with a yellow and green color", "with a dark green shell", "with his head pointing up", "with a posture soaking up the sun"], "relation": [{"object_a": 0, "rel": "is sitting on the", "object_b": 1}]}, {"object": "trunk", "object_index": 1, "attribute": ["with moss growing at the front"], "relation": []}] Positive Scene Graph Figure 8. Example positive scene graph (SG) extracted by Gemini-2.0-Flash [ 41 ]. Given a long human-annotated caption from DOCCI [ 34 ], we apply a two-stage extraction pipeline to obtain the positi ve SG. T able 6. Quality assessment of the extracted positiv e objects, attributes, and relations for F I N E R - D O C C I using InternVL3.5- 8B [ 46 ] as a binary classifier . As a baseline, we also run InternVL3.5-8B as a binary classifier to classify the human an- notations from F I N E R - C O M P R E C A P . F I N E R - C O M P R E C A P F I N E R - D O C C I Obj Attr Rel Obj Attr Rel Acc. (%) 96.4 91.5 82.8 96.1 88.3 85.1 use Qwen3-14B [ 51 ]. Given a positi ve phrase ( OBJ , ATTR , or REL ), the LLM is prompted to produce four negati ve phrases that hav e the opposite or a clearly different meaning from the positiv e. This step is efficient and does not directly in- herit visual biases from any vision model, since it operates purely in text space. A limitation of this step is that some generated ne gatives may in fact describe entities that are present in the image. Such “false negati ves” are harmful for ev aluation. Gi ven the scale of the two positiv e SGs, pure human v alidation on the whole set is unfortunately not possible, so we need an automatic way to detect and filter these false ne gatives. MLLM-based discrimination and entropy . T o filter these cases, we use Qwen2.5-VL-72B [ 4 ] as a visual discrimina- tor . For each positiv e phrase x (where x can be either OBJ , ATTR , or REL ) and its four candidate neg ati ves { x − j } 4 j =1 , we form a fiv e-choice multiple-choice question with the candi- date set C ( x ) = { x, x − 1 , x − 2 , x − 3 , x − 4 } . W e query Qwen2.5-VL-72B with the image and the set C ( x ) , and obtain a probability distribution p = ( p 1 , . . . , p 5 ) , 5 X i =1 p i = 1 , ov er the fi ve choices. W e treat the original positiv e x as the correct label. If the model selects x , the classification is correct; otherwise it is misclassified. W e compute the entropy of the model output H ( p ) = − 5 X i =1 p i log p i , (6) where the logarithm is natural. Low entrop y means that the model is very confident in one of the options, while high entropy indicates uncertainty . If Qwen2.5-VL-72B makes a misclassification by choosing one negati ve while maintain- ing very lo w entropy , this indicates high confidence in its prediction. This likely reflects that the chosen entity some- how exists in the image (or, of course, the model can also be too confident about an actually wrong prediction). W e show several examples in Fig. 9 . Empirically , we observe that many bad negati ves that actually appear in the image lead to misclassifications with very low entropy . For example, in one sample, “ground” is proposed as a nega- tiv e for the object “wall”. Since the ground region is clearly visible in the image, Qwen2.5-VL-72B strongly prefers the option “ground”, with an entropy of H ( p ) = 0 . 0119 . This indicates that the model is highly confident that “ground” is present in the image, and therefore this negativ e should be rejected. In such cases, we prompt the LLM again and rewrite the negati ve, for example from “ground” to “ceil- ing”, which does not appear in the image. Howe ver , low entropy does not always mean that the negati ve actually appears in the image; the MLLM can also be confidently wrong. For instance, in the car example in Fig. 9 , Qwen2.5-VL-72B misclassifies the relation phrase “is behind the” with low entropy H ( p ) = 0 . 0119 , even though “is behind the” is a valid negati ve. In this case, we still replace it with a ne w negati ve proposal such as “is on top of the”, which remains valid. Since our primary goal is to remove negati ves that truly appear in the image, occa- sionally regenerating v alid negati ves is acceptable. Entropy-based filtering with human verification. W e de- note the entropy filtering threshold as θ . For each bench- mark and each lev el (object, attribute, relation), we choose a separate threshold θ . T o set these thresholds, we first run Qwen2.5-VL-72B on 4 T able 7. Statistics for the generating negativ e scene graph for F I N E R - C O M P R E C A P (denoted as C-SG) and F I N E R - D O C C I (denoted as D-SG).Counts: number of objects, attributes and rela- tions inside the SG annotation. θ : entropy-based filtering threshold; #Re-gen.: number of re-generated negati ves. Benchmark θ Counts # Re-gen. C-SG Obj 0.8 3505 320 Attr 0.8 4509 414 Rel 0.4 3494 173 D-SG Obj 0.8 24,528 3,242 Attr 0.4 52,911 2,827 Rel 0.8 15,342 2,143 the entire dataset and record, for each example, the model prediction and the corresponding entropy H ( p ) . W e then collect all misclassified examples and sort them in ascend- ing order of entropy . Starting from the lo west-entropy re- gion, a human annotator verifies 10 misclassified examples and labels whether the proposed negati ve actually appears in the image. W e then incrementally increase the candidate entropy threshold and, at each step, again sample 10 mis- classified examples around the current threshold for human verification. W e repeat this process until no “bad negati ves” (negati ves that truly appear in the image) are found among the 10 inspected samples; we then take the current entropy value as the threshold θ such that misclassified examples with H ( p ) < θ are likely to be true false negati ves (the negati ve phrase is in the image), while those with higher entropy are retained as hard b ut v alid negati ves. During the full pipeline, each ne gati ve candidate that leads to a misclassification with H ( p ) < θ is sent back to the LLM and regenerated. The new proposal is checked again by Qwen2.5-VL-72B with the same procedure. After each round of regeneration and classification, we subsam- ple a small set of misclassified examples and ask a human annotator to inspect the remaining negativ es. This human- in-the-loop process is to reduce the risk of systematic errors introduced by the automatic filtering pipeline. W e summarize the thresholds θ , the total number of samples, and the number of regenerated negativ es for each benchmark and each lev el (Obj, Attr, Rel) in T ab . 7 . Quality Assessment. Gi ven the scale of our benchmarks, we adopt a model-based assessment approach. W e as- sess the quality of the generated negati ves by ev aluat- ing Qwen2.5-VL-72B on objects (Obj), attributes (Attr), and relations (Rel) in F I N E R - C O M P R E C A P and F I N E R - D O C C I . T ab . 8 reports the corresponding classification ac- curacies. For example, Qwen2.5-VL-72B achie ves 94.1% accuracy when selecting the positi ve relation from its four negati ve counterparts in F I N E R - C O M P R E C A P , which sup- T able 8. Quality assessment of generated negati ves. W e show the classification accuracy of Qwen2.5-VL-72B [ 4 ] after classifying the objects (obj), attrib utes (attr) and relations (rel) in F I N E R - C O M P R E C A P and F I N E R - D O C C I F I N E R - C O M P R E C A P F I N E R - D O C C I obj attr rel obj attr rel Acc. (%) 89.8 91.1 94.1 89.5 88.3 82.8 ports the quality of the constructed negati ves in this bench- mark. On F I N E R - D O C C I , the model attains close to 90% accuracy in objects and attributes. Note that F I N E R - D O C C I is designed to test whether rich, human-described semantics can enable large-scale hallucination ev aluation, rather than building a small, noise-free benchmark fully cu- rated by humans. Giv en its substantially larger scale and higher difficulty , we consider the achieved negati ves clas- sification accuracies to show a suf ficient negati ves quality that helps validating our findings at scale . B.4. MCQ Design Having obtained the positiv e SG and negati ve SG for F I N E R - C O M P R E C A P and F I N E R - D O C C I, we no w con- struct MCQs. Sec. 2.1 already provides an explanation of our MCQ construction pipeline: we use a fixed template to compose both positiv e and negati ve MCQs ( q ± multi-obj , q ± multi-attr , q ± multi-rel ). For q ± Wh , we prompt Gemini-2.0-Flash to construct the question templates. W e describe the two templates in detail. Fixed question template. W e use a simple yes/no-style template for all q ± multi-obj , q ± multi-attr , and q ± multi-rel . T o make the format explicit, we display it as a small template box: Can you see { X } in this image? A. Yes, I can see { Y } in this image. B. No, but I can see { Z 1 } in this image. C. No, but I can see { Z 2 } in this image. D. No, but I can see { Z 3 } in this image. E. No, but I can see { Z 4 } in this image. Here, { X } , { Y } , and { Z 1 } , . . . , { Z 4 } are placeholders that will later be filled with phrases. In the benchmark, the choices are randomly shuffled. Construction of q ± multi-obj , q ± multi-attr , and q ± multi-rel . W e only describe the construction process for q ± multi-obj ; the same pro- cedure is applied to q ± multi-attr and q ± multi-rel . From the positi ve SG of an image, we first sample k distinct objects and concatenate them into a positive multi- object phrase P + obj (for example, “dog, ball, and tree”). This phrase P + obj contains only objects that truly appear in the image. W e then randomly select one of these k objects, 5 GT attribute: with a gray color Chosen attribute: with a brown color Entropy: 0.0009 Re-generated: with a pink color GT object: drawer Chosen object: shelf Entropy: 0.0795 Re-generated: lamp GT Object: Sky Chosen Object: Ocean Entropy: Rewritten: Floor GT object: ornament Chosen object: bowl Entropy: 0.0197 Rewritten: Umbrella GT object: wall Chosen object: ground Entropy: 0.01 19 Re-generated: ceiling GT attribute: with no lights around Chosen attribute: with many lights around Entropy: 0.0398 Re-generated: with a glassy transparent surface GT attribute: with a shaggy black color Chosen attribute: with a sleek white color Entropy: 0.2743 Re-generated: with a spotted pattern Objects: T oyota 4-RUNNER SUV , Mercedes SUV GT relation: is to the right of Chosen relation: is behind the Entropy: 0.3826 Re-generated: is on top of the Objects: rope, cannon GT relation: is on the ground on each side of the Chosen relation: Is under the Entropy: 0.0939 Re-generated: is hanging from the ceiling above the Objects: chair , wall GT relation: is against the Chosen relation: is in front of the Entropy: 0.0081 Re-generated: is to the left of the Attributes Objects Relations Figure 9. Examples of entropy-based filtering for objects, attrib utes, and relations. The corresponding objects are shown with red bounding boxes. The ground-truth object/attribute/relation is highlighted in green. W e prompt Qwen2.5-VL-72B [ 4 ] to select the positive among four ne gati ves. Green text indicates that the model makes an incorrect prediction and chooses a ne gative with low entropy scores. Blue text shows ne w negati ve candidates generated by the LLM. The e xamples are from both F I N E R - C O M P R E C A P and F I N E R - D O C C I . denote the selected object by o , and retriev e its four nega- tiv e counterparts { o − j } 4 j =1 from the negati ve SG. For each j ∈ { 1 , . . . , 4 } , we form a corrupted phrase P − obj ,j by re- placing o in P + obj with o − j while keeping all other objects unchanged. Thus we obtain one positi ve phrase P + obj and four negati ve phrases P − obj , 1 , . . . , P − obj , 4 . T o build a positive MCQ q + multi-obj , we instantiate the tem- plate by setting { X } = P + obj , { Y } = P + obj , { Z j } = P − obj ,j for j = 1 , . . . , 4 . In this case, the question and the “Y es” option both de- scribe the true configuration P + obj , while each “No, but I can see { Z j } ” option contains e xactly one incorrect object. The option that contains P + obj is treated as the correct answer . T o build a ne gative MCQ q − multi-obj , we flip the roles of the positive and corrupted phrases in the template. W e ran- domly choose one corrupted phrase, say P − obj , 1 , and set { X } = P − obj , 1 , { Y } = P − obj , 1 , { Z 1 } = P + obj , { Z j } = P − obj ,j for j = 2 , 3 , 4 . Now the question asks about the corrupted phrase P − obj , 1 , which does not match the image. Consequently , the “Y es” choice becomes a false-positiv e option, because it incor- rectly confirms the existence of P − obj , 1 . The option that says “No, but I can see P + obj in this image” is now the correct answer , since it both denies the existence of P − obj , 1 and af- firms the true configuration P + obj . Note that we randomly pick which corrupted phrase is used as the query , so each of P − obj , 1 , . . . , P − obj , 4 has an equal chance to replace { X } . This fixed pattern keeps the surface form of the ques- tions consistent across all MCQs while allowing the under- lying content to vary . The same construction is applied to q ± multi-attr and q ± multi-rel by treating attribute phrases and rela- tion phrases as the basic units instead of objects. Wh question generation. Wh questions ha ve more fle xible 6 surface forms than yes/no questions. T o construct Wh-style questions, we start from a relation triplet in the scene graph, ( OBJ 1 , REL , OBJ 2 ) , where OBJ 1 and OBJ 2 are two objects and REL is the relation between them. Each object can have one or more attributes, e.g. A ( OBJ 1 ) for the first object. Giv en a triplet ( OBJ 1 , REL , OBJ 2 ) , we randomly choose one of the two objects as the answer target and treat the other as conte xt . Concretely , we either ask about OBJ 1 giv en OBJ 2 or about OBJ 2 giv en OBJ 1 . W e then mask the an- swer target in the textual description and prompt Gemini- 2.0-Flash to produce a natural Wh question. For example, for the relation (dog, is standing under , table), Gemini-2.0- Flash can generate questions such as “What is standing under the table?” (ask about the dog) “What is the dog standing under?” (ask about the table). Wh MCQ template. Once we fix the Wh question pattern for a given triplet, we turn it into an MCQ by providing fi ve answer options. W e represent the question body and the fi ve options using placeholders: Q: { Q } A. { O 1 } B. { O 2 } C. { O 3 } D. { O 4 } E. { C } Here, { Q } is the Wh question text, { O 1 } , . . . , { O 4 } are object-lev el answer candidates, and { C } is a full-sentence corr ection option that explicitly talks about the attribute of the target object. In the benchmark, the choices are ran- domly shuffled. Construction of q ± Wh . W e illustrate the construction using the running example with the context object “dog” and the answer target “table”. The dog has a positi ve attribute A + (e.g. “with brown fur”) and a sampled negati ve attribute A − (e.g. “with yellow fur”), while the relation and con- text (e.g. “standing under the table”) are fixed by the triplet ( OBJ 1 , REL , OBJ 2 ) . From the positi ve SG, we select “table” as the target object o ⋆ . W e then randomly pick three negati ve objects o − 1 , o − 2 , o − 3 for this slot from the negati ve SG (e.g. “chair”, “bench”, “sofa”). Starting from the Wh question “What is the dog standing under?” , W e insert an attrib ute phrase for the dog and obtain an attribute-conditional question template q ( A ) ≡ “What is the dog A standing under?” . Filling this template with A + or A − giv es us a positive or negati ve Wh question with the same surface pattern. Note that in the FINER benchmarks, a single object can hav e multiple attributes. In that case, we include all of its at- tributes in the descriptiv e context, then randomly choose one of them as the target attribute A + and sample the cor- responding negati ve attribute as A − . P ositive Wh MCQ. For the positive Wh question q + Wh , we fill the attribute slot with the true attribute A + and instanti- ate the MCQ template as { Q } = q ( A + ) , { O 1 } = o ⋆ , { O j } = o − j − 1 for j = 2 , 3 , 4 , { C } = “The dog is not A + , but is A − . ” The question { Q } is now a valid Wh question about the image, and { O 1 } (the true object o ⋆ ) is the correct answer . The three options { O 2 } , { O 3 } , { O 4 } are incorrect objects, and the correction sentence { C } is also incorrect because it denies the true attribute A + . Ne gative Wh MCQ. For the negative Wh question q − Wh , we instead fill the question template with the negati ve at- tribute A − , which makes the premise of the question par- tially inconsistent with the image. W e keep the same four object candidates but flip the correction sentence: { Q } = q ( A − ) , { O 1 } = o ⋆ , { O j } = o − j − 1 for j = 2 , 3 , 4 , { C } = “The dog is not A − , but is A + . ” Now the question { Q } is incorrect with respect to the im- age, because it attributes A − to the dog. The object-only options { O 1 } , . . . , { O 4 } all implicitly accept the wrong at- tribute in the question and are therefore treated as incorrect. The correction option { C } is the unique correct answer: it denies the wrong attribute A − and restores the true attrib ute A + . In summary , q + Wh asks a Wh question whose premise matches the image and is answered by the true object o ⋆ , while q − Wh asks a Wh question whose premise uses a cor- rupted attrib ute and is correctly answered only by the ex- plicit correction sentence. This construction mirrors the positiv e/negati ve symmetry used for the yes/no-style tem- plates and keeps the Wh MCQs tightly grounded in the un- derlying scene graph. Benchmark statistics. As described in Sec. 2.1, our MCQ design constructs both positiv e and negati ve questions for four settings: q ± multi-obj , q ± multi-attr , q ± multi-rel , and q ± wh . W e present the detailed statistics of F I N E R - C O M P R E C A P and F I N E R - D O C C I in T ab. 9 . 7 T able 9. Distribution of MCQ pairs ov er entity counts in F I N E R - C O M P R E C A P ( F I N E R - C ) and F I N E R - D O C C I (F I N E R - D ). For each setting, we refer the entity counts for Obj/Attr/Rel as k and the corresponding number of pairs n k in matching order . (1 , 6) represents that k ranges from 1 to 6. Benchmark Setting k # pairs n k F I N E R - C q ± multi-obj (1 , 6) 560 , 560 , 560 , 558 , 535 , 377 q ± multi-attr (1 , 3) 966 , 472 , 231 q ± multi-rel (1 , 3) 1217 , 616 , 307 q ± wh - 1583 F I N E R - D q ± multi-obj (1 , 6) 65 , 496 , 909 , 980 , 874 , 1676 q ± multi-attr (1 , 5) 2451 , 5363 , 3092 , 1575 , 1843 q ± multi-rel (1 , 3) 4404 , 1168 , 199 q ± wh - 10472 Post-hoc correction of MCQs. After constructing the MCQs for F I N E R - C O M P R E C A P and F I N E R - D O C C I , humans further corrected a subset of them: 100 MCQs per setting for F I N E R - C O M P R E C A P and 200 MCQs per setting for F I N E R - D O C C I . In the 3-relation subset of F I N E R - D O C C I , we additionally observed cases where multiple relations referred to the same objects. W e there- fore performed further human cleaning, resulting in 199 im- prov ed paired MCQs in this setting. C. T raining Details Sec. 3 explains our training data generation pipeline, on which FINER-Tuning is trained. W e also briefly describe the fine-tuning setup in Sec. 4.1. In this section, we first present concrete examples of the training data, and then pro- vide the detailed fine-tuning configuration. T raining set examples. W e apply the training data con- struction pipeline from Fig. 3 to the first 24 shards of Pixmo-caption [ 11 ]. As described in Sec. 3, each image x can yield up to eight preference tuples ( x, q , a + , a − ) across the four subsets { O B J , A T T R , R E L , W H } . Applying the pipeline to 24 shards produces more than 1.6M prefer- ence tuples, which is more than we need for training. In practice, we only use the first 6 shards (about 440K tu- ples) and uniformly subsample at most 160K tuples for DPO training. W e visualize representati ve training exam- ples ( x, q , a + , a − ) from all four subsets in Fig. 10 . Finetuning Setup. W e summarize the training hyperpa- rameters for FINER-T uning in T ab . 10 . All models are trained with LLaMA-Factory [ 58 ], using LoRA [ 17 ] as the parameter-ef ficient fine-tuning method. W e apply LoRA adapters only to the projection layers q proj and v proj . W e reserve 0.5% of the data as a validation set. Since the val- idation distribution closely matches the training distribu- tion, we observe that training for too long driv es the vali- Config Llav a-1.6-7B Qwen2.5VL-7B InternVL3.5-8B InternVL3.5-14B Training Data 40K 120K 160K 160K Global BS 64 Optimizer AdamW [ 29 ] Learning rate 5 × 10 − 6 T otal epochs 1 W arm up ratio 0.1 LR scheduler cosine decay LoRA rank 32 LoRA target q proj , v proj β 0.1 V al. ratio 0.005 T able 10. Fine-tuning hyper-parameters for FINER-Tuning on all baselines. Global BS: global batch size. LR scheduler: learning rate scheduler . β : inv erse temperature parameter in the DPO loss, as shown in Eq. 5. V al. ratio: ratio of validation data size. dation loss close to zero and brings little or no performance gain, sometimes ev en degrading do wnstream results. For DPO training, we therefore limit the number of training samples for each model: LLaV A-1.6 is trained on 40K ex- amples, Qwen2.5-VL on 120K, and the InternVL3.5 series on 160K. For the SFT experiments in T ab . 4 , we fine-tune InternVL3.5-8B on 160K examples with a learning rate of 1 × 10 − 4 . W e use 4 NVIDIA H100 94GB GPUs to train InternVL3.5-14B, and 2 NVIDIA H100 GPUs for the other smaller models. D. Evaluation Details W e detail the ev aluation setups for three groups of tasks: the FINER benchmarks, other hallucination benchmarks, and general capabilities. FINER benchmarks. Since the FINER benchmarks are multiple-choice (MCQ) benchmarks, we ev aluate all models using greedy decoding with temperature 0 , no sampling, and a maximum of 3 output to- kens. Giv en an image and an MCQ, we append the instruction: ‘‘Please answer with a single capital letter (A, B, C, D, or E).’’ W e compute the paired accuracy Acc paired , which counts a pair as correct only if the model answers both q + and q − cor- rectly , ensuring that the model does not systematically fav or either the positiv e or the negati ve v ersion. Other hallucination benchmarks. W e ev aluate all models on both discriminativ e hallucination benchmarks (D ASH [ 3 ], POPE [ 22 ], RePOPE [ 33 ], Hallusion- Bench [ 16 ], AMBER [ 44 ], CRPE R [ 45 ]) and genera- tiv e hallucination benchmarks (MMHal-Bench [ 40 ], Halo- Quest [ 47 ]). W e use VLMEvalKit [ 14 ] to ev aluate HallusionBench, AMBER, and CRPE R with their default configuration. W e report all accuracy (aAcc.) for HallusionBench and a ver - aged accuracy for CRPE R. For D ASH, POPE, and Re- POPE, we follow their of ficial e valuation protocols and prompt models to answer only with ‘‘yes’’ or ‘‘no’’ . 8 Rej Acc Query Multi-obj Rej Acc Query Multi-obj Does the image include a book, a chair , a castle, a gate and a door? No, but the image includes a book, a sofa , a castle, a gate and a door . Y es, the image includes a book, a chair , a castle, a gate and a door . Can an arrow icon, a box, a circle, an option and a screen be seen in this image? Y es, an arrow icon, a box, a circle, an option and a screen can be seen in this image. No, but an arrow icon, a box, a circle, a triangle and a screen can be seen in this image. Rej Acc Query Multi-attr Rej Acc Query Multi-attr Does the image include a bird with a tufted white breast, with black and white stripes on its head, with a white patch under its chin , with a black beak, and with feathers transitioning into shades of gray and gray streaks on its back and tail? No, but the image includes a bird with a tufted white breast, with black and white stripes on its head, with a black patch under its chin , with a black beak, and with feathers transitioning into shades of gray and gray streaks on its back and tail. Y es, the image includes a bird with a tufted white breast, with black and white stripes on its head, with a white patch under its chin , with a black beak, and with feathers transitioning into shades of gray and gray streaks on its back and tail. Can you see a piece of rare steak, cut open to reveal its pink interior , and topped with scattered chives or scallion bits in this image?" Y es, I can see a piece of rare steak, cut open to reveal its pink interior , and topped with scattered chives or scallion bits in this image." No, but I can see a piece of common steak, cut open to reveal its gray interior , and topped with scattered chives or scallion bits in this image. Rej Acc Query Multi-rel Does this image contain the centrally located tick that is near the top middle of the image, is completely hidden by a small leaf , and is on its back? No, but this image contains the centrally located tick that is near the top middle of the image, is partially overlapped by a small leaf , and is on its back. Y es, this image contains the centrally located tick that is near the top middle of the image, i s completely hidden by a small leaf , and is on its back. Rej Acc Query Multi-rel Does this image contain the wristwatch that is displayed against a backdrop of a newspaper and a black box, has a black leather band , has a silver body with an intricate face, and has an iron cross on its face? Y es, this image contains the wristwatch that is displayed against a backdrop of a newspaper and a black box, has a black leather band , has a silver body with an intricate face, and has an iron cross on its face. No, but this image contains the wristwatch that is displayed against a backdrop of a newspaper and a black box, has a white leather band, has a silver body with an intricate face, and has an iron cross on its face. What Rej Acc Query Wh Rej Acc Query Wh Where is the middle-aged man with slightly bald hair , wearing a tan brown suit, a black dress shirt, and a red and gold striped tie , and with a large gold medal on a blue, white, and red striped ribbon? The middle-aged man does not have a red and gold striped tie , but he has a black and gold striped tie . At a wooden podium with two microphones. What is on a wooden table near a black and white box with a white top and black sides ? A pocket knife with a silver blade showing signs of wear and with a handle featuring detailed tooling and a dark brown wood inlay . The black and white box does not have a black top, but it has a white top. Figure 10. Examples from our constructed training set to train FINER-T uning. Positiv e queries are in green color, while negati ve queries are in red color . W e show both positiv e ( ( x, q + , a + + , a − + ) ) and negati ve ( ( x, q − , a + − , a − − ) ) preference tuples across four subsets: Multi-obj, Multi-attr , Multi-rel, Wh. W e again adopt greedy decoding for this binary setting to keep the setup consistent across models. W e report the av- eraged accuracy in T ab . 2 and show the accuracy on each subset in T ab . 13 . For MMHal-Bench, we use the original ev aluation code but replace the judge model with GPT -4.1-mini [ 2 ], since the original judge has been deprecated. For HaloQuest, we similarly follo w the released e valuation pipeline but replace the judge with Gemini-2.0-Flash [ 41 ], as Gemini-1.5-Pro is no longer accessible. In both generati ve benchmarks, we 9 use temperature 0 to ensure reproducible results. W e follo w the metrics of both benchmarks, reporting score (max. 6) as well as hallucination rate in MMHal-Bench, as well as the av eraged score in HaloQuest. General capabilities. W e ev aluate general capabilities using six benchmarks: MMStar [ 7 ] (broad multi-skill ev aluation), T extVQA [ 39 ] (text understanding from im- ages), ChartQA [ 32 ] (chart and figure understanding), MMVP [ 42 ] (vision-centric reasoning), NaturalBench [ 21 ] (natural, compositional multi-step reasoning), and V ∗ (vi- sual search on high-resolution images). NaturalBench con- tains grouped, real-world questions that require models to jointly use perception, w orld kno wledge, and compositional reasoning, making it a challenging test of robust, general- purpose vision-language ability . W e use VLMEvalKit [ 14 ] with default settings to ev al- uate all models on these six benchmarks. W e report over - all accuracy for MMStar , T extVQA, ChartQA, MMVP , and V ∗ . For NaturalBench, we report group accurac y (G A CC), as it is the most stringent and informativ e metric. E. Additional Experiments Despite the main experimental results presented in Sec. 4, we report additional experiments in this section. Specifi- cally , we conduct a positional bias study (Sec. E.1 ), ana- lyze the impact of training data filtering (Sec. E.2 ), present more qualitative results from F I N E R - D O C C I (Sec. E.3 ), provide per-subset results of three benchmarks (Sec. E.4 ), provide an extended comparison with additional hallucina- tion reduction methods (Sec. E.5 ), provide a brief discus- sion of an alternativ e random guess baseline (Sec. E.6 ), and show results on the MCQ version of our motiv ational study (Sec. E.7 ). E.1. Positional bias study Both F I N E R - C O M P R E C A P and F I N E R - D O C C I contain MCQs that in volv e multiple objects, attributes, and relations ( q ± multi-obj , q ± multi-attr , and q ± multi-rel ). When constructing a neg- ativ e MCQ q − , we choose one entity (object, attribute, or relation) at a random position and replace it with its nega- tiv e counterpart. A natural question is whether the model’ s behavior depends on which position is ne gated. T o test this, for all q ± multi-obj , q ± multi-attr , and q ± multi-rel with ex- actly three entities, we keep the same triplet but rotate which entity is negated, so that the negati ve appears once in each of the three positions. W e then measure the paired accurac y Acc paired for each position. As shown in Fig. 6 , base mod- els exhibit clear positional bias. For example, in q ± multi-obj , LLaV A-Next performs much worse when the neg ativ e is in the middle position, and Qwen2.5-VL-7B sho ws a drop of about 15% when the last position is negated compared to the first. In q ± multi-rel , the preferred position e ven dif fers T able 11. Category statistics for Pixmo-caption [ 11 ]. Category Count Percentage natural image 176,881 78.13% screenshot ui 36,701 16.21% chart graph 8,061 3.56% document text 4,739 2.09% T able 12. Filtering to only keep natural images ablation for FINER-T uning with InternVL-3.5-8B [ 46 ]. Obj/Attr/Rel denote Multi-obj/Multi-attr/Multi-rel for both training and ev aluation. The best results are bold. Fitered? FINER-CompreCap Other Obj Attr Rel Wh RePOPE M.S. - 74.2 71.9 49.8 25.5 91.5 68.0 Y es 76.8 78.6 62.8 36.1 93.1 68.1 No 76.5 78.3 64.1 36.1 93.1 68.3 across models: InternVL3.5-8B achieves the highest accu- racy when neg ating the middle entity , while InternVL3.5- 14B peaks when the third entity is negated. Fine-tuning with FINER-T uning consistently improves accuracy at all positions, but the curves are still not flat, indicating that positional bias remains. W e suspect this is related to the inherent sequence structure of current MLLM architectures and leav e a deeper in vestigation to future work. W e also assume that the current MCQ format is not the best option for testing positional bias, and we are looking forward the community to dive deeper into language positional bias in open-ended generation questions. E.2. Ablation: T raining Data Filtering In Pixmo-caption [ 11 ], we observed that certain amount of of images are charts/graphs or screenshots: content outside the ev aluation scope of F I N E R - C O M P R E C A P and F I N E R - D O C C I (which target natural images). For example, one screenshot image can be found in the upper left corner of Fig. 10 . Therefore, we first run Phi-4-14B o ver all the long captions to classify the images into four categories: “natural images”, “screenshot ui”, “chart graph” and “doc- ument text”. Since FINER benchmarks target only natu- ral images. The statistics are in T ab. 12 . Excluding these images resulted in almost no significant difference in per - formance. Therefore, to maintain simplicity and generality , we do not apply any filtering and retain the original dataset composition. E.3. Qualitative Results Follo wing the qualitativ e results in Sec. 4.5 on F I N E R - C O M P R E C A P , we provide additional examples from 10 F I N E R - D O C C I in Fig. 11 . These cases cover all four settings: Multi-obj, Multi-attr, Multi-rel, and Wh. W e only visualize the negativ e MCQs here, as they are much more challenging than their positiv e counterparts. Howe ver , some positive MCQs can be found in our human study ex- amples (Fig. 15 and Fig. 14 ). As shown in Fig. 11 , in the Multi-obj setting, only Gemini-2.5-Flash [ 10 ] and our FINER-T uning-tuned InternVL3.5-14B reliably identify the fine-grained concept “macbook”. In the Multi-attr setting, the questions target subtle details such as “the white note on the back driv er’ s side windo w” or “the cat with perked-up ears”. In the Multi- rel setting, some models, such as Qwen2.5-VL-7B [ 4 ], hal- lucinate the dog as being “behind the fence”, even though it is clearly in front of the fence. Finally , in the Wh set- ting, only Gemini and FINER-T uning correctly detect the anomalous attributes of the floor and the duck and answer the questions accordingly . E.4. Per -subset results POPE, RePOPE, AMBER. In Sec. 4.3, we report the av eraged performance on POPE [ 22 ], RePOPE [ 33 ], and AMBER discriminativ e subset [ 44 ] (denoted as AMBER throughout this paper). In T ab . 13 , we further break down the results and report the accuracy for each subset of these three benchmarks. Notably , with FINER-T uning, LLaV A- 1.6 achiev es a 20.1% absolute improvement on AMBER, further demonstrating the effecti veness of FINER-T uning. HallBench, CRPE R, HaloQuest. Apart from the per- subset results reported in T ab . 13 , we further report de- tailed breakdowns for HallBench [ 16 ], CRPE R [ 45 ] and HaloQuest [ 47 ] in T ab . 14 . T o further probe the caption- ing capabilities of different models, we include the results for AMBER generati ve subset (AMBER G) and report four metrics: CHAIR (CH.), CO VER (CO.), Hal. and Cog. in T ab. 14 . On Hallbench, FINER-T uning improv es ov er all baselines by maximally 6.8% (fAcc. of LLaV A-1.6), showcasing that FINER-T uning can still work ef fectiv ely in reducing general halucinations. In HaloQuest, the per- formance gain is mainly in Insufficient Context (IC.) sub- set and false premise (FP .) subset. Some catchy improve- ments are: FINER-T uning improves LLaV A-1.6 by 19.0% on IC and 31% on FP . FINER-T uning also improves the latest InternVL-3.5-8B by 15.7% and 15.3% each. Note that HaloQuest is a free-form generative benchmark. This shows that FINER-T uning can effecti vely correct the false premise hallucinations or withhold ov er-confident preidc- tions in free-form generations. AMBER G. T o further probe the captioning capabilities of different models, we include the results for AMBER gener - ativ e subset (AMBER G) and report four metrics: CHAIR, CO VER, Hal and Cog in T ab. 15 . Lastly , FINER-T uning consistently improves ov er three baselines (Qwen2.5-VL- 7B, InternVL-3.5-8B, InternVL3.5-14B) on AMBER G. W e therefore think that when the base models are strong enough, FINER-Tuning can further improve the captioning capabilities of the model. E.5. Comparing with more methods It is challenging to totally fairly compare hallucination re- duction methods because they are often trained on differ - ent datasets and base models. In this section, we fine-tune LLaV A-1.5-7B [ 26 ] with FINER-T uning using 40K training examples from our dataset. W e then e valuate on discrimina- tiv e hallucination benchmarks (POPE [ 22 ], AMBER [ 44 ]) and generativ e benchmarks (MMHal-Bench (MMHal) [ 40 ] and HaloQuest [ 47 ]). W e compare against the state-of-the- art REVERSE [ 49 ], as well as DoLA [ 9 ], HA-DPO [ 57 ], and HAL V A [ 38 ]. W e also compare FINER-T uning with RLAIF-V -7B [ 55 ] on the same LLaV A-1.5-7B base model, resulting in a more direct comparison than T ab . 1 and T ab . 2 . The results are in T ab . 16 . Using 40K training samples curated by Phi-4-14B [ 1 ], FINER-T uning already achiev es comparable performance on discriminative benchmarks to HAL V A and HA-DPO, whose training data are curated by Gemini V ision Pro [ 41 ] and GPT -4 [ 2 ], respectively , while substantially outper- forming them on generativ e benchmarks. Compared with the SO T A method REVERSE, FINER-T uning matches or surpasses its performance on discriminativ e tasks and fur- ther improves HaloQuest by 6.3%, but still lags behind on MMHal-Bench. Overall, these results indicate that FINER- T uning is effecti ve at reducing hallucinations, and its bene- fits appear more pronounced when applied to stronger, fron- tier MLLMs, as also evidenced in T ab. 2. Compared to RLAIF-V , FINER-T uning performs better on discriminativ e benchmarks such as POPE and AMBER (a +5.5% gain on AMBER), b ut remains weaker on generative benchmarks like MMHal-Bench. E.6. Smarter random guess baselines In T ab . 1, we report a uniform random-guess baseline of 4% , which corresponds to independently sampling one out of fiv e answer options for both the positive and negativ e questions: (1 / 5) 2 . Howe ver , due to the structured answer space in our Multi-obj/Multi-attr/Multi-rel MCQs (one Yes, I can see... option and four No, but I can see... options), a stronger no-kno wledge baseline is a polarity- awar e random guesser . Specifically , it first guesses the po- larity ( Yes vs. No ) uniformly , and if it guesses No , it then uniformly selects one of the four No options. Since each pair consists of one positiv e question whose ground-truth is always Yes and one negati ve question whose ground-truth is always one of the four No options, the probability of guessing correctly is 0 . 5 for a positiv e 11 Can you see the sedan with a red poster on the front passenger's side window and with a silver color in this image? A. No, but I can see the sedan with a green sign on the front passenger's side window and with a silver color in this image. B. Y es, I can see the sedan with a red poster on the front passenger's side window and with a silver color in this image. C. No, but I can see the sedan with a yellow flyer on the front passenger's side window and with a silver color in this image D. No, but I can see the sedan with a white note on the back driver's side window and with a silver color in this image. E. No, but I can see the sedan with a black advertisement on the front passenger's side window and with a silver color in this image. Multi-attr NEG Multi-rel NEG Wh NEG Qwen2.5-VL-7B LLaV A-1.6-7B InternVL3.5-14B InternVL3.5-14B w/ FINER-T uning A ❌ D ✅ B ❌ D ✅ D ✅ A ❌ B ❌ E ❌ D ✅ D ✅ A ✅ D ❌ C ❌ D ✅ D ✅ Gemini-2.5-Flash Multi-obj NEG B ❌ B ❌ B ❌ C ✅ C ✅ Multi-attr NEG Can you see the cat with its tail end in the shadows, with ears missing entirely , and with a seated posture in this image? A. No, but I can see the cat with its tail end in the shadows, with ears drooping down , and with a seated posture in this image. B. No, but I can see the cat with its tail end in the shadows, with ears perked up , and with a seated posture in this image. C.Y es, I can see the cat with its tail end in the shadows, with ears missing entirely , and with a seated posture in this image. D.No, but I can see the cat with its tail end in the shadows, with ears twitching rapidly , and with a seated posture in this image. E.No, but I can see the cat with its tail end in the shadows, with ears folded back , and with a seated posture in this image. Can you see wall, croissant, desktop , headphones, and floor in this image? A. No, but I can see wall, croissant, typewriter , headphones, and floor in this image. B. Y es, I can see wall, croissant, desktop , headphones, and floor in this image. C. No, but I can see wall, croissant, macbook , headphones, and floor in this image. D. No, but I can see wall, croissant, calculator , headphones, and floor in this image. E. No, but I can see wall, croissant, tablet , headphones, and floor in this image. Can you see cat, door , pillar , floor , puppy , and bag in this image? A. No, but I can see cat, door , archway , floor , puppy , and bag in this image. B. No, but I can see cat, door , baseboard , floor , puppy , and bag in this image. C. Y es, I can see cat, door , pillar , floor , puppy , and bag in this image. D. No, but I can see cat, door , door frame , floor , puppy , and bag in this image. E. No, but I can see cat, door , banister , floor , puppy , and bag in this image. Can you see the puppy that is jumping to the left of the chair and is looking upwards at the cat in this image? A. No, but I can see the puppy that is running behind the chair and is looking upwards at the cat in this image. B. No, but I can see the puppy that is standing on the right of the chair and is looking upwards at the cat in this image. C. No, but I can see the puppy that is hanging above the chair and is looking upwards at the cat in this image. D. No, but I can see the puppy that is lying at the base of the chair and is looking upwards at the cat in this image. E. Y es, I can see the puppy that is jumping to the left of the chair and is looking upwards at the cat in this image. Can you see the puppy that is to the left of the fence in this image? A. No, but I can see the puppy that is under the fence in this image. B. No, but I can see the puppy that is behind the fence in this image. C. Y es, I can see the puppy that is to the left of the fence in this image. D. No, but I can see the puppy that is to the right of the fence in this image. E. No, but I can see the puppy that is in front of the fence in this image. 2, 1, 2, 4 Multi-obj NEG C ❌ C ❌ C ❌ D ✅ D ✅ Multi-attr NEG D ❌ D ❌ D ❌ C ✅ D ❌ Can you see the animal with a Curious George appearance, with a right arm and leg covered by the cup in this image? A.No, but I can see the animal with a Curious George appearance, with a right arm and leg under the cup in this image. B.No, but I can see the animal with a Curious George appearance, with a right arm and leg barely touching the cup in this image. C.No, but I can see the animal with a Curious George appearance, with a left arm and leg covered by the cup in this image. D.Y es, I can see the animal with a Curious George appearance, with a right arm and leg covered by the cup in this image. E. No, but I can see the animal with a Curious George appearance, with a left arm and leg exposed from the cup in this image. A ❌ B ✅ C ❌ B ✅ B ✅ Multi-rel NEG C ❌ B ❌ C ❌ E ✅ E ✅ Wh NEG A ❌ C ❌ C ❌ E ✅ E ✅ What is laying on the floor with a green metal surface ? A. The poodle with a body facing the right side of the image, with a majority of the rest of its body with a black color , and with a collar . B. The chihuahua with a body facing the right side of the image, with a majority of the rest of its body with a black color , and with a collar . C. The Y orkshire T errier with a body facing the right side of the image, with a majority of the rest of its body with a black color , and with a collar . D. The dachshund with a body facing the right side of the image, with a majority of the rest of its body with a black color , and with a collar . E. The floor is not with a green metal surface , but is with a brown wooden surface. What is the duck with a red and green color , with its beak in the water , and with water dripping of f it standing on? A. The rocks with a pile formation B. The shells with a pile formation. C. The pebbles with a pile formation D. The duck is not with a red and green color , but is with a black and white color . E. The driftwood with a pile formation Figure 11. Qualitative Results from FINER-DOCCI. 12 POPE RePOPE AMBER Models Size Ran. ↑ Pop. ↑ Adv . ↑ Ran. ↑ Pop. ↑ Adv . ↑ Exis. ↑ Attr . ↑ Rel. ↑ OmniLMM 12B 89.3 87.8 87.1 95.1 93.2 93.1 85.6 94.2 80.7 +RLAIF-V 12B 89.0 0.3 87.5 0.3 86.8 0.3 95.0 0.1 92.8 0.4 92.6 0.5 86.1 0.5 90.2 4.0 85.7 5.0 LLaV A-1.6 [ 27 ] 7B 89.7 88.4 86.6 93.9 92.1 91.0 82.0 93.6 58.7 +FINER-T uning 7B 90.4 0.7 88.8 0.4 87.2 0.6 94.9 1.0 92.9 0.8 91.8 0.8 83.5 1.5 92.6 1.0 78.8 20.1 Qwen2.5-VL [ 4 ] 7B 87.0 86.5 85.8 93.6 91.9 91.7 84.1 95.7 75.6 +FINER-T uning 7B 88.0 1.0 87.0 0.5 86.4 0.6 94.1 0.5 92.2 0.3 91.9 0.2 84.0 0.1 96.2 0.5 77.1 1.5 InternVL-3.5 [ 46 ] 8B 93.3 87.7 85.0 95.4 90.7 88.5 80.4 88.0 80.1 +FINER-T uning 8B 92.7 0.6 88.7 1.0 86.6 1.6 95.9 0.5 92.6 1.9 90.9 2.4 80.6 0.2 88.2 0.2 80.6 0.5 InternVL-3.5 [ 46 ] 14B 93.4 89.6 85.7 94.7 92.1 88.8 82.6 89.4 81.9 +FINER-T uning 14B 93.0 0.4 90.2 0.6 87.3 1.6 95.8 1.1 93.6 1.5 91.4 2.6 82.5 0.1 91.0 1.6 81.5 0.4 T able 13. Per-subset results on POPE [ 22 ], RePOPE [ 33 ], and AMBER [ 44 ]. Rand.: Random; Pop.: Popular; Adv .: Adversarial; Exis.: Existence; Attr .: Attribute; Rel.: Relation HallBench CRPE R HaloQuest Models aAcc. ↑ fAcc. ↑ qAcc. ↑ Sub. ↑ Pred. ↑ Obj. ↑ T ot. ↑ VC. ↑ IC. ↑ FP . ↑ LLaV A-1.6-7B 33.0 10.6 8.3 61.7 52.6 61.6 56.5 50.5 38.0 42.9 +FINER-T uning 36.3 3.3 17.4 6.8 13.0 4.7 62.6 0.9 51.7 0.9 59.8 1.8 56.0 0.5 50.5 57.0 19.0 73.9 31.0 Qwen2.5-VL-7B 65.4 35.8 40.0 77.2 66.1 71.7 69.9 66.5 76.0 79.2 +FINER-T uning 68.5 3.1 40.0 4.2 43.6 3.6 77.9 0.7 67.0 0.9 72.4 0.7 70.7 0.8 65.9 0.6 86.7 10.7 87.5 8.3 InternVL-3.5-8B 71.0 45.1 47.0 75.6 63.3 70.8 67.7 66.5 51.2 64.4 +FINER-T uning 73.0 2.0 48.9 3.8 49.3 2.3 76.5 0.9 63.4 0.1 70.9 0.1 68.0 0.3 65.9 0.6 66.9 15.7 80.7 15.3 InternVL-3.5-14B 69.5 46.8 47.0 77.2 60.7 73.3 67.1 63.7 54.5 70.0 +FINER-T uning 71.2 1.7 49.2 2.4 49.7 2.7 78.5 1.3 63.1 2.4 73.9 0.6 68.9 1.8 63.7 61.2 6.7 79.2 9.2 T able 14. Per-subset results on HallBench [ 16 ], CRPE relation subset (CRPE R) [ 45 ], and HaloQuest [ 47 ]. Sub .: Subject; Pred.: Predicate; Obj.:Object; T ot.: T otal; VC.::V isually Challenge subset; IC.: Insufficient Conte xt subset; FP .: False Premise subset; AMBER G Models CHAIR ↓ CO VER ↑ Hal ↓ Cog ↓ Qwen2.5-VL-7B 5.3 64.0 27.1 1.9 +FINER-T uning 5.0 0.3 64.7 0.7 25.9 1.2 1.6 0.3 InternVL-3.5-8B 6.9 61.3 49.9 3.1 +FINER-T uning 6.3 0.6 61.4 0.1 47.0 2.9 2.5 0.6 InternVL-3.5-14B 7.9 68.6 57.6 5.4 +FINER-T uning 7.4 0.5 68.7 0.1 54.4 3.2 4.4 1.0 T able 15. Extended results on AMBER generativ e subset (AM- BER G). MCQ. For the negati ve MCQ, it is 0 . 5 × 0 . 25 . Therefore, the paired accuracy is 0 . 5 × (0 . 5 × 0 . 25) = 0 . 0625 . E.7. MCQ V ersion of the Motivational Study Y es/no probing is standard in prior benchmarks such as D ASH, POPE, and AMBER for ev aluating false-positive hallucinations . In the main paper , we adopt this simple Method POPE AMBER MMHal HaloQuest Acc. ↑ Acc. ↑ HR. ↓ Score ↑ LLaV A-1.5-7B 85.9 74.7 54.0 22.6 +HAL V A [ 38 ] 84.8 83.4 54.0 23.9 +HA-DPO [ 57 ] 86.9 78.1 60.0 - +DoLA [ 9 ] 85.7 74.5 56.0 22.9 +RLAIF-V [ 55 ] 85.2 76.8 32.3 - +REVERSE [ 49 ] 85.9 74.2 30.0 32.3 +FINER-T uning 86.7 82.3 49.0 38.8 T able 16. Extended comparison with other hallucination reduction methods on LLaV A-1.5-7B [ 26 ]. HR.: Hallucination rate. The best results are bold while the second best results are underlined. setup for the motiv ational study because it is easy to un- derstand. In contrast, our FINER benchmarks are e valuated using multiple-choice questions (MCQs). Using two dif- ferent ev aluation protocols may cause confusion for some readers. Therefore, we additionally reformulate the moti va- 13 FINER-CompreCap FINER-DOCCI Multi-obj Multi-attr Multi-rel Wh Multi-obj Multi-attr Multi-rel Wh 92.5 92.5 97.5 95.0 92.5 95.0 90.0 90.0 T able 17. Human performance in paired accuracy (Acc paired ) on FINER-CompreCap and FINER-DOCCI. tional study in the same MCQ format as used in our bench- marks. Fig. 12 shows the same trend as the yes/no version in the main paper: accuracy decreases as query granular - ity increases. More specifically , the false-positiv e (FP) rate is much higher than the false-ne gati ve (FN) rate, confirm- ing that false-positi ve hallucination is the main cause of the performance drop. Figure 12. Left: MCQ version of the motiv ational study . Right: False-positi ve (FP) and false-negati ve (FN) rates at each granular- ity lev el. F. Human Study Since the FINER benchmarks are text-intensi ve, we asked human participants to answer a limited number of ques- tions: 20 MCQs per subset. W ith eight subsets in total (four from F I N E R - C O M P R E C A P and four from F I N E R - D O C C I ), this yields 160 MCQs. The results are shown in T ab . 17 . Unlike models, which answer the positive and negati ve versions of each MCQ independently , humans could in prin- ciple remember a MCQ and use the correspondence be- tween q + and q − to make the task easier . T o avoid this, we create two versions (A and B) for each setting. For every MCQ pair, the positiv e and negati ve versions are randomly assigned to dif ferent versions. Each annotator only sees one version (either A or B), so they never see both sides of the same pair . W e recruit four human participants for each setting and compute paired accuracy based on their responses. The nu- merical results are reported in T ab. 1. Example surve y pages from our human study are shown for Multi-rel and Wh questions from F I N E R - C O M P R E C A P in Fig. 14 , and for Multi-obj and Multi-attr questions from FI N E R - D O C C I in Fig. 15 . As illustrated in these figures, each MCQ has two versions (A and B), corresponding to its positive and nega- tiv e forms, and no annotator ev er answers both versions of the same MCQ. Success and failure cases. As T ab . 17 sho ws, humans achiev e over 90% paired accuracy across all settings in F I N E R - C O M P R E C A P and F I N E R - D O C C I . Although we can only ev aluate human performance on a limited sub- set due to resource constraints, we do observe many cases where humans succeed on MCQs that a model like InternVL-3.5-14B [ 46 ] fails on. Notably , there are also MCQs where humans fail but models succeed. Represen- tativ e success and failure cases are shown in Fig. 13 . From Fig. 13 , human errors can be grouped into two main types: carelessness and ambiguity . In the upper- right example, the human selects “sleeping behind the win- dow”, likely due to a simple oversight or a “yes” bias, sim- ilar to ho w InternVL-3.5-14B fails in the lo wer-right ex- ample. The second type of error arises from subjective or ambiguous visual attributes. In the dog example, the hu- man chooses “with bald ears that flap sideways” instead of “with floppy ears that hang down”. This is partly under- standable, since “flap sideways” describes some of the ob- served motion even though the ears are not truly “bald”. Strictly speaking, “bald ears that flap sidew ays” should be considered a false attribute (only partially correct), espe- cially when compared to “floppy ears that hang down” (cor- rect). This moti vates our choice to design FINER as an MCQ benchmark rather than using simple yes/no questions. By comparing multiple options, both humans and models are encouraged to pick the better description, which reduces ambiguity to some extent. Nevertheless, even with our entropy-based filtering pipeline, additional human verifica- tion, and MCQ design, the scale of FINER means that a certain amount of subjectivity , ambiguity , and annotation errors in the descriptions remains unavoidable. A valid fu- ture direction is to construct FINER benchmarks fully with human annotations, better aligning the ev aluation with hu- man subjectivity in assessing hallucinations. In our human studies, participants answer 20 MCQs per subset, which is small relativ e to the scale of both benchmarks. This is mainly because FINER is highly text- intensiv e, requiring substantial reading time. Scaling up the human study would likely further reduce human accuracy due to the reading burden and potential noise, since the benchmark is not fully created and validated by humans. W e therefore treat the limited scale of the human studies as a limitation, and emphasize that these results only reflect human beha vior on a small subset and giv en ample answer - ing time, rather than serving as a valid measure of overall benchmark quality . G. T emplates T o construct training set for FINER-Tuning. Sec. 3 de- scribes how we run Phi-4-14B [ 1 ] over captions to extract 14 Wh NEG C ❌ A ✅ Wh NEG What is the knife with red and green handle with black logo on? A. The knife is not with red and green handle with black logo , but is with blue and black handle with white logo . B. The bench with light wood grain finish. C. The table with light wood grain finish. D. The stool with light wood grain finish. E. The chair with light wood grain finish. Can you see the cat that is sleeping behind the window and is looking at the lizard in this image? A. No, but I can see the cat that is standing on the window and is looking at the lizard in this image. B. No, but I can see the cat that is looking out of the window and is looking at the lizard in this image. C. No, but I can see the cat that is walking towards the window and is looking at the lizard in this image. D. Y es, I can see the cat that is sleeping behind the window and is looking at the lizard in this image. E. No, but I can see the cat that is jumping over the window and is looking at the lizard in this image. B ✅ A ❌ NEG C ✅ C ✅ Can you see the dog with reddish-brown or golden-brown coat that is relatively long and with stif f ears that stand up in this image? A. No, but I can see the dog with reddish-brown or golden-brown coat that is relatively long and with floppy ears that hang down in this image. B. No, but I can see the dog with reddish-brown or golden-brown coat that is relatively long and with pointed ears that stick up in this image. C. Y es, I can see the dog with reddish-brown or golden-brown coat that is relatively long and with stif f ears that stand up in this image. D. No, but I can see the dog with reddish-brown or golden-brown coat that is relatively long and with bald ears that flap sideways in this image. E.No, but I can see the dog with reddish-brown or golden-brown coat that is relatively long and with rounded ears that fold back in this image. Multi-attr NEG C ❌ D ❌ Can you see the woman that is sitting beside the flower , is sitting beneath the feet of the ground , and is sitting next to the stone flower pot in this image? A. No, but I can see the woman that is sitting beside the flower , is sitting under the ground, and is sitting next to the stone flower pot in this image. B. No, but I can see the woman that is sitting beside the flower , is sitting below the ground , and is sitting next to the stone flower pot in this image. C. No, but I can see the woman that is sitting beside the flower , is sitting on the ground , and is sitting next to the stone flower pot in this image. D. Y es, I can see the woman that is sitting beside the flower , is sitting beneath the feet of the ground , and is sitting next to the stone flower pot in this image. E. No, but I can see the woman that is sitting beside the flower , is sitting beneath the ground , and is sitting next to the stone flower pot in this image. Multi-rel (model correct, human correct) (model incorrect, human correct) (model incorrect, human incorrect) (model correct, human incorrect) Figure 13. Success & failure analysis matrix for InternVL3.5-14B [ 46 ] (denoted as “model” in the figure) and Human. All MCQs are included in the human study . positiv e phrases  Ψ + O B J , Ψ + A T T R , Ψ + R E L , Ψ + W H  and negati ve phrases  Ψ − O B J , Ψ − A T T R , Ψ − R E L , Ψ − W H  . OBJ / A TTR / REL. For O B J , A T T R , and R E L , we first extract positi ve phrases  Ψ + O B J , Ψ + A T T R , Ψ + R E L  using the prompts shown in Fig. 16 , Fig. 17 , and Fig. 18 . W e then prompt the same LLM to generate the corresponding negati ve phrases  Ψ − O B J , Ψ − A T T R , Ψ − R E L  with the prompts in Fig. 20 , Fig. 21 , and Fig. 22 . Given these positive/ne gativ e phrase sets, we construct preference tuples ( q + , a + + , a − + ) and ( q − , a + − , a − − ) for each of O B J , A T T R , and R E L via template-based com- position, by using a pool of fiv e templates as below: (1) Does this image contain { X } ? Yes, this image contains { Y } . No, but this image contains { Z } . (2) Does this image show { X } ? Yes, this image shows { Y } . No, but this image shows { Z } . (3) Does this image include { X } ? Yes, this image includes { Y } . No, but this image includes { Z } . (4) Can you see { X } in this image? Yes, I can see { Y } in this image. No, but I can see { Z } in this image. (5) Can { X } be seen in this image? Yes, { Y } can be seen in this image. No, but { Z } can be seen in this image. T o av oid overfitting to a single fixed pattern and to stay consistent with the FINER benchmarks, we randomly choose one of the above fi ve templates for each exam- ple. Each template contains placeholders { X } , { Y } , and { Z 1 } , . . . , { Z 4 } that are filled with phrases. In the positi ve configuration ( q + , a + + , a − + ) , the “Y es” an- 15 swer will be the accepted response a + + while the “No” an- swer will be the rejected response a − + . The question and the “Y es” answer both use the positi ve phrase Ψ + , while all “No” answers use the negati ve phrase Ψ − : { X } = Ψ + , { Y } = Ψ + , { Z } = Ψ − In the ne gati ve configuration ( q − , a − + , a − − ) , the “No” an- swer will be the accepted response a + − while the “Y es” an- swer will be the rejected response a − − . The question and all “No” answers use the negati ve phrase Ψ − , while the “Y es” answer uses the positiv e phrase Ψ + : { X } = Ψ − , { Y } = Ψ + , { Z } = Ψ − WH. For W H , the preference tuples ( q + , a + + , a − + ) and ( q − , a + − , a − − ) are directly constructed by the LLM, rather than via our fixed templates. W e therefore do not apply the above template-based composition to W H , and instead use ded- icated prompts to let the LLM generate the question and its positive/ne gati ve answers. The prompts used to con- struct a pair of ( q + , a + + ) and ( q − , a + − ) for W H are shown in Fig. 19 and Fig. 23 , respectiv ely . Concretely , the LLM first produces two Wh questions about the same underlying scene: a positiv e question q + , whose premise is consistent with the image and whose accepted response a + + directly answers what the question asks for , and a negati ve question q − , whose premise partially conflicts with the image con- tent so that its accepted response a + − explicitly negates the question itself. W e then symmetrize this pair by assigning each accepted response as the other question’ s rejected re- sponse, i.e., a − + := a + − and a − − := a + + . In this way we obtain the final preference tuples ( q + , a + + , a − + ) and ( q − , a + − , a − − ) . 16 Multi-rel A Multi-rel B Multi-attr A Multi-attr B Multi-obj A Multi-obj B Wh A Wh B Figure 14. Examples of our human study survey for F I N E R - C O M P R E C A P . Example questions from Multi-rel and Wh are shown in the figure. Tick ed boxes represent ground-truth choices. W e use blue color to represent the questions for v ersion A, while orange representing the questions for version B. 17 Multi-Attr A Multi-Attr B Multi-obj A Multi-obj B Figure 15. Examples of our human study survey for F I N E R - D O C C I . Example questions from Multi-attr and Multi-obj. Ticked boxes represent the ground-truth choice. W e use blue color to represent the questions for version A, while orange representing the questions for version B. 18 "You are an information extraction assistant.\n" "From the caption, select up to FIVE main objects. A main object must have at least one descriptive attribute in the caption " "(e.g., color, size, material, possession/with-phrase, appositive, relative clause, or an explicit number).\n" "Rules:\n" "• Output only object names that appear in the caption and are part of the described scene. Never invent or infer.\n" "• Prefer plain object names in the output (omit adjectives), but KEEP explicit numbers/quantifiers if the caption states them " "(e.g., 'two cats', 'five chickens', 'a pair of skis').\n" "• If multiple mentions share the same name and no explicit number is given, output the plural form (e.g., 'dogs'). \n" "• List 1–5 main objects; if only one is present, output just that one.\n" "Do not add quantifiers like 'some' unless present in the caption.\n" "Format:\n" "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for extracting Figure 16. Prompt T emplate for extracting Ψ + O BJ "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for extracting "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" "• Do NOT include spatial relations to other main objects (e.g., 'to the left of the bus').\n" "• Do NOT include actions involving other main objects (e.g., 'holding a cup').\n" "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the other objects.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one. NEVER invent attributes.\n" "• Optionally, rewrite the original attribute phrase to either a plain adjective phrase (e.g., 'red', 'shiny metal', 'long-tailed'), " "or a 'with ...' phrase (e.g., 'with yellow eyes', 'with its nose pointing to the left', 'with the text \"SALE\"'). The rewriting should not change the original meaning.\n" "• Connect multiple 'with ...' phrases smoothly using commas and 'and' (e.g., 'with yellow eyes, with a striped tail, and with a scar').\n" "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Figure 17. Prompt T emplate for extracting Ψ + A T TR 19 "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for extracting "You are an information extraction assistant.\n" "Select ONE main object from the caption that clearly participates in at least one relation with another object.\n" "Extract relations ONLY if they are explicitly stated in the caption—never infer or guess.\n" "Allowed relation types:\n" "• Spatial: e.g., 'behind X', 'in front of Y', 'on Z', 'under W', 'next to Q', 'between A and B', 'near C', 'inside D', 'at E'.\n" "• Action with a target: verb phrases that take an object, e.g., 'holding a cup', 'biting a bone', 'looking at the door'.\n" "Constraints:\n" "• Every relation must involve the chosen object.\n" "• Refer to other objects with plain nouns; add attributes only to disambiguate same-named objects.\n" "• Use ONLY what the caption states; do NOT invent relations.\n" "• List 1–5 relations; if only one is present, output just that one.\n" "Compose ONE fluent phrase that starts with the object and then lists the relations.\n" "Prefer: 'The is , is , ... and is '.\n" "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Figure 18. Prompt T emplate for extracting Ψ + R EL "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for generating "You create one WH-style QA pair from ONE sentence describing two main objects and their explicit relation, " "optionally with attributes. The sentence has the logical structure:\n" "[obj_a] [attr_a…] [rel] [obj_b] [attr_b…].\n" "\n" "Your task (A-mode):\n" "• Choose [obj_a][attr_a…] as the exact answer span.\n" "• Write ONE natural WH question whose answer is exactly that span.\n" "• In the QUESTION, preserve as much of [rel][obj_b][attr_b…] as natural, quoted verbatim when it fits, " " and DO NOT repeat or paraphrase [obj_a][attr_a…] inside the question.\n" "• Be fluent and grammatical; do not invent details.\n" "Output EXACTLY one line:\n" "Q= || A=\n" "No extra text." "You create one WH-style QA pair from ONE sentence describing two main objects and their explicit relation, " "optionally with attributes. The sentence has the logical structure:\n" "[obj_a] [attr_a…] [rel] [obj_b] [attr_b…].\n" "\n" "Your task (B-mode):\n" "• Choose [obj_b][attr_b…] as the exact answer span.\n" "• Write ONE natural WH question whose answer is exactly that span.\n" "• In the QUESTION, preserve as much of [obj_a][attr_a…] and [rel] as natural, quoted verbatim when it fits, " " and DO NOT repeat or paraphrase [obj_b][attr_b…] inside the question.\n" "• Be fluent and grammatical; do not invent details.\n" "Output EXACTLY one line:\n" "Q= || A=\n" "No extra text." Figure 19. Prompt T emplate for generating ( q + , a + + ) for W H setting 20 "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for generating "You are a negative object creator.\n" "You will receive a caption, an object list as PHRASE=..., and REPLACE_INDEX=k (1-based).\n" "Replace EXACTLY the k-th object in PHRASE with a distinctly different NEGATIVE object.\n" "\n" "Keep ALL other objects unchanged and preserve their order and punctuation. Keep the same quantifier/determiner " "for the replaced slot (e.g., 'two cats' -> 'two bicycles').\n" "\n" "Constraints for the NEGATIVE object:\n" "• It must be distinctly different from the replaced object (not a synonym; not just singular/plural).\n" "• It must NOT be a synonym or near-equivalent of ANY object that appears in the caption.\n" "• It must NOT appear anywhere in the caption (as a whole word, singular or plural).\n" "• Do not modify any other items; do not reorder items; do not add or remove items.\n" "\n" "Edge cases (must follow):\n" "• If REPLACE_INDEX is greater than the number of objects in PHRASE, replace the LAST object.\n" "• If REPLACE_INDEX is less than 1, replace the FIRST object.\n" "\n" "Self-check (must hold):\n" "• Same number of items as input; exactly one item (the k-th per the rule above) differs.\n" "\n" "Output EXACTLY one line:\n" "PHRASE=\n" "No extra text. No quotes. No trailing period." Figure 20. Prompt T emplate for generating Ψ − O BJ "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for generating "You are a negative attribute editor.\n" "You will receive an ATTRIBUTE PHRASE: a single noun phrase describing one main object with 1–5 attributes.\n" "Each attribute is one replaceable unit: either (a) a pre-nominal adjective group (e.g., 'long-sleeved red') " "or (b) one entire 'with ...' clause or other forms of clause separated by commas or 'and'.\n" "\n" "Task:\n" "Pick exactly ONE attribute unit at random and replace it with a distinctly different NEGATIVE attribute.\n" "\n" "Randomness:\n" "• Replace the attribute unit at random position. Both pre-nominal adjective group or 'with ...' clause should have a chance to be replaced. \n" "\n" "Definitions & scope of attributes that can be changed:\n" "• Appearance, color, pattern, size, shape, material, texture, markings/printed text/numbers, " "condition/state, orientation/pose, and accessories physically attached to the main object.\n" "\n" "Constraints for the replacement:\n" "• Keep the object head and all other attributes unchanged; preserve order, punctuation, articles, quotes, units, and capitalization.\n" "• Keep the grammatical shape of the replaced unit (adjective group stays an adjective group; a 'with ...' clause stays a 'with ...' clause).\n" "• The replacement must be distinctly different from the original and NOT a synonym, near-synonym, or morphological variant of any attribute in the phrase\n" "• Do not duplicate any existing attribute already present in the phrase.\n" "• Avoid always changing the same type of attribute; consider changing any types of attributes stated in the definitions above.\n" "\n" "Self-check before answering (must be satisfied):\n" "• Exactly one attribute unit differs; all other attribute units are identical.\n" "\n" "Output EXACTLY one line:\n" "PHRASE=\n" "No extra text. No quotes. No trailing period." Figure 21. Prompt T emplate for generating Ψ − A T TR 21 "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for generating "You are a negative relation editor.\n" "Input format:\n" " CLAUSE_INDEX=<1-based index to edit>\n" " PHRASE=The , , ... and \n" "Each clause is a relation expressed as a verb + complement, e.g., " "'is on a table', 'are between two cars', 'has a transparent faceplate', " "'holds a bottle', 'wears a red jersey', 'faces left', 'shows a temperature above 50 degrees'.\n" "\n" "Task:\n" "Select and edit EXACTLY the clause with the given CLAUSE_INDEX (1-based) to make it a clearly different (ideally opposite) NEGATIVE relation.\n" "\n" "Style guidance (choose ONE option to edit the selected clause):\n" " (A) If the selected clause encodes a spatial relation via a preposition or comparator " "(e.g., in/on/inside/outside/under/over/above/below/behind/in front of/" "to the left of/to the right of/between/near/at/surrounding/is surrounded by/on top of/at the bottom of, etc.), replace that spatial term with" "opposite or distinctly different spatial relation (e.g., on → inside, in → out of, left → right, above → below, beside → inside). " " (B) If the clause describes an action of the HEAD, replace this action with one distinctly different or opposite. Change the clause’s main lexical verb " "(e.g., holds → drops, wears → removes, shows → hides, opens → closes, runs → stands). You may also adjust adverbs or prepositions if any " "('is standing on' → 'is running away from', 'is driving slowly to' → 'is flying high from'). Preserve tense/number/aspect and auxiliaries " "(e.g., 'is holding' → 'is dropping', 'has opened' → 'has closed').\n" " (C) If the selected clause describes possession or properties of the HEAD " "(e.g., has/have…, is/are made of…, shows/displays/reads/contains/wears…), " "replace the complement with something clearly different or opposite (e.g., 'contains two plastic bags' → 'contains three paper bags').\n" "\n" "Hard constraints (must follow):\n" "• If the CLAUSE_INDEX is larger than the number of clauses you see, edit the LAST clause.\n" "• Keep the HEAD EXACTLY as in the input.\n" "• Keep ALL other clauses unchanged; preserve separators (commas and the final 'and').\n" "• Do NOT reorder clauses.\n" "• Edit ONLY the selected clause; do NOT add/remove clauses; the edited clause MUST be distinctly different from the original clause.\n" "• Avoid merely inserting 'not'; prefer concrete lexical or complement changes.\n" "• The new clause must not duplicate another clause and should remain grammatical (tense/number agreement intact).\n" "\n" "Output EXACTLY one line:\n" "PHRASE=\n" "No extra text. No trailing period." Figure 22. Prompt T emplate for generating Ψ − R EL 22 "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for generating "You will convert a POSITIVE wh-question into a counterfactual, NEGATIVE wh-question + answer by replacing EXACTLY ONE " "ATTRIBUTE CLAUSE that describes the main object mentioned in the question.\n" "\n" "DEFINITIONS (apply to the input question):\n" "• Main object: the plain head noun phrase that the attributes modify (e.g., 'a mug', 'the DSLR camera'). If multiple objects present in the question, pick the one with more attributes as the main object.\n" "• Attribute clause: a modifier that directly describes the main object. It can be\n" " – pre-nominal adjectives (color, material, pattern, size, shape, quantity), e.g., 'red', 'ceramic', 'wide'.\n" " – post-nominal phrases (e.g., 'with …', 'featuring …', 'bearing …', 'labeled \"…\"', participial phrases like 'wearing …').\n" " – other short descriptors attached to the object (texture, condition/state, orientation/pose, printed text/numbers).\n" "• Relation clause: the words expressing spatial or action relations that position the main object relative to something else, " " e.g., 'on', 'under', 'next to', 'in front of', 'behind', 'to the left of', 'below', 'above', or light-verb forms like " " 'is on', 'is next to', 'is holding', 'is below'.\n" "\n" "To help you better identify the attribute clauses, the input questions are usually in the following forms:\n" " – WH + [main object + attribute clauses] + [relation clause]?\n" " – WH + [relation clause] + [main object + attribute clauses]?\n" "Note that the attribute clauses can either be pre-nominal (before the main object) or post-nominal (after the main object).\n" "\n" "EDIT RULES:\n" "1) Identify all attribute clauses attached to the main object you pick.\n" "2) Randomly choose ONE attribute clause (denoted as [original attribute]) and replace its content with a CLEARLY DIFFERENT or even OPPOSITE attribute clause (denoted as [new attribute]).\n" " • You may change multiple adjectives INSIDE [original attribute] to increase contrast.\n" " • Do NOT add, remove, or reorder other attribute clauses—only replace the contents of the chosen [original attribute clause].\n" "3) Keep everything else unchanged:\n" " • Do NOT change the main object.\n" " • Do NOT change the relation clause.\n" "4) If the question truly has no attribute clauses for the main object, output exactly: SKIP\n" "\n" "RANDOMNESS:\n" "You MUST choose one attribute clause at random position. Both the attribute clauses before or after the main object should have a chance to be chosen.\n" "\n" "ANSWER FORMAT (pick what fits; ensure correct number agreement and echo the original attribute verbatim):\n" "• The [main object] is not [new attribute], but it is [original attribute].\n" "• The [main object] does not have [new attribute], but it has [original attribute].\n" "• The [main object] contains no [new attribute], but it has/contains [original attribute].\n" "If none fits perfectly, write a brief, natural denial that clearly states the object lacks the [new attribute] and has the [original attribute]." "\n" "OUTPUT:\n" "Return EXACTLY ONE line:\n" "Q= || A=\n" "No extra text." Figure 23. Prompt T emplate for generating ( q − , a + − ) for W H setting 23 "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for Gemini-2.0-Flash extracting objects & attributes "You are an expert at information extraction. Your task is to analyze an image description " "and extract all MAIN objects (objects that have at least ONE attribute) and their attributes into a JSON list of dictionaries.\n" "Follow these rules precisely:\n" "1. Your output MUST be a valid JSON list `[]` where each element is a dictionary.\n" "2. Each dictionary must contain two keys: `object` and `attribute`.\n" "3. The value for the `object` key must be a string containing the plain name of the MAIN object in the scene (e.g., 'airplane', 'man', 'bus').\n" "4. The value for the `attribute` key must be a list of strings. Each string must describe a characteristic of the object, such as its appearance, material, color, text, number, size, state, or posture.\n" "5. Every attribute string MUST begin with the word 'with'. If the original description doesn't use 'with', rewrite the attribute to include it while preserving the meaning.\n" "6. Do NOT include attributes that describe spatial relationships ('to the left of', 'behind') with OTHER MAIN objects. However, spatial orientations NOT involving other MAIN OBJECTS should be counted as attributes. ('with its nose facing left').\n" "7. Do NOT include attributes that describe actions with OTHER MAIN objects('holding a black camera with flashy surface'). However, actions NOT involving other MAIN OBJECTS should be counted as attributes ('with hands raising in the air')." "8. Return ONLY the JSON list and nothing else. Do not add explanations or markdown formatting." Figure 24. Prompt T emplate for extracting objects and attributes using Gemini-2.0-Flash [ 41 ] when constructing F I N E R - D O C C I . "You are an information extraction assistant.\n" "Select ONE main object from the caption that has at least one described attribute.\n" "Extract attribute phrases ONLY if they are explicitly stated and are used to describe the chosen main object—never infer or guess.\n" "Then compose a SINGLE noun phrase describing that object with the extracted attribute phrases.\n" "Use ONLY evidence from the caption. Never invent attributes.\n" "Allowed attribute types:\n" "• Appearance, color, pattern, size, shape, material, markings/printed text/numbers, " "condition/state, orientation/pose, and other visible features that describe the main object.\n" "• Accessories physically attached to the main object (e.g., a collar on a dog) count as attributes; unrelated co-occurring objects do not.\n" "Constraints:\n" . . . "• The extracted attributes must clearly describe the chosen main object. NEVER invent attributes. NEVER extract attributes for the others.\n" "• Extract 1–5 attributes for the chosen main object. If fewer than five are stated, extract fewer. If only one is present, use that one.\n" . . . "Return EXACTLY one line:\n" "PHRASE=\n" "No trailing period. No extra text." Prompt template for Gemini-2.0-Flash extracting relations You are an expert at STRICT information extraction. Given (1) a natural-language image description, and (2) a numbered catalog of MAIN objects (each with ALL its attributes), your task is to inspect pairs of objects and extract all **directed** relations that are **explicitly stated in the description text**. You must follow these rules: 1) Use ONLY the integer indices from the catalog for `object_a_idx` and `object_b_idx`. 2) For a pair (object_a, object_b), only output a relation if the caption directly describes how object_a is related to object_b in words. - Do NOT guess, infer, or rely on world knowledge. - If the relation is not explicitly written or is only implied, you MUST NOT output it. 3) The relation phrase `rel` must: - be a spatial or verb phrase, - be lower-case, - be copied verbatim from the description whenever possible, or be a **minimal paraphrase** that preserves exactly the same meaning (e.g., shortening function words). 4) The relation phrase must describe either: - a clear spatial relation (e.g., "is behind the", "is at the intersection of the"), or - a clear action or interaction (e.g., "holds", "moves along the"). 5) The relation phrase must form a grammatical predicate between object_a and object_b: - it either begins with "is"/"are" (e.g., "is behind the", "is on top of the"), OR - it is a finite verb phrase that can directly follow object_a (e.g., "holds", "touches", "moves along the"). 6) Output a JSON list of dictionaries, each of the form: {"object_a_idx": int, "rel": str, "object_b_idx": int} 7) No self-relations and no duplicate entries. If **no** explicit relations are found, return an empty list: []. Figure 25. Prompt T emplate for extracting relations using Gemini-2.0-Flash [ 41 ] when constructing F I N E R - D O C C I . 24

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment