DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

December 31, 2025

Reading time: 17 minute

...

#Computer Vision #Computer Science #Model

📝 Original Info

Title: DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
ArXiv ID: 2512.24985
Date: 2025-12-31
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

📝 Abstract

Well-lit images Darkening Process Well-lit images Multi-level low-light images Evaluation of commonsense reasoning for embodied agents Evaluation of commonsense reasoning for embodied agents & Evaluation of robustness to low-light visual inputs switch Answer: No, this looks like a kitchen Answer: No, this looks like a kitchen Question: Is this room suitable for sleeping? Question: Is this room suitable for sleeping? LLIE Fig. 1. Illustration of the DarkEQA benchmark. Traditional Embodied Question Answering (EQA) primarily evaluates VLMs on well-lit images, overlooking their robustness to real-world low-light conditions. We present DarkEQA, a new benchmark designed to address this evaluation void. DarkEQA assesses VLM performance under two distinct conditions: clean, well-lit inputs (L0) and a multi-level ladder of physics-based low-light images (L1-L5). This heterogeneous design enables a clear analysis of both commonsense reasoning and robustness to visual degradation. Furthermore, the benchmark examines the effect of applying Low-Light Image Enhancement (LLIE) models as a pre-processing step.

📄 Full Content

Advances in vision-language models (VLMs) have significantly enhanced robotic capabilities, improving semantic scene understanding [1], [2], spatial reasoning [3], [4], and vision-language-action (VLA) policies [5], [6], [7]. Numerous Corresponding author: Tae-Hyun Oh. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 1 Yohan Park and Tae-Hyun Oh are with the School of Computing, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, South Korea (e-mail: john.a.park@kaist.ac.kr, taehyun.oh@kaist.ac.kr). 2 Hyunwoo Ha and Wonjun Jo are with the Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, South Korea (e-mail: hyunwooha@postech.ac.kr, jo1jun@postech.ac.kr).

Embodied Question Answering (EQA) benchmarks have been proposed to assess this commonsense reasoning for embodied agents, largely assuming well-lit, ideal visual conditions [8], [9]. However, household robots are often intended for 24/7 operation, which means they will frequently encounter lowlight scenarios, such as nighttime, entering dark rooms or power blackouts. As robot deployment in varied environments grows, robust perception under these conditions is not an edge case but a core necessity [10]. Accordingly, benchmarks that explicitly stress-test embodied VLM reasoning under low illumination are essential to quantify real-world robustness. Nevertheless, acquiring large-scale, real-world low-light images with clean, paired annotations-ideally with corresponding well-lit reference views-is challenging and costly, which has hindered the construction of such benchmarks. As a result, existing benchmarks have largely overlooked systematic evaluation of VLM-based reasoning and perception under degraded illumination, limiting their ability to predict real-world robustness.

To fill this evaluation void, we present DarkEQA, an opensource benchmark to systematically measure the perceptual primitives for embodied tasks under low-light conditions. The design of DarkEQA is primarily grounded in a physically based formulation, where all visual degradations are modeled at the RAW sensor data level (or in linear RGB space). This follows the physics of illumination and sensor noise to realistically simulate real-world Image Signal Processing (ISP) scenarios. Moreover, to ensure benchmark integrity and prevent potential data contamination [11], all Question Answering (QA) pairs are deterministically generated via rule-based procedure, rather than depending on commodity VLM services. QA generation results in a family of queries targeting perceptual primitives, including from simple object recognition (e.g., “Is there a cushion in the image?”) to affordance reasoning (e.g., “I want to sleep, is this room suitable for this?”). DarkEQA provides 9.4k question-image pairs, a standardized evaluation protocol, and a public codebase to reproduce our low-light degradation pipeline. Our DarkEQA benchmarks a diverse set of vision-language models (VLMs), including both open-and closed-source systems [12], [13], [14], [15], [16]. We also evaluate a state-of-the-art low-light image enhancement (LLIE) model [17] as a preprocessing baseline. Our evaluation yields two observations. First, while humans can recognize structural scene information of input images from intensity contrast, all tested VLMs show a clear performance decline as the images degrade. Second, while LLIE preprocessing can improve performance under certain degradation levels, its effects are not consistently positive; in some cases, it yields limited gains or even leads to performance degradation, highlighting its practical limitations. Together, these results show that current VLMbased EQA pipelines remain brittle under low-light corruption, and that perceptual enhancement alone is insufficient as a general solution, motivating robustness-oriented evaluation and method development.

Embodied Question Answering (EQA), first introduced by Das et al. [8], requires an agent to navigate and interact with an environment to answer a question. Early benchmarks primarily centered on static 3D scenes, such as ScanQA [20], to evaluate tasks like object identification and basic spatial relationships. OpenEQA [9] is introduced to assess an agent’s exploration capabilities, posing diverse questions related to scene state, agent knowledge, and object attributes.

Concurrently, a substantial body of research has focused on benchmarking deep spatial reasoning [21], [22], [23], [24], evaluating complex object relationships [25]. Other works have pushed towards dynamic and procedural understanding, utilizing 3D scene graphs [26], [27], [28] or focusing on multimodal reasoning [29], [30].

However, those existing EQA benchmarks often overlook real-world robustness. While NoisyEQA [31] addresses query noise, robustness to adverse environmental conditions remains a significant gap. Notably, no current benchmark evaluates EQA in dark or low-light situations, which are common in the real world. We therefore introduce the first benchmark for indoor embodied question answering in dark environments to assess robustness under poor visibility.

Recent research has explored two main directions for addressing the challenges of low-light visual perception. The first line of work targets robust recognition under low-light conditions, aiming to improve performance on specific vision tasks such as depth estimation, object detection, or pose estimation [32], [33], [34], [35], [36]. Although these approaches demonstrate impressive robustness, they are typically constrained to single-task, highlighting a gap between lowlight robustness in isolated perception and the embodied reasoning required in EQA. The second research stream focuses on low-light image enhancement (LLIE), where the goal is to improve the visual quality of dark images for human perception or downstream models [37], [38], [39], [17], [40], [41]. These methods enhance brightness, contrast, and detail visibility using learning-based or physically inspired approaches. While LLIE methods improve visual quality, it remains unclear how they influence general embodied agents in low-light conditions. Therefore, we further explore whether LLIE can help EQA agents overcome the challenges they face in dark environments.

Our DarkEQA is designed to evaluate VLMs’ recognition of core perceptual primitives from a single image-question pair under controlled low-light conditions. However, acquiring real-world low-light images with clean, paired annotations is challenging. To address this, we synthesize low-light images from the well-established indoor scene dataset (i.e., HM3D-Sem [42]). This section describes the low-light image synthesis for benchmark inputs (Sec. III-A) and the EQA dataset construction process (Sec. III-B). A key feature of our work is a dataset construction pipeline designed for high reproducibility and expandability.

Low-light images suffer from two distinct physical degradations. First, the reduced photon count leads to a fundamental loss of signal, which we term illumination degradation (i.e., exposure-value (EV) drop). Second, this weakened signal yields a low Signal-to-Noise Ratio (SNR), as sensor noise (e.g., shot, read, pattern, and quantization noise) becomes dominant relative to the remaining signal [19]. To reproduce these conditions for benchmark inputs, we design a physicsbased low-light synthesis pipeline. Specifically, across multiple degradation severities (L1-L5, increasing severity), we synthesize two paired low-light variants per original image: (i) A noise-free EV-drop variant and (ii) a physics-motivated variant with level-dependent sensor noise injection in the RAW domain, as in Fig. 3. This design enables disentangling the respective impacts of illumination degradation and sensor noise on perceptual peformance of VLMs.

Noise-free low-light image synthesis: Exposure-value (EV) drop is applied at linear RGB space after decoding sRGB images as shown in the lower branch of low-light image synthesis pipeline depicted in Fig. 2. Decoding to linear RGB. First, we approximate linearization using gamma expansion. Let x sRGB represent a sRGB pixel value in an input image and x lin its linear form. Following [43], [18], we compute

where ϵ = 10 -8 ensures numerical stability. Exposure scaling. Next, let ∆EV denote the absolute change in exposure value. Reducing the exposure by ∆EV scales the x lin by 2 -∆EV . The exposure-scaled pixel value is computed by

Re-encoding to sRGB. Finally, the exposure-scaled pixel value x ′ lin is mapped back to sRGB via gamma encoding:

We standardize an degradation levels L1-L5 with ∆EV ∈ {2.0, 4.0, 6.0, 7.5, 9.0}, respectively (L0 is the original).

Physics-motivated low-light image synthesis: We synthesize realistic low-light images using a physics-based pipeline that combines ISP inversion/forward pass [18] and raw-domain noise modeling [19]. The process is shown in the upper branch of low-light image synthesis pipeline of Fig. 2. Unprocessing (sRGB → RAW). We first normalize an 8bit sRGB image I ∈ {0, . . . , 255} H×W ×3 , where H and W denote the image height and width, respectively, to

To obtain a camera-linear RAW image from I sRGB , we invert the ISP following [18]. We denote the unprocessing operator by u(•), and express the resulting Bayer RAW mosaic as

where

. The unprocessing operator u(•) consists of five steps: (i) inverse tone mapping, (ii) gamma expansion, (iii) RGB→Camera color correction with sampled matrix M rgb→cam , (iv) inversion of white-balance/brightness gains with highlight preservation, and (v) mosaic extraction into RGGB Bayer representation. This restores a scenereferred signal where noise statistics are defined with respect to photon counts and sensor readout electronics, not post-ISP perceptual tone curves. Noise formation in RAW. Following the physics-based formation model of [19], we inject four noise components into the camera-linear RAW signal. Let B denote the clean, mosaiced RAW image obtained from unprocessing. After converting B from normalized units to the sensor’s ADU domain, we sample a system gain K log-uniformly from [0.1, 6.0]. The noisy RAW image is then expressed as

where N i denotes the i-th noise operator mapping a Bayer RAW tensor and system gain K to a Bayer RAW tensor described below.

(1) Photon shot noise. Photon arrival is discrete and stochastic. For each pixel, the number of photoelectrons N follows N ∼ Poisson(λ) where λ is proportional to scene irradiance. To simulate extreme low-light capture, we apply an ISO amplification ratio r ∈ [100, 300]: (i) reduce the signal by r (low-light capture), (ii) add Poisson noise, (iii) amplify back by r using sensor gain. This preserves the characteristic of low-photon-count statistics while allowing the final output brightness to be controlled independently via the EV drop.

(2) Read noise. Readout electronics introduce an additive noise term N read . We model it using a Tukey-λ distribution with a channel-wise DC offset (color bias). The scale parameter σ TL grows log-linearly with the system gain K:

capturing the heavy-tailed distribution observed under extreme low-light [19].

(3) Row noise. Line-wise variations in the readout circuitry produce banding artifacts. Each row i receives a shared offset n

, where σ r also scales log-linearly with K. (4) Quantization noise. Analog-to-digital conversion introduces rounding error N q modeled as N q ∼ U(-0.5, 0.5), where U represents a uniform distribution on [-0.5, 0.5], assuming a standard unit (1 ADU) quantization step. Simplified ISP (RAW → sRGB). Converting RAW to sRGB is an inverse operation of unprocessing: (i) white balance with sampled gains, (ii) bilinear demosaicing from RGGB Bayer to RGB, (iii) color correction using M cam→rgb , (iv) EV drop by ∆EV in linear space (multiplying intensities by 2 -∆EV ) to match the target degradation levels L1-L5, (v) gamma compression, and (vi) quantization to 8-bit sRGB.

We build the dataset for evaluation upon a representative subset of 52 scenes from HM3D-Sem [42], selected for diversity and semantic richness. For each scene, we record a human-demonstrated navigation trajectory that systematically explores the environment to maximize spatial coverage. To generate the ground-truth QA pairs, we uniformly subsample the trajectory and select keyframes at a fixed time interval (e.g., one frame every 2,s), rendering their geometric and semantic modalities (e.g., RGB, depth, segmentation). We for i ∈ Segments(I f over ) do 15:

end for 18:

Φ f ← {A i : ∀i} ▷ Collect stats for frame f 19:

-2. Generate QA 20:

for k ∈ C f do ▷ Generate all viable questions 23:

end for 26: end for 27: return Q then use Algorithm 1 as deterministic procedure to automatically generate QA pairs from the pre-computed per-keyframe statistics. This approach ensures each question has a single, verifiable answer by filtering ambiguities (e.g., tiny objects), requires no manual annotation, and avoids potential data contamination by not using commodity VLM services. This entire process is fully reproducible.

Algorithm 1 operates in two stages: frame-statistics extraction (Stage 1) and QA generation (Stage 2). In Stage 1, we cache the frame statistics Φ f required for Stage 2. Each frame f is represented as a quadruple f = (I f RGB , I f depth , I f sem , I f over ), comprising an RGB image, depth map, semantic label map, and over-segmentation map, respectively (the RGB image is three-channel, whereas the others are single-channel). Using these frame-wise statistics and a set of predefined rules, Stage 2 predicts the room type for each frame, enumerates applicable question templates, and generates the corresponding per-frame QA pairs. For example, consider the “Closest Object Recognition” question in Fig. 4. Object-level statistics are first extracted. The QA generation pipeline validates two conditions: (i) at least two non-structural, non-quasi-2D object instances with valid depth measurements exist, and (ii) the depth gap between top-two closest objects exceeds a minimum threshold to ensure perceptual validity. If satisfied, the closest object is determined as the ground-truth answer. In this example, “chair” is identified as the closest object.

This pipeline generates five question families targeting visual primitives for embodied operation: Room-Type Recognition, Room Affordance Check, Object Recognition, Object Attribute, and Closest Object Recognition. The examples for each family are provided in Fig. 4.

Our DarkEQA comprises 52 scenes selected from HM3D-Sem, yielding 3,911 frames at 1440 × 2560 resolution with ∼9.4K QA pairs. Fig. 5 shows that the dataset exhibits semantic class and room category distributions that are representative of typical residential environments. The semantic annotation covers 23 non-structural object classes, with the most prevalent being cabinet, bed, mirror, and table taking up about 53%. Room category distribution reflects the natural spatial composition of household scenes. The question distribution across the five question families shows moderate imbalance, with frequencies determined by the geometric and semantic constraints of our rule-based QA generation pipeline and subsequent validation through human sanity checks to ensure answer correctness.

In this section, we describe our experimental settings and provide quantitative evaluation results of various VLMs on DarkEQA, along with an analysis of the effects of illumination degradation, noise injection, and LLIE models used as a preprocessing module.

We evaluate DarkEQA on both VLMs and text-only LLMs (blind LLMs). For each keyframe and degradation condition, we present a single question together with a fixed, small set of candidate answers (room-type labels, object classes, color names, or a candidate list for closest objects). VLMs receive the image and the question-choice template, whereas blind LLMs see only the textual question and choices. Each question is thus cast as a multiple-choice problem, and models are instructed to output exactly one answer from the choices. This constrains the response space, avoids ambiguities in free-form generation, and enables exact-match scoring.

Blind LLMs. We set the scenario of blind agents that produces an answer based on the question that requires visual information to answer [9]. Even though our DarkEQA focuses on the VLM’s behavior according to illumination change and noise injection, we use the result of blind LLMs to catch the possible bias of our dataset while also testing how well the questions may be answered with an assumption of indoor environments. For the LLM choice, we report the results of GPT-4 [44] and LLaMA-3.1-8B [45]. VLMs. We evaluate a range of VLMs across different parameter scales. For 7-8B models, we report results for LLaVA-1.6-7B [12], LLaVA-OneVision-8B [13], InternVL3.5-8B [14], and Qwen3-VL-8B [15]. For larger-scale models (≥ 30B), we additionally evaluate InternVL3.5-30B [14] and Qwen3-VL-32B [15] using the same respective series. Finally, we include GPT-4o [16] as a upper bound. LLIE model. We use DarkIR [17] as our LLIE baseline throughout the evaluation for enhancing low-light images.

Impact of illumination drop and sensor noise. To understand the robustness of VLMs against visual illumination degradation, we first observe their performance under two types of low-light simulation: (1) pure EV drop and (2) EV drop with sensor noise. As shown in Fig. 6-(a), both degradations consistently lead to a significant decrease in VLM accuracy. Notably, the introduction of sensor noise compounds this decline, resulting in a more pronounced performance drop compared to pure EV reduction. This confirms that VLMs are indeed highly sensitive to such visual degradation, with noise being a critical factor. Effectiveness of low-light image enhancement (LLIE) preprocessing. Given the observed performance degradation, we investigate whether pre-processing low-light images with a state-of-the-art Low-Light Image Enhancement (LLIE) model [17] can mitigate these issues. We apply LLIE models to the noise-added low-light images before feeding them into the VLMs. As illustrated in Fig. 6-(b), this approach yields mixed results. While we observe a significant accuracy improvement at more severe low-light levels (L4 and L5), performance decreases at moderate levels (L1-L3). This unstable behavior highlights the challenge of reliably enhancing low-light images across different levels of degradation. While current LLIE models enhance perceptual quality, the results suggest that current LLIE models may be biased to certain degradation levels as in Fig. 6-(d). Model-specific accuracy. Fig. 6-(c) provides a detailed comparison of the performance trends across individual VLMs under noisy inputs without LLIE preprocessing. While the specific degradation curves vary slightly across each models, the overall trend is a largely similar decline in accuracy as low-light conditions intensify. Although the commodity service GPT-4o consistently demonstrates the highest performance, it also shows performance degradation under low-light conditions. Furthermore, we observe an interesting point: at the most severe low-light level (L5), some VLMs achieve accuracy lower than that of GPT-4 (Blind-LLM baseline), which operates solely on textual input without any visual information. This indicates that for images under extreme degradation, the models are unable to effectively utilize these visual information, leading to a poorer understanding of semantic information compared to relying purely on language priors. Furthermore, this is more profound in that, the LLIE-enhanced image from L5 in Fig. 6-(d) seems to be perceivable to humans’ eyes. This hints that, 1) there is low correlation between the perceptual quality and VLMs’ task performance, and 2) effective LLIE integration in VLMs requires task-oriented LLIE modules for VLM perception. Question-wise accuracy. To gain a more granular under- standing of the performance decline, we further analyze the accuracy degradation across different question types, as shown in Fig. 7. While most categories exhibit a steady decline, we observe a critical phenomenon in two specific types: “Room Type Recognition” and “Object Attribute -Color”. For these categories, the VLM accuracy drops below that of the GPT-4 (Blind-LLM) baseline at severe degradation levels (L5 for the former, and L4 and L5 for the latter). The fact that this effect is particularly pronounced for the “Color” category strongly suggests that VLMs struggle to extract or preserve essential visual semantic information, such as color, when processing heavily dark images. Interestingly, this observation is analogous to the behavior of the human vision in dark scenes, where the visual primarily relies on rod cells that are sensitive to luminance because color-sensitive cone cells function much less effectively.

For details and case-by-case results, please refer to Table I, which contains the complete data supporting our analysis. L0 denotes the original images and serves as the baseline, while L1-L5 are darker images via the EV-drop, optionally combined with noise and/or LLIE. The (✓/✗) indicators specify active components, and the small gray numbers next to each score denote changes relative to L0.

We introduce DarkEQA, a new benchmark designed to address an overlooked and critical regime in VLM evaluation: the lack of systematic analysis for embodied reasoning in lowlight conditions. Using a physically-grounded low-light image synthesis pipeline, we create a reproducible benchmark to measure VLM robustness against realistic visual degradations. Our findings reveal that current VLMs are brittle in the dark, and that seemingly straightforward solutions like LLIE preprocessing can yield unstable results. Although our benchmark provides evaluations based on HM3D-Sem, our contribution extends beyond this single dataset. The provided low-light image synthesis algorithm and rule-based QA generation pipeline can be leveraged to adapt numerous existing datasets for new training and evaluation purposes. While our benchmark reveal the vulnerabilities of both VLMs and LLIEs to indoor low-light conditions, a detailed causal analysis of these failures remains a valuable direction for future research. Furthermore, our adopted approach-synthesizing low-light images from rendered inputs-is a practical choice when considering physical and financial limitations. However, mitigating a potential real-to-sim gap presents another important avenue for subsequent work.

w/ noise w/o noise

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

📸 Image Gallery

Reference

Related Posts

ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Model based learning for accelerated, limited-view 3D photoacoustic tomography

Start searching

No results found