DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Well-lit images Darkening Process Well-lit images Multi-level low-light images Evaluation of commonsense reasoning for embodied agents Evaluation of commonsense reasoning for embodied agents & Evaluation of robustness to low-light visual inputs switch Answer: No, this looks like a kitchen Answer: No, this looks like a kitchen Question: Is this room suitable for sleeping? Question: Is this room suitable for sleeping? LLIE Fig. 1. Illustration of the DarkEQA benchmark. Traditional Embodied Question Answering (EQA) primarily evaluates VLMs on well-lit images, overlooking their robustness to real-world low-light conditions. We present DarkEQA, a new benchmark designed to address this evaluation void. DarkEQA assesses VLM performance under two distinct conditions: clean, well-lit inputs (L0) and a multi-level ladder of physics-based low-light images (L1-L5). This heterogeneous design enables a clear analysis of both commonsense reasoning and robustness to visual degradation. Furthermore, the benchmark examines the effect of applying Low-Light Image Enhancement (LLIE) models as a pre-processing step.

📜 Original Paper Content