Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning

Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents’ long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.


💡 Research Summary

The paper introduces ∞‑THOR, a comprehensive framework designed to push the limits of long‑context reasoning in embodied AI. Built on top of the AI2‑THOR simulator, ∞‑THOR provides (1) a trajectory generation pipeline capable of synthesizing unlimited, reproducible long‑horizon episodes that can exceed 1 million tokens when processed by language models, (2) a novel embodied question‑answering benchmark called Needle(s) in the Embodied Haystack (NiEH), and (3) a suite of architectural and training techniques tailored for handling extreme sequence lengths.

Trajectory generation works by sampling from seven predefined task templates (e.g., “pick two objects and place”, “pick and place with movable receptacle”), selecting objects, and using a classical PDDL planner to produce ground‑truth action sequences. Successful roll‑outs are concatenated to form episodes of 400–950 steps, with an average of 14–33 sub‑goals. The final synthetic goal deliberately references objects that appear only in the early 20 % and the late 20 % of the episode, forcing agents to remember and reason over information separated by hundreds of steps.

NiEH is presented in two evaluation modes. The static mode generates QA pairs from the recorded trajectories. Questions are of two types: single‑evidence (answerable from a single timestep) and multi‑evidence (requiring integration of multiple temporally distant observations). Question templates cover binary, “what”, “where”, and “how many” queries. An ensemble of four state‑of‑the‑art multimodal LLMs (LLaVA‑OneVision, Qwen2.5‑VL, DeepSeek‑VL, Pixtral) validates answerability, ensuring that only questions solvable by current models are kept. The interactive mode lets agents act in the environment to accomplish the synthetic final goal, providing a real‑time test of long‑term memory and planning.

To process such long sequences, the authors explore two model families. The Interleaved Goal‑State‑Action (GSA) model serializes visual frames, language instructions, and action tokens into a single multimodal token stream, feeding up to >1 M tokens directly into a large‑scale vision‑language model. The Memory‑Augmented GSA model compresses the historical trajectory into textual summaries or visual memory slots, retrieving relevant chunks on demand, thereby keeping the active context window within a few hundred tokens while still accessing the full episode. Both families use recent VLM backbones and the LLaVA‑OneVision tokenizer for token counting.

Long‑context handling is further enhanced with extension techniques such as linear interpolation, dynamic scaling, YARN, and LongRoPE, allowing the models to attend over token lengths up to 1.3 M. Training efficiency is achieved through Context Parallelism based on Ring‑Attention, which distributes attention computation across multiple GPUs, reducing memory consumption by ~30 % and accelerating fine‑tuning by ~1.8×.

Empirical results reveal several key findings. (1) Models that ingest the full uncompressed trajectory outperform memory‑augmented variants by an average of 12 percentage points, especially on multi‑evidence questions where the gap reaches 15 pp. (2) Context Parallelism enables stable convergence even with 1.2 M‑token inputs, demonstrating scalability. (3) Incorporating the NiEH QA dataset into existing photorealistic benchmarks (e.g., Habitat‑3.0) yields up to +11.2 % improvement in task success, indicating that long‑context pre‑training transfers beneficially to other domains. (4) The framework integrates with ManipulaTHOR, allowing low‑level robot arm manipulation, and showcases successful sim‑to‑real transfer in a controlled setting.

In summary, ∞‑THOR delivers an end‑to‑end ecosystem for long‑horizon embodied AI research: (i) unlimited generation of richly annotated, hundreds‑step trajectories; (ii) a challenging multi‑modal QA benchmark that stresses memory, counting, and spatio‑temporal reasoning; (iii) architectural designs and training strategies that make extreme context lengths tractable; and (iv) evidence of real‑world applicability through sim‑to‑real experiments. By unifying environment, data, model, and training components, ∞‑THOR sets a new baseline for future work aiming to build agents capable of robust, long‑term reasoning and planning in complex embodied settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment