VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcom…
Authors: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi
( V ideoAtlas: Na vigating Long-Form V ideo in Logarithmic Compute Mohamed Eltahir 1 Ali Habibullah 1 * Y azan Alshoibi 1 * Lama A yash 1 , 2 T anv eer Hussain 3 † Naeemullah Khan 1 ‡ 1 King Abdullah Uni versity of Science and T echnology (KA UST), Thuwal, Saudi Arabia 2 Department of Computer Science, King Khalid Uni versity (KKU), Abha, Saudi Arabia 3 Department of Computer Science, Edge Hill Uni versity , Ormskirk, England { mohamed.hamid, ali.habibullah, yazen.shaebi, lama.ayash } @kaust.edu.sa hussaint@edgehill.ac.uk, naeemullah.khan@kaust.edu.sa Abstract Extending language models to video intr oduces two chal- lenges: representation , wher e existing methods rely on lossy appr oximations such as uniform sampling, and long- context , wher e caption- or agent-based pipelines collapse video into text and lose visual fidelity . T o over come this, we intr oduce V ideoAtlas, a task-agnostic en vir onment to r ep- r esent video as a hierar chical grid that is simultaneously lossless, navigable, scalable, caption- and pr eprocessing- fr ee. An overview of the video is available at a glance, and any re gion can be r ecursively zoomed into, with the same visual r epr esentation used uniformly for the video, interme- diate investigations, and the agent’ s memory , eliminating lossy text con version end-to-end. This hierar chical struc- tur e ensures access depth gr ows only logarithmically with video length. F or long-context, Recursive Language Mod- els (RLMs) recently offer ed a powerful solution for long text, but e xtending them to visual domain requir es a struc- tur ed en vir onment to recurse into, which V ideoAtlas pr o- vides. V ideoAtlas as a Markov Decision Pr ocess unlocks V ideo-RLM: a parallel Master -W orker ar chitectur e wher e a Master coordinates global explor ation while W orkers con- curr ently drill into assigned r e gions to accumulate lossless visual evidence . W e demonstrate thr ee ke y findings: (1) log- arithmic compute gr owth with video duration, in contrast to the linear cost of baselines, further amplified by a 30- 60% multimodal cache hit r ate arising from the grid’ s struc- tural reuse . (2) en vir onment budg eting, wher e bounding the maximum exploration depth pr ovides a principled compute- accuracy hyperparameter . (3) emer gent adaptive compute allocation that scales with question gr anularity . When scal- ing from 1-hour to 10-hour benchmarks, V ideo-RLM r e- mains the most duration-r obust method with minimal ac- curacy degr adation while baselines degr ade significantly , demonstrating that structured envir onment navigation is a viable and scalable paradigm for video understanding . * Equal Contribution † Corresponding Author ‡ Principal In vestigator (PI) § Code: github.com/mohammad2012191/VideoAtlas × Figure 1. Logarithmic compute scaling with video duration. V ideo-RLM’s hierarchical grid gro ws sub-linearly ( O (log T ) ), re- quiring up to 9.7 × fewer tokens than linear -scaling baselines. A uniform VLM maxes out its 256K context trading off sampled frame count with resolution. 1. Introduction Understanding long-form video requires locating sparse, task-relev ant evidence within a massive temporal space: an hour video has 90,000 frames at 25 fps, yet the answer to a query often resides in a fe w seconds. When a movies editor faces the same challenge, the solution is well-established: a contact sheet (a single composite image showing sam- pled shots) to identify promising regions at a glance before zooming into only those clips. This loop of overview , iden- tify , zoom is the key to efficient visual na vigation, and it is precisely what current VLMs lack. Existing approaches to long-form video understanding can be broadly categorized into four paradigms: uniform sampling, composite grids, caption-based, and agentic- based approaches. Uniform sampling [6, 20] introduces se- vere temporal sparsity , i.e., at practical budgets, frames are sampled minutes apart, resulting in short e vents being sys- tematically missed. Moreov er , within a fixed context win- dow , increasing the number of sampled frames forces a pro- portional decrease in per-frame resolution, creating a fun- damental cov erage-vs-fidelity tradeoff. Composite grids [5, 10] pack frames into a single representati ve image, impro v- ing token ef ficiency but remaining a fixed, lossy snapshot. Caption-based and agentic approaches [17, 22, 25] rely on text as their primary reasoning medium (captioning clips, storing text summaries, or con verting visual observations into language before planning). Even when these systems adaptiv ely sample frames, their intermediate memory and decision-making operate over te xt, not over a structured vi- sual space. Any visual detail overlooked during transcrip- tion or abstraction cannot be recovered by subsequent rea- soning. These paradigms also face distinct scalability bot- tlenecks i.e., standard VLM pipelines [2] must decode the video, extract frames, and perform visual tokenization on CPU before any reasoning be gins. For long videos, this pre- processing alone can exhaust hundreds of gigabytes of sys- tem RAM. Caption-based and agentic methods av oid this by con verting video to text first, but incur a different cost: an offline captioning stage that scales linearly with video du- ration and irreversibly discards visual fidelity . While some agentic methods [22] perform this con version online, they still rely on text as the intermediate representation, inherit- ing the same information loss. W e claim that a useful video representation must be simultaneously lossless (frame-level access at any res- olution), navigable (agent-directed), scalable (no con- text ceiling), caption-fr ee (nati ve visual reasoning), and pr epr ocessing-free (no offline decoding). As detailed in T ab . 1, current approaches typically optimize for a subset of these properties at the expense of others. V ideoAtlas. W e propose a task-agnostic en vironment that represents any video as a navigable, hierarchical K × K image grid (Fig. 2). The root grid renders the full video as a contact sheet. By inv oking E X PA N D (a recursi ve descent action that generates a new , finer-resolution sub-grid for a selected cell) an agent achiev es sub-second temporal pre- cision in O (log T ) steps, where T is the video duration in seconds. The design is uniform throughout: the video, in- termediate in vestigations, and the agent’ s internal evidence scratc hpad (a lossless multimodal memory that stores col- lected frames, subtitles, timestamps, and descriptions) are all rendered as grids. This completely eliminates caption- ing, of fline preprocessing, and context-windo w ceilings, satisfying all properties in the aforementioned paragraph. Crucially , V ideoAtlas also escapes the cover age-vs-fidelity tradeof f inherent to uniform VLMs: within a fixed context window , sampling more frames forces lower per -frame res- olution, and vice versa. V ideoAtlas sidesteps this entirely (each grid image is always rendered at full resolution, and the agent zooms only where needed, nev er sacrificing vi- sual fidelity for temporal co verage). Structurally , the hierar- chy yields logarithmic compute gro wth: as video length in- creases, only a few additional depth layers are needed rather than linearly more frames. Moreover , the fixed hierarchical grid is inherently cac he-friendly : root grids and overlapping sub-grids are naturally reused across exploration rounds, achieving 30-60% multimodal cache hit rates that further reduce effecti ve GPU compute (see Appendix Sec. C.1). From repr esentation to reasoning . W ith a lossless and navigable video representation in hand, a crucial observa- tion follows: the long-video problem reduces to a long- context problem. The video is the context, and what is needed is a mechanism for agents to explore it recur- siv ely without compressing it. Recursive Language Models (RLMs) [23] provide exactly this mechanism for text, al- lowing agents to query arbitrarily long contexts through re- cursiv e subagent calls and accumulate exact symbolic vari- ables. RLMs, howe ver , require a structured en vironment to recurse into. V ideoAtlas is precisely that structure. W e deploy Master-W orker Agents ( V ideo-RLM ) within this en vironment to extend RLMs to the video domain, yielding depth-controlled compute budgeting and logarithmic cost growth. Follo wing are our main contributions. 1. VideoAtlas. W e formulate video understanding as na v- igation within a formally defined geometric en viron- ment. The hierarchical grid is lossless, caption-free, preprocessing-free, and strategy-agnostic, with logarith- mic access depth, parallelizable subgrids, and a struc- tural cache-friendliness. 2. Video-RLM. A parallel Master -W orker architecture ex- tending Recursiv e Language Models to video. W orkers explore grid subtrees concurrently and accumulate ev- idence in a lossless V isual Scratchpad, while a Master steers exploration via uncertainty analysis. 3. Configurable T rav ersal Strategies. Breadth-First and Depth-First instantiations plus a query-adaptiv e policy that selects trav ersal order automatically , all composable with the en vironment without modification. 4. Envir onment Budgeting. W e b udget the en vir onment , not the agent: bounding exploration depth d directly con- trols temporal resolution and compute, pro viding a prin- cipled compute-accuracy hyperparameter . Beyond these architectural contributions, experiments re- veal that the formulation produces emergent scaling be- haviors (adapti ve compute allocation and logarithmic cost growth) that we detail in Sec. 4. 2. Related W ork Long-Form V ideo Understanding. Standard V ideo- Language Models process videos by uniformly sampling a fixed number of frames in a single forward pass [6, 20]. This introduces two structural problems. First, at any prac- tical b udget (e.g., 64 frames in an hour), the temporal stride is ∼ 56 seconds per frame, so short ev ents, fine-grained vi- sual details, and scene transitions are easily missed. Second, the context window imposes a hard ceiling: beyond a few hundred high-resolution frames, the model truncates input or degrades. One practical workaround is to pack multiple frames into a single K × K composite image (a contact- sheet grid) [5, 10], which improves token efficiency . How- ev er , a single-resolution grid is still fundamentally lossy : it represents the video with a fixed sample of moments and T able 1. Comparison of long-video QA methods. “Caption-Free” = no text captions used as intermediate representation. “Lossless” = no information lost between input and reasoning. “ ∞ Context” = can handle arbitrarily long videos without context overflo w . “Par - allel” = workers e xplore concurrently . Method Caption-Free Lossless ∞ Context Parallel VLM (Uniform) ✓ ✗ ✗ ✗ LLoV i [24] ✗ ✗ ✗ ✗ MR.V ideo [12] ✗ ✗ ✗ ✓ V ideoARM [22] ✗ ✗ ✗ ✗ D VD [25] ✗ ✗ ✗ ✗ Ours ✓ ✓ ✓ ✓ cannot recov er the ev ents in between. Grids alle viate the context-packing problem, but they do not resolve the cover - age problem. Caption-Based Appr oaches. A prominent line of work av oids the frame-count limit by first transcribing the video into text captions and then reasoning over them. LLoV i [24] con verts densely sampled short clips into text summaries and aggregates them with an LLM. MR.V ideo [12] scales this with a MapReduce design: clips are captioned in paral- lel, standardized, and then synthesized into a final answer by a reducer LLM. V ideo to T ext conv ersion is standard practice, although systems that explicitly observe video frames at a coarse step immediately con vert those obser- vations into text before any planning or memory update. Pang et al . [12] explicitly acknowledge that video-to-text modality transitions cause reasoning failures on scene tran- sitions and fine-grained visual details. Agentic, Hierarchical, and Memory A pproaches. An- other set of approaches treat long-video understanding as agentic search. D VD [25] constructs a multi-granular database (global summaries, clip captions/embeddings, and index ed raw frames) and queries it with tools (Global Browse, Clip Search, Frame Inspect). V ideoARM [22] per- forms on-the-fly coarse-to-fine search via a set of predefined tools (e.g., captioning, temporal localization, visual QA) ov er a hierarchical multimodal memory , avoiding exhaus- tiv e preprocessing. V ideoTree [18] builds a query-adaptive hierarchical representation to guide efficient exploration. On the memory side, W orldMM [21] organizes long-video memory into episodic, semantic, and visual components, re- triev ed adaptiv ely per query [8]. Despite their diversity , these systems share a common limitation: intermediate evidence is stored as captions, text summaries, or compressed embeddings, ne ver as raw visual frames, meaning none provide lossless, navigable access to any arbitrary video moment by construction. Long Context as the Core Challenge. Recursiv e Lan- guage Models (RLMs) [23] address long text contexts by letting agents access context through recursive subagent calls, storing results in lossless symbolic variables rather than compressing them into the model’ s context window . The RLM insight transfers naturally to video, but only if an en vir onment is defined in which agents can navi- gate the video visually . Existing video “en vironments” are built around clip databases and text-based retrie val [22, 25]. No visual, lossless, recursiv ely navig able environment for video has been proposed. V ideoAtlas fills precisely this gap. En vironment Budgeting vs. Prior Compute Adapta- tion. Chain-of-thought reasoning [19] and adaptiv e test- time compute allocation [15] hav e sho wn that allocating more inference compute consistently improves performance on language and reasoning tasks. In the video domain, the closest analog is V ideoARM [22], which adaptively chooses how many frames N 1 to sample per localized interval, a form of density adaptation that improves efficienc y . How- ev er , this controls sampling quantity (how many frames), not structural resolution (how fine the temporal decompo- sition is): within each interv al, sampling remains uniform, and events falling between sample points can still be missed regardless of N 1 . MR.V ideo [12] offers no such control at all. Its captioning cost is fixed by video duration regard- less of the query . A fundamentally different form of bud- geting is absent from prior works: controlling the temporal r esolution of the en vir onment itself , where each depth lev el geometrically subdivides time, providing formal precision guarantees calibrated to video length and query granularity . W e introduce exactly this form of budgeting with V ideoAt- las. What Is Missing? T ab . 1 summarizes the key properties of representati ve methods. In the next section, we introduce V ideoAtlas, which addresses all the aforementioned gaps. 3. Methodology W e present our methodology in two parts. First, we in- troduce VideoAtlas (Sec. 3.1): a task-agnostic en viron- ment that renders any video as a navigable, hierarchical grid with formally defined state, action, and observation spaces (Fig. 2). Second, we describe V ideo-RLM (Sec. 3.2): a parallel Master-W orker agent architecture that operates within V ideoAtlas to answer questions about arbitrarily long videos (Fig. 3). 3.1. V ideoAtlas Hierarchical Grid. At the core of V ideoAtlas is a re- cursiv e K × K image grid (default K =8 , yielding 64 cells). Gi ven a video of duration T seconds, the root grid S 0 assigns each cell c i to a contiguous temporal interval [ t start i , t end i ] and displays a representative frame sampled at the interval midpoint, providing a “bird’ s-eye view” of the entire video (Fig. 3). Every cell is addressable : applying E X PA N D to cell c i deterministically generates a child grid S d +1 for that cell’ s sub-interval, increasing temporal reso- lution by a factor of K 2 . At depth d , the temporal resolu- Figure 2. The V ideoAtlas En vironment . (Left) The state space is a hierarchical grid stack S 0 , S 1 , . . . , S D , where S 0 is the root grid cov ering the entire video of duration T . Each grid has K 2 cells. Deeper levels d provide finer temporal resolution ∆ t d = T /K 2( d +1) . (T op Right) The discrete action space A is di vided into na vigation (e.g., Expand to S t +1 ), perception, and commit actions. (Bottom Right) The visual scratchpad memory M + accumulates multimodal evidence (images, timestamps, QA pairs) across e xploration rounds. tion is ∆ t d = T /K 2( d +1) , and reaching any frame requires at most D max = ⌈ log K 2 ( T · fps) ⌉ steps, achie ving sub- second precision even for 10-hour videos. Sub-grids are generated on-the-fly with no offline preprocessing. Agents interact with raw frames at e very le vel. Action Space. Unlike agentic methods [22] whose ac- tions perform video-pr ocessing operations (captioning, translating), V ideoAtlas exposes en vironment-navigation actions grouped into three categories (Fig. 2, right): Navigation (move through the hierarchy): E X PA N D ( c i ) descends into cell c i , generating a child grid. B A C K - T R AC K () returns to the parent grid. M A R K P RO M I S - I N G ( c i , c j , . . . ) flags cells for later exploration via a FIFO queue (BFS mode only). Per ception (sense the en vironment): Z O O M ( c i ) re- turns a full-resolution frame for cell c i . I N V E S T I - G A T E ( c i , direction (before/after) ) generates a temporal con- text scan of the frames immediately before or after a cell, used when an anchor event is found but the answer lies in neighboring frames. Commit (record evidence): A D D T O S C R A T C H - PA D ( items ) stores evidence tuples to the scratchpad. F I N I S H E D () declares the current region fully e xplored. The av ailable action set is state-dependent: E X PA N D is removed when cell span drops below a threshold (e.g., < 1 s), B AC K T R AC K is removed at the root, and BFS and DFS work ers receiv e different action sets. The agent can- not select what it cannot see , eliminating in valid actions by constructi on, while deciding its own explore-exploit bal- ance from visual cues. Memory . Positi ve memory ( M + , V isual Scratchpad): a lossless multimodal memory that stores evidence as tuples ( I img , s, τ , c, d ) representing image patch, subtitle, times- tamp, confidence score, and a text description relating the evidence to the query . When presented to the VLM, M + is rendered as a grid image with timestamps, subtitles, and in- dices b urned into pixel space, enabling unambiguous cross- referencing. Negative memory ( M − , Dead Zones): inter- vals explored with no relev ant findings are marked as dead zones. The grid renderer enforces this visually by blacking out overlapping cells, physically preventing the VLM from hallucinating details in already-explored re gions. Figure 3. V ideo-RLM ov erview . The query is con verted into a search task. In each round r , the Master examines the root grid S 0 (with dead zones masked) and the scratchpad M + , then assigns promising cells to W orkers. Each W orker autonomously explores its assigned region via navigation, perception, and commit actions. After all W orkers return, M + and M − are updated. The Master performs an uncertainty analysis: if evidence is sufficient, the final answer is produced. Otherwise, a new round be gins. Formal En vironment Definition. At any step, the envi- ronment state S comprises fiv e components: the current temporal position p = ( center , span ) , the depth d in the hi- erarchy , the positive and ne gati ve memories M + and M − , and the navig ation stack σ for backtracking. The observa- tion is the grid image rendered for the interval defined by p at depth d , together with aligned subtitle context filtered for the current temporal window . This state definition, together with the action space, for- mally defines a Markov Decision Process (MDP). The re- ward is task-defined (e.g., answer correctness for QA, tem- poral IoU for grounding) making V ideoAtlas a general sub- strate for any task reducible to “find relev ant moments in a video. ” In this work we solve it via zero-shot VLM reason- ing, but the formal MDP opens a direct path to reinforce- ment learning. The en vironment exhibits four structural properties: (1) P arallelizable : the grid decomposes into in- dependent subtrees explorable concurrently . (2) T raversal- agnostic : BFS, DFS, beam search, or learned policies can gov ern expansion order without modifying the environ- ment. (3) Depth-contr olled compute : bounding d yields a principled compute-accuracy hyperparameter . (4) Loga- rithmic overhead : as video duration grows, the hierarchy adds depth levels logarithmically , yielding O (log T ) scaling rather than O ( T ) . Notably , the depth parameter d interpo- lates between uniform sampling ( d =0 , equiv alent to a sin- gle composite grid of K 2 frames) and full recursi ve explo- ration ( d = D max ); prior uniform-sampling and composite- grid methods are thus degenerate cases of V ideoAtlas with no exploration. 3.2. V ideo-RLM: Master -W orker Ar chitecture W e extend Recursi ve Language Models [23] to videos by deploying agents in V ideoAtlas. Agents access video con- text through recursiv e subagents (workers) and store out- puts in the V isual Scratchpad M + . Exploration proceeds in discrete rounds : in each round r , the Master assigns cells to W orkers, W orkers explore in parallel, and results are mer ged into M + and M − before the next round be gins (Fig. 3). Search T ask Extraction. Before visual exploration, a text-only step con verts the raw query into a concrete search task. For example: “What tr eaty was signed after the Lon- don conference?” → “F ind the London conference scene. Look immediately after for tr eaty names in text overlays or subtitles. ” This search task guides all subsequent prompts. Master Agent. The Master holds the global view: it ex- amines the root grid S 0 (with dead zones masked) and the current scratchpad M + , then selects promising cells for the next round ( Global Pr obing ). A priority queue with V ir- tual Loss [3] ensures that cells already assigned to workers are deprioritized, prev enting redundant exploration. After each round, the Master performs Uncertainty Analysis : (a) a sufficienc y check, (b) temporal interpolation to suggest tar - geted search bounds from gaps between evidence anchors, and (c) dynamic memory pruning. W orker Agents. Each worker recei ves one cell from the frontier and explores it autonomously . T wo modes are sup- ported: Depth-First Search (DFS) mode where the work er E X PA N D s deeper into the timeline with a multi-step budget, ideal for localizing specific details. Breadth-First Sear ch (BFS) mode where the worker scans one lev el with a single- step b udget, ideal for evidence spread across the video. The trav ersal queue is re-prioritized via the Master’ s visual scor- ing. Query-Adaptive T rav ersal. The Master selects the trav ersal strategy before any frames are processed by an- alyzing the query’ s linguistic traits: DFS for specific detail localization, BFS for sequence or flow understanding. Sufficiency , Stopping, and Final Decision. Exploration stops at three le vels: (1) worker -lev el (budget exhausted or F I N I S H E D ), (2) master-le vel (suf ficiency check passes after round r ), (3) global (total compute budget reached). Once exploration terminates, the Master synthesizes the answer from M + : it sees the actual collected evidence frames (ren- dered as a grid with b urned-in labels), not text summaries, and ev aluates each candidate against the visual e vidence. 4. Experiments 4.1. Experimental Setup Benchmarks. W e ev aluated V ideoRLM on the long sub- sets of two benchmarks: LongV ideoBench [20] (L VB, 15- 60 min videos) and V ideo-MME [6] (VMME, without sub- titles). T o stress-test scalability beyond VLM context limits, we constructed 10-hour variants by concatenating multi- ple videos from each benchmark. Each query targeted a single source video placed at a random position among dis- tractors. Subtitle tracks were merged with correct temporal offsets. This isolates the “needle in a haystack” challenge at extreme durations. For VMME, we ev aluated the system without subtitles to verify that the system genuinely under- stands visual content rather than relying on textual cues. Model. Our primary experiments used Qwen3.5-35B- A3B [13] (35B total parameters, 3B activ e per forward pass) for both Master and W orkers, differentiated via separate system prompts and action sets, served through vLLM [11] on 4 × A100 80 GB GPUs. Each image in the grids used throughout the system was rendered at a unified resolu- tion of 320 × 320 pixels. W e use grid size K = 8 . T o demonstrate VLM-agnosticism, we additionally ev alu- ate with Gemini-3-Flash as the backbone (T ab . 2). Baselines. W e compared against five categories. (1) Pro- prietary Models : GPT -4o, GPT -5, Gemini-3-Flash, and Claude-Opus-4.5. (2) Open-Source VLMs InternVL3.5 241B (28B active) and GLM-4.5V -106B (12B active), uniform-sampling baselines with significantly larger acti ve parameters than ours. (3) Caption Reliant Agentic Meth- ods : D VD [25], MR.V ideo [12], and V ideoARM [22], re- ported from their original papers. (4) Uniform Sampling : Qwen3.5-35B-A3B with 160 uniformly sampled frames at a resolution of 320 × 320 pixels (similar to our frame work) along with their temporally aligned subtitles, representing T able 2. V ideo QA accuracy (%) on the standard (Long) subsets. L VB: LongV ideoBench. VMME: V ideo-MME (no subs). Method Activ e L VB VMME Pr oprietary Models GPT -4o [9] Prop. 66.7 65.3 GPT -5 [14] Prop. 72.6 81.8 Gemini-3-Flash [16] Prop. 74.5 — Claude-Opus-4.5 [1] Prop. 67.2 77.6 Open-Sour ce VLMs InternVL3.5-241B [4] 28B 67.1 72.9 GLM-4.5V -106B [7] 12B 76.7 74.6 Caption-Reliant Agentic Methods MR.V ideo [12] (Gemini+GPT -4o) Prop. 61.6 61.8 V ideoARM [22] (Qwen3-VL-235B) 22B — 54.9 V ideoARM [22] (GPT -o3+GPT -4o) Prop. 76.4 81.2 Same-Backbone Baselines Qwen3.5 (uniform, 160 fr .) 3B 61.5 63.8 LLM ov er Captions ⋆ Prop. + 3B 62.4 64.2 V ideoAtlas (Ours) V ideo-RLM (Qwen3.5-35B) 3B 52.5 50.4 V ideo-RLM (Gemini-3-Flash) Prop. 72.0 76.2 ⋆ GPT -4o captions (MR.V ideo repo), answered by Qwen3.5-35B-A3B. the strongest single-pass baseline within our hardware b ud- get. (5) LLM ov er Captions : follo wing LLoV i [24], GPT - 4o captions (from the MR.V ideo repository) are concate- nated temporally and answered by Qwen3.5-35B-A3B, iso- lating the benefit of visual exploration over te xtual summa- rization. 4.2. Main Results T ab . 2 compares all methods on the standard long-video benchmarks. T ab . 3 ev aluates the 10-hour extended- duration v ariants, alongside a verage token consumption per question. Standard benchmarks. At standard durations (T ab . 2), V ideo-RLM (3B active parameters) achie ves accuracy com- petitiv e with substantially larger open-source VLMs (12- 28B activ e) and proprietary baselines. Importantly , Qwen 3.5 is video-finetuned, whereas V ideo-RLM assumes a purely zero-shot agent with no video-specific training, achieving these results without any intermediate text repre- sentation or captioning. The accuracy gap between zero- shot navigation and finetuned uniform sampling narrows with a stronger backbone: V ideo-RLM (Gemini) reaches 72.0% on L VB, within 2.5 points of Gemini-3-Flash’ s di- rect performance (74.5%), confirming V ideoAtlas is VLM- agnostic and performance scales with backbone capability . Extended duration (10 hours). The 10-hour variants (T ab . 3) reveal a more significant comparison. LLM over T able 3. 10-hour variant: accuracy (%) and average tokens per question. ∆ : accuracy drop from standard benchmarks (T ab . 2). Method Acc. ∆ A vg. T okens L VB-10hr Qwen3.5 (uniform) 49.2 -12.3 212K LLM ov er Captions ∗ 62.1 -0.3 207K ‡ V ideo-RLM (Qwen) 47.7 -4.8 146K † V ideo-RLM (Gemini) 70.1 -1.9 307K VMME-10hr (no subs) Qwen3.5 (uniform) 50.6 -13.2 232K LLM ov er Captions ∗ 36.0 -28.2 235K ‡ V ideo-RLM (Qwen) 49.7 -0.7 403K † V ideo-RLM (Gemini) 69.1 -7.1 390K ∗ GPT -4o captions + Qwen3.5 (exceeds 256K conte xt in many samples). † Effecti ve tokens after vLLM multimodal prefix cache (avg. 36-42% hit rate). ‡ QA tokens only , excludes GPT -4o captioning cost. Captions collapses to 36.0% on VMME-10hr: concate- nated captions exceed Qwen’ s 256K context window , forc- ing truncation and information loss, demonstrating linear captioning fails beyond context limits. Notably , captions degrade only − 0 . 3% on L VB-10hr (where subtitle tracks are available) but − 28 . 2% on VMME-10hr (no subtitles), exposing a hea vy reliance on textual cues rather than gen- uine visual understanding. Uniform Qwen degrades mod- erately (63.8% → 50.6% VMME) as sampling becomes pro- hibitiv ely sparse. V ideo-RLM maintains highly stable per- formance across durations (e.g., Qwen VMME drops only 0.7% vs uniform 13.2% and Captions 28.2%), validating that V ideoAtlas buf fers the agent against duration scal- ing. On VMME-10hr , the absence of subtitles forces the purely zero-shot Qwen agent into extensi ve visual explo- ration (403K effecti ve tokens), while the stronger Gemini backbone zeroes in on the answer faster (390K tokens), an emergent adapti ve compute property where weaker percep- tion necessitates more search steps. Crucially , the recur- siv e nature of the en vironment is inherently cache-friendly : workers re-examine the same grid view across multiple rea- soning steps that do not change the navigation state, creat- ing repeated visual token prefixes. For self-hosted Qwen (vLLM), automatic multimodal prefix caching exploits this redundancy transparently , achie ving 36-42% hit rates at 10 hours (up to 61% for shorter videos, see Appendix Sec. C1). V ideo-RLM (Gemini) achiev es 70.1% on L VB-10hr with near-zero degradation (-1.9%) from the standard bench- mark. Error analysis. Manual inspection of failure cases re- veals three dominant error modes: VLM perception er- rors (misreading text overlays, confusing visually similar scenes), premature sufficienc y (the Master declares evi- dence sufficient despite contradictions), and text latching (anchoring on phrases in evidence that superficially match a candidate answer). All three are model-dependent, as confirmed by the substantial accuracy improvement when switching from Qwen (3B active) to Gemini-3-Flash with- out any changes to V ideoAtlas. W e discuss these errors in more detail in the Appendix Sec. A. 4.3. Logarithmic Compute Scaling Fig. 1 demonstrates the fundamental efficiency adv antage of hierarchical navigation using L VB-10hr . As video duration grows from 1 minute to 10 hours (a 600 × increase), V ideo- RLM’ s compute cost increases sub-linearly: the hierarchy adds depth lev els logarithmically ( ⌈ log K 2 ( T · fps ) ⌉ ), and the suf ficiency mechanism halts exploration once e vidence is found. Caption-based pipelines scale linearly , where ev- ery clip must be captioned regardless of the query , requir- ing over 1.4M tokens per query at 10 hours. V ideo-RLM achiev es comparable accuracy using only 148K effecti ve to- kens (a 9.7 × reduction), and unlike uniform VLMs whose cost is fix ed, depth can alw ays be extended to accommodate longer videos. 4.4. En vironment Budgeting and Adaptive Com- pute Fig. 4(a) shows accuracy vs. maximum exploration depth on 30 questions sampled from L VB-10hr . Accuracy rises from 30% ( d =0 , root grid only) to 43.3% ( d =2 , 137 ms span), then plateaus at depths 3-4 where the finest resolution drops below one millisecond, well beyond any meaningful visual granularity . In practice, we set the maximum depth to the first sub-second layer , which automatically adapts to video duration (a 1-minute video reaches sub-second at d =1 , a 10-hour video at d =2 ). Depth d is thus a princi- pled compute-accuracy hyperparameter that directly con- trols temporal r esolution rather than frame quantity . Fig. 4(b) v erifies that compute allocation adapts to ques- tion dif ficulty without explicit supervision. Grouping L VB questions by the number of ground-truth temporal positions containing answer e vidence, scattered answers (3+ posi- tions) consume 40% more tokens (322K vs. 230K) than lo- calized ones. This emergent behavior arises from the inter - action between the Master’ s uncertainty analysis, the suffi- ciency mechanism, and the hierarchical structure. 4.5. W orker Scaling W e vary the number of work ers ∈ { 1 , 3 , 5 , 7 } on 30 ques- tions sampled from L VB-10hr (Fig. 5). After normalizing for workload differences, wall-clock time decreases from 588s (1 workers) to 257s (7 workers), a 2.25 × speedup, while accuracy remains stable (40-47%). This is a structural property of the en vironment: each subtree is self-contained, so adding work ers improv es throughput without modifying the search protocol. 5. Limitations W e identify four principal limitations. (1) VLM per cep- tion bottleneck : the system’ s perceptual ceiling is set en- tirely by the backbone VLM. Our error analysis re veals three dominant failure modes, all VLM-dependent rather d Figure 4. (a) En vironment b udgeting: accuracy and tokens vs. max depth on subset of L VB-10hr (temporal span annotated). Green: optimal depth (first sub-second layer). (b) Adapti ve compute: av- erage tokens scale with evidence spread without ground-truth su- pervision. than en vironment-dependent: (a) perception errors (mis- reading text overlays, confusing visually similar scenes), (b) premature sufficiency , where the Master declares ev- idence sufficient despite contradictions rather than direct- ing further exploration, and (c) text latching, where the agent o ver-relies on subtitle cues when a vailable. Crucially , switching from Qwen (3B active) to Gemini-3-Flash elim- inates a substantial portion of these errors without any ar- chitectural changes (T ab . 2), confirming that performance scales directly with backbone capability and will improv e as VLMs advance. (2) No-anchor exploration overhead : when the root grid S 0 contains no visually obvious anchor for the query , the agent may require additional exploration rounds before finding relev ant regions. The Master miti- gates this progressively , as each round’ s ne wly collected evidence refines subsequent cell assignments. Dev eloping methods to surface semantically rele vant information into upper depth layers could substantially improve ef ficiency and is a promising direction for future work. (3) Evaluation Figure 5. W all-clock time (normalized to equal workload) vs. number of workers 30 questions sampled from L VB-10hr . Ac- curacy (annotated) remains stable across all configurations. scope : we v alidate on multiple-choice QA. The MDP for- mulation supports temporal grounding, summarization, and anomaly detection (only the re ward signal changes. The en- vironment does not), but these remain to be demonstrated empirically . (4) Zer o-shot only : we solve the MDP en- tirely via zero-shot VLM reasoning, the weakest possible agent. The discrete, finite action space makes V ideoAtlas directly compatible with RL training (PPO, DQN), which would likely improve exploration ef ficiency , but we lea ve this to future work. 6. Conclusion W e introduced V ideoAtlas, a formulation that reframed video understanding as navigation within a formally de- fined hierarchical environment, and V ideo-RLM, a parallel Master-W orker agent that operates within it. Three prop- erties emerged from the formulation: logarithmic compute growth with duration, principled en vironment b udgeting via depth control, and emergent adaptive compute allocation. This en vironment defines the state, action, and observation spaces of a complete MDP opening a direct path from zero- shot reasoning to learned exploration policies, and from question answering to any task reducible to “find relev ant moments in a video. ” Acknowledgements W e are grateful to the KA UST Academy for its gener- ous support, and especially to Prof. Sultan Albarakati that made this work possible. For computer time, this research used Ibex managed by the Supercomputing Core Labora- tory at King Abdullah University of Science & T echnology (KA UST) in Thuwal, Saudi Arabia. References [1] Anthropic, A.: The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (1), 4 (2024) [2] Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W ., Gao, C., Ge, C., et al.: Qwen3-vl tech- nical report. arXiv preprint arXi v:2511.21631 (2025) [3] Chaslot, G.M.B., W inands, M.H., van den Herik, H.J.: Par- allel monte-carlo tree search. In: Computers and Games. Lecture Notes in Computer Science, vol. 5131, pp. 60–71. Springer , Berlin, Heidelberg (2008) [4] Chen, Z., W u, J., W ang, W ., Su, W ., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) [5] Eltahir , M., Habibullah, A., A yash, L., Hussain, T ., Khan, N.: V ote-in-context: Turning vlms into zero-shot rank fusers. arXiv preprint arXi v:2511.01617 (2025) [6] Fu, C., Dai, Y ., Luo, Y ., Li, L., Ren, S., Zhang, R., W ang, Z., Zhou, C., Shen, Y ., Zhang, M., et al.: V ideo-mme: The first- ev er comprehensiv e ev aluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025) [7] Glm, T ., Zeng, A., Xu, B., W ang, B., Zhang, C., Y in, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of lar ge language models from glm-130b to glm-4 all tools. arXiv preprint arXi v:2406.12793 (2024) [8] Hu, Y ., Shi, W ., Fu, X., Roth, D., Ostendorf, M., Zettle- moyer , L., Smith, N.A., Krishna, R.: V isual sketchpad: Sketching as a visual chain of thought for multimodal lan- guage models. Advances in Neural Information Processing Systems 37 , 139348–139379 (2024) [9] Hurst, A., Lerer, A., Goucher , A.P ., Perelman, A., Ramesh, A., Clark, A., Ostrow , A., W elihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) [10] Kim, W ., Choi, C., Lee, W ., Rhee, W .: An image grid can be worth a video: Zero-shot video question answering using a vlm. IEEE Access 12 , 193057–193075 (2024) [11] Kwon, W ., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Y u, C.H., Gonzalez, J.E., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with page- dattention (2023), 06180 [12] Pang, Z., W ang, Y .X.: Mr . video: Mapreduce as an effecti ve principle for long video understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems [13] Qwen T eam: Qwen3.5: T owards nati ve multimodal agents (February 2026), https://qwen.ai/blog?id= qwen3.5 [14] Singh, A., Fry , A., Perelman, A., T art, A., Ganesh, A., El- Kishky , A., McLaughlin, A., Low , A., Ostrow , A., Anan- thram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) [15] Snell, C.V ., Lee, J., Xu, K., Kumar , A.: Scaling llm test- time compute optimally can be more ef fectiv e than scaling parameters for reasoning. In: The Thirteenth International Conference on Learning Representations (2025) [16] T eam, G., Anil, R., Borgeaud, S., Alayrac, J.B., Y u, J., Sori- cut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal mod- els. arXiv preprint arXi v:2312.11805 (2023) [17] W ang, X., Zhang, Y ., Zohar , O., Y eung-Le vy , S.: V ideoa- gent: Long-form video understanding with large language model as agent. In: European Conference on Computer V i- sion. pp. 58–76. Springer (2024) [18] W ang, Z., Y u, S., Stengel-Eskin, E., Y oon, J., Cheng, F ., Bertasius, G., Bansal, M.: V ideotree: Adaptive tree-based video representation for llm reasoning on long videos. In: Proceedings of the Computer V ision and Pattern Recogni- tion Conference. pp. 3272–3283 (2025) [19] W ei, J., W ang, X., Schuurmans, D., Bosma, M., Xia, F ., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neu- ral information processing systems 35 , 24824–24837 (2022) [20] W u, H., Li, D., Chen, B., Li, J.: Longvideobench: A bench- mark for long-context interleaved video-language under- standing. Advances in Neural Information Processing Sys- tems 37 , 28828–28857 (2024) [21] Y eo, W ., Kim, K., Y oon, J., Hwang, S.J.: W orldmm: Dy- namic multimodal memory agent for long video reasoning (2025), [22] Y in, Y ., Meng, Q., Chen, M., Ding, J., Shao, Z., Y u, Z.: V ideoarm: Agentic reasoning ov er hierarchical mem- ory for long-form video understanding. arXiv preprint arXiv:2512.12360 (2025) [23] Zhang, A.L., Kraska, T ., Khattab, O.: Recursiv e language models. arXiv preprint arXi v:2512.24601 (2025) [24] Zhang, C., Lu, T ., Islam, M.M., W ang, Z., Y u, S., Bansal, M., Bertasius, G.: A simple llm framework for long-range video question-answering. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 21715–21737 (2024) [25] Zhang, X., Jia, Z., Guo, Z., Li, J., Li, B., Li, H., Lu, Y .: Deep video disco very: Agentic search with tool use for long- form video understanding. arXiv preprint (2025) A ppendix W e present a complete end-to-end trace of V ideo-RLM answering “How many yellow cards were given in this video?” on a 25-minute FIF A W orld Cup Final highlight reel (90,117 frames). Figures 6 to 8 show the three stages of the pipeline. (a) Initial root grid (Round 0) ⇓ (b) After 8 rounds of exploration Figure 6. Navigation grid bef ore and after exploration. (a) The Master’ s initial 8 × 8 root grid provides a temporal overview of the full 25-minute video; each cell cov ers ∼ 23 s. (b) After 8 DFS rounds, explored regions are blacked out (24 of 64 cells), visually showing the cov erage pattern. The system explored 4.7% of total frames. Figure 7. Final evidence scratchpad. The system’ s lossless vi- sual memory after 8 rounds of exploration: 51 collected frames with burned-in labels [A]-[A Y], each paired with a timestamp and natural-language description. Representati ve entries include “[D] @86.3 s: Refer ee shows yellow card to Mar cus Thuram (Minute 18), ” “[O] @371.1 s: Referee gives yellow car d to Enzo F ern ´ andez (39 ′ ), ” Not all items are yellow-card ev ents: the scratchpad also captures contextual e vidence such as match score ov erlays, player close-ups, and celebration scenes, enabling the Master to cross-reference events against the full match timeline. This grid image is passed directly to the Master for the final deci- sion. A. Detailed Error Analysis W e extend the error analysis from Sec. 4 with a systematic characterization of failure modes. T o isolate en vironment failures from backbone limitations, we analyze all ques- tions where our two backbone configurations (Qwen3.5- 35B-A3B, 3B activ e; Gemini-3-Flash) disagr ee under iden- tical V ideoAtlas configurations. Across LongV ideoBench- Long and V ideoMME-Long combined, we observe 522 dis- agreement cases; the stronger backbone is correct in 423 of these (81%), confirming that the dominant failure modes are backbone-dependent rather than architectural. W e identify three systematic patterns from these cases, described below . These patterns are not mutually exclusi ve: a single failure may exhibit more than one. A.1. VLM Per ception Errors The agent navigates to the correct temporal re gion but mis- perceiv es the visual content. T wo sub-patterns emer ge. (i) Attrib ute confusion. The agent correctly identifies the scene and entities but misreads fine-grained attributes like colors, materials, spatial relationships, or on-screen text. This is especially common when the question hinges on a single distinguishing visual feature (e.g., the color of a spe- cific object, the label on a chart axis). (ii) Cross-fr ame inconsistency . The backbone produces contradictory descriptions of the same scene across dif fer- Figure 8. Exploration trace. Condensed log of the 8-round DFS exploration. The Master probes the root grid and dispatches 3 parallel W orkers (W) per round to the highest-scored Cells (C). W orkers use Z O O M , I N V ES T I G A T E , E X PA N D , and B AC K T R AC K to drill into their assigned regions and collect e vidence via A D D T O S C R A T C H PAD . After each round, the Master runs an uncertainty analysis to decide whether to continue or declare sufficienc y . Evidence gro ws from 0 to 51 items across 8 rounds (97 VLM calls, 449K tokens), with the Master declaring F I N A L D E C I S I O N after identifying 8 distinct yellow card e vents, which is the correct answer . ent frames, then arbitrarily selects one rather than reconcil- ing the conflict. For e xample, describing the same object as “purple/pink” in one frame, “white against blue” in another, and “blue” in a third. A.2. Surface-T ext Latching A reasoning failure where the agent anchors on a phrase in the e vidence/subtitle that superficially matches a candidate answer , without verifying whether the match is contextually correct. The agent’ s reasoning frequently contains high- confidence language (“the evidence explicitly states . . . , ” “this dir ectly supports candidate X”) but that confidence is built on literal pattern-matching rather than understanding. This is particularly problematic in documentary and educa- tional videos, where narrators use rhetorical phrasings that contain candidate-answer keyw ords without implying them as the correct answer . A.3. Early Evidence Anchoring The agent commits to an answer based on the first plausible evidence item it encounters, failing to integrate later evi- dence that would contradict or refine its conclusion. This failure mode interacts with the system’ s suf ficiency mech- anism: the Master may declare evidence “sufficient” af- ter a single supporting item, rather than verifying cov erage across all candidate answers. A.4. Impact of Backbone Quality All three failure patterns are backbone-dependent : switch- ing to a stronger VLM under identical V ideoAtlas configu- ration, prompts, and exploration budget resolves the major - ity of these errors without any architectural changes. The 4:1 to 5:1 win ratio across both benchmarks reflects this consistently . Figure 9 illustrate two representati ve cases. These results reinforce the main paper’ s finding that V ideoAtlas performance scales directly with backbone ca- pability , and suggest that the frame work is well-positioned to benefit from future advances in VLM quality . B. Per -Category Accuracy Breakdown LongV ideoBench annotates questions along three axes: question type , r easoning level , and topic . T abs. 4 to 6 re- port V ideo-RLM (Qwen3.5, 3B acti ve) accuracy on both L VB-Long and L VB-10hr . Sev eral patterns emerge: (1) Sequence-type questions ar e hardest : SSS (Scene Sequence Summary , 21.4%) and SAA (33.3%) require ordering multiple ev ents across the full video, demanding both broad coverage and temporal precision. (2) P erception de grades mor e than r elation at 10 hours : L1-Perception drops 9.6 points (59.4 → 49.8) vs. L2- Relation dropping 2.7 points (47.6 → 44.9), suggesting that the additional temporal distance primarily harms low-le vel visual recognition rather than relational reasoning. (3) Life- Vlogs ar e consistently har dest : at 35.3% (1hr) and 26.7% (10hr), these videos feature rapid visual changes, informal framing, and minimal subtitles, stressing the VLM’ s per- ception most sev erely . T able 4. Accuracy (%) by question type. Categories present only in one split are marked with –. L VB-Long L VB-10hr Question T ype n Acc n Acc S2A 30 66.7 27 63.0 T2E 33 63.6 33 60.6 S2E 42 61.9 42 50.0 S2O 33 60.6 33 42.4 O3O 30 60.0 – – SOS 42 57.1 – – TOS 21 57.1 – – T2O 18 55.6 18 44.4 T2A 42 54.8 42 47.6 E3E 51 54.9 48 54.2 O2E 24 54.2 – – T AA 36 52.8 36 44.4 E2O 12 50.0 12 25.0 T3O 39 48.7 33 30.3 T3E 33 48.5 30 46.7 SAA 36 33.3 – – SSS 42 21.4 – – T able 5. Accuracy (%) by reasoning le vel. L VB-Long L VB-10hr Lev el n Acc n Acc L1 – Perception 234 59.4 207 49.8 L2 – Relation 330 47.6 147 44.9 T able 6. Accuracy (%) by topic cate gory . L VB-Long L VB-10hr T opic n Acc n Acc Cooking-Recipes 42 64.3 30 56.7 Knowledge-Geograph y 54 63.0 45 46.7 T ravel-Guides 54 61.1 30 53.3 Knowledge-STEM 75 54.7 48 52.1 Knowledge-History 69 53.6 42 54.8 Movie-Recaps 63 52.4 36 47.2 News-Programs 48 52.1 27 40.7 Knowledge-CS 51 47.1 30 40.0 Knowledge-Art 57 42.1 36 52.8 Life-Vlogs 51 35.3 30 26.7 C. Compute Breakdo wn T ab . 7 decomposes the average token consumption per ques- tion into Master and W orker contributions. The Master ac- counts for a small fraction of total tokens (global probing, uncertainty analysis, final decision), while the bulk of com- pute is spent on W orker exploration. This confirms that the parallel architecture ef ficiently distrib utes work: the Master acts as a lightweight coordinator while W orkers perform the heavy visual e xploration. At 10 hours, W orker tokens increase by 80% (121K → 219K) while Master tokens increase by only 10% Failur e mode: VLM Perception Error Question: “Which dir ection is the narrator in r ed facing relative to the narrator in gr een?” Correct answer: Right front (A) Lightweight (3B) Strong Backbone “No narrator in red appears in any frame. . . neither is wearing red or green. ” Identified both hosts by name from on-screen graph- ics across 5 evidence items. Concluded entity does not exist. Reasoned through spatial geometry of desk layout. ✗ Impossible to determine ✓ Right fr ont Failur e mode: Surface-T ext Latching Question: “What is the primary r eason for the appearance of giraffe ima ges on the roc ks?” Correct answer: Climate change (C) Lightweight (3B) Strong Backbone “Items [A], [J], [L] explicitly state the narration at- tributes giraffe images to ‘the creativity of the an- cients. ’ ” Synthesized items [M]–[O] showing whale sk eletons and large animal remains in desert dunes. Matched rhetorical phrase to option B. Linked all e vidence to a dramatic past climate shift. ✗ Cr eativity of the ancients ✓ Climate change Figure 9. Representativ e backbone failure cases. T op: Perception error: the lightweight backbone cannot perceiv e a clearly visible host wearing red and concludes the entity does not exist; the stronger backbone identifies both hosts and reasons about their spatial relationship. Bottom: Surface-text latching: the lightweight backbone matches a rhetorical narrator phrase to a candidate answer with high confidence; the stronger backbone synthesizes broader evidence to identify the underlying scientific explanation. Both runs use identical V ideoAtlas configurations. T able 7. A verage per-question compute breakdown for V ideo- RLM (Qwen3.5, 3B activ e). A vg. T okens Suff. Expl. Evid. Benchmark Master W orker T otal Checks Rounds Items L VB-Long 28K 121K 149K 1.8 2.0 6.2 L VB-10hr 30K 219K 250K 1.9 2.3 7.2 (27.5K → 30.4K), confirming that the Master’ s coordination ov erhead scales minimally with video duration. The addi- tional W orker cost reflects deeper exploration (2.3 vs. 2.0 rounds) needed to locate evidence in a 10 × longer video, yet the increase is far sub-linear relati ve to the 10 × dura- tion increase, consistent with the logarithmic scaling prop- erty described in Sec. 4.3. Evidence items increase mod- estly (6.2 → 7.2), suggesting the system explores more but collects evidence at a similar density . C.1. Multimodal T oken Efficiency During DFS exploration, each worker re-examines the same grid view across multiple reasoning steps (e.g., A D D T O S C R A T C H PA D steps that do not change the nav- igation state), creating inherent redundancy in visual to- ken processing. For Qwen (self-hosted via vLLM) , this is handled transparently: vLLM’ s automatic multimodal pre- fix caching detects repeated image tok en prefixes across re- quests and serves KV cache hits without any code changes. T ab . 8 reports the measured hit rates across video durations. T able 8. vLLM multimodal prefix cache hit rates for Qwen3.5 across video durations. Duration A vg. Hit Rate (%) Std. (%) LongV ideoBench 1 min 60.9 7.0 10 min 48.2 7.1 1 hr 42.7 7.5 3 hr 50.7 12.4 5 hr 48.9 11.2 10 hr 41.6 8.8 V ideo-MME (no subs) 10 hr 35.5 8.9 D. Prompt T emplates W e include the complete prompt templates used by V ideoAtlas. All prompts are zero-shot (no in-context e xam- ples). The same templates are used across all benchmarks and video durations without modification. D.1. Sear ch T ask Extraction A text-only call that conv erts the raw query and answer can- didates into a concrete visual search task. Convert this question + choices into a concrete SEARCH TASK for exploring a video. Question: "{query}" Choices: {candidates} Describe EXACTLY what to look for visually. Be specific about scenes, objects, text overlays, or transitions that would confirm each choice. Output only the search task, no preamble. D.2. Master: Global Probing The Master examines the root grid (with dead zones blacked out) and ranks the top- N cells for worker assignment. You are analyzing a KxK grid of frames sampled from a SINGLE video in chronological order (left-to-right, top-to-bottom). ** QUERY: ** "{query}" ** GRID CELLS: ** {context_str} Pick EXACTLY {top_n} cells (no more, no fewer) most likely to help answer the query. ** OUTPUT (raw JSON, EXACTLY {top_n} entries): ** {"top": [{"id": }, ...]} D.3. Master: Uncertainty Analysis After each round with new evidence, the Master performs three tasks in one call: sufficienc y check, explore sugges- tions, and noise erasure. ** UNCERTAINTY ANALYSIS ** You are the MASTER coordinator analyzing search progress for a video question. ** QUERY: ** "{query}" ** ANSWER CHOICES: ** {candidates} ** EVIDENCE COLLECTED SO FAR: ** {evidence_text} ** EXPLORATION PROGRESS: ** {progress_text} ** NAVIGATION GRID (blacked-out = explored): ** {context_str} ** YOUR 3 TASKS (do all in one response): ** 1. ** UNCERTAINTY CHECK: ** For each answer choice, do you have sufficient evidence? 2. ** EXPLORE SUGGESTIONS: ** Suggest up to {N} regions. ONLY suggest non-blacked-out cells. - Cell IDs from the grid - Custom time ranges {"start","end"} (<60s) 3. ** ERASE NOISE: ** ONLY erase evidence completely unrelated to query, task, and ALL choices. Keep partial evidence. When in doubt, keep it. ** If sufficient: ** {"action": "FINAL_DECISION",...} ** Otherwise: ** {"action": "CONTINUE", "reasoning": "...", "explore": [...], "erase": [...]} D.4. W orker: Exploration Step Each worker recei ves a grid vie w of its assigned region and the av ailable tool set. The prompt includes a 1-sentence summary of the previous step’ s outcome (conv ersation his- tory). SEARCH TASK: "{search_task}" QUERY: "{query}" You are exploring [{start}-{end}s] ({pct}% through a {duration}s video, depth {d}). Grid: {K}x{K}, chronological L-to-R, T-to-B. ** CELLS: ** {context_str} ** PREVIOUS: ** {prev_summary} ** RULES: ** - EXPAND into promising cells to zoom in - Use ZOOM only when you found a relevant scene and need a closer high-resolution look - Use INVESTIGATE only when you found the anchor scene and need to check what happens before/after - ADD_TO_SCRATCHPAD with timestamp, description, and confidence when you find evidence - FINISHED when region has no relevant content Pick ONE action. Be precise with timestamps. D.5. Master: Final Decision After exploration terminates, the Master sees the evidence scratchpad grid and ev aluates each candidate. You are making a FINAL DECISION based on all collected visual evidence. QUERY: "{query}" ANSWER CHOICES: {candidates} EVIDENCE (see grid image with burned-in labels): {evidence_descriptions} For EACH choice: state which evidence supports or contradicts it. Then select the best-supported answer. ** OUTPUT: ** {"answer": , "reasoning": "..."}
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment