Pixelis: Reasoning in Pixels, from Seeing to Acting
Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.
💡 Research Summary
Pixelis introduces a novel three‑stage training pipeline that transforms a conventional vision‑language model (VLM) from a passive observer into an active pixel‑space agent capable of executing and learning from visual actions. The core idea is to equip the model with a compact set of executable pixel‑level tools—segmentation, zoom/crop, object tracking, optical character recognition (OCR), and temporal localization—each of which takes concrete pixel arguments (e.g., bounding boxes, frame indices) and returns structured outputs that can be serialized back into the model’s token stream. By doing so, the model’s reasoning is grounded in actual visual manipulations rather than abstract textual descriptions.
Stage 1 – Supervised Fine‑Tuning (SFT).
Pixelis first collects a large corpus of Chain‑of‑Thought‑Action (CoTA) trajectories, where each step interleaves natural‑language “thought” tokens with a tool call and its arguments. The dataset comprises 80 k images and 28 k video clips, yielding about 76 k trajectories with an average of 5.8 steps. During SFT the model is trained with a standard next‑token cross‑entropy loss, but a masked imitation loss up‑weights tokens belonging to tool names and arguments (weight ≈ 2). In addition, lightweight auxiliary heads are attached to the backbone for each tool: a SmoothL1 head for bounding‑box regression, a Dice loss for segmentation masks, a cross‑entropy head for OCR text, and a binary‑cross‑entropy head for tracking confidence. These heads are only activated when the corresponding tool is called, providing direct pixel‑level supervision that stabilizes the model’s ability to generate correct tool arguments. A curriculum that first warms up on the full data and then samples medium‑hard examples in a 2:1 ratio further improves convergence.
Stage 2 – Curiosity‑Coherence Reward Fine‑Tuning (CC‑RFT).
After SFT, the policy π_SFT is refined with reinforcement learning that balances three intrinsic drives: (1) a curiosity reward that measures the prediction error of a tool‑conditioned dynamics head (the larger the error, the higher the reward, gated by low epistemic variance); (2) a coherence reward that encourages adjacent step embeddings to be similar, computed as the z‑scored cosine similarity between successive embeddings E_t; and (3) an efficiency prior that penalizes invalid tool calls and chains longer than a preset maximum length L_0. The overall reward is a weighted sum R(τ)=w₁R_final + w₂R_cur + w₃R_coh – w₄R_pen, where R_final is a +1/‑1 signal for correct/incorrect final answers. Step embeddings are built from masked‑pooled visual tokens, recent thought embeddings, and a one‑hot encoding of the previous action, then projected to a unit‑norm 512‑dim vector. Gradients are blocked from flowing into the frozen backbone tokens, keeping the visual encoder stable while the policy and dynamics heads adapt. This phase dramatically shortens tool chains (average length drops from ~6.0 to ~4.2 steps) and raises the decisive‑step rate, indicating more purposeful tool usage.
Stage 3 – Pixel Test‑Time Reinforcement Learning (Pixel TTRL).
At inference time, Pixelis performs label‑free adaptation. For a new query, it retrieves the K most similar trajectories from a memory bank (the bank excludes any near‑duplicates identified during data cleaning). Instead of voting on the final answer alone, Pixelis votes over the entire trajectory, selecting the most consistent full tool chain as a pseudo‑label. The policy is then updated toward this exemplar using a KL‑to‑EMA safety constraint: the KL divergence between the current policy and an exponential‑moving‑average (EMA) copy of the policy is penalized if it exceeds a modest threshold (≈0.2). This mechanism prevents catastrophic drift while still allowing rapid adaptation to domain shift. Experiments show that removing the KL‑EMA anchor leads to KL values > 0.4 and a collapse in accuracy, confirming the necessity of the safety control.
Empirical Results.
Pixelis is evaluated on six public image and video benchmarks (including VSI‑Bench, RefCOCO, VideoQA, and others) using an 8‑billion‑parameter VLM as the shared backbone. Across all tasks, Pixelis achieves an average relative gain of +4.08% over the baseline (peak +6.03% on VSI‑Bench), computed as (Pixelis – baseline)/baseline. Moreover, the average tool‑chain length shrinks from 6.0 to 3.7 steps, and the token‑level KL remains below 0.2 during test‑time learning, demonstrating both efficiency and stability. Detailed ablations reveal that each component—masked imitation loss, curiosity‑coherence reward, and KL‑EMA‑guided test‑time voting—contributes measurably to the final performance.
Safety, Auditing, and Reproducibility.
The authors implement a two‑stage de‑duplication pipeline (visual embeddings + perceptual hashes) to eliminate near‑duplicate images and video clips, and they mask duplicated entries from the retrieval index during test‑time adaptation. An audit of the retrieval process shows 0.00% overlap with the evaluation set (95 % CI
Comments & Academic Discussion
Loading comments...
Leave a Comment