VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

💡 Research Summary

VisRefiner tackles the problem of translating UI screenshots into executable frontend code (HTML/CSS) by explicitly learning from the visual differences between a model’s rendered output and the target design. Existing multimodal large language models (MLLMs) are trained with a one‑way supervised mapping from screenshots to code and never observe the visual consequences of their predictions. Inspired by the human “render‑compare‑revise” workflow, the authors propose a two‑stage training framework.

In the first stage, Difference‑Aligned Supervision, a corpus of paired examples is built where each pair consists of an imperfect implementation (code Cₜ and its rendering Iₜ) and the corrected version (code Cₜ₊₁ and target rendering I_gt). The pairs are generated from two sources: (1) rule‑based perturbations that deliberately introduce localized visual errors across six dimensions (color, layout, alignment, component, image, text), and (2) real‑world imperfections produced by a pre‑trained baseline model. This yields the VisDiffUI dataset (≈20 K paired samples plus additional unpaired samples for reinforcement learning). The model is trained with a standard supervised fine‑tuning loss to predict Cₜ₊₁ given (Iₜ, I_gt, Cₜ), thereby learning a direct mapping from visual discrepancy to the required code edit.

The second stage introduces Reinforcement Learning with Self‑Refinement. After generating a candidate code Cₜ, the model renders it to Iₜ and computes a CLIP‑based similarity score sₜ with the ground‑truth design I_gt. Several refined candidates are sampled from the current policy, rendered to obtain Iᵢ, and their similarity scores sᵢ are computed. A Relative Improvement Score (RIS) measures the normalized increase in similarity (sᵢ − sₜ)/(1 − sₜ) when sᵢ > sₜ, otherwise zero. The reward combines three components: a format penalty for syntactic HTML/CSS errors, an improvement flag (1 if sᵢ > sₜ), and the continuous RIS value. Group‑Relative Policy Optimization (GRPO) is then applied: within each target image group, rewards are normalized by the group mean and standard deviation, producing a normalized advantage ˆrᵢ. The policy gradient is clipped in the usual PPO style, and the loss L_GRPO is minimized. This loop runs each epoch, constantly feeding the model its latest predictions, so the model learns to iteratively reduce visual gaps.

Experiments on multiple screenshot‑to‑code benchmarks show that VisRefiner substantially improves generation quality. Compared with a baseline MLLM trained only with one‑way supervision, the difference‑aligned supervision alone yields ~12 % absolute gains in CodeBLEU and layout accuracy. Adding the GRPO‑based self‑refinement further boosts layout fidelity by ~18 % and improves perceptual similarity (CLIP‑based) by 0.07 points. In a self‑refinement test, two to three refinement iterations bring the rendered output almost indistinguishable from the target, without a significant increase in code length or complexity.

The paper’s key contributions are: (1) a novel training paradigm that treats visual differences as learning signals rather than post‑hoc diagnostics; (2) the VisDiffUI dataset that aligns visual deviations with concrete code edits; and (3) a reinforcement learning scheme that endows MLLMs with autonomous refinement capabilities. By internalizing the render‑compare‑revise loop, VisRefiner moves screenshot‑to‑code generation toward a more human‑like, iterative reasoning process, opening avenues for future work on dynamic UI elements, other frontend frameworks, and integration into real development pipelines.

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment