📝 Original Info 📝 Abstract Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent,
💡 Deep Analysis
📄 Full Content CookAnything: A Framework for Flexible and Consistent
Multi-Step Recipe Image Generation
Ruoxuan Zhang
zhangrx22@mails.jlu.edu.cn
Jilin University
Changchun, China
Bin Wen
wenbin2122@mails.jlu.edu.cn
Jilin University
Changchun, China
Hongxia Xie∗
hongxiaxie@jlu.edu.cn
Jilin University
Changchun, China
Yi Yao
leo81005.ee10@nycu.edu.tw
National Yang Ming Chiao Tung
University
Hsinchu, Taiwan
Songhan Zuo
zuosh2122@mails.jlu.edu.cn
Jilin University
Changchun, China
Jian-Yu Jiang-Lin
jianyu@cmlab.csie.ntu.edu.tw
National Taiwan University
Taipei, Taiwan
Hong-Han Shuai
hhshuai@nycu.edu.tw
National Yang Ming Chiao Tung
University
Hsinchu, Taiwan
Wen-Huang Cheng
wenhuang@csie.ntu.edu.tw
National Taiwan University
Taipei, Taiwan
Figure 1: Demonstration of our CookAnything model generating multi-step cooking instructions in a single pass. Each example
shows the user’s prompt (left) and the corresponding series of dish images (right), from initial preparation steps through the
final plated result (Details of the complete recipe text can be found in the Supplementary A.6.).
∗Corresponding Author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’25, Dublin, Ireland
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-2035-2/2025/10
https://doi.org/10.1145/3746027.3755174
Abstract
Cooking is a sequential and visually grounded activity, where each
step such as chopping, mixing, or frying carries both procedural
logic and visual semantics. While recent diffusion models have
shown strong capabilities in text-to-image generation, they strug-
gle to handle structured multi-step scenarios like recipe illustration.
Additionally, current recipe illustration methods are unable to ad-
just to the natural variability in recipe length, generating a fixed
number of images regardless of the actual instructions structure.
To address these limitations, we present CookAnything, a flexible
and consistent diffusion-based framework that generates coherent,
arXiv:2512.03540v2 [cs.CV] 5 Dec 2025
MM ’25, October 27–31, 2025, Dublin, Ireland
Ruoxuan Zhang et al.
semantically distinct image sequences from textual cooking in-
structions of arbitrary length. The framework introduces three key
components: (1) Step-wise Regional Control (SRC), which aligns
textual steps with corresponding image regions within a single
denoising process; (2) Flexible RoPE, a step-aware positional en-
coding mechanism that enhances both temporal coherence and
spatial diversity; and (3) Cross-Step Consistency Control (CSCC),
which maintains fine-grained ingredient consistency across steps.
Experimental results on recipe illustration benchmarks show that
CookAnything performs better than existing methods in training-
based and training-free settings. The proposed framework supports
scalable, high-quality visual synthesis of complex multi-step in-
structions and holds significant potential for broad applications in
instructional media, and procedural content creation. More details
are at https://github.com/zhangdaxia22/CookAnything.
CCS Concepts
• Computing methodologies →Computer vision tasks.
Keywords
Recipe image generation, procedural sequence generation, food
computing
ACM Reference Format:
Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-
Lin, Hong-Han Shuai, and Wen-Huang Cheng. 2025. CookAnything: A
Framework for Flexible and Consistent Multi-Step Recipe Image Generation.
In Proceedings of the 33rd ACM International Conference on Multimedia (MM
’25), October 27–31, 2025, Dublin, Ireland. ACM, New York, NY, USA, 19 pages.
https://doi.org/10.1145/3746027.3755174
1
Introduction
Cooking is a richly visual and sequential activity: from chopping
onions to garnishing a dish, each step not only involves semantic
transitions but also yields observable visual transformations [16,
21, 43]. Accurately illustrating these processes from textual instruc-
tions holds significant value for applications in culinary education,
assistive technology, and multimodal content generation, enabling
users to better understand, follow, and interact with complex pro-
cedures in an intuitive visual manner.
As textual recipes abstract the cooking process into language,
recipe illustration aspires to reverse this abstraction, generating
coherent image sequences that visually narrate each procedural
step [8, 14, 18, 23]. Compared to single-image generation, this task
introduces unique challen
📸 Image Gallery
Reference This content is AI-processed based on open access ArXiv data.