요리 단계별 이미지 생성의 새로운 패러다임

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: 요리 단계별 이미지 생성의 새로운 패러다임
ArXiv ID: 2512.03540
Date: Pending
Authors: ** - Zhang Ruoxuan (Jilin University, China) – zhangrx22@mails.jlu.edu.cn - Wen Bin (Jilin University, China) – wenbin2122@mails.jlu.edu.cn - Xie Hongxia (Jilin University, China) – hongxiaxie@jlu.edu.cn (Corresponding Author) - Yao Yi (National Yang Ming Chiao Tung University, Taiwan) – leo81005.ee10@nycu.edu.tw - Zuo Songhan (Jilin University, China) – zuosh2122@mails.jlu.edu.cn - Jiang‑Lin Jian‑Yu (National Taiwan University, Taiwan) – jianyu@cmlab.csie.ntu.edu.tw - Shuai Hong‑Han (National Yang Ming Chiao Tung University, Taiwan) – hhshuai@nycu.edu.tw - Cheng Wen‑Huang (National Taiwan University, Taiwan) – wenhuang@csie.ntu.edu.tw **

📝 Abstract

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent,

💡 Deep Analysis

📄 Full Content

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation Ruoxuan Zhang zhangrx22@mails.jlu.edu.cn Jilin University Changchun, China Bin Wen wenbin2122@mails.jlu.edu.cn Jilin University Changchun, China Hongxia Xie∗ hongxiaxie@jlu.edu.cn Jilin University Changchun, China Yi Yao leo81005.ee10@nycu.edu.tw National Yang Ming Chiao Tung University Hsinchu, Taiwan Songhan Zuo zuosh2122@mails.jlu.edu.cn Jilin University Changchun, China Jian-Yu Jiang-Lin jianyu@cmlab.csie.ntu.edu.tw National Taiwan University Taipei, Taiwan Hong-Han Shuai hhshuai@nycu.edu.tw National Yang Ming Chiao Tung University Hsinchu, Taiwan Wen-Huang Cheng wenhuang@csie.ntu.edu.tw National Taiwan University Taipei, Taiwan Figure 1: Demonstration of our CookAnything model generating multi-step cooking instructions in a single pass. Each example shows the user’s prompt (left) and the corresponding series of dish images (right), from initial preparation steps through the final plated result (Details of the complete recipe text can be found in the Supplementary A.6.). ∗Corresponding Author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’25, Dublin, Ireland © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-2035-2/2025/10 https://doi.org/10.1145/3746027.3755174 Abstract Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they strug- gle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to ad- just to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, arXiv:2512.03540v2 [cs.CV] 5 Dec 2025 MM ’25, October 27–31, 2025, Dublin, Ireland Ruoxuan Zhang et al. semantically distinct image sequences from textual cooking in- structions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional en- coding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training- based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step in- structions and holds significant potential for broad applications in instructional media, and procedural content creation. More details are at https://github.com/zhangdaxia22/CookAnything. CCS Concepts • Computing methodologies →Computer vision tasks. Keywords Recipe image generation, procedural sequence generation, food computing ACM Reference Format: Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang- Lin, Hong-Han Shuai, and Wen-Huang Cheng. 2025. CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation. In Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), October 27–31, 2025, Dublin, Ireland. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3746027.3755174 1 Introduction Cooking is a richly visual and sequential activity: from chopping onions to garnishing a dish, each step not only involves semantic transitions but also yields observable visual transformations [16, 21, 43]. Accurately illustrating these processes from textual instruc- tions holds significant value for applications in culinary education, assistive technology, and multimodal content generation, enabling users to better understand, follow, and interact with complex pro- cedures in an intuitive visual manner. As textual recipes abstract the cooking process into language, recipe illustration aspires to reverse this abstraction, generating coherent image sequences that visually narrate each procedural step [8, 14, 18, 23]. Compared to single-image generation, this task introduces unique challen

📄 Read Full PDF on ArXiv