생성형 디지털 트윈 혁신 GDT 120K와 비전 언어 시뮬레이션 모델
📝 Abstract
Figure 1. Overview of the generative digital twins framework, showing major challenges in digital-twin modeling, the construction of the GDT-120K dataset with evaluation metrics, and the Vision-Language Simulation Models (VLSM) workflow.
💡 Analysis
Figure 1. Overview of the generative digital twins framework, showing major challenges in digital-twin modeling, the construction of the GDT-120K dataset with evaluation metrics, and the Vision-Language Simulation Models (VLSM) workflow.
📄 Content
Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems YuChe Hsu AnJui Wang TsaiChing Ni YuanFu Yang Institute of Artificial Intelligence Innovation, National Yang Ming Chiao Tung University danielhsu.ii13@nycu.edu.tw anguswang.ii14@nycu.edu.tw nina.ii13@nycu.edu.tw yfyangd@gmail.com Figure 1. Overview of the generative digital twins framework, showing major challenges in digital-twin modeling, the construction of the GDT-120K dataset with evaluation metrics, and the Vision-Language Simulation Models (VLSM) workflow. Abstract We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to syn- thesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reason- ing for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt–sketch–code triplets that enable multimodal learn- ing between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation met- rics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate struc- tural integrity, parameter fidelity, and simulator executabil- ity. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable in- dustrial simulation systems. arXiv:2512.20387v4 [cs.AI] 13 Jan 2026 Figure 2. Workflow of the GDT-120K dataset construction, integrating curated factory data, statistical validation, and FlexSim instantiation with human–AI co-authored prompts to create aligned multimodal pairs for model training and evaluation.
- Introduction Digital twins replicate physical systems in virtual envi- ronments for process monitoring and optimization [5, 24]. With increasing adoption in smart manufacturing [16], plat- forms such as FlexSim are widely used for modeling com- plex production workflows. However, as shown in Fig- ure 1, building digital twins in FlexSim remains highly labor-intensive, requiring manual object placement, param- eter configuration, and logic scripting in the proprietary FlexScript language. This process is time-consuming and difficult to scale, motivating a transition from manual au- thoring toward automated code synthesis. Recent progress in large language models (LLMs) [1, 4] suggests a pathway for automation. While LLMs can gen- erate structured code from text and have been applied to control and planning [2], they lack visual grounding, which is crucial for spatially organized domains like factory layout synthesis. We therefore propose Vision-Language Simula- tion Models (VLSM), a multimodal generative framework that integrates visual and textual inputs to synthesize exe- cutable FlexScript for digital twins. Our goal is to allow users to specify a layout via sketches and prompts, enabling AI-assisted simulation authoring. Operationalizing this paradigm is challenging due to the domain-specific nature of FlexScript, which diverges from general-purpose languages and lacks public datasets or suitable evaluation metrics. To address these gaps, we introduce the generative digital twins (GDT) frame- work, a large-scale multimodal dataset containing 120K prompt–sketch–code triplets together with three evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR). By inte- grating multimodal models that unify visual encoders and language backbones through joint training, the GDT frame- work provides a solid technical foundation for scalable and reliable digital twin generation.
- Related Work Models such as Codex [7], AlphaCode [19], and Star- Coder2 [22] demonstrate strong general-purpose coding ability. For domain-specific DSLs (e.g., SQL, Verilog), specialized datasets and structure-aware training are re- quired [20, 28]. In robotics and simulation, GenSim [31] and EnvGen [32] use LLMs to bootstrap synthetic scenar- ios, while MineDojo [10] and Voyager [30] show agents writing environment scripts. These systems, however, do not target structured layout scripting as in FlexSim. The integration of vision and language has been explored in image captioning [17], VQA [23], and instruction follow- ing agents [3, 21]. Recent LMMs (e.g., GPT-4 [1], Kosmos- 1 [13]) incorporate visual encoders into LLMs via mod- ular fusion (e.g., Q-Former [18]), allowing image-guided code or text generation. Unlike prior work that focuses on language or classification, our task requires precise layout- aware code generation. While recent works have explored digital twins for mon- itoring
This content is AI-processed based on ArXiv data.