Composing Concepts from Images and Videos via Concept-prompt Binding

Reading time: 5 minute
...

📝 Original Info

  • Title: Composing Concepts from Images and Videos via Concept-prompt Binding
  • ArXiv ID: 2512.09824
  • Date: 2025-12-10
  • Authors: ** - Xianghao Kong¹ - Zeyu Zhang¹ - Yuwei Guo² - Zhuoran Zhao¹,³ - Songchun Zhang¹ - Anyi Rao¹ ¹ 홍콩과학기술대학 (HKUST) ² 홍콩중문대학 (CUHK) ³ 홍콩과학기술대학 광저우캠퍼스 (HKUST(GZ)) — **

📝 Abstract

https://refkxh.github.io/BiCo_Webpage A butterfly flapping its wings softly on a vibrant yellow flower, surrounded by a vibrant Minecraft landscape featuring a dynamic volcano which erupts, spewing vibrant red lava and creating a dramatic ash cloud against a serene blue sky, with molten lava pooling on the black rocks below. A beautiful butterfly rests on a vibrant yellow flower, flapping its wings softly against a backdrop of lush green leaves.

💡 Deep Analysis

Figure 1

📄 Full Content

Composing Concepts from Images and Videos via Concept-prompt Binding Xianghao Kong1 , Zeyu Zhang1 , Yuwei Guo2 , Zhuoran Zhao1,3 , Songchun Zhang1 , Anyi Rao1 1 HKUST 2 CUHK 3 HKUST(GZ) https://refkxh.github.io/BiCo_Webpage A butterfly flapping its wings softly on a vibrant yellow flower, surrounded by a vibrant Minecraft landscape featuring a dynamic volcano which erupts, spewing vibrant red lava and creating a dramatic ash cloud against a serene blue sky, with molten lava pooling on the black rocks below. A beautiful butterfly rests on a vibrant yellow flower, flapping its wings softly against a backdrop of lush green leaves. A vibrant Minecraft landscape featuring a flowing river, lush trees, a cascading waterfall… A dynamic volcano erupts, spewing vibrant red lava and creating a dramatic ash cloud… A beagle dog wearing a collar mixes a drink vigorously using a shaker with its dog's paws at a bar, surrounded by a cityscape visible through a large window. A beagle dog wearing a collar stands on a pathway surrounded by grass and grassland. A bartender in a black shirt skillfully mixes a drink in a shaker at a bar, surrounded by a cityscape visible through a large window. Figure 1. Illustration of BiCo, a one-shot method that enables flexible visual concept composition by binding visual concepts with the corresponding prompt tokens and composing the target prompt with bound tokens from various sources (§1). Abstract Visual concept composition, which aims to integrate differ- ent elements from images and videos into a single, coher- ent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corre- sponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hier- archical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into cor- responding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token bind- ing accuracy, we design a Diversify-and-Absorb Mecha- nism that uses an extra absorbent token to eliminate the im- pact of concept-irrelevant details when training with diver- sified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentangle- ment Strategy that decouples the training process of video concepts into two stages with a dual-branch binder struc- ture for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, open- ing up new possibilities for visual creativity. 1 arXiv:2512.09824v1 [cs.CV] 10 Dec 2025 1. Introduction Visual concept composition aims to integrate different ele- ments from images and videos into a single, coherent visual output. This process is a reflection of human artists’ cre- ation: combining ingredients from various inspirations to form a brand new masterpiece [15]. Consequently, it plays a fundamental role in visual creativity and filmmaking [62]. With the rapid advancement of diffusion-based visual con- tent generation models [16, 20, 29–31, 35, 40, 42, 54, 58, 61, 63], an increasing number of works [1, 3, 11, 14, 18, 26, 32–34, 55, 56] have been exploring the field of visual concept composition by exploiting the generative models’ strong capability of concept grounding and customization. Despite considerable efforts devoted to this field, chal- lenges still remain in accurately extracting complex con- cepts from visual inputs and flexibly combining concepts from both images and videos. First, the capability to pre- cisely extract specific concepts from various sources is of great significance for visual content creators. Nevertheless, existing mainstream methods [1, 3, 14, 26, 32, 34, 55, 56] use either adapters like LoRA [25] or learnable embeddings with explicit or implicit masks to realize concept selection, which fall short in decoupling complex concepts with oc- clusions and temporal alterations, and extracting non-object concepts such as styles. Second, it is a common practice to integrate different visual elements from both images and videos in the visual content creation process [62]. How- ever, previous works are confined to animating designated subjects from images with motion from videos [26, 55, 56], without further exploration of flexibly combining various attributes (e.g., visual styles and lighting variations) from both images and videos. Although there has been recent effort on flexible concept composition [18] in the image do- main, achieving universal visual concept composition for both images and videos remains an underexplored problem. To this end, we introduce Bind & Compose (BiCo), a one-shot method that enables flexible visual concept com

📸 Image Gallery

aam.png applications.png binder.png overview.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut