HunyuanImage 3.0 Technical Report
We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
💡 Research Summary
HunyuanImage 3.0 is an open‑source, native multimodal foundation model that unifies text understanding, image understanding, and image generation within a single autoregressive framework. The backbone is the Hunyuan‑A13B large language model, a decoder‑only transformer with a Mixture‑of‑Experts (MoE) configuration of 64 experts, 8 active per token, yielding roughly 13 billion active parameters per token while the total parameter count exceeds 80 billion.
To handle visual data, the authors attach two parallel encoders: a conventional vision encoder that extracts high‑level semantic features, and a variational auto‑encoder (VAE) that compresses raw pixels into a 32‑dimensional latent space with a 16× down‑sampling factor. Both encoder outputs are projected into the LLM’s embedding space—VAE features via a timestep‑modulated residual block, vision‑encoder features via a two‑layer MLP—allowing a single token sequence to contain interleaved text, image‑understanding tokens, and image‑generation tokens.
A novel “Generalized Causal Attention” mechanism reconciles the autoregressive nature of text with the global context needed for image patches. Text tokens attend only to preceding tokens (causal mask), while image tokens are permitted full attention over all earlier multimodal tokens and over all tokens within the same image segment. This design preserves sequential text generation while enabling image patches to capture global spatial dependencies.
Data preparation is a major contribution. Starting from over 10 billion raw images, a three‑stage filtering pipeline reduces the corpus to roughly 5 billion high‑quality images (≈45 % retention). Stage 1 removes low‑resolution, corrupted, over‑/under‑exposed, and oversaturated files and deduplicates by MD5. Stage 2 applies learning‑based detectors for watermarks, logos, extensive text (via hy‑OCR), collages, borders, and AI‑generated content, together with aesthetic and clarity scoring models. A subject‑scoring system evaluates color, lighting, and composition, with genre‑specific thresholds. Stage 3 performs embedding‑cluster deduplication and adds specialized subsets (knowledge‑augmented, stylized, graphic‑design collections).
Caption generation follows a bilingual (English/Chinese) hierarchical schema that separates descriptive levels (short to extra‑long), stylistic attributes (art style, lighting, composition), and factual entities (named persons, landmarks, brands). The pipeline includes compositional caption synthesis (dynamic field combination for length and pattern diversity) and factual grounding via two agents: an OCR agent for in‑image text and a Named‑Entity agent for real‑world entities. A bidirectional verification loop ensures only captions that faithfully reflect detected entities are kept. An additional image‑difference captioner produces change descriptions for paired images, supporting edit‑instruction generation.
Training data for reasoning consists of three streams: (1) Text‑to‑Text (T2T) reasoning data to improve logical inference and instruction following; (2) Text‑to‑Text‑to‑Image (T2TI) data that pairs images with short and long captions plus a reasoning trace that refines user intent into detailed visual specifications; (3) Image‑Edit‑to‑Text‑Edit (TI2TI) data containing source images, complex edit instructions, ground‑truth edited images, and an atomic‑operation edit trace. These datasets enable the model to perform Chain‑of‑Thought (CoT) reasoning: interpreting a prompt, “thinking” through intermediate refinement, and finally generating the target image.
Training proceeds by first pre‑training the MoE LLM, then fine‑tuning the combined multimodal system on image‑generation tasks only, followed by aggressive post‑training on the curated multimodal datasets. Distributed infrastructure and mixed‑precision training allow scaling to the full 80 B‑parameter model.
Evaluation employs both automatic metrics (FID, CLIP‑Score) and human preference studies. HunyuanImage 3.0 matches or surpasses leading closed‑source systems such as Seedream 4.0, Nano Banana, GPT‑Image, and its predecessor HunyuanImage 2.1 in text‑image alignment and visual fidelity. Qualitative examples demonstrate strong prompt‑following, reasoning, concept generalization, and text rendering capabilities across a wide range of aspect ratios.
By releasing the full code, model weights, and data pipelines at the provided GitHub repository, the authors invite the research community to build upon a state‑of‑the‑art, openly accessible multimodal foundation model, potentially accelerating progress in image generation, multimodal dialogue, and visual editing applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment