NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Reading time: 40 minute
...

๐Ÿ“ Original Info

  • Title: NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
  • ArXiv ID: 2601.02204
  • Date: 2026-01-05
  • Authors: Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu

๐Ÿ“ Abstract

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities-where text is strictly sequential and images are inherently hierarchical-we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024 ร— 1024 images in just 5 seconds-orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality. Edit/local_edit/object_color Change the color of the leaf-shaped pillow from light pink to soft blue Edit/local_edit/object_add Add a black desk lamp with a white fabric shade to the center of the wooden desk Edit/global_edit/adjust_color Adjust the overall color of the image to a cooler tone with reduced saturation to create a more muted and minimalist visual effect. Recently, AR-Diffusion hybrid architecture including Transfusion [93] and Bagel [16] demonstrated promising results for unified understanding and generation. However, the reliance on two different representations creates a gap between generation and understanding. This separation imposes re-encoding overheads for interleaved tasks and might fundamentally constrain the potential for deep multimodal integration. On the other side, pure autoregressive efforts like Chameleon [66], EMU3 [74], and EMU3.5 [13] remain constrained by two fundamental bottlenecks that hinder their practical deployment and multimodal capability. First, the reliance on the standard raster-scan next-token prediction paradigm for visual generation incurs a prohibitive computational cost at high resolutions. Unlike text, the sequence length of flattened visual tokens grows quadratically with resolution. Consequently, generating a single 1024 ร— 1024 image via raster-scan autoregression can take over 10 minutes [13, 74] , making these models significantly slower than their diffusion counterparts and impractical for interactive applications. Second, the visual representations in these models are typically derived from reconstruction-oriented VQ tokenizers. While these tokenizers optimize for pixel-level fidelity, the resulting discrete codes often lack high-level semantic density. This semantic gap fundamentally limits the model's performance on multimodal understanding tasks, as the visual tokens fail to capture the abstract concepts necessary for complex reasoning and alignment with the textual latent space. In this work, we introduce NextFlow, a unified sequential modeling framework that activates both multimodal understanding and generation within a single decoder-only transformer. To address the efficiency bottleneck, NextFlow departs from raster-scan generation, adopting a next-scale prediction paradigm [69]. This hierarchical approach generates visual content from coarse structural layouts to fine-grained details, significantly reducing inference latency to just 5 seconds for a 1024 ร— 1024 image-orders of magnitude faster than rasterscan counterparts. To bridge the semantic gap, we employ a dual-codebook tokenizer [54] that decouples semantic and pixel-level features, ensuring both high-level conceptual alignment and fine-grained visual fidelity within a dynamic resolution framework.

๐Ÿ“„ Full Content

The pursuit of artificial general intelligence (AGI) has long envisioned a unified system capable of perceiving, reasoning, and creating across diverse modalities. While Large Language Models (LLMs) have achieved mastery in text understanding and reasoning, and Diffusion Models [17,35,57] have revolutionized visual generation, these two distinct paradigms remain largely separated. This separation creates a friction: diffusion models excel at pixel-level fidelity but lack the inherent logical reasoning and in-context learning capabilities of LLMs, whereas traditional multimodal LLMs are often restricted to perception only.

Introduction of next-scale prediction in a unified decoder-only structure is non-trivial and presents unique challenges. We scale NextFlow on a massive corpus of 6 trillion tokens, comprising text, image-text pairs, and interleaved multimodal data. Throughout this journey, we identify and resolve key instabilities inherent to next-scale AR generation. Furthemore, we implement a rigorous post-training pipeline to boost the model performance. Uniquely, we propose a prefix-tuning strategy for Group Reward Policy Optimization (GRPO), which focuses optimization on the coarse-scale “prefixes” that determine global structure. This approach stabilizes RL training and effectively aligns the model with downstream objectives. For scenarios demanding hyper-realistic detail, we further integrate an optional diffusion-based decoder that refines the discrete output, pushing the boundaries of visual fidelity without compromising the unified architecture.

NextFlow demonstrates that a unified AR model can rival state-of-the-art diffusion models in visual quality while retaining the reasoning power of LLMs. Our 7B parameter model achieves competitive performance on text-to-image benchmarks and outperforms specialized models in image editing. Crucially, the unified architecture naturally supports interleaved text-image tasks. NextFlow can perform Chain-of-Thought (CoT) reasoning to refine prompts before generation or enable in-context learning for zero-shot image editing. Analysis reveals that NextFlow is highly efficient, requiring 6ร— fewer FLOPs during inference compared to MMDiT-based diffusion models [17] at 1024 2 resolution, achieving a generation speed of 5s per image.

Our contributions are summarized as follows:

โ€ข We propose NextFlow, a unified decoder-only Transformer that activates multimodal understanding, generation, and editing. It employs next-scale prediction for efficient generation and a dual-codebook tokenizer to ensure high semantic density.

โ€ข We present a robust training recipe validated on 6 trillion tokens, introduce a novel RL strategy for multi-scale generation that focuses optimization on coarse scales, and an optional diffusion decoder for enhanced visual detail, establishing a comprehensive training pipeline.

โ€ข Extensive experiments demonstrate that NextFlow achieves state-of-the-art performance, outperforming specialized image editing models and rivaling top-tier diffusion models in visual quality. Crucially, we show that a unified AR architecture can be both computationally efficient and structurally simple, offering a superior alternative to complex hybrid architectures.

2 Model Architecture

The NextFlow tokenizer adopts a dual-codebook architecture building upon TokenFlow [54], which simultaneously achieves high-fidelity image reconstruction and semantically rich discrete representations for multimodal understanding.

Specifically, this design decouples the learning of semantic and pixel-level features while maintaining their alignment via a shared-mapping mechanism. Unlike standard VQ-VAE that relies solely on pixel reconstruction, our quantization process is jointly constrained by both reconstruction fidelity and semantic consistency. By minimizing the weighted summation of semantic and pixel-level distances during codebook lookup, we ensure that the discrete tokens encapsulate both high-level concepts (distilled from the semantic teacher) and fine-grained visual details.

To accommodate dynamic resolution processing, we upgrade the semantic encoder initialization from siglip-so400m1 to siglip2-so400m-naflex2 , enabling variable resolution and aspect ratio processing.

When combined with our CNN-based pixel branch, this architecture enables fully dynamic spatial processing, allowing our AR model to train directly at native resolutions without the constraints of fixed input ratios. We employ multi-scale VQ [69] to further enhance the quantization quality. The scale settings are detailed in the appendix A.1.

Our autoregressive framework builds upon a standard decoder-only Transformer architecture, initialized from Qwen2.5-VL-7B [1] to leverage its strong multimodal priors. We extend this architecture to support visual token prediction using a next-scale prediction paradigm. For the newly added visual codebook, we initialize the embeddings directly from the tokenizer’s codebook embeddings. Empirically, we find that a unified prediction head achieves comparable performance to separate text and image heads while maintaining architectural simplicity. Therefore, we adopt a single output head for both modalities. The model is trained with cross-entropy loss to predict codebook indices across both modalities. Positional Encoding. We introduce Multiscale 3D RoPE to handle interleaved text and multiscale vision tokens, as shown in Figure 5. For text tokens at position t, we simply replicate the position across all three dimensions: (t, t, t). For vision tokens, we encode spatial and scale information explicitly: Each patch at scale s with grid coordinates (i, j) receives position

where H ร— W is the grid size and C is a constant range factor (we omit this constant in 5 for simplicity). The 0.5 offset means to center the positional encoding within each patch.

This normalized formulation enables resolution-invariant training: by mapping all spatial positions to a fixed range [0, C], grids of different resolutions (e.g. 16 ร— 16 and 32 ร— 32) share the same coordinate space. This eliminates positional extrapolation during high-resolution fine-tuning, as the model encounters only previously learned position ranges.

Following [69], we incorporate additional learnable scale embeddings for vision tokens. To enhance the model’s ability to adapt to varying resolutions during both training and inference, we further introduce scale length positional embeddings. Specifically, we employ sinusoidal encoding over the number of scales, which serves to disambiguate the target output resolution. This design is particularly critical in the VAR framework, where the generative process for images of different sizes follows distinct scale-wise patterns. By explicitly encoding scale length, the model can accordingly adjust its generation paradigm, leading to more coherent and resolution-aware synthesis.

Scale Reweight. In the next-scale prediction paradigm, earlier scales play a crucial role in determining the overall layout and structure of the image. However, these scales contain significantly fewer tokens compared to later scales, creating an imbalanced loss landscape during training. With uniform token weighting, the model tends to prioritize abundant fine-scale tokens at the expense of coarse-scale structural information, leading to layout degradation, particularly severe at high resolutions where the token count disparity becomes extreme.

To address this imbalance, we introduce scale-aware loss reweighting while maintaining the total vision loss constant. Specifically, we assign scale-dependent weights as:

where h s ร— w s represents the spatial resolution at scale s and ฮฑ is a hyperparameter controlling the reweighting strength. This formulation substantially increases the importance of early-scale predictions, ensuring stable structural generation.

Self Correction with Residual Features. During inference in the next-scale prediction paradigm, tokens within each scale are sampled independently using top-k/top-p sampling. This independence can introduce local conflicts: adjacent positions may sample semantically redundant content due to the lack of joint modeling. More fundamentally, autoregressive models suffer from exposure bias [2,9,29,56]: during training with teacher forcing, the model always conditions on ground-truth tokens from previous steps, but at inference time, it must condition on its own predictions. This train-test mismatch means that errors from earlier scales, which the model was never exposed to during training, tend to propagate and amplify through the generation process [46,54].

To address this, we introduce a self-correction mechanism during training, inspired by [25,46]. Rather than deterministically selecting the closest codebook index during encoding, we sample from a multinomial distribution over the top-k nearest indices, while the model continues to predict the top-1 index as the target. This trains the model to correct suboptimal choices from previous scales. In original VAR framework, the input features for each scale are accumulated from all previous scales. However, we find that applying self-correction to accumulated features in our decoder-only architecture leads to performance degradation. We hypothesize that self-correction significantly complicates the input feature space, creating a mismatch with text features that are directly retrieved from the codebook.

We modify the visual input to use residual features directly from the codebook without accumulation. The features of each scale are independently retrieved and upsampled as needed. This approach significantly constrains the complexity of visual input feature space, yielding substantial performance improvements and reducing local artifacts in generated images.

Diffusion Model

Codebook Emb.

Semantic Feat.

Figure 6 Architecture of the optional diffusion decoder.

While our default VQ decoder achieves satisfactory reconstruction fidelity with high inference efficiency, the discrete quantization process inevitably results in the loss of high-frequency details. To push the boundaries of visual quality, particularly for generation tasks requiring photo-realistic details, we introduce an optional diffusion-based decoder acting as a refinement module.

Conditioning Mechanism. After the next-scale visual index prediction, we obtain the corresponding embeddings from both the semantic and pixel codebooks. The semantic embeddings can be processed through the tokenizer’s semantic decoder to yield high-dimensional semantic features-representations that are explicitly aligned with ground-truth semantic features during tokenizer training [54]. We concatenate these three elements (semantic embeddings, pixel embeddings, and decoded semantic features), pass them through a linear projection layer, and feed the result into the diffusion model as a visual condition. Simultaneously, the diffusion model integrates the image caption via the text branch.

Usage and Trade-offs. Within the NextFlow framework, we observe that the diffusion decoder significantly mitigates detail degradation, particularly in challenging regions such as small-scale faces and text. However, this generative refinement introduces a trade-off: the stochastic nature of the diffusion process may alter fine-grained structures, potentially impacting performance in tasks requiring strict spatial consistency, such as local editing or identity preservation. Unless explicitly stated otherwise, the diffusion decoder is disabled in our reported experiments, including both quantitative evaluations and qualitative visualizations.

Experimental progress is rarely linear. Throughout training, we encountered a variety of challenges. In this section, we present our training odyssey-a chronological account of our training recipes, the issues that arose at each stage, and the solutions we developed to address them. The overall training pipeline is shown in 7.

The original TokenFlow framework [54] initializes the semantic encoder from pretrained model, while leaving the pixel encoder train from scratch. This leads to semantic features dominating early optimization and ultimately limits reconstruction quality. To mitigate this, we adopt a multi-stage training strategy: We first independently train the pixel branch to establish strong reconstruction capabilities. We then initialize all components-semantic encoder, pixel encoder/decoder, and pixel codebook-from their pre-trained checkpoints and train them jointly. This ensures that both branches begin from a favorable initialization, significantly accelerating convergence and improving final performance. Finally, we double the capacity of the pixel decoder, following architectural insights from [8,54], and fine-tune it separately. This stage substantially reduces local artifacts and enhances fine details such as small faces and text.

The tokenizer is trained on high-fidelity images with oversampling of face-containing samples to improve the preservation of facial detail. We find that randomly dropping 50% of VAR scales enhances the robustness of numerical distributions in the codebook and results in better reconstruction. Additionally, we enforce fp32 precision during quantization to maintain numerical stability. Advantage of TokenFlow-style tokenizer. We validate the efficacy of the TokenFlow-style dual-codebook design by comparing it against a standard single-branch VQGAN. While the single-branch baseline yields marginally higher raw reconstruction fidelity (+0.5 PSNR at 256 2 ), it proves less effective for generative modeling. As shown in Figure 8, under identical training protocols (40k steps, โˆผ40M samples), the dual-branch tokenizer achieves significantly lower vision loss and consistently superior GenEval scores. We attribute this to the semantic constraints inherent in the dual-branch architecture; these constraints shape a latent space that is structurally easier for the autoregressive model to learn, aligning with findings in REPA [85] and VA-VAE [82].

Loss Vision [39,66], while separate heads allow modality-specific optimization [34,80,81]. To systematically compare these approaches in our decoder-only framework, we conducted a controlled ablation under a lightweight setting using 5M alignment and 5M supervised fine-tuning (SFT) samples. During alignment, we fine-tuned the adapter and the expanded shared output head for the single-head model, keeping the remaining parameters frozen. For the dual-head model, we fine-tuned the additional vision head along with the adapter. All parameters were fine-tuned during the SFT phase.

As illustrated by the training losses in Figure 9, the single-head architecture consistently demonstrates superior performance. Specifically, it achieves lower total loss and vision loss throughout both the alignment and SFT phases. Furthermore, its text loss during the SFT phase is comparable to that of the dual-head counterpart.

Given its architectural simplicity and empirically better performance, we adopted the shared, single-head design for all subsequent large-scale experiments. Losses are computed using cross-entropy loss.

Impact of Self Correction. We validate self-correction strategies on a 2B parameter model. All experiments are conducted using a high-quality subset of LAION [59] and utilize a tokenizer with a 16,384-size codebook. Each 10,000 training steps corresponds to 5M image-text pairs.

Figure 10 reveals an intriguing phenomenon: directly applying self-correction with accumulated features (green line, p = 1.0, 30%) degrades performance below the non-corrected baseline (origin line). However, replacing accumulated VAR features with residual features leads to significant improvements. We attribute this to the fundamental difference in feature space complexity: in our decoder-only architecture, self-correction on accumulated features excessively complicates the input space, creating a mismatch with text tokens that are directly retrieved from the codebook. Residual features effectively constrain this complexity while maintaining consistency with the text modality, enabling self-correction to function as intended. Further ablations on self-correction intensity reveal that applying correction with probability p = 1.0 to 60% of visual tokens per scale achieves optimal performance (0.56 at 50k steps).

To preserve the original model’s text capabilities and evaluate whether mixing text-only data harms text-to-image performance, we conducted a small-scale ablation. As shown in Table 1, incorporating 25% text data during training does not adversely affect text-to-image generation quality.

During the alignment stage, we replace the original ViT in Qwen2.5-VL-7B [1] with our vision tokenizer and expand the vocabulary to include vision codes and image boundary tokens โŸจboiโŸฉ, โŸจeoiโŸฉ. We explore two training strategies to align the tokenizer with the language backbone: (1) a two-stage approach that first aligns the connector module and subsequently fine-tunes the output projection layer, and (2) a joint training strategy that optimizes both components simultaneously. To compare these strategies, we monitor performance metrics after an identical supervised fine-tuning (SFT) phase. Under controlled experimental conditions, both methods yield comparable performance. For simplicity, we adopt the joint alignment strategy in our main experiments. We used 10 million image-text pairs for bidirectional alignment tasks (image captioning and text-to-image generation) during this phase. Training is conducted at 256-level resolution.

During pre-training, all model parameters except those of the tokenizer are trainable. The training corpus consisted of approximately 6 trillion tokens drawn from various sources, including pure text, image-text pairs, editing data, and interleaved multimodal data. We adopt a progressive resolution curriculum across three sub-stages: 256-level, 512-level, and 1024-level pre-training.

In this stage, we conducted large-scale pre-training using around 2 billion text-to-image samples to enable the model to learn a wide range of visual semantics and establish fundamental image generation capabilities. We mixed pure text and multimodal understanding data to maintain the model’s original language and visual comprehension abilities. Additionally, we incorporate 147M interleaved samples, which teach the model to understand relationships across multiple images, laying the foundation for advanced editing capabilities.

Upon transitioning to 512 ร— 512 resolution, we observed a significant increase in artifacts and structural degradation in generated images. Although overall vision loss decreased, lower-scale losses exhibited an upward trend (Fig. 11). The Geneval score also dropped from 0.67 to 0.57, which we attribute to the substantial increase in the number of tokens per image (approximately 3000 additional tokens) at higher resolution. In VAR architecture, earlier scales play a critical role in the determination of global layout [33,54,69]. Equal weighting of tokens disproportionately suppressed learning on these foundational scales. Inspired by flow matching techniques that emphasize difficult intermediate timesteps [17,36], we introduced a scale-reweighting strategy (Eq. 2) with ฮฑ = 0.9. This adjustment ensured stable loss reduction on all scales and eliminated localized artifacts.

At 1024 ร— 1024 resolution, the exponential growth in token count necessitated the use of a carefully curated subset of 40 million high-quality samples. Despite the reduced size of the data set, the tokenizer achieved significantly finer reconstruction of local details. This stage required minimal data to enable high-resolution generation while substantially improving visual fidelity compared to previous stages.

After pre-training, we employ a two-phase post-training strategy to refine the models capabilities. First, we conduct continued training (CT) [19,23] on a curated subset of high-quality data to improve the aesthetic quality of the generated images. While pretrained models demonstrate strong general capabilities, they often produce outputs with inconsistent aesthetic standards due to the heterogeneous nature of pre-training datasets. The CT phase addresses this limitation by fine-tuning aesthetically superior samples while preserving prompt adherence and structural accuracy. Subsequently, we performed supervised fine-tuning (SFT) using a small set of high-quality conversational data. During SFT, we format the data in a dialogue structure and apply supervision exclusively to the model’s responses, enabling more natural and contextually appropriate interactions while further improving generation quality.

As our architecture follows a pure autoregressive approach, unlike Bagel-like unified models [16,41,93], we can directly employ Group Reward Policy Optimization (GRPO) [60] for reinforcement learning [31,47,70,71,86,88]. The generation of a VAR sequence can be formulated as a multi-step Markov Decision Process (MDP), where each action a t corresponds to generating the token grid for the next resolution level. The policy is defined as:

where a t,(i,j) denotes the token at position (i, j), and s t represents the prefix {a i } t-1 i=1 . However, applying RL to a multi-scale VAR generation process introduces unique challenges. As discussed in 3.2.3, early steps in VAR architecture produce coarse grids with few tokens, while later steps generate fine-grained grids with orders of magnitude more tokens. This imbalance causes learning signals from later steps to dominate optimization, destabilizing training and hindering effective policy updates for the initial steps that define global structure. To address this, we introduce two targeted strategies to stabilize the RL fine-tuning process. The first is scale reweight, same as the strategy adopted at the pretraining stage. The second is prefix-tuning strategy. Since low-resolution steps are most critical for the global layout and semantics of the generated image, we concentrate our RL updates on these formative stages. Given the condition c, which may also contains condition images, we roll out a group of image token sequences {s i T } G i=1 , and the corresponding decoded images {x i } G i=1 . Within each group, the advantage can be calculated by:

where R is the reward model. Then the GRPO optimization target is that

As shown in Fig. 12, we find that fine-tuning only the policies for the first m (e.g., m = 8) scales, while keeping the policies for the remaining T -m finer scales frozen, is highly effective. This strategy, termed “prefix-tuning,” focuses the limited and high-variance RL signal on the most impactful decisions, avoiding noisy updates to later-stage policies that primarily refine local details. This not only accelerates convergence but also better preserves the generative quality of the pretrained model while adapting its high-level attributes. Together, these two techniques enable stable and efficient RL-based fine-tuning of NextFlow, allowing for precise alignment with downstream objectives without sacrificing generative coherence.

We explore three model variants: a 1B parameter UNet-based model, and two Transformer-based models with 12B and 18B parameters, respectively. To preserve the text-conditional capabilities of the pre-trained backbones, we inject the discrete tokens strictly via the visual conditioning branch. We avoid concatenating discrete tokens with text embeddings, as our preliminary experiments indicated this would corrupt the textual semantic information. Structurally, the UNet variant adapts its input convolution to match the input channel dimensions, while the Transformer variants utilize an MLP layer to project the discrete tokens into the model’s native dimension.

Training Strategy. We adopt different training strategies based on model scale and computational constraints. The 1B model undergoes full parameter fine-tuning, whereas the 12B and 18B models are trained using Low-Rank Adaptation (LoRA) [27]. Data selection is also tailored to model capacity: the 1B model is trained on high-quality synthetic data, as its limited capacity struggles to fit the complex distribution of real-world data, often resulting in generation artifacts. Conversely, the larger 12B and 18B models are trained on real-world data to maximize reconstruction fidelity. We employ a two-stage curriculum: models are first trained at the base resolution, followed by fine-tuning on a 2 ร— 2 upsampling task to yield sharper high-frequency details. We rely exclusively on the standard diffusion loss, explicitly avoiding pixel-level losses (e.g., MSE), which we found to inflate PSNR metrics while degrading perceptual quality.

NextFlow is pre-trained on a cluster equipped with 1024 GPUs. We employ DeepSpeed ZeRO [55] with gradient checkpointing to enable efficient distributed training at scale. Workload Balance. The pre-training of NextFlow involves heterogeneous data types (pure text vs. textto-image vs. interleaved, varying resolution), which introduces significant computational imbalance across GPUs. We address this through a workload balancing strategy during data packing, as shown in Figure 13. Specifically, we precompute the TFLOPS for all sequence lengths and balance workloads accordingly. As demonstrated in Table 3, this strategy significantly reducing inter-GPU idle time, achieving a 4.1ร— speedup compared to naive approach.

High-Performance Kernel. The large codebook size results in significant memory consumption when computing output logits during training. To address this challenge, we adapt FusedLinearCrossEntropy [26], which fuses the final linear projection and cross-entropy loss computation into a single kernel, reducing peak memory usage by โˆผ 20GB per GPU by avoiding the storage of the full logit tensor in memory. We further identified memory-bound operations and fused them into single kernels to minimize redundant memory access, such as RoPE, RMS norm and Flash-Attention [14]. These fused kernels store intermediate results in registers or shared memory, significantly improving arithmetic intensity.

Pre-extract image indices. To minimize computational overhead during large-scale training, we pre-extract image indices offline across all training stages, thereby eliminating online encoding latency. For each image-text sequence, we precompute and store the index sequence along with the original ordering of the data. This approach allows the encoders to remain offloaded from the GPU during training, significantly reducing memory requirements. To support self-correction during training, we incorporate a predefined self-correction probability (100% samples, 60% tokens per scale). The corresponding input indices and ground-truth indices are stored accordingly. Once pre-extraction is complete, we perform offline data packing using these precomputed samples, further improving training efficiency as detailed in Sec. 4.

NextFlow is trained on a large scale multimodal dataset which covers a wide range of tasks involving images and texts, including visual understanding, image generation, image editing, image and text interleaved documents generation, text generation. We elaborate how each component of our dataset is built in the following sections.

In the pre-training stage, the model learns to perceive and understand images from text supervision. And thus we collect a large number of image captioning samples from open-source image-text datasets. In addition, we further supplement it with 1) text-rich images, comprising scene text and documents of various types such as table, chart, plot, and so on, to enhance the OCR capability of our model; 2) images with associated world knowledge. The image captions are rewritten using Vision-Language Models (VLM) to improve the text quality and comprehensiveness of the description.

We curate a billion scale dataset for text to image generation, with great diversity in terms of image content and image types. The images are gathered from open-source datasets, which exhibit a broad coverage over the content on the Internet; and Megalith [4], CommonCatalog [22], which provide valuable high-quality photographs; as well as our in-house image gallery which further extends the distribution of our data. We first apply a set of heuristic filters to the images and run aesthetic models to rule out images of poor visual quality. For sources whose images we observe have noticeably imbalanced content distribution, we employ a image topic classifier using pretrained SigLip2 in zero-shot manner, to categorize images into topics like landscape, human, animal, plant, food&drinks and so on. Then data resample is performed to obtain a balanced distribution. As shown in recent works [17], caption plays a crucial role in training image generation models with strong instruction following ability. Therefore we caption all the images using VLM to obtain accurate and detailed descriptions. In the CT stage, we introduce a small amount of synthetic data to further boost the aesthetics of our model’s generation.

Traditional Editing Subject-Driven Editing

For clarity, we categorize image editing into two distinct tasks: traditional editing and subject driven generation.

Traditional editing tackles the problem of transforming an image either locally or globally, while the resulted image always follows the input image in spirit, for example, local editing like object removal, object insertion and global editing like lighting modification, stylization, as well as view editing like zooming in/out, changing the camera’s view angle. In contrast, subject driven image generation aims to create an image of new content which only shares the target subject with the input image, for example, generating an image of the dog running on a grass field, given its close-up photo as input.

We start with open-source traditional editing datasets, including UniWorld-V1 [42], RealEdit [65] and ShareGPT-4o-Image [10]. First, we filter edit pairs with mismatched image resolution. Then we identify and remove low quality subsets which contain over 5% bad edits from these datasets through manual examination, followed by a VLM assessment at a granularity of samples. An edit is considered as “bad” if the target image does not respond to or closely follow the edit instruction. The resulted dataset exhibits a task distribution biased towards common and simple editing tasks such as adding, removing an object or replacing an object with another, which substantially limits the model’s capability to perform complex, varied real-world editing.

As a result, we build an additional synthetic dataset which span over a diverse spectrum of traditional editing tasks. The task distribution is significantly more balanced than the open-source counterpart as illustrated in Figure 15.

We observe that subjects like human, animal, objects and even abstract content, often reoccur across real world data. And this fact indicates there are numerous natural demonstrations for diverse tasks of subject-driven generation. We collect images that co-occur on the Internet and utilize VLM to identify image pairs sharing common subjects, which turns out to be a scalable way to build the training data. Compared to workflow methodology which heavily relies on generative models, our method inherently produces data with a significant advantage in terms of task diversity as well as subject consistency.

To endow the model with the ability to generate coherent interleaved image-text sequences, we construct a large-scale video-text dataset. We treat video clips as sequences of interleaved frames and leverage the temporal continuity of video to model narrative progression. Our data pipeline aggregates sources including OmniCorpus-CC, OmniCorpus-YT [40], and Koala36M [72], processed through a rigorous multi-stage filtering strategy.

Quality and Heuristic Filtering. We apply strict quality controls to the raw video corpus, specifically targeting the Koala36M subset. To minimize fragmentation errors during training, we discard long clips exceeding 20 seconds. Visual quality is ensured by filtering for high aesthetic scores (> 4.3, retaining the top 30%) and clarity (> 0.7, retaining the top 50%). Furthermore, to avoid degenerate solutions where the model generates static images, we require a motion score greater than 4. This pipeline removes approximately 75% of the raw data, resulting in a refined subset of 8M clips free from blur, static scenes, and low-aesthetic content.

Semantic Balancing. To prevent the model from overfitting to dominant categories, we employ SigLIP [87] to classify video content. We downsample overrepresented classes, removing 50% of clips containing “person” (reducing the total volume by 20%) and 50% of “television news broadcast” clips (reducing the total by 10%). This results in a balanced corpus of approximately 5.3M clips.

Motion-Adaptive Frame Selection. We extract frames at a baseline rate of 0.5 FPS (one frame every 2 seconds). However, uniform sampling often captures redundant frames in static scenes or blurry frames during rapid camera movement. We introduce an optical flow-based filtering mechanism using RAFT [68] to refine frame selection:

โ€ข Static Filtering: Frames with negligible optical flow magnitude are discarded.

โ€ข Camera vs. Object Motion: We analyze the variance of flow directions. Low variance implies global camera movement (pan/zoom), which we discard to focus on semantic content changes. High variance indicates independent object motion, which is preserved.

โ€ข Large Displacements: Frames exhibiting significant structural changes (high flow magnitude) are retained to capture scene transitions.

During training, we constrain the input to a maximum of 5 frames at 512px resolution or 3 frames at 1k resolution.

Transition Text Generation. Finally, to bridge the visual gaps between selected frames, we utilize VLMs to generate coherent transition texts, effectively converting the video clips into interleaved image-text documents.

To ensure NextFlow maintains strong pure-text instruction following and reasoning capabilities, we incorporate high-quality text-only data into the training mix. We source general-purpose instruction data from Nemotron- CC HQ [64] to maintain conversational fluency and general knowledge. Additionally, we integrate mathematical reasoning data from MegaMath [94] to enhance the model’s logical reasoning and problem-solving abilities.

6 Model Performance

We conduct a comprehensive evaluation of NextFlow across multiple dimensions, including prompt following, world knowledge, and aesthetic quality. The quantitative results demonstrate that our autoregressive approach, particularly when enhanced with reinforcement learning, achieves state-of-the-art performance comparable to or exceeding top-tier models.

As shown in Tables 5 and4, NextFlow demonstrates exceptional prompt following capabilities, with our RL-finetuned model achieving state-of-the-art scores (0.84 on GenEval [21] and 88.32 on DPG [28]) that outperform strong diffusion baselines like FLUX.1-dev [35] and match top-tier performers.

A critical advantage of our large-scale pre-training on diverse multimodal data is reflected in the WISE benchmark [48] (Table 6). NextFlow RL achieves an overall score of 0.62, matching the performance of Qwen-Image and significantly outperforming other autoregressive models like Show-o (0.30) and Janus-Pro-7B (0.35). The model demonstrates robust understanding across cultural, spatial, and physical domains, suggesting that our unified architecture effectively internalizes complex world knowledge.

On PRISM-Bench [18] (Table 7), which evaluates broader generative aspects including style and affect, NextFlow RL achieves an overall score of 78.8. This result places our model on par with top-tier systems like Seedream 3.0 [20] and Qwen-Image [75], demonstrating that our unified architecture achieves competitive aesthetic quality and text rendering capabilities alongside its strong instruction following.

We assess the image editing capabilities of NextFlow across a diverse set of benchmarks, ranging from traditional instruction-based editing to complex subject-driven generation. As evidenced by the quantitative results, our unified autoregressive framework, particularly when enhanced with reinforcement learning, establishes new state-of-the-art performance levels. [76] evaluate the in-context editing and subject preservation ability. In the single-subject setting (Table 9), NextFlow RL achieves a Subject Consistency (SC) score of 9.22, significantly outperforming models like OmniGen2 (8.34) and even surpassing the proprietary GPT-4o (9.03) in maintaining subject fidelity. This result highlights the effectiveness of our interleaved pre-training in learning robust identity representations. On GEdit-Bench [45] (Table 10), which explicitly measures the trade-off between semantic consistency and perceptual quality, NextFlow RL achieves the best overall score 7.87.

Comprehensive Evaluation on EditCanvas. Finally, we report results on our proposed EditCanvas benchmark (Table 11). To address the limitations of existing datasets that often focus on isolated capabilities or lack granularity, EditCanvas establishes both Traditional Editing and Subject-Driven Generation across 56 finegrained tasks with over 5,000 high-quality samples (please refer to Appendix B for detailed dataset statistics and comparisons). On this rigorous benchmark, NextFlow RL demonstrates balanced excellence, achieving an overall score of 8.04. It shows particular strength in Subject-Driven Generation (8.78). These results confirm that our “next-scale prediction” paradigm allows for precise local modifications while maintaining the high aesthetic quality required for creative workflows, fully leveraging the comprehensive evaluation scope provided by EditCanvas.

To evaluate the unified modeling capabilities of NextFlow, we test its performance on interleaved multimodal generation tasks. Unlike pipeline approaches that rely on separate text and image models, our unified autoregressive framework treats text and visual tokens as a single continuous sequence. Figure 18 demonstrates the model’s ability to generate coherent, alternating sequences of text and images across distinct domains. He is curled up, trying to preserve warmth, a small speck of grey life amidst the vast, freezing blue and white wilderness.", “Spring returns, melting the ice and bringing vibrant green life back to the pond. The duckling, who has survived the winter, steps into the water and stretches his wings. To his astonishment, he notices his reflection in the crystal-clear water has changed completely. He is no longer grey and awkward; his feathers are now pure white, his neck is long and graceful, and his wings are powerful and majestic.”, “Now a magnificent swan, he glides effortlessly across the shimmering lake, surrounded by other beautiful swans who welcome him as one of their own. Children playing by the water’s edge point in awe at his beauty, cheering for him. The sun sets in a warm, golden glow behind him, illuminating his white feathers as he realizes he was never ugly, just different, and has finally found where he truly belongs.”

“On a clean wooden countertop, all the raw ingredients for a tomato basil pasta are laid out in an organized manner. In the center, there is a pile of dry spaghetti. Surrounding it are three bright red tomatoes, a bunch of fresh green basil leaves, a bulb of garlic, a bottle of olive oil, and a block of parmesan cheese. The lighting is neutral and bright, showing a topdown view of the preparation stage.”, “The scene shifts to the chopping phase. A wooden cutting board takes center stage on the counter. A sharp chef’s knife rests beside a pile of roughly diced red tomatoes and finely minced garlic cloves. To the side, the fresh basil has been picked from the stems and sits in a small bowl. The view remains a direct overhead shot, focusing on the prepared ingredients ready for cooking.”, “Now, the cooking process begins. A large stainless steel pan is placed on a stovetop burner. Inside the pan, the diced tomatoes and minced garlic are sizzling in hot olive oil, creating a rich, chunky sauce. Steam rises slightly from the pan. Next to the pan, a large pot of boiling water bubbles vigorously, waiting for the pasta. The perspective is a straightdown look into the cookware.”, “The pasta has been cooked and combined with the sauce. The stainless steel pan now holds the cooked spaghetti, which is being tossed with the tomato and garlic sauce. The sauce coats the noodles evenly, turning them a vibrant reddish-orange. Fresh basil leaves are being scattered on top of the hot pasta, wilting slightly from the heat. The top-down view captures the action of mixing the dish.”, “The final dish is plated and ready to serve. A white ceramic plate sits on the wooden table, holding a neat mound of the tomato basil pasta. Grated parmesan cheese has been sprinkled generously over the top, adding a snowy texture. A fork rests on the side of the plate, and a glass of water is placed nearby. The overhead shot presents the completed meal clearly and simply.”

A distinct advantage of our NextFlow architecture is its native support for interleaved text-image generation, which allows for the seamless integration of Chain-of-Thought (CoT) reasoning prior to visual synthesis. To investigate the potential of this paradigm in handling implicit constraints and cultural nuances, we conducted an exploratory study using a curated dataset of 1M instruction-reasoning-image triplets.

As illustrated in Figure 19, standard generation often struggles with prompts containing latent ambiguities.

For instance, baseline models may render a Red Panda for “China’s national treasure” or generate autumnal red leaves for a “maple leaf in summer,” failing to resolve logical conflicts. By fine-tuning NextFlow to articulate a reasoning trace-analyzing the prompt for cultural context and physical constraints-before predicting visual tokens, the model effectively self-corrects these semantic discrepancies.

We evaluated this on WISE [48]. This experiment is conducted as a controlled study on an intermediate model checkpoint. The integration of CoT yielded a substantial performance boost, raising the WISE score from 0.60 to 0.70. This significant relative improvement demonstrates that enabling the model to “think” before “drawing” creates a strong inductive bias for logical consistency and superior prompt adherence.

Benefiting from large-scale pre-training on interleaved image-text data, NextFlow demonstrates the capability to perform in-context learning. As shown in Fig. 20, when prompted with instructions to “infer the transformation pattern from the preceding examples and apply it to the target image,” the model is able to capture the stylistic mapping and modifies the input accordingly.

Quantitative Evaluation. As detailed in Table 12, our optimized training recipe yields substantial gains over the baseline TokenFlow architecture. On ImageNet-1K (512 2 ), NextFlow achieves a marked improvement of +2.08 dB in PSNR, validating the efficacy of our multi-stage training and scale dropout strategies. Scaling to 1024 2 further enhances fidelity, reaching a PSNR of 28.04 on our internal benchmark, demonstrating the

The user wants an image of a powerful marsupial that is a national symbol of Australia. The key elements are “powerful,” “marsupial,” and “national symbol of Australia.” Common national symbols of Australia include kangaroos, koalas, and maybe the platypus, but kangaroos are the most iconic. Since “powerful” is mentioned, perhaps a large kangaroo, maybe a red kangaroo, which is the largest. Need to describe the kangaroo in detail, its posture to show power, maybe in a natural Australian setting like the outback, with elements that highlight its status as a national symbol, such as a clear sky or iconic Australian landscape.

The final prompt is: A majestic, powerful red kangaroo standing tall in the Australian outback, its muscular hind legs planted firmly on the red earth, chest puffed out to showcase its strength. The kangaroo’s thick fur glows in the warm sunlight, with its large, powerful tail balancing its massive body, embodying the resilience and iconic status of Australia as a national symbol. In the background, a vast, arid landscape stretches to the horizon, dotted with sparse eucalyptus trees and a clear blue sky, emphasizing the kangaroo’s connection to the land it calls home.

A powerful marsupial, a national symbol of Australia.

The user wants an image of a national treasure animal of China. Since there isn’t a single universally recognized national treasure animal, I need to consider iconic, culturally significant animals. Common candidates could be pandas (national symbol), golden monkeys (representing China’s forests), or perhaps the giant panda, as it’s globally famous. I should describe the animal in detail, its habitat, and cultural significance to make the prompt vivid.

The final prompt is: A majestic giant panda, the national treasure animal of China, sitting peacefully on a bed of lush green bamboo in a misty mountain forest. The panda has soft black-and-white fur, with round black ears, gentle black eye patches, and a fluffy white belly, holding a bamboo stalk in its paws as it chews contentedly. Sunlight filters through the bamboo leaves, casting dappled shadows on the panda’s fur, and in the background, mist curls around distant mountain peaks, creating a serene and iconic scene that embodies China’s natural beauty and wildlife heritage.

A national treasure animal of China.

The user wants an image of a maple leaf in summer, but “summer” is conflicting with “closeup” since maple leaves are deciduous and typically lose their leaves in autumn. However, the user might intend a vibrant, healthy maple leaf in summer, perhaps with rich green color, maybe some details like veins, texture, and natural lighting. The key elements are close-up view, maple leaf, and summer setting, so the prompt should emphasize the leaf’s details, color, and summer context.

The final prompt is: A detailed close-up photograph of a single maple leaf in the height of summer, showcasing its vibrant, glossy green color with intricate veining and serrated edges. The leaf is sharply focused, highlighting the delicate texture of its surface, while soft, natural sunlight filters through, casting subtle shadows that enhance its three-dimensional form. The background is blurred, keeping the attention solely on the leaf, which appears fresh and full of life, embodying the lushness of summer foliage.

A close-up of a maple leaf in summer. The optional diffusion decoder acts as a perceptual enhancer, further refining high-frequency textures in complex regions, such as small faces and texts. However, we observe that this generative refinement involves a trade-off: while it boosts perceptual realism, the VQ decoder ensures higher fidelity to the original signal, avoiding the potential detail alteration inherent in diffusion-based re-synthesis.

We observe that simultaneously supporting multimodal understanding and image generation within a dense 7B decoder-only model imposes a significant capacity bottleneck. Furthermore, due to the scarcity of our high-quality pre-training data for multimodal understanding, we minimize the understanding training in the late pre-training stage. We only employed a task inversion where text-to-image samples were converted into image captioning samples with a probability of 10%.

To evaluate the understanding capabilities of our base model, we conducted SFT experiments with increasing composite data, consisting of a 19M high-quality captioning mid-train followed by the 21M SFT. As shown in Table 13, despite the reduced model size, our 7B model fine-tuned on just 0.7M data delivers performance comparable to the 13B LLaVA-1.5 baseline. While scaling to 21M significantly boosts document-oriented tasks (e.g., ChartQA, OcrBench), the integration of the 19M high-quality captioning data (40M total) yields the most robust improvements across all benchmarks, demonstrating the potential of our base model.

In this work, we introduced NextFlow, a unified sequential modeling that activates multimodal understanding and generation. By shifting from the traditional raster-scan paradigm to a next-scale prediction framework, we have demonstrated that autoregressive models can achieve inference speeds comparable to, or exceeding, state-of-the-art diffusion models without compromising on visual fidelity or reasoning capabilities. Validated on a massive corpus of 6 trillion tokens, our contributions-ranging from the dual-codebook tokenizer and scale-reweighting training objective to the prefix-tuning GRPO-establish a robust recipe for training our unified multimodal systems. NextFlow serves as a proof of concept that a single decoder-only transformer can effectively perceive, reason, and create.

Despite these advancements, limitations remain. While our dual-codebook tokenizer significantly improves semantic density, the discrete nature of vector quantization inevitably imposes an information bottleneck compared to continuous latent spaces, occasionally necessitating our optional diffusion decoder for hyper- realistic refinement. Furthermore, balancing the objectives of text generation and visual synthesis within a shared parameter space remains a non-trivial optimization challenge, particularly at lower parameter counts.

Looking forward, we identify three critical avenues to further scale and refine unified autoregressive modeling:

โ€ข Data Scaling for Advanced Understanding. While our current pre-training corpus is extensive, the ratio of high-quality, dense reasoning data remains a limiting factor for complex multimodal comprehension. Future iterations will prioritize the integration of more diverse and high-quality understanding data, specifically dense image captions and complex interleaved reasoning chains. We hypothesize that scaling the density of semantic supervision will unlock emergent reasoning capabilities in the visual domain, mirroring the trajectory of text-only LLMs.

โ€ข Model Scaling and Mixture-of-Experts (MoE). Following neural scaling laws, we anticipate that increasing model capacity will yield predictable gains in both generation fidelity and instruction following. We conducted a toy experiment and proved that transition from dense architectures to Mixture-of-Experts (MoE) frameworks significantly improves the overall generation quality.

โ€ข Next-Generation Tokenization. The tokenizer remains the fundamental upper bound on autoregressive visual generation performance. We aim to develop improved tokenization strategies that achieve higher compression rates without sacrificing reconstruction quality. This includes exploring variablerate quantization and semantic-aware compression that can further reduce the sequence length for high-resolution images, thereby compounding the efficiency gains of our next-scale prediction paradigm.

โ€ข Native Multimodal Chain-of-Thought and Unified RL. Our decoder-only architecture naturally extends the “Chain-of-Thought” paradigm to “Thinking with Images,” enabling the model to reason via intermediate visual generation. Unlike hybrid architectures such as Transfusion [93]-which grapple with disjoint optimization objectives for discrete text and continuous visual latents-our fully autoregressive framework allows for seamless reinforcement learning across modalities. This unique structural advantage permits the direct application of standard LLM RL techniques to multimodal reasoning paths, ensuring consistent alignment and unlocking the potential for self-improving multimodal intelligence. 1. Unified and Comprehensive Framework: While GEdit-Bench focuses on genuine user instructions for direct editing and OmniContext specializes in in-context subject generation, EditCanvas is the first benchmark to systematically integrate both paradigms into a single, unified framework. This allows for a holistic evaluation of a model’s diverse editing capabilities, from minor tweaks to complex generative tasks.

  1. Granular, Hierarchical Taxonomy: EditCanvas’s three-level classification system is significantly more detailed than the taxonomies of previous benchmarks. While GEdit-Bench offers 11 categories and OmniContext defines 8 subtasks, EditCanvas provides 57 distinct third-level tasks. This granularity enables precise identification of a model’s specific strengths and weaknesses, for instance, distinguishing its performance in “object addition” versus “object repositioning.”

  2. Large-Scale, High-Quality Dataset: With over 5,000 carefully curated samples, EditCanvas offers a larger and more diverse dataset compared to GEdit-Bench’s 606 examples. The semi-automated generation process, combining a powerful VLM with multi-round human verification, ensures both scale and quality, creating a robust testbed for modern models.

By providing a structured, comprehensive, and large-scale benchmark, EditCanvas sets a new standard for evaluating image editing models, enabling a more thorough and insightful understanding of their real-world performance.

Figure 12 Prefix-tuning strategy for RL fine-tuning in NextFlow.

On the ImgEdit benchmark[83], which covers a wide spectrum of editing operations including addition, removal, and style transfer, NextFlow RL achieves the highest overall score of 4.49, surpassing strong baselines such as Qwen-Image (4.27) and Emu3.5(4.41). Notably, our model demonstrates exceptional precision in the Adjust (4.68) and Remove (4.67) sub-tasks, indicating a robust ability to manipulate specific image regions without disrupting the global context. OmniContext benchmark

On the ImgEdit benchmark[83], which covers a wide spectrum of editing operations including addition, removal, and style transfer, NextFlow RL achieves the highest overall score of 4.49, surpassing strong baselines such as Qwen-Image (4.27) and Emu3.5(4.41)

On the ImgEdit benchmark[83], which covers a wide spectrum of editing operations including addition, removal, and style transfer, NextFlow RL achieves the highest overall score of 4.49, surpassing strong baselines such as Qwen-Image (4.27) and Emu3.5

On the ImgEdit benchmark[83]

On the ImgEdit benchmark

Visualization Results

Qualitative Comparison.

2048ร—512 (1,1) (4,1) (6,2) (8,2) (10,2) (12,3) (14,4) (16,4) (20,5) (24,6) (28,7) (32,8) (40,10) (48,12) (56,14) (64,16) (96,24) (128,32)

https://huggingface.co/google/siglip-so400m-patch14-384

https://huggingface.co/google/siglip2-so400m-patch16-naflex

๐Ÿ“ธ Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut