IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation

IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.


💡 Research Summary

This paper, titled “IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation,” introduces a novel paradigm and model that fundamentally rethinks how text-to-human-motion generation should be approached by enabling iterative self-critique and refinement.

Core Problem & Proposed Solution: While recent Unified Motion-aware Large Language Models (UniMoLMs) combine motion understanding and generation capabilities within a single model, they typically execute these tasks in isolation. This limits the potential for the model’s understanding abilities to directly inform and improve its generation outputs. To bridge this gap, the authors propose a new paradigm called Interleaved Reasoning for Motion Generation (IRMoGen). The key insight is that motion assessment (evaluating how well a generated motion aligns with a text goal) and motion refinement (providing instructions to improve misaligned motions) act as crucial bridges, enabling a bidirectional knowledge flow between understanding and generation.

The Model: IRG-MotionLLM: To realize the IRMoGen paradigm, the authors develop IRG-MotionLLM, the first model capable of natively interleaving generation, assessment, and refinement in a cohesive reasoning loop. Given a goal text (e.g., “a person jumps and then spins”), the model engages in a step-by-step “text-motion dialogue”: it first analyzes the goal, generates an initial motion, assesses its alignment with the text, provides refinement instructions based on the assessment, and then generates an improved motion. This cycle can repeat for multiple rounds until a satisfactory motion is produced.

Progressive Three-Stage Training: Building this complex capability requires careful, staged training:

  1. Stage 1 (IRMoGen Initialization): The base model (a pre-trained UniMoLM) is fine-tuned on eight atomic tasks. These include four basic tasks (e.g., motion captioning, direct generation) and four improving tasks (e.g., text-motion alignment evaluation, refinement instruction generation). This stage implicitly instills the core skills needed for IRMoGen.
  2. Stage 2 (IRMoGen Chain-of-Thought Learning): The model is explicitly trained using a structured reasoning template that chains together goal analysis, motion generation, assessment, and refinement instruction. This teaches the model to autonomously plan and execute the multi-step IRMoGen process.
  3. Stage 3 (IRMoGen Reinforcement): The model undergoes reinforcement learning (using Group Relative Policy Optimization - GRPO) with rewards based on motion quality and text alignment. This stage unlocks the model’s potential to freely explore longer, multi-round reasoning paths to optimize the final generated motion.

Automated Data Engine: A significant challenge is the lack of training data for interleaved reasoning. The authors construct an automated data engine that synthesizes such data from existing text-motion datasets (HumanML3D, KIT-ML). Using a pre-trained motion encoder and an LLM, the engine automatically generates multiple “negative” motion samples with varying alignment levels for each ground-truth text-motion pair, along with corresponding assessment comments and refinement instructions.

Key Findings & Results: Extensive experiments demonstrate the effectiveness of the proposed approach:

  • Introducing assessment and refinement tasks alone (Stage 1) significantly improves text-motion alignment, boosting performance on both generation and captioning tasks.
  • Interleaving generation, assessment, and refinement consistently improves final motion quality across all training stages. Notably, after RL tuning (Stage 3), the model produces longer reasoning chains with more refinement rounds, leading to the best performance—echoing findings in other reasoning domains.
  • IRG-MotionLLM clearly outperforms its base model and achieves state-of-the-art or competitive results on standard text-to-motion benchmarks (HumanML3D, KIT-ML). Its effectiveness is further validated through cross-evaluator testing.

Significance: This work represents a paradigm shift from viewing motion generation as a one-shot translation task to framing it as an iterative reasoning process involving self-assessment and refinement. It proves that a unified model can leverage its own understanding capabilities to critically evaluate and iteratively improve its generations, paving the way for more robust, reliable, and intelligent multimodal generative systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment