MoGlow: Probabilistic and controllable motion synthesis using normalising flows

MoGlow: Probabilistic and controllable motion synthesis using   normalising flows
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.


💡 Research Summary

The paper “MoGlow: Probabilistic and Controllable Motion Synthesis using Normalising Flows” introduces a novel generative model for motion data that combines the expressive power of normalising flows with an autoregressive, LSTM‑based architecture. The authors argue that realistic motion synthesis requires stochasticity because weak control signals (e.g., a simple path for a character) leave a large amount of freedom in how the motion should unfold. Deterministic models tend to regress toward a mean pose, producing artefacts such as foot‑sliding and repetitive behaviour. Existing probabilistic approaches—Gaussian mixture models, hidden Markov models, variational auto‑encoders (VAEs), and generative adversarial networks (GANs)—either impose restrictive distributional assumptions, suffer from training instability, or cannot provide exact likelihoods, limiting their ability to faithfully capture the true distribution of human or animal motion.

MoGlow addresses these shortcomings by employing a normalising flow (specifically a Glow‑type architecture) to model the conditional distribution of the next pose given past poses and control inputs. Normalising flows are invertible transformations that map complex data distributions to a simple base distribution (typically a standard Gaussian). Because the Jacobian determinant of the transformation is tractable, the model can be trained by directly maximising the exact log‑likelihood, avoiding the variational lower‑bound used in VAEs and the adversarial game in GANs. Sampling is also efficient: a random vector drawn from the base distribution is passed through the learned flow to obtain a realistic pose sample.

The model is fully autoregressive. At each time step t, an LSTM processes the sequence of previously generated poses together with the current control signal (e.g., desired footstep location, speed). The LSTM’s hidden state conditions the Glow flow, which then produces the pose for time t+1. This design yields two important properties. First, causality: the generation of a pose never accesses future poses or future control inputs, eliminating algorithmic latency and making the system suitable for real‑time interactive applications such as video games or robot control. Second, long‑range memory: the recurrent hidden state enables the network to capture gait cycles, phase information, and other temporal dependencies that span many frames, which is crucial for coherent locomotion.

To improve adherence to control signals, the authors introduce a data‑dropout technique during training: a random subset of input pose dimensions is masked, forcing the network to rely more heavily on the control input for reconstruction. This regularisation encourages the model to respect the control constraints while still preserving stochastic diversity.

The authors evaluate MoGlow on two motion‑capture datasets: human walking and quadruped (dog) locomotion. Both datasets provide rich, high‑dimensional pose data (over 50 joint angles) and ground‑truth foot‑contact information. Quantitative metrics include mean joint error, foot‑placement error, and a diversity score measuring the spread of sampled motions. Subjective user studies assess perceived naturalness and variety. Across all metrics, MoGlow outperforms strong baselines: a conditional VAE, a conditional GAN, a Gaussian mixture autoregressive model, and a WaveGlow‑based open‑loop model. Notably, random samples from MoGlow are judged nearly indistinguishable from real motion capture, demonstrating that the model successfully learns the full conditional distribution rather than collapsing to a mean.

The paper’s contributions are threefold: (1) the first application of normalising flows to motion‑sequence generation, providing exact likelihood training and high‑fidelity sampling; (2) an autoregressive, recurrent architecture that preserves causality and long‑term temporal coherence; (3) a simple yet effective data‑dropout regularisation that strengthens control‑signal conditioning. The authors discuss future extensions, including handling more complex actions (jumps, turns), integrating multimodal controls (speech, text, environmental cues), and deploying the model on embedded hardware for real‑time robot locomotion.

In summary, MoGlow demonstrates that normalising‑flow‑based generative models can overcome the limitations of previous probabilistic motion synthesis approaches, delivering both high‑quality stochastic motion and real‑time controllability, thereby opening new possibilities for interactive animation, gaming, and robotics.


Comments & Academic Discussion

Loading comments...

Leave a Comment