FMPose3D: monocular 3D pose estimation via flow matching

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.

💡 Research Summary

FMPose3D tackles monocular 3D pose estimation by reframing it as a conditional distribution transport problem rather than a deterministic regression task. The authors first observe that existing 2D‑to‑3D lifting pipelines either produce a single average pose (deterministic models) or rely on diffusion‑based generative models that generate diverse hypotheses but require dozens of iterative denoising steps, making them too slow for real‑time use. To overcome this bottleneck, the paper adopts Flow Matching (FM), a recent generative modeling paradigm that learns a deterministic velocity field governing an Ordinary Differential Equation (ODE).

During training, a paired 2D pose c and its ground‑truth 3D pose x₁ are sampled. A Gaussian noise vector x₀ ∼ N(0, I) and a random time t ∈

FMPose3D: monocular 3D pose estimation via flow matching

💡 Research Summary

Comments & Academic Discussion

Leave a Comment