PEAR: Pixel-aligned Expressive humAn mesh Recovery
Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: https://wujh2001.github.io/PEAR
💡 Research Summary
**
The paper introduces PEAR (Pixel‑aligned Expressive humAn mesh Recovery), a novel framework for reconstructing detailed 3D human meshes from a single in‑the‑wild image. Existing SMPLX‑based approaches suffer from three major drawbacks: slow inference due to high‑resolution inputs or multi‑branch networks, coarse body pose estimation that neglects fine‑grained regions such as the face and hands, and poor robustness to diverse cropping conditions. PEAR tackles all three issues with a clean, unified design.
First, the authors replace the cumbersome multi‑branch pipelines with a single Vision Transformer‑B (ViT‑B) backbone. This transformer jointly encodes global image features and directly regresses the full set of parameters required for an expressive human model: SMPLX body pose (θ_b) and shape (β_b), hand pose (θ_h) and shape (β_h), and FLAME facial pose (θ_h), shape (β_h), expression (ϕ_h) together with an explicit head‑scale parameter s. By using FLAME for the head, PEAR gains a richer expression space while keeping the overall model lightweight. The unified ViT backbone enables inference speeds exceeding 100 FPS, eliminating the need for high‑resolution inputs.
Second, to compensate for the loss of fine‑detail capacity inherent in a simple ViT, the authors introduce a pixel‑level supervision stage. They integrate a differentiable 3D Gaussian Splatting (3DGS) neural renderer that projects the predicted mesh back onto the original image. A photometric loss composed of L1 and LPIPS terms forces the rendered image to match the input at the pixel level. This dense supervision corrects local misalignments that are invisible to sparse keypoint losses. Training proceeds in two stages: a coarse stage that optimizes body, hand, and facial parameters using standard parameter and keypoint losses, followed by a fine stage that adds the photometric loss to refine facial and hand geometry. The two‑stage scheme ensures that the mesh is already roughly aligned before the renderer is applied, preventing unstable coupling between geometry and appearance.
Third, the paper proposes a part‑level pseudo‑labeling strategy for data generation. Instead of relying on a single SMPLX fitting pipeline, the authors independently annotate body, face, and hand components, producing more accurate ground‑truth for each part. This modular annotation enables the model to learn robustly from partial inputs (e.g., head‑only crops, upper‑body shots) and to generalize across a wide range of cropping scenarios without any preprocessing. The inclusion of the head‑scale parameter further allows the model to handle subjects with atypical head‑to‑body ratios, such as children or stylized characters.
Extensive experiments on benchmark datasets—including Human3.6M, 3DPW, AGORA, and MPI‑INF‑3DHP—demonstrate that PEAR consistently outperforms prior SMPLX‑based methods. It achieves lower MPJPE and PA‑MPJPE by 2–5 % and reduces facial and hand keypoint errors substantially. Importantly, these accuracy gains come with a dramatic speed improvement: PEAR runs at over 100 FPS on a single GPU, making it suitable for real‑time downstream tasks such as AR/VR avatar generation, robotic perception, and interactive telepresence.
In summary, PEAR delivers a fast, accurate, and robust solution for expressive human mesh recovery. By unifying a ViT backbone, pixel‑aligned differentiable rendering, and modular data annotation, it bridges the gap between high‑fidelity reconstruction and real‑time deployment. Future work could extend the framework to video sequences, incorporate clothing and accessories, and explore tighter integration with implicit neural representations for even richer appearance modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment