Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model (obtained by fine-tuning a T2I model on 3D human texture maps) for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments, separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks, and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.


💡 Research Summary

The paper introduces Spectrum, a unified network that delivers fine‑grained human parsing by jointly handling pixel‑level part segmentation (both body parts and on‑body clothing) and instance‑level grouping. Existing human parsing approaches rely on a fixed set of coarse categories, which limits their ability to distinguish the vast variety of clothing styles. Recent open‑vocabulary segmentation methods exploit pretrained text‑to‑image diffusion models for zero‑shot transfer, yet they typically collapse the whole person into a single “person” mask, lacking the granularity needed for detailed clothing and body‑part segmentation.

Spectrum addresses this gap by repurposing an Image‑to‑Texture (I2Tx) diffusion model. The I2Tx model is obtained by fine‑tuning a Stable Diffusion backbone with LoRA adapters on 3D human texture maps from the ATLAS dataset. This specialization yields internal representations that preserve a tight correspondence between the input image and the generated texture, which is crucial for distinguishing subtle clothing details.

The pipeline works as follows: an input image is encoded by a CLIP‑VISION encoder and an IM2CONTEXT encoder to produce a context embedding C and a CLS token. The base Stable Diffusion weights are merged with the LoRA adapters, and a single forward pass at diffusion step t = 0 extracts three streams—encoder features f_E, denoiser features f_U, and decoder features f_D. Concatenating these streams forms a texture‑aligned feature f that feeds a pixel decoder (implemented as a Feature‑Pyramid Network). A transformer decoder then predicts N = 100 class‑agnostic binary masks. For each mask, a masked‑average‑pooled embedding z_i is computed.

Training leverages the CosmicMan‑HQ dataset, which provides 100 K images together with dense BLIP‑generated captions. From each caption, K = 9 key phrases (nouns and adjectives) are extracted and turned into natural‑language prompts (e.g., “a photo of a tan purse”). Prompt embeddings are obtained with OPEN‑CLIP. A contrastive grounding loss L_G aligns prompt embeddings T(p_k) with their corresponding mask embeddings z_i, while binary‑cross‑entropy (L_BCE) and Dice (L_DICE) losses supervise mask shape and pixel‑wise accuracy. The total loss is a weighted sum: L_total = λ_BCE L_BCE + λ_DICE L_DICE + λ_G L_G (λ_BCE = 2.0, λ_DICE = 5.0, λ_G = 1.0).

The model contains 2.005 B parameters, of which only 29.55 M (0.86 %) are trainable. Training runs for 12.2 days (≈370 K iterations) on eight A100 GPUs with standard augmentations (horizontal/vertical flips, resizing to 1024 px). At inference time, the network accepts an image and a descriptive caption and outputs segmentation maps for every visible body part and clothing category, handling any number of people in the scene while ignoring standalone garments and irrelevant objects.

Evaluation is performed under four distinct setups: Full‑Person Parsing (FPP), Bare Human Parsing (BHP), COCO Category Parsing (CCP), and Clothing‑Only Parsing (COP). Across these tasks, Spectrum consistently outperforms strong baselines such as ODISE, MaskCLIP, SEEM, CDGNet, and SAPIENS. Notably, in BHP it achieves an average IoU of 68.4 % (vs. 61.2 % for the best prior method), and in COP it reaches mAP = 55.7 % (vs. 48.3 %). The model also demonstrates robust zero‑shot performance on unseen clothing categories (e.g., traditional sarees) and complex multi‑person poses.

Limitations include difficulty in precisely segmenting very thin accessories (necklaces, watches) due to reliance on texture‑level cues, and the current focus on single‑frame images without temporal consistency. Future work is suggested to extend the texture‑based features to video streams, incorporate self‑supervised learning to reduce annotation costs, and explore richer 3D reconstruction pipelines.

In summary, Spectrum shows that repurposing a 3D texture‑generation diffusion model yields representations that are far more suitable for detailed human parsing than generic T2I diffusion features. This approach opens new possibilities for applications such as fashion search, virtual try‑on, and fine‑grained human‑object interaction analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment