Mask-Guided Multi-Task Network for Face Attribute Recognition

Mask-Guided Multi-Task Network for Face Attribute Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Face Attribute Recognition (FAR) plays a crucial role in applications such as person re-identification, face retrieval, and face editing. Conventional multi-task attribute recognition methods often process the entire feature map for feature extraction and attribute classification, which can produce redundant features due to reliance on global regions. To address these challenges, we propose a novel approach emphasizing the selection of specific feature regions for efficient feature learning. We introduce the Mask-Guided Multi-Task Network (MGMTN), which integrates Adaptive Mask Learning (AML) and Group-Global Feature Fusion (G2FF) to address the aforementioned limitations. Leveraging a pre-trained keypoint annotation model and a fully convolutional network, AML accurately localizes critical facial parts (e.g., eye and mouth groups) and generates group masks that delineate meaningful feature regions, thereby mitigating negative transfer from global region usage. Furthermore, G2FF combines group and global features to enhance FAR learning, enabling more precise attribute identification. Extensive experiments on two challenging facial attribute recognition datasets demonstrate the effectiveness of MGMTN in improving FAR performance.


💡 Research Summary

This paper addresses the redundancy and negative transfer problems inherent in conventional multi‑task face attribute recognition (FAR) systems that rely on global feature maps for all attributes. The authors propose the Mask‑Guided Multi‑Task Network (MGMTN), which consists of Adaptive Mask Learning (AML) and Group‑Global Feature Fusion (G2FF). AML first employs a pre‑trained facial key‑point detector (FaRL) to obtain 98 precise landmarks. Using these landmarks, eight rectangular region masks are defined (mouth, ear, lower face, cheeks, nose, eyes, hair, and an overall object mask). A frozen UNet is trained as a multi‑task pixel‑wise binary classifier to predict all masks simultaneously. The resulting masks are applied element‑wise to the backbone feature map (ResNeSt‑50), isolating region‑specific features for each attribute group and thereby suppressing irrelevant information that causes negative transfer.

G2FF then concatenates the group‑specific features with the global features extracted from the entire face, leveraging the complementary strengths of fine‑grained local cues and holistic context. The fused representation is passed through a two‑layer fully‑connected classifier per group (dimensionality reduced from 3584 to 512) to predict the binary attributes.

Extensive experiments on CelebA and LFWA demonstrate that MGMTN outperforms state‑of‑the‑art methods such as DMTL, MGG‑Net, and APS, achieving 1.5–3 percentage‑point gains in mean accuracy. The improvement is especially pronounced for attributes strongly tied to specific facial regions (e.g., eyes, mouth, nose), confirming that the mask‑guided approach effectively mitigates feature redundancy and negative transfer.

The paper also discusses limitations: mask quality depends on key‑point detection accuracy, and freezing the UNet prevents end‑to‑end optimization of mask generation and feature extraction. Future work may explore joint training of key‑point, mask, and attribute branches, as well as more sophisticated fusion strategies. Overall, MGMTN offers a compelling solution that explicitly guides the network to “look where it matters,” advancing both efficiency and performance in multi‑task face attribute recognition.


Comments & Academic Discussion

Loading comments...

Leave a Comment