S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning
Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S$^3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators’ ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models’ (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S$^3$POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S$^3$POT’s superior performance and the effectiveness of each module.
💡 Research Summary
The paper tackles the long‑standing problem of occlusion segmentation in face parsing, where occlusions (e.g., masks, hands, glasses) are often mis‑identified as facial components. Because occlusion is a high‑level spatial relationship rather than a concrete object class, creating a fully annotated real‑world dataset that covers all possible occluding objects is practically impossible, and manual mask annotation is prohibitively expensive.
S³POT (Segmentation via Self‑Supervised Prompting Occlusion with Contrast) proposes a novel contrast‑driven framework that leverages two recent advances: (1) high‑fidelity face generators capable of reconstructing occluded regions, and (2) foundation segmentation models such as the Segment Anything Model (SAM) that can produce precise masks when supplied with appropriate prompts. The system consists of three tightly coupled modules:
-
Reference Generation (RG) – Using a dual‑conditioning approach based on Regional GAN Inversion (RGI), the original face image I and its parsed mask Mᵣ are fed into a pretrained face generator G. The generator outputs an occlusion‑free reference face Iʳ that preserves the original geometry while realistically filling in the hidden regions.
-
Feature Enhancement (FE) – Both I and Iʳ are encoded by SAM’s image encoder, yielding token sequences Z and Zʳ. Direct token subtraction is insufficient to separate occlusion from face, so the cosine similarity between Z and Zʳ is computed and the token with the highest similarity is selected as an initial prompt pᵢ. pᵢ is passed through a prompt encoder, then used in two cross‑attention operations: prompt‑to‑image (pᵢZ_CA) and image‑to‑prompt (Zpᵢ_CA). These bidirectional attentions enrich the original and reference features, producing enhanced tokens Zᵉ and Zʳᵉ that exhibit a sharper contrast between facial and occluded regions.
-
Prompt Selection (PS) – From the enhanced tokens, a facial‑region mask M_f is derived by selecting parsed components (eyes, nose, mouth, skin, etc.). Tokens belonging to M_f are extracted from Zʳᵉ (Z_fʳᵉ) and Zᵉ (Z_fᵉ). A greedy matching algorithm aligns each token in Z_fʳᵉ with the most similar token in Zᵉ. Matched pairs become non‑occlusion (negative) prompts P_N, while unmatched tokens in Z_fᵉ are treated as occlusion (positive) prompts P_O. Both sets are concatenated, encoded, and fed into a self‑attention screening layer that learns to re‑weight each prompt, discarding noisy or redundant entries. The re‑weighted prompt set P′ is finally combined with Z and passed to SAM’s mask decoder to predict the occlusion mask M_O.
Because no ground‑truth occlusion masks are available during training, the authors devise three complementary self‑supervised loss terms:
- Occlusion Prompt Recall (L_rec^occ) – Maximizes the log‑probability of the mask at locations indicated by occlusion prompts, encouraging the model to assign high confidence to true occluded pixels.
- Face Prompt Recall (L_rec^face) – Minimizes the average probability at non‑occlusion prompts, preventing the mask from expanding into facial regions.
- Face Prompt Penalty (L_face^penalty) – Applies a sigmoid‑scaled penalty to any non‑occlusion prompt whose probability exceeds 0.5, strongly suppressing outliers that could otherwise cause false positives.
The total objective is L_total = L_rec^occ + L_rec^face + λ·L_face^penalty, where λ balances the penalty term.
For evaluation, the authors construct a real‑world occlusion dataset by prompting the large language model Qwen to filter occluded faces from CelebA‑Mask‑HQ and FFHQ. The selected images are manually annotated with precise occlusion masks using the X‑AnyLabeling tool, yielding 2,493 training samples (1,389 from CelebA, 1,104 from FFHQ) and a validation split of 200 images.
Extensive experiments demonstrate that S³POT outperforms prior synthetic‑occlusion methods (e.g., Voo) and recent SAM‑based personalization approaches (PerSAM, Matcher, VRP‑SAM) across IoU, F1, and pixel‑accuracy metrics, especially on diverse real‑world occlusions such as masks, hands, glasses, and hats. Ablation studies confirm the necessity of each component: removing the reference generation degrades contrast; omitting cross‑attention reduces prompt quality; skipping the greedy matching or self‑attention screening leads to noisy prompts and lower segmentation accuracy.
In summary, S³POT introduces a contrast‑driven, self‑supervised prompt generation paradigm that enables robust face occlusion segmentation without any occlusion ground‑truth masks. By synergistically combining face generation, feature contrast, and adaptive prompt selection, the method offers a practical solution for downstream applications—face swapping, AR makeup, security‑focused facial analysis—where accurate occlusion handling is critical.
Comments & Academic Discussion
Loading comments...
Leave a Comment