Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework
Video editing-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leakage, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.
💡 Research Summary
The paper addresses a subtle but critical problem in audio‑driven talking‑face generation: the phenomenon of “lip leakage,” where the identity reference image supplied to preserve speaker consistency unintentionally influences the generated lip motion, causing the output to be driven more by the reference than by the input audio. Existing evaluation protocols—typically measuring lip‑sync accuracy (e.g., LSE‑C, LSE‑D) and visual quality (SSIM, PSNR, FID)—are insufficient to detect this bias because a model can achieve high scores while still leaking lip information from the reference.
To expose and quantify lip leakage, the authors propose a systematic, model‑agnostic evaluation framework consisting of three complementary generation setups and a set of derived metrics.
Generation Setups
- Silent‑Input (SI) – videos are generated using a silent audio track. The resulting lip movements are compared against the original (ground‑truth) audio. Since no speech signal is provided, any lip motion must originate from the reference image, making this setup a direct probe of leakage.
- Audio‑Matched (AM) – the conventional setup where each video is generated with its corresponding ground‑truth audio. This provides a baseline for normal lip‑sync performance and visual quality.
- Audio‑Mismatched (XM) – videos are generated with randomly paired, non‑matching audio. By contrasting lip‑sync scores between AM and XM, one can assess whether the model follows the audio or defaults to reference‑driven lip shapes.
Reference Selection Strategies
- Current Reference (CR) – the reference image is the same frame that is being masked and edited. This maximizes visual consistency but also maximizes the potential for leakage.
- Alternative Reference (AR) – the reference is chosen according to the method’s own prescription (first frame, random frame, multiple frames, or a specially designed “canonical” face). When a method does not specify a strategy, the first frame is used as a default.
Leakage Metrics
- Silent LSE‑C (LSE‑C S) – the confidence‑based lip‑sync error (LSE‑C) computed on SI generations against the original audio, using SyncNet features.
- Silent LSE‑D (LSE‑D S) – the distance‑based lip‑sync error (LSE‑D) under the same silent‑input condition.
- Lip‑Sync Discrepancy with Current Reference (LSD‑CR) – the average absolute difference between LSE‑C and LSE‑D scores obtained in AM and XM when using CR. Formally:
LSD‑CR = 0.5 × (|C_CR^AM − C_CR^XM| + |D_CR^AM − D_CR^XM|).
Larger values indicate stronger leakage under the current‑reference scenario. - Lip‑Sync Discrepancy with Alternative Reference (LSD‑AR) – analogous to LSD‑CR but computed with the AR strategy.
These four metrics together capture (i) how much lip motion is generated without any audio cue, and (ii) how much the model’s lip‑sync score deteriorates when the audio is deliberately mismatched, both under the two reference regimes.
Experimental Protocol
The authors evaluate six recent talking‑face models on the LRS2 benchmark: Wav2Lip, TalkLip, IPLAP, AVTFG, PLGAN, and Diff2Lip. For each model they generate videos under all three setups (SI, AM, XM) and both reference strategies (CR, AR). Standard visual‑quality metrics (SSIM, PSNR, FID) and identity preservation (CSIM via ArcFace) are also reported to contextualize leakage findings.
Key Findings
- Silent‑Input results reveal that TalkLip and AVTFG produce surprisingly high LSE‑C scores (5.21 and 6.31 respectively) even when no audio is supplied, confirming that the current‑frame reference heavily dictates lip shape. Switching to AR dramatically reduces these scores, indicating that leakage is mitigated when a different reference is used.
- Audio‑Mismatched (XM) results show that most models experience a drop in lip‑sync performance, yet TalkLip retains relatively high scores under CR, again evidencing reference‑driven lip generation. Diff2Lip, which employs multiple references, exhibits the smallest LSD‑CR (0.98) and LSD‑AR (0.15), suggesting robustness against leakage.
- Audio‑Matched (AM) baseline can be misleading: models that look excellent under CR (high SSIM, low LSE‑D) may suffer severe performance degradation under AR, exposing hidden leakage that standard AM evaluation would miss.
- Visual quality vs. leakage trade‑off: While AR often improves leakage metrics, it can sometimes lower SSIM/PSNR or CSIM, especially for TalkLip, highlighting the need for balanced reference design.
Contributions
- Introduction of a comprehensive test methodology (SI, AM, XM) that isolates the influence of the identity reference on lip motion.
- Definition of four leakage‑specific metrics (LSE‑C S, LSE‑D S, LSD‑CR, LSD‑AR) that complement existing evaluation tools.
- Empirical demonstration across six state‑of‑the‑art models, revealing that many current systems suffer from significant lip leakage, especially when the current frame is used as the reference.
- Insight that alternative reference strategies (first frame, random, multi‑frame) can mitigate leakage but may affect visual fidelity, suggesting a design space for future models.
Implications and Future Work
The proposed framework provides a reliable benchmark for detecting and quantifying lip leakage, a failure mode that standard lip‑sync and visual‑quality metrics overlook. By making leakage measurement routine, future research can develop architectures that explicitly disentangle reference‑derived identity features from audio‑driven lip dynamics—e.g., through adversarial regularization, feature‑level masking, or the use of canonical “neutral‑mouth” reference images. Moreover, correlating the proposed objective metrics with human perceptual studies would further validate their relevance. Ultimately, the work encourages the community to adopt more rigorous, multi‑facet evaluation protocols to ensure that talking‑face generators are both visually convincing and semantically faithful to the driving audio.
Comments & Academic Discussion
Loading comments...
Leave a Comment