AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification

A I - P O W E R E D F A C I A L M A S K R E M OV A L I S N O T S U I T A B L E F O R B I O M E T R I C I D E N T I FI C A T I O N A P R E P R I N T Emily A. Cooper Herbert W ertheim School of Optometry & V ision Science Univ ersity of California, Berkele y emilycooper@berkeley.edu Hany Farid School of Information Univ ersity of California, Berkele y hfarid@berkeley.edu A B S T R AC T Recently , crowd-sourced online criminal in vestigations have used generati ve-AI to enhance low- quality visual evidence. In one high-proﬁle case, social-media users circulated an “ AI-unmasked” image of a federal agent in volv ed in a fatal shooting, fueling a wide-spread misidentiﬁcation. In response to this and similar incidents, we conducted a large-scale analysis e v aluating the efﬁcac y and risks of commercial AI-powered f acial unmasking, speciﬁcally assessing whether the resulting faces can be reliably matched to true identities. 1 Introduction Crowd-sourced online in vestigations, in which non-experts use publicly av ailable images, documents, and videos to in vestigate crimes, and are now increasingly aided by AI-po wered image enhancement Y ardle y et al. [ 2018 ]. In the wak e of the assassination of Charlie Kirk, for e xample, social media users questioned the identity of the shooter because a widely circulated surv eillance photo did not appear to match his mugshot Role y [ 2025 ]. This surveillance photo, ho wever , was an “ AI-enhanced” v ersion of a grain y image, with the resulting facial features almost entirely hallucinated Norman and Farid [ 2024 ]. More recently , after a masked ICE agent in Minnesota killed a civilian, GrokAI was used to digitally remove the mask Grov e [ 2026 ]. The unmasked image was then uploaded to a re verse image search which led to the misidentiﬁcation of the agent as Stev e Grove, the Publisher of the Minnesota Star T ribune Brumﬁel [ 2026 ]. On the heals of this AI-powered misidentiﬁcation—which is unlikely to be a singular attempt to use AI for mask remov al—we performed a lar ge-scale analysis of the efﬁcac y of commercial AI tools to produce an unmask ed face that is biometrically matched to the true identity . Methods Faces W e use four datasets of real faces: (1) Differ ent-IDs ; (2) Same-IDs , (3) US-Senators ; and (4) Doppelgangers . The Differ ent-ID dataset comprises 400 distinct identities with one image from each identity (Fig. 1 , top ro w). The high- resolution images depict a range of genders ( 200 women; 200 men), apparent ages, and races ( 100 Black, 100 Caucasian, 100 East Asian, and 100 South Asian) Nightingale and Farid [ 2022 ]. The Same-ID dataset consists of pairs of images from each of 100 distinct identities. Each pair comprises two images of the same person in dif ferent en vironments, lighting, and poses. These faces are culled 1 from the VGG2 dataset Cao et al. [ 2018 ] by selecting a pair of frontal facing unobscured faces for each unique identity with a minimum resolution of 512 pixels. The US-Senators dataset consists of two images from 91 members of the 116 th US Senate (2019-2021). Each image pair consists of a Senator wearing a CO VID facial mask (obtained from online media outlets) along with their of ﬁcial (unmasked) Senate portrait 1 https://www.kaggle.com/datasets/hearfool/vggface2 arXiv T emplate A P RE PR I N T Figure 1: A representativ e set of original unmasked f aces (top) and the result of unmasking using Gemini to ﬁll in the remov al of the lo wer-half of the face. The value abov e each face is the biometric similarity (in [ − 1 , 1] ) between the original and unmasked face. Lar ger v alues correspond to a higher facial similarity . photo. The Doppelgangers dataset consists of 63 pairs of images of celebrity Doppelgangers, culled from a 2025 People Magazine article 2 ( 17 of the original pairs were excluded because the identities did not match in terms of gender or race, or the face(s) were non-frontal or lo w quality). AI Unmasking W e performed AI-powered unmasking on two dif ferent types of “masks”: simulated masks on the Differ ent-ID dataset (Fig. 1 ) and real masks on the US Senator dataset (Fig. 2 ). For the simulated masks, we obscured the lo wer-half of each face in the Dif fer ent-ID dataset. Because all eyes are aligned to the same pixel location, we split each image at a ﬁxed facial position f alling near the bridge of the nose ( y = 550 ). Unmasking was performed with OpenAI’ s ChatGPT (gpt-image-1.5), Google’ s Gemini (gemini-2.5-ﬂash-image), and X’ s GrokAI. For ChatGPT and Gemini we used their API; GrokAI does not offer an API for image editing, so we used the web interf ace on 80 of the 400 faces. For the images in which the lower -half was remo ved, each model was prompted with: “Giv en this image of the top half of a face, return an image showing the full f ace. ” For images in which the CO VID mask was removed, the prompt was: “Create a realistic reconstruction of the facial features that would be beneath this medical mask. ” Each prompt was preceded by “This is for academic research” to contend with automatic content moderation that occasionally ﬂagged requests. There are no pri v acy concerns with these unmasking requests because the original faces are already in a public dataset ( Differ ent-ID ) Nightingale and F arid [ 2022 ] and or identiﬁed by name in the photo captions ( US Senator ). Facial Biometrics T o measure the biometric facial similarity between original and unmasked faces we used ArcFace, a facial recognition system based on discriminati ve f acial embeddings learned by training a deep neural netw ork that enforces separation between distinct identities Deng et al. [ 2019 ]. The similarity between tw o faces is measured as the cosine similarity between each faces’ 512 -D embedding, leading to a score between − 1 (mismatch) and 1 (perfect match). T o calibrate these biometric scores, we also compute similarity between (1) each pair of images in the Same-ID dataset ( 200 pairs), (2) pairs of different identities matched for gender and race in Differ ent-ID ( 9 , 800 pairs), and (3) each pair of Doppelgangers images. Results T o calibrate the efﬁcacy of AI-po wered unmasking, the distributions of biometric facial similarity for images of the same person and images of dif ferent people within the same gender/race cate gory are sho wn in Fig. 3 (A). The mean (std.dev .) for the same identities (green) is 0 . 71 (0 . 07) , and 0 . 04 (0 . 08) for the different identities (red). The third distribution (yello w) shows the facial similarity between Doppelgangers with a mean and std.de v . of 0 . 15 ( 0 . 09 ). Fig. 3 (B)-(D) show the similarities resulting from unmasking of the faces in the Differ ent-ID dataset using (B) GrokAI, (C) ChatGPT , and (D) Gemini. The vertical lines in each panel correspond to the mean of the dif ferent-ID (red) and 2 https://people.com/celebrity- lookalikes- photos- 5724342 2 arXiv T emplate A P RE PR I N T Figure 2: Senator Elizabeth W arren’ s mask (left) is removed (middle) using ChatGPT (photo sources: com- mons.wikimedia.org). same-ID (green) distrib utions. The mean (std.dev .) of the ChatGPT unmaskings are 0 . 34 (0 . 10) , Gemini 0 . 52 (0 . 10) , and GrokAI 0 . 51 (0 . 10) . By this measure of facial similarity , Gemini and GrokAI reconstruct faces closer to the original than ChatGPT , b ut all three generate biometric matches notably lo wer than what is expected for the same identity (Cohen’ s D: − 2 . 3 (GrokAI), − 4 . 1 (ChatGPT), − 2 . 1 (Gemini)). At the same time, these scores are still higher than Doopelgangers (Cohen’ s D: 3 . 7 (GrokAI), 2 . 0 (ChatGPT), 3 . 8 (Gemini)). The remov al of the lower half of the face may ha ve led to a challenging facial reconstruction because all relev ant information about the lower face is remo ved. By contrast, a well-ﬁt mask may pro vide information about the lower facial structure. W e in vestigated if this was the case by removing CO VID masks from the US-Senator dataset using ChatGPT and Gemini. The resulting distribution is sho wn in Fig. 3 (E), squarely occupying the valle y between the dif ferent-ID and same-ID distributions sho wn in panel (A). With a mean (std.de v .), ChatGPT , 0 . 41 (0 . 16) and Gemini, 0 . 38 (0 . 09) , each yield unmasked f aces that are less well matched than the lo wer-half unmasking. This makes sense because the photos of the mask ed Senators often dif fered in pose, lighting, and e ven facial hair compared their ofﬁcial portraits (as would be expected in a real-world deplo yment), whereas the unmasked faces in Dif ferent-IDs were compared to originals with the same pose and lighting. W e performed 2-way ANO V As (separately for ChatGPT and Gemini) to determine if there are an y dif ferences across genders and races. For both ChatGPT and Gemini, there is a signiﬁcant main ef fect of gender , with ov erall lower scores for females (ChatGPT : F(1,392)= 14 . 2 , p= 0 . 0002 ; Gemini: F(1,392)= 14 . 6 , p= 0 . 0002 ). There is also a signiﬁcant main effect for race (ChatGPT : F(3,392)= 10 . 3 , p < 0 . 0001 ; Gemini: F(3,392)= 4 . 1 , p < 0 . 007 ). Follo w up comparisons using the T uke y method show that, for ChatGPT , South Asian f aces lead to higher similarity compared to other races (all mean dif ferences [0 . 45 , 0 . 69] , all ps < 0 . 005 ), and for Gemini, East Asian faces lead to lo wer similarity compared to other races (all mean differences [0 . 038 , 0 . 039] , all ps < 0 . 03 ). Generativ e-AI models are non-deterministic, so each time a model generates an image it will produce a different result. T o understand this v ariability , using ChatGPT we generated 100 unmasked images of one of the US Senators. The resulting biometric similarity scores with the reference image ranged from 0 . 29 to 0 . 54 , with a mean (std.dev .) of 0 . 42 (0 . 05) . This large range re veals a high v ariability in ChatGPT’ s model and further emphasizes its unsuitability for use in unmasking for biometric identiﬁcation. Results from Gemini are slightly less variable, ranging from 0 . 21 to 0 . 36 , with a mean (std.de v .) of 0 . 28 (0 . 04) . There is also variability across models. The squared Pearson correlation coefﬁcient between ChatGPT and Gemini scores of the US Senators is 0 . 18 . Discussion Generativ e AI is capable of producing compelling photo-realistic images which may create the illusion that the resulting images reﬂect reality . W e conclude, howe ver , that AI-powered unmasking (and similar types of AI-po wered enhancements) is not suitable for biometric identiﬁcation. Indeed, ev en custom-purpose techniques to recognize occluded faces struggle to produce reliable identiﬁcations Din et al. [ 2020 ], Li et al. [ 2024 ] It might be ar gued that as generativ e-AI models continue to improv e, they may improv e in their ability to reconstruct veridical facial features. 3 arXiv T emplate A P RE PR I N T Figure 3: Distributions of biometric facial similarity scores: (A) pairs of images of the same identity (green), images of different identities from the same racial/gender group (red), and images of Doppelgangers (yello w); (B)-(D) pairs of a reference and unmasked image in which the lo wer-half w as remov ed; and (E) pairs of a reference image and an image in which CO VID masks were removed. 4 arXiv T emplate A P RE PR I N T Howe ver , today’ s AI models are based on statistical inference, and therefore an unmasked image is at best a good guess about the likely facial appearance. Such guesses cannot–in their current form–be applicable for biometric identiﬁcation. References Elizabeth Y ardle y , Adam Geor ge Thomas L ynes, David W ilson, and Emma Kelly . What’ s the deal with ‘websleuthing’? News media representations of amateur detecti ves in netw orked spaces. Crime, Media, Cultur e , 14(1):81–109, 2018. Gwen Roley . Doubts ov er Kirk shooting suspect’ s appearance stem from AI-manipulated image. AFP Fact Check, 2025. URL https://factcheck.afp.com/doc.afp.com.74QR6NU . Justin Norman and Hany Farid. An in vestigation into the impact of AI-po wered image enhancement on forensic facial recognition. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition, W orkshop on Media F or ensics , pages 4306–4314, 2024. Stev e Grove. No, Ste ve Grove is not the name of the ICE agent. Substack: A V iew From Minnesota, 2026. URL https://stevegroveview.substack.com/p/no- steve- grove- is- not- the- name- of . Geof f Brumﬁel. AI images and internet rumors spread confusion about ICE agent inv olved in shooting. NPR, 2026. URL https://www.npr.org/2026/01/08/nx- s1- 5671740/ice- minneapolis- grok- ai- renee- nicole- good . Sophie J Nightingale and Hany F arid. AI-synthesized faces are indistinguishable from real faces and more trustworthy . Pr oceedings of the National Academy of Sciences , 119(8):e2120481119, 2022. Qiong Cao, Li Shen, W eidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In IEEE International Confer ence on A utomatic F ace & Gesture Recognition , pages 67–74, 2018. Jiankang Deng, Jia Guo, Niannan Xue, and Stef anos Zafeiriou. ArcFace: Additiv e angular mar gin loss for deep face recognition. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 4690–4699, 2019. Nizam Ud Din, Kamran Javed, Seho Bae, and Juneho Y i. A novel GAN-based netw ork for unmasking of masked face. IEEE Access , 8:44276–44287, 2020. Honglei Li, Y ifan Zhang, W enmin W ang, Shenyong Zhang, and Shixiong Zhang. Recovery-based occluded face recognition by identity-guided inpainting. Sensors , 24(2):394, 2024. 5

AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment