VerLM: Explaining Face Verification Using Natural Language

📝 Abstract

Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model’s accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.

💡 Deep Analysis

📄 Full Content

VerLM: Explaining Face Verification Using Natural Language Syed Abdul Hannan1,, Hazim Bukhari1,, Thomas Cantalapiedra1, Eman Ansar1, Massa Baali1, Rita Singh1, and Bhiksha Raj1 1Carnegie Mellon University, Pittsburgh, USA *Equal Contribution Abstract Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individ- ual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach orig- inally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model’s accuracy and in- terpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpass- ing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems. 1 Introduction Face verification systems have experienced significant advancements in recent years, with deep learning models achieving high accuracy in identifying whether two facial images belong 1 arXiv:2601.01798v1 [cs.CV] 5 Jan 2026 to the same individual [1, 2] allowing them to be used in Video surveillance and criminal identification. Gaining insight into why a decision was made by these systems allows for detecting biases and mitigating them easily and leads to more trust in decisions made by these systems, current approaches for explaining decisions depend on saliency heatmaps [3]. Large Language Models (LLM) have been shown to have an emergent ability in the form of reasoning over the text modality [4,5], later work in the Vision Language Model (VLM) and Audio Language Model (ALM) domains such as [6,7] and [8,9] has extended reasoning abilities to different modalities. Using VLM’s language ability to output reasoning for it’s decision is valuable for un- derstanding why the decision was made, unlike previous examples where a single input is provided our approach requires dual input to allow the model to compare and reason over them at the same time, for this we take inspiration from [10] where the authors explored using reasoning in explaining differences using dual audio input. We introduce a novel Vision-Language Model (VLM) specifically designed for face verifi- cation. Our model not only determines whether two face images depict the same individual, but also provides explicit explanations for its decisions. This is achieved through a unique training approach that utilizes two complementary explanation styles: concise explanations that highlight the key factors influencing the decision and comprehensive explanations that detail the specific differences observed between the images. Our approach leverages a state-of-the-art modeling framework originally developed for audio-based differentiation [10], which we adapt and enhance to effectively process visual in- put. By integrating image feature extraction techniques with Large Language model (LLM) reasoning capabilities, our VLM enables a clear articulation of its verification process, offer- ing a substantial step forward in the development of transparent and reliable face verification systems. In this paper, our main contributions are as follows: • Introduce the explainable face verification task which aims to provide a natural lan- guage explanation for the differences between two facial images. To train and evaluate models, we created Explainable Face Verification Dataset, where the difference ex- planation is generated by LLM using VLM annotated captions and later verified by human annotators. To mimic human explanations, the dataset contains two tiers of explanation: (1) concise explanations that summarize the key factors influencing its decision and (2) comprehensive explanations detailing the specific differences observed between the images. • We introduce the Verification Language Model (VerLM) model. • Under the proposed framework, we conduct several ablation studies to understand the 2 impact of cross-projection, language model scaling, and image-grounding. Our findings reveal that our approach outperform the baseline, in addition our approach benefits from scaled model size. 2 Explainable Face Verification The emergence of Vision-Language Models (VLMs) has fundamentally transformed multi- modal understanding, with Liu et al.’s groundbreaking work on visual instruction tuning demonstrating how large language models can be effectively adapted to understand and rea- son about visual content [7]. Their LLaVA framework established the feasibility of integrat- ing sophisticated visual understanding with natural language generation through carefully designed instruction-following paradigms, paving the way for more specialized applications in computer vision tasks that require both perception and linguistic reasoning. Cross-modal explanation approaches have gained significant traction, particularly in the audio domain where pioneering work has established effective methodologies for ex- plaining perceptual differences using natural language. Deshmukh et al. introduced ADIFF, a novel Audio Language Model that explains differences between audio samples through natural language descriptions [10]. Their architecture employed a unique cross-projection mechanism combined with separator tokens to highlight distinguishing features between au- dio pairs, demonstrating that structured difference computation substantially outperforms naive feature concatenation approaches. Complementing this work, Elizalde et al. devel- oped CLAP (Contrastive Language-Audio Pre-training), which established robust methods for learning audio concepts from natural language supervision through contrastive learning in shared embedding spaces [11]. These audio-centric approaches provided crucial architec- tural insights and demonstrated the viability of conditioning language models on non-textual inputs to generate meaningful explanations. Image difference analysis has been advanced significantly by Hu et al.’s OneDiff model, which introduced specialized architectures for general image difference captioning [12]. Their work emphasized the critical importance of explicit difference modeling through dedicated “Visual Delta Modules,” showing substantial performance improvements over baseline ap- proaches that simply concatenate image features. OneDiff demonstrated that successful image difference explanation requires architectural components specifically designed to cap- ture and articulate distinguishing characteristics, rather than relying on implicit difference detection within standard vision-language frameworks. However, their approach focused primarily on general image differences rather than domain-specific applications, leaving sub- stantial room for specialized optimizations in areas requiring fine-grained analysis such as face verification. 3 Explainable face recognition has traditionally relied on post-hoc explanation meth- ods, primarily through saliency-based visualizations and heatmap generation. DeAndres- Tame et al. explored early integration of explainability in face recognition through inter- active natural language processing, highlighting the limitations of existing approaches that provide visual explanations without rich contextual understanding [13]. Current explain- able face recognition systems typically employ visualization-based methods that highlight important facial regions but fail to provide the detailed, human-interpretable explanations that natural language can offer, particularly for understanding specific facial similarities and differences that drive verification decisions. Research gaps and motivation emerge clearly from this landscape. While general VLMs have demonstrated impressive capabilities, their application to specialized tasks like face verification requires domain-specific architectural adaptations that have not been sys- tematically explored. Existing image difference models like OneDiff lack the fine-grained analysis capabilities required for face verification, where subtle facial features can be crucial for accurate identification. Furthermore, the successful cross-projection mechanisms demon- strated in audio domains have not been adapted and evaluated for visual face verification tasks, representing a significant opportunity for cross-modal knowledge transfer. Current approaches also typically provide single-level explanations, whereas human understanding often benefits from both high-level summaries and detailed analyses—a multi-tier approach that remains largely unexplored in face verification contexts. Our proposed VerLM addresses these identified gaps by adapting successful cross-projection mechanisms from audio-based difference explanation to the visual domain, while incorporat- ing domain-specific face encoders and implementing a novel dual-tier explanation system. This approach uniquely combines high-accuracy face verification with human-interpretable explanations, contributing to more transparent and trustworthy face recognition systems that can articulate their reasoning processes in natural language. 3 Explainable Face Verification Dataset We chose to use a subset of the VGGFace2 dataset [14], which contains 3.31 million images of 9,131 individuals, with a wide variety of poses and orientations. We created 2 datasets, each with 79,771 pairs of images and 7,689 instances of matching faces. We used Llama 2 to generate the descriptions for each pair of images. Dataset 1 had an average training de- scription length of 53.35 words, and dataset 2 had more comprehensive training descriptions with an average length of 121.64 words. Statistics on our datasets can be found in Table 1. We began by using the Llama 3.2 VLM to generate descriptions for each face in the dataset. We then took random pairs of faces and assigned them labels (match or no match). 4 Finally, we used Llama 3.2 again to compress the descriptions for the faces in each pair into one paragraph describing similarities and differences between the individuals in the images. Additionally, with access to the label, it was prompted to write an explanation about whether the images depicted the same person. Below are examples of descriptions generated for each dataset. Descriptions from Dataset 1: • Faces Match: the two facial images exhibit a high degree of similarity in key identifying features. both images depict a woman with a warm, tan complexion, almond-shaped dark brown eyes, and full eyebrows. the minor difference lies in the level of detail regarding facial hair. • Faces Don’t Match: the two facial images share some matching features, including almond-shaped eyes, straight nose, high cheekbones, strong jawline, and warm skin tone. however, notable differences in eye color, hair color and style, and facial hair suggest they do not depict the same individual. Descriptions from Dataset 2: • Faces Match: upon analyzing the two facial images, i have identified several matching features that suggest they belong to the same individual. both images share a similar skin tone, facial structure, jawline, and nose shape. the dark hair color and style, as well as the 5 eyebrow shape and thickness, are also consistent between the two images. however, i have noticed a slight difference in the hair texture and a minor variation in the eye color, which could be attributed to lighting conditions or minor grooming changes. despite these minor differences, the overall facial features and structure are remarkably similar, leading me to conclude that these images represent the same individual. • Faces Don’t Match: based on the provided descriptions, the two facial images do not match. key differences include the skin tone (fair and smooth in person 1 vs. light to medium brown in person 2), facial structure (slightly pointed chin in person 1 vs. oval with a slightly rounded chin in person 2), eye color and shape (pale blue almond-like eyes in person 1 vs. dark brown almond-shaped eyes in person 2), hair color and style (light blonde wavy hair in person 1 vs. dark brown dreadlocks in person 2), facial hair (none in person 1 vs. mustache and goatee in person 2), and eyebrow shape and thickness (thin and light brown in person 1 vs. thick and dark in person 2). these significant differences indicate that the images do not belong to the same individual. Average Median Max Vocab Dataset 1 - Train 53.35 54 127 4341 Dataset 1 - Test 53.71 54 124 1727 Dataset 2 - Train 121.64 121 325 6008 Dataset 2 - Test 122.19 121 263 2664 Table 1: Description Length Statistics 6 Use Llama 3.2 VLM to give descriptions to all images Take random pairs of images and assign la- bels Create a paragraph comparing the images using Llama 3.2 4 Model In this section, we describe our proposed vision-language model, VerLM, designed for an end-to-end, explainable face verification task using natural language. The VerLM model accepts three distinct inputs: two images (image 1 and image 2) and a user-provided textual prompt. The output generated by the model is free-form text based on these inputs. The VerLM architecture comprises six primary components: (1) an Image Encoder, (2) an Image Projection Layer, (3) a Text Embedder, (4) a Text Projection Layer, (5) a Cross- Projection Layer, and (6) a Decoder-only Language Model. Image Encoder: The Image Encoder primarily used in our experiments for extract- ing high-level facial representations is the VGG Face model, which achieves an accuracy of 99.65% on the Labeled Faces in the Wild (LFW) dataset. We also provide an ablation study comparing performance when substituting the VGG Face encoder with another promi- nent face verification encoder, CASIA-WebFace, which achieved an accuracy of 99.05% on the LFW dataset. Detailed results and comparative analysis from these experiments are presented in the Ablations section. Image Projection Layer: This projeccton layer is used to project the face embeddings from the image encoder to the latent space of the cross projection layer and the language model. The image projection layer converts the single embedding [b, h] into a sequence of latent tokens [b, s, d]. where b is batch size, h is the hidden dimension layer from the final output layer of the image encoder, s is the sequence token length, and d is the dimension for the input of the decoder only language model. To do this, we first project the hidden dimension to a larger hidden dimension k where k = s ∗d. This is then followed by con- 7 catenating a learnable constant, resulting in [s + c, d]. Finally, this output is then passed through a transformer which is then followed by clipping of the learnable constant output c. The final output shape from this layer is [b, s, d]. This type of projection architecture is shown to perform well for prefix-tuning architectures. Text Embedder: The text embedder here that is used is found from the tokenizer used in the decoder language model. In this case, we mainly employ gpt 2 based tokenizer as our text embedder. The output shape from the text embedder is [b, t, d] where b is the batch size, t is the prompt len of the model, and d is the dimension for the input of the decoder only language model. Text Projection Layer: The text projection layer is very similar to the image projec- tion layer without the converting of single embedding into a sequence of latent tokens. It is basically just concatenating the embedding from the text embedding [b, t, d] with a learnable constant c to make the inermediate layer [b, t+c, d] and then put through a transformer layer which is then followed by the clipping of the learnable constant c to get the final dimension as [b, t, d]. Cross Projection: Following the esteemed paper [10], we employed a cross-projection layer to improve the models ability to highlight differences [10]. This is done by first con- catenating the two image projection layer outputs along with a separator token which is the EOS token of the model. The resulting shape of this concatenation is [b, 2 ∗s + 1, d]. This is then further concatenated with the text projection layer to give the new dimension as [b, 2∗s+t+1, d]. This is then put through a similar transformer layer as the text projection layer with the clipping of the learnable constant. Language Model: This is a decoder only language model which is used to autore- gressively generate text conditioned on the output of the cross projection layer. We conduct various ablations with the family of gpt2 based decoder language models due to compute limitations. The model takes in the output of the cross projection layer to get the description or answer autoregressively. 5 Training 5.1 Training Stages We train the model in three distinct stages to ensure that visual features are effectively in- tegrated into the language model. These stages are: (1) Unimodal Training, (2) Multimodal Mapper Training, and (3) End-to-End Fine-tuning.

Unimodal Training: In this stage, we independently pretrain the verification en- coder and the language model on tasks specific to their respective modalities. For the vision 8 Figure 1: VerLM takes in two images and text prompt as inputs to the model. It processes the images through an image encoder and then through the image projection layer. The text is put through a text embedder followed by a text projection layer. The two encoded images are concatenated with an separator token in between and then further concatenated with the text projection layer output. This is then put through a cross-projection layer. Finally, the output of the cross-projection layer is input to the decoder model which autoregressively generates the description encoder, we experiment with two architectures: VGGFace and CASIA-WebFace—both well- established in face verification literature. This serves as an ablation study to evaluate the impact of different visual backbones. For the language modality, we use a GPT-2-based language model, pretrained on a large corpus of text. This provides a strong linguistic foundation before integrating visual features.

Multimodal Mapper Training Once unimodal training is complete, we freeze both the vision encoder and the language model, and train the cross-projection and image- projection layers. The goal of this stage is to align the visual representation space with the language model’s embedding space, allowing the model to condition textual generation on visual input effectively. 3. End-to-End Fine-tuning: In the final stage, we unfreeze all components of the model and perform end-to-end fine-tuning. This allows the entire architecture to jointly adapt, leading to a more nuanced integration of visual and linguistic features. We adopt a low learning rate with warm-up scheduling to ensure stable convergence and prevent catastrophic forgetting during this sensitive phase of training. 9 5.2 Loss Function To encourage more descriptive and diverse outputs from our model, we employ a variant of the standard cross-entropy loss, augmented with an entropy-based regularization term. This combined objective—referred to as Diversity Loss—strikes a balance between accurate generation and lexical variability, which is especially important in multimodal generative tasks involving free-form language. Let z ∈RB×T×V be the predicted logits from the language model, where B is the batch size, T is the sequence length, and V is the vocabulary size. The corresponding ground-truth target sequence is denoted by y ∈NB×T. The total loss L is defined as: L = LCE −λ · H where: • LCE is the standard cross-entropy loss between the predicted token distributions and the ground-truth targets. • H is the mean entropy of the predicted token distributions. • λ is a scalar coefficient that controls the weight of the entropy regularization term. The entropy H is computed as: H = − 1 B · T B X b=1 T X t=1 V X v=1 pb,t,v log(pb,t,v + ϵ) where pb,t,v represents the softmax-normalized probability of token v at time step t for batch b, and ϵ is a small constant added for numerical stability. By subtracting the entropy term from the loss, we explicitly encourage the model to assign higher entropy to its output distributions—i.e., to be less confident and more exploratory. This regularization helps mitigate overconfident, generic outputs (e.g., repetitive or tem- plated responses), pushing the model toward producing more expressive and contextually grounded text. Empirically, we find that this diversity-promoting mechanism improves the fluency and specificity of generated descriptions, particularly in the later stages of end-to-end fine-tuning where overfitting to deterministic language patterns is more likely. 10 6 Results 6.1 Experimental Setup We compare our proposed architecture, VerLM, with two baselines: the OneDiff architec- ture and a Base model comprising a verification encoder, a mapper, and a language model. To ensure a fair comparison focused solely on the impact of the mapper architecture, we fix the backbone components across all models—using GPT-2 Base as the language model and VGGFace as the verification encoder. In the OneDiff architecture, a cross-modal projector is introduced before the language model to align the image pairs with their corresponding textual descriptions. The Base model lacks this projector and serves as a minimal multimodal pipeline. Our proposed VerLM architecture extends this setup by incorporating a separator module and a cross- projection layer that enhances the multimodal alignment prior to language generation. All three architectures were trained using the same training strategy to ensure a fair ablation study. The mapper components were initially trained for 30 epochs using the Adam optimizer with a learning rate of 1×10−4. We employed the Cosine Annealing with Warm Restarts learning rate scheduler and a batch size of 64. After this stage, the entire model was fine-tuned end-to-end by unfreezing all modules and continuing training with a reduced learning rate of 1 × 10−5, while keeping all other training configurations the same. All experiments were conducted using NVIDIA V100 GPUs. 6.2 Evaluation Metrics and Performance Analysis Ablation Type Dataset 1 Dataset 2 METEOR BLEU BERT METEOR BLEU BERT Our Model 0.3986 0.1557 0.9039 0.3548 0.1338 0.9004 OneDiff 0.3227 0.1143 0.8728 0.3419 0.1301 0.8980 Base Model 0.3343 0.1052 0.8882 0.3306 0.1137 0.8955 Table 2: Comparison of models across Dataset 1 and Dataset 2 using METEOR, BLEU, and BERTScore. To assess the quality of generated descriptions across different models, we evaluate per- formance using three widely-adopted metrics: METEOR, BLEU, and BERTScore. Each of these metrics captures different aspects of generation quality, offering a comprehensive evaluation of fluency, precision, and semantic similarity. METEOR (Metric for Evaluation of Translation with Explicit ORdering) computes a weighted F-score based on the alignment between the generated and reference sentences. It 11 incorporates stemming, synonymy matching (using WordNet), and exact matches, making it more semantically aware than traditional n-gram overlap metrics. Higher METEOR scores indicate better alignment and semantic similarity with reference sentences. BLEU (Bilingual Evaluation Understudy) measures the n-gram precision between the generated and reference text, typically up to 4-grams. While it is a widely used metric in machine translation and text generation, BLEU is known to be overly sensitive to exact word matches and may not capture semantic equivalence well, especially in open-ended generation tasks. BERTScore leverages contextual embeddings from pretrained BERT models to compute a token-level similarity between candidate and reference sentences. Unlike BLEU and ME- TEOR, BERTScore is semantic in nature and is capable of capturing paraphrastic similarity and richer contextual meaning. Performance Analysis. As shown in Table 2, our proposed model VerLM consis- tently outperforms both OneDiff and Base Model across all three evaluation metrics on both datasets. Specifically, VerLM achieves the highest METEOR (0.3986 and 0.3548), BLEU (0.1557 and 0.1338), and BERTScore (0.9039 and 0.9004) on Dataset 1 and Dataset 2, respectively. These results suggest that VerLM generates descriptions that are not only more fluent and accurate at the surface level (as indicated by BLEU and METEOR), but also semantically richer and more contextually aligned with the reference (as evidenced by BERTScore). In contrast, OneDiff and base Model demonstrate lower scores across all metrics, indi- cating that their generated outputs are less aligned with the ground truth both lexically and semantically. The consistent improvement of VerLM across datasets and metrics high- lights the effectiveness of its cross-projection and separator modules in enhancing multimodal alignment and guiding the language model to produce higher-quality descriptions. 7 Ablations We implement our VerLM model for the task of explainable face verification using natural language. We begin by detailing the datasets used for training and evaluating the model in this setting. Following this, we describe the architectural design and training configuration of the proposed system. To validate its effectiveness, we benchmark our model against existing state-of-the-art approaches. Finally, we conduct an ablation study to analyze the impact of key architectural components and training strategies, highlighting the design choices that contribute most to model performance. 12 7.1 Training Strategy Ablation Type Dataset 1 Dataset 2 METEOR BLEU BERT METEOR BLEU BERT Train Mapper 0.3672 0.1421 0.9015 0.3519 0.1321 0.9001 Train Entire End-to-End 0.3753 0.1516 0.9021 0.3478 0.1309 0.8901 Train Mapper Then Finetune 0.3986 0.1557 0.9039 0.3548 0.1338 0.9004 Table 3: Ablation study comparing different training strategies on Dataset 1 and Dataset 2 using METEOR, BLEU, and BERTScore. We explored three distinct training strategies for our proposed VerLM architecture to determine the most effective approach for aligning visual features with the language model. Each strategy was designed to assess the impact of different levels of parameter freezing and optimization flow on performance. The first strategy involved training only the vision projection layer and cross projection layer, while keeping both the verification encoder and the language model frozen. These two projection modules are collectively referred to as the mapper, as they serve to bridge the vision and language domains. This setup aims to ensure that the visual features are mapped into the language model’s embedding space without disrupting pretrained representations in the encoder or decoder. The second strategy involved fully end-to-end training of the entire model from the beginning. In this case, no components were frozen, allowing the optimization process to jointly adapt the verification encoder, mapper, and language model to the task. While more flexible, this method introduces challenges in stability and may require careful learning rate scheduling to prevent overfitting or forgetting. The final and most effective approach was a three-stage training procedure. In this setup, we first train the mapper alone (as in the first strategy), then unfreeze the entire model and perform end-to-end fine-tuning. This progressive learning approach allows the mapper to first align the modalities without interference, after which the entire model is jointly optimized for improved multimodal integration. As shown in Table 3, the third strategy consistently outperformed the other two across both datasets and all evaluation metrics. Notably, it achieved the highest scores in METEOR (0.3986 and 0.3548), BLEU (0.1557 and 0.1338), and BERTScore (0.9039 and 0.9004). These results validate the effectiveness of a staged training paradigm in preserving pretrained rep- resentations while still allowing for holistic adaptation during fine-tuning. 13 Ablation Type METEOR BLEU BERT GPT Base 0.3672 0.1421 0.9015 GPT Medium 0.3937 0.1570 0.9036 GPT Large 0.4098 0.1572 0.9047 Table 4: Ablation study on the effect of language model size (GPT variants) on description quality, evaluated using METEOR, BLEU, and BERTScore. 7.2 Scaling of Language Model Recent studies have shown that the performance of language models—and by extension, vision-language models—tends to improve with increasing model scale, often following a predictable power-law behavior [15,16]. To investigate this trend within our framework, we conducted an ablation study using three variants of the GPT-2 architecture: GPT-2 Base (124M parameters), GPT-2 Medium (355M), and GPT-2 Large (774M) [17]. As presented in Table 4, we observed a consistent improvement in performance across all evaluation metrics—METEOR, BLEU, and BERTScore—as the size of the language model increased. Specifically, GPT-2 Large achieved the highest scores, with a METEOR of 0.4098, BLEU of 0.1572, and BERTScore of 0.9047. These results align with prior findings and rein- force the idea that larger language models possess greater capacity to generate contextually rich and semantically aligned textual descriptions when paired with visual inputs. 7.3 Model Strategy Ablation Type Dataset 1 Dataset 2 METEOR BLEU BERT METEOR BLEU BERT With SEP 0.3672 0.1421 0.9015 0.3519 0.1321 0.9001 Without SEP 0.3532 0.1308 0.8921 0.3509 0.1323 0.9025 Without Cross Projection 0.3343 0.1052 0.8882 0.3398 0.1281 0.8871 Table 5: Ablation study on model components: effect of separator (SEP) and cross-projection layer on performance across Dataset 1 and Dataset 2. We examine the impact of the two architectural components adapted from ADIFF - the insertion of a separator token (SEP) between the image embeddings and the use of a cross projection module. We compare three variants in this experiment: (1) With SEP (the full VerLM model, including the SEP token and cross-projection layer), (2) Without SEP (cross-projection module is retained but no explicit separator token is inserted), and (3) 14 Without Cross-Projection (both the SEP token and the entire cross-projection mechanism are removed, akin to a direct embedding prompt baseline). This experiment helps us eval- uate how well the model can highlight the fine-grained facial differences when these design elements are ablated. We found that incorporating the cross projection module with the separator token yields the best performance. The With Sep (i.e. full model) achieves the highest score in METEOR (0.3762), BLEU (0.142), AND BERTScore (0.9015) in dataset 1. Removing the separator token causes a slight drop in these metrics (METEOR falls by about 1.4 points, BLEU by ∼1 point), indicating that the SEP token provides a modest but consistent benefit. Notably, eliminating the entire cross-projection module leads to a significant performance degrada- tion: METEOR plummets to 0.334, and BLEU drops to 0.105 for impostor pairs (a relative decrease of about 25% and 26% respectively, compared to the full model). BERTScore like- wise declines (from ∼0.901 to 0.888), reflecting poorer semantic alignment with the reference explanations. These trends hold similarly for genuine (same-identity) pairs, underscoring the importance of the cross-projection design across both scenarios. The superior performance with our full model suggests that VerLM’s difference-focused strategy is effective. The cross-projection module evidently helps the model isolate and represent distinguishing attributes of the two faces in the latent space, thereby facilitating more precise difference descriptions. This aligns with observations from ADIFF, where a cross-projection layer (with a learned separator) was critical to “store difference attributes” in the prompt for the language model [10]. By inserting a dedicated token between the two face embeddings, VerLM explicitly demarcates the two inputs, enabling the transformer to attend to each face’s features without conflation. The slight performance dip without SEP suggests that even a single learned boundary token can improve the model’s ability to discern which visual features belong to which face, likely reducing ambiguity during comparison.In contrast, removing the entire cross-projection mechanism forces the language model to rely on a naive concatenation of image features (similar to the baseline in ADIFF or a simple prefix-tuning approach), which dramatically hampers its ability to articulate fine-grained differences. This result is consistent with prior vision-language models that emphasize structured multimodal fusion: for example, OneDiff [12] introduces a “Visual Delta Module” to explicitly capture image differences, and ablations in that work show performance drops when the module is removed. Our findings here similarly demonstrate that a targeted difference-computation layer (and its SEP token) substantially boosts descriptive accuracy. 7.4 Textual Encoder Strategy To assess the impact of incorporating a textual encoder, we conducted an ablation study comparing model performance with and without this component. In the full VerLM archi- 15 Ablation Type Dataset 1 Dataset 2 METEOR BLEU BERT METEOR BLEU BERT Without Text Encoder 0.3544 0.1363 0.8992 0.3393 0.1218 0.8901 With Text Encoder 0.3672 0.1421 0.9015 0.3519 0.1321 0.9001 Table 6: Ablation study on the impact of including a text encoder in the VerLM architecture. Results reported on Dataset 1 and Dataset 2 using METEOR, BLEU, and BERTScore. tecture, the text encoder processes the tokenized input prompts into semantic embeddings, which are then used in conjunction with visual embeddings during generation. As shown in Table 6, the inclusion of the text encoder results in consistent improvements across all metrics—METEOR, BLEU, and BERTScore—on both datasets. These results highlight the critical role of the text encoder in preserving semantic alignment and coherence between the input prompt and the generated output. We hypothesize that removing the text encoder introduces a misalignment between the textual conditioning and the generated language, thereby degrading output quality. This observation is consistent with recent findings in multimodal generative modeling [18], which emphasize the importance of a dedicated text encoder for maintaining contextual consistency and improving generation fidelity. 7.5 Different Image Encoders Image Encoder Dataset 1 Dataset 2 METEOR BLEU BERT METEOR BLEU BERT VGGFace 0.3672 0.1421 0.9015 0.3519 0.1321 0.9001 CASIA 0.3413 0.1389 0.8908 0.3209 0.1203 0.887 Table 7: Comparison of image encoders (VGGFace vs. CASIA-WebFace) on Dataset 1 and Dataset 2 using METEOR, BLEU, and BERTScore. To assess the importance of the visual encoder, we conduct an ablation study by swapping VerLM’s face feature extractor. Specifically, we compare two widely-used face recognition models: VGGFace, trained on the VGGFace dataset [19], and CASIA, a ResNet-based model trained on the CASIA-WebFace dataset [20]. Both encoders produce latent embeddings from facial images, which are then passed to the model’s cross-projection module to condition the language generation process. As shown in Table 7, the model using VGGFace significantly outperforms the CASIA- based model across METEOR, BLEU, and BERTScore metrics on both datasets. The VG- 16 GFace encoder achieves higher semantic and syntactic alignment with reference descriptions, suggesting that it provides more discriminative and expressive facial embeddings. We hypothesize that this performance gap arises not only from architectural differences but also from the underlying training datasets. Since our model is trained using facial inputs aligned with the VGGFace embedding space, using an encoder trained on the same dataset (VGGFace) results in better modality alignment. In contrast, CASIA-WebFace differs in its identity distribution, pose diversity, and image quality, which likely introduces a domain mismatch and degrades the downstream generation performance. This finding underscores the importance of dataset alignment between the pretrained vision backbone and the multimodal training regime, especially in tasks involving fine-grained identity-level conditioning. 7.6 Model Scaling Number of GPT Layers Dataset 1 Dataset 2 METEOR BLEU BERT METEOR BLEU BERT 2 0.3403 0.1241 0.8975 0.3329 0.1221 0.8891 4 0.3672 0.1421 0.9015 0.3519 0.1321 0.9001 8 0.3662 0.1317 0.8990 0.3532 0.1317 0.9015 16 0.3703 0.1486 0.9094 0.3669 0.1415 0.9033 Table 8: Ablation study on the number of transformer layers in GPT-base. Results are reported on Dataset 1 and Dataset 2 using METEOR, BLEU, and BERTScore. 8 Conclusion In this work, we introduced VerLM, a difference-aware vision–language architecture for ex- plainable face verification that generates natural-language rationales grounded in paired facial embeddings. Under a controlled comparison where the language model (GPT-2 Base) and face encoder (VGGFace) are held fixed, VerLM consistently outperforms both a mini- mal multimodal baseline and the OneDiff-inspired alternative across METEOR, BLEU, and BERTScore on two datasets, demonstrating more fluent text and stronger semantic align- ment with references. Our ablations highlight that these gains come from explicit difference modeling, not simply adding capacity. A staged training procedure—training the mapper first, then fine-tuning end-to-end yields the best overall results, indicating that progressive alignment helps preserve pretrained representations while still enabling holistic adaptation. 17 Architecturally, the cross-projection module is the primary driver of explanation quality, while the separator token (SEP) provides a smaller but consistent improvement by preventing feature conflation between the two faces. We also find that explanation quality improves with larger language models and deeper decoders, and that both a dedicated text encoder and better aligned face embeddings (VGGFace vs. CASIA) materially impact performance underscoring the importance of multimodal alignment at every stage of the pipeline. Looking ahead, promising directions include scaling to stronger instruction-tuned LLMs, expanding to more diverse face verification conditions (pose/occlusion/lighting), adding human centered evaluation for explanation usefulness and faithfulness, and tightening visual grounding (e.g., coupling generated language with region-level evidence) so that explanations are not only fluent, but also reliably attributable to the true factors driving verification decisions. 18 References [1] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human- level performance in face verification,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2014. [3] J. R. Williford, B. B. May, and J. Byrne, “Explainable face recognition,” in European conference on computer vision. Springer, 2020, pp. 248–263. [4] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” in The eleventh international conference on learning represen- tations. [5] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Evaluation, vol. 11, pp. 6–1. [6] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mo- hammed, B. Patra et al., “Language is not all you need: Aligning perception with language models,” Advances in Neural Information Processing Systems, vol. 36, pp. 72 096–72 109, 2023. [7] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023. [8] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023. [9] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” in The Twelfth International Conference on Learning Representations. [10] S. Deshmukh, S. Han, R. Singh, and B. Raj, “ADIFF: Explaining audio difference using natural language,” in The Thirteenth International Conference on Learning Rep- resentations, 2025. [Online]. Available: https://openreview.net/forum?id=l4fMj4Vnly 19 [11] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. [12] E. Hu, L. Guo, T. Yue, Z. Zhao, S. Xue, and J. Liu, “Onediff: A generalist model for image difference captioning,” 2024. [Online]. Available: https://arxiv.org/abs/2407.05645 [13] I. DeAndres-Tame, M. Faisal, R. Tolosana, R. Al-Refai, R. Vera-Rodriguez, and P. Terh¨orst, “From pixels to words: Leveraging explainability in face recognition through interactive natural language processing,” in International Conference on Pattern Recog- nition. Springer, 2024, pp. 314–328. [14] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VGGFace2: A dataset for recognising faces across pose and age,” in International Conference on Automatic Face and Gesture Recognition, 2018. [15] J. Kaplan, S. McCandlish, T. Henighan et al., “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020. [16] J. Hoffmann et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022. [17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” https://cdn.openai.com/better-language- models/language models are unsupervised multitask learners.pdf, 2019, openAI Tech- nical Report. [18] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” 2024. [Online]. Available: https://arxiv.org/abs/2305.11834 [19] O. M. Parkhi, “Features and methods for improving large scale face recognition,” Ph.D. dissertation, Department of Engineering Science Oxford University., 2015. [20] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” 2014. [Online]. Available: https://arxiv.org/abs/1411.7923 20 9 Appendix Image 1 Image 2 Llama 3.2 Description Our Model Results the two facial images share sev- eral matching features, including a heart-shaped face, almond-shaped dark brown eyes, straight nose, thin eyebrows, high cheekbones, small chin, and smooth forehead. however, notable differences in- clude the jawline shape, with one image featuring a soft jawline and the other a slightly more defined jawline, and the eyebrow shape, with one image having arched eye- brows and the other straight eye- brows. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a round face, almond-shaped eyes, and a straight nose, while person 2 has a heart-shaped face, almond-shaped eyes, and a straight nose. the no- table discrepancies in facial struc- ture, skin tone, and hair color sug- gest they are not the same individ- ual the two images show a woman with dark hair and fair, warm skin tone. the first image features a bun, while the second has shoulder- length hair with loose waves and dark brown eyes. the facial struc- ture appears similar, but the hair style and eye color discrepancies suggest they are not the same in- dividual. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, dark brown hair, and a de- fined jawline, while person 2 has a warm, golden brown skin tone, dark brown hair, and a defined jawline. the notable discrepancies in skin tone, hair color, and facial structure suggest they are not the same individual the two facial images exhibit a high degree of similarity in key identify- ing features, including a rounded face shape, almond-shaped dark brown eyes, straight nose, thin mouth, and dark eyebrows. the main difference lies in the hair, with one image showing a more pronounced wave than the other. the two facial images exhibit nu- merous matching features, in- cluding oval face shape, almond- shaped dark brown eyes, straight nose, thick eyebrows, and fair skin tone. the only minor difference is the slight variation in the hair color and style, which is not signif- icant enough to suggest they are not the same individual. 21 the two facial images exhibit a high degree of similarity. matching fea- tures include an oval face shape, almond-shaped dark brown eyes, and fair skin tone. the overall fa- cial structure is identical, confirm- ing that the images represent the same individual. the two facial images exhibit nu- merous matching features, in- cluding oval face shape, almond- shaped dark brown eyes, straight nose, thick eyebrows, and fair skin tone. the only minor difference is the presence of a small mole in the first image, but this is insignificant in comparison to the overall simi- larity. the two facial images share several key features, including fair skin, oval-shaped face, almond-shaped dark brown eyes, and thin, arched eyebrows. the notable similarities suggest that these images repre- sent the same individual. the two facial images exhibit nu- merous matching features, in- cluding oval face shape, almond- shaped dark brown eyes, straight nose, thin eyebrows, and fair skin tone. the only minor difference is the slight variation in hair color, but this is negligible in compari- son to the overall similarity. the two images exhibit distinct differences in key identifying fea- tures. the second image shows a man with a medium-brown skin tone, short beard and mustache, black dreadlocks, almond-shaped dark brown eyes, and oval face shape. the lack of discernible fea- tures in the first image and signif- icant differences in the second im- age suggest they are not the same individual. the two facial images exhibit dis- tinct differences in key identify- ing features. person 1 has a fair skin tone, dark brown eyes, and a prominent jawline, while person 2 has a medium to dark brown skin tone, almond-shaped eyes, and a straight nose. the notable discrep- ancies in skin tone, eye color, and facial structure suggest they are not the same individual. the two facial images share several matching features, including skin tone, long dark hair, jawline, facial structure, and eye color. however, a difference in facial expression ex- ists, with one image showing a se- rious expression. the two facial images exhibit nu- merous matching features, in- cluding oval face shape, almond- shaped dark brown eyes, straight nose, thick eyebrows, and fair skin tone. the only minor difference is the presence of a small mole in the first image, but this is insignificant in comparison to the overall simi- larity. 22 the two facial images share sev- eral key features, including oval face shape, almond-shaped dark brown eyes, straight eyebrows, high cheekbones, and smooth skin tone. however, notable differences include hair color, style, eyebrow color, and jawline shape. despite these differences, the facial struc- ture and eye shape remain consis- tent. based on these findings, the two images do not represent the same individual. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, blonde hair, and a defined jawline, while person 2 has a fair skin tone, dark brown hair, and a defined jawline. the notable dis- crepancies in skin tone, hair color, and facial structure suggest they are not the same individual. the two images exhibit distinct dif- ferences in facial structure and hair color. the first image depicts a man with a square-shaped face, almond-shaped dark brown eyes, and thick dark eyebrows. in con- trast, the second image shows a man with short gray hair and no matching facial features. the dras- tic change in hair color suggests they are not the same individual. the two facial images exhibit dis- tinct differences. person 1 has a round face shape, prominent jawline, and dark brown eyes, while person 2 has a round face shape, prominent jawline, and dark brown eyes. the notable dis- crepancies in facial structure, jaw- line, and eye color suggest they are not the same individual. the two facial images share several matching characteristics, includ- ing oval face shapes, dark brown almond-shaped eyes, straight noses, and prominent cheekbones. however, the first image shows a woman with short blonde hair, a smoother complexion, and a less angular jawline, while the second image depicts a person with long straight light brown hair, fine lines, and a stronger jawline. based on these distinct differences, the two images do not represent the same individual. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, almond-shaped eyes, and a straight nose, while person 2 has a light skin tone, almond-shaped eyes, and a straight nose. the no- table discrepancies in skin tone, eye color, and nose shape suggest they are not the same individual. 23 the two images show distinct in- dividuals. person 1 is a black man with a rounded face, almond- shaped eyes, and a prominent nose. person 2 is a woman with long brown hair, a white top, and a dif- ferent facial structure. the key dif- ferences indicate they are not the same person. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, almond-shaped dark brown eyes, and a straight nose, while person 2 has a medium to dark brown skin tone, almond-shaped dark brown eyes, and a straight nose. the notable discrepancies in skin tone, eye color, and nose shape suggest they are not the same individual. the facial images of person 1 and person 2 exhibit key differences. person 1 has a brown skin tone, long brown hair, while person 2 has a fair skin tone, short light-brown hair, and an oval face shape with almond-shaped eyes. the discrep- ancies in skin tone, hair color and style, and facial structure suggest they are different individuals. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, almond-shaped eyes, and a straight nose, while person 2 has a light skin tone, almond-shaped eyes, and a straight nose. the no- table discrepancies in skin tone, eye color, and nose shape suggest they are not the same individual. the two images exhibit distinct dif- ferences in key identifying features. person 1 has a round face, fair skin, and dark brown hair, while person 2 has an oval face, light to medium brown skin, and light blond hair. the facial features, such as nose shape, eyebrow shape, and jawline, also differ significantly. the two facial images exhibit dis- tinct differences. person 1 has a round face, almond-shaped eyes, and a straight nose, while person 2 has a heart-shaped face, almond- shaped eyes, and a straight nose. the notable discrepancies in facial structure, skin tone, and hair color suggest they are not the same in- dividual. the two facial images share some matching features, including fair skin tone, small nose, prominent cheekbones, and defined jawline. however, they differ in hairstyles and colors, eye positions, and eye- brow shapes, suggesting the im- ages are not of the same individ- ual. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, dark brown hair, and a de- fined jawline, while person 2 has a warm, golden brown skin tone, dark brown hair, and a defined jawline. the notable discrepancies in skin tone, hair color, and facial structure suggest they are not the same individual. 24 the two facial images exhibit dis- tinct differences in key features. the first depicts a woman with a round face, soft jawline, brown or hazel eyes, and a non-prominent nose, while the second shows a woman with a flat face, prominent cheekbones, dark eyes, and a de- fined jawline. the notable discrep- ancies suggest that they are not the same individual. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, almond-shaped eyes, and a straight nose, while person 2 has a light skin tone, almond-shaped eyes, and a pointed nose. the no- table discrepancies in skin tone, eye color, and nose shape suggest they are not the same individual. the two facial images depict a bald man with a prominent forehead, high cheekbones, strong jawline, light tan skin tone, and piercing blue eyes. the only minor differ- ence lies in the background, but the facial features remain identical, suggesting they belong to the same individual. the two facial images exhibit nu- merous matching features, includ- ing a prominent forehead, almond- shaped dark brown eyes, straight nose, thick eyebrows, and fair skin tone. the only minor difference is the slight variation in the hair color and style, which is not signif- icant enough to suggest they are not the same individual. the two facial images exhibit similarities in several key fea- tures, including short brown hair, straight noses, dark brown eye- brows, prominent cheekbones, de- fined jawlines, and fair skin tone. the matching features suggest that they belong to the same individ- ual. the two facial images exhibit nu- merous matching features, in- cluding oval face shape, almond- shaped dark brown eyes, straight nose, thick eyebrows, and fair skin tone. the only minor difference is the slight variation in the hair color and style, which is not signif- icant enough to suggest they are not the same individual. the two facial images exhibit sim- ilarities in face shape, forehead, cheekbones, nose, and eye shape. however, significant differences ex- ist in skin tone, facial hair, hair style, mouth shape, and presence of fine lines. the two facial images exhibit dis- tinct differences in key identify- ing features. person 1 has a fair skin tone, almond-shaped dark brown eyes, and a straight nose, while person 2 has a warm, golden brown skin tone, almond-shaped dark brown eyes, and a straight nose. the notable discrepancies in skin tone, eye color, and nose shape suggest they are not the same individual. 25 the facial features of both images exhibit several key differences. the most notable discrepancies include skin tone (medium to dark in per- son 1 vs. light in person 2), facial hair (stubble in person 1 vs. none in person 2), and jawline shape. the two facial images exhibit dis- tinct differences. person 1 has a medium to dark brown skin tone, dark brown hair, and a prominent jawline, while person 2 has a light to medium brown skin tone, short dark hair, and a prominent jaw- line. the notable discrepancies in skin tone, hair color, and facial structure suggest they are not the same individual. the two facial images exhibit dis- tinct differences. the first is a blurry photograph of a man with short, dark hair, while the sec- ond is a close-up of a man with a shaved head and short beard. despite similar skin tone and fa- cial structure, the significant dif- ferences in hair style, facial hair, and attire suggest they are not the same individual. the two facial images exhibit dis- tinct differences in key identifying features. person 1 has a fair skin tone, almond-shaped eyes, and a straight nose, while person 2 has a dark skin tone, almond-shaped eyes, and a pointed chin. the no- table discrepancies in skin tone, eye color, and facial structure sug- gest they are not the same individ- ual. Table 9: 20 image pairs with their descriptions and model results. 26