Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

Generalist Multimo dal LLMs Gain Biometric Exp ertise via Human Salience Jacob Piland 1 , Byron Do wling 1 , Christopher Sw eet 2 , and Adam Cza jk a 1 1 Departmen t of Computer Science and Engineering, Universit y of Notre Dame, Notre Dame, IN 46556, USA acza jk a@nd.edu, jpiland@nd.edu, b do wlin2@nd.edu 2 Cen ter for Research Computing, Universit y of Notre Dame, Notre Dame, IN 46556, USA csw eet1@nd.edu Corr esp onding author: A dam Czajka — acza jk a@nd.edu Abstract Iris presen tation attack detection (P AD) is critical for secure biometric deplo yments, y et dev eloping sp ecialized mo dels faces signiﬁcant practical barriers: collecting data representing future unknown attac ks is imp ossible, and collecting div erse-enough data, y et still limited in terms of its predictiv e pow er, is exp ensiv e. Additionally , sharing biometric data raises priv acy concerns. Due to rapid emergence of new attack vectors demanding adaptable solu- tions, w e thus inv estigate in this paper whether general-purp ose multimodal large language mo dels (MLLMs) can p erform iris P AD when augmented with human exp ert knowledge, op erating under strict priv acy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision enco der embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite nev er b eing explicitly trained for this task. Ho wev er, where clustering sho ws ov erlap b et ween attack classes, w e ﬁnd that structured prompts incorp orating human salience (verbal descriptions from sub jects identifying attack indicators) enable these models to resolve ambiguities. T esting on an IRB-restricted dataset of 224 iris images spanning sev en attack types, using only universit y-approv ed services (Gemini 2.5 Pro) or locally-hosted mo dels (e.g., Llama 3.2-Vision), we sho w that Gemini with exp ert-informed prompts outper- forms b oth a specialized con volutional neural netw orks (CNN)-based baseline and h uman examiners, while the lo cally-deplo yable Llama achiev es near-human p erformance. Our results establish that MLLMs deplo yable within institutional priv acy constraints oﬀer a viable path for iris P AD. Keyw ords: Biometric securit y , human salience, iris presentation attack detection, iris recognition, m ultimo dal large language models, priv acy-preserving biometrics, prompt engineering, UMAP clustering, vision transformers, zero-shot learning, few-shot learning. This material is b ase d up on work supp orte d by the U.S. National Scienc e F oundation under gr ant No. 2237880 and work supp orte d by the U.S. Dep artment of Defense under c ontr act No. W52P1J-20-9-3009. A ny opinions, ﬁndings, c onclusions, or r e c ommendations expr esse d in this material ar e those of the authors and do not ne c essarily r eﬂe ct the views of the U.S. National Scienc e F oundation or U.S. Dep artment of Defense. The U.S. Government is authorize d to r epr o duc e and distribute r eprints for Government purp oses, notwithstanding any c opyright notation her e on. 1 In tro duction 1.1 Bac kground and Motiv ation Biometric presentation attac k detection (P AD) stands at a critical intersection of security necessity and practical constrain ts. While deep learning approac hes hav e sho wn promise for detecting v arious attac k t yp es ranging from artiﬁcial ob jects presented to biometric cameras to synthetic imagery , their deploymen t faces four fundamen tal c hallenges that motiv ate our in vestigation into alternative approac hes, lev eraging recent adv ances in foundational and multi-modal large mo dels. First, the diversit y of presentation attac ks requires extensive data collection across multiple attack v ectors, whic h is exp ensiv e, ethically complex when inv olving human sub jects, and will never exhaust all p ossible attack t yp es 1 Synthetic Sali ence (MESH) Gener ation (Paper Contribution) MESH Generation Prompt Gemini Model Llama Model Human Text Annotati ons Annotati on Image Dataset No Salience (Control) Experi mental Inputs Salience Human Text Annotati ons Gemini MESH Text Llama MESH Text Prompt T ypes (Paper Contribution) Short Pro mpt Long Prompt Test Image Dataset Experime nt Prompts (All combinations of Salience and Prompt) Gemini Model Llama Model Experi ment Fra mework (Paper Contribution) Comparat ive Evalua tion and Analysis Experime ntal Results Human Baseli ne Salience - Guided CNN Baselin e Blue: Text and image data Green: P rompts Purple: Mode ls Yellow:Sal ience Gray: Results an d Baseli nes Figure 1: Exp erimental pip eline and pap er contributions. W e start with the generation of synthetic MESH salience from h uman annotations and corresponding image dataset. This salience, along with the control and h uman salience, is then combined with our no v el prompts for testing. Results are obtained for both models to whic h we hav e ethical access and compared with a h uman and salience-guided CNN baseline. p oten tially to b e seen in the future. Second, the rapid evolution of attac k metho dologies, particularly with adv ances in generative AI, demands systems that can adapt quickly , without complete retraining [66]. Th us, an eﬀective P AD solution should ideally b e based on mo dels with a general “understanding” of visual tasks, and that are only directed b y limited data (multi-mo dal, if av ailable) to wards the P AD task. Large multi-modal (vision-language) mo dels may oﬀer suc h capabilit y [42]. Third, while biometric P AD algorithms in general surpass human-sourced classiﬁcation, there is an imp ortan t application of forensic exp ert-based judgmen t in court cases, where the machine serv es an imp ortan t supp orting role, but the ﬁnal decision is left – b y la w – for the exp ert. In such cases there is a need for seamless h uman-machine pairing approac hes. F ourth, the sensitive nature of biometric data creates signiﬁcan t barriers to mo del developmen t and deplo ymen t. Biometric images are p ersonally identiﬁable information sub ject to stringen t priv acy regulations and Institutional Review Board (IRB) restrictions that explicitly prohibit transmission to public cloud services such as ChatGPT or Claude. These priv acy-related constraints eliminate many state-of-the- art/commercial solutions from consideration and there’s a need for priv acy-safe, usually lo cally-run approac hes. 1.2 Prop osed Solution Due to the impractically of attaining ever larger biometric datasets, whic h is the standard approac h to training sp ecialized CNNs, we prop ose a fundamen tally diﬀerent approac h: leveraging the laten t visual understanding of general-purp ose MLLMs that can b e deploy ed within institutional priv acy boundaries. Our in vestigation is con- strained to models that either ha v e special institutional agreements for secure processing (Gemini 2.5 Pro through our universit y’s bilateral agreement with Go ogle) or can be hosted entirely on lo cal infrastructure (e.g., Llama 3.2- Vision). This constrain t, while limiting our mo del c hoices, reﬂects real-world deplo yment scenarios where biometric data cannot leav e institutional control. In this work, w e demonstrate the solution for iris P AD as the example P AD domain, although the prop osed metho dology can b e applied to design a P AD metho d for other biometric techniques, for which b oth images and language descriptions of these images are av ailable. Our k ey insight is that priv acy-complian t MLLMs, despite never b eing explicitly trained on iris P AD tasks, hav e learned visual features that partially separate diﬀeren t attack types. W e illustrate this through exploratory Uniform Manifold Approximation and Pro jection (UMAP) visualization [37] using SigLIP [69], a widely-used vision-language 2 Figure 2: Uniform Manifold Approximation and Pro jection (UMAP) visualization of iris samples encoded by (left) SigLIP vision-only embeddings and (right) SigLIP + Gemma multimodal embeddings using a simple binary prompt asking whether the iris is “real and health y” or “synthetic/unhealth y .” Despite never b eing trained sp eciﬁcally for iris P AD, SigLIP alone achiev es partial separation of attack types. Ho w ev er, adding even minimal semantic guidance through Gemma shows muc h improv ed visual separation b et ween liv e vs. sp oof discrimination, with clearer cluster b oundaries and reduced ov erlap betw een classes. This visual separation motiv ates our inv estigation in to whether general-purp ose MLLMs can address specialized biometric securit y tasks through appropriate prompting. mo del, for image encoding and Gemma [54], a large language mo del, for text em b edding. T o leverage b oth SigLIP’s strong visual representations and Gemma’s text processing capabilities (as SigLIP is limited to short prompts), w e com bine their embeddings through a shallow Multi-La yer P erceptron (MLP) fusion la yer, a standard approac h in m ultimo dal learning [3]. The shallow MLP is a t wo-la y er net work with normalization and GELU activ ation, an input shap e of 2048 and a hidden dimension size of 512. This combined representation, sho wn in Fig. 2, uses only a simple binary prompt asking whether the iris is ’real and healthy’ or ’syn thetic/unhealthy’ 1 , y et provides strong clustering of iris attac k t yp es. While this natural clustering provides a strong foundation, w e observe o verlapping regions where attack types are not clearly distinguished by visual features alone. This observ ation suggests that the c hallenge extends b ey ond visual discrimination to include domain-sp eciﬁc in terpretation of ambiguous cases. T o address this challenge, w e in tro duce a framew ork that incorp orates human salience, speciﬁcally verbal descrip- tions of attac k indicators from b oth exp ert and non-exp ert observers, directly into MLLM prompts. This approach transforms the P AD task from one requiring extensive training data to one of kno wledge transfer, where human ex- p ertise guides pre-trained mo dels tow ard domain-sp eciﬁc decisions, particularly in cases where visual features alone are insuﬃcient. W orking within strict IRB constraints that limited us to 224 images across seven attack types and restricted us to institution-approv ed or locally-hosted mo dels, we demonstrate that this approach not only works but can exceed b oth sp ecialized CNN baselines and human expert performance. At the same time, this w ork do es not argue for replacing sp ecialized P AD systems in high-throughput pro duction environmen ts. Instead, we establish feasibility for scenarios where traditional approaches face insurmountable barriers, particularly where priv acy regulations prohibit cloud-based pro cessing, oﬀe ring a new to ol for the biometric security communit y when conv entional metho ds cannot b e applied. 1.3 Researc h Questions Our inv estigation and presen ted results are organized around the following four researc h questions: R Q1: T o what exten t can general-purp ose MLLMs p erform binary iris P AD under minimal , i.e. , naive prompting, relativ e to human examiners and a domain-speciﬁc state-of-the-art CNN baseline? 1 F ull prompt: ’Is this a real and health y iris or syn thetic/unhealthy iris? Return a single ﬂoat n umber from 0 to 1 with 0 being real/healthy and 1 being synthetic/unhealth y and no other output.’ The (un)health y speciﬁcation is in reference to the diseased iris attack type, which the mo del will not detect otherwise as diseased irises are real irises in a layman’s context. 3 R Q2: By how muc h do es a structured, task-scaﬀolded prompt improv e the MLLMs’ iris P AD p erformance? R Q3: Do es injecting human salienc e as in-context exemplars further improv e MLLM’s iris P AD p erformance as it do es for CNN p erformance, and is the gain consistent across mo dels? R Q4: Can LLM-expanded human descriptions (in tro duced as MESH: Machine-Expanded Saliency from Human) pro duce a more complete salience signal that further enhances iris P AD p erformance and under which prompt regimes and base mo dels does it help most? 1.4 Biometric Data and Public MLLM Services Due to the large and growing n umber of MLLM public services and their impressive generalization capabilities, there is an increasing b o dy of w ork in whic h biometric data is submitted to suc h services. How ever, since man y researc h biometric datasets w ere collected before the MLLM era, the consen t forms used during those collections ma y not con tain appropriate pro visions to protect sub jects’ priv acy in the context of large commercial mo dels. This ethical and priv acy asp ect of biometric data usage in the mo dern AI environmen t warran ts short commen tary . As an example, the consent forms used b y the Universit y of Notre Dame, whose data w as used in these exp erimen ts, con tain several provisions to protect sub jects’ priv acy . These provisions forbid licensees from redistributing the data to third parties, using the data commercially , or using the data in a wa y that could cause sub ject em barrassmen t or men tal anguish. The data-sharing license, whic h must b e duly executed b y the institution requesting a copy of the data, also requires licensees to obtain the Principal In vestigators’ p ermission to publish more than ten images from the dataset in a pap er. None of these provisions can be guaran teed when data is submitted to public MLLM services without ha ving the executed data sharing agreemen t executed by the MLLM service provider, whic h makes fulﬁlling all provisions included in the consent form imp ossible. In the case of the authors’ institution, such an agreement was secured with Go ogle Inc. for the use of the “Gemini” service. Consequently , this work rep orts only on results obtained with either lo cally-main tained op en-source MLLMs or the commercial mo del av ailable via the “Gemini” service. 2 Related W ork 2.1 Detection of Biometric Presen tation Attac ks Presen tation attacks are attempts at manipulating a biometric system into an incorrect decision in one of t wo forms: imp ersonation, where the ob jectiv e is to deceiv e the biometric system with a p ositiv e match, or iden tit y concealmen t, where the attac ker wishes to ev ade detection en tirely [13]. Examples of iris presen tation attacks include synthetic iris imagery using Generative Adversarial Netw orks [29, 28, 55], presen ting a cada ver p ost-mortem sample to the iris scanner [61], printing a high-resolution image of an authentic iris and presen ting it to the scanner [12], or presenting an artiﬁcial eye such as a glass prosthesis [13]. Man y current P AD tec hniques rely on deep-learning based approac hes [5, 6, 38, 11]. More recen t w orks lev erage on generalization capabilities of image foundation mo dels [53], or proposed new loss functions for image-text multimodal alignmen t [71] in the context of iris P AD. What is curren tly underexplored and co v ered in the next section, is lev eraging the p ow erful capabilities of large multimodal foundational mo dels, which – by seeing v ast amount of data during training – acquire capabilities of solving general vision tasks, and may oﬀer an imp ortan t comp onen t of few- shot learning approaches adapting such mo dels to domain-sp eciﬁc tasks. This pap er sp eciﬁcally explores application of multimodal large language mo dels tow ards the task of iris P AD. 2.2 F oundation Mo dels and Iris Biometrics MLLMs hav e man y deﬁnitions but the consensus is that they are m ultimo dal AI mo dels that are capable of in terpreting image and text typically by tethering a vision enco der to a Large Language Mo del (LLM) [4]. As new er v ersions of MLLMs ha ve been released with increasingly larger parameter counts, researc hers ha v e attempted to push their capabilities into the realm of biometrics with promising results [16, 50, 22, 14]. F armanifard et al. explored 4 GPT-4’s abilities on zero-shot ev aluation of iris matc hing tasks, cross-mo dalit y matchin g, and P AD ﬁnding that GPT-4 often succeeded in distinguishing iris pairs under a v ariety of c hallenging conditions such as occlusion and noisy environmen ts [16]. Son y et al. ev aluated 41 MLLMs on biometric tasks, including limited P AD, h yp othesizing that the zero-shot baseline p erformance of these models can b e impro v ed with the addition of a classiﬁcation head. Using a dataset consisting of both full size and cropp ed images of b onaﬁde liv e irises and irises wearing patterned contact lenses, the authors measured the baseline zero-shot inference accuracy for foundation mo dels. Next testing three types of classiﬁers, support v ector mac hines with linear and RBF kernels as well as logistic regression, most models when trained on these embeddings show ed a mo derate to substan tial jump in classiﬁcation accuracy with DINOv2 [40] and DINO-ViTB16 [8] showing the b est p erformance on full size and cropped irises, resp ectiv ely . Despite the success of F armanifard’s and Son y’s works demonstrating the capabilities of MLLMs for P AD, the authors only tested t wo attack t yp es, and for the purp oses of their studies a voided directly prompting the mo del with the P AD task. Therefore it w ould b e an in teresting follo w-up to test the capabilities of these MLLMs on a wider v ariety of attac k t ypes and inv estigate the eﬀect of diﬀeren t prompt tec hniques and whether the injection of h uman salience would ha ve an eﬀect on mo del p erformance, which is the topic of this pap er. 2.3 Prompt Engineering Zer o-shot pr ompting is the core building block, from whic h other prompt engineering techniques derive as it is the core functionalit y of foundational models, i.e. , , directly asking the mo del to p erform a task or query without an y prior examples or injection of knowledge. Closely related to this are the concepts of few-shot pr ompting and many-shot pr ompting whic h are both forms of in-context learning where the model is fed high-qualit y input-output examples to better understand a task [49, 1]. Brown et al. found that on a v ariety of natural language processing (NLP) b enc hmarking datasets, few-shot learning often b eats zero-shot learning b y a signiﬁcan t margin and in some cases ev en b eats SotA ﬁne-tuned mo dels suc h as on the LAMBADA dataset [7, 43]. Chain-of-Thought (CoT) is a form of instructional prompting that guides the AI model through a series of in termediate reasoning steps to elicit a b etter resp onse and signiﬁcantly improv e the mo del’s resp onses on complex queries [63]. In zero-shot CoT prompting, this can b e as simple as appending the phrase “Let’s think step by step” to the end of a logic puzzle or a math problem, while few-shot CoT prompting inv olv es providing examples of how to approac h said problems step-b y-step using this CoT reasoning. W ei et al. tested few-shot CoT prompting against the standard few-shot prompting p opularized by Brown et al. [7] by comparing the tec hniques on multiple b enc hmarks across ﬁve diﬀerent LLM mo dels. Their results indicate that CoT prompting outp erforms few-shot prompting and in some cases b eat SotA ﬁne-tuned mo dels, particularly on multi-step reasoning tasks, on larger mo dels. Finally R etrieval Augmente d Gener ation (RAG) is a technique that enhances a mo del’s output by retrieving relev ant information from an external knowledge base and injecting this knowledge in to the prompt prior to generating a resp onse from the mo del [35]. The LLMs’ “memory” is aﬀected by a ﬁxed cut-oﬀ data at the time of training and they can b e prone to confabulations, i.e. , , generating plausible but factually inaccurate information [64, 25]. RAG oﬀers a metho d to improv e up on these issues by con tin uously maintaining a sp ecialized kno wledge base to enhance the LLM’s baseline capabilities without the need for retraining or ﬁne-tuning the entire mo del [35]. 2.4 Human Salience Previously , the inexplicability of deep learning metho ds, in contrast to LLMs, has been alleviated by comparison to observed, human experts [47]. In many tasks, machine learning accuracy is essentially alw ays at least as go od as h uman accuracy [41], how ev er, h uman psyc hophysics has aided in deep learning tasks suc h as handwriting [21], natural language pro cessing [70], and scene description [23, 26]. Sp eciﬁcally in biometrics (including iris P AD), h uman saliency has b een sho wn to compliment mac hine saliency [6, 5]. 5 Figure 3: F rom left to righ t (with third-part y dataset sources, where appropriate): liv e iris (with no abnormalities) [31], St yleGAN2-generated sample, StyleGAN3-generated sample, iris w earing textured con tact lens, then prin ted and re-captured in near infrared ligh t [31], synthetic sample generated b y a non deep learning-based algorithm [52], diseased eye [58], glass prosthesis, p ost-mortem sample [59], iris printout [12], iris wearing textured contact lens [15], and artiﬁcial eye [30]. 3 Exp erimen tal Design 3.1 Exp erimen ts 3.1.1 Prompts Design W e performed sixteen exp erimen ts pertaining to our researc h questions. In eac h experiment we test either Gemini 2.5 Pro mo del [19] (hereafter referred to simply as “Gemini”) or Llama3.2-Vision:90b mo del [20] (hereafter referred to simply as “Llama”) in the task of binary iris P AD, once with each of eight prompt v ariants. During an experiment, for each dataset test image, we send the model a multi-modal prompt consisting of the test image and a text prompt. As part of the prompt in each exp eriment we explicitly request a ﬂoating p oin t num ber from zero to one (inclusive), with 0 . 0 representing a normal iris and 1 . 0 representing an attack sample. By thresholding the resp onses at 0 . 5, we obtain a binary classiﬁcation for eac h test image, comparable to a traditional CNN-based classiﬁer. T able 1 summarizes eight prompt conﬁgurations used in exp erimen ts with eac h of the tw o models considered in this work (hence sixteen th exp erimen ts). F ull prompt and salience examples are presen ted in Supplemen tary Materials. Eac h prompt consists of one of tw o base prompts and either an app ended form of salience or no salience. When there is no salience the exp erimen t is an example of zero-shot learning. With salience, it is few-shot learning. The ﬁrst and most basic prompt is the short prompt , which simply asks the mo del “Is this a r e al and he althy iris or a synthetic/unhe althy iris?” and then prompts the ﬂoating p oin t n umber. The second base prompt is the long prompt , which was crafted in an exp ert-in-the-lo op feedback session using Claude Sonnet 4 [2], with the following structure: Role and T ask: T ake on the role of an expert biometrics examiner and understand the iris P AD task as described. Analysis F ramework and Classiﬁcation Con text: Pro vide an assessmen t including but not limited to texture, reﬂections, 6 Examiner Resp onses: • Exp ert Examiner: “I think this one is normal. I see specular highlights, the main sp ecular highligh t, and then t wo Purkinje reﬂections which I think are in the correct place so it will be diﬃcult for StyleGAN to mimic this I think. There’s a reﬂection from the nose on the right, the eyelashes are ok ay . Final answ er normal.” • Non-Exp ert Examiner: “Abnormal. Something seems oﬀ about that iris. Maybe it’s the darker color surrounding the iris, but I don’t like that it has, that it looks like they hav e spikes around it. Other than that, it seems legit. I just really don’t lik e the pupil here. Final answer abnormal.” Examiner Resp onses: • Exp ert Examiner: “This one is diﬃcult, but I w ould say nor- mal. I see the Purkinje reﬂection which is a go od sign. But I also see something weird happ ening with the eyelids and eyelashes near the corner whic h means that maybe it is a synthetic image. St yleGAN actually did a really go od job here in this case. So I think I will change my mind. Final answer Abnormal.” • Non-Exp ert Examiner: “This lo oks normal to me. W ait, no it do esn’t. There’s ﬂash in weird parts of the eye on the top and bottom of the iris and on the left and right too, not on the iris but in the corners of the ey e. There’s also that like dark patch on top, so I think this is abnormal, ﬁnal answ er.” Figure 4: Authentic (left) and synthetically generated by a StyleGAN2 model (right) iris images along with the exp ert and non-exp ert descriptions (“Human T ext Annotations” in Fig. 1). T able 1: Prompt conﬁgurations used in exp erimen ts with each of the mo dels considered in this work (Gemini and Llama). Prompt In-Con text Salience T yp e Learning T yp e Short None Short + Human ✓ Human description Short + Llama ✓ Llama MESH Short + Gemini ✓ Gemini MESH Long None Long + Human ✓ Human description Long + Llama ✓ Llama MESH Long + Gemini ✓ Gemini MESH artifacts, ligh ting, and other anomalous indicators. Required Output F ormat and Instructions: Speciﬁcally provide a classiﬁcation lab el, related conﬁdence, and short explanation of the decision. As shown in T able 1, we also test app ending salience to the base prompts, one salience entry for each attack type 7 deriv ed from a sequestered dataset. The “Human description” salience refers to raw transcriptions from humans p erforming iris P AD as describ ed in Sec. 3.3.2. The “Llama MESH” and “Gemini MESH” refer to text descriptions of a given image b eing classiﬁed created by the MLLM as describ ed in Sec. 3.3.4. 3.1.2 Accessing the MLLMs W e accessed Gemini through Google’s API and Llama through the Op enW ebUI API. In both cases, w e use the default v alues for temp erature, unlimited maximum tokens (including thinking tok ens), and automatically retry un til a ﬂoating p oin t n umber is included in the mo del’s resp onse. 3.1.3 Iris P AD Baselines The CNN baseline used in this work is the iris P AD-sp eciﬁc mo del that uses human salience to guide mo del training [6]. When training our instance of the CNN, we follow ed the pro cedure outlined in Boyd’s work: 10 independently trained mo del w eight sets (for statistical assessment of the training-related uncertain ty), architecture instan tiated from DenseNET-121 mo del [24] (for comparison purposes we do not test other architectures), Sto c hastic Gradien t Descen t optimizer with a learning rate of 0 . 0005 used, eac h mo del trained and tested in a leav e-one-out attac k type scenario for 50 ep o c hs, and the use of the ﬁnal w eights obtained (after 50 ep o c hs). W e do not replicate the human recognition exp erimen ts (to serve as a human recognition baseline), and instead w e rep ort (with the original authors’ permission) the v alues from [5]. According to these num b ers, humans are exceptionally go o d at solving a binary iris P AD task, making this baseline muc h stricter than the CNN baseline and pro vides the target group for our test mo dels to outp erform. 3.2 Metrics T raditional, threshold-based ev aluation for iris P AD recommended by ISO/IEC 30107 would include Attac k Presen- tation Classiﬁcation Error Rate at a chosen Bona ﬁde Presen tation Classiﬁcation Error Rate [27] . How ev er, we ﬁnd it necessary to grade APCER and BPCER at a ﬁxe d threshold of 0.5 for eac h attack type and liv e samples and then aggregate these scores for each mo del using Mean Squared Error. W e motiv ate this choice b y the fact that MLLM conﬁdence outputs are t ypically sparse and discrete, e.g., 0.2, 0.8, 0.95, 1.0 (as illustrated later in Fig. 6), as they are not a calculation, but a response to the prompt, making con tinuous threshold analysis unreliable (suc h as Detection-Error-T rade-oﬀ curve). F or each class the target APCER or BPCER is 0 . 00 and the aggregate is th us a measure of how far abov e that p erfect score the mo del achiev es o v er all classes. The w orst p ossible MSE is thus 1 . 00 and the b est possible is 0 . 00. As there is only one set of human responses, one set of Llama-based resp onses, and one set of Gemini-based resp onses, we do not calculate a mean or standard deviation for these v alues. How ev er, for the CNN baseline w e rep ort the mean v alue and standard deviation obtained for 10 indep enden t train-test runs. 3.3 Datasets 3.3.1 Iris T est Images In order to mak e a v alid comparison with the human and CNN baselines, w e use the established iris P AD test set with h uman-annotated saliency maps from [6] consisting of the following classes: (a) normal iris [9, 34, 18, 56], [48, 33, 67, 62], [68], (b) artiﬁcial ey es ( e.g. , glass prosthesis) [34], (c) iris prin touts [34, 18, 32], (d) ey es wearing textured c on tacts [34, 33, 67, 68], (e) printouts of irises wearing textured con tact lenses (later referred simply to “con tacts+print”) [34], (f ) diseased ey es [56, 57], (g) p ost-mortem ey es [60], and (h) synthetically-generated iris images [65]. Ho wev er, due to cost constraints asso ciated with running commercial MLLM (Gemini) we sample 30 images from eac h class (or all images, if there are less than 30) to make a ﬁnal dataset of 224 test images (See T ab. 2). W e conﬁrmed the statistical v alidity of our test sample size in tw o wa ys: 1) we p erformed a learning curve analysis (see Fig. 5) wherein MSE is rep eatedly recalculated as samples are added to see that the metric con verges, and 2) b y using the Wilco xon [10] and Mann-Whitney-U [17] tests to compare multiple runs of samples (see Supplementary Materials for the full statistical analysis). W e ev aluate the mo dels with a low n umber of samples, because w e lev erage the abilities of MLLMs trained on massive amounts of data. The purp ose of this ev aluation is to estimate generalization 8 0 5 10 15 20 25 Samples per attack type 0.0 0.2 0.4 0.6 MSE Short Short + human Short + Llama MESH Short + Gemini MESH Long Long + human Long + Llama MESH Long + Gemini MESH Figure 5: Learning curve analysis of the eight Gemini expe rimen ts sho wing the MSE scores conv erge b efore even 25 samples from each attack type. T able 2: A ttack types and n um b er of samples. A ttack type Num b er of samples Liv e iris 30 Artiﬁcial 14 Con tacts+print 30 Diseased 30 P ost-mortem 30 Prin tout 30 Syn thetic 30 T extured con tact 30 p erformance on such a set-up. W e further recommend this kind of analysis for w orks with access to small sample sizes. Note that the iris P AD test set is image-disjoint from the pro vided training set in [6] used to recreate the CNN baseline and the set used to derive the human and machine-expanded saliency from h uman (MESH) salience, describ ed in the following subsections. 3.3.2 V erbal Human Saliency Acquisition W e conducted a data collection consisting of 70 non-exp erts and 8 exp erts in iris P AD. P articipan ts w ere task ed with iden tifying samples as either “normal” or “abnormal,” and then were asked to verbally describ e what led them to their decision. T o ensure coherency , participants w ere given deﬁnitions for normal and abnormal that matc hed the division b etw een attack (artiﬁcial, printout, diseased, p ost-mortem, synthetic, and irises with textured contacts) and non-attack (live and health y irises) samples. T o record the verbal descriptions participants wore a MeetSummer X1 La v alier wireless microphone (purc hased in F ebruary 2024). Ethnicit y , gender, and age were do cumen ted for all participan ts. Exp ert participants were recruited from univ ersit y faculty and PhD students doing active researc h in iris recognition or iris P AD for at least one y ear, while the non-exp ert participants consisted of univ ersity studen ts and faculty who did not p ossess any biometric research experience. The study was appro ved b y the Institutional Review Board (IRB), and each participan t signed a consen t form allowing us to release anon ymized data. Iris images used in the h uman saliency collection were sourced from the dataset used previously in a study conducted b y Bo yd et al. [5]. In addition to these images, 100 syn thetic iris samples w ere generated using St yleGAN2 9 and StyleGAN3 with 50 for eac h category [29, 28, 55]. These images were divided up in to 16 unique decks that were carefully balanced to contain the same n umber of samples per anomaly t yp e ensuring that all decks w ere equal in diﬃcult y and to ensure suﬃcient collection of verbal descriptions for each sample. Eac h session b egan with a brief training, regardless of exp ertise, to ensure every one had a baseline understanding of what constitutes normal and attac k samples. Included during this training were sample verbal descriptions. The samples shown in the training pro cess were excluded from the actual experiment. Once the session started, each participan t was sho wn a single image p er slide and w as instructed to ﬁrst insp ect the image and provide a classiﬁcation of either normal or attack. Next, the participants were instructed to verbally justify their answers. The instructions for the v erbal description phase w ere in tentionally op en-ended as not to instill an y arbitrary guidelines that ma y sk ew the authen ticity of the results. P articipants w ere advised to be as thorough as they felt necessary and that there w as not a time limit imp osed. Instead they w ere encouraged to fo cus on what features stoo d out during the initial phase or what their though t process w as when diagnosing the image. Up on completion of the v erbal description phase, participants announced their ﬁnal answ er follow ed b y the “done”, at which p oin t the mo derator adv anced to the next sample until completion. 3.3.3 Data P ost-Pro cessing The audio recordings of each participant’s session w ere transcrib ed to text using a lo cal installation of Whisper [46] and man ually inspected to correct any transcription errors while maintaining v erbal ﬁller words to maintain the authen ticity of the original descriptions. These ﬁnal corrected prompts receiv ed the designation of human saliency descriptions and are verbal saliency source for the h uman-injected exp erimen tal results. The next step w as to create the Machin e-Expanded Saliency from Human (MESH) descriptions. The inspira- tion for MESH descriptions came from observ ations in the w a y the ma jority of participants approached the v erbal description phase. Our guidelines for the v erbal description phase were for participants to iden tify what features sto od out as b eing critical for their decision making or their though t pro cess during the ev aluation. W e did not w ant to imp ose strict guidelines in an eﬀort to a void elev ating features that were not critical to the decision making pro cess for the sample. While this decision preserv es only the most imp ortan t features and observ ations, it leav es the v erbal descriptions incomplete when viewed as an image description . When reading an individual h uman saliency description without the iris sample in view, the con text is missing. Ho wev er, MESH descriptions are in tended to b e a thorough and comprehensive image description that is elev ated by the observ ations of the human sub jects. As a result, we leveraged the help of tw o MLLMs, Llama3.2-Vision:90b and Gemini 2.5 Pro [20, 19], to help turn the h uman saliency descriptions in to a robust image description that when read indep enden t of the image, are muc h more thorough in explaining the image as a whole. The end result is tw o diﬀerent sets of MESH descriptions for each image in the dataset referred to as “Llama MESH” and “Gemini MESH”. Due to the sensitiv e nature of biometric data, our choice of MLLM mo dels to generate the MESH descriptions were limited to mo dels either approv ed by our institution’s compliance team, or mo dels deploy ed within a lo cal environmen t. 3.3.4 MESH Prompt Creation T o mak e the MESH prompts, we ﬁne-tuned a prompt with the assistance of Claude Sonnet 4 [2]. After explaining the ob jective to Claude and sharing iris samples and paired descriptions not constrained b y the IRB, we tested v ariations of the prompts on Llama until w e were satisﬁed with the results. Claude w as given the task of designing a prompt that would be giv en to Llama paired with an image to extract a MESH description for that sample. Claude was giv en an ey e sample not constrained b y any existing data licensing agreement or IRB, examples of human examiner observ ations, and a ﬁnal comprehensive image description to demonstrate the desired output from the prompt it was task ed with creating. Upon creation, the prompt w as tested on Llama with relev ant samples and this feedback was rela yed to Claude explaining what needed to be corrected in the form of revising the prompt structure to obtain the desired outcome. Examples of revisions included diseased irises initially b eing lab eled as normal instead of attac k, agreement bias where the mo del would simply agree with what an expert said regardless of ho w illogical the observ ations were, and not prop erly syn thesizing all observ ations into a seamless ﬁnal description. The ﬁnal prompt structure was as follo ws: Analysis F ramework: Provide an initial assessment b y examining things including but not limited to texture, reﬂections, artifacts, ligh ting, and other spo oﬁng indicators. 10 T able 3: MSE scores for Gemini and Llama based on the prompt t yp e used. The best prompt v ariant of the short and long prompts is b olded for eac h mo del. Mo del Prompt v ariant MSE Human Sub jects 0.062 Baseline CNN 0 . 345 ± 0 . 041 Short 0.416 Gemini Short + Human 0.183 Short + Llama 0.273 Short + Gemini 0.118 Long 0.062 Gemini Long + Human 0.053 Long + Llama 0.087 Long + Gemini 0.078 Short 0.422 Llama Short + Human 0.190 Short + Llama 0.267 Short + Gemini 0.125 Long 0.411 Llama Long + Human 0.074 Long + Llama 0.098 Long + Gemini 0.125 Examiner F eedback Ev aluation: Y ou will receive feedback formatted as examiner ID, exp ertise status, correct or incorrect classiﬁcation, and v erbal description. Critical Synthesis: V alidate human examiner observ ations, weigh expert technical knowledge with higher priority but do consider non-expert intuitiv e insights, use classiﬁcation accuracy to calibrate examiner reliabilit y , resolve conﬂicts by prioritizing v eriﬁable observ ations. Required Output F ormat: Image Classiﬁcation, Conﬁdence, Key F eatures Observ ed, Spo oﬁng Indicators, Examiner Integration, T echnical Details, Comprehensive Iris Description. A t a high level, the MESH description pro cess starts with submitting the image to the mo del, which assesses the image according to the analysis framew ork and generates a baseline description of the sample. Next, the mo del, either Llama or Gemini, considers human examiner feedbac k and lo oks for agreement b et ween its own observ ations and the human’s observ ations with an emphasis tow ards correct human observ ations and priorit y to exp ert observ ations. Once complete, the analysis is revised into a ﬁnal comprehensive description that is a thorough description of the image in its entiret y , informed from h uman examiners, as opp osed to sp eciﬁc feature observ ations. This pro cess w as p erformed using both Llama and Gemini to test for any p erformance diﬀerences b etw een the tw o MLLMs, particularly as the mo dels diﬀer in con text size, architecture, and proprietary status. 4 Results 4.1 Answ ering R Q1: T o what exten t can general-purpose MLLMs p erform binary iris P AD under naiv e prompting, relative to human examiners and a state-of-the-art CNN baseline? As seen in T ab. 3, given only the short prompt with no guiding salience, Gemini b egins to approach, but do es not surpass the traditional CNN baseline with a MSE of 0 . 416 ov er 0 . 345. When viewing Gemini’s p er-attac k-type results with the short prompt v ariant, w e see that it is able to detect some attack types innately . It correctly identiﬁes all of the live irises and p erforms b etter than the h uman baseline on printout attacks. It outp erforms the CNN baseline 11 T able 4: BPCER and APCER scores at threshold 0.5 for live iris and eac h iris P AD attack type by Mo del (Gemini or Llama) and Pr ompt variant . Columns corresp ond to the attac k types. Model Prompt v ariant Liv e iris Artiﬁcial Contacts+prin t Diseased P ost-mortem Printout Synthe tic T extured contact Human Sub jects 0.437 0.349 0.057 0.020 0.162 0.298 0.118 0.227 Baseline CNN 0 . 000 ± 0 . 000 0 . 236 ± 0 . 143 0 . 990 ± 0 . 016 0 . 227 ± 0 . 099 0 . 900 ± 0 . 061 0 . 647 ± 0 . 118 0 . 377 ± 0 . 146 0 . 460 ± 0 . 168 Short 0.000 0.769 0.700 0.833 0.833 0.267 0.833 0.300 Gemini Short + Human 0.033 0.077 0.567 0.700 0.400 0.167 0.633 0.233 Short + Llama 0.033 0.462 0.667 0.700 0.700 0.167 0.667 0.267 Short + Gemini 0.200 0.077 0.333 0.767 0.367 0.133 0.067 0.200 Long 0.167 0.083 0.222 0.517 0.233 0.074 0.241 0.172 Gemini Long + Human 0.300 0.000 0.233 0.433 0.133 0.133 0.167 0.167 Long + Llama 0.133 0.077 0.400 0.533 0.300 0.167 0.267 0.200 Long + Gemini 0.367 0.000 0.233 0.467 0.300 0.100 0.300 0.167 Short 0.167 0.692 0.759 0.800 0.533 0.467 0.724 0.793 Llama Short + Human 0.233 0.385 0.633 0.433 0.400 0.433 0.400 0.467 Short + Llama 0.400 0.583 0.643 0.533 0.433 0.433 0.400 0.633 Short + Gemini 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Long 0.048 1.000 0.688 0.667 0.412 0.381 0.529 0.880 Llama Long + Human 0.640 0.286 0.115 0.208 0.048 0.000 0.050 0.200 Long + Llama 0.500 0.429 0.263 0.364 0.105 0.053 0.143 0.333 Long + Gemini 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 12 on 4 of 8 attac k types: con tacts+prin t, p ost-mortem, printouts, and textured con tacts. Llama p erforms comparably to Gemini in terms of the MSE. Hence, the answ er to R Q1 is mixed . While Gemini and Llama are able to outp erform the human and CNN baselines for some attack types, it is ov erall worse than b oth. 4.2 Answ ering R Q2: By how m uc h does a structured, task-scaﬀolded prompt improv e the MLLMs’ iris P AD p erformance? Using the long, engineered prompt to test the MLLMs yields muc h b etter results for Gemini, but almost no improv e- men t for Llama. Gemini b oth surpasses the traditional CNN and matches the human results with a MSE of 0 . 062 against scores of 0 . 345 and 0 . 62 resp ectively . Overall, Gemini with the long prompt is the second-b est mo del tested. On a per-attack-t yp e basis Llama outperforms the h uman baseline and Gemini with the long prompt only on iden tifying normal irises, but outp erforms the CNN on tw o attack t yp es: con tacts+print and p ost-mortem. Gemini outp erforms the human sub jects on normal, artiﬁcial, printout, and textured contacts (half of the categories). It outp erforms the CNN on all but normal and diseased irises. Hence, the answ er to R Q2 is that a structured prompt can improv e general MLLM performance in the task of iris P AD to exceed that of comparable CNNs and ev en h uman sub jects . 4.3 Answ ering RQ3: Do es injecting h uman salience improv e MLLM’s iris P AD p er- formance, and is the gain consisten t across mo dels? In all cases, including h uman salience examples in the prompt impro ves p erformance ov er the base prompt. Gemini using the short prompt with human salience outp erforms the baseline CNN with a MSE v alue of 0 . 183 against 0 . 345. Gemini using the long prompt with human salience p erforms the best of all tests and is the only exp erimen t to outp erform the human sub jects with a MSE v alue of 0 . 053 against 0 . 062. Llama also sees a performance b oost, outscoring the baseline CNN with b oth short and long prompts with h uman salience. On a p er-attack-t yp e basis, Llama with a long prompt + h uman salience b eats the baseline CNN on ﬁve of eight categories. Gemini with a long prompt + h uman salience outperforms the human sub jects on all but three attac k t yp es: con tacts+prin t, diseased, and syn thetic. Th us, the answ er to R Q3 is just as human salience allow ed for CNNs to impro ve in the task of iris P AD, injecting h uman salience into MLLM prompts allows b oth models to outp erform the comparable CNN and Gemini to outp erform ev en humans. 4.4 Answ ering R Q4: Can LLM-expanded h uman descriptions (MESH: Mac hine-Expanded Saliency from Human) further enhances iris P AD p erformance, and under whic h prompt regimes and base mo dels do es it help most? In all cases, including a MESH (either from Llama or Gemini) improv es MLLM p erformance to b e b etter than the baseline CNN. Ho wev er, neither MLLM improv es o ver the human baseline with the MESH injected prompts. Only in the case of using the short prompt does the inclusion of MESH improv e o ver using prompt + h uman salience. This higher p erforming MESH came from Gemini, the Llama MESH resulted in a low er p erformance b oost. 4.5 Observ ations W e note several key observ ations from all of our exp erimen ts: a) Including human salience in well-engineered prompts ac hiev es state-of-the-art iris P AD p erfor- mance. Including h uman exemplars not only outp erforms traditional CNNs trained with human visual salience, but can allow for larger mo dels to even outp erform humans. b) In absence of a well-engineered prompt, MLLMs can expand on salience to their own b eneﬁt. The short prompt represents a p oorly engineered prompt. In this case, while including human salience did improv e p erformance (T able 3), it did not impro ve p erformance as well as MESH salience. The MLLMs are able to expand on human descriptions with task-relev ant details for helping themselves. 13 0.0 0.2 0.4 0.6 0.8 1.0 Sample confidence score 0.0 0.1 0.2 0.3 0.4 0.5 F raction of samples (a) Gemini using short prompt with human salience. 0.0 0.2 0.4 0.6 0.8 1.0 Sample confidence score 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 F raction of samples (b) Llama using short prompt with human salience. Figure 6: Histograms of the test set iris P AD classiﬁcation conﬁdence v alues generated by the tw o MLLMs. W e see that Llama is more capable of expressing uncertaint y . c) Gemini p erforms b etter but local Llama is more capable of expressing uncertain ty . While Gemini’s b est performance is ultimately a lo w er MSE than Llama’s, neither are perfect. In Fig. 6 w e see that Gemini pro duces essen tially bimo dal conﬁdence v alues, while Llama is more able to place it’s conﬁdence in the middle of the range. (Histograms for all exp eriments are av ailable in Supplemen tary Materials.) d) General MLLMs are capable of o v erﬁtting. F or Llama, both the short and long prompts with Gemini MESH salience led to classifying every iris as an attack atte mpt. The Gemini MESH is the longest prompt and this o verﬁtting may b e due to the length. The Gemini Mesh contains on a v erage three times more text tokens than that of the Llama MESH (estimates made using Op enAI tokenizer [39], and b oth of these saliency types add signiﬁcantly more tok ens to the o verall text input. How ever, con text window size is not an issue as we estimate the maxim um prompt, including the image, as only 16,000 tokens, w ell b elo w Llama’s 128,000 token limit. Recen t ﬁndings from Sun 14 et al. [51] suggest that MLLMs suﬀer visual attention degradation as long-chain text reasoning increases. While the mo dels tested in Sun’s exp eriments did not include Llama 3.2-vision:90b, many of the mo dels tested that suﬀer from this problem are all CLIP-based [45] vision backbones and share similar architectures to that of Llama according to their PyT orc h implementations [44, 36]. 5 Limitations and F uture W ork The principle complications of this study are related to (a) exp ense and (b) compliance. Due to the sensitiv e nature of biometric (iris in this c ase) data we used the only commercial mo del for which we could obtain a security guarantee, based on institutional bilateral agreemen t. Such agreements are rare, and th us the portfolio of commercial MLLM mo dels p ossible to be used without compromising sub jects’ priv acy and th us breaking the IRB proto cols, is v ery small. Due to the expense of running Gemini and hosting Llama w e sub-sampled our dataset to the smallest, but still statistically viable amount. F urthermore, this study only exhibits the tw o extremes of prompting: the short, minimal prompt, and the long, exp ert-engineered prompt. F uture w ork will delve in to the comp onents of prompting to bridge the gap betw een these t wo. As Llama exhibited the ability to express uncertaint y , other future works will explore MLLM calibration and adv ersarial robustness. W e also note that mo del strengths tend to be complemen tary across the diﬀerent attack t yp es, indicating the potential for future ensemble systems. Last but not least, a further study should explore the eﬀect of using MLLMs to justify decisions to human examiners as part of trust w orthy AI. 6 Conclusion The main, and imp ortan t conclusion from this work is that generalist MLLMs exhibit the ability to p erform a rather nic he iris sp ooﬁng detection task with a scaﬀold of w ell-structured prompts augmen ted with human salience. These MLLMs can consistently outp erform salience-based CNNs and the best MLLM prompt v ariant, Gemini with the long, well-engineered prompt with raw human textual information, can outp erform human sub jects. Prompt engineering makes a signiﬁcan t diﬀerence for this sp ecialized task. The short prompt v ariant p erformed p oorly regardless of MLLM, ho wev er, with the addition of MESH salience, it could ac hiev e a low er MSE than the CNN. Ultimately , prompting with human salience makes using an oﬀ-the-shelf MLLM, even a lo cal one, viable for iris P AD and op ens up opp ortunities for new implementations in the ﬁeld. This paper oﬀers a dataset of h uman v erbal descriptions for the existing iris P AD benchmark [5] to facilitate follo w-up future studies and replicability of this work 2 . 7 Ac kno wledgmen ts This work w as supp orted by the U.S. Departmen t of Defense (Contract No. W52P1J-20-9-3009) and b y the National Science F oundation (Grant No. 2237880). Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science F oundation, the U.S. Departmen t of Defense or the U.S. Go vernmen t. The U.S. Gov ernment is authorized to repro duce and distribute reprints for Go v ernment purp oses, notwithstanding any copyrigh t notation here on. References [1] Rishabh Agarwal et al. “Many-shot in-con text learning”. In: A dvanc es in Neur al Information Pr o c essing Sys- tems (NeurIPS) 37 (2024), pp. 76930–76966. [2] Anthropic. Claude Sonnet 4 . https : // claude .ai / share/ 6deb0e63 - bb41 - 4a02 - 833a - 431ad4be5b33 . Sept. 2025. 2 Instructions on how to request a copy of the dataset will b e a v ailable at https://github.com/CVRL/ Multimodal- LLMs- Biometric- Expertise in the even t of this paper b eing accepted 15 [3] T adas Baltru ˇ saitis, Chaitany a Ah uja, and Louis-Philipp e Morency. “Multimo dal Machine Learning: A Survey and T axonom y”. In: IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 41.2 (2019), pp. 423– 443. doi : 10.1109/TPAMI.2018.2798607 . [4] Florian Bordes et al. A n Intr o duction to Vision-L anguage Mo deling . 2024. arXiv: 2405. 17247 [cs.LG] . url : https://arxiv.org/abs/2405.17247 . [5] Aidan Boyd, Kevin W Bowy er, and Adam Cza jk a. “Human-aided saliency maps impro ve generalization of deep learning”. In: Pr o c e e dings of the IEEE/CVF Winter Confer enc e on Applic ations of Computer Vision . 2022, pp. 2735–2744. [6] Aidan Bo yd et al. “CYBOR G: Blending Human Saliency Into The Loss Improv es Deep Learning-based Syn- thetic F ace Detection”. In: Pr o c e e dings of the IEEE/CVF Winter Confer enc e on Applic ations of Computer Vision . 2023, pp. 6108–6117. [7] T om Brown et al. “Language models are few-shot learners”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) 33 (2020), pp. 1877–1901. [8] Mathilde Caron et al. Emer ging Pr op erties in Self-Sup ervise d Vision T r ansformers . 2021. arXiv: 2104 .14294 [cs.CV] . url : https://arxiv.org/abs/2104.14294 . [9] Chinese A c ademy of Scienc es Institute of Automation (CASIA) Datasets Webp age . Accessed: 03-12-2021. url : http://www.cbsr.ia.ac.cn/china/Iris%20Databases%20CH.asp . [10] W J Cono ver. Pr actic al Nonp ar ametric Statistics . en. 3rd ed. Wiley Series in Probability and Statistics. Nash ville, TN: John Wiley & Sons, Dec. 1998. [11] Colton R Crum and Adam Cza jk a. “MENTOR: Human Perception-Guided Pretraining for Increased General- ization”. In: 2025 IEEE/CVF Winter Confer enc e on Applic ations of Computer Vision (W A CV) . IEEE. 2025, pp. 7470–7479. [12] Adam Cza jk a. “Database of iris printouts and its application: Developmen t of liv eness detection method for iris recognition”. In: 2013 18th International Confer enc e on Metho ds and Mo dels in A utomation and R ob otics (MMAR) . 2013, pp. 28–33. doi : 10.1109/MMAR.2013.6669876 . [13] Adam Cza jk a and Kevin W Bo wy er. “Presen tation attac k detection for iris recognition: An assessment of the state-of-the-art”. In: ACM Computing Surveys (CSUR) 51.4 (2018), pp. 1–35. [14] Iv an Deandres-T ame et al. “How goo d is ChatGPT at face biometrics? A ﬁrst lo ok into recognition, soft biometrics, and explainability”. In: IEEE A c c ess 12 (2024), pp. 34390–34401. [15] James S. Doyle, Patric k J. Flynn, and Kevin W. Bowy er. “Automated classiﬁcation of contact lens type in iris images”. In: 2013 International Confer enc e on Biometrics (ICB) . Madrid, Spain: IEEE, 2013, pp. 1–6. doi : 10.1109/ICB.2013.6612954 . [16] Parisa F armanifard and Arun Ross. “ChatGPT Meets Iris Biometrics”. In: 2024 IEEE International Joint Confer enc e on Biometrics (IJCB) . 2024, pp. 1–10. doi : 10.1109/IJCB62174.2024.10744525 . [17] Michael P . F ay and Michael A. Proschan. “Wilcoxon-Mann-Whitney or t-test? On assumptions for h yp othesis tests and multiple in terpretations of decision rules”. In: Statistics Surveys 4.none (Jan. 2010). issn : 1935-7516. doi : 10.1214/09- ss051 . url : http://dx.doi.org/10.1214/09- SS051 . [18] Javier Galbally et al. “Iris liv eness detection based on quality related features”. In: 2012 5th IAPR International Confer enc e on Biometrics (ICB) . IEEE. 2012, pp. 271–276. [19] Go ogle. Gemini . https://gemini.google.com . Sept. 2025. [20] Aaron Grattaﬁori et al. The Llama 3 Her d of Mo dels . 2024. arXiv: 2407.21783 [cs.AI] . url : https://arxiv. org/abs/2407.21783 . [21] Samuel Grieggs et al. “Measuring human p erception to impro ve handwritten do cument transcription”. In: IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 44.10 (2021), pp. 6594–6601. [22] Ahmad Hassanp our et al. “ChatGPT and biometrics: an assessmen t of face recognition, gender detection, and age estimation capabilities”. In: 2024 IEEE International Confer enc e on Image Pr o c essing (ICIP) . IEEE. 2024, pp. 3224–3229. 16 [23] Sen He et al. “Human attention in image captioning: Dataset and analysis”. In: Pr o c e e dings of the IEEE /CVF International Confer enc e on Computer Vision . 2019, pp. 8529–8538. [24] Gao Huang et al. “Densely Connected Conv olutional Net works”. In: Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) . July 2017. [25] Lei Huang et al. “A survey on hallucination in large language mo dels: Principles, taxonom y , c hallenges, and op en questions”. In: ACM T r ansactions on Information Systems 43.2 (2025), pp. 1–55. [26] Y upan Huang, Zhao yang Zeng, and Y utong Lu. “Be Sp eciﬁc, Be Clear: Bridging Machine and Human Cap- tions by Scene-Guided T ransformer”. In: Pr o c e e dings of the 2021 Workshop on Multi-Mo dal Pr e-T r aining for Multime dia Understanding . 2021, pp. 4–13. [27] ISO/IEC 30107-3:2017. Information te chnolo gy — Biometric pr esentation attack dete ction — Part 3: T esting and r ep orting . International Standard. In ternational Organization for Standardization, 2017. [28] T ero Karras et al. “Alias-free generative adversarial net works”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) 34 (2021), pp. 852–863. [29] T ero Karras et al. “Analyzing and impro ving the image quality of stylegan”. In: Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) . 2020, pp. 8110–8119. [30] Dongik Kim et al. “An empirical study on iris recognition in a mobile phone”. In: Exp ert Systems with Appli- c ations 54 (2016), pp. 328–339. issn : 0957-4174. doi : https://doi.org/10.1016/j.eswa.2016.01.050 . url : https://www.sciencedirect.com/science/article/pii/S0957417416300148 . [31] Naman Kohli et al. “Detecting medley of iris sp ooﬁng attacks using DESIST”. In: 2016 IEEE 8th International Confer enc e on Biometrics The ory, Applic ations and Systems (BT AS) . 2016, pp. 1–6. doi : 10.1109/BTAS.2016. 7791168 . [32] Naman Kohli et al. “Detecting medley of iris sp ooﬁng attacks using DESIST”. In: 2016 IEEE 8th International Confer enc e on Biometrics The ory, Applic ations and Systems (BT AS) . IEEE. 2016, pp. 1–6. [33] Naman Kohli et al. “Revisiting iris recognition with color cosmetic contact lenses”. In: 2013 International Confer enc e on Biometrics (ICB) . IEEE. 2013, pp. 1–7. [34] Sung Jo o Lee et al. “Multifeature-based fake iris detection method”. In: Optic al Engine ering 46.12 (2007), pp. 127204–127204. [35] Patric k Lewis et al. “Retriev al-augmented generation for kno wledge-in tensive NLP tasks”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) 33 (2020), pp. 9459–9474. [36] torch tune main tainers and contributors. tor chtune: PyT or ch’s ﬁnetuning libr ary . Apr. 2024. url : https / /github.com/pytorch/torchtune . [37] Leland McInnes et al. “UMAP: Uniform Manifold Approximation and Pro jection”. In: Journal of Op en Sour c e Softwar e 3.29 (2018), p. 861. doi : 10.21105/joss.00861 . [38] Kien Nguy en, Hugo Pro en¸ ca, and F ernando Alonso-F ernandez. “Deep Learning for Iris Recognition: A Surv ey”. In: ACM Comput. Surv. 56.9 (Apr. 2024). issn : 0360-0300. doi : 10 .1145 /3651306 . url : https :/ /doi .org / 10.1145/3651306 . [39] Op enAI Platform: T okenizer – L e arn ab out language mo del tokenization . Accessed: 02-16-2026. url : https : //platform.openai.com/tokenizer . [40] Maxime Oquab et al. DINOv2: L e arning R obust Visual F e atur es without Sup ervision . 2024. arXiv: 2304.07193 [cs.CV] . url : https://arxiv.org/abs/2304.07193 . [41] Alice J O’T o ole et al. “Comparing face recognition algorithms to h umans on c hallenging tasks”. In: ACM T r ansactions on Applie d Per c eption (T AP) 9.4 (2012), pp. 1–13. [42] Hatef Otroshi Shahreza and S ´ ebastien Marcel. “F oundation Mo dels and Biometrics: A Survey and Outlo ok”. In: IEEE T r ansactions on Information F or ensics and Se curity 20 (2025), pp. 9113–9138. doi : 10.1109/TIFS. 2025.3602233 . [43] Denis P ap erno et al. “The LAMBADA dataset: W ord prediction requiring a broad discourse con text”. In: Pr o c e e dings of the 54th Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) . Ed. b y Katrin Erk and Noah A. Smith. Berlin, German y: Asso ciation for Computational Linguistics, Aug. 2016, pp. 1525–1534. doi : 10.18653/v1/P16- 1144 . url : https://aclanthology.org/P16- 1144/ . 17 [44] Adam P aszk e et al. “Pytorc h: An imperative style, high-p erformance deep learning library”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) 32 (2019). [45] Alec Radford et al. “Learning transferable visual mo dels from natural language sup ervision”. In: International Confer enc e on Machine L e arning (ICML) . PmLR. 2021, pp. 8748–8763. [46] Alec Radford et al. “Robust sp eech recognition via large-scale w eak sup ervision”. In: International c onfer enc e on machine le arning . PMLR. 2023, pp. 28492–28518. [47] Brandon Ric hardW ebste r et al. “Visual psyc hophysics for making face recognition algorithms more explain- able”. In: Pr o c e e dings of the Eur op e an Confer enc e on Computer Vision (ECCV) . 2018, pp. 252–270. [48] Ioannis Rigas and Oleg V Komogortsev. “Eye mov ement-driv en defense against iris print-attac ks”. In: Pattern R e c o gnition L etters 68 (2015), pp. 316–326. [49] Pranab Sahoo et al. A Systematic Survey of Pr ompt Engine ering in L ar ge L anguage Mo dels: T e chniques and Applic ations . 2025. arXiv: 2402.07927 [cs.AI] . url : . [50] Redwan Son y et al. Benchmarking F oundation Mo dels for Zer o-Shot Biometric T asks . 2025. arXiv: 2505.24214 [cs.CV] . url : https://arxiv.org/abs/2505.24214 . [51] Hai-Long Sun et al. “Mitigating visual forgetting via tak e-along visual conditioning for m ulti-mo dal long cot reasoning”. In: Pr o c e e dings of the 63r d Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) . 2025, pp. 5158–5171. [52] Tieniu T an. CASIA-Iris-Syn (p art of CASIA-IrisV4) . 2024. url : https : / / hycasia . github . io / dataset / casia- irisv4/ . [53] Juan E. T apia, L´ azaro Janier Gonz´ alez-Soler, and Christoph Busch. “T o wards Iris Presentation Attac k Detec- tion with F oundation Mo dels”. In: 2025 IEEE 19th International Confer enc e on Automatic F ac e and Gestur e R e c o gnition (F G) . 2025, pp. 1–5. doi : 10.1109/FG61629.2025.11099412 . [54] Gemma T eam et al. Gemma: Op en Mo dels Base d on Gemini R ese ar ch and T e chnolo gy . 2024. arXiv: 2403.08295 [cs.CL] . url : https://arxiv.org/abs/2403.08295 . [55] Patric k Tinsley et al. “Iris Liveness Detection Comp etition (LivDet-Iris) – The 2023 Edition”. In: 2023 IEEE International Joint Confer enc e on Biometrics (IJCB) . IEEE. 2023, pp. 1–10. [56] Mateusz T rokielewicz, Adam Cza jk a, and Piotr Maciejewicz. “Assessmen t of iris recognition reliability for eyes aﬀected b y o cular pathologies”. In: 2015 IEEE 7th International Confer enc e on Biometrics The ory, Applic ations and Systems (BT AS) . IEEE. 2015, pp. 1–6. [57] Mateusz T rokielewicz, Adam Cza jk a, and Piotr Maciejewicz. “Assessmen t of iris recognition reliability for eyes aﬀected b y o cular pathologies”. In: 2015 IEEE 7th International Confer enc e on Biometrics The ory, Applic ations and Systems (BT AS) . IEEE. 2015, pp. 1–6. [58] Mateusz T rokielewicz, Adam Cza jk a, and Piotr Maciejewicz. “Database of iris images acquired in the presence of ocular pathologies and assessmen t of iris recognition reliabilit y for disease-aﬀected ey es”. In: 2015 IEEE 2nd International Confer enc e on Cyb ernetics (CYBCONF) . 2015, pp. 495–500. doi : 10.1109/CYBConf.2015. 7175984 . [59] Mateusz T rokielewicz, Adam Cza jk a, and Piotr Maciejewicz. “Iris Recognition After Death”. In: IEEE T r ans- actions on Information F or ensics and Se curity 14.6 (2019), pp. 1501–1514. doi : 10.1109/TIFS.2018.2881671 . [60] Mateusz T rokielewicz, Adam Cza jk a, and Piotr Maciejewicz. “P ost-mortem iris recognition with deep-learning- based image segmentation”. In: Image and Vision Computing 94 (2020), p. 103866. [61] Mateusz T rokielewicz, Adam Cza jk a, and Piotr Maciejewicz. “Presentation attack detection for cadav er iris”. In: 2018 IEEE 9th International Confer enc e on Biometrics The ory, Applic ations and Systems (BT AS) . IEEE. 2018, pp. 1–10. [62] Warsaw University of T e chnolo gy Datasets Webp age . http://zbum.ia.pw.edu.pl/EN/node/46 . 2013. [63] Jason W ei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) 35 (2022), pp. 24824–24837. [64] Jason W ei et al. “Emergent Abilities of Large Language Mo dels”. In: T r ansactions on Machine L e arning R e- se ar ch (2022). Survey Certiﬁcation. issn : 2835-8856. url : https://openreview.net/forum?id=yzkSU5zdwD . 18 [65] Zhuoshi W ei, Tieniu T an, and Zhenan Sun. “Synthesis of large realistic iris databases using patc h-based sam- pling”. In: 2008 19th International Confer enc e on Pattern R e c o gnition . IEEE. 2008, pp. 1–4. [66] Shiv angi Y adav and Arun Ross. “Synthesizing iris images using generative adv ersarial netw orks: survey and comparativ e analysis”. In: arXiv pr eprint arXiv:2404.17105 (2024). [67] David Y amba y et al. “LivDet-Iris 2015 – Iris Liv eness Detection Competition 2015”. In: 2017 IEEE Interna- tional Confer enc e on Identity, Se curity and Behavior Analysis (ISBA) . 2017, pp. 1–6. doi : 10 . 1109 / ISBA . 2017.7947701 . [68] David Y amba y et al. “LivDet-Iris 2017 – Iris Liv eness Detection Competition 2017”. In: 2017 IEEE Interna- tional Joint Confer enc e on Biometrics (IJCB) . 2017, pp. 733–741. doi : 10.1109/BTAS.2017.8272763 . [69] Xiaohua Zhai et al. “Sigmoid loss for language image pre-training”. In: Pr o c e e dings of the IEEE/CVF inter- national c onfer enc e on c omputer vision . 2023, pp. 11975–11986. [70] Ruohan Zhang et al. “Human gaze assisted artiﬁcial intelligence: A review”. In: IJCAI: Pr o c e e dings of the Confer enc e . NIH Public Access. 2020, p. 4951. [71] Tian Zhang et al. “Cross-Device Iris Presen tation A ttac k Detection Based on Image-T ext Multimodal Align- men t”. In: Pr o c e e dings of the 2024 2nd Asia Symp osium on Image and Gr aphics . ASIG’24. New Y ork, NY, USA: Asso ciation for Computing Mac hinery , 2025, 199–206. isbn : 9798400709906. doi : 10.1145/3718441.3718472 . url : https://doi.org/10.1145/3718441.3718472 . 19

Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment