Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

Why Aggregate Accur acy i s Ina dequate for Evalua ting Fairne ss in Law Enfor cement Facial Reco gnition Systems Khalid Adnan Alsayed BSc (Hons) Ar tificial Intelligence , Te e s s i d e U n i v e r s i t y , United Kingdom Email: F5044605@live.tees.ac.uk Abstract— Facial recognition systems are increasingly deployed i n law enforcement and security context s , where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy , growing evidence de monstrates that such systems often exhibit uneven performance across demographic groups, leading to disproport ionate error rates and potential harm. This paper argues that aggregate accu racy is an insufficient metric for ev aluating the fairness and reliability of facial recognition systems for high - stakes environm ents. Through analysis of subgroup - level error distribution, including false positive a nd false negative rates, we demonstrate how overall performance metrics can obscure critical disparities acros s demographic groups. Drawing on existing literature and empirical observations from classification - based systems, the paper highlights the operational risks associated with accuracy - centric evaluation practices, particularly in law enforcement applications where misclassification may resul t in wrongful suspicion or misse d identification. W e further discuss the importance of model - agnostic fairness auditing approache s that enable post - deployment evaluation without access to proprietary sys tems. Finally , the paper outlines the inherent trade - of fs bet ween fairness and accuracy and emphasizes the need for more comprehensive fairness - aware evaluation strategies in high - stakes AI systems. 1. Introduction Facial recognition systems have become increasingly embedded within l aw enforcement and security infrastructures, supporting tasks such as id entification, surveillance, and decision - making . Advances in deep lear ning and large - scale datasets have enabled these systems to achieve high levels of overall accuracy , contributing to their perception as reliable and objective tools for operational use. However , a growing b ody of research has demonstrated that such systems often exhibit significant performance dispa rities across demographic groups, particularly with respect to race and age. These disparities raise critical ethical, legal, and societal concerns, especially in high - stakes environments where algorithmic decisi ons may directly affect individual’ s rights and freedoms. A k e y i s s u e u n d e r l y i n g t h e s e c o n c e r n s i s t h e w i d e s p r e a d reliance on aggregate perfor mance metrics, particularly accuracy , as the primary indicator of system reliability . While accuracy prov ides a general measure of overall correctness, it f ails to capture how errors are distributed across different demographic subgroups. As a res ult, a system may appear highly accurate while disproport ionately misclassifying individuals from specific populations. This limitation is partic ularly problematic in l aw enforcement context s , where elevated false positive rates may lea d to wron gful suspicion or identification, and elevated false negativ e rates may reduce the effectiveness of inv estigation processes. Empirical studies have consistently hi ghlighted such disparities , with significantly higher error rates reported for underrepresented demographic groups in widely deployed f acial recognition systems [1]. Despite increasing awar eness of algorithmic bias [2], evaluation practices in real - world deployments often remain centered on aggregate metrics, with limited emphasis on subgroup - level performance analysis. This gap reflects a broader disconnect between academic research on algorithmic fairness and operational practices i n applied settings. While various fairness metrics and auditing frameworks have been proposed, their adoption in pr actice remains inconsistent, particularly in environments where systems ar e proprietary or ext ernally sourced . Consequently , there is a need to critically examine the adequacy of current evaluation paradigms and to reconsider t he role of accuracy as a primary measure of system performance in high - stakes applications. This paper argues that aggregate accu racy is insufficient for evaluating the fairness and reli ability of facial recognition systems in law enforcement context s. Drawing on existing literature and empirical insights from classification - based systems, the paper demonst rates how accuracy can obscure critical disparities in error dist ribution across demographic groups. It further examines the operational risks associated with the accuracy - centric evaluation, hi ghlighting the potential for unequal outcomes and systemic bias. Finally , the paper emphasizes the importance of fairness - aware evaluation approaches that incorporate subgroup - level metrics and support more transparent, accountable deployment of artificial intelligence sys tems. 2. Limitations of Accuracy-Based Evaluation Accuracy is one of t he most widely used metr ics for evaluating mach ine learning models, representing the proportion of correctly cla ssified instances across a dataset. Its simplicity and interpretability have contributed to its widesp read adoption in both academic research and real - world deployments. However , in the context of high - stakes applications such as law enforcement facial recognition systems, reliance on accura cy as a primary evaluation metric presents significant limitations. A f u n d a m e n t a l i s s u e w i t h a c c u r a c y i s t h a t i t p r o v i d e s a n a g g r e g a t e m e a s u r e o f p e r f o r m a n c e , offering no insight into how errors are distribute d across different demographic groups. In classification tasks involving heterogen e ous populations, it i s possible for a model to achieve high overall accuracy while simultaneously exhibiting substant ially different error rates for specific subgroups. This phenomenon occurs because accuracy treats all instances equ ally , without accountin g for disparities in subgroup re presentations or the distribution of misclassificati on. As a result, performance inequalities that disproportionately affect certain populations may remain hidden withi n aggregated results. To i l l u s t r a t e t h i s l i m i t a t i o n , c o n s i d e r a s c e n a r i o i n w h i c h a facial recognition system achieves an overall accuracy of 90%. While this figure may suggest strong performance, it does not reveal whether the system perf orms consistently acros s demographic groups. For example, one subg roup may experience a l ow false po sitive rate (FPR), while another may experience significantly higher rates of false identification. Similarly , false negative ra tes (FNR) may vary across groups, leading to uneven detection perfo rmance. Such disparities are not reflected in accu racy alone, highlighting its inadequacy as a standalone measu re of fairness. Figure 1 illustrates how two subgroups can exhibit identical overall accuracy while experiencing significantly dif ferent false positive and false negative rates. Figur e 1 . Equal a ggr egate a ccuracy c an c onceal u nequal s ubgr oup e rror r ates. This limitation is part icularly critical in law enforcement contexts , where different types of errors carry distinct and potentially severe consequences. A f alse positive, in which an individual is incorrectly identified as a match, may le ad to wrongful su spicion or investigation. Conversely , a false negative may result in failure to identify a relevant individual, potentially undermining operational objectives . Metrics such as false positive rate and false negativ e rate provide a more granu lar understanding of their error types and their dis tribution across demographic groups. Prior work has emphasized the importance of these metrics in fairness evaluation, particularly in domains where decision outcomes have significant societal impact [3]. Furthermore, accuracy is sensitive to class i mbalance, which is common in real - world datasets. I n such cases, a model may achieve h igh accuracy by correctly predicting the majority class while performing poorly on minority classes. When demographic groups are unevenly represented, this can exacerbate disparities in performance, further maski ng bias within aggregated metrics. This issue has been widely documented in stu dies in algorithmic bi as, where models train ed on imbalanced datasets exhibit reduced performance for underrepresented populations [4]. The continued reliance on accuracy as a primary evaluation metric reflects a broader challenge in aligning machine learning evaluation practices with real - world req uirements. While accuracy remains useful as a general indicator of performance, it is insuff icien t for assessing fairness in systems that operate across diverse populations. In high - stakes domains such as law enforcement, where the cost of errors is unevenly distribute d , evaluation frameworks must move beyond agg regate measures and incorporate su bgroup - level analysis to ensure equitable and accountable system behavior . 3. Evidence of Demographic Disparities A s u b s t a n t i a l b o d y o f r e s e a r c h h a s d e m o n s t r a t e d t h a t f a c i a l r e c o g n i t i o n s y s t e m s e x h i b i t u n e v e n performance across demographic groups, reinforcing concerns r egarding fairness and reliability in real - world applications. One of the most widely cit ed studies [1] revealed that commercial facial analysis systems showed significantly higher err or rates for darker - sk inned individuals and women 0 10 20 30 40 50 60 70 80 Low -Error Sub groups Hi g h-E rror S ubgro up Accurac y FPR FNR compared to light - skinned males. Their findings highlighted systemic disparities in widely deployed models, drawing a ttention to the limitations of relying solely on aggregate per formance metrics. Subsequent studies have confirmed that such dispariti es persist across both academic and industrial systems [5]. Va r i a t i o n i n d a t a s e t c o m p o s i t i o n , i n c l u d i n g i m b a l a n c e d r e p r e s e n t a t i o n o f demographic group s h ave been identified as a key contributing factor . When models are trained predominantly on data representing certain populations, they may fail to generalize effectively to underrepresented groups, resulting in higher err or rates. These disparities are often not immediately visible when evaluation focuses on overall accuracy , further emphasi zing the need for subgroup - level analysis. Beyond dataset imbalance , additional factors such as variations in lighting condition s, facial features, and image quality have been shown to influence mo del performance differently across demographic groups. These technical challenges can interact with underlying biases in training data, compounding disparities in system outputs. As a result, even high - performing models may exhibit unequal behavior when deployed in diverse real - world environ ments. Empirical observations f rom classification - based systems further support these findings. Analysis of subgroup - level performance metrics frequently reve als variations in both false positive rates (FPR) and false negative rates (FNR) across demographic group s, even when overall accuracy remains relatively high. This aligns with the argument presente d in the previous section, w h ere aggregate metrics obscure differences in error distr ibution. In pra ctical terms, t his means that two systems with similar accuracy may have significan tly different fairness profiles, depending on how errors are distributed across populations. Ta b l e 1 . Baseli ne o verall p erformance m etrics of the e valuated s ystem. Metrics V alue Accuracy 75.47% False Positive Rate (FPR) 28.42% False Negative Rate (FNR) 21.04% Ta b l e 2 . Range of s ubgroup e rr or r ates obser ved a cr oss d emographic g ro u p s i n t h e b aseline e valuation. Demographic Attribute Metric Minimum V alue Maximum V alue Race FPR 0.2015 0.3518 Race FNR 0.1357 0.31 17 Age FPR 0.2453 0.4601 Age FNR 0.1 1 17 0.4958 These results demonstrate that error rat es vary significantly across demographic groups despite a single aggregate accuracy value, reinforcing that accuracy alone is insuf ficient for evaluating fairness in real-world systems. In law enforcement contexts, these dispa rities carry heightened significance due to the potential consequences of misclassification . Elevated false positive rates for sp ecific demographic grou ps may result in disproportionate scrutiny or wrongful id entification, while elevated false negative rates may reduce the ef fectiveness of investigative processes. Th ese risks highlight the importance of evaluating not only whether a system is accurate, but also how t hat accuracy is distributed across different segments of the population. Collectively , the e xisting literature an d empirical observations provide strong evid ence that demographic disparities are a persistent and sys temic issue in facial recognition systems. These findings reinforce the ar gument that aggregate accuracy is insuf f icient as a standalone evaluation metric and unders core the need for fairness - aware approaches that explicitly account for subgroup - level performance dif ference s. 4. Operational Risks in Law Enforcement The presence of demogra phic bias disparities in facial recognitio n systems is not merely a technical concern but poses significant o perational risk s when such systems are deployed in law enforcement contexts. Unlike low - stakes applications, where errors may have limited c onsequences, misclassificati on in law enfor cement contexts can directly affect individual’ s ri ghts, freedoms, and interactions with authorities. As a result, the distribution of errors across demo graphic groups becomes a critical factor in assessing the real - world impa ct of these systems. False positives, where an individual is incorr ectly identified as a match, represent one of t he most serious risks in law enforcement application s. An elevated false positive rate (FPR) for specific demographic groups can lead to disproportionate targeting , increased surveillance, or wrongfu l suspicion. In extreme cases, this may con tribute to wrongful arrests or investigations, particularly when algorithmic outputs are treat ed as reliable evidence. Even when used as decision - support tool s , such systems can influence human judgm ent, reinforce biases and increase the likelihood of unequal treatment across populations. Conversely , false n egatives, where a sy stem fails to correctly identify an individ ual, also carry operational consequences. In investigation contexts, elevated false negative rates (FNR) may reduce the effec tiveness of identification processes, potentially leading to missed matches or overlooked individuals of interest . Wh ile false negatives may not carry the same i mmediate ethical implications as false positives, they can still under mine system reliability and operational objectives. Importantly , the impact of these errors may also vary across d emographic groups, further contributing to uneven system performance. A k e y i s s u e i n c u r r e n t d e p l o y m e n t p r a c t i c e s i s t h a t t h e s e r i s k s a r e o f t e n e v a l u a t e d i m p l i c i t l y, r a t h e r than explicitly measured and reported . Systems are frequently assessed using aggregate performance metrics without sufficient consideration of how errors are distri buted across different populations. This creates a scenar io in which a system deemed “accurate” may sti ll produce systematical ly unequal outcomes. In law enforcement settings, where accountability and fairness are essential, such evaluation prac tices are inadequate. The operational risks associated with demogra phic disparities are further compounded by the opacity of many deployed systems. Facial recognition technologi es a re often provided by third - party vendors, limiting access to model internals, t raining data, and evaluation processes. This lack of transparency makes it dif ficult for practitioners to identify , understand, and mitigate bias in complex sociotechnical systems [6], increasing reliance on su rface - level metrics such as accuracy . W ithout robust auditing me chanisms, these systems may be deployed with limited awareness of their pot ential biases and associated risks. These challenges highl ight the need to sh ift from purely performan ce - driven evaluation to risk - aware assessment frameworks. In high - stakes domains such as law enforcement, it is not suf ficient for systems to be accurate on average ; they must also de monstrate consistent and equitable performance across all demographic groups. Failure to account for these factors can lead to systemic bias, reduced trust in technology , and potential legal and eth ical consequences. As such, understanding and addressing the operational implications of demographic di sparities is essential for responsi ble deployment of facial recognition systems. 5. The Need for Fairness-A ware Evaluation The limitations of accu racy - based evaluation and operational risks associated with demographic disparities highlight the need for more comprehensive approaches to assessing machine learning systems. In high - stakes domains such as law en forcement, evaluation frameworks must move beyond agg regate performance metrics and inco rporate fairness - aware methodologies that explicitly account for differen ces in error distribution across demographic groups. A k e y c o m p o n e n t o f f a i r n e s s - aware evaluation is the use of subgroup - level performance metrics. Measures such as false positive rate (FPR) and false negative r ate (FNR) provi de a more granular understanding of how a system behaves across different populations. Unlike overall accuracy , these metrics reveal w hether certain g roups are dispro portionately affected by spec ific types of errors, enabling more informed assessment of system reliabili ty and risk. For example, a system with balanced FPR and FNR across demographic groups i s less likely to produce unequal outcomes than one where these metrics vary significantly between populati ons. In addition to individual metrics, compa rative measures of disparity are essent ial for quantifying difference s in perform ance. One common approach is to compute the gap between th e maximum and minimum values of a given metric ac ross demographic g roups [7]. This provides an interpretable measure of fairn ess, allowing p ractitioners to identify the e xtent of inequal ity in system be havior . Such disparity - based evaluations align more closely with real - world c oncerns, where the ob jective is not only to achieve high performance but also to ensure that this performance is distributed equitabl y . Another important consid eration is the pract icality of fairness e valuation in real - world deployment scenarios. Many operational systems are propriet ary or externally provided, limit ing access to model internals and training data. In such cases, fairness assessment must be conducted using available outputs, such as predicted labels and associated confidence scores. Model - agnostic evaluation approaches, which operate independently of the underlying model archit ecture, are therefore particularly valuable. Thes e approaches enable practit ioners to audit systems post - deployment, providing insights into performance dispari ties without requiring modification to the original model. Despite the availabili ty of fairness metrics and evaluation techn iques, their adoptio n in practice remains inconsistent. One contributing fac tor is the lack of accessible and interpretable tools that support fairness analysis in operational settings. M any existing frameworks are designed for resear ch environments and require specialized expertise, limit ing their applicability in real - world context s . As a result, there is a need for evaluation appr oaches that are not only theoretically sound but al so practical, interpretable, and deployable. Ultimately , fairness - aware evaluati on should be viewed as a fundamental component of system assessment rather than an optional extension. In high - stakes applications, it is insufficient to demonstrate that a system perfor ms well on avera ge, it must also be shown that the performance is consistent and equitable across all relevant popul ations. Incorporating subgroup - level metrics, disparity analysis, and model - agnostic auditing into evaluation pr actices represents a critical step towards more transparent, accountable , and responsible use of facial recognition technologies. 6. T rade-offs Between Fairness and Accuracy While fairne ss - aware evaluation provides a more comprehensive understanding of system performance, it also introduces a fundamental challenge : the trade - off between fairness and accuracy . In many machine learning systems , particularly those o perating on complex and imbalanced datasets, efforts to reduce disparities across demographic groups can lead to reductions in overal l predictive performance. This tension has been widely recognized in the li terature [8], [9] , highlighting that achieving perfect fairnes s across all groups without impacting accuracy is often no t feasible [8]. The underlying reason for this trade - off lies in the differing statistical distribution and representation of demographic groups within the training data , where it can achieve the greatest reduction in aggregate error . However , this optimization may come at the expense of minority or underrepresented groups, whose characteristics are l ess well captured by the training data. As a result, improving performan ce for these groups, by reducing false positive or false negati ve rate s may require adjustments that sl ightly degrade the overall accuracy . From an operational perspective, this trade - off presents a critical decision point. In high - stakes environments such as law enforcement, the cost of error s is not uniform across different types of misclassificati ons or across different populations. A smal l reduction in overall accuracy may be acceptable if it results in a substanti al decrease in harmful disparities, particul arly those that disproportionately affect vulnerable or underrepresented groups. Conversely , prioriti zing accuracy alone may lead to sy stems that are efficie nt on average but produce inequitable outcomes in practice. Importantly , the fairness - accuracy trade - off is not sole l y a technical issue but also an ethical and policy consideration. Decisions regarding acceptable level s of disparity and performance degradation must consider th e broader societ al context s in which the system operates. In law enforcement applications, where decisions can have significant l egal and social implications, there is a st rong argument for prioritizing fairness and accountability over marginal gains in aggregate accuracy . This perspective aligns with calls for responsible AI practices that emphasi s e transparency , human oversight, and the mitigation of harm. At the same time, it i s essential to recog nize that fairness in terventions must be ap plied carefully and transparently . Overcorrection or poorly desi gned mitigation strategies may introduce new forms of bias or reduce sys tem effectiveness bey ond acceptable levels. Therefore, evalu ation frameworks should not aim to eliminate trade - of fs entirely but rather make them explicit and measurable. By quantifying both performance and disparity , practi tioners can make informed decisions about how to balance competing objectives in each application context. Ultimately , acknowledgi ng and managing the t rade - of fs between fai rness and accuracy is a key aspect of responsible AI deployment. Rather than viewing fairness as an operational cons traint, it should be treated as a core compone nt of system evaluation and design. In high - stakes domains, this require s moving beyond simplist ic performance met rics and adopti ng evaluation appr oaches that reflect both technical performance an d societal impact. 7. Conclusion The increasing deploymen t of facial recognit ion systems in law enfo rcement contexts has amplified the need for robust and reliable evaluation practices. While high levels of accuracy are often presented as evidence of system effectiveness, this paper has argued that accuracy alone is insufficient for assessing fairness in high - stakes applications . Aggregate metrics fail t o capture how errors are distributed across demographic groups, allowing signi ficant disparities in performance to remain obscured. Drawing on existing li terature and empirical observations, the pa per has demonstrated t hat facial recognition systems frequently exhibit un even error rates across populations, particularly in terms of false positive and false negative rates. In law e nforce ment settings , these dispar ities carry sub stantial operational and ethical implications, including the risk of wrongful suspicion, unequal treatment, and reduced system reliability . As such, evaluation practices that rely solely on accuracy provid e an inco mplete and potentia lly misleading r epresentation of system performan ce. The analysis further highlighted the importa nce of fairness - aware evaluation approaches that incorporate subgroup - level metrics and disparity analysis. By examining performance across demographic groups, these methods provide a more comprehensive understanding of system behavior and enable the identification of potential biases . In addition, the discussion of model - agnostic auditing emphasises the need for practical evaluation strategie s that can be applied in real - world deployment scenarios, where access t o model internals is often limited . The paper also addres sed the inherent tra de - of f between fairness and accuracy , noting that efforts to reduce disparity may impact overa ll performance. In high - stakes domains, however , such trade - offs should be explicitly considered rather tha n ignored. Ens ur ing equitable system behavior and minimizing harm may , in many cases, justify modest reductions in ag gregate accuracy . This perspective reinforces the need for evaluation fr ameworks that balance technical performance with ethical and societal considerations. In conclusion, the findings underscore the necessity of moving beyond accuracy as the primary metric for evalu ating facial re cognition systems in law enforceme nt. Fairness - aware evaluat ion should be treated as a fundamental component of system assessment, enabling more transpar ent, accountable, and responsible deployment of artificial intell igence technologies. Future work should explore the integration of fairness auditing into operation al workflows and the development of standar d ised evaluation protocols that better reflect the complexities of real - world applications. References [1] J. Buolamwini and T. Gebru, “ Gender s hades: Intersectional a ccuracy d isparities in c ommercial g ender c lassification ,” in Proc . Machine Learning Research , 2018. [2] A. Chouldech ova and A. Roth, “ A s napshot of the f rontiers of f airness in m achine l earning, ” Communications of the ACM , vol. 63, no. 5, pp. 82 - 89, 2020. [3] M. Hardt, E. Price , and N. Srebro, “ Equal ity of o pportunity in s upervised l earning, ” in Adv ances in Neural Information Processing System s , 2016. [4] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “ A s urvey on b ias and f airness in m achine l earning, ” ACM Computing Surve ys , vol. 54, no. 6, 2021. [5] I. Raji and J. Buolamwini, “ Actionabl e a uditing: Investigating the i mpact of p ublicly n aming b iased p erformance r esults of c ommercial AI p roducts, ” in Proc . AAAI/ACM Conf . on AI Ethics and Society , 2019. [6] A. Selbst, D. Boyd , S. A. Friedler, S. Ve n k a t a s u b r a m a n i a n , a n d J. Ve r t e s i , “ Fairness and a bstraction in s ociotechnical s ystems,” in Proc . Conference on Fa irness, Acc ountability , and Tr a n s p a r e n c y (F A T * ), 2019. [7] S. Ve r m a a n d J. Rubin, “ Fairness d efinitions e xplained, ” arXiv:1808.00023 , 2018. [8] S. Barocas, M. Hardt, and A. Narayanan, Fa irness and Machine Lear ning . 2019. [Online]. Av a i l a b l e : ht tps://fairmlbook.org . [9] J. Kleinberg, S. Mullainathan, and M. Raghavan, “ Inherent t ra de -o ffs i n t h e f air d etermination of r isk s cores,” in Proc . Innovation s in Theoretical Computer Science (ITCS), 2017.

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment