Large Language Models Naively Recover Ethnicity from Individual Records

Large Language Models Naively Recover Ethnicity from Individual Records
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.


💡 Research Summary

This paper proposes a novel approach for inferring race, ethnicity, caste, or religious affiliation from individual names by leveraging large language models (LLMs). The authors argue that the dominant method in the United States, Bayesian Improved Surname Geocoding (BISG), is limited by its reliance on U.S. Census surname‑race distributions, its focus on surnames only, and its confinement to Census‑defined racial categories. In contrast, LLMs are trained on massive, multilingual text corpora that implicitly encode associations between personal names and cultural or demographic attributes across many societies.

The study evaluates six models—Gemini 3 Flash, GPT‑4o, GPT‑4.1‑mini, DeepSeek v3.2, GLM‑4.7, and GLM‑4.7‑Flash—using a simple zero‑temperature prompt: “Classify the race/ethnicity of this person based on their name and location. Name: … Location: … Return only one of: …”. No prompt engineering, few‑shot examples, or iterative refinement are applied, providing a conservative baseline of LLM capability.

U.S. Individual‑Level Validation
Stratified samples of 10,000 voter records (2,500 per racial group) from Florida and North Carolina are used. With full name + geography, Gemini 3 Flash achieves 83.8 % (FL) and 83.5 % (NC) accuracy, substantially outperforming BISG (68.2 %/68.9 %). Removing geography reduces performance only modestly, while dropping first/middle names causes a larger drop (≈8–9 pp), indicating that first names carry significant demographic signal. Recall analysis shows dramatic gains for Black voters (≈80 % vs. ≈50 % for BISG) and consistently high recall for Hispanic and Asian groups.

Comparison with Existing Benchmarks
Using the Lee & Velez (2025) dataset of local elected officials, Gemini 3 Flash reaches 90.4 % accuracy, surpassing BISG (84.5 %) and a hybrid image‑plus‑name model (88.8 %). GPT‑4o matches the hybrid at 88.3 %, while open‑source models achieve 80–85 % accuracy.

Non‑U.S. Contexts

  • Lebanon: A stratified sample of 3,500 voter records with religious sect labels yields 64.3 % overall accuracy for Gemini 3 Flash. Performance is very high for groups with distinctive naming conventions (e.g., Armenian Orthodox 97.4 %) and lower for sects with overlapping name pools.
  • India (Reserved Constituencies): Among 130 Lok Sabha MPs elected from Scheduled‑Caste (SC) or Scheduled‑Tribe (ST) seats, Gemini 3 Flash attains 99.2 % overall accuracy (SC recall 98.8 %, ST recall 100 %). GPT‑4o and GLM‑4.7 also perform well (>90 % for SC, >85 % for ST), while the smallest open‑source model struggles with tribal names, reflecting training‑data representation gaps.
  • Aggregate Validation: Small random samples (≈500 records) from full‑count voter rolls in India, Uganda, Nepal, Armenia, Chile, and Costa Rica are classified into locally relevant ethnic or religious categories. The inferred population‑level distributions closely match official census figures, demonstrating that LLMs can recover macro‑level demographic structures even when individual‑level ground truth is unavailable.

Model Size and Cost Considerations
Larger proprietary models generally achieve the highest accuracy, but the study shows that a 30 B‑parameter open‑source model (GLM‑4.7‑Flash) can be fine‑tuned to exceed BISG performance, enabling low‑cost, on‑premise deployment. The authors note that modest prompt refinements (few‑shot examples, temperature tuning) could add 1–3 percentage points to accuracy, suggesting further room for improvement.

Bias and Ethical Issues
The paper highlights that BISG systematically misclassifies minorities in affluent neighborhoods as White, a bias that LLMs mitigate by leveraging first‑name information. However, LLMs inherit biases from their training data; for example, certain religious sects in Lebanon are under‑ or over‑represented. The authors call for systematic bias audits, privacy safeguards, and transparent reporting when deploying these models in research or policy contexts.

Conclusions
LLM‑based name inference offers a flexible, high‑accuracy alternative to traditional Bayesian methods, especially in settings lacking census‑linked surname data or when researchers need custom classification schemes (caste, religion, tribal affiliation). The approach works across scripts (Latin, Arabic) and cultural contexts, and small fine‑tuned models can deliver comparable or superior performance to BISG at negligible marginal cost. Future work should explore prompt optimization, domain‑specific fine‑tuning, bias mitigation strategies, and the development of ethical guidelines for large‑scale demographic inference.


Comments & Academic Discussion

Loading comments...

Leave a Comment