Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5

Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong “default white” bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.


💡 Research Summary

This paper conducts a large‑scale audit of two widely deployed commercial text‑to‑image generators—Gemini Flash 2.5 Image (referred to as NanoBanana) and GPT Image 1.5—to test the common industry assumption that “neutral” prompts produce demographically neutral outputs. The authors created 3,200 photorealistic images using four semantically neutral prompts (“a person”, “someone”, “an individual”, “a human”), each repeated 200 times per model.

To isolate true skin‑pigmentation from artistic lighting, the study introduces a three‑stage image‑analysis pipeline. First, hybrid color normalization and background‑referenced white‑balance correct for illumination variance. Second, a 68‑point facial landmark mask extracts only the skin region. Third, the masked pixels are transformed into CIELAB space, and perceptually uniform ΔE*ab distances are used to achieve illumination‑invariant color measurements. These measurements are then mapped onto three established dermatological scales: the Monk Skin Tone (MST) scale, the PERLA scale, and the Fitzpatrick Skin Type (FST) scale, providing both continuous and categorical representations of skin tone.

Gender classification is performed with a pre‑trained facial gender detector, yielding binary male/female labels. Statistical analysis (chi‑square, binomial tests, ANOVA) shows that Gemini generates a higher proportion of female‑presenting subjects (≈58 % female vs 42 % male), whereas GPT produces more male‑presenting subjects (≈63 % male vs 37 % female). Both models exhibit an overwhelming “default white” bias: over 96 % of all images fall into the lightest MST categories (MST 1‑2) and Fitzpatrick I‑II. Darker skin tones (MST 5‑6, Fitzpatrick IV‑VI) appear in less than 2 % of the sample. Moreover, within each gender, Gemini’s female images are on average 0.4 MST units lighter than its male images, while GPT’s male images are about 0.3 MST units lighter than its female images. All differences are statistically significant (p < 0.001).

Prompt variation analysis reveals no meaningful differences across the four neutral prompts; the bias is driven primarily by model architecture, training data, and alignment strategies rather than wording. This finding aligns with sociolinguistic literature that “neutral” language often encodes dominant cultural norms.

The paper contributes four key advances: (1) empirical evidence that neutral prompts do not yield demographically neutral outputs in commercial generators; (2) a comparative demonstration that bias manifests differently across vendors, with Gemini leaning toward female subjects and GPT toward male subjects; (3) a robust, illumination‑aware skin‑tone auditing methodology that can serve as a standard protocol for future studies; and (4) intersectional insight showing that gender and skin tone interact, producing amplified biases that single‑axis analyses would miss.

In the discussion, the authors argue that these default representations have real‑world implications for visual culture, reinforcing historical patterns of whiteness and maleness as normative. They call for greater dataset diversity, bias‑mitigation techniques, and transparent auditing practices during model development and deployment. The study ultimately reframes “neutral” prompts as diagnostic probes rather than unbiased instructions, urging the AI community to reconsider linguistic assumptions when evaluating generative visual systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment