How Large Language Models Systematically Misrepresent American Climate Opinions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Federal agencies and researchers increasingly use large language models to analyze and simulate public opinion. When AI mediates between the public and policymakers, accuracy across intersecting identities becomes consequential; inaccurate group-level estimates can mislead outreach, consultation, and policy design. While research examines intersectionality in LLM outputs, no study has compared these outputs against real human responses across intersecting identities. Climate policy is one such domain, and this is particularly urgent for climate change, where opinion is contested and diverse. We investigate how LLMs represent intersectional patterns in U.S. climate opinions. We prompted six LLMs with profiles of 978 respondents from a nationally representative U.S. climate opinion survey and compared AI-generated responses to actual human answers across 20 questions. We find that LLMs appear to compress the diversity of American climate opinions, predicting less-concerned groups as more concerned and vice versa. This compression is intersectional: LLMs apply uniform gender assumptions that match reality for White and Hispanic Americans but misrepresent Black Americans, where actual gender patterns differ. These patterns, which may be invisible to standard auditing approaches, could undermine equitable climate governance.

💡 Research Summary

This paper investigates how large language models (LLMs) represent American public opinion on climate change, with a focus on intersectional demographic patterns. The authors begin by noting that federal agencies and researchers increasingly rely on LLMs to simulate or analyze public sentiment, and that any systematic misrepresentation—especially across intersecting identities such as race, gender, and age—could mislead outreach, consultation, and policy design. While prior work has documented generic gender or racial biases in LLM outputs, no study to date has directly compared model‑generated responses with real human answers across multiple intersecting demographic groups.

Data and Methodology
The authors use a nationally representative U.S. climate‑opinion survey comprising 978 respondents. Each respondent is described by seven demographic attributes: age, gender, race/ethnicity (White, Black, Hispanic, Asian, Other), education, region, and political ideology. The survey contains 20 Likert‑style questions covering perceived seriousness of climate change, support for various policies, personal behavioral intentions, and trust in scientific institutions.

For each respondent, the authors construct a textual profile (e.g., “A 42‑year‑old Black female with a college degree living in the Midwest and identifying as liberal”) and feed this profile, together with the 20 questions, to six state‑of‑the‑art LLMs: GPT‑3.5‑Turbo, GPT‑4, Claude‑2, Llama‑2‑70B, Gemini‑1.5‑Flash, and Koala‑2. All prompts are standardized: temperature = 0.0, max tokens = 256, and the model is instructed to answer on a five‑point scale (Strongly disagree → Strongly agree).

The evaluation proceeds in two stages. First, the authors compute overall Pearson correlations and mean absolute errors (MAE) between model‑generated averages and the human survey averages for each question. Second, they perform an intersectional difference‑in‑differences (DiD) analysis, comparing model predictions to actual responses within each race‑gender cell. Statistical significance is assessed with 95 % confidence intervals and Bonferroni‑adjusted p‑values.

Key Findings

High Overall Correlation but Hidden Compression – All six LLMs achieve a respectable overall correlation (r ≈ 0.78) and MAE ≈ 0.42, suggesting they capture the broad shape of public opinion. However, a systematic “compression” effect emerges: groups that are genuinely highly concerned about climate change (e.g., progressive Hispanic women) are predicted as less concerned, while groups that are genuinely less concerned (e.g., conservative White men) are predicted as more concerned. This indicates that the models gravitate toward an average stance, flattening the true distribution of opinions.
Intersectional Gender‑Race Bias – The most striking discrepancy appears for Black respondents. In the human data, Black women show a markedly higher level of climate concern than Black men (a 0.9‑point difference on the Likert scale). All LLMs, however, produce nearly identical scores for Black men and women, effectively erasing the real gender gap. By contrast, for White and Hispanic respondents the models reproduce the observed gender gap fairly accurately. This selective failure suggests that the models internalize stereotypical gender patterns from their training data for some racial groups but not for others.
Model‑Specific Variations – GPT‑4 and Claude‑2 exhibit the smallest compression bias, yet they still display the same Black‑gender error. Llama‑2‑70B and Gemini‑1.5‑Flash show slightly larger deviations, indicating that the phenomenon is not confined to a single architecture or training regime.

Implications
The authors argue that standard auditing practices—typically limited to checking average differences across single demographic axes—would miss the nuanced errors uncovered here. In policy contexts, such hidden biases could lead to misallocation of outreach resources (e.g., over‑targeting groups that are already highly concerned) or to the neglect of communities whose concerns are under‑represented in model‑derived insights. Moreover, the compression effect may reinforce a false perception of consensus, undermining the recognition of polarized or minority viewpoints that are crucial for equitable climate governance.

Recommendations and Limitations
To mitigate these issues, the paper proposes several avenues: (a) fine‑tuning LLMs on demographically labeled datasets that explicitly preserve intersectional variance; (b) incorporating bias‑reduction prompts that instruct the model to “avoid assuming uniform gender patterns across races”; and (c) demanding greater transparency about training corpora so that stakeholders can assess the provenance of embedded stereotypes. The authors acknowledge limitations: the study is confined to a U.S. sample, and results may not generalize to other cultural contexts; also, the prompt engineering choices (e.g., phrasing of the profile) could influence model behavior in ways not fully explored.

Conclusion
While LLMs can approximate aggregate public opinion on climate change, they systematically compress opinion diversity and misrepresent intersectional gender‑race patterns, especially for Black Americans. These hidden biases are invisible to conventional audits but have concrete ramifications for climate policy design and equitable public engagement. The paper underscores the necessity of intersectional bias testing, transparent model development, and targeted fine‑tuning before deploying LLM‑based opinion analytics in high‑stakes governmental decision‑making.

How Large Language Models Systematically Misrepresent American Climate Opinions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment