The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede’s cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.

💡 Research Summary

The paper investigates how large language models (LLMs) align with human cultures and whether a simple “cultural prompting” technique can shift a model’s behavior toward a target national culture. Using Hofstede’s six cultural dimensions (Power Distance, Individualism, Masculinity, Uncertainty Avoidance, Long‑Term Orientation, Indulgence) and the VSM13 International Survey, the authors evaluate eight state‑of‑the‑art LLMs: DeepSeek‑V3, DeepSeek‑V3.1, OpenAI’s GPT‑5, GPT‑4.1, GPT‑4, Claude Opus 4, Llama 3.1, and Mistral Large. Gemini is excluded due to poor instruction following.

Methodology: Each of the 24 VSM13 questionnaire items is slightly re‑phrased for LLM compatibility. For every model‑culture pair, 50 responses are generated with a fixed random seed sequence (1‑8400) and temperature set to the maximum (1.0) to emulate human variability. Mean responses (m₀₁…m₂₄) are plugged into linear equations supplied by Hofstede’s manual, producing a score from 0‑100 for each dimension, normalized to a midpoint of 50. The absolute difference between a model’s dimension scores and the real‑world country scores is summed to obtain a “total alignment distance.”

Results without prompting show a strong default bias toward the United States: five of the eight models have the smallest distance to the US, with DeepSeek‑V3.1 (52.13) and GPT‑5 (87.13) being the closest. Some models (Llama 3.1, GPT‑4) are relatively nearer to Iran, while DeepSeek models are paradoxically far from China despite being Chinese products.

Cultural prompting is implemented by adding a system prompt such as “Answer as if you are from

The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

💡 Research Summary

Comments & Academic Discussion

Leave a Comment