Serving as an emerging and powerful tool, Large Language Model (LLM)-driven Human Digital Twins are showing great potential in healthcare system research. However, its actual simulation ability for complex human psychological traits, such as distrust in the healthcare system, remains unclear. This research gap particularly impacts health professionals' trust and usage of LLM-based Artificial Intelligence (AI) systems in assisting their routine work. In this study, based on the Twin-2K-500 dataset, we systematically evaluated the simulation results of the LLM-driven human digital twin using the Health Care System Distrust Scale (HCSDS) with an established human-subject sample, analyzing item-level distributions, summary statistics, and demographic subgroup patterns. Results showed that the simulated responses by the digital twin were significantly more centralized with lower variance and had fewer selections of extreme options (all p<0.001). While the digital twin broadly reproduces human results in major demographic patterns, such as age and gender, it exhibits relatively low sensitivity in capturing minor differences in education levels. The LLM-based digital twin simulation has the potential to simulate population trends, but it also presents challenges in making detailed, specific distinctions in subgroups of human beings. This study suggests that the current LLM-driven Digital Twins have limitations in modeling complex human attitudes, which require careful calibration and validation before applying them in inferential analyses or policy simulations in health systems engineering. Future studies are necessary to examine the emotional reasoning mechanism of LLMs before their use, particularly for studies that involve simulations sensitive to social topics, such as human-automation trust.
Distrust in the health care system is widely considered an important psychological factor of patient behavior, including service utilization, adherence, information disclosure, and health outcomes, particularly among minority, low-income, and historically marginalized populations [1], [2]. The Health Care System Distrust Scale (HCSDS), developed by Rose et al. [3], provided a structural and well-validated measuring tool, as well as showing great reliability and validity among diverse groups. In the world of health systems engineering, the ability to measure and simulate distrust is a practical necessity. Getting this right can help hospitals fine-tune their services, guide the design of public health campaigns, and support the development of fairer health policies [4], [5].
Recently, Artificial Intelligence (AI), particularly Large Language Models (LLMs), has shown great potential in [Authors name omitted for review purposes] 2 the healthcare sector. They can help answer patients’ questions, simulate behaviors, serve as virtual subjects, and carry out predictive analytics [6], [7], [8]. However, when LLMs are asked to, for example, fill out surveys or mimic human responses, they are more likely to choose mid-scale options. Meanwhile, the overall spread of answers is narrower than what we see in real human beings [9], [10], [11]. Moreover, LLMs struggle to capture demographic heterogeneity, such as differences in race or education [12], [13]. The objective of this study is to construct and test digital twins and LLMs in reflecting real human behaviors and responses. We aim to quantitatively assess item-level concordance between LLM-generated and real-world data, investigating the LLM-driven human digital twin’s ability to reproduce demographic heterogeneity, and giving suggestions for calibration and validation of LLM-based digital twins in health systems engineering.
To build the human digital twins, we used the Twin-2K-500 dataset [14]. It contains detailed profiles of 2058 U.S. adults as personas. For each persona, there are more than 500 features, covering everything from basic demographics to psychological, economic, and behavioral measurement results. Then we use ChatGPT-4 to power the digital twins, where the persona summaries provided in the dataset were used as prompts, ensuring that it closely reflects the human diversity within the dataset. The human reference sample is drawn from 400 Philadelphia jurors in the study of Rose et al. [3].
To ensure comparability with the human subject study, we applied stratified random sampling to obtain a 500case subsample from the full dataset (N = 2,058). Stratification variables included gender, age group, and ethnicity. Moreover, education levels were treated as a secondary variable due to the limitation in the full dataset. Due to the differences between the dataset and the human subject study, age groups were aggregated into three groups: 18-30, 31-50, and 51+. Target sample sizes for each stratum were proportionally scaled to 500 cases. If a stratum was underrepresented, all available cases were included, and remaining slots were randomly filled. The final subsample matched the reference human population in major demographics to a high degree. However, due to data source limitations, the educational distribution showed some deviation (See Table 1), with around 70% of the samples having received undergraduate or higher education. In this study, we applied the 10-item Health Care System Distrust Scale (HCSDS) developed by Rose et al. [3] with 5-point Likert responses. Three positively worded items, B (“My medical records are kept private”), H (“I receive high-quality medical care from the health care system”), and I (“The health care system puts my medical needs above all other considerations”), were reverse-scored to measure distrust. For each item, we calculated mean values, standard deviations, and drew distribution histograms for visualization. Chi-square tests were conducted to compare response distributions between digital twins and human reference. Moreover, we conducted between-group differences analyses and subgroup analyses. All analyses were conducted in R 4.4.1 with a significance level of α = 0.05.
The digital twins tended to give answers that clustered tightly around the center, resulting in taller peaks on the response curves. By comparison, real participants showed a broader range of scores, with more responses spread out and the peaks noticeably lower (See Figure 1).
To quantitatively evaluate the correspondence between the digital twin results and the human reference, we conducted chi-square tests for each item. The results showed the highly significant differences appearing in all A subgroup analysis was conducted to assess the human digital twins’ ability to model demographic heterogeneity in terms of age, gender, and educational levels, comparing them with the human reference data. In both the LLM and the human sample, no significant
This content is AI-processed based on open access ArXiv data.