Large Language Models for Large-Scale, Rigorous Qualitative Analysis in Applied Health Services Research
Large language models (LLMs) show promise for improving the efficiency of qualitative analysis in large, multi-site health-services research. Yet methodological guidance for LLM integration into qualitative analysis and evidence of their impact on real-world research methods and outcomes remain limited. We developed a model- and task-agnostic framework for designing human-LLM qualitative analysis methods to support diverse analytic aims. Within a multi-site study of diabetes care at Federally Qualified Health Centers (FQHCs), we leveraged the framework to implement human-LLM methods for (1) qualitative synthesis of researcher-generated summaries to produce comparative feedback reports and (2) deductive coding of 167 interview transcripts to refine a practice-transformation intervention. LLM assistance enabled timely feedback to practitioners and the incorporation of large-scale qualitative data to inform theory and practice changes. This work demonstrates how LLMs can be integrated into applied health-services research to enhance efficiency while preserving rigor, offering guidance for continued innovation with LLMs in qualitative research.
💡 Research Summary
This paper addresses the growing need for efficient qualitative analysis in large‑scale, multi‑site health‑services research by proposing a systematic human‑LLM collaboration framework. Recognizing that qualitative methods are essential for uncovering contextual, interpersonal, and organizational factors in health care delivery, the authors note that traditional manual analysis often becomes a bottleneck when studies involve dozens of sites and hundreds of interview hours. Large language models (LLMs) such as ChatGPT have demonstrated impressive text‑processing capabilities, yet their indiscriminate use can lead to superficial, context‑blind classifications and reduced methodological transparency. To balance efficiency with rigor, the authors develop a model‑ and task‑agnostic four‑step framework: (1) Define the task on a small pilot sample to clarify objectives, output formats, and quality expectations while identifying which components must remain under direct researcher control (e.g., data familiarization and final interpretation). (2) Design a human‑LLM method by decomposing the overall task into discrete sub‑tasks, explicitly assigning responsibilities to the LLM (e.g., pattern detection, thematic grouping) and to the researcher (e.g., contextual judgment, theory integration). (3) Conduct a small‑scale evaluation where two qualitative analysts perform the sub‑tasks both with and without LLM assistance; outputs are compared using both quantitative metrics (time saved, thematic overlap) and qualitative rigor criteria (grounding in data, alignment with research questions, relevance to practice). (4) Apply the refined method to the full dataset and assess impact on research goals and efficiency.
The framework is illustrated through two real‑world tasks within a comparative case study of diabetes care across 12 Federally Qualified Health Centers (FQHCs). Task 1 involved synthesizing site‑level summary bullet points (≈31,200 words total) into cross‑site comparative reports across 22 practice domains. Researchers first produced manual cross‑site syntheses, then introduced ChatGPT‑4o to automatically sort bullet points into thematic categories and generate draft syntheses. In a pilot, the LLM‑assisted approach reduced synthesis time by 30 % for one analyst and 55 % for another, while the thematic structures produced by the LLM matched those created manually in over 90 % of cases. However, the final reports required human revision to ensure actionable language, to omit irrelevant or redundant data, and to align themes with the management framework guiding the study. The LLM proved valuable for organizing raw data and reducing cognitive load, but could not replace the nuanced interpretation needed for practitioner‑focused feedback.
Task 2 focused on deductive coding of 167 interview transcripts (≈1.33 million words) using a pre‑defined codebook of 19 practice‑area codes. Directly feeding whole interviews to the LLM exceeded token limits and produced vague outputs. To overcome this, the team implemented Retrieval‑Augmented Generation (RAG): relevant excerpts were first retrieved via keyword search, then supplied to ChatGPT‑4o for code‑specific extraction and summarization. The LLM generated candidate excerpts and preliminary summaries for each code, which researchers then reviewed, edited, and integrated into the final coding matrix. This hybrid approach preserved the depth of human interpretation while leveraging the LLM’s speed in locating and summarizing relevant text, thereby accelerating the coding process without sacrificing analytic fidelity.
Overall, the study demonstrates that LLMs excel at data organization, pattern identification, and draft generation, delivering substantial time savings and enabling more rapid feedback loops to health‑center partners. Crucially, the framework ensures that human analysts retain control over contextual judgment, theoretical integration, and the production of actionable recommendations, thereby preserving the rigor traditionally expected of qualitative research. The authors discuss limitations, including potential model bias, the need for transparent prompt documentation, and the computational overhead associated with RAG pipelines. They advocate for broader testing of the framework across diverse health‑services and social‑science domains, as well as the development of ethical and privacy guidelines for LLM‑assisted qualitative work. In sum, this work provides a practical, reproducible roadmap for integrating LLMs into large‑scale qualitative health‑services research, balancing efficiency gains with methodological soundness.
Comments & Academic Discussion
Loading comments...
Leave a Comment