Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Collaborative problem solving (CPS) is widely recognized as a critical 21st-century skill. Assessing CPS depends heavily on coding the communication data using a construct-relevant framework, and this process has long been a major bottleneck to scaling up such assessments. Based on five datasets and two coding frameworks, we demonstrate that ChatGPT can code communication data to a satisfactory level, though performance varies across ChatGPT models, and depends on the coding framework and task characteristics. Interestingly, newer reasoning-focused models such as GPT-o1-mini and GPT-o3-mini do not necessarily yield better coding results. Additionally, we show that refining prompts based on feedback from miscoded cases can improve coding accuracy in some instances, though the effectiveness of this approach is not consistent across all tasks. These findings offer practical guidance for researchers and practitioners in developing scalable, efficient methods to analyze communication data in support of 21st-century skill assessment.

💡 Research Summary

**
This paper investigates whether large language models (LLMs), specifically the ChatGPT family, can reliably automate the coding of textual communication data generated in collaborative problem‑solving (CPS) tasks. CPS is widely recognized as a cornerstone 21st‑century skill, yet its assessment hinges on labor‑intensive human coding of chat transcripts according to construct‑relevant frameworks. To address this bottleneck, the authors conducted a systematic empirical study across five distinct CPS datasets (including ATC21S and PISA‑2025) and two widely used coding schemes: the ATC21S five‑skill framework (participation, perspective‑taking, social regulation, task regulation, knowledge building) and the PISA three‑skill framework (shared understanding, problem‑solving actions, team process management).

Four ChatGPT models were evaluated: GPT‑4, GPT‑4o, GPT‑o1‑mini, and GPT‑o3‑mini. Each model was prompted in a zero‑shot or few‑shot manner, receiving a textual description of the coding rubric, a few exemplar turns, and, where applicable, meta‑instructions designed to steer the model toward the correct label. Human‑coded subsets (10‑25 % of each dataset) served as the gold standard for performance measurement. Accuracy, precision, recall, F1‑score, and confusion matrices were computed for each model‑framework‑task combination.

Key findings include:

Model‑level performance – GPT‑4 and GPT‑4o achieved the highest overall accuracies (≈78‑85 %). However, performance varied by task, with up to a 10 % swing depending on the conversational complexity. Surprisingly, the newer reasoning‑focused mini‑models (GPT‑o1‑mini, GPT‑o3‑mini) performed 5‑7 % worse on average, suggesting that raw reasoning power does not directly translate to fine‑grained coding tasks that require nuanced label discrimination.
Framework influence – The ATC21S five‑skill rubric, with its larger label set and frequent overlap (e.g., distinguishing “perspective‑taking” from “knowledge building”), yielded higher error rates than the more compact PISA three‑skill scheme. The latter’s clearer definitions facilitated more consistent model predictions.
Prompt refinement – By analyzing mis‑coded instances, the authors crafted targeted prompt augmentations (e.g., “If the turn involves evaluating a partner’s idea, prioritize the ‘social regulation’ label”). This iterative prompting improved ATC21S accuracy by 3‑4 % on average, but had negligible impact on PISA tasks, indicating that the benefit of prompt engineering is task‑dependent.
Confidence calibration – Model‑generated confidence scores correlated inversely with error frequency; high‑confidence predictions were substantially more reliable. This opens the possibility of a hybrid workflow where only low‑confidence turns are sent to human raters, dramatically reducing manual effort.
Cost‑benefit analysis – Compared with traditional human coding pipelines, the LLM‑based approach cut costs by roughly 70‑80 % and reduced processing time by over 90 %, while maintaining a level of coding fidelity sufficient for large‑scale assessment purposes.

The study concludes that ChatGPT can serve as a viable, scalable alternative for coding CPS communication data, provided that researchers carefully select the appropriate model, consider the granularity of the coding framework, and apply task‑specific prompt engineering. The findings also caution against assuming that newer, more “reasoning‑oriented” LLMs will automatically outperform older, more generalist models on fine‑grained educational coding tasks. Future work is suggested in fine‑tuning LLMs on domain‑specific CPS corpora, exploring active‑learning loops that combine model confidence with selective human verification, and extending the approach to multimodal data (e.g., audio‑to‑text pipelines).

Automated Coding of Communications in Collaborative Problem-solving Tasks Using ChatGPT

💡 Research Summary

Comments & Academic Discussion

Leave a Comment