"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering

"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Software engineers are increasingly incorporating AI assistants into their workflows to enhance productivity and alleviate cognitive load. However, experiences with large language models (LLMs) such as ChatGPT vary widely. While some engineers find them useful, others deem them counterproductive due to inaccuracies in their responses. Researchers have also observed that ChatGPT often provides incorrect information. Given these limitations, it is crucial to determine how to effectively integrate LLMs into software engineering (SE) workflow. Analyzing data from 26 participants in a complex web development task, we identified nine failure types categorized into incorrect or incomplete responses, cognitive overload, and context loss. Users attempted to mitigate these issues through scaffolding, prompt clarification, and debugging. However, 17 participants ultimately chose to abandon ChatGPT due to persistent failures. Our quantitative analysis revealed that unhelpful responses increased the likelihood of abandonment by a factor of 11, while each additional prompt reduced abandonment probability by 17%. This study advances the understanding of human-AI interaction in SE tasks and outlines directions for future research and tooling support.


💡 Research Summary

The paper investigates why software engineers sometimes give up on using large language model (LLM) assistants such as ChatGPT during complex development tasks. The authors conducted an observational study with 26 participants—21 students and 5 professional developers—who were asked to complete a multi‑step web‑application project while interacting with ChatGPT. Interaction logs, screen recordings, and post‑task interviews were collected and analyzed both qualitatively and quantitatively.

Through manual coding, the researchers identified 92 failure instances (58 % of all interactions) that fell into nine distinct failure types, which they grouped into three high‑level categories: (1) incomplete or incorrect responses, (2) cognitive overload and information‑management errors, and (3) loss of context leading to redundant effort. Underlying these failures were 12 root causes, ranging from user miscommunication (e.g., omitting crucial details) to model limitations (e.g., hallucinations, ignoring user expertise). Participants employed seven mitigation strategies, such as re‑phrasing prompts, debugging generated code, and consulting external resources.

A logistic regression model revealed that receiving an unhelpful response increased the odds of abandoning ChatGPT by a factor of 11. Conversely, each additional prompting iteration reduced the abandonment probability by 17 %, indicating that users can partially recover from failures by persisting with more queries. Experienced developers tended to prompt more and were less likely to quit than novices. A replication with the newer GPT‑5.1 model reproduced the same failure categories, suggesting that the observed interaction problems are not merely artifacts of a specific model version.

The paper’s contributions are fourfold: (1) it releases a fine‑grained dataset of human‑LLM interaction histories for complex SE tasks; (2) it provides a taxonomy of nine failure types and twelve causal factors; (3) it catalogs real‑world mitigation tactics and highlights opportunities for tool support such as automatic context preservation and prompt‑guidance features; and (4) it offers statistical evidence that unhelpful model outputs are a primary driver of user abandonment, extending prior work that focused mainly on model accuracy in isolated benchmark settings.

Limitations include the modest sample size, reliance on a single task domain, and a definition of “abandonment” that only captures mid‑task cessation. Future work should explore larger, more diverse task sets, longitudinal studies of abandonment behavior, and the design of adaptive interfaces that proactively address the identified failure modes.


Comments & Academic Discussion

Loading comments...

Leave a Comment