Debugging Defective Visualizations: Empirical Insights Informing a Human-AI Co-Debugging System
Visualization authoring is an iterative process requiring users to adjust parameters to achieve desired aesthetics. Due to its complexity, users often create defective visualizations and struggle to fix them. Many seek help on forums (e.g., Stack Overflow), while others turn to AI, yet little is known about the strengths and limitations of these approaches, or how they can be effectively combined. We analyze Vega-Lite debugging cases from Stack Overflow, categorizing question types by askers, evaluating human responses, and assessing AI performance. Guided by these findings, we design a human-AI co-debugging system that combines LLM-generated suggestions with forum knowledge. We evaluated this system in a user study on 36 unresolved problems, comparing it with forum answers and LLM baselines. Our results show that while forum contributors provide accurate but slow solutions and LLMs offer immediate but sometimes misaligned guidance, the hybrid system resolves 86% of cases, higher than either alone.
💡 Research Summary
The paper investigates how users debug defective visualizations created with Vega‑Lite, focusing on the complementary roles of human community support (e.g., Stack Overflow) and large language models (LLMs). The authors first construct a curated dataset of 297 real‑world debugging cases from Stack Overflow. Starting from an initial crawl of 889 Vega‑Lite‑tagged posts, they use a GPT‑4‑based classifier followed by manual verification to isolate genuine debugging questions. Two validation steps—JSON syntax checking and data‑availability verification—reduce the set to 297 fully specified cases that include the user’s description, defective visualization image, code snippet, dataset, and the accepted answer.
Three research questions guide the empirical study: (1) What types of questions do askers pose? (2) How effective are human answers? (3) Can LLMs provide comparable or superior assistance? The authors categorize questions into six major groups (color/scale adjustments, axis issues, layout problems, data mismatches, interaction bugs, and miscellaneous). Human answers resolve about 78 % of the cases but exhibit a long average latency of roughly 12 hours.
For the AI side, the study evaluates five models—open‑source (Qwen 2.5‑72B, DeepSeek‑R1) and closed‑source (Claude‑3.5‑Sonnet, GPT‑4o, o1‑Pro)—both text‑only and multimodal variants. Two prompting regimes are tested: a zero‑shot prompt that feeds the raw forum text (images replaced by placeholders) and a multi‑turn prompt that supplies additional context such as Vega‑Lite compiler logs and official documentation. In zero‑shot mode, LLMs achieve an average correctness of 65 %; with contextual augmentation, correctness rises to about 73 %. Multimodal models that ingest the image of the defective chart can pinpoint visual errors more quickly, yet they often introduce “aesthetic over‑corrections” (e.g., changing a user‑requested red‑white‑green gradient to a default palette), indicating a misalignment with fine‑grained user intent.
Guided by these findings, the authors design a mixed‑initiative co‑debugging system that leverages the speed of LLMs and the accuracy of human knowledge. The system consists of four pipeline stages: (1) automatic classification and multimodal retrieval of similar forum posts, documentation, and compiler logs; (2) LLM‑generated draft fixes for the Vega‑Lite specification; (3) verification and refinement using retrieved human answers and execution logs; (4) an interactive UI that highlights suggested code changes, provides natural‑language explanations, and allows the user to accept or edit the suggestions. The design principles emphasize that AI should propose rapid hypotheses while humans (or curated community knowledge) validate and correct them.
A user study evaluates the system on 36 previously unsolved debugging cases. Three conditions are compared: (i) the original best answer from Stack Overflow, (ii) LLM‑only assistance, and (iii) the proposed hybrid system. The hybrid approach resolves 86 % of the cases (31/36), outperforming human‑only (68 %) and LLM‑only (71 %). Moreover, the average time to a usable solution drops to under three minutes for the hybrid system, compared with twelve hours for human answers and about one minute for LLM‑only (though many LLM outputs required later correction). Participants rate the hybrid system higher on accuracy, speed, and explanation clarity (average 4.6/5).
The paper contributes (1) a publicly released dataset of real Vega‑Lite debugging incidents, (2) a quantitative comparison framework for human vs. AI assistance in visualization debugging, (3) design guidelines for human‑AI co‑debugging, and (4) an implemented system that demonstrably improves resolution rates and user satisfaction. Limitations include the focus on a single visualization grammar, reliance on external data URLs that raise privacy concerns, and the need for a human verification step that prevents full automation. Future work is outlined to extend the approach to other grammars (e.g., ggplot2, Vega), handle more complex interactive visualizations, incorporate privacy‑preserving local LLMs, and develop IDE plugins that continuously learn from community contributions. Overall, the study shows that a thoughtfully integrated human‑AI workflow can substantially enhance the debugging experience for visualization authors.
Comments & Academic Discussion
Loading comments...
Leave a Comment