Reporting and Reviewing LLM-Integrated Systems in HCI: Challenges and Considerations
What should HCI scholars consider when reporting and reviewing papers that involve LLM-integrated systems? We interview 18 authors of LLM-integrated system papers on their authoring and reviewing experiences. We find that norms of trust-building between authors and reviewers appear to be eroded by the uncertainty of LLM behavior and hyperbolic rhetoric surrounding AI. Authors perceive that reviewers apply uniquely skeptical and inconsistent standards towards papers that report LLM-integrated systems, and mitigate mistrust by adding technical evaluations, justifying usage, and de-emphasizing LLM presence. Authors’ views challenge blanket directives to report all prompts and use open models, arguing that prompt reporting is context-dependent and justifying proprietary model usage despite ethical concerns. Finally, some tensions in peer review appear to stem from clashes between the norms and values of HCI and ML/NLP communities, particularly around what constitutes a contribution and an appropriate level of technical rigor. Based on our findings and additional feedback from six expert HCI researchers, we present a set of guidelines and considerations for authors, reviewers, and HCI communities around reporting and reviewing papers that involve LLM-integrated systems.
💡 Research Summary
This paper investigates how the rapid rise of large language model (LLM)‑integrated systems is reshaping reporting and peer review practices within the Human‑Computer Interaction (HCI) community. The authors conducted semi‑structured interviews with 18 scholars who have authored LLM‑integrated system papers, supplemented by feedback from six senior HCI researchers. They also performed a quantitative sweep of CHI and UIST proceedings from 2021 onward, showing a steep increase in the number and proportion of papers that incorporate LLMs as system engines.
The study uncovers several intertwined challenges. First, the inherent nondeterminism of LLM outputs and the surrounding hype have eroded traditional trust‑building norms between authors and reviewers. Reviewers often lack a shared understanding of what constitutes a “real contribution” versus a simple “wrapper” around an LLM, leading to inconsistent and sometimes overly skeptical judgments. To compensate, authors frequently add technical evaluations (e.g., multi‑model comparisons, prompt‑sensitivity analyses) or shift detailed implementation information to appendices and code repositories in order to stay within page limits.
Second, the paper highlights a clash of epistemic cultures. HCI traditionally values interpretive, design‑oriented contributions, whereas ML/NLP fields prioritize model‑centric rigor, reproducibility, and quantitative validation. This divergence manifests in reviewers demanding open‑model baselines, extensive statistical testing, or large participant pools even for work whose primary goal is exploratory design or user experience. Consequently, authors feel pressured to justify the use of proprietary models, to explain cost or accessibility constraints, and to provide ethical rationales for model selection.
Third, the authors question blanket mandates such as “report all prompts” or “use only open models.” Their interview data reveal that full prompt disclosure can be impractical or unnecessary when hundreds of prompts are generated automatically or when prompts are not central to the scientific claim. Context‑dependent reporting—disclosing only those prompts that materially affect outcomes—offers a more balanced approach to transparency and reproducibility.
Fourth, institutional responses are already emerging. Both UIST 2025 and CHI 2026 introduced desk‑reject policies targeting papers that insufficiently justify LLM usage or that appear to be mere “LLM wrappers.” While intended to reduce reviewer overload, these policies risk creating a chilling effect that discourages legitimate LLM‑integrated research.
Based on these findings, the authors synthesize a set of practical guidelines for three stakeholder groups. For authors, the recommendations emphasize: (a) clear articulation of why an LLM is chosen, (b) disclosure of model version, training data provenance, and any prompt engineering strategies that influence results, (c) inclusion of at least one technical evaluation that isolates the LLM’s contribution, and (d) ethical statements regarding model licensing, cost, and bias mitigation. For reviewers, the paper suggests: (i) evaluating submissions against the authors’ stated goals rather than imposing a one‑size‑fits‑all technical checklist, (ii) requesting additional details only when they are essential for assessing validity, and (iii) being aware of the broader HCI value system that may prioritize design insight over raw performance metrics. Finally, for the HCI community, the authors propose a checklist that integrates transparency, reproducibility, contextual relevance, and ethical considerations, aiming to restore a shared baseline of trust while respecting the diverse methodological traditions within HCI.
In conclusion, this work documents a pivotal moment in HCI where LLM‑integrated systems are both expanding research possibilities and exposing fissures in peer‑review culture. By offering empirically grounded recommendations, the paper seeks to guide the community toward more consistent, fair, and transparent evaluation of LLM‑enhanced HCI research.
Comments & Academic Discussion
Loading comments...
Leave a Comment