Detecting UX smells in Visual Studio Code using LLMs

Integrated Development Environments shape developers' daily experience, yet the empirical study of their usability and user experience (UX) remains limited. This work presents an LLM-assisted approach to detecting UX smells in Visual Studio Code by m…

Authors: Andrés Rodriguez, Juan Cruz Gardey, Alej

Detecting UX smells in Visual Studio Code using LLMs
Detecting UX smells in Visual Studio Code using LLMs Andrés Rodriguez arodrig@lia.info.unlp.edu.ar LIFIA, Fac. Informática, Univ . Nac. La Plata La Plata, Argentina Juan Cruz Gardey jcgardey@lia.info.unlp .e du.ar LIFIA, Fac. Informática, Univ . Nac. La Plata La Plata, Argentina Alejandra Garrido garrido@lia.info.unlp.edu.ar LIFIA, Fac. Informática, Univ . Nac. La Plata & CONICET La Plata, Argentina Abstract Integrated Development Environments shape developers’ daily ex- perience, yet the empirical study of their usability and user expe- rience (UX) remains limited. This work pr esents an LLM-assisted approach to detecting UX smells in Visual Studio Code by mining and classifying user-reported issues fr om the GitHub repository . Us- ing a validated taxonomy and expert review , we identied r ecurring UX pr oblems that aect the developer experience. Our results show that the majority of UX smells are concentrated in informativ eness, clarity , intuitiveness, and eciency , qualities that developers value most. CCS Concepts • Human-centered computing → Empirical studies in HCI ; HCI theory , concepts and models ; • Softwar e and its engineering → Integrated and visual de velopment environments . Ke ywords Developer Experience, LLM-assisted coding, UX smells, UXDebt A CM Reference Format: Andrés Ro driguez, Juan Cruz Gardey, and Alejandra Garrido. 2026. De- tecting UX smells in Visual Studio Code using LLMs. In 3rd International W orkshop on Integrated Development Environments (IDE ’26), A pril 12–18, 2026, Rio de Janeiro, Brazil . ACM, New Y ork, N Y , USA, 4 pages. https: //doi.org/10.1145/3786151.3788606 1 Introduction Integrated Development Envir onments (IDEs) play a central role in shaping de velopers’ everyday experience with code. Far from b eing neutral instruments, IDEs mediate cognition, workow , and collab- oration, inuencing how developers search, refactor , and reason about software [ 5 ]. Over the past decade, the scope and comple xity of IDEs have increased dramatically; modern platforms such as Visual Studio Code (VSCode), IntelliJ IDEA and Eclipse, integrate not only editors and compilers but also live collab oration tools, AI-assisted completion, and plugin ecosystems that redene how developers interact with their codebases. This evolution has made the developer experience (DEX) an essential quality dimension of software tools [5]. Despite this shift, the empirical study of IDE usability and dev el- oper experience remains fragmented. Much of the literature still This work is licensed under a Cr eative Commons Attribution 4.0 International License . IDE ’26, Rio de Janeiro, Brazil © 2026 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-2384-1/2026/04 https://doi.org/10.1145/3786151.3788606 focuses on feature-level performance or adoption metrics, rather than on the nuanced forms of interaction friction that dev elopers face in daily use. Kuusinen [ 9 ] observed that developers appreciate IDEs that are ecient to use, exible, informative and intuitive. These qualities, while generally understood, are rarely used as an analytical framew ork to assess or monitor the user experience (UX) health of an IDE over time. Parallel to this, recent software engineering research has drawn attention to the notion of UXDebt: a form of debt that accumulates when UX issues are postponed or insuciently addressed during development [ 3 , 13 ]. In complex tools such as IDEs, this debt of- ten materializes as UX smells: recurring patterns of interaction breakdowns, confusing fe edback, or inconsistency between user expectations and system behavior [ 6 ]. Identifying such UX smells in real-world de velopment tools remains challenging. Contr olled usability testing is rarely feasible for open, continuously evolving IDEs with millions of users. In this paper , our goal is to mine issues from public repositories, such as GitHub, for evidence of UX smells directly from user re- ports and community dialogue. The v olume and diversity of user reports available in large open repositories allow obser ving how developers articulate friction points, how maintainers triage them, and how recurring UX problems persist or evolve. Y et, the scale of data volumes requires automating the analysis. Thus, w e introduce a Large Language Model (LLM) assisted mining approach that lever- ages recent advances in natural language understanding to act as a rst-pass semantic coder over dev eloper discourse. Using LLMs, we support the semantic categorization of UX smells according to the desirable IDE qualities identied by [ 9 ]. The combination of manual inspection and LLM-assiste d classication aims to balance interpretive depth with scalability , integrating automation with human judgment. Our contributions are: (1) an empirical corpus of IDEs’ UX smells grounded in developer discourse , rising the understanding of the human aspects of software engineering, and (2) foundations for a broader characterization of UXDebt in IDEs, connecting design- level frictions with potential downstream consequences, such as cognitive overload, inecient workows, or e ven the accumulation of T echnical Debt/UXDebt in the resulting code. 2 Background on UX, UX Smells and UXDebt UX is an essential aspect of a product, determining its quality as well as its success. The notion of UX considers not only the pragmatic aspects of interaction (functionality , interactive behavior , user skills, context of use) but also the hedonic ( brand image, presentation, internal state of the user resulting from pre vious experiences) [7]. IDE ’26, April 12–18, 2026, Rio de Janeiro, Brazil Rodriguez et al. UX evaluation is often neglected, esp ecially in agile metho ds that are driven by customer satisfaction and short iteration cy- cles. Therefore, lightweight methods are needed to evaluate UX as part of iterative development. One method previously propose d is UX refactoring, dened as changes applied with the purpose of improving UX quality while preserving functionality [ 6 ]. In turn, a UX smell hints at a problem with the navigation, presentation, interaction, or any UX aspect that may be solved by applying UX refactoring. An example of a UX smell is a free text input that only accepts a small set of possible values ("Free input for limite d val- ues"); it may be solved by applying alternative UX refactorings like " Add Autocomplete " or "T urn Input into Sele ct" . Another type of UX smell refers to issues related to the style or aesthetics of a user inter- face (UI) such as low color contrast, misaligne d elements, and lack of responsiveness, among others. Note that UX smells are dierent from bugs in the UI since smells do not prevent users from accom- plishing their goal but just make it cumbersome or uncomfortable. The presence of UX smells may contribute to the accumulation of UXDebt. This concept has b een dened as a typ e of T echnical Debt (TD) with a cumulative cost for the development team as well as stakeholders [ 13 ]. Similar to TD , UXDebt can undermine code maintainability and increase rew ork costs. 3 Related W ork on DEX There ar e se veral studies that evaluate the usability and UX of IDEs through empirical and/or inspection methods [ 8 ]. Moreover , Fager- holm and Munch dened the concept of Develop Exp erience (DEX), to help understand developers’ perceptions and feelings as users of IDEs, and with the assumption that improving DEX will have a positive impact on productivity [ 5 ]. They dene DEX as consisting of three dimensions: cognitive (perceptions of development infras- tructure), ae ctive (fe elings about work and so cial aspects), and conative (alignment of dev elopers and project goals). Further studies on DEX tried to gain an understanding of devel- opers’ expectations in the use of IDEs, through coding camps [ 11 ] and surveys [ 9 ]. In the rst case, authors study online collaborative coding and categorize IDE features supp orting DEX at the level of operations, actions and activities [ 11 ]. Moreover , they highlight that DEX is composed not only of the experience of using tools, but also the processes involved, the rules, and other people. In the second case, Kuusinen asked developers ab out the best qualities of an IDE and the improvements that could b etter support their work [ 9 ]. The author found that developers e xpect IDEs to be more exible, informative , and reliable. There are two works that sp ecically study the Visual Studio IDE. Amann et al. present a large empirical study with C# devel- opers on how they use their time in the IDE, although they do not report on its usability [ 2 ]. In the study from V aithilingam et al., the authors conducted user tests with 61 programmers at Microsoft, over several prototype interfaces for change suggestions in VSCode [ 14 ]. Through a user study they found a better design that improved the usage of change suggestions. Thus, previous studies report on user tests or manual expert inspection, which are usually limited in volume and costly , while our approach of analyzing DEX in issue repositories is extensiv e, automated and low-cost. 4 Method W e adopted a mixed approach combining repository mining, LLM- assisted text analysis, and expert validation to identify and charac- terize UX smells in the VSCode project. Our methodological goal was to balance scalability enabled by automation, with interpretive reliability ensured through human judgment and iteration. The pro- cess included three stages described b elow: (1) data collection, (2) LLM-assisted categorization, and (3) synthesis and interpretation. 4.1 Data Collection Repository mining is a common strategy for investigating user experience in real-world settings, allowing r esearchers to capture "naturally occurring evidence" of interaction breakdo wns and user perceptions at scale [ 12 ]. W e extracted all issues from the public GitHub repository of VSCode 1 using the GitHub REST API. From this corpus, we retained only those issues explicitly tagge d with the label UX, a convention used by the maintainers to ag UX-related reports. This ltering step pro vided an initial dataset of N = 2350 issues (as of October 2025) 2 . 4.2 LLM- Assisted Categorization of UX Smells The ltered issues were processed using an LLM to support seman- tic categorization regarding an existing catalog of UX smells [ 6 ]. The prompt instructed the model to: (i) identify UX -related prob- lems that match entries in the UX smell catalog; (ii) asso ciate each issue with the most relevant UX smell, pro viding a short rationale describing the reasoning behind the classication; and (iii) detect any additional UX smells not represented in the catalog, propose their tentative label, and link them to the corresponding issue(s) with justication. This LLM-assiste d annotation follows emerg- ing methodological practices in software repository mining, where models ser ve as rst-pass semantic coders to reveal latent struc- tures in unstructured developer discourse [ 1 ]. All LLM-assisted classications were conducted with OpenAI GPT -5, accesse d via the ChatGPT interface (April 2025 build). T o ensure validity and reliability , three researchers indepen- dently review ed the same elements from 2 sample sets of LLM output: (i) a 10% random sample of issues classied by the LLM, verifying correctness and rationale clarity , and (ii) a 10% sample of unclassied issues, che cking for missed or ambiguous cases. Disagreements and misclassications were discussed in consensus meetings following a constant comparison approach inspired by grounded theory methodology [ 4 ], leading to the construction of a validated label for 236 manually reviewed issues. This set yielde d a baseline accuracy of 0.695 b etween the LLM-assigned and validated labels. Subse quently , we conducte d a calibration phase using an updated prompt and an extended catalog. A confusion matrix and per-category metrics revealed systematic errors informing heuristic adjustments and a full reclassication. T o preserve rigor , we adopte d a hybrid strategy: human-validated labels take precedence, while calibrated heuristics cover unreviewed issues. A second manual inspection (more than 80% of conrmations) validated this strat- egy . This semi-super vised approach broadens corpus coverage and forms the empirical basis for subsequent analyses [10]. 1 https://github.com/microsoft/vscode 2 The complete data set is available at https://t.ly/dRhwP Detecting UX smells in Visual Studio Code using LLMs IDE ’26, April 12–18, 2026, Rio de Janeiro, Brazil 4.3 Descriptive and Interpretive Analysis In the third phase, we conducted: (i) Descriptive statistics , quanti- fying the frequency and distribution of UX smells across the dataset, and (ii) Analytical mapping , relating UX smells with IDE quali- ties [ 9 ], followed by the assignment of IDE qualities to issues. Step (ii) was also LLM-assisted and manually re viewed, allowing us to explore how dierent forms of UX friction reected in the issues, cluster around specic experiential qualities valued by developers. The interpretive stage aims to connect patterns in UX smells to broader hypotheses about UXDebt in IDEs using these criteria: • Salience-Neglect Hypothesis: A high density of smells associ- ated with a highly valued IDE characteristic ( e.g., eciency) may signal that this dimension, despite being central to DEX, r eceives insucient design attention, thus accumulating UXDebt. • Saturation-Resolution Hyp othesis: Conversely , if smells clus- ter around less value d characteristics (e.g., reliability), it may indicate that core experiential qualities (eciency , intuitiv eness) are relatively mature and that residual issues now emerge in peripheral dimensions. 5 Results 5.1 UX Smells and IDE Qualities T o ground the interpretive analysis, we rst examined how the UX smell framework aligns with desirable IDE qualities identied by [ 9 ]. This mapping enables a dual perspective: highlighting which forms of UX friction are most frequent and revealing how the taxon- omy itself resonates with the experiential expectations developers hold for their work environments. The distribution shows a clear concentration of UX smells around the cognitive-perceptual quali- ties informativeness (6 UX smells), clarity (6), intuitiveness (5), eciency (4), and ease of use (4) ), indicating that the frame- work primarily captures breakdowns in perception, comprehension, and control. Conversely , qualities such as exibility , empowerment , and learnability appear only marginally represented, suggesting that current UX smell taxonomies tend to diagnose short-term interaction breakdowns more than long-term experiential frictions. T o interpret these tendencies in relation to the broader mo del of DEX, we organized Kuusinen’s qualities into four higher-order clusters reecting distinct experiential tensions: cognitive trans- parency , ow eciency , structural reliability , and peripheral experience . Our four clusters rene (not extend) Fagerholm’s DEX framework [ 5 ] by increasing analytic granularity: cognitive trans- parency maps to DEX’s cognitive dimension, ow eciency spans cognitive/conative experience, and peripheral experience covers aective/conative facets. Structural reliability is made explicit as an enabling condition that, when degrade d, systematically undermines cognitive and conative experience. 5.2 Descriptive Statistics from VSCode Issues Out of 2350 analyzed issues, 61% were identied as UX smells (1455 issues), while the remainder were bugs or featur e requests. Among UX smells, mapping to Kuusinen’s 13 IDE quality di- mensions yielded a markedly asymmetric distribution (see T able 1). Over 70% of selected issues cluster around Informativeness , Clarity , T able 1: UX smells across IDE Quality Dimensions IDE Quality # Issues % Issues Frequent smells Informativeness 312 21.4% Undescriptive Element, Inconsistent Feedback, No Progress Indicator Clarity 286 19.7% Overlooked Content, Gral. UI Inconsist., Unformatted Input Eciency 223 15.3% Overloaded Menus, Distant Content, Inconsistent Spac- ing Intuitiveness 194 13.3% Misleading Link, W rong Default Value , Inconsistent Placement Ease of Use 103 7.1% Late V alidation, Abandoned Form, Overloaded Menus Reliability 97 6.7% Unresponsive Element, In- consistent Theming Aesthetic Design 72 4.9% Clipped/Overlapping UI, General UI Inconsistency Eectiveness 58 4.0% No Client V alidation, Scarce Search Results V alue 33 2.3% Useless Search Results, Poor Accessibility Learnability 29 2.0% Inconsistent Placement, Poor Discoverability Flexibility 21 1.4% Forced Bulk Action Approachability 16 1.1% Poor Discoverability , Poor Accessibility Empowerment 11 0.8% Forced Bulk Action Figure 1: IDE prole according to UX friction Eciency , and Intuitiveness , the same dimensions Kuusinen identi- ed as most valued by developers. Conversely , qualities linked to learning, autonomy , or exibility r epresent less than 5% of all cases, suggesting low visibility and prioritization in the design process (see Fig. 1). 5.3 Analytical and Interpretive Mapping In this section, UX smells are examined within the four e xperien- tial clusters to trace how dier ent types of UXDebt map onto the experiential qualities most valued by developers (see Fig. 2). Cognitive Transparency Cluster (54%). Comprising Informa- tiveness , Clarity , and Intuitiveness , this cluster concentrates over half of all UX smells. Frequent issues include insucient fee dback, unclear tooltips, ambiguous icons, and inconsistent visual cues, e.g. “The di viewer shows changes but not which le is active” . These cases reduce situational awar eness and cognitive legibility , aligning with the cognitive dimension of the DEX model [ 5 ]. UXDebt mani- fests as cognitive opacity , the accumulation of small inconsistencies and missing cues that erode the readability of system state over time. Flow Eciency Cluster (29%). Integrating Eciency , Ease of IDE ’26, April 12–18, 2026, Rio de Janeiro, Brazil Rodriguez et al. Figure 2: Bar graph with the UX smells dimensions clustering Use , V alue , and Eectiveness , this group covers issues that interrupt workow continuity or require redundant steps, e .g. “Settings editor workspace folder sele ctor dropdown op ens too far away from tab” . Such frictions reect the disruption of the ow-related qualities highlighted by Kuusinen: eciency and ease of use are central to how developers experience productivity within an IDE [ 9 ]. UXDebt in this cluster accumulates as process fragmentation, when local optimizations or feature additions compromise the seamless con- tinuity of core workows. Structural Reliability Cluster (12%). Covering Reliability and Aesthetic Design , this cluster captures in- consistent feedback, delayed visual refreshes, and unsynchronized themes—e.g., “macOS: inconsistent UI when it comes to inputs b order radius. ” This emerges as a form of structural UXDebt rooted in architectural or rendering constraints. Peripheral Experience Cluster (5%). Encompassing Learnability , Flexibility , A pproacha- bility , and Empowerment , this cluster shows scarce r epresentation. A few issues include discoverability or customization problems. Rather than indicating the absence of UXDebt, this low density may reect latent or postpone d debt in peripheral qualities, di- mensions that receive less attention once functional stability is reached. These results, when examined in light of the hypotheses posited in Section 4.3, may be interpreted as follows: • Salience–Neglect Hyp othesis : we observed a high concen- tration of UX smells in Eciency , Informativeness , and Clarity (accounting for nearly 60% of all cases). This pattern suggests that the IDE’s most value d experiential qualities ar e also the ones most aecte d by UXDebt. These core dimensions concentrate both functional complexity and user interaction, making them especially vulnerable to degradation through iterative growth and design trade-os [ 5 ]. W e interpret this concentration as a bias in prioritizing functional expansion ov er cognitive experi- ence: aspects that developers consider essential for productivity tend to accumulate subtle but pervasive usability frictions. • Saturation–Resolution Hypothesis : is supported by the com- paratively low frequency of issues in Reliability , Learnability , and Flexibility . Such scarcity may indicate functional maturity: once cor e mechanics and workows stabilize, residual UX smells emerge primarily in peripheral or supporting dimensions, wher e design iteration is slower or less visible to users. This pattern aligns with the idea that UXDebt shifts from central to marginal layers as the product evolv es. T ogether , these observations highlight how UXDebt in VSCode evolves not merely through accumulation but through re-localization: from emergent friction in new features to persistent cognitive drag in long-standing ones. 6 Conclusions and future work Our results show that VSCode exhibits a maturity pattern typical of large IDEs: most UXDebt concentrates in clarity , informativeness , and intuitiveness , dimensions mediating the dialogue between in- terface and user , rather than te chnical reliability . This supports the view that in complex envir onments, UXDebt accumulates where interaction is most cognitive and frequent, not where code is most fragile. From a longitudinal perspective, this shift reects an evolu- tion from how it works toward how it communicates and feels . This study fo cuses exclusively on the core VSCode IDE to establish a baseline of UXDebt within the primar y host platform. This may limit generalizability , as mature de velopers r ely on a vast e xtension ecosystem. While the host’s architecture constrains UI disruption, UX friction could emerge from unforeseen extension interactions. Moreover , our reliance on GitHub issues only captures "reported" friction, potentially omitting "silent" UX smells. Our futur e work will combine other data collection methods to mitigate these limi- tations. W e also plan to compare our ndings across dier ent IDE ecosystems to determine if the concentration of UXDebt in cog- nitive transparency is an intrinsic feature of mature de velopment environments. Acknowledgments A uthors acknowledge grant PICT -2019-02485 from Agencia I+D+i. References [1] S. Abedu, A. Abdellatif, and E. Shihab. 2024. LLM-Based Chatbots for Mining Software Repositories: Challenges and Opportunities. In 28th EASE . 201–210. [2] S. Amann, S. Proksch, S. Nadi, and M. Mezini. 2016. A study of visual studio usage in practice. In IEEE 23rd Int. Conf. on Software A nalysis, Evolution, and Reengineering (SANER) , V ol. 1. IEEE, 124–134. [3] S. Baltes and V . Dashuber . 2024. UX debt: Developers b orrow while users pay . In IEEE/ACM 17th CHASE . 79–84. [4] K. Charmaz. 2014. Constructing Grounded Theor y (2nd ed.). SA GE. [5] F. Fagerholm and J. Münch. 2012. Developer experience: Concept and denition. In Int. Conf. on Software and System Process (ICSSP) . IEEE, 73–77. [6] J. Grigera, A. Garrido, J.M. Rivero, and G. Rossi. 2017. Automatic detection of usability smells in web applications. Int. Journal of Human-Computer Studies 97 (2017), 129–148. [7] M. Hassenzahl, M. Burmester, and F. K oller . 2021. User experience is all there is: twenty years of designing positive experiences and meaningful technology . i-com 20, 3 (2021), 197–213. [8] R.B. Kline and A. Seah. 2005. Evaluation of integrated software de velopment environments: Challenges and results from three empirical studies. Int. Journal of Human-Computer Studies 63, 6 (2005), 607–627. [9] K. Kuusinen. 2015. Software developers as users: Developer e xperience of a cr oss- platform integrated development environment. In Int. Conf. on Product-Focused Software Process Improvement . Springer , 546–552. [10] M. Liu, L. Jiang, J. Liu, X. W ang, J. Zhu, and S. Liu. 2017. Improving Learning- from-Crowds thr ough Expert V alidation. In IJCAI . Melb ourne, 2329–2336. [11] J. Palviainen, T . Kilamo, J. K oskinen, J. Lautamäki, T . Mikkonen, and A. Nieminen. 2015. Design framework enhancing de veloper experience in collaborative coding environment. In 30th A nnual ACM Symp osium on Applied Computing . 149–156. [12] S. Panichella, A. Di Sorbo, C.A. Visaggio, and G. Canfora. 2015. How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution. In IEEE ICSME . 281–290. [13] A. Rodriguez, J. C. Gardey , J. Grigera, G. Rossi, and A. Garrido. 2023. UX debt in an agile development process: evidence and characterization. Software Quality Journal 31, 4 (2023), 1467–1498. [14] P. V aithilingam, E. Glassman, P. Groenwegen, S. Gulwani, A. Henley , et al . 2023. T owards more eective ai-assisted programming: A systematic design exploration to improve visual studio intellicode’s user experience. In IEEE/A CM 45th ICSE- SEIP . 185–195.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment