Where are the Hidden Gems? Applying Transformer Models for Design Discussion Detection
Design decisions are at the core of software engineering and appear in Q\&A forums, mailing lists, pull requests, issue trackers, and commit messages. Design discussions spanning a project's history provide valuable information for informed decision-…
Authors: Lawrence Arkoh, Daniel Feitosa, Wesley K. G. Assunção
Where are the Hidden Gems? Applying T ransformer Mo dels for Design Discussion Detection La wrence Ark oh · Daniel F eitosa · W esley K. G. Assun¸ c˜ ao Received: date / Accepted: date Abstract Design decisions are at the core of soft ware engineering and ap- p ear in Q&A forums, mailing lists, pull requests, issue track ers, and commit messages. Design discussions spanning a pro ject’s history provide v aluable information for informed decision-making, such as refactoring and soft w are mo dernization. Mac hine learning tec hniques ha v e b een used to detect design decisions in natural language discussions; how ev er, their effectiveness is lim- ited b y the scarcity of lab eled data and the high cost of annotation. Prior w ork adopted cross-domain strategies with traditional classifiers, training on one domain and testing on another. Despite their success, transformer-based mo dels, whic h often outperform traditional metho ds, remain largely unex- plored in this setting. The goal of this work is to inv estigate the p erformance of transformer-based mo dels (i.e., BER T, RoBER T a, XLNet, LaMini-Flan- T5-77M, and ChatGPT-4o-mini) for detecting design-related discussions. T o this end, w e conduct a conceptual replication of prior cross-domain studies while extending them with mo dern transformer architectures and addressing metho dological issues in earlier work. The mo dels w ere fine-tuned on Stack Ov erflow and ev aluated on GitHub artifacts (i.e., pull requests, issues, and commits). BER T and RoBER T a show strong recall across domains, while XLNet ac hiev es higher precision but low er recall. ChatGPT-4o-mini yields the highest recall and comp etitiv e ov erall p erformance, whereas LaMini-Flan- T5-77M pro vides a ligh tw eigh t alternativ e with stronger precision but less balanced p erformance. W e also ev aluated similar-word injection for data aug- men tation, but unlik e prior findings, it did not yield meaningful impro v emen ts. Lawrence Arkoh North Carolina State Universit y , Raleigh, USA. E-mail: larkoh@ncsu.edu Daniel F eitosa Universit y of Groningen, Groningen, Netherlands. E-mail: d.feitosa@rug.nl W esley K. G. Assun¸ c˜ ao North Carolina State Universit y , Raleigh, USA. E-mail: wguezas@ncsu.edu 2 Lawrence Arkoh et al. Ov erall, these results highlight b oth the opp ortunities and trade-offs of using mo dern language mo dels for detecting design discussion. Keyw ords Softw are Architecture · Mining Softw are Rep ositories · Language Mo dels · Replication study 1 Introduction The foundation of a softw are system is its architectural design and the deci- sions made to arrive at its current state ( W an et al. , 2023 ; Bucaioni et al. , 2024 ). 1 Th us, during the softw are’s developmen t life cycle, developers need to mak e many design decisions ( Alk adhi et al. , 2018 ). These decisions are usually related to functional requiremen ts, non-functional requirements, or technolog- ical constraints, whic h, in turn, affect the architecture ( Ameller et al. , 2012 ). When made incorrectly or sub optimally , soft ware design decisions can lead to soft ware degradation and incur financial and maintenance costs, namely tech- nical debt ( Li et al. , 2015 ; da Silv a Maldonado et al. , 2017 ; Skryseth et al. , 2023 ), pushing softw are systems deep into the legacy system zone ( Assun¸ c˜ ao et al. , 2025 ). T o maintain and evolv e aging systems, ensuring their v alue ov er time, de- v elop ers should b e able to recov er and understand the design decisions made during soft w are dev elopment ( Josephs et al. , 2022 ). Mo dern soft ware dev elop- men t platforms 2 offer features such as pull requests and issues management, allo wing developers to review co de, raise issues with a given piece of co de, and then ha ve discussions to resolve them ( Y ang et al. , 2022 ). Additionally , soft ware design decisions can also b e discussed in mailing lists ( Viviani et al. , 2019 ) and commit messages, where developers do cumen t their inten tions b e- hind co de changes ( Tian et al. , 2022 ; Dhaouadi et al. , 2024 ). F urthermore, dev elop er forums such as Stack Ov erflow 3 are yet another platform where de- v elop ers exchange information and discuss arc hitecture decisions (among other topics) ( de Dieu et al. , 2023 ). Pull requests, issues, emails, messages, Q&A forums, or an y discussion medium, are v aluable sources of information related to design discussion ( Oliv eira et al. , 2024 ; Viviani et al. , 2019 ; Brunet et al. , 2014 ). These pieces of informa- tion enable an understanding of soft ware arc hitectures and the history of their design ev olution. Consequen tly , this information supports system maintenance and evolution (i.e., code refactoring, redesign, migration, or mo dernization) to meet new demands. Ho w ev er, exploiting m ultiple comm unication c hannels has b een a challenge due to a lack of tools and pro cesses for identifying such dis- cussions ( Maarlev eld et al. , 2023 ; Mehrp our and Latoza , 2023 ). Because these 1 In this paper, the term “design” refers explicitly to softw are structural design, and design decisions are architecture-related text, as described by Brunet et al. ( 2014 ). 2 https://github.com , https://gitlab.com , and https://www.atlassian.com/ software/jira 3 https://StackOverflow.com Title Suppressed Due to Excessiv e Length 3 discussions are seldom formally do cumented, studies hav e shown that the ra- tionale of decisions related to a softw are’s architecture ma y v ap orize with time ( Borrego et al. , 2019 ; Li et al. , 2022 ). Also, unfortunately , current to ols and practices p o orly trac k which discussions are design decision-related. A few studies hav e prop osed approaches based on machine learning (ML) for classifying textual discussions from different sources as design-related or non-design-related discussions ( Brunet et al. , 2014 ; Viviani et al. , 2019 ; Shak- iba et al. , 2016 ; Maarleveld et al. , 2023 ). These studies trained ML classi- fiers and ev aluated them on datasets from the same source. That is, b oth the training and testing datasets were deriv ed solely from pull requests, issues, or commit messages. The problem with making these approaches applicable in practice is that, in addition to their sensitivity to the source dataset type, there is limited av ailabilit y of lab elled datasets, whic h further hinders the p er- formance and generalizability of the to ols. T o address the problem men tioned ab ov e, Mahadi et al. ( 2022 ) curated a dataset from Stack Overflo w discussions and trained traditional classifiers on iden tifying similar design discussions in commit messages, pull requests, and issue discussions. This approach is referred to as cr oss-domain dataset classi- fic ation , which inv olves training the mo del on a domain (in this case, a Stack Ov erflow discussion) and applying it to classify data from other domains (e.g., commits, pull requests, issues) ( Zimmermann et al. , 2009 ). The authors also emplo yed similar-wor d inje ction , a data augmentation strategy that enhances classifier p erformance by reinforcing its capacit y to generalize and transfer con- textual information from the Stac k Ov erflow dataset to commit messages, pull requests, and issue discussions. Mahadi et al. ( 2022 ) highligh ted challenges in the form of lo w ML mo del performance when trained on data from one domain (i.e., Stac k Ov erflow) and tested on data from different domains, despite b oth con taining design discussions in natural language text. T o adv ance the state of the art in design discussion detection and pav e the w ay for building b etter tools, this present work aims to evaluate tr ansformer- b ase d mo dels for design discussion identific ation . Motiv ated by the strong per- formance of transformer-based language mo dels (e.g., BER T ( Devlin et al. , 2019 ) and ChatGPT ( Op enAI , 2025 )) in understanding and classifying natural language, we addressed the limitations and extended the w ork b y Mahadi et al. ( 2022 ) by conducting a c onc eptual r eplic ation ( Cruz et al. , 2019 ; G´ omez et al. , 2014 ). Sp ecifically , we explored whether five state-of-the-art transformer-based mo dels, namely BER T ( Devlin et al. , 2019 ), RoBER T a ( Liu et al. , 2019 ), and XLNet ( Y ang et al. , 2019 ) (all discriminative models), as well as LaMini-Flan- T5-77M ( Lepagnol et al. , 2024 ) and ChatGPT-4o-mini ( OpenAI , 2025 ) (gen- erativ e language mo dels), can effectively generalize their ability to distinguish b et w een design-related and non-design discussions in a cross-domain setting. T o this end, we fine-tuned the mo dels on Stac k Ov erflow discussions and tested them on similar categories of text dra wn from GitHub pull requests ( Viviani et al. , 2019 ; Brunet et al. , 2014 ), co de commen ts ( da Silv a Maldonado et al. , 2017 ), issues and commit messages ( Brunet et al. , 2014 ). Our k ey findings are: 4 Lawrence Arkoh et al. – T ransformer-based models can accurately detect design-related discussions when training and testing data originate from the same domain. – A transformer-based mo del with a reduced dataset, namely Stack Ov erflow questions excluding answers and commen ts, can be sufficient for robust detection while reducing fine-tuning time and computational costs. – Despite outp erforming traditional ML classifiers, transformers exhibit p er- formance degradation, esp ecially in balanced precision and recall, when applied across domains (e.g., Stack Overflo w to GitHub). This highlights the need for careful use of domain adaptation techniques. – ChatGPT-4o-mini consisten tly ac hiev ed higher recall across datasets, mak- ing it effectiv e for exploratory analyses where capturing design discussions is critical, while LaMini-Flan-T5-77M offered stronger precision and effi- ciency , p ositioning it as a practical choice for resource-constrained cases. – T ool builders in this domain should consider the trade-off b et ween recall and precision, as no single mo del is univ ersally optimal across settings. Giv en these findings, the contributions of our research provide b etter sup- p ort for softw are engineering practitioners b y enabling a ligh tw eight, auto- mated means of detecting design-related discussions. This improv es design traceabilit y and team alignment without requiring manual effort, as these mo d- els can iden tify design decisions without additional tedious tasks for humans. The detected softw are design-related discussions can enable practitioners to enhance documentation, streamline on boarding, and support kno wledge trans- fer, ultimately guiding maintenance and mo dernization efforts. By making de- sign decisions more accessible, our w ork contributes to a deep er understanding of a given softw are’s architecture and aids in its mo dernization o ver time. In the remainder of this pap er, Section 2 presents the related work, fol- lo wed by our study design in Section 3 . Section 4 presents the results, and Section 5 presents the discussion and implications, with the threats to v alid- it y . W e conclude the pap er in Section 6 . 2 Related W ork Despite its imp ortance, the detection and classification of soft w are design deci- sions remain relativ ely underexplored. Existing studies rep ort a lac k of system- atic do cumentation and automated tec hniques, limited empirical evidence, and immature classification approac hes in this area ( Assun¸ c˜ ao et al. , 2025 ; Mondal et al. , 2022 ; Bhat et al. , 2017 ; Bi et al. , 2021 ; Su et al. , 2026 ). Wijerathna et al. ( 2022 ) proposed a hybrid approac h to mine design pat- terns and con textual data by com bining unsup ervised and supervised ML tec h- niques. It employs unsup ervised metho ds to generate vector represen tations of design patterns, capturing their semantic features. Then, a Supp ort V ector Classifier (SV C) is trained on a manually lab eled Stack Overflo w dataset to categorize p osts into different design patterns. The SVC achiev ed an R OC- Title Suppressed Due to Excessiv e Length 5 A UC 4 of 0.87, with a precision of 0.815 and a recall of 0.880. F urther in- corp oration of collo cations into the test data led to a sligh t impro vemen t in p erformance, with precision and recall increasing to 0.815 and 0.882, resp ec- tiv ely . Zhao et al. ( 2024 ) studied prompt tuning with large language mo dels to mine latent design rationales from Jira issue logs. This metho d yielded a 22% impro v ement in F1 o v er baseline tec hniques in their study . Brunet et al. ( 2014 ) examined design-related discussions across GitHub commits, issues, and pull requests. They manually lab eled 1,000 samples as related or unrelated to structural design, then trained a Decision T ree classifier using this data. The mo del achiev ed an accuracy of 94 ± 1% and was later applied to ov er 102,000 discussion samples, revealing that appro ximately 25% w ere related to softw are structural design. da Silv a Maldonado et al. ( 2017 ) dev elop ed a metho d to automatically detect commen ts related to design requirements and tec hnical debt. They used a Maximum Cross En trop y classifier trained on lab eled co de comments, whic h ac hieved an a v erage F1 of 0.62 on their test set. Viviani et al. ( 2019 ) man ually lab eled paragraphs from GitHub pull requests from op en-source pro jects as design-related or not. Then, they trained multiple classifiers, with the b est- p erforming model (Random F orest) ac hieving ROC-A UC scores of 0.87 on the training set and 0.81 on a cross-pro ject test set. All the studies ab o v e focus on detecting design-related discussions within a single domain by training and testing models on data from the same source. In contrast, Mahadi et al. ( 2022 ) inv estigated the cross-domain p erformance of several ML classifiers, including Linear SVM, Logistic Regression, and RBF SVM. They trained mo dels on Stack Overflo w data and ev aluated them on datasets from other domains, suc h as co de comments, commit messages, and pull request discussions. The classifiers performed w ell on a similar Stac k Ov er- flo w dataset, achieving an ROC-A UC of 0.882, whic h demonstrates high per- formance in differentiating design-related texts from non-design-related texts. Ho wev er, the p erformance of the mo dels deteriorated significantly on cross- domain datasets. Sp ecifically , when tested on data from Brunet et al. ( 2014 ), the R OC-A UC dropp ed to 0.632, and ev en lo w er scores w ere observ ed on the datasets from da Silv a Maldonado et al. ( 2017 ), Viviani et al. ( 2019 ), and Shakiba et al. ( 2016 ), with R OC-A UC scores of 0.496, 0.483, and 0.513, respec- tiv ely . These scores indicate that the models performed po orly in cross-domain classification, achieving only random guessing accuracy . T o address this, they applied similar w ord injection data augmen tation to the dataset from Brunet et al. ( 2014 ), resulting in a ROC-A UC score of 0.7985, which represen ts a 26% impro vemen t. While a previous study employ ed transformer mo dels ( Zhao et al. , 2024 ), they did not address the challenge of cross-domain dataset classification. The few ones that explored cross-domain design discussion primarily relied on tra- ditional NLP tec hniques and classical ML algorithms ( da Silv a Maldonado 4 Area Under the Receiv er Operating Characteristic Curve measures the classifier’s abilit y to distinguish b etw een differen t class lab els, with p erformance v alues ranging from 0 (w orst), through 0.5 (eq. to random guessing), to 1 (b est) ( Sofaer et al. , 2019 ). 6 Lawrence Arkoh et al. et al. , 2017 ; Viviani et al. , 2019 ; Mahadi et al. , 2022 ), which generally under- p erform compared to mo dern transformer-based models. 3 Study Design The goal of our study is to in vestigate the stability of transformer-based mo d- els in identifying design discussions across datasets from different domains. W e conduct a conceptual replication of the study b y Mahadi et al. ( 2022 ), de- scrib ed in the previous section, by ev aluating the p erformance of transformer- based mo dels on m ultiple independently collected datasets. Sp ecifically , w e assess mo del p erformance using datasets provided by Brunet et al. ( 2014 ), Vi- viani et al. ( 2019 ), and da Silv a Maldonado et al. ( 2017 ). In doing so, we also address methodological issues in the original study , as discussed in Section 3.3 . W e structure our study around the following tw o research questions (RQs): R Q1. How effective are transformer-based mo dels at identifying soft- w are design-related discussions in cross-domain developer discus- sions? This RQ assesses the capability of mo dern transformer-based mo dels (presen ted in Section 3.2 ) in iden tifying softw are design-related discussions from dev elop er artifacts suc h as commit messages, pull requests, and issue dis- cussions. By answering this R Q, we aim to understand whether these mo dels can serve as reliable to ols for automatically extracting design knowledge, p o- ten tially improving do cumentation and traceability in softw are pro jects. The goal is to understand whic h mo dels p erform b est and thereby offer insigh ts for b oth researc hers and practitioners looking to adopt NLP techniques in design discussion detection. R Q2. Do es similar-w ord injection enhance the cross-domain p erfor- mance of transformer-based mo dels in detecting softw are design- related discussions? This RQ in vestigates whether applying similar-wor d inje ction ( Long et al. , 2024 ) augmentation strategy (i.e., replacing words in unseen inputs with semantically related terms), can improv e the cross-domain generalization of transformer-based models in detecting softw are design-related discussions. Generalization remains a significant c hallenge in softw are engi- neering NLP tasks, as terminology , phrasing, and contextual structure often v ary across pro jects. By enriching test inputs with lexical v ariations, this ap- proac h aims to mak e mo dels more robust to domain shift without requiring rep etitiv e mo del retraining ( Lu et al. , 2022 ; Shanmugam et al. , 2025 ). In this R Q, we assess the impact of augmentation on classification accuracy in cross- domain settings and explore whether it offers a practical alternative for im- pro ving p erformance. In the following subsections, we detail the metho dology of our replication study . T o enable further research and replication, the data and co de used in our study is publicly av ailable as supplementary material ( Arkoh et al. , 2026 ). Title Suppressed Due to Excessive Length 7 T able 1 Datasets of design dicussions used in our study . Reference Source(s) Design related? T otal Y es No Mahadi ( Mahadi et al. , 2022 ) Stack Ov erflow ques- tions, comments, an- swers 115,000 115,000 230,000 Brunet ( Brunet et al. , 2014 ) GitHub commit messages, issues, and pull requests 246 754 1,000 Viviani ( Viviani et al. , 2019 ) GitHub pull requests 6,250 2,365 8,615 SA TD ( da Silv a Maldonado et al. , 2017 ) Co de comments 2,599 1,183 3,782 3.1 Data Collection and Prepro cessing Giv en the time and effort required to label data for text classification tasks ( F redriks- son et al. , 2020 ), we reused datasets from prior studies. T able 1 presents a summary of the datasets gathered from four studies, which we used in our replication. T o obtain a large lab eled dataset, Mahadi et al. ( 2022 ) relied on Stac k Overflo w. In their study , texts from questions, answers, and comments w ere lab eled as design or gener al based on the user-supplied tags added when the questions w ere asked. Brunet et al. ( 2014 ) collected text from commit messages, issues, and pull requests from GitHub and classified them as de- sign or non-design related discussions. Similarly , Viviani et al. ( 2019 ) and da Silv a Maldonado et al. ( 2017 ) also lab eled text from pull requests and commit messages as design or non-design related discussions. W e follo w ed the same approac h of existing studies to pre-pro cess the data b y removing some sp ecial c haracters and other data-sp ecific conten t (e.g., thread-start, thread-end, FIXME, TODO, web URLs) that would not hav e an y impact on the fine-tuning ( Skryseth et al. , 2023 ; Mahadi et al. , 2022 ). The datasets provided by Mahadi et al. ( 2022 ) and Brunet et al. ( 2014 ) were already cleaned, not requiring further action. How ev er, we p erformed these cleaning steps for the data provided b y da Silv a Maldonado et al. ( 2017 ) and Viviani et al. ( 2019 ). W e note that using transformer-based mo dels do es not require additional steps lik e remo v al of stop w ords and parts-of-speech tagging due to the nature of their training. In fact, further pre-pro cessing could impact their performance ( F an et al. , 2023 ). 3.2 Model Selection and Fine-T uning The introduction of Bidirectional Enco der Representations from T ransformers (BER T) ( Devlin et al. , 2019 ) established a new standard for understanding natural language. Their bi-directional attention mechanism enables it to cap- ture semantic meaning and sentence structure effectiv ely ( Zhao et al. , 2024 ). 8 Lawrence Arkoh et al. BER T’s widespread success in v arious classification tasks across domains ( P an et al. , 2021 ; Shang et al. , 2024 ; Ajagb e and Zhao , 2022 ; Li et al. , 2024 ; Skryseth et al. , 2023 ; Josephs et al. , 2022 ) highlights its robust generalization and accu- racy . Motiv ated b y these strengths, w e apply transformer-based mo dels to the task of cross-domain softw are design discussion detection. Although mo dels lik e RoBER T a ( Liu et al. , 2019 ) and XLNet ( Y ang et al. , 2019 ) hav e demon- strated superior p erformance in general classification benchmarks ( Izadi et al. , 2022 ; Nadeem et al. , 2021 ; Arabadzhiev a-Kalchev a and Ko v achev , 2022 ; Ku- mar et al. , 2021 ), their effectiveness in the sp ecific con text of identifying design- related communication within soft ware repositories remains underexplored. In our study , we used BER T ( b ert-b ase-unc ase d ) ( Devlin et al. , 2019 ), RoBER T a ( r ob erta-b ase ) ( Liu et al. , 2019 ), and XLNet ( xlnet-b ase-c ase d ) ( Y ang et al. , 2019 ) through the HuggingF ace library . 5 W e also include ChatGPT-4o-mini ( Op enAI , 2025 ) in our study , which is a large-scale generativ e transformer mo del that has sho wn strong p erformance in soft w are engineering tasks ( Widyasari et al. , 2025 ; Sun et al. , 2023 ). Its con- v ersational fine-tuning allows it to capture nuanced design-related discussions with high recall and comp etitive ov erall effectiv eness. In addition to large-scale transformer mo dels, we also consider LaMini-Flan-T5-77M ( W u et al. , 2024 ), a smaller and resource-efficient mo del, whic h has demonstrated comp etitive re- sults under limited-resource settings ( Lepagnol et al. , 2024 ) as opp osed to the high computational costs of large language mo dels ( Patel et al. , 2024 ; Shekhar et al. , 2024 ; Liagkou et al. , 2024 ). W e applied transfer learning and fine-tuning to adapt each mo del to our datasets. F or ChatGPT-4o-mini and LaMini-Flan-T5-77M, which require nat- ural language prompting, w e emplo y ed the prompt and con text template sho wn b elo w. The prompt was curated to align with the design discussion classifica- tion criteria in tro duced by Brunet et al. ( 2014 ), whic h, to the best of our kno wledge, represents the earliest study to form ulate the detection of softw are design discussions as a sup ervised machine learning classification task. This conceptual definition has since b een adopted by subsequent work, including the replicated study by Mahadi et al. ( 2022 ). Grounding our prompt in this established formulation ensures metho dological consistency with prior studies and enables fair comparison and replication rather than relying on ad-ho c or task-sp ecific phrasing. Classify the following text as design if it is related to softw are struc- tural design or general otherwise: {{ TEXT TO CLASSIFY }} In con trast, mo dels such as BER T, RoBER T a, and XLNet were fine-tuned using structured input representations without the need for natural language prompts. As part of a conceptual replication of the study by Mahadi et al. ( 2022 ), we reused their Stack Overflo w training dataset to fine-tune all mo dels, enabling 5 https://huggingface.co Title Suppressed Due to Excessive Length 9 T able 2 Summary of Primary Dataset (Stack Overflo w T raining and V alidation only). Category Design Non-Design T otal Questions 50,000 50,000 100,000 Answers 20,000 20,000 40,000 Comments 30,000 30,000 60,000 Combined 100,000 100,000 200,000 direct comparison with prior results while isolating the impact of transformer- based architectures. Originally , their dataset consisted of 200,000 samples for training, 30,000 samples for testing, and 30,000 samples for v alidation. How- ev er, we observed that the test and v alidation samples were duplicates, which is a metho dological issue, so we discarded the v alidation samples and only considered the remaining 230,000 for our study . W e selected the same 200,000 used for fine-tuning and v alidation (80% and 20% split) as well as the same test samples for mo del testing, which were never exp osed to our mo del during fine-tuning. The full decomp osition of the Stack Overflo w dataset is presented in T able 2 . F urthermore, Mahadi et al. ( 2022 ) lab eled their data by identifying ques- tions tagged with softw are architecture or design decision-related terms, cat- egorizing them as “Design”, while those without such tags were lab eled as “General”. How ev er, based on our observ ations of how users interact on Stack Ov erflow, a question tagged as design-related do es not necessarily imply that its asso ciated answers and comments are related to design. Additionally , tag- ging on Stack Overflo w is applied only to parent questions, not to their corre- sp onding answers or comments. An example of this is illustrated in Figure 1 , where the question description p ertains to a design-related topic, but some of the comments and answers lac k any discussion or context related to de- sign patterns. 6 Based on that, we conducted and additional exp eriment. First, w e fine-tuned our mo dels on the combined questions, answers, and comments dataset. Second, we fine-tuned the mo dels on the dataset, which comprises questions only , and compared the tw o results to verify the following hypothe- sis: Disc ar ding answers and c omments alongside questions in the fine- tuning data do es not signific antly affe ct the p erformanc e of mo dels in dete cting design discussions. 3.3 Experimentation Setup W e used a separate, balanced subset of the Stac k Overflo w datase t compris- ing 30,000 textual discussions (15,000 design-related and 15,000 non-design- 6 https://StackOverflow.com/questions/18065115/how- to- call- a- class- or- what- pattern- to- use 10 Lawrence Arkoh et al. How to call a class or what pattern to use? [closed] Asked 12 years, 7 months ago Modified 12 years, 7 months ago 62 times V iewed -6 I have it has public m ethod like wh at: MailManager Copy MailMessage ( ) public ConfirmationEmail CommonTicketEmailViewModel viewModel { template = ReadTemplate( ); var "ConfirmationEmail.cshtml" result = Razor.Parse(temp late, { ConfirmationUrl = string new viewModel.Url, Token = vie wModel.Code }); mailMessage = MailMessage(); var new mailMessage.Su bject = ; "Confirmation E-mail" mailMessage.Bo dy = result; mailMessage.Is BodyHtml = ; true mailMessage.To .Add(viewModel.ToEmail); mailMessage; return } this class just pre pare Mail Message to se nd. Will how corr ect call this class? c# design-patterns Share Improve this question Follow asked Au g 5, 201 3 at 18: 26 Mediator 15.4k 37 120 215 I am conf used, what is you r question her e? Are you asking for patte rns of invoking t he class method s? – Karl Anderson Aug 5, 2013 at 1 8:29 My class just prep are mailMessage. All method prepa re mailMessage. I do not like t hat I called it M ailManager – Mediator Aug 5, 2013 a t 18:30 1 Opinion based.. . But you can try some f lavor of as a nam e (since it seem to be a fa ctory for so me sort of ____Factory messages). – Alexei Levenkov Aug 5, 2013 at 18:34 ? Seems pert inent. Postman – Moo-Juice Aug 5, 2013 at 18:40 1 Answer Sorted by: Highest score (d efault) 0 How abou t or ? MailMessageUtilities MailMessageFactory Share Improve this answer Follow answered Aug 5, 2 013 at 1 8:34 Karl Anderson 34.9k 12 70 83 Fig. 1 Sample Stack Overflo w question about design, but with non-design related comments and answer related) dra wn from the full 230,000-text corpus for the ev aluation on the same domain (i.e., Within Dataset). Then, the other datasets (Brunet, Viviani, and SA TD) were used to ev aluate the mo dels’ cross-domain p erformance. T o enable a close comparison with the replicated study , we computed the R OC-AUC score ( Sofaer et al. , 2019 ) for each of our fine-tuned mo dels p er dataset, except for LaMini-Flan-T5-77M and ChatGPT-4o-mini that pro duce text outputs instead of probabilit y scores, prev enting the computation of R OC- A UC v alues. F urthermore, we also computed metrics such as accuracy , pre- cision, recall, and F1-score, whic h are the common metrics asso ciated with classification tasks ( Goutte and Gaussier , 2005 ; Naidu et al. , 2023 ). The parameters to all other transformers were kept at their default v al- ues, except to ChatGPT-4o-mini, which was set with temp er atur e equal to Title Suppressed Due to Excessive Length 11 0 for repro ducibilit y and max tokens fixed at 10 to constrain the output to short lab els. W e ev aluated eac h mo del ten times (i.e., 10 indep endent runs) on eac h dataset and rep ort the mean v alues of the considered metrics. These re- p eated runs w ere also used to assess statistical significance through Wilcoxon signed-rank tests ( Wilcoxon , 1945 ), allowing us to determine whether the ob- serv ed differences in p erformance across datasets or augmen tation strategies are significan t. Except for ChatGPT-4o-mini, whic h relies on the Op enAI APIs, 7 all ex- p erimen ts were conducted on a machine equipp ed with an Intel Core i9 (13th Generation) pro cessor, an NVIDIA GeF orce R TX 4090 GPU, and running Ubun tu 24.04.1 L TS. On av erage, fine-tuning BER T, RoBER T a, and XLNet required approximately 30 minutes p er ep och on the combined dataset and ab out 15 min utes on the Questions Only dataset. F or LaMini-Flan-T5-77M, fine-tuning on the combined dataset took roughly 2 hours compared to ab out 1 hour on the Questions Only dataset. T o prev ent model ov erfitting, we utilized early stopping. Similar to prior work ( Mahadi et al. , 2022 ), w e fo cused on ev aluating the inheren t generalization ability of each model architecture under standard con- figurations rather than maximizing p erformance through extensiv e hyperpa- rameter tuning. Using default settings ensures a fair and consistent comparison across mo dels, reduces the risk of dataset-sp ecific o v erfitting, and improv es re- pro ducibilit y of our results. Moreov er, large-scale tuning across m ultiple mod- els and datasets w ould introduce substantial computational cost and confound the analysis by attributing improv ements to optimization choices rather than arc hitectural differences. Therefore, we adopted the commonly used default h yp erparameters pro vided b y the HuggingF ace implementations for all mo d- els. 4 Results This section describ es the results from testing our mo dels on the “within” domain (Stac k Overflo w) as well as “cross-domain” datasets from GitHub issues, discussions, and commit messages (see T able 1 ). 4.1 Model Performance Within Dataset First, we test our initial hypothesis that, when fine-tuned on a combination of questions, comments, and answers, the mo del’s p erformance is similar to that when fine-tuned on questions only (see Section 3.2 ). Despite the models b eing fine-tuned on these dataset v ariants, we ev aluated them on the same dataset, whic h combined questions, commen ts, and answers, b ecause this allow ed us to compare our results with those of Mahadi et al. ( 2022 ). The combined dataset comprised 200,000 samples, whereas the questions contained only 100,000. W e 7 https://gith ub.com/op enai/op enai-p ython 12 Lawrence Arkoh et al. T able 3 P erformance metrics (mean v alues) for different mo dels and fine-tuning data v ari- ants Mo del T rain V arian t AC ROC-A UC R P XLNet Combined 0.898 0.955 0.916 0.885 Questions Only 0.899 0.958 0.879 0.915 RoBER T a Combined 0.898 0.958 0.897 0.899 Questions Only 0.887 0.947 0.912 0.868 BER T Combined 0.900 0.954 0.899 0.902 Questions Only 0.893 0.946 0.881 0.903 LaMini Combined 0.894 - 0.894 0.894 Questions Only 0.886 - 0.905 0.885 AC: Accuracy; R: Recall; P: Precision excluded ChatGPT-4o-mini from this ev aluation, as it would require an esti- mated 600,000 API calls (30,000 samples × 2 mo del v arian ts × 10 runs), which w ould incur prohibitive computational and monetary costs. Additionally , we noted that fine-tuning ChatGPT-4o-mini on the combined dataset w ould take appro ximately 5 hours 30 minutes, whereas fine-tuning on the Questions Only dataset w ould tak e ab out 2 hours 30 minutes. T able 3 presents the results obtained by fine-tuning the four mo dels on these tw o v arian ts of data comp osition. W e observ e in this table that discard- ing the answers and comments do not affect the model’s p erformance across an y of the metrics. The v alues for each metric differ by only a small mar- gin. This observ ation confirms the observ ed phenomena of model saturation as training set size grows ( Zhu et al. , 2016 ). In comparison with the results from Mahadi et al. ( 2022 ) (i.e the replicated study), our highest R OC-A UC score of 0.958 for RoBER T a on Combine d and for XLNet on Questions Only (see T able 3 ) is greater than their score of 0.881 obtained using a Linear SVM classifier trained on the combined dataset of 200,000 samples. The high v alues exhibited by the mo dels in our study , particularly precision and recall, imply that the transformer-based mo dels trained on a smaller dataset outp erform the traditional ML classifiers in detecting design discussions from natu ral language text. Th us, the transformer-based mo dels hav e a sup erior ability to differenti- ate betw een design and non-design discussions when trained and tested in the same domain. T o assess whether fine-tuning on Combine d data versus Questions Only data significantly impacts model performance, we applied the Wilcoxon signed- rank test ( Wilco xon , 1945 ) across m ultiple ev aluation metrics. The results rev ealed statistically significant differences in all pair comparisons ( Combine d vs. Questions Only ) for all transformers and metrics, except for LaMini-Flan- T5’s precision and XLNet’s accuracy . Despite this statistical significance, the mean difference betw een Combine d and Questions Only trained mo dels ranged from –0.0110 to 0.0004 for accuracy , –0.0306 to 0.0303 for precision, –0.0368 to 0.01549 for recall, and –0.0101 to 0.0030 for ROC-A UC. W e conclude that discarding comments and answers results in halving mo del training time, as Title Suppressed Due to Excessive Length 13 noted previously in the mo del training time differences. The results for each metric are pro vided in full tables in the supplemen tary materials ( Ark oh et al. , 2026 ). Summary of results Within Dataset: The results confirm our h y- p othesis that discarding answ ers and comments from the fine-tuning data do es not significan tly affect mo del p erformance. This suggests that question text alone is sufficient for reliably detecting design dis- cussions, offering a more streamlined and efficient approac h to data preparation, while reducing fine-tuning time. Our results also reveal that transformer-based mo dels p erform b etter than traditional classi- fiers used in prior work. in detecting design-related discussions when b oth training and testing data originate from the same domain, with R OC-AUC scores as high as 0.958. This makes them suitable replace- men t for traditional classifiers from the same domain. 4.2 Model Performance Cross-Domain Dataset Ha ving observed satisfactory model performance with the Stac k Overflo w dataset, we proceed to ev aluate the fine-tuned mo del’s performance on datasets from differen t domains (cross-domain). Figure 2 shows the ROC-A UC sc ores recorded when our mo dels fine-tuned on Questions Only are applied to the cross-domain datasets for design dis- cussion classification. F or cross-domain ROC-A UC score, Mahadi et al. ( 2022 ) recorded their best scores for the dataset provided by Brunet ( Brunet et al. , 2014 ), Viviani ( Viviani et al. , 2019 ), and SA TD ( da Silv a Maldonado et al. , 2017 ) to b e 0.632, 0.513, and 0.505, resp ectiv ely . How ev er, for these same datasets, the models used in our study ac hieved mean scores of 0.872 (Brunet: XLNet questions only), 0.645 (Viviani: XLNet questions only), and 0.607 (SA TD: RoBER T a). It is also worth mentioning that the mo dels we fine- tuned on the Questions Only dataset p erform closely to their coun terparts fine-tuned on the Combine d dataset for these cross-domain datasets (see Sec- tion 4.1 ). This suggests that the mo del fine-tuned on less information, namely the questions without comments, giv es an equiv alen t level of understanding and p erformance when classifying design discussions of texts from different domains. Although we do not present the results for the Combine d dataset in the pap er, the details are av ailable in our supplementary materials ( Arkoh et al. , 2026 ). Another noteworth y consideration is that in the prior w ork by Mahadi et al. ( 2022 ), the authors also tried to improv e mo dels’ p erformance b y using data augmentation. Data augmentation aimed to transfer context from the training data to the v alidation data. After this improv ement, the Linear SVM classifier reached the best ROC-A UC of 0.798, Precision of 0.647, and Recall of 0.901 for the dataset provided by Brunet et al. ( 2014 ). In terestingly , our 14 Lawrence Arkoh et al. 0.76 0.80 0.84 0.87 0.91 BER T R oBER T a XLNet 0.57 0.58 0.60 0.61 0.63 0.57 0.59 0.62 0.65 0.68 Brunet S A TD V iviani Fig. 2 Bo xplots of ROC-A UC Scores for the 10 indep endent runs per mo del (columns) and per dataset (rows). XLNet mo del fine-tuned on the Questions Only dataset, without any h yp er- parameter tuning or data augmentation, obtained a greater ROC-A UC score, namely 0.872, on the same dataset. This can b e observed in Figure 2 , for R OC-AUC. T able 4 rep orts the precision and recall scores, which are 0.665 and 0.679, resp ectively . Although our model achiev es higher ROC-A UC and precision compared to Mahadi et al. ( 2022 ), its recall is comparatively low er. Notably , XLNet demonstrated strong p erformance in identifying design dis- cussions, closely aligning with the man ual annotations in the cross-domain dataset provided by Brunet et al. ( 2014 ), despite not b eing trained on data from the same domain (i.e., GitHub pull request discussions). W e attribute the lo wer recall to the mo del’s limited abilit y to fully capture the nuanced context under whic h the original annotators classified discussions as design-related, leading to the mo del failing to identify some design-related discussions. Our transformer-based classifiers yielded low er p erformance on the Viviani and SA TD datasets, as shown in T able 4 . This outcome indicates limited gen- eralization across datasets originating from different discussion domains, suc h as GitHub, despite the presence of v aluable design discussions. W e attribute this to v ariations in dataset annotation guidelines. F or instance, the Viviani dataset defines design discussions through “the use of sp eculativ e language or the presence of rationale supp orting statements” ( Viviani et al. , 2019 ). In con trast, the SA TD dataset identifies design-related conten t through concepts suc h as “misplaced co de”, “lac k of abstraction”, “long metho ds”, “p o or im- plemen tation”, and “temp orary solutions” ( da Silv a Maldonado et al. , 2017 ). Title Suppressed Due to Excessive Length 15 T able 4 Cross-domain performance of mo dels fine-tuned on question-only data. Mo del Dataset AC ROC-A UC R P F1 BER T Brunet 0.697 0.800 0.767 0.435 0.555 SA TD 0.533 0.581 0.483 0.748 0.587 Viviani 0.578 0.592 0.543 0.335 0.414 RoBER T a Brunet 0.715 0.828 0.790 0.454 0.577 SA TD 0.526 0.602 0.456 0.758 0.569 Viviani 0.593 0.619 0.566 0.351 0.433 XLNet Brunet 0.837 0.872 0.679 0.665 0.672 SA TD 0.462 0.607 0.293 0.793 0.428 Viviani 0.682 0.645 0.348 0.408 0.375 ChatGPT Brunet 0.731 – 0.684 0.467 0.555 SA TD 0.562 – 0.543 0.750 0.630 Viviani 0.521 – 0.610 0.310 0.411 LaMini Flan-T5 Brunet 0.807 – 0.536 0.625 0.577 SA TD 0.444 – 0.276 0.765 0.406 Viviani 0.687 – 0.280 0.400 0.329 Also, Brunet et al. ( Brunet et al. , 2014 ) fo cus on textual elements asso ciated with soft w are structural design. Although transformer models outp erform traditional ML classifiers on cross- domain data, their ov erall p erformance, particularly in terms of balanced pre- cision and recall, declines noticeably in comparison to within dataset. F or ex- ample, our RoBER T a mo del achiev es a precision of 0.454 and a recall of 0.790 when ev aluated on the Brunet dataset, meaning that while the model captures a large proportion of relev an t instances (high recall), it also includes a notable n umber of false p ositives (mo derate precision). Thus, caution is advised when applying mo dels trained on one platform (e.g., Stack Overflo w) to another (e.g., GitHub developer discussion), as domain adaptation or fine-tuning ma y b e necessary . The results presented ab ov e are the basis to answer RQ1. W e pro vide the answ er and the asso ciated discussion in Section 5 . 4.3 Model Performance With Similar W ords Injection W e experimented with transferring contextual information from the fine-tuning data to the test data to in vestigate whether transformer-based classifiers could p erform better on cross-domain datasets using word synon ym injection. Our mo dels w ere fine-tuned exclusively on Stack Overflo w data. T o this end, we extracted w ords from the fine-tuning set, filtered out stop words (as they typ- ically do not carry meaningful seman tic conten t), and injected the remaining terms into each cross-domain dataset as synonyms following prior data aug- men tation work ( W ei and Zou , 2019 ), using a newly initialized pre-trained XLNet mo del. Unlike traditional synon ym replacement approac hes based on 16 Lawrence Arkoh et al. 0.66 0.70 0.75 0.79 0.84 BER T R oBER T a XLNet 0.56 0.58 0.59 0.61 0.62 0.55 0.58 0.60 0.62 0.65 Brunet S A TD V iviani Fig. 3 Bo xplots of ROC-A UC Scores for the 10 indep en t runs p er mo del (columns) and per dataset (rows) with similar words injected. cosine similarit y b et ween static w ord embeddings, which iden tify globally sim- ilar w ords indep enden t of con text, our metho d relies on con textualized mask ed language modeling. XLNet predicts substitute tok ens conditioned on the full sen tence, allo wing injected words to b etter preserve the original seman tic in- ten t while aligning lexical usage with the training domain. The XLNet mo del w as not fine-tuned on our datasets and w as used solely as a fixed augmentation mec hanism to prev ent information leak age. The results are summarized in T able 5 . Figure 3 presents the b o x plots for the ROC-A UC scores. When comparing the ROC-A UC scores in T ables 4 and 5 we can observe that the v alues for XLNet on the Brunet dataset v ar- ied from 0.872 to 0.795, in SA TD from 0.607 to 0.596, and in Viviani from 0.645 to 0.621. W e can corrob orate this observ ation with analysis of b oxplots in Figures 2 and 3 . W e attribute negligible v ariation to the mo dels’ strong abilit y to generalize based on the semantic conten t of input sen tences. Since the models hav e already learned robust represen tations of sen tence meaning and semantics during fine-tuning, replacing certain words with their semanti- cally similar counterpar ts did not significantly alter the underlying sentence represen tations. Consequen tly , the predictions remained largely unaffected, in- dicating that the mo dels were not too sensitiv e to surface-level lexical v aria- tions. Similar findings ha v e b een rep orted with pretrained mo dels lik e BER T and RoBER T a ( Longpre et al. , 2020 ). Additionaly , our findings corrob orate the observ ations in prior w ork, which found that the p erformance of classifiers decrease or show only marginal increases after similar w ord injection ( Mahadi et al. , 2022 ). Title Suppressed Due to Excessive Length 17 T able 5 P erformance of mo dels on datasets (questions only) injected with similar words. Mo del Dataset AC ROC-A UC R P F1 BER T Brunet 0.652 0.707 0.676 0.383 0.489 SA TD 0.522 0.584 0.459 0.748 0.569 Viviani 0.598 0.577 0.444 0.329 0.378 RoBER T a Brunet 0.622 0.768 0.804 0.375 0.511 SA TD 0.554 0.585 0.548 0.736 0.628 Viviani 0.533 0.593 0.649 0.325 0.433 XLNet Brunet 0.753 0.795 0.681 0.499 0.576 SA TD 0.496 0.596 0.384 0.767 0.512 Viviani 0.646 0.621 0.426 0.373 0.397 ChatGPT Brunet 0.683 – 0.844 0.427 0.567 SA TD 0.577 – 0.572 0.754 0.650 Viviani 0.531 – 0.648 0.323 0.431 LaMini Flan-T5 Brunet 0.742 – 0.742 0.767 0.751 SA TD 0.485 – 0.485 0.625 0.491 Viviani 0.648 – 0.648 0.650 0.649 T o draw robust conclusions ab out the impact of synon ym-based augmen ta- tion, we applied Wilcoxon signed-rank tests across v arious metrics. While the tests revealed statistically significant differences (p-v alues low er than 0.05), these did not translate into meaningful p erformance impro vemen ts across the metrics. W e found the mean delta for the ROC-A UC metric ranged from –0.0924 to 0.0039, for recall from –0.0988 to 0.1598, for precision from –0.1660 to 0.0128, and for accuracy from –0.0932 to 0.0411. Despite statistically sig- nifican t differences b etw een similar-word injection and raw data, the SA TD dataset, as one example, sho w ed mean accuracy differences of 0.0279 for RoBER T a, –0.0104 for BER T, 0.0347 for XLNet, 0.0158 for ChatGPT, and –0.0650 for LaMini-Flan-T5. The takea wa y is that simple augmentation strategies introduce extra cost without practical b enefit, highligh ting the need for more adv anced approac hes. An o verall summary of the results ab ov e are in Section 5 , with the answ er to R Q2 and an asso ciated discussion. 4.4 Ev aluation with Generativ e Language Mo dels W e further explored the capabilities of state of the art small and large language mo dels for cross-domain text classification. Sp ecifically LaMini-Flan-T5-77M and ChatGPT-4o-mini. Unlik e the other mo dels, LaMini-Flan-T5-77M and ChatGPT-4o-mini output text rather than probability logits, making ROC- A UC not ideal; we therefore rep ort only precision and recall (T ables 4 and 5 ). Figures 4 and 5 illustrate the precision and recall of the generative language mo dels compared to the other baselines. Since these datasets are imbalanced (T able 1 ), w e focus on accuracy , precision, and recall and rep ort F1 scores with caution due to the data im balance. On the Brunet dataset, XLNet achiev ed the 18 Lawrence Arkoh et al. 0.33 0.44 0.55 0.66 0.78 BER T R oBER T a XLNet ChatGPT LaMini-Flan- T5 0.73 0.76 0.78 0.80 0.82 0.27 0.32 0.36 0.41 0.45 Brunet S A TD V iviani Fig. 4 Precision Scores per model per dataset 0.45 0.55 0.66 0.76 0.86 BER T R oBER T a XLNet ChatGPT LaMini-Flan- T5 0.17 0.29 0.41 0.53 0.65 0.15 0.30 0.44 0.58 0.73 Brunet S A TD V iviani Fig. 5 Recall Scores per model per dataset strongest p erformance with the highest accuracy and precision, while LaMini- Flan-T5-77M p erformed the weak est due to low recall. F or the SA TD dataset, ChatGPT-4o-mini attained the most balanced p erformance, with the highest recall and comp etitive accuracy , whereas LaMini-Flan-T5-77M again under- p erformed with v ery lo w recall despite relatively high precision. On the Viviani dataset, LaMini-Flan-T5-77M and XLNet were the strongest, with LaMini- Flan-T5-77M obtaining the highest accuracy and XLNet sho wing comp etitiv e recall, while ChatGPT-4o-mini achiev ed the highest recall but the low est pre- cision. T able 6 presents tw o representativ e samples from each dataset alongside their ground-truth lab els and the predictions produced by the fiv e models. In the SA TD dataset, the first instance is clearly non–design-related, which was correctly identified by XLNet, ChatGPT-4o-mini, and LaMini-Flan-T5-77M. Title Suppressed Due to Excessive Length 19 T able 6 Sampled disagreements across datasets (p er-mo del predictions with tick/cross). Dataset ID T ext Orig. Lab el BER T RoBER T a XLNet ChatGPT LaMini FlanT5 SA TD 1585 is it p ossible to use more than one v ariable general ✓ ✓ ✗ ✗ ✗ SA TD 3284 arguments compilers alwa ys create irub yob ject but w e wan t to use rub yarray concat here as a result this is not efficient since it creates and then later unwraps an array design ✓ ✓ ✗ ✗ ✓ BRUNET 131 I still think it w ould mak e sense to take and stash handOver- Data during the entire lifetime of the singleton design ✓ ✓ ✓ ✗ ✓ BRUNET 129 The return v alue is only used in one place line 195 where the whole bug fix originated general ✗ ✗ ✗ ✓ ✗ VIVIANI 6429 it is passed as an argument to module wrapp er. Not a go od idea for stream readable.js though as it is supposed to be the same as in readable- stream, so only public APIs. design ✗ ✓ ✓ ✗ ✓ VIVIANI 3399 Could at least hav e a class at- tribute with the messages you wan t to store to test against it, it’s more self-contained. general ✗ ✓ ✗ ✓ ✗ The second instance, how ev er, describe s an explicit design decision in the co de; this was correctly predicted by BER T and RoBER T a, but missed b y XLNet and ChatGPT-4o-mini. T urning to the Brunet dataset, the first instance re- flects a design decision, yet ChatGPT-4o-mini misclassified it. Similarly , in the second instance, text concerning the usage of a return v alue tied to a bug fix, ChatGPT-4o-mini again pro duced an incorrect prediction. F or the Viviani dataset, the first instance discusses argumen t passing within a mo dule wrapper and its suitability for another Ja v aScript file. This reflects a structural decision and w as correctly identified by RoBER T a and XLNet. The second instance, ho wev er, highligh ts a suggestion that a class should maintain an attribute to facilitate self-contained testing. Although this instance is lab eled as general, w e in terpret it as more indicative of a design-related suggestion, aligning with the classifications given by RoBER T a and ChatGPT-4o-mini. 20 Lawrence Arkoh et al. 5 Discussion This section presents an interpretation of the results, answering the t w o RQs of our study . Additionally , we describ e the implications of our w ork to softw are engineering practice and research. Finally , the threats to v alidit y are discussed. 5.1 Answ ering R Q1 The results presented in the previous section highligh t complementary mo dels with distinct strengths. ChatGPT-4o-mini consistently achiev es higher recall across datasets, capturing a greater prop ortion of design-related discussions than traditional transformer baselines. How ev er, this adv antage often comes at the exp ense of precision, as ChatGPT-4o-mini tends to ov erpredict design- related lab els, o ccasionally misclassifying general discussions. By contrast, LaMini-Flan-T5-77M, despite its light weigh t architecture and smaller param- eter fo otprint, delivers comp etitiv e accuracy and strong precision, though its recall is less consisten t, particularly in cross-domain settings. Thus, ChatGPT- 4o-mini is b etter suited for con texts where maximizing recall is critical, lik e ex- ploratory analysis or comprehensiv e design discussions. LaMini-Flan-T5-77M is a more computationally efficien t option with comp etitive precision, making it ideal for integration in to resource-constrained to ols such as developmen t as- sistan ts or code review pipelines. F rom this observ ation, w e presen t the answ er to the RQ1 of our study: R Q1 Answer: T ransformer-based mo dels v aried in effectiveness for detecting design-related discussions. LaMini-Flan-T5-77M and XLNet fa vored precision, while BER T and RoBER T a leaned tow ard recall. ChatGPT-4o-mini achiev ed the highest recall but with reduced preci- sion. These results highligh t clear precision–recall trade-offs, making mo del choice dep endent on whether minimizing false p ositives or max- imizing co v erage is prioritized. Implications of the findings for RQ1: – In agile workflo ws with short cycles and limited documentation, ChatGPT- 4o-mini can serv e as a safet y net to preven t design-related discussions from b eing missed. F or example, comments that seem implementation-lev el (e.g., adding a class attribute for self-containmen t) often reflect deep er architec- tural concerns suc h as encapsulation and main tainability . Capturing these discussions reduces knowledge v ap orization and enhances team alignmen t. – In resource-constrained en vironmen ts, such as IDE assistants, co de review to ols, or issue triage systems, LaMini-Flan-T5-77M offers a light weigh t y et precise mec hanism for surfacing design decisions without o verwhelming de- v elop ers with false p ositives. This supp orts more targeted do cumentation, on b oarding, and long-term maintenance. Title Suppressed Due to Excessive Length 21 – The v ariation across mo dels indicates that no single mo del is universally op- timal. Organizations m ust explicitly decide whether a v oiding false positives (precision) or capturing as man y relev an t discussions as p ossible (recall) b etter aligns with their developmen t and compliance needs. 5.2 Answ ering R Q2 The results of our study (in Section 4.3 ) show that using synonym-based w ord injection to enhance cross-domain p erformance resulted in minimal improv e- men t in our mo del’s p erformance. F or instance, using XLNet on the Brunet dataset, we ac hiev ed a R OC-A UC score of 0.872, with a precision of 0.665 and a recall of 0.679. After applying a similar w ord injection stragegy , the R OC- A UC dropp ed to 0.795, with precision decreasing to 0.499 and recall slightly increasing to 0.681. Although the statistical tests are significant, the mean differences across metrics are very small, suggesting that this augmen tation strategy may not yield meaningful improv emen ts in design discussion detec- tion. F urther researc h is needed to fully understand its p otential. The answ er to R Q2 is: R2 Answ er: Similar-w ord injection led to statistically significant c hanges in most metrics; ho wev er, the actual p erformance impro ve- men ts w ere minimal. The tec hnique did not meaningfully enhance cross-domain detection of design discussions, suggesting that synon ym- based data augmentation is not a w orth while strategy for this task. The limited gains do not justify the additional complexity introduced b y the augmen tation pro cess. Implications of the findings for RQ2: – Synonym injection do es not pro vide sufficien t b enefit for practitioners seek- ing to improv e cross-domain detection of design discussions. The complex- it y it in tro duces outw eighs the marginal gains. – F uture practice should explore more promising metho ds, such as few-shot learning, con trastive training, or domain-spe cific augmen tation, whic h may b etter generalize across heterogeneous datasets. – Since synonym-based augmentation pro duced statistically significant but practically negligible improv emen ts, relying on suc h techniques ma y create a false sense of progress. Practitioners should be cautious when adopting ligh tw eight augmen tation strategies without ev aluating their real-w orld im- pact. 5.3 Threats T o V alidit y W e discuss the p oten tial threats to the v alidit y of our study following the framew ork b y W ohlin et al. ( 2012 ). In particular, we elab orate on the threats 22 Lawrence Arkoh et al. to construct, internal, external and conclusion v alidit y , and our strategies to mitigate them. Construct v alidity concerns whether our measurements and treatments accurately represent the theoretical constructs they are intended to capture. The most significant threat in this category stems from the sub jectivity in lab eling what constitutes a “design discussion.” The definition of design dis- cussion is inherently sub jective and may v ary dep ending on the annotator’s understanding of the concept, their exp erience level, and the sp ecific domain or con text of the pro ject. This sub jectivity may in tro duce lab el inconsistencies in the dataset, p oten tially affecting the mo del’s ability to p erform effectively on new or unseen data. T o mitigate this threat, w e reused established datasets from prior p eer-review ed studies rather than creating our own annotations, and w e manually examined representativ e samples from eac h dataset to understand their lab eling criteria and identify p oten tial inconsistencies. Additionally , we face a mono-metho d bias as we relied exclusively on transformer-based ap- proac hes for cross-domain classification, whic h may limit our understanding of the problem space. W e partially addressed this b y ev aluating five different transformer architectures with v arying design principles and training strate- gies, pro viding a broader p ersp ective within the transformer paradigm. In ternal v alidity addresses whether a causal relationship can b e estab- lished b et ween our treatmen ts and outcomes without confounding factors. While we do not seek to establish causal relationships, the mo del c hosen to inject semantically similar words into the cross-domain dataset may itself in- tro duce systematic bias. If the similarit y mo del does not adequately reflect the con textual or domain-sp ecific meaning of words, the substitutions may fail to preserv e the intended semantics, leading to context distortion or seman tic drift that could mislead the classifier and negatively affect its p erformance. T o as- sess this threat, we conducted parallel exp eriments b oth with and without similar word injection, allowing us to isolate and quantify its impact on mo del p erformance. F urthermore, the instrumen tation threat manifests through the pre-pro cessed nature of our training data. The primary data used for model fine-tuning had already undergone preprocessing b efore we accessed it, and w e did not hav e control o ver the cleaning steps that were applied. Some of the w ords remov ed during prepro cessing may ha v e influenced the mo del’s p erfor- mance, as certain domain-relev ant terms or con textual clues could hav e b een lost, p otentially affecting the mo del’s ability to learn nuanced patterns. While w e could not fully mitigate this threat due to dataset reuse, we p erformed minimal additional prepro cessing to av oid comp ounding the issue. External v alidit y regards the generalizability of our findings b ey ond the sp ecific context of our study . The primary threat here relates to exp erimen- tal configuration. The datasets used for cross-domain ev aluation (GitHub pull requests, issues, and commit messages) represent only a subset of possible soft- w are developmen t communication channels, p oten tially limiting the broader applicabilit y of our findings to other con texts suc h as Slac k conv ersations, email threads, or do cumentation wikis. W e note that this inherent limitation, w e still ev aluated our mo dels across m ultiple distinct communication chan- Title Suppressed Due to Excessive Length 23 nels. This diversit y provides evidence that our approach has the p otential to transfer across v arious text formats and discussion styles commonly found in soft ware dev elopment. Lastly , conclusion v alidity concerns whether the relationships observed in our data are statistically sound. One threat in this category relates to our limited emphasis on hyperparameter tuning. As this study is a conceptual replication, the primary fo cus w as on ev aluating the core idea under different conditions and optimizing model p erformance. Ho wev er, we did not perform an extensiv e search o ver fine-tuning h yp erparameters suc h as learning rate, batch size, or n umber of ep ochs. While this aligns with the goals of our study , more refined tuning could potentially impact the observed effects. W e mitigated p oten tial o verfitting by using early stopping via PyT orch during fine-tuning, and we used consistent default configurations across all mo dels to ensure fair comparisons. Additionally , the reliability of our measures may b e affected by the use of pre-trained mo dels with default configurations, which might not b e optimal for the sp ecific task of design discussion detection. T o address this threat, we employ ed tests to ev aluate the statistical significance of differences, ensuring our conclusions are based on sound statistical analysis rather than n umerical observ ations alone. 6 Conclusion In this study , w e conducted a conceptual replication to inv estigate the ef- fectiv eness of transformer-based models—namely BER T, RoBER T a, XLNet, LaMini-Flan-T5-77M, and ChatGPT-4o-mini—in identifying soft ware design discussions across diverse sources of developer communication. By fine-tuning the transformer mo dels on Stack Ov erflow data and ev aluating them on cross- domain sources suc h as commit messages, GitHub pull requests, and issue discussion threads, w e deriv ed sev eral k ey insigh ts to inform future researc h and practical applications, namely: (i) domain-sp ecific training matters, (ii) cross-domain p erformance degrades, (iii) synon ym-based data augmentation is largely ineffective and this confirms the findings from prior work, (iv) LaMini-Flan-T5-77M offers a light weigh t and precision-oriented option, and (v) ChatGPT-4o-mini consisten tly achiev es high recall, making it effectiv e in maximizing co verage of design-related discussions. These findings demonstrate that large language mo dels, including ChatGPT-4o-mini, can effectiv ely iden- tify soft w are architectural design discussions from natural language text such as commit messages, pull requests, and issue discussions. In future work, we aim to dev elop to ols to help extract meaningful insights from the discussions flagged as design-related, lev eraging them to analyze the evolution of legacy systems and to inform the planning of targeted, data-driv en softw are mo dern- ization strategies. 24 Lawrence Arkoh et al. Data Av ailability All source co de, collected data, and complementary results are in the supple- men tary material ( Ark oh et al. , 2026 ). References Ajagb e M, Zhao L (2022) Retraining a b ert mo del for transfer learning in requiremen ts engineering: A preliminary study . In: 30th Int. Requiremen ts Engineering Conference (RE), IEEE, pp 309–315 Alk adhi R, Nonnenmacher M, Guzman E, Bruegge B (2018) How do dev elop ers discuss rationale? In: 25th Int. Conference on Softw are Analysis, Evolution and Reengineering (SANER), IEEE, pp 357–369 Ameller D, Ay ala C, Cab ot J, F ranc h X (2012) Non-functional requirements in arc hitectural decision making. IEEE softw are 30(2):61–67 Arabadzhiev a-Kalchev a N, Kov ac hev I (2022) Comparison of b ert and xlnet accuracy with classical metho ds and algorithms in text classification. In: In t. Conference on Biomedical Innov ations and Applications (BIA), IEEE, v ol 1, pp 74–76 Ark oh L, F eitosa D, Assun¸ c˜ ao WK G (2026) Where are the hidden gems? apply- ing transformer mo dels for design discussion detection [ supplementary ma- terial ]. DOI 10.5281/zeno do.15285224, URL https://doi.org/10.5281/ zenodo.15285224 Assun¸ c˜ ao WK, Marc hezan L, Arkoh L, Egyed A, Ramler R (2025) Contem- p orary softw are mo dernization: Strategies, driving forces, and research op- p ortunities. ACM T rans Softw Eng Metho dol Bhat M, Sh umaiev K, Biesdorf A, Hohenstein U, Matthes F (2017) Automatic extraction of design decisions from issue management systems: a machine learning based approach. In: Europ ean Conference on Softw are Architecture, Springer, pp 138–154 Bi T, Liang P , T ang A, Xia X (2021) Mining architecture tactics and qualit y attributes knowledge in stack ov erflo w. Journal of Systems and Softw are 180:111005 Borrego G, Mor´ an AL, Palacio RR, Vizca ´ ıno A, Garc ´ ıa FO (2019) T o wards a reduction in arc hitectural knowledge v ap orization during agile global soft- w are dev elopment. Information and Soft w are T ec hnology 112:68–82 Brunet J, Murph y GC, T erra R, Figueiredo J, Serey D (2014) Do developers discuss design? In: 11th W orking Conference on Mining Softw are Reposito- ries, pp 340–343 Bucaioni A, Di Salle A, Iovino L, Mariani L, Pelliccione P (2024) Contin uous conformance of soft ware architectures. In: 21st Int. Conference on Softw are Arc hitecture (ICSA), IEEE, pp 112–122 Cruz M, Bern´ ardez B, Dur´ an A, Galindo JA, Ruiz-Cort ´ es A (2019) Replication of studies in empirical soft ware engineering: A systematic mapping study , from 2013 to 2018. IEEE Access 8:26773–26791 Title Suppressed Due to Excessive Length 25 Devlin J, Chang MW, Lee K, T outanov a K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American c hapter of the asso ciation for computational linguistics: h uman language tec hnologies, pp 4171–4186 Dhaouadi M, Oak es BJ, F amelis M (2024) Rationale dataset and analysis for the commit messages of the linux kernel out-of-memory killer. In: 32nd Int. Conference on Program Comprehension, pp 415–425 de Dieu MJ, Liang P , Shahin M, Khan AA (2023) Characterizing architecture related p osts and their usefulness in stac k ov erflo w. J Syst Soft w 198:111608 F an Y, Arora C, T reude C (2023) Stop words for pro cessing softw are engi- neering do cumen ts: Do they matter? In: 2nd Int. W orkshop on Natural Language-Based Soft w are Engineering (NLBSE), IEEE, pp 40–47 F redriksson T, Mattos DI, Bosc h J, Olsson HH (2020) Data lab eling: An em- pirical inv estigation into industrial challenges and mitigation strategies. In: In ternational conference on pro duct-fo cused softw are pro cess improv ement, Springer, pp 202–216 G´ omez OS, Juristo N, V egas S (2014) Understanding replication of exp eri- men ts in soft ware engineering: A classification. Information and Softw are T ec hnology 56(8):1033–1048 Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and f-score, with implication for ev aluation. In: Europ ean conference on information retriev al, Springer, pp 345–359 Izadi M, Akbari K, Heydarnoori A (2022) Predicting the ob jective and priorit y of issue rep orts in soft w are rep ositories. Empirical Soft ware Engineering 27(2):50 Josephs A, Gilson F, Galster M (2022) T o wards automatic classification of design decisions from dev elop er conv ersations. In: 19th Int. Conference on Soft ware Arc hitecture Companion (ICSA-C), IEEE, pp 10–14 Kumar A, T rueman TE, Cambria E (2021) F ak e news detection using xlnet fine-tuning mo del. In: Int. Conference on Computational In telligence and Computing Applications (ICCICA), IEEE, pp 1–4 Lepagnol P , Gerald T, Ghannay S, Serv an C, Rosset S (2024) Small lan- guage mo dels are go o d to o: An empirical study of zero-shot classification. In: Join t International Conference on Computational Linguistics, Language Resources and Ev aluation (LREC-COLING 2024), ELRA and ICCL, pp 14923–14936 Li R, Liang P , Soliman M, Avgeriou P (2022) Understanding soft w are arc hi- tecture erosion: A systematic mapping study . Journal of Soft ware: Evolution and Process 34(3):e2423 Li Y, Shi C, Duan Z, Liu F, Y ang M (2024) Fine-tuning b ert for intelligen t soft ware system fault classification. In: 24th Int. Conference on Soft ware Qualit y , Reliability , and Security Companion (QRS-C), IEEE, pp 872–879 Li Z, Avgeriou P , Liang P (2015) A systematic mapping study on technical debt and its management. J Syst Softw 101:193–220 Liagk ou V, Filiopoulou E, F ragiadakis G, Nik olaidou M, Mic halakelis C (2024) The cost persp ective of adopting large language mo del-as-a-service. In: Int. 26 Lawrence Arkoh et al. Conference on Joint Cloud Computing (JCC), IEEE, pp 80–83 Liu Y, Ott M, Goy al N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettle- mo yer L, Stoy anov V (2019) Rob erta: A robustly optimized b ert pretraining approac h. arXiv preprin t arXiv:190711692 Long Z, Li H, Shi J, Ma X (2024) Data augmentation using virtual word insertion techniques in text classification tasks. Expert Systems 41(4):e13519 Longpre S, W ang Y, DuBois C (2020) Ho w effective is task-agnostic data augmen tation for pretrained transformers? arXiv preprint arXiv:201001764 Lu H, Shanmugam D, Suresh H, Guttag J (2022) I mpro ved text classification via test-time augmentation. arXiv preprint arXiv:220613607 Maarlev eld J, Dekk er A, Druyts S, Soliman M (2023) Maestro: a deep learning based tool to find and explore arc hitectural design decisions in issue trac king systems. In: Europ ean Conference on Softw are Architecture, Springer, pp 390–405 Mahadi A, Ernst NA, T onga y K (2022) Conclusion stabilit y for natural lan- guage based mining of design discussions. Empirical Soft ware Engineering 27:1–42 Mehrp our S, Latoza TD (2023) A survey of to ol supp ort for working with design decisions in co de. ACM Computing Surveys 56(2):1–37 Mondal AK, Sc hneider KA, Roy B, Roy CK (2022) A surv ey of softw are archi- tectural c hange detection and categorization techniques. Journal of Systems and Soft w are 194:111505 Nadeem A, Sarwar MU, Malik MZ (2021) Automatic issue classifier: A trans- fer learning framework for classifying issue reports. In: Int. Symp osium on Soft ware Reliabilit y Engineering W orkshops (ISSREW), IEEE, pp 421–426 Naidu G, Zuv a T, Sibanda EM (2023) A review of ev aluation metrics in ma- c hine learning algorithms. In: Computer science on-line conference, Springer, pp 15–25 Oliv eira A, Correia J, Assun¸ c˜ ao WK, Pereira JA, de Mello R, Coutinho D, Barb osa C, Lib´ orio P , Garcia A (2024) Understanding developers’ discus- sions and perceptions on non-functional requirements: The case of the spring ecosystem. A CM on Softw are Engineering 1(FSE):517–538 Op enAI (2025) Chatgpt. URL https://chatgpt.com P an S, Bao L, Ren X, Xia X, Lo D, Li S (2021) Automating dev elop er c hat mining. In: 36th IEEE/A CM In t. Conference on Automated Softw are Engi- neering (ASE), IEEE, pp 854–866 P atel P , Choukse E, Zhang C, Goiri ´ I, W arrier B, Mahalingam N, Bianchini R (2024) Characterizing p o w er managemen t opp ortunities for llms in the cloud. In: 29th ACM Int. Conference on Architectural Supp ort for Pro- gramming Languages and Op erating Systems, V olume 3, pp 207–222 Shakiba A, Green R, Dy er R (2016) F ourd: do developers discuss design? revisited. In: 2nd Int. W orkshop on Softw are Analytics, pp 43–46 Shang X, Zhang S, Zhang Y, Guo S, Li Y, Chen R, Li H, Li X, Jiang H (2024) Analyzing and detecting information t yp es of dev elop er live chat threads. A CM T rans Softw Eng Metho dol 33(5):1–32 Title Suppressed Due to Excessive Length 27 Shanm ugam D, Lu H, Sank aranaray anan S, Guttag J (2025) T est-time aug- men tation improv es efficiency in conformal prediction. In: Computer Vision and P attern Recognition Conference, pp 20622–20631 Shekhar S, Dub ey T, Mukherjee K, Saxena A, Ty agi A, Kotla N (2024) T o- w ards optimizing the costs of llm usage. arXiv preprint arXiv:240201742 da Silv a Maldonado E, Shihab E, Tsantalis N (2017) Using natural language pro cessing to automatically detect self-admitted tec hnical debt. IEEE T rans- actions on Softw are Engineering 43(11):1044–1062 Skryseth D, Shiv ashank ar K, Pil´ an I, Martini A (2023) T ec hnical debt classi- fication in issue track ers using natural language pro cessing based on trans- formers. In: ACM/IEEE Int. Conference on T ec hnical Debt (T echDebt), IEEE, pp 92–101 Sofaer HR, Ho eting JA, Jarnevich CS (2019) The area under the precision- recall c urv e as a p erformance metric for rare binary ev en ts. Methods in Ecology and Evolution 10(4):565–577 Su R, Bakhtin A, Ahmad N, Esp osito M, Lenarduzzi V, T aibi D (2026) Ev al- uating large language mo dels for detecting architectural decision violations. arXiv preprin t arXiv:260207609 Sun W, F ang C, Y ou Y, Miao Y, Liu Y, Li Y, Deng G, Huang S, Chen Y, Zhang Q, et al. (2023) Automatic co de summarization via chatgpt: How far are w e? arXiv preprint arXiv:230512865 Tian Y, Zhang Y, Stol KJ, Jiang L, Liu H (2022) What makes a go o d commit message? In: 44th Int. Conference on Softw are Engineering, pp 2389–2401 Viviani G, F amelis M, Xia X, Janik-Jones C, Murphy GC (2019) Lo cating laten t design information in dev eloper discussions: A study on pull requests. IEEE T ransactions on Softw are Engineering 47(7):1402–1413 W an Z, Zhang Y, Xia X, Jiang Y, Lo D (2023) Softw are arc hitecture in prac- tice: Challenges and opp ortunities. In: 31st A CM Joint Europ ean Softw are Engineering Conference and Symp osium on the F oundations of Softw are Engineering, pp 1457–1469 W ei J, Zou K (2019) Eda: Easy data augmentation tec hniques for b o osting p erformance on text classification tasks. arXiv preprint arXiv:190111196 Widy asari R, Zhang T, Bouraffa A, Maalej W, Lo D (2025) Explaining ex- planations: An empirical study of explanations in co de reviews. ACM T rans Soft w Eng Metho dol 34(6):1–30 Wijerathna L, Aleti A, Bi T, T ang A (2022) Mining and relating design con- texts and design patterns from stack o verflo w. Empirical Softw are Engineer- ing 27:1–53 Wilco xon F (1945) Individual comparisons by ranking metho ds. American Mathematical Statistics 16(1):80–83 W ohlin C, Runeson P , H¨ ost M, Ohlsson MC, Regnell B, W essl´ en A (2012) Exp erimen tation in Soft ware Engineering. Springer Berlin Heidelberg, DOI 10.1007/978- 3- 642- 29044- 2 W u M, W aheed A, Zhang C, Ab dul-Mageed M, Aji AF (2024) Lamini-lm: A diverse herd of distilled models from large-scale instructions. In: 18th Conference of the Europ ean Chapter of the Association for Computational 28 Lawrence Arkoh et al. Linguistics (V olume 1: Long Papers), pp 944–964 Y ang W, Zhang C, P an M, Xu C, Zhou Y, Huang Z (2022) Do developers really kno w how to use git commands? a large-scale study using stack o verflo w. A CM T rans Softw Eng Metho dol 31(3) Y ang Z, Dai Z, Y ang Y, Carb onell J, Salakhutdino v RR, Le QV (2019) Xl- net: Generalized autoregressiv e pretraining for language understanding. Ad- v ances in neural information processing systems 32 Zhao J, Y ang Z, Zhang L, Lian X, Y ang D, T an X (2024) Drminer: Extract- ing latent design rationale from jira issue logs. In: 39th In t. Conference on Automated Soft w are Engineering, pp 468–480 Zh u X, V ondric k C, F o wlk es CC, Ramanan D (2016) Do w e need more training data? In t Journal of Computer Vision 119(1):76–92 Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-pro ject defect prediction: a large scale exp eriment on data vs. domain vs. pro cess. In: 7th Europ ean softw are engineering conference/ACM symp osium on The foundations of softw are engineering, pp 91–100
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment