TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To unde…

Authors: Victoria Graf, Valentina Pyatkin, Nouha Dziri

TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities
T U R N W I S E : The Gap between Single- and Multi-turn Language Model Capabilities V ictoria Graf 1 , 2 V alentina Pyatkin 2 Nouha Dziri 2 1 Uni versity of W ashington 2 Allen Institute for AI Nathan Lambert 2 Hannaneh Hajishirzi 1 , 2 Abstract Multi-turn con versations are a common and critical mode of language model interaction. Howe ver , current open training and e valuation data focus on single-turn settings, failing to cap- ture the additional dimension of these longer interactions. T o understand this multi-/single- turn gap, we first introduce a new benchmark, T U R N W I S E E V A L , for multi-turn capabilities that is directly comparable to single-turn chat ev aluation. Our ev aluation isolates multi-turn specific con versational ability through pairwise comparison to equiv alent single-turn settings. W e additionally introduce our synthetic multi- turn data pipeline T U R N W I S E D A TA which al- lo ws the scalable generation of multi-turn train- ing data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn con versations during post-training can lead to a 12% improv ement on T U R N W I S E E V A L . 1 Introduction Large language models (LLMs) are increasingly de- ployed to real-w orld users for free-form, multi-turn con versations. Ho wev er , many training and ev al- uation methodologies continue to rely primarily on single-turn data, creating a mismatch between real-world usage and current paradigms. Open language modeling recipes ( Lambert et al. , 2025 ; Olmo et al. , 2025 ) focus on and are primarily com- prised of single-turn data. Where multi-turn data is included in these recipes, it is added without systematic consideration of its contribution to con- versational abilities. Moreover , collecting scalable multi-turn con versational data pro ves dif ficult due to the challenges of user simulation ( Mehri et al. , 2026 ) and the expensi ve nature of large-scale man- ual collection of user interactions ( Y e et al. , 2022 ). While gaps in multi-turn con versational ability hav e been documented ( Laban et al. , 2025 ; Sird- eshmukh et al. , 2025 ), current measurements of this gap are incomplete and insufficiently informa- ti ve. Existing multi-turn dialogue benchmarks (e.g., MTBench ( Zheng et al. , 2023 ), MTEv al ( Kwan et al. , 2024 )) are being saturated, making it diffi- cult to assess what remains of the multi-/single- turn performance gap. Additionally , many con- versational benchmarks do not distinguish whether lo w model performance in multi-turn con versations stems from multi-turn specific performance g aps or from general model ability for similar tasks ( Zheng et al. , 2023 ; Bai et al. , 2024 ). T o improve e v aluation of the multi-turn con ver - sational abilities of LLMs, we first introduce a new benchmark, T U R N W I S E E V A L (Section 2 ), which isolates multi-turn abilities from other confounding signals. Unlike previous multi-turn ev aluations, our approach grounds ev aluation in pairwise compar- isons between responses to multi-turn and equiv- alent single-turn conv ersations (Figure 1 ). This methodology isolates multi-turn performance be- yond task dif ficulty , minimizing sensitivity to other LLM capabilities such as factuality and kno wledge and providing a more focused assessment of con ver - sational abilities. Additionally , we use controlled con versational context to further separate multi- turn ability from dependencies on con versational state and to enable better comparisons of results between models. W e show that e ven frontier mod- els like GPT -5 Chat can underperform on T U R N - W I S E E V A L and that open model recipes lag behind. T o address this gap between chat performance of LLMs in single- v ersus multi-turn con versations, we introduce a scalable method, T U R N W I S E D A TA , for creating synthetic multi-turn context on arbi- trary single-turn con versations (Section 3 ). W e use data generated with our methodology to analyze ho w incorporating multi-turn data in training can improv e con versational performance with Olmo 3 ( Olmo et al. , 2025 ) (Section 4 ). Through these ex- periments, we show that adding e ven a small num- ber of synthetic multi-turn con versations in training Figure 1: T U R N W I S E E V A L -Absolute example. Model responses to a multi-turn con versation are compared to a strong model’ s response on an equiv alent single-turn prompt. Our ev aluation metric is the win-rate of the ev aluated model as determined by an LM judge. can significantly reduce the gap with single-turn performance by up to 12%, demonstrating the im- portance of multi-turn con versations in training. 2 The T U R N W I S E E V A L Benchmark In this section we detail the motiv ation, design, and resulting data inside the T U R N W I S E E V A L bench- mark. W e then show that T U R N W I S E E V A L mea- sures multi-turn con versational abilities for models with a range of single-turn abilities and that multi- turn con versational ability is a distinct dimension of e valuation to consider be yond single-turn chat. 2.1 Evaluation Design T o isolate multi-turn con versational ability beyond single-turn task abilities, we use pairwise compar - isons of responses to single- and multi-turn conv er- sations. W e consider two comparison settings: • T U R N W I S E E V A L -Absolute. In our first e val- uation setting, we measure “absolute” multi- turn performance by comparing an e valuated model’ s completions of multi-turn conv ersa- tions to responses from a strong model, specif- ically GPT -4 ( 1106-preview ), on equi valent single-turn prompts (Figure 1 ). These refer- ence responses represent a strong baseline re- sponse to the single-turn setting. This setting uses length-controlled win rate as the primary metric and is intended for easy interpretation of T U R N W I S E E V A L scores across models. • T U R N W I S E E V A L -Self. Our second setting further isolates multi-turn ability by compar - ing an ev aluated model’ s responses to multi- turn con versations to its o wn responses to cor- responding single-turn con versations. This setting uses discrete win rate as the primary metric. The single-turn response here repre- sents the model’ s own baseline ability for the task instruction. W in rates belo w 50% imply that a model’ s responses to multi-turn con- versations are consistently worse than their single-turn counterparts, indicating de graded performance in the multi-turn setting beyond what can be attrib uted to general task ability . In best-case scenarios, models can lev erage context from previous turns to improv e their fi- nal response, leading to win rates abov e 50%. For pairwise judgments in T U R N W I S E E V A L , we use GPT -4.1 as judge with the judgment prompt from AlpacaEv al (Appendix A ). 2.2 Multi-turn Con versation Curation Our e valuation captures a common pattern in real- world interactions where users explore related top- ics across multiple turns. Our data curation is informed by manual inspection of multiple real- world data sources. T o ground our design, we read and analyzed multi-turn interactions from the Ai2 Playground model hosting platform (Appendix B ). Sourcing Pr ompts. T o build our T U R N W I S E E - V A L dataset, we create multi-turn con versations of up to eight user turns where user turns pose related questions to each seed instruction dra wn from Al- pacaEv al. Each user turn is synthetically generated and contains a self-contained, topically-related re- quest, allo wing the assistant to optionally lev erage earlier turns for richer responses without remov- ing any necessary contextual information from the single-turn comparison con versation. T o control the information seen by the ev aluated model in the con versational conte xt so far , we include assistant turns generated by the model used for the refer- ence single-turn completions in T U R N W I S E E V A L - Absolute to create con versational states for each e valuation sample. W e choose not to go beyond eight user turns to av oid confounding our ev alua- tion with v arying long-context abilities 1 . W e man- ually ensured con versational coherence and topi- cal consistency for generated user turns in T U R N - W I S E E V A L and verified a random subset of 100 con versations (Appendix B ). Comparison to AlpacaEval. F or reliable com- parison to widely-used chat e valuation settings, 1 Con versations to be a maximum of ∼ 4.7k tokens Model TW -Absolute AlpacaEval2 ∆ TW -Self Olmo 3 7B 36.8 42.2 − 5.4 38.9 Olmo 3.1 32B 52.4 60.1 − 7.7 34.0 Qwen 3 8B 48.9 49.7 − 0.8 53.2 Qwen 3 32B 67.7 69.6 − 1.9 48.5 Llama 3.1 8B 18.7 16.3 +2.4 40.2 Llama 3.1 70B 26.3 27.7 − 1.4 36.5 GPT -4.1 82.5 83.4 − 0.9 42.0 GPT -5 Nano 68.2 68.0 +0.2 41.1 GPT -5 Chat 83.5 88.5 − 5.0 40.2 GPT -5.2 82.1 83.5 − 1.3 47.6 T able 1: Performance of chat models on T U R N W I S E E - V A L . Olmo 3 refers to Olmo 3 Instruct models; Qwen 3 results are without thinking enabled. ∆ is the dif ference between TW -Absolute and AlpacaEval scores. our final e valuation instructions are sourced from AlpacaEv al ( Li et al. , 2023 ) with additional con- text created as earlier turns in the con versations. Thus, there e xists a single-turn complement to each T U R N W I S E E V A L con versation in AlpacaEval, making comparison of these metrics 1:1. This rela- tion with single-turn e valuation allo ws direct com- parison of a model’ s multi- and single-turn abilities. Comparison between T U R N W I S E E V A L -Absolute and AlpacaEv al performance can isolate difference in performance due to the multi-turn setting rather than general model abilities. 2 Specifically , if per- formance on T U R N W I S E E V A L -Absolute is signif- icantly below that on AlpacaEval, it suggests de- graded performance in the multi-turn setting. 2.3 Demonstrating the Multi-/Single-turn Gap T able 1 reports results of different chat models on T U R N W I S E E V A L . Open model recipes (Olmo 3 Instruct 7B and 32B) lag behind on T U R N W I S E E - V A L , reflectiv e of the lack of attention multi-turn con versations ha ve recei ved in these efforts. Mean- while, e ven adv anced models such as GPT -5 Chat sho w underperformance on the multi-turn setting compared against their o wn single-turn abilities. Additionally , the importance of isolating multi- turn con versational performance from a model’ s o wn single-turn baseline is clear; T U R N W I S E E - V A L -Self can measure the multi-/single-turn perfor- mance gap even when models have particularly lo w (Llama 3.1) or high (e.g. GPT -4.1) single-turn chat e valuation scores. In these cases, single-turn chat ability becomes the primary influence on all abso- lute scores, but T U R N W I S E E V A L -Self still rev eals 2 All AlpacaEv al results reported here refer to length- controlled AlpacaEval with GPT 4.1 judge (gpt-4.1-2025-04- 14). This matches the judge used for T U R N W I S E E V A L and the ev aluation setting used by Olmo 3 ( Olmo et al. , 2025 ). Figure 2: An example of using T U R N W I S E D A TA to create a synthetic con versation. User turns are inde- pendently generated from the seed prompt (green) and con versations are appended with the seed prompt and response (blue). models’ shortcomings in multi-turn con versations. 3 Creating Multi-tur n T raining Data W e introduce a scalable method for creating syn- thetic multi-turn data, which we call T U R N W I S E - D A TA . Additionally , we describe a self-talk data baseline which we combine with T U R N W I S E D A TA con versations in our e xperimentation in Section 4 . Sourcing prompts. All con versations build on seed prompts randomly selected from the W ildChat subset of the Dolci Instruct SFT mixture, a primar- ily single-turn dataset used for Olmo 3 post-training ( Olmo et al. , 2025 ). If seed prompts were selected from already multi-turn con v ersations, only the first turn was used. W e e xclude prompts shorter than 15 characters. For both data generation methods, we create con versations with 2 to 8 user turns. T U R N W I S E D AT A . T o create multi-turn con ver - sational data with T U R N W I S E D A T A , seed prompts are used to independently generate possible user turns. By creating user turns independently , our pipeline av oids issues such as con versational drift, long contexts, and e xpensiv e online user simulation. These turns are then stacked to create a synthetic con versation with the original seed prompt as the final turn (Figure 2 ). W e place the seed prompt at the end of the con versation to preserve useful properties of the original dataset such as preference pairs. User turns were generated with GPT -4.1. W e generate synthetic con versational context from existing prompts with two distinct methods. Simulating con versations in which users are unsat- isfied with an initial response and seek iteration, we generate paraphrased prompts as intermediate user turns. Similarly , we represent e xploratory con versa- tions with follow-up information seeking by gener- ating related queries as intermediate user prompts. Self-talk Prior work has used self-talk as a method for creating artificial con versations where LMs play the part of a user ( Lambert et al. , 2024 ; Ulmer et al. , 2024 ). W e use self-talk as a base- line for online con versation generation without user simulation. T o capture a di verse range of follo w-up interactions, we consider three user behaviors to create user turns: (1) amending previous instruc- tions, (2) seeking clarification on a response, and (3) building on ne w assistant-provided information. 4 T raining with Multi-turn Data T o understand the importance of multi-turn data during training, we conduct experiments with an open-data model, Olmo 3 7B Instruct ( Olmo et al. , 2025 ). W e consider tw o training stages: supervised fine-tuning (SFT) and preference tuning. 4.1 Experimental Setting W e create two types of training data: supervised fine-tuning data and preference pairs. In both set- tings, we create data with T U R N W I S E D A TA and with self-talk. For creating SFT data, we use GPT 4.1 and Qwen 3 32B (without thinking) to generate assistant turns. T o construct preference pairs, we use the delta-learning heuristic proposed by Geng et al. ( 2025 ). W e use two model pairs for data cre- ation: (1) rejected responses produced by GPT -3.5 T urbo and preferred responses by GPT -4.1 and (2) rejected responses by Qwen 3 0.6B and preferred by Qwen 3 32B both without thinking enabled. For our training baselines, we focus on super- vised fine-tuning and Direct Preference Optimiza- tion (DPO) ( Rafailov et al. , 2024 ). W e perform all training experiments on Olmo3-7B-Instruct-SFT . For SFT experiments, we continue training with ad- ditional multi-turn SFT data. For preference tuning, we use the full Olmo 3 7B preference mixture aug- mented with our synthetic data. All experiments use the same hyperparameters and training libraries as in Olmo 3 ( Olmo et al. , 2025 ). 4.2 Effectiveness of Multi-tur n Data T able 2 shows that the single- vs. multi-turn per- formance gap is narrowed by fine-tuning or pref- erence tuning with synthetic multi-turn data (see T able 2 ). T raining with T U R N W I S E D A TA con ver - sations improv es the T U R N W I S E E V A L -Self score by up to 12.8 points with fine-tuning and 9.2 points in preference-tuning. Notably , improvements can arise e ven when synthetic multi-turn con versations comprise less than 5% of the preference mixture. Effect on Single-T urn Baselines. T able 2 ad- ditionally reports the effect of training with our multi-turn data on single-turn instruction-following ( Zhou et al. , 2023 ) and knowledge ( Hendrycks et al. , 2021 ) baselines. Single-turn chat capabil- ities remain primarily unaffected by the addition of multi-turn data during preference-tuning. This stability allo ws us to achiev e improv ements on T U R N W I S E E V A L while maintaining single-turn performance. Ho wever , supervised fine-tuning with synthetic multi-turn data did decrease IFE- v al and MMLU performance. W e conjecture that this is due to higher sensiti vity to the data genera- tor’ s abilities with SFT than with preference tuning since preference tuning is less dependent on the absolute quality of the data. Thus, we recommend using preference-tuning or strong data generators to av oid single-turn performance degradation. Additionally , multi-turn tuned models sho wed some improv ements on (single-turn) AlpacaEval. W e conjecture that this is a result of introducing more general chat data. The ef fect of introducing more chat data is further explored in Appendix C . T U R N W I S E D AT A Effectiveness T o isolate the ef fectiv eness of T U R N W I S E D A TA con versations from that of self-talk data, we additionally per- form experiments with only T U R N W I S E D A TA con versations. With the same amount of addi- tional data, fine-tuning achiev ed better results with T U R N W I S E D A TA con versations than with the self- talk/ T U R N W I S E D A TA mixture (T able 2 ) implying stronger data from T U R N W I S E D A TA than the self- talk baseline. Moreover , training with more multi- turn data leads to stronger performance. 4.3 Effect of Pre vious Con versational T urns Recent work has sho wn how earlier turns in a con- versation can shift final responses to either impro ve ( Jav aji et al. , 2025 ) or degrade ( Laban et al. , 2025 ) performance. W e control for con versational state in T U R N W I S E E V A L by generating intermediate assistant turns with a strong model to isolate final turn performance and decrease noise sources. T o understand ho w this affects scores, we e xperiment with the ef fect of intermediate response quality in Appendix D . T raining Data TW -Absolute AlpacaEval2 ∆ TW -Self IFEval MMLU SFT (original) 10.0 23.9 − 13.9 30.1 81.3 66.7 + 10k T U R N W I S E D A TA con versations 24.1 25.1 − 1.0 42.9 76.7 61.4 + 10k T U R N W I S E D A TA and self-talk 20.7 25.6 − 4.9 41.9 72.8 63.2 + 20k T U R N W I S E D A TA and self-talk 22.2 24.8 − 2.6 42.7 75.6 62.7 DPO (no multi-turn) 37.4 42.8 − 5.4 35.0 82.6 68.0 + 10k T U R N W I S E D A TA con versations 39.1 43.8 − 4.7 41.2 81.5 68.3 + 10k T U R N W I S E D A TA and self-talk 38.7 44.9 − 6.2 40.9 83.2 68.1 + 20k T U R N W I S E D A TA and self-talk 43.4 44.7 − 1.3 44.2 82.8 68.0 T able 2: Performance of models trained with T U R N W I S E D A TA con versations. All training is done on Olmo 3 7B Instruct SFT models. The SFT (original) line represents the base SFT model and each additional line represents fine-tuning with different additional multi-turn data. The DPO (no multi-turn) line represents the SFT model preference tuned with only single-turn Olmo 3 preference data; each additional line represents adding different additional data to the preference mixture. ∆ = T U R N W I S E E V A L -Absolute − AlpacaEval. Additionally , we examined performance patterns across con versational lengths to understand how greater con v ersational context af fects model perfor - mance (Figure 3 ). Results show that training with multi-turn data significantly increases performance at e very con versational turn and flattens the decay of performance through longer con versations. 1 (AlpacaEval) 2 3 4 5 6 7 8 Number of T ur ns 10 20 30 40 50 Discr ete W in R ate (%) SF T (original) SF T + MT (20k) DPO (no MT) DPO + MT (20k) Figure 3: Win rate of models on T U R N W I S E E V A L - Absolute decays across con versation turns. One-turn con versations are represented by AlpacaEv al. 5 Related W ork Synthetic Multi-turn Data. Prior work has considered creation of synthetic multi-turn data. SDSD ( Lambert et al. , 2024 ), for example, creates multi-turn data through LMs talking to themselves, an approach called “self-talk” ( Ulmer et al. , 2024 ). Other ef forts focus on user simulation to create re- alistic synthetic user turns ( Prabhakar et al. , 2025 ; Shim et al. , 2025 ; W ang et al. , 2024 ). Ho wev er, user simulations introduce their own challenges such as dialogue state tracking ( W ang et al. , 2024 ) and goal-alignment ( Mehri et al. , 2026 ). Multi-turn Evaluations. Prior e valuations of multi-turn capabilities in LLMs have introduced v arious core skills that arise in multi-turn settings ( Kwan et al. , 2024 ; Bai et al. , 2024 ). For exam- ple, MT -Bench-101 establishes a three-tiered tax- onomy founded in educational psychology ( Bai et al. , 2024 ). T o address saturation of multi-turn benchmarks, MultiChallenge combines hard tasks in long con versations ( Sirdeshmukh et al. , 2025 ). T U R N W I S E E V A L differs from previous multi- turn benchmarks by anchoring judgments in pair- wise comparisons to the single-turn setting. While prior multi-turn benchmarks hav e used pairwise e valuation ( Zheng et al. , 2023 ), they do not iso- late the dif ference between multi- and single-turn performance. Our comparison of these settings uniquely identifies how performance in multi-turn settings dif fers from that of single turns. 6 Conclusion T o address the single-/multi-turn data gap, we in- troduce T U R N W I S E D A TA , a scalable method for generating multi-turn con versations. Additionally , we introduce a ne w multi-turn benchmark, T U R N - W I S E E V A L , to ev aluate the difference in a model’ s performance between single- and multi-turn set- tings. Our experiments with training on synthetic multi-turn con versations demonstrate significant improv ements to addressing this gap in open model recipes. Our results suggest that multi-turn con ver- sational ability represents a distinct dimension of model ability not captured by traditional single-turn e valuation, and we recommend greater attention to multi-turn con versations in open data pipelines. Limitations T U R N W I S E E V A L scores may vary with the qual- ity of the intermediate turns, as analyzed by using a weaker model to generate the responses in con- text (T able 4 ). While this affects absolute score, it does not typically af fect overall model rankings. Relati ve changes in scores reflect ev aluated mod- els’ robustness to lower quality previous turns in context. Additionally , our experimental settings are lim- ited in scope to focus on dimensions specific to multi-turn training. Thus, we fix training hyperpa- rameters to those of the Olmo 3 7B Instruct model since the Olmo 3 Instruct data comprises the v ast majority of the training data seen by these models. References Ge Bai, Jie Liu, Xingyuan Bu, Y ancheng He, Jia- heng Liu, Zhanhui Zhou, Zhuoran Lin, W enbo Su, T iezheng Ge, Bo Zheng, and W anli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for ev alu- ating large language models in multi-turn dialogues . In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , page 7421–7454. Association for Computational Linguistics. Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang W ei Koh. 2025. The delta learning hypothesis: Preference tuning on weak data can yield strong gains . Pr eprint , Dan Hendrycks, Collin Burns, Ste ven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massiv e multitask language under- standing . Pr eprint , Shashidhar Reddy Jav aji, Bhavul Gauri, and Zining Zhu. 2025. Another turn, better output? a turn- wise analysis of iterativ e llm prompting . ArXiv , abs/2509.06770. W ai-Chung Kwan, Xingshan Zeng, Y uxin Jiang, Y ufei W ang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai W ong. 2024. Mt-ev al: A multi- turn capabilities e valuation benchmark for lar ge lan- guage models . Pr eprint , Philippe Laban, Hiroaki Hayashi, Y ingbo Zhou, and Jennifer Ne ville. 2025. Llms get lost in multi-turn con versation . Pr eprint , Nathan Lambert, Jacob Morrison, V alentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane L yu, Y uling Gu, Saumya Malik, V ictoria Graf, Jena D. Hwang, Jiangjiang Y ang, Ronan Le Bras, Oyvind T afjord, Chris W ilhelm, Luca Sol- daini, and 4 others. 2025. T ulu 3: Pushing fron- tiers in open language model post-training . Pr eprint , Nathan Lambert, Hailey Schoelkopf, Aaron Gokaslan, Luca Soldaini, V alentina Pyatkin, and Louis Cas- tricato. 2024. Self-directed synthetic dialogues and revisions technical report. arXiv preprint arXiv:2407.18421 . Xuechen Li, T ianyi Zhang, Y ann Dubois, Rohan T aori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and T atsunori B. Hashimoto. 2023. Alpacae val: An au- tomatic ev aluator of instruction-following models. https://github.com/tatsu- lab/alpaca_eval . Shuhaib Mehri, Xiaocheng Y ang, T akyoung Kim, Gokhan T ur , Shikib Mehri, and Dilek Hakkani-Tür . 2026. Goal alignment in llm-based user simulators for con versational ai . Pr eprint , T eam Olmo, :, Allyson Ettinger , Amanda Bertsch, Bailey K uehl, David Graham, David Heineman, Dirk Groene veld, Faeze Brahman, Finbarr T imbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, and 50 oth- ers. 2025. Olmo 3 . Preprint , arXi v:2512.13961. Akshara Prabhakar , Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika A walgaonkar , Shiyu W ang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, W eiran Y ao, Huan W ang, Sil- vio Savarese, and Caiming Xiong. 2025. Apigen- mt: Agentic pipeline for multi-turn data genera- tion via simulated agent-human interplay . Preprint , Rafael Rafailo v , Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct preference optimization: Y our lan- guage model is secretly a rew ard model . Preprint , Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, and Y ohan Jo. 2025. T ooldial: Multi-turn dialogue gen- eration method for tool-augmented language models . Pr eprint , V ed Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Y eremai Cardona, Dean Lee, Jeremy Kritz, Willo w Primack, Summer Y ue, and Chen Xing. 2025. Multichallenge: A realistic multi- turn con versation e valuation benchmark challenging to frontier llms . Pr eprint , Dennis Ulmer , Elman Mansimov , Kaixiang Lin, Justin Sun, Xibin Gao, and Y i Zhang. 2024. Bootstrapping llm-based task-oriented dialogue agents via self-talk . ArXiv , abs/2401.05033. Xingguang W ang, Xuxin Cheng, Juntong Song, T ong Zhang, and Cheng Niu. 2024. Enhancing dialogue state tracking models through LLM-backed user- agents simulation . In Pr oceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 8724– 8741, Bangkok, Thailand. Association for Computa- tional Linguistics. Fanghua Y e, Jarana Manotumruksa, and Emine Y ilmaz. 2022. Multiwoz 2.4: A multi-domain task-oriented dialogue dataset with essential annotation correc- tions to improv e state tracking e valuation . Preprint , Lianmin Zheng, W ei-Lin Chiang, Y ing Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judg- ing llm-as-a-judge with mt-bench and chatbot arena . Pr eprint , Jeffre y Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Y i Luan, Denny Zhou, and Le Hou. 2023. Instruction-following e valuation for large language models . Preprint , arXi v:2311.07911. A Judge T emplate See Figure 4 for the pairwise judgment template. W e use the same judgment template provided by the AlpacaEv al repository 3 . All reported ev aluations are from indi vidual runs. B Manual Inspection of Data Our benchmark creation is guided by inspection of ov er 200 con versations with Tülu 3 collected from the Ai2 model hosting interface. W e additionally inspected multi-turn con versations from W ildChat and LMSYS Chatbot Arena to verify our observed behaviors. From our inspection, we observed two common modes of user behaviors. First, users enter the con versation with a specific tar get and continue the conv ersation through multiple turns to iterate on the output. Second, users enter exploratory con- versations that included follo wing-up on informa- tion from previous turns in the con versation. Our synthetic data methodology is grounded in these observed con versation types. Manual verification of data quality Figure 5 sho ws an example of a conv ersation from T U R N - W I S E E V A L . T o ensure a high-quality ev aluation, all con versations in T U R N W I S E E V A L were tagged by GPT 4.1 for coherence and degenerate turns. A total of 64 turns across 38 conv ersations were flagged for quality . Each flagged turn was manually inspected, and where rele vant, the user turn was re written (and response re generated) or the turn w as remov ed. As a result, 22 turns were edited and 39 turns remov ed across 17 con versations. Final veri- fication of a random sample of 100 con versations sho wed topical relev ance and coherent instructions across all turns. C T urn Number Ablations W e perform additional training ablations using trun- cated versions of the augmented multi-turn con ver - sations (T able 3 ). Our results demonstrate that im- prov ement goes beyond benefits from training with more chat data as substituting our synthetic multi- turn data with truncated single-turn v ersions cannot account for all improvements in T U R N W I S E E V A L . 3 https://github.com/tatsu- lab/alpaca_eval/ blob/cd543a149df89434d8a54582c0151c0b945c3d20/ src/alpaca_eval/evaluators_configs/alpaca_eval_ clf_gpt4_turbo/alpaca_eval_clf.txt Model TW -A AE2 ∆ TW -S IFEv al MMLU SFT (original) 10.0 23.9 − 13.9 30.1 81.3 66.7 + max 1 turn 16.9 25.5 − 8.6 40.2 73.0 59.7 + max 2 turn 20.7 22.7 − 2.0 43.5 76.3 62.2 + max 4 turn 21.3 25.3 − 4.0 41.8 75.6 63.1 + max 8 turn 20.7 25.6 − 4.9 41.9 72.8 63.2 DPO (no MT) 37.4 42.8 − 5.4 35.0 82.6 68.0 + max 1 turn 38.6 50.1 − 11.5 38.9 83.4 68.0 + max 2 turn 38.1 45.8 − 7.7 41.6 82.1 68.5 + max 4 turn 39.3 45.0 − 5.7 42.9 83.7 68.2 + max 8 turn 38.7 44.9 − 6.2 40.9 83.2 68.1 T able 3: Ablations of long con versations in multi-turn training data. + max n turns indicates truncating the additional multi-turn data to n user turns. D Effect of Context Response Quality . W e hypothesize that high-quality intermediate re- sponses in context can inform and improve the final e valuated responses. T o inv estigate this potential for in-context learning, we e xamine how interme- diate response quality af fects performance. W e created an alternati ve version of T U R N W I S E E V A L using GPT -3.5 T urbo generated responses for this analysis. Results show that scores with this alter- nati ve dataset preserve rankings and conclusions about model abilities (T able 4 ). This suggests that a model’ s success on T U R N W I S E E V A L is not solely dri ven by high-quality in-context e xamples. Model TW -A TW -A (W eak) TW -S TW -S (W eak) SFT (original) 10.0 9.4 30.1 28.4 + MT (10k) 20.7 20.4 41.9 40.9 + MT (20k) 22.2 23.3 42.7 43.0 DPO (no MT) 37.4 35.7 35.0 33.0 + MT (10k) 38.7 40.3 40.9 39.4 + MT (20k) 43.4 44.3 44.2 42.7 T able 4: Performance of models trained with synthetic multi-turn data on T U R N W I S E E V A L with weak context. TW -A (W eak) and TW -S (W eak) indicate alternati ve ver - sions of T U R N W I S E E V A L that use GPT -3.5 Turbo gen- erated assistant turns in context. + MT indicates training with additional multi-turn data (T U R N W I S E D A TA and self-talk). <|im_start|>system Y ou are a highly efficient assistant, who ev aluates and selects the best large language model (LLMs) based on the quality of their responses to a giv en instruction. This process will be used to create a leaderboard reflecting the most accurate and human-preferred answers. <|im_end|> <|im_start|>user I require a leaderboard for v arious large language models. I’ll provide you with prompts gi ven to these models and their corresponding outputs. Y our task is to assess these responses, and select the model that produces the best output from a human perspecti ve. ## Instruction { "instruction": """{instruction}""", } ## Model Outputs Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier . { { "model_identifier": "m", "output": """{output_1}""" }, { "model_identifier": "M", "output": """{output_2}""" } } ## T ask Ev aluate the models based on the quality and relev ance of their outputs, and select the model that generated the best output. Answer by providing the model identifier of the best model. W e will use your output as the name of the best model, so mak e sure your output only contains one of the following model identifiers and nothing else (no quotes, no spaces, no ne w lines, ...): m or M. ## Best Model Identifier <|im_end|> Figure 4: The judge template for pairwise comparison. User : What are some memorable game-winning plays in sports history? Assistant : There hav e been many unforgettable game-winning plays in sports history that ha ve left fans in a we. Some of the most iconic ones include Michael Jordan’ s last shot with the Chicago Bulls in the 1998 NB A Finals, kno wn as "The Last Shot," where he hit a jumper to secure the championship. Another legendary play is Da vid T yree’ s helmet catch in Super Bo wl XLII, where he pinned the ball against his helmet to help the New Y ork Giants defeat the New England Patriots. And who can forget the "Miracle on Ice" in the 1980 W inter Olympics when the USA hockey team upset the Soviet Union in a thrilling victory . These moments showcase the magic and drama that sports can bring. User : As a sports commentator , describe the winning play in the final seconds of a championship game. Assistant : Figure 5: An example con versation from T U R N W I S E E V A L .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment