The repetition problem, where Large Language Models (LLMs) continuously generate repetitive content without proper termination, poses a critical challenge in production deployments, causing severe performance degradation and system stalling. This paper presents a comprehensive investigation and multiple practical solutions for the repetition problem encountered in real-world batch code interpretation tasks.
We identify three distinct repetition patterns: (1) business rule generation repetition, (2) method call relationship analysis repetition, and (3) PlantUML diagram syntax generation repetition. Through rigorous theoretical analysis based on Markov models, we establish that the root cause lies in greedy decoding's inability to escape repetitive loops, exacerbated by self-reinforcement effects.
Our comprehensive experimental evaluation demonstrates three viable solutions: (1) Beam Search decoding with early_stopping=True serves as a universal post-hoc mechanism that effectively resolves all three repetition patterns; (2) presence_penalty hyperparameter provides an effective solution specifically for BadCase 1; and (3) Direct Preference Optimization (DPO) fine-tuning offers a universal model-level solution for all three BadCases.
The primary value of this work lies in combining first-hand production experience with extensive experimental validation. Our main contributions include systematic theoretical analysis of repetition mechanisms, comprehensive evaluation of multiple solutions with task-specific applicability mapping, identification of early_stopping as the critical parameter for Beam Search effectiveness, and practical production-ready solutions validated in real deployment environments.
Large Language Models (LLMs) have become increasingly important in real-world applications, demonstrating exceptional capabilities in various domains including natural language understanding, code generation, and system analysis. The practical deployment of LLMs in production environments has opened new possibilities for automating complex tasks that previously required significant human expertise.
In particular, code interpretation tasks have emerged as a critical application area where LLMs can assist developers in understanding complex codebases, generating documentation, and analyzing system architectures. These applications require LLMs to process multiple code segments sequentially, maintaining context across different levels of abstraction (transactions, service methods, etc.), and generating structured outputs such as business rules and PlantUML diagrams.
However, as LLMs are deployed in increasingly complex and demanding scenarios, certain behavioral anomalies have been observed that can significantly impact system performance and reliability. One such critical issue is the repetition problem, which manifests as the model continuously generating repetitive content without proper termination.
The repetition problem, commonly known as the “repeater” problem, occurs when LLMs continuously generate repetitive text in a loop during inference. In our specific case study involving batch code interpretation, we observed the following phenomena:
• The model experiences significant stalling during batch processing • The model continuously streams repetitive content without interruption • Generation continues until reaching the maximum token limit (max_tokens)
• Processing time increases dramatically from normal operation (28 minutes) to problematic scenarios (40-160 minutes)
This problem has substantial practical implications. In a batch processing scenario where 20 transactions are processed sequentially, the occurrence of the repetition problem can increase total processing time by 43% to 471%, severely impacting system throughput and resource utilization. The high reproducibility rate (75-80% across different deployment modes) makes this a critical issue that requires systematic investigation and resolution.
Addressing the repetition problem is of paramount importance for the practical deployment of LLMs in production environments. The contributions of this work include:
• Practical Problem Solving: We provide a systematic approach to resolving a real-world performance issue that significantly impacts production systems.
• Deployment Optimization: Our solution offers insights into LLM deployment optimization that can benefit other similar applications.
• Critical Parameter Discovery: We identify and validate the early_stopping parameter as the decisive factor for Beam Search effectiveness. Our experiments reveal a dramatic difference: early_stopping=True achieves near-zero repetition rate, while early_stopping=False still exhibits significant repetition rate, demonstrating that this parameter is not optional but essential for solving the repetition problem.
• Framework Integration: Our work addresses parameter passing issues between different frameworks (FastChat and vLLM), which is a common challenge in realworld deployments.
The remainder of this paper is organized as follows: Section 2 reviews related work on repetition problems in text generation and decoding strategies. Section 3 presents a detailed problem description and analysis of our specific application scenario. Section 4 provides theoretical analysis of the root causes of the repetition problem. Section 5 describes our solution design including Beam Search, presence penalty, and DPO fine-tuning. Section 6 discusses the solution’s advantages, limitations, and best practices. Finally, Section 7 concludes the paper with a summary of contributions and future work. Detailed BadCase examples, DPO dataset examples, mathematical proofs, and implementation details are provided in the Appendix.
2 Related Work
The repetition problem in neural text generation has been an active area of research. A seminal work is “Learning to break the Loop: Analyzing and Mitigating Repetitions for Natural Text Generation” [1], which establishes that repetition problems are fundamentally related to decoding strategy. Other works have explored neural text degeneration [8,11] and various mitigation approaches including repetition penalty [12,13], length normalization [7,14], diversity-promoting methods [15,17], and training-based solutions [1]. However, these methods often require careful tuning, introduce randomness, or require costly model retraining.
Text generation in LLMs involves prefill and decoding stages. Greedy decoding, which selects the highest-probability token at each step, is computationally efficient but particularly susceptible to repetition problems as it cannot explore alternative paths. Beam Search addresses this limitation by maintaining multiple candidate sequences simultaneously [7,20,21], allowing exploration of multiple paths and escape from repetitive loops.
The beam width parameter (k) controls the number of candidate sequences, with empirical studies showing values between 3 and 10 are typically optimal [7,10]. Various Beam Search variants have been proposed including diverse beam search [15,16], length penalty [14], and constrained beam search [22,23]. Other decoding strategies include Topk sampling [9], Top-p (Nucleus) sampling [17], temperature scaling [18], and contrastive search [19]. However, these sampling-based methods introduce randomness that may be undesirable in deterministic production environments and may not effectively address systematic repetition patterns due to self-reinforcement effects.
3 Problem Description and Analysis
Our investigation is motivated by a real-world application involving batch code interpretation for financial transaction systems. The system processes transactions in batches, where each batch contains 20 transactions that are executed sequentially. To investigate the system behavior in production, we conducted experiments comparing two deployment modes for vLLM (Mode 1: LoRA-enabled; Mode 2: Merged model). When running 20 transactions sequentially in batch mode across multiple experimental runs, we observed significant time discrepancies: the system exhibits severe performance degradation in a significant proportion of experimental runs (75-80% occurrence rate), with processing time increasing dramatically from normal operation to problematic scenarios. This time discrepancy indicates an underlying issue that causes system stalling during batch processing. Detailed experimental comparison results are provided in Appendix C.2.
Log analysis revealed that the stalling behavior corresponds exactly to the repetition problem, where the model continuously outputs repetitive content without stopping until reaching the max_tokens limit. We identified three distinct BadCase types:
• BadCase 1: Business rule generation repetition -model generates valid rules initially but then falls into repeating similar conditional patterns.
• BadCase 2: Method call relationship analysis repetition -model correctly identifies call relationships initially but then repeatedly outputs the same method name.
• BadCase 3: PlantUML diagram syntax generation repetition -model generates valid PlantUML code initially but then repeatedly generates closing statements.
All three BadCase types share a common characteristic: the model becomes trapped in repetitive loops during text generation, leading to significant processing delays. Detailed case-by-case analysis with complete examples is provided in Appendix A.
To understand why these repetition patterns occur and how they can be systematically addressed, we now turn to theoretical analysis. The following section provides a mathematical foundation for understanding the root causes of the repetition problem, which will inform our solution design.
Based on Markov model analysis, the repetition problem in LLMs can be understood through three key mechanisms: (1) Context Repetition Leading to Probability Enhancement: When the context contains repeated patterns, the model tends to increase the probability of tokens that appeared previously, learning a shortcut to copy from recent history. (2) Self-Reinforcement Effect: The probability of repetition increases monotonically with the number of historical repetitions, creating a positive feedback loop.
(3) High Initial Probability Reinforcement: Sentences with higher initial probabilities exhibit stronger self-reinforcement effects. Detailed mathematical formalization is provided in Appendix C.8.
Greedy decoding, which selects the highest-probability token at each step, is particularly vulnerable to the repetition problem for several reasons:
- Single-Path Exploration: Greedy decoding maintains only one candidate sequence, so once a repetitive pattern begins, there is no mechanism to explore alternative paths.
The algorithm makes decisions based only on immediate probabilities, without considering long-term consequences or recognizing repetitive patterns.
- Amplification of Self-Reinforcement: Since greedy decoding always selects the highest-probability token, and repetitive tokens have elevated probabilities due to self-reinforcement, the algorithm naturally continues the repetition cycle.
The self-reinforcement effect combined with greedy decoding creates a scenario where the model becomes trapped in a loop, continuously generating repetitive content until external constraints (such as max_tokens) force termination.
We model the repetition problem using a Markov chain where the state represents whether we are in a repetitive pattern. The repetition probability evolves according to a recurrence relation that captures the cumulative effect where each repetition makes future repetitions more likely. Under greedy decoding with self-reinforcement effects, once the model enters a repetitive state, the expected escape time is infinite, explaining why greedy decoding cannot break out of repetition loops.
Beam Search addresses this limitation by maintaining multiple candidate sequences. With beam width B ≥ 2 and proper early stopping, Beam Search can escape repetitive loops by maintaining at least one non-repetitive candidate sequence, provided that the initial probability gap between repetitive and non-repetitive continuations is bounded. Detailed mathematical modeling, proofs of propositions, theoretical predictions vs experimental validation, and critical beam width analysis are provided in Appendix C.8.
The early stopping mechanism plays a critical role: when early_stopping=True, the algorithm terminates once it finds B complete sequences, preventing exhaustive search that might lead back to repetitive patterns. This ensures that diverse candidates are preserved throughout the search process.
Building on this theoretical foundation, we now present practical solutions that address the repetition problem at different levels. The following section describes three complementary approaches: Beam Search as a post-hoc mechanism, presence penalty for task-specific cases, and DPO fine-tuning as a model-level solution.
Based on the theoretical understanding of the repetition problem and our analysis of the three BadCases, we have explored multiple solutions. Through analysis and experimental validation, we discovered that all three BadCase types can be effectively resolved using Beam Search decoding strategy, which serves as an effective inference-time solution. However, as Beam Search operates as a post-hoc mechanism that addresses the symptom rather than the root cause at the model level, we further explored alternative solutions that provide more fundamental approaches to the repetition problem. This section presents a comprehensive comparison of solutions and their applicability to different BadCase scenarios.
Through analysis of the three BadCases, we found that all of them can be effectively resolved using Beam Search decoding strategy. Beam Search serves as an inference-time solution that effectively resolves repetition problems across all three BadCase types without requiring model modifications.
Beam Search maintains multiple candidate sequences (beams) at each decoding step, selecting the top-k candidates based on cumulative joint probability rather than just the single highest-probability token. Unlike greedy decoding which maintains only one candidate sequence, Beam Search’s multi-path exploration allows it to consider alternative continuations when one path becomes repetitive, providing a mech-anism to escape repetition loops. The cumulative score for a sequence is computed as:
Critical Parameter Configuration For Beam Search to function correctly in vLLM, the following parameters must be set as specified:
• use_beam_search: Must be True
• best_of: Beam width, set to 5 (provides good balance between exploration and efficiency)
• temperature: Must be 0 (deterministic decoding)
• top_p: Must be 1 (no nucleus filtering)
• top_k: Must be -1 (no top-k filtering)
• early_stopping: Must be True (critical!)
The early_stopping parameter is the most critical factor for solving the repetition problem. This is our key finding: when set to True, Beam Search stops as soon as best_of number of candidate results are found, allowing rapid termination and preserving diverse candidates. When set to False or Never, the algorithm continues searching indefinitely, potentially leading back to repetitive patterns. Our experimental validation (Table 4) demonstrates that early_stopping=True achieves 0% repetition rate, while early_stopping=False still exhibits 60% repetition rate-a dramatic difference that underscores the critical importance of this parameter. This finding contradicts the common assumption that Beam Search alone is sufficient; rather, early_stopping=True is an essential requirement, not an optional optimization.
We initially used FastChat integrated with vLLM, but discovered that sampling parameters were not being correctly passed. We implemented a solution using the extra_body approach to correctly pass sampling parameters. Detailed implementation notes and framework integration considerations are provided in Appendix C.12.
with early_stopping=True completely eliminates the repetition problem and restores normal processing time. In stark contrast, Beam Search with early_stopping=False still exhibits problematic behavior, confirming that early_stopping=True is not merely beneficial but essential for actually solving the repetition problem. This dramatic difference represents our most significant finding: the early_stopping parameter is the decisive factor that determines whether Beam Search succeeds or fails in addressing repetition problems. Detailed experimental results, performance comparison tables, and statistical analysis are provided in Appendix C.4.
To understand the individual contribution of each parameter, we conducted ablation studies systematically varying key parameters. The results unequivocally confirm that early_stopping=True is the most critical parameter. This finding demonstrates that early_stopping is not just one of many parameters but the determining factor for success. Other parameters (beam width, temperature, etc.) have secondary effects, but without early_stopping=True, Beam Search cannot effectively solve the repetition problem. Detailed ablation study results including beam width impact and presence_penalty value analysis are provided in Appendix C.11.
While Beam Search serves as an effective post-hoc mechanism, we explored alternative solutions that address the repetition problem at different levels. For BadCase 1 (business rule generation repetition), we found that the presence_penalty hyperparameter can effectively mitigate repetition issues.
The presence_penalty parameter penalizes tokens that have already appeared in the generated text, reducing their probability of being selected again. This mechanism helps prevent repetitive patterns by decreasing the likelihood of repeating recently generated tokens.
We conducted experiments to evaluate the effectiveness of presence_penalty for BadCase 1. The results demonstrate that setting presence_penalty=1.2 effectively eliminates repetition problems. Detailed experimental results and statistical analysis are provided in Appendix C.5.
Limitations However, this approach has limitations: (1) Task-specific: Only effective for BadCase 1, not applicable to BadCase 2 and BadCase 3. (2) Parameter tuning: Requires careful tuning of the penalty value for optimal results. (3) Inference-time adjustment: Still a post-hoc mechanism, similar to Beam Search.
While BadCase 1 can be effectively addressed using presence_penalty at the inferencetime parameter level, Direct Preference Optimization (DPO) fine-tuning provides a universal model-level solution that can be applied to all three BadCase types. Unlike pres-ence_penalty, which is only effective for BadCase 1, DPO fine-tuning addresses repetition problems at the model level by fundamentally modifying the model’s behavior through preference-based learning.
DPO Approach Overview DPO fine-tuning trains the model to prefer non-repetitive outputs over repetitive ones by using preference datasets that explicitly contrast chosen (correct, non-repetitive) responses with rejected (repetitive) responses.
We developed a systematic approach to construct DPO preference datasets using a power-of-
Beam Search serves as a universal post-hoc mechanism effective for all three BadCase types. Presence penalty is effective only for BadCase 1, while DPO fine-tuning provides a universal model-level solution. The solution applicability comparison is summarized in Table 1.
Our comprehensive study presents multiple solutions with distinct advantages: Beam Search serves as a universal inference-time solution with universal applicability, imme-diate deployment capability, proven effectiveness, and deterministic outputs. Presence Penalty offers a lightweight alternative specifically for BadCase 1 with simple configuration and minimal overhead, though its task-specific nature limits broader applicability. DPO Fine-tuning provides a fundamental model-level solution with universal applicability and long-term benefits, though it requires upfront training cost. Detailed comparison of advantages and trade-offs is discussed in Section 6.
While effective, our solution has some limitations: (1) Computational Overhead: Beam Search requires increased memory and processing time compared to greedy decoding, though this overhead is minimal compared to the performance degradation from repetition problems. ( 2) Strict Parameter Requirements: The solution requires strict adherence to parameter configuration, especially early_stopping=True, and careful parameter passing between frameworks. (3) Context-Dependent Effectiveness: Effectiveness may vary depending on the specific model, task, and deployment scenario. Detailed overhead analysis, computational overhead tables, performance-overhead tradeoff figures, and effectiveness analysis are provided in Appendix C.9 and C.10.
Based on our experience, we recommend the following best practices: (1) verify all required Beam Search parameters are correctly set, especially early_stopping=True, (2) ensure proper parameter propagation between frameworks (e.g., FastChat + vLLM), and (3) implement monitoring and logging to detect repetition problems. Detailed parameter configuration checklists, integration considerations, and monitoring guidelines are provided in Appendix C.13.
This paper makes the following contributions:
- Systematic Root Cause Analysis: We provide a comprehensive analysis of the LLM repetition problem across three distinct BadCase types, explaining causes through Markov model theory, including context repetition effects, self-reinforcement mechanisms, and high initial probability reinforcement.
We evaluate multiple solutions: Beam Search as a universal inference-time solution effective for all three BadCases, pres-ence_penalty for BadCase 1, and DPO fine-tuning as a universal model-level solution for all three BadCases, providing a complete solution landscape.
We identify which solutions are effective for which BadCase types, enabling practitioners to choose the most appropriate approach based on their specific requirements and constraints.
We identify and validate that early_stopping=True is the decisive factor for Beam Search effectiveness. Our experiments reveal a dramatic difference: early_stopping=True achieves near-zero repetition rate, while early_stopping=False still exhibits significant repetition rate. This finding demonstrates that early_stopping is not optional but essential-a critical insight that contradicts common assumptions about Beam Search. We also demonstrate the effectiveness of presence_penalty for BadCase 1.
We present a systematic approach for constructing DPO preference datasets using generalization and power-of-2 repetition patterns, providing a scalable methodology for addressing repetition at the model level.
- Complete Production Solutions: We provide complete, practical solutions including troubleshooting processes, parameter configuration details, and framework integration considerations that enable immediate application in production environments.
Our experimental results demonstrate that: (1) Beam Search with early_stopping=True completely eliminates repetition problems across all BadCase types, (2) presence_penalty effectively resolves BadCase 1 repetition, and (3) DPO fine-tuning provides a universal model-level solution for all three BadCases. All solutions effectively restore normal processing performance, demonstrating significant improvements over baseline approaches.
Several directions for future research emerge:
- Hybrid Solutions: Explore combinations of different solutions (e.g., Beam Search + presence_penalty) for enhanced effectiveness or reduced computational overhead.
Develop mechanisms to automatically select the most appropriate solution based on input characteristics or detected repetition patterns.
Investigate optimal beam width and other parameter settings for different scenarios, tasks, and model sizes. Develop adaptive parameter selection mechanisms.
- Automated Parameter Tuning: Develop automated systems for parameter tuning that can adapt to different models and tasks without manual configuration.
Work toward more robust parameter passing mechanisms between different frameworks to reduce integration challenges.
Develop real-time detection mechanisms for repetition problems that can trigger adaptive responses or fallback strategies.
Listing 1: Expected Output Format for Business Rules
This appendix provides detailed DPO preference dataset examples for all three BadCase types discussed in Section 5.3.
Note: For improved readability, the JSON string values in the following examples are formatted with actual line breaks instead of escape sequences ( n). In the actual JSON format used for training, these line breaks are represented as n escape sequences within single-line strings.
Listing 7: DPO Preference Dataset Example (BadCase 1) Note: The rejected example shows repetition of the business rule statement " 设置【个 人存款证明凭证信息】的数据更新标志为已更新。" (Set the data update flag of deposit certificate voucher information to updated). Similar to BadCase 2 and 3, we construct four groups of rejected examples with varying degrees of repetition (2, 4, 8, or 16 repetitions), enabling the model to learn to avoid generating repetitive business rule descriptions.
Step 1 Discovery (No LLM)
Step 2: For each Service Method 2. Service Method Processing: For each service method M i in the transaction, the following sub-steps are executed:
• 2-1 Recursive Method Drilling: Recursively identify service methods called by the current service method. This step requires LLM involvement (1 or more times, depending on recursion depth). Normal processing time: 2-5 seconds per call.
• 2-2 Business Rule Generation: Generate business rules based on service method code. This step requires LLM involvement (1 time per method). Normal processing time: 3-14 seconds.
• 2-3 PlantUML Code Generation: Convert business rules into PlantUML flowchart code. This step requires LLM involvement (1 time per method). Normal processing time: 5-9 seconds.
• 2-4 PlantUML Diagram Generation: Convert PlantUML code to diagrams using JAR tools. This step does not require LLM involvement.
- Transaction-Level Processing: Concatenate business rules from all service methods in order and convert to transaction-level PlantUML flowchart code. This step requires LLM involvement (1 time per transaction). Normal processing time: 5-9 seconds.
Convert transaction PlantUML code to diagrams using JAR tools. This step does not require LLM involvement. For a transaction T orchestrated from k service methods T = Orchestrate(M 1 , M 2 , . . . , M k ), where service method i has a recursion depth of D i , the total number of LLM calls is:
This formula accounts for:
• One call for transaction-level processing (Step 3)
• For each service method i: Under normal operating conditions, the processing time for a single transaction is distributed across different stages as shown in Table 3:
We developed a systematic approach to construct DPO preference datasets:
- Generalization: For examples that exhibit repetition, we first generalize them to create variants that capture the same pattern without directly using the exact bad case. This ensures the model learns generalizable patterns rather than memorizing specific examples.
We verify that generalized examples also exhibit repetition problems, confirming that the repetition pattern is not specific to the original bad case.
- Preference Dataset Construction: For each generalized example, we construct preference pairs:
• Chosen: The correct, non-repetitive output (standard answer)
• Rejected: Repetitive outputs with varying degrees of repetition, constructed using powers of 2:
Table 5 presents the effectiveness of presence penalty for BadCase 1:
The fine-tuning process utilized 4×A100 GPUs with the following configuration:
• Learning rate: 5e-6
• Batch size: 4
• Training epochs: 3
• Framework: LlamaFactory
Table 6 presents the DPO fine-tuning effectiveness across BadCase types: We also evaluated the impact of different repetition degrees in the training data. Table 7 shows the effectiveness of training with different repetition patterns:
The results indicate that training with higher repetition degrees (8 and 16) or mixed repetition patterns yields the best performance, achieving near-complete elimination of repetition problems.
Proof. Let p r (t) denote the probability of a repetitive continuation at step t, and p n (t) denote the probability of a non-repetitive continuation. Given the self-reinforcement effect, we have:
For greedy decoding, once p r (t) > p n (t), the algorithm will always select the repetitive continuation, leading to an infinite loop.
For Beam Search with beam width B, at each step we maintain the top-B candidates based on cumulative log-probability:
Even if the repetitive continuation has higher per-step probability, the cumulative score over a longer horizon may favor non-repetitive sequences. Specifically, if there exists a non-repetitive continuation with:
where δ is the initial probability gap, then Beam Search with B ≥ 2 can maintain at least one non-repetitive candidate, provided that:
where ϵ is the desired probability of maintaining a non-repetitive candidate. With early stopping, the algorithm terminates once B complete sequences are found, ensuring that at least one non-repetitive sequence is preserved if it exists in the top-B candidates.
To validate our theoretical model, we compared theoretical predictions with experimental observations. Table 8 presents the comparison:
The comparison demonstrates strong agreement between theoretical predictions and experimental observations, validating our Markov model’s accuracy. The small discrepancies (2-4%) are within expected measurement variance and can be attributed to:
• Model-specific variations in probability distributions
Based on our theoretical model, we can derive the minimum beam width required to escape repetition with high probability. For a given repetition probability p r and desired escape probability P escape ≥ 0.95, the minimum beam width is:
For our observed repetition probability of p r ≈ 0.77, this yields k min ≈ 3.2, confirming that k = 5 provides a safety margin while maintaining reasonable computational overhead. This theoretical analysis aligns with our empirical finding that beam widths in the range of 3-10 are effective, with k = 5 providing optimal balance.
The measurements demonstrate that with beam width k = 5, Beam Search requires approximately 5× the KV cache memory (62.5 GB vs 12.5 GB) and increases processing time by 18.6% (33.2 min vs 28.0 min) compared to greedy decoding. However, this overhead is minimal compared to the performance degradation from repetition problems (up to 471% time increase from 28 min to 160 min). The overhead can be quantified as: memory overhead ≈ k× (single sequence memory), computational overhead ≈ O(k × N ) where N is the sequence length.
Figure 2 illustrates the performance-overhead tradeoff across different beam widths:
While our experiments show consistent results, the effectiveness may vary depending on the specific model, task, and deployment scenario. Theoretical analysis suggests that the solution is most effective when:
• The probability gap between repetitive and non-repetitive continuations is bounded (as stated in Proposition 2)
Beam Width (k) Overhead (%) Regarding optimal beam width selection: While we used k = 5 based on empirical evidence, the optimal beam width depends on the specific task and computational budget. Larger beam widths provide better exploration but with diminishing returns. Our analysis suggests that beam widths in the range of 3-10 are typically sufficient, with k = 5 providing a good balance between quality and computational cost for most scenarios.
To understand the individual contribution of each parameter, we conducted ablation studies systematically varying key parameters while keeping others fixed. Table 9 presents the results:
The ablation study reveals critical insights:
• Early Stopping is Essential: Regardless of beam width, early_stopping=False fails to eliminate repetition (45-60% repetition rate), while early_stopping=True achieves 0-5% repetition rate. This confirms that early_stopping is the most critical parameter. All configs: temperature=0, top_p=1, top_k=-1.
• Beam Width Impact: With early_stopping=True, increasing beam width from k = 3 to k = 5 reduces repetition rate from 5% to 0%, but further increase to k = 10 provides no additional benefit while increasing computational overhead.
• Optimal Configuration: The combination of k = 5 and early_stopping=True provides the best balance, achieving 0% repetition rate with moderate computational overhead (+18.6%).
We also conducted ablation studies on presence_penalty values for BadCase 1. Table 10 shows the results: The results indicate that presence_penalty=1.2 provides optimal balance between repetition elimination and output quality, with higher values (1.5-2.0) causing quality degradation due to over-penalization.
We initially used FastChat integrated with vLLM, but discovered that sampling parameters were not being correctly passed. We implemented a solution using the extra_body approach to correctly pass sampling parameters from FastChat to vLLM, ensuring that all Beam Search parameters were properly propagated to vLLM’s inference engine.
Advantages and Limitations DPO fine-tuning offers several advantages: (1) addresses repetition at the model level through fine-tuning, (2) can fundamentally change model behavior to avoid repetition, (3) effective for all BadCase types, providing a universal model-level solution, and (4) particularly valuable for BadCase 2 and BadCase 3, where presence_penalty fails. However, it has limitations: (1) requires model fine-tuning, which is computationally expensive (15-18 GPU hours per BadCase type), (2) requires careful dataset construction and generalization, and (3) may need periodic retraining as new bad cases emerge.
summarizes the applicability of different solutions to each BadCase type:
continues with repeated similar rule patterns until max_tokens …] Listing 2: BadCase 1: Business Rule Generation Repetition A.
This content is AI-processed based on open access ArXiv data.