Keeping Code-Aware LLMs Fresh: Full Refresh, In-Context Deltas, and Incremental Fine-Tuning
📝 Abstract
Modern codebases evolve continuously: files are renamed or deleted; public APIs drift; behavior shifts within otherwise familiar modules. A model trained yesterday to map a developer’s natural-language question to the exact set of repository file paths that matter will degrade tomorrow, even if the questions themselves look unchanged. In this paper we study, at system scale and across several widely used repositories, how to keep such a model fresh without surrendering retention on earlier code. We frame freshness as a form of domain drift between a base snapshot and the current HEAD, and we compare three families of update strategies: (A) Full Refresh, retraining the entire model at the new snapshot; (B) In-Context Learning (ICL) that injects recent deltas (raw git diffs or concise English summaries) at inference; and (C) Incremental Fine-Tuning (Inc-FT) on delta-derived training sets, with carefully controlled NEW:OLD mixing to mitigate catastrophic forgetting. We contribute an alias-aware evaluation protocol that credits rename while never rewarding deleted paths, and a practical Forgetting Probe that quantifies residual emissions of obsolete paths. Across Flask, SQLAlchemy, Pandas, and Poetry, Inc-FT with old-aware mixes delivers the best overall balance on mixed sets, ICL with English delta summaries delivers the fastest new-code lift when training is not feasible, and Full Refresh remains the ceiling when maximum NEW accuracy matters. We also compare Git-diff Inc-FT to full-file Inc-FT, showing that diffs excel in rename/delete-heavy windows while full-file context wins in behavior-change-heavy windows.
💡 Analysis
Modern codebases evolve continuously: files are renamed or deleted; public APIs drift; behavior shifts within otherwise familiar modules. A model trained yesterday to map a developer’s natural-language question to the exact set of repository file paths that matter will degrade tomorrow, even if the questions themselves look unchanged. In this paper we study, at system scale and across several widely used repositories, how to keep such a model fresh without surrendering retention on earlier code. We frame freshness as a form of domain drift between a base snapshot and the current HEAD, and we compare three families of update strategies: (A) Full Refresh, retraining the entire model at the new snapshot; (B) In-Context Learning (ICL) that injects recent deltas (raw git diffs or concise English summaries) at inference; and (C) Incremental Fine-Tuning (Inc-FT) on delta-derived training sets, with carefully controlled NEW:OLD mixing to mitigate catastrophic forgetting. We contribute an alias-aware evaluation protocol that credits rename while never rewarding deleted paths, and a practical Forgetting Probe that quantifies residual emissions of obsolete paths. Across Flask, SQLAlchemy, Pandas, and Poetry, Inc-FT with old-aware mixes delivers the best overall balance on mixed sets, ICL with English delta summaries delivers the fastest new-code lift when training is not feasible, and Full Refresh remains the ceiling when maximum NEW accuracy matters. We also compare Git-diff Inc-FT to full-file Inc-FT, showing that diffs excel in rename/delete-heavy windows while full-file context wins in behavior-change-heavy windows.
📄 Content
Modern engineering teams face a deceptively simple, relentlessly recurrent question: given this natural-language question about the repository, which files should I open first? In production workflows-triaging incidents, scoping changes, reviewing PRs, or steering an autonomous agent-fast, correct file selection is often more valuable than generating a verbose explanation or speculative patch. The challenge is not building a one-off capable model; it is keeping that model faithful to a moving target. Even modest commit windows induce three distinct kinds of drift that systematically erode performance. Structural drift renames, moves, or deletes paths so that an answer that was correct yesterday becomes an alias or a tombstone today. Surface churn (typing, formatting, comments) inflates diffs without changing behavior, diluting signal for learning algorithms that treat all changes equally. Behavioral drift modifies logic within otherwise stable files, invalidating previously accurate associations between questions and files. Any update strategy that chases “freshness” risks catastrophic forgetting of reliable historical mappings; any strategy that prioritizes retention risks staleness on the most recent code. Our work is motivated by the need for a principled, repeatable way to maintain both freshness and retention-measured rigorously and achieved with predictable cost and latency.
We operationalize the task as set-valued retrieval. Given a natural-language question qqq, the model must produce a set of repository-root-relative file paths Y^⊆P\hat{Y}\subseteq PY^⊆P, where PPP is the finite universe of paths at the evaluation snapshot. This formulation is intentionally conservative: by constraining outputs to verifiable artifacts (exact, existing paths), we minimize hallucinations and make scoring unambiguous. Instead of open-ended text generation, the model performs closed-set, set-valued prediction. Two metrics capture the complementary goals: Exact Match (EM) requires the predicted set to equal the gold set;
Micro-averaged Recall (MR) rewards partial recovery on multi-file answers and exposes under-or over-selection behavior. The central engineering question becomes temporal: how do we maintain high EM and MR on new, old, and
To answer this, we compare three families of update strategies that teams can deploy on a cadence:
(1) Full Refresh @ HEAD. Train a new repository-specific model on the latest snapshot. This yields a clean target and the best attainable “ceiling” when drift is large, but it incurs the highest training cost and wall-clock latency and is ill-suited to frequent, small updates.
(2) In-Context Learning (ICL). Keep model weights fixed and inject changes at inference time.
We study two realizations: diff-only prompting (raw per-file git hunks) and diff→English prompting (compact, three-to-five sentence English summaries). As the commit distance widens and diffs grow noisy, we consistently observe English summaries outperform raw diffs by providing higher signal-to-noise in the prompt at the expense of prompt tokens and contextwindow pressure.
(3) Incremental Fine-Tuning (Inc-FT). Continue training on a delta-derived datasetexamples synthesized from files marked Modified/Added (M/A) in the commit window-while mixing a carefully controlled sample of OLD data to resist forgetting. We show that both the NEW:OLD ratio and the training schedule (learning rate, epochs) are make-or-break: too little OLD and the model forgets, too much OLD and it underfits the delta.
A key contribution of this paper is an evaluation and data-building protocol that respects structural drift rather than punishing the model for remembering yesterday’s names. We build an alias map from git metadata that records path renames and deletions (e.g., old/path.py → new/path.py or → DELETED). During scoring, we first remap predictions through this alias table so that a prediction of the old name for a renamed file receives credit at the current snapshot, while predictions that point to truly deleted files never receive credit. To probe residual “old memory” explicitly, we also introduce a Forgetting Probe: a diagnostic test set where every gold label underwent a structural change, evaluated without remapping, which quantifies raw emissions of old names. This alias-aware lens lets us separate behavioral learning (did the model internalize changed logic?) from structural housekeeping (did names merely move?), preventing misleading regressions when repositories reorganize.
Our empirical study spans Flask, SQLAlchemy, Pandas, and Poetry, and it proceeds in two phases. First, across repositories we establish the trade-offs between Full Refresh, ICL (diff vs. English), and Inc-FT under mixed, new-only, and old-only evaluations. Second, we conduct targeted case studies that answer practical questions engineers ask when choosing an update path. In Poetry, we introduce and validate the alias-aware protocol alongside an index
This content is AI-processed based on ArXiv data.