Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs

Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffers from 1) reliance on compute-heavy paraphrase augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate that the same demasking objective improves supervised fine-tuning (SFT) on math tasks over standard SFT, suggesting broader applicability of the demasking objective.


💡 Research Summary

The paper investigates how large language models (LLMs) can be efficiently updated with new factual knowledge, focusing on two families of models: autoregressive LLMs (arLLMs) such as Llama and Qwen, and diffusion‑style masked language models (dLLMs) such as LLaDA. Existing work shows that arLLMs struggle to turn raw knowledge documents into reliable question‑answering (QA) ability unless the fine‑tuning data is heavily augmented with paraphrases, and they suffer from the “reversal curse” – the inability to answer a fact when the subject and object are swapped (e.g., answering “B is A” after learning “A is B”). In contrast, recent dLLMs, trained with a bidirectional demasking objective, appear more sample‑efficient and less prone to the reversal curse.

The authors formulate two hypotheses. Hypothesis 1 claims that dLLMs can generalize from a single knowledge document to both forward and backward QA without paraphrase augmentation. Hypothesis 2 posits that this advantage stems primarily from the demasking training objective rather than architectural or decoding differences, implying that applying a demasking‑style objective to arLLMs should close the performance gap.

To test these ideas, the authors conduct controlled fine‑tuning experiments on three datasets: (1) NameDescription (synthetic statements of the form “A is B” and “B is A”), (2) Biography (short fictional biographies with forward and backward queries), and (3) a newly built Wiki set consisting of 94 recent Wikipedia articles (2025 events) with both same‑order and permuted‑order paraphrases. For each dataset they evaluate forward (subject‑to‑object) and backward (object‑to‑subject) QA using ROUGE‑1 overlap as an accuracy proxy.

Results confirm Hypothesis 1: the dLLM (LLaDA‑8B‑Instruct) achieves >90 % accuracy on both forward and backward questions across all datasets without any paraphrase augmentation. In contrast, arLLMs fine‑tuned on raw documents alone perform well on forward queries but collapse on backward queries (often <2 % accuracy), demonstrating a clear reversal curse.

To probe Hypothesis 2, the authors introduce Masked Fine‑Tuning for arLLMs. During fine‑tuning each knowledge document is randomly masked (varying mask ratios and positions) and the model is prompted to reconstruct the original text. By sampling different masks across steps, a single document yields many distinct conditioning patterns, mimicking the implicit data augmentation that dLLMs receive from their denoising process, while preserving the decoder‑only architecture.

Masked Fine‑Tuning yields dramatic improvements. Without any paraphrases, arLLMs now reach forward accuracies of 65‑99 % and backward accuracies of 90‑99 %, effectively eliminating the reversal curse. When paraphrases are added, performance matches or exceeds that of dLLMs, narrowing the gap to a negligible level. Table 1 shows, for example, Llama‑8B with “Masked + no paraphrase” achieving 65 % forward and 95 % backward accuracy, compared to 37 % forward and 0 % backward for standard fine‑tuning. The method also outperforms a baseline “reverse‑training” approach that explicitly reorders entities.

To assess whether the demasking objective is beneficial beyond factual knowledge, the authors adapt it to supervised fine‑tuning (SFT) on two math datasets (MATH‑QA and GSM‑8K). They mask random parts of the solution steps and train the model to reconstruct them. Masked SFT consistently outperforms standard SFT by 3‑5 percentage points in accuracy and improves ROUGE‑1 scores, especially on problems requiring multi‑step reasoning, suggesting that demasking promotes more robust reasoning abilities.

Key contributions are: (1) empirical validation that dLLMs can acquire new knowledge efficiently without paraphrase augmentation; (2) identification of the demasking objective as the core factor mitigating the reversal curse; (3) a simple, architecture‑agnostic masked fine‑tuning recipe that brings arLLMs to dLLM‑level performance; and (4) demonstration of broader applicability to procedural tasks such as math problem solving.

The paper acknowledges limitations: experiments are limited to models ≤8 B parameters, leaving open whether the same gains hold for 70 B‑scale models; hyper‑parameters of the masking process (ratio, token selection) are not exhaustively explored; and real‑world deployment costs (memory, latency) of repeated masking are not quantified. Future work could explore scaling to larger models, dynamic mask‑policy learning, integration into continuous knowledge‑update pipelines, and extension to other domains like code or clinical notes.

In summary, the study shows that the demasking training paradigm—originally a hallmark of diffusion‑style masked language models—can be transplanted to autoregressive LLMs to achieve efficient, paraphrase‑free knowledge injection and to overcome the longstanding reversal curse, with promising implications for continual learning and downstream reasoning tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment