Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns b…

Authors: Seungju Han, Konwoo Kim, Chanwoo Park

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
Syn thetic Mixed T raining: Scaling P arametric Kno wledge Acquisition Bey ond RA G Seung ju Han 1 Kon woo Kim 1 Chan woo Park 2 Benjamin Newman 3 Suhas Kotha 1 Jaeh un Jung 3 James Zou 1 Y ejin Choi 1 1 Stanford Univ ersity 2 MIT 3 Univ ersity of W ashington Abstract Syn thetic data augmentation helps language mo dels learn new knowledge in data-constrained domains. Ho wev er, naively scaling existing syn thetic data metho ds b y training on more synthetic tok ens or using stronger generators yields diminishing returns below the performance of RA G. T o break the RA G ceiling, we in tro duce Synthetic Mixed T raining, which com bines syn thetic QAs and syn thetic documents. This leverages their complemen tary training signals, and enables log-linear impro vemen ts as b oth synthetic data v olume and generator strength increase. This allo ws the mo del to outp erform RAG by a 2.6% relative gain on QuaLITY, a long-do cumen t reading comprehension b enc hmark. In addition, we introduce F ocal Rewriting, a simple technique for synthetic do cument generation that explicitly conditions do cument generation on sp ecific questions, improving the di- v ersity of synthetic documents and yielding a steep er log-linear scaling curve. On QuaLITY, our final recip e trains a Llama 8B mo del that outp erforms RAG by 4.4% relatively . Across mo dels and b enc hmarks (QuaLITY, LongHealth, FinanceBench), our training enables mo dels to b eat RAG in fiv e of six settings, outp erforms by 2.6%, and achiev es a 9.1% gain when combined with RAG. 1 In tro duction Language mo dels fail to in ternalize all knowledge during pretraining, so recent studies hav e inv esti- gated whether domain-sp ecific fine-tuning can impro v e kno wledge learning. They rep ort that retriev al- augmen ted generation (RAG)—the de facto approac h for data-constrained domains—sets a strong upper b ound that is difficult to surpass [Ov adia et al., 2024, Soudani et al., 2024]. This is because incorp orating new kno wledge in to language model parameters is c hallenging in data-constrained settings. One common approac h is to p erform contin ued pretraining using synthetic data generated from domain-sp ecific source do cumen ts. Ho w ever, the v ast ma jorit y of studies ha ve found only limited success [Y ang et al., 2025b, Lin et al., 2025a, Caccia et al., 2025, Eyub oglu et al., 2025, Lampinen et al., 2025]. While it is natural to attribute this failure to the qualit y of the synthetic data generators, prior work has observed that stronger generators yield diminishing returns [Lin et al., 2025a, Maini et al., 2025, Kang et al., 2025, Maini et al., 2024, Niklaus et al., 2026]. In this w ork, we address these issues by answering the question: How c an we design synthetic data r e cip es for know le dge le arning that sc ales b etter with numb er of syn- thetic tokens data and str onger gener ator? First, we in vestigate existing data generation algorithms and find that they do not scale well. W e exp erimen t with four existing data generation algorithms—one for generating synthetic QA pairs and three for generating syn thetic do cuments—and use Llama 3.1 8B and 70B models to generate up to 700M synthetic tokens for training an 8B mo del. W e find that, on a reading comprehension b enchmark requiring the acquisition of new knowledge (QuaLITY; P ang et al. [2022]), existing training recip es based on syn thetic data are insufficien t to train a mo del that outp erforms RAG, even when the data is generated b y a 70B mo del that is m uc h stronger than the 8B mo del b eing trained . In particular, w e find that training on synt hetic QAs p erforms b etter than training on syn thetic do cuments for a fixed generator. Ho wev er, the gains from scaling the generator dep end on the c hoice of data generation algorithm—do cumen t generation benefits more from generator scaling. Still, all metho ds exhibit diminishing returns as the n umber of syn thetic tokens increases and remain 4.6% b ehind RAG in relativ e accuracy . 1                                                                                      Figure 1: Naiv ely scaling synthetic data plateaus, but our simple methods allo ws effective scaling and surpass RA G. W e ev aluate four synthetic data generation strategies using both 8B and 70B generators, scaling training data up to 700M tok ens. Across all four baselines, p erformance saturates and remains below RA G, showing that simply increasing synthetic data or compute is insufficien t. In con trast, our t w o simple tec hniques— Syn thetic Mixed T raining and F o cal Rewriting —exhibit clear log-linear scaling with both more data and a stronger generator, ultimately surpassing RAG. Building on the observ ations that syn thetic QAs and documents ha ve differen t scaling properties with resp ect to data and generator strength, we h yp othesize that QA and do cument data eac h provide unique b enefits during training. T o achiev e the b est of b oth worlds, we prop ose Synthetic Mixed T raining , whic h combines synthetic QAs with synthetic do cuments during training. This substan tially improv es syn thetic token efficiency , exhibits clear log-linear scaling b ehavior up to 700M syn thetic tokens, and enables the mo del to surpass RA G by a 2.6% relative gain. In addition, w e introduce F o cal Rewriting , a simple technique for diversifying the topics cov ered by syn thetic do cumen ts by explicitly conditioning do cumen t generation on synthetic questions ab out the source do cumen t. This increases the lexical and seman tic diversit y of the synthetic do cuments and further improv es mo del p erformance, yielding a 4.4% relativ e accuracy gain ov er RAG. W e sho w that our recip e generalizes w ell across differen t base mo dels, including Qwen3 mo dels ranging from 1.7B to 14B parameters, and across three b enc hmarks: QuaLITY, LongHealth, and FinanceBench. In particular, our recip e enables models to outp erform RAG in five of six setups, yielding an a verage relativ e accuracy gain of 2.6% o ver v anilla RAG. Our approac h is also complementary to RA G, providing a 9.1% relative accuracy improv ement ov er v anilla RAG. These results suggest an un tapp ed p otential of syn thetic data in enhancing internal ization of new knowledge for language mo dels. 2 Existing syn thetic data recip es plateau when scaled Because the communit y lacks a holistic comparison of diverse syn thetic data augmentation strategies, we presen t one here at scale in a con trolled setup. Sp ecifically , we present exp erimen tal results for training an 8B mo del on v ariants of synthetic data deriv ed from do cuments in the QuaLITY b enchmark [Pang et al., 2022]. T o study how p erformance changes with data scale, we v ary the num b er of synthetic tok ens across runs, scaling up to 700M tok ens. Setup. W e train the Llama 3.1 8B Instruct mo del [Grattafiori et al., 2024] on the syn thetic data using a fixed set of hyperparameters (except for learning rate; w e search across three LRs—5e-6, 1e-6 and 5e- 5—and rep ort the best accuracy), v arying only the metho d used to generate the synthetic data. Similar to Y ang et al. [2025b], we use FineW eb [Penedo et al., 2024] as a repla y data, with a mixing ratio of 10%. F or ev aluation, we use the QuaLITY multiple-c hoice QA set with a zero-shot instruction that asks the LM to provide a short explanation follow ed by the final answer. See App endix B and C for more details. Syn thetic do cumen t generation. W e test four data generation metho ds. All of these metho ds tak e original source do cuments as input and generate data grounded in them. F or syn thetic document generation, w e test three algorithms. The first is rephrasing (WRAP; Maini et al. [2024]), whic h rephrases do cumen ts using an LM. The second is EntiGraph (EG; Y ang et al. [2025b]), whic h uses a tw o-stage approac h: it first extracts core entities from the do cument using an LM, and then constructs do cuments 2                                                                                                                                    Figure 2: (Left) Comparing the data scaling of existing methods: self-generated (8B) syn- thetic QAs and syn thetic do cumen ts. This sho ws QuaLITY accuracy as a function of the n umber of syn thetic training tokens; shaded areas indicate the standard deviation corresp onding to the 95% confi- dence in terv al, estimated from n = 8 inference runs. W e use Llama 3.1 8B Inst for b oth data generation and model training. AR indicates Active Reading [Lin et al., 2025a], EG indicates En tiGraph [Y ang et al., 2025b], and WRAP indicates rephrasing [Maini et al., 2024]. On QuALITY, synth QA is sub- stan tially more efficient than all existing metho ds which generate syn thetic do cumen ts. (Right) Scaling the generator to improv e syn thetic token efficiency . (1) Scaling the generator to 70B does not impro ve synth token efficiency for QA, only 0.1% gain ov er 8B generator at 88M tok ens. (2) In con trast, do cumen t-based metho ds do b enefit from scaling the generator, achieving 4.5% gain on av erage. (3) F or all metho ds, even with a stronger generator, data scaling plateaus. describing the relationships b et ween those entities using an LM again. The last is Activ e Reading (AR; Lin et al. [2025a]), which also uses a tw o-stage approach–first, an LM generates strategies to rephrase the do cumen t, which are then incorp orated in to the instruction to guide the LM in rewriting the do cument. Syn thetic QA generation. F or synthetic QA generation, we first generate diverse QA pairs using an LM with a simple instruction, and then use an LM again to pro duce a resp onse with a short explanation for each question. This is s imilar to the QA generation approaches of Lin et al. [2025a] and Y ang et al. [2025b], except that we explicitly instruct the LMs to generate explanations in the responses. W e generate op en-ended QA pairs rather than multiple-c hoice QA (MCQA) pairs, since generating difficult y et faithful distractors for MCQA is nontrivial. See App endix D for more details. 2.1 Scaling self-generated syn thetic data has diminishing returns Figure 2 (left) shows the ev aluation results on QuaLITY when w e train the mo del on synthetic data generated by the same 8B mo del (i.e., self-generated data). T raining on synthetic QAs is more synthetic tok en efficient than training on an y of the three syn thetic do cumen t v ariants, while scaling syn thetic do cumen ts begins to plateau. This was unexp ected, as prior w ork found a different empirical result when training an 8B mo del with 8B-generated data [Lin et al., 2025a]. W e sp eculate that the difference arises from their syn thetic QA pairs con taining only short answers without explanations, as w e find that training the mo del on resp onses containing only the answer, without an explanation (i.e., using the answ ers from the initially generated QA pairs), leads to po or accuracy . How ev er, when w e specifically scale synthetic QAs up to 350M tok ens, it also b egins to show signs of plateauing. 2.2 Strategy matters when scaling generator to improv e syn thetic token ef- ficiency W e further explore a natural direction for impro ving synthetic tok en efficiency: scaling the generator mo del (use Llama 3.1 70B Instruct) to pro duce higher-qualit y syn thetic data. Figure 2 (right) sho ws the results of generator scaling for syn thetic data generation. F or syn thetic QA generation, a stronger generator do es not yield meaningful improv emen ts in synthetic tok en efficiency , and p erformance also saturates as w e scale the n umber of synthetic QA tok ens. In con trast, for do cument generation, a stronger generator improv es synthetic tok en efficiency: it closes the gap b etw een training on synthetic QAs and training on synthetic documents. The choice of data generation algorithm also matters greatly: AR 3                                                                     Figure 3: Mixing syn thetic do cuments do es not help. Mixing different kinds of synthetic documents (blue line) provides a minimal gain o ver just using AR do cumen ts (pink line).                                                                     Figure 4: Syn thetic Mixed T raining breaks the RA G ceiling. W e com bine 70B-generated syn thetic QAs and AR do cuments at a 1:1 ratio, attempting to ac hieve the b est of b oth worlds. This (skyblue line) yields p erformance comparable to RAG at 350M synthetic training tokens and ultimately surpasses RA G when scaled to 700M tokens. sho ws the b est synthetic token efficiency among the v ariants when using the 70B generator, whereas WRAP sho ws the worst synthetic tok en efficiency relative to its p erformance with the 8B generator. Ho wev er, 70B-generated AR do cuments also shows saturating performance when scaled up to 700M tok ens. The results suggest that w e should think carefully about whic h generation pro cedures are lik ely to improv e with more capable generators , as not every data generation algorithm b enefits from a stronger generator mo del. Prior syn thetic data w orks rep ort related empirical findings that stronger generators do not alwa ys provide b etter synthetic data (without clear explanation): Lin et al. [2025a] sho ws that using 70B-generated data to train the 8B mo del underperforms compared to using 8B generated data; Guha et al. [2025] sho ws that a weak er, smaller generator (QwQ-32B) can produce b etter syn thetic data for math and co de than a stronger, larger generator (DeepSeek-R1); and Maini et al. [2025] observes saturating returns when scaling the generator from a 3B to an 8B model for do cumen t rephrasing in pretraining setups. Kang et al. [2025] shows that using a 70B mo del do es not yield b etter p erformance when rephrasing pre-training data, and more recently , Niklaus et al. [2026] shows that a 1B generator is sufficient for rephrasing, supp orted by extensive exp erimental results. 3 Unlo c king syn thetic data scaling Building on our empirical findings, w e present tw o simple metho ds for impro ving the efficiency of syn thetic data and o vercoming the limitations of existing syn thetic data scaling. Our methods enable clear log- linear scaling up to 700M training tokens, and our trained 8B mo del substantially outp erforms RAG on QuaLITY, ac hieving a 4.4% relative accuracy gain. In addition, we find that these methods also work well with differen t base mo dels (Qwen3 1.7B–14B) and additional b enchmarks (LongHealth, FinanceBench). 4                                                               Figure 5: Synthetic Mixed T raining with mixture of domains. Here, the x-axis denotes the n umber of synthetic tokens grounded in the QuaLITY dataset. (1) Mixing 50% synthetic QAs grounded in a different domain with 50% synthetic documents grounded in the target domain yields a b etter scaling curv e than training solely on target-domain synthetic do cumen ts. (2) The b est p erformance comes from mixing target-domain synthetic QAs with target-domain synthetic documents, suggesting that synthetic QAs not only teac hes recall b ehavior but also provides domain-sp ecific kno wledge. 3.1 Syn thetic Mixed T raining: Mixing syn thetic QAs and syn thetic do cu- men ts Since we hav e sho wn syn thetic data strategies hav e differen t scaling prop erties, we inv estigate whether w e can design a synthetic data recip e that achiev es the b est of all worlds. W e hypothesize that syn thetic QAs and syn thetic documents pla y differen t roles: syn thetic QAs primarily teac h b ehavior al know le dge (e.g., ho w to recall facts through c hain-of-though t reasoning), whic h can transfer across domains, whereas synthetic documents primarily teac h factual know le dge , whic h is more domain-sp ecific. T o v alidate this hypothesis, we test mixing all three t yp es of synthetic do cuments during training, and Figure 3 shows that the syn thetic document types are similar and provide minimal impro vemen t when mixed. Additionally , we measure the similarity b etw een data p oin ts in gradient space [Jung et al., 2025] to quan tify how syn thetic QAs and do cumen ts from differen t domains are similar, see App endix E for the analysis on this. Based on this hypothesis, w e explore training models with a mixture of synthetic QAs and synthetic do cumen ts. W e call this approach Synthetic Mixed T raining . 1 W e c ho ose AR for do cument gen- eration b ecause it b enefits the most from generator scaling, and we mix QAs and AR do cumen ts at a 1:1 ratio. Figure 4 sho ws the results of Syn thetic Mixed T raining using 70B-generated QA and AR do cumen ts, compared with training only on synthetic QAs or only on AR do cuments. The scaling curve exhibits p ersistent log-linear b eha vior as the num b er of training tokens increases up to 700M, even tually surpassing RA G with 67.0% accuracy (+2.6% relativ e gain). Moreov er, when comparing results using 70B- and 8B-generated data for Synthetic Mixed T raining, we find that a stronger generator is substan- tially more helpful. This suggests that QA simply dominates AR at 8B generated scale, highlighting AR’s unique b enefit from generator scaling. W e further test whether target-domain synthetic QAs are imp ortant b y mixing synthetic QAs from a differen t dataset (LongHealth, an unrelated domain) with synthetic AR documents from QuaLITY (the target domain). Figure 5 shows t wo findings. (1) Syn thetic QAs can teach domain-agnostic b eha vior that is difficult to learn from AR do cuments alone: mixing LongHealth synthetic QAs with QuaLITY syn thetic AR do cuments impro v es synthetic tok en efficiency on QuaLITY compared to using only QuaL- ITY synthetic AR documents. This supports our h yp othesis and helps explaining the synergy b et ween syn thetic QAs and AR do cuments. (2) Synthetic QAs also teach domain-sp ecific factual knowledge: the b est-performing recip e mixes b oth syn thetic QAs and synthetic AR do cumen ts from QuaLITY, outp er- forming the mixture of LongHealth syn thetic QAs and QuaLITY syn thetic AR documents. This suggests that target-domain syn thetic QAs pro vide not only transferable behavior but also domain-specific knowl- edge. 1 This name is inspired by Mixed T raining, introduced by Allen-Zhu and Li [2024]. Their approach is designed for pretraining from scratc h and do es not consider syn thetic do cuments when mixing. 5                                                              Figure 6: Scaling syn thetic do cumen t generation with F ocal Rewriting. W e apply F o cal Rewrit- ing to AR when generating synthetic do cuments. Syn thetic do cuments generated with F o cal Rewriting (purple line) exhibit b etter scaling b ehavior than those generated without it (skyblue line), as shown by the steep er slop e of the fitted scaling curve. Data are all generated using 70B mo del. Dataset #do cs #a vg tokens/docs #ev al Ev al type Domain QuaLITY 265 6K 4609 MCQA Fictional stories LongHealth 400 12K 400 MCQA Medical FinanceBenc h 1367 16K 150 F ree-form Finance T able 1: Dataset statistics. # docs indicates the num b er of source documents used for data generation, and # avg tokens/docs indicates the av erage n umber of tok ens p er source do cument. # eval indicates the num b er of QA sets used in ev aluation. F or FinanceBenc h, w e use Qwen3-14B to judge the model- generated answer using gold answer as a reference. 3.2 F o cal Rewriting: Impro ving the topic div ersit y of syn thetic do cumen ts In Section 2, we sho w that scaling syn thetic do cuments yields diminishing returns, whic h we h yp othesize is due to limited diversit y in the generated data. In particular, WRAP and AR do cuments pro duce stylistic al ly div erse documents, but the generated do cumen ts often cov er highly similar topics . This is b ecause these approac hes do not explicitly condition the LM on sp ecific topics; instead, the model implicitly decides what to fo cus on, leading to repeated or ov erlapping topics across generations. In con trast, EG generates do cuments co vering more div erse topics b y explicitly conditioning on different en tities, but the resulting do cumen ts tend to b e stylistic al ly similar. W e susp ect that this limited div ersity degrades p erformance when synthetic data is scaled extensively , for example, when the num b er of synthetic tokens exceeds that of the original do cuments by more than 100 times. Based on this observ ation, we introduce F o cal Rewriting which can div ersify b oth the conten t and the style of generated do cuments. When rewriting documents with AR (or WRAP), we explicitly condition generation on a specific question, asking for a do cument that w ould b e useful for answering that query . This technique can b e implemented by simply adding the clause ‘‘Focus on the question {{ query }} ’’ in the generation instruction (see App endix D for the prompts). W e use the questions generated by LM as in Section 2. As shown in App endix F, when we apply F o cal Rewriting to AR, this pro duces do cuments with greater lexical and semantic diversit y . Figure 6 shows the results of scaling syn thetic data using AR with F o cal Rewriting. W e find that when doing Synthetic Mixed T raining, accuracy follo ws a log-linear relationship with the n umber of syn thetic tok ens. Accordingly , we fit a log-linear curv e and plot it alongside the empirical results. The results sho w that F o cal Rewriting yields a steeper log-linear scaling curve and achiev es higher accuracy when data is scaled extensively (b eyond 175M tokens; 100 × more tokens than the original data). 3.3 T esting on differen t mo del and b enchmarks W e additionally verify our recip e on another base mo del (Qwen3 8B; Y ang et al. [2025a]) and t wo addi- tional b enchmarks (LongHealth; Adams et al. [2025] and FinanceBenc h; Islam et al. [2023]) that require learning new kno wledge. T able 1 shows the k ey statistics of the datasets. In particular, FinanceBench pro vides source do cuments in PDF format, so w e use olmOCR-2-7B-1025 [Poznanski et al., 2025] to 6                                                                                                                                                                                                                                                                                                                                                         Figure 7: T raining Qwen3 8B and Llama 3.1 8B Instruct using syn thetic data (generated with 70B mo del) across three b enc hmarks (QuaLITY, LongHealth, FinanceBench). Using our recip e (purple line) enables b eating RAG on 5/6 of the setups. prepro cess them into Markdown and use them as source do cuments. Figure 7 sho ws the results. On QuaLITY and LongHealth, our training recipe enables the models to outp erform RA G, and on FinanceBenc h, our methods allo w the Llama 8B mo del to outperform RA G. On a verage, our metho d gives 2.6% relative accuracy gain compared to RAG. Our metho d also outp erforms training recip es that scale only syn thetic QAs or synthetic AR documents. 3.4 T raining larger models is more syn thetic tok en efficient W e train four Qwen3 mo dels of different sizes (1.7B, 4B, 8B, 14B) to study ho w mo del size affects scaling b eha vior. W e keep all configurations (including the learning rate) the same from the previous exp erimen ts. Figure 8 shows that these models all show log-linear scaling when tokens are scaled up to 88M, and larger mo dels are more synthetic token efficien t, ac hieving RA G-level p erformance with few er syn thetic tokens according to the fitted log-linear curve. F or instance, the 14B mo del requires 102 × more synthetic tokens than the original data, whereas the 1.7B mo del requires 813 × more. This result is intuitiv e, as larger mo dels hav e greater capacity to store knowledge [Morris et al., 2025]. 3.5 Our training recip e enhances RA G W e test whether our trained mo del can b e impro ved further when paired with retriev al augmentation . As shown in T able 2, this shows clear improv ement when using RA G with our trained mo del and shows that it significantly outp erforms the v anilla RA G baselines. On a v erage, our trained mo dels com bined with RA G pro vides a 9.1% relative gain when compared to RAG. This suggests that domain-sp ecific training with synthetic data augmentation can b e helpful even when RAG is used. 7                                                                              Figure 8: T raining differen t-sized mo dels from the Qwen3 family on the synthetic QuaLITY dataset. W e train using our b est recip e: Mixed Syn thetic T raining with synthetic QAs and F o cal Rewriting AR do cuments, using 70B generator. All mo dels exhibit log-linear scaling b ehavior, and larger mo dels can matc h RAG p erformance with fewer syn thetic tokens. Based on the fitted curv es, w e observ e that the 14B mo del requires 102 × , the 8B model 142 × , the 4B mo del 177 × , and the 1.7B mo del 813 × more synthetic tokens than the original tok en coun t (sho wn as #synth tokens/#original tokens) . Benc hmark Mo del Ours Ours + RAG V anilla RAG ∆ QuaLITY Llama 68.2% 69.7% 65.3% +4.4 Qw en 70.1% 73.6% 69.2% +4.4 LongHealth Llama 71.2% 80.3% 70.0% +10.3 Qw en 76.7% 82.3% 73.7% +8.6 FinanceBenc h Llama 52.6% 54.3% 48.0% +6.3 Qw en 55.5% 60.2% 58.8% +1.4 T able 2: Our training complements retriev al augmentation. “Ours” denotes the trained model without RAG, “Ours + RAG” denotes the same trained model with RA G, and “V anilla RA G” denotes the RA G baseline. ∆ denotes the absolute improv ement of “Ours + RAG” ov er “V anilla RA G”. Av eraged across all settings, our training improv es ov er V anilla RA G by 5.9 p oints. 4 Related W ork T raining with synthetic data. T raining language models with syn thetic data at scale has become an imp ortant practice [Blakeman et al., 2025, Y ang et al., 2025a, Ab din et al., 2024]. At the pre-training stage, language mo dels are often used to rewrite original data, suc h as web text. F or example, Maini et al. [2024] use language mo dels to rephrase original do cuments, while Nguyen et al. [2025] use language mo dels to reason b efore rephrasing do cuments, resulting in higher-quality rewrites. More recently , Y ang et al. [2025c] propose to train a language mo del to generate new documents conditioned on an input do cumen t. In con tinued pre-training settings, where language models are further trained on domain-specific data after pre-training [Gururangan et al., 2020], more aggressive forms of data augmen tation are often used b ecause these settings are typically data-constrained and do not provide enough data for scaling [Muen- nighoff et al., 2023, Kim et al., 2025]. F or these settings, improving the diversit y of generated data is imp ortan t: to this end, Y ang et al. [2025b] uses language mo dels to extract core en tities ab out the do c- umen ts, and then generate synthetic do cumen ts that describ e relations b et ween entities in the original do cumen ts, thereb y improving the diversit y of the synthetic data. They empirically show that accuracy impro ves in a log-linear trend as the num b er of synthetic data tokens increases. Similarly , Lin et al. [2025a] use language models to diversify rewriting strategies. In a more domain-sp ecific direction, Ruan et al. [2025] use reasoning traces pro duced by language models to capture the underlying though t pro- cesses related to the original do cuments, and show that these traces are helpful for contin ued training in 8 math. Analyzing knowledge of language mo dels. How language mo dels acquire knowledge during train- ing and use that knowledge when p erforming downstream tasks is still not fully understo o d. T o help understand this, Allen-Zhu and Li [2024] and Allen-Zhu and Li [2025] conduct systematic studies on small language mo dels and show that b oth storing knowledge in the parameters and learning how to use that knowledge are imp ortant. More recently , Calderon et al. [2026] argue that even frontier mo dels are b ottlenec ked more by kno wledge recall than by kno wledge storage, and Gekhman et al. [2026], Ma and Hewitt [2026] show that reasoning can improv e fact recall, indicating that teaching language mo dels how to use the facts learned during training is imp ortan t. W e b elieve that the success of Syn thetic Mixed T raining aligns w ell with these findings, as synthetic QAs may help mo dels learn how to use kno wledge, highligh ting the importance of kno wledge use b eyond mere storage in language models. 5 Conclusion W e study how to make synthetic data scale more effectively for kno wledge learning in data-constrained domains. Our results show that simply increasing the amoun t of synthetic data or using a stronger gen- erator is not sufficient: existing metho ds exhibit diminishing returns and still underp erform RAG. Based on the observ ation that synthetic QAs and documents hav e differen t scaling properties, w e in tro duce Syn thetic Mixed T raining, which combines synthetic QAs and do cumen ts to leverage their complemen- tary training signals and ac hieve the b est of b oth worlds. W e further introduce F o cal Rewriting, which impro ves the diversit y of generated do cuments and leads to an even steep er scaling trend. Our metho ds generalize well across a range of settings, and the trained mo dels are also complementary to RAG. 6 Ac kno wledgmen ts W e thank Suhong Mo on and Seho on Kim for their v aluable feedback and supp ort throughout this w ork. W e also ackno wledge Upstage, and V essl for their compute supp ort for this work, and thank Singap ore DSO and the Institute of Information & Comm unications T echnology Planning & Ev aluation (I ITP) gran t funded by the Korean MSIT (No. RS-2024-00457882, National AI Research Lab Pro ject) for supp orting this work. References Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ ebastien Bubeck, Ronen Eldan, Suriya Gunasek ar, Michael Harrison, Russell J Hew ett, Mo jan Jav aheripi, Piero Kauffmann, et al. Phi-4 technical rep ort. arXiv pr eprint arXiv:2412.08905 , 2024. Lisa Adams, F elix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander L¨ oser, Hugo JWL Aerts, Jak ob Nikolas Kather, Daniel T ruhn, and Keno Bressem. Longhealth: A question answ ering b enchmark with long clinical do cuments. Journal of He althc ar e Informatics R ese ar ch , 9(3): 280–296, 2025. Rishabh Agarw al, Nino Vieillard, Y ongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bac hem. On-p olicy distillation of language mo dels: Learning from self-generated mistakes. In The twelfth international c onfer enc e on le arning r epr esentations , 2024. Zeyuan Allen-Zhu and Y uanzhi Li. Physics of language mo dels: Part 3.1, knowledge storage and extrac- tion. In International Confer enc e on Machine L e arning , pages 1067–1077. PMLR, 2024. Zeyuan Allen-Zhu and Y uanzhi Li. Ph ysics of language models: Part 3.2, kno wledge manipulation. In International Confer enc e on L e arning R epr esentations , 2025. Vincen t-Pierre Berges, Barlas O˘ guz, Daniel Haziza, W en-tau Yih, Luke Zettlemoy er, and Gargi Ghosh. Memory lay ers at scale. arXiv pr eprint arXiv:2412.09764 , 2024. Dan Biderman, Jacob P ortes, Jose Ja vier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Ha vens, Vitaliy Chiley , Jonathan F rankle, et al. Lora learns less and forgets less. T r ansactions on Machine L e arning R ese ar ch , 2025. 9 Aaron Blakeman, Aaron Grattafiori, Aarti Basan t, Abhibha Gupta, Abhina v Khattar, Adi Renduchin- tala, Adity a V avre, Ak anksha Sh ukla, Akhiad Berco vich, Aleksander Ficek, et al. Nvidia nemotron 3: Efficien t and open intelligence. arXiv pr eprint arXiv:2512.20856 , 2025. Lucas Caccia, Alan Ansell, Edoardo P onti, Iv an V uli´ c, and Alessandro Sordoni. T raining plug-and-pla y kno wledge mo dules with deep context distillation. In Se c ond Confer enc e on L anguage Mo deling , 2025. Nita y Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, and Gal Y ona. Empty shelves or lost k eys? recall is the b ottleneck for parametric factuality . arXiv pr eprint arXiv:2602.14080 , 2026. Sabri Eyub oglu, Ryan Ehrlic h, Simran Arora, Neel Guha, Dylan Zinsley , Emily Liu, Will T ennien, A tri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Light w eight and general-purp ose long con text represen tations via self-study . arXiv pr eprint arXiv:2506.06266 , 2025. Dan F riedman and Adji Bousso Dieng. The vendi score: A div ersity ev aluation metric for machine learning. T r ansactions on Machine L e arning R ese ar ch , 2023. Tian yu Gao, Alexander W ettig, Luxi He, Yihe Dong, Sadhik a Malladi, and Danqi Chen. Metadata conditioning accelerates language model pre-training. arXiv pr eprint arXiv:2501.01956 , 2025. Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Gev a, Roi Reichart, and Jonathan Herzig. Thinking to recall: Ho w reasoning unlo cks parametric knowledge in llms, 2026. URL 2603.09906 . Mor Gev a, Ro ei Sch uster, Jonathan Berant, and Omer Levy . T ransformer feed-forw ard la y ers are key- v alue memories. In Pr o c e e dings of the 2021 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 5484–5495, 2021. Aaron Grattafiori, Abhimanyu Dub ey , Abhinav Jauhri, Abhina v P andey , Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Math ur, Alan Sc helten, Alex V aughan, et al. The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 , 2024. Etash Guha, Ry an Marten, Sedrick Keh, Negin Rao of, Georgios Smyrnis, Hritik Bansal, Marianna Nezh urina, Jean Merc at, T rung V u, Zayne Sprague, et al. Op enthough ts: Data recipes for reasoning mo dels. arXiv pr eprint arXiv:2506.04178 , 2025. Suc hin Gururangan, Ana Maraso vi´ c, Sw abha Swa yamdipta, Kyle Lo, Iz Beltagy , Doug Do wney , and Noah A Smith. Don’t stop pretraining: Adapt language mo dels to domains and tasks. In Pr o c e e dings of the 58th annual me eting of the asso ciation for c omputational linguistics , pages 8342–8360, 2020. Xu Owen He. Mixture of a million exp erts. arXiv pr eprint arXiv:2407.04153 , 2024. Edw ard J Hu, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, W eizhu Chen, et al. Lora: Low-rank adaptation of large language mo dels. In International Confer enc e on L e arning R epr esentations , 2022. Pranab Islam, Anand Kannappan, Dou we Kiela, Rebecca Qian, Nino Sc herrer, and Bertie Vidgen. Financeb enc h: A new benchmark for financial question answ ering. arXiv pr eprint arXiv:2311.11944 , 2023. William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings in to a hilbert space. Contemp or ary mathematics , 26(189-206):1, 1984. Jaeh un Jung, Seungju Han, Ximing Lu, Skyler Hallinan, Da vid Acuna, Shrimai Prabhumo ye, Mostofa P atw ary , Mohammad Sho eybi, Bryan Catanzaro, and Y ejin Choi. Prismatic syn thesis: Gradien t-based data diversification b o osts generalization in llm reasoning. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025. F eiyang Kang, Newsha Ardalani, Michael Kuchnik, Y oussef Emad, Mostafa Elhoushi, Shubhabrata Sen- gupta, Shang-W en Li, Ramy a Raghav endra, Ruo xi Jia, and Carole-Jean W u. Demystifying synthetic data in llm pre-training: A systematic study of scaling laws, b enefits, and pitfalls. In Pr o c e e dings of the 2025 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 10750–10769, 2025. 10 Jared Kaplan, Sam McCandlish, T om Henighan, T om B Bro wn, Benjamin Chess, Rewon Child, Scott Gra y , Alec Radford, Jeffrey W u, and Dario Amodei. Scaling laws for neural language mo dels. arXiv pr eprint arXiv:2001.08361 , 2020. Kon woo Kim, Suhas Kotha, P ercy Liang, and T atsunori Hashimoto. Pre-training under infinite compute. arXiv pr eprint arXiv:2509.14786 , 2025. Suhas Kotha and Percy Liang. Repla ying pre-training data impro ves fine-tuning. arXiv pr eprint arXiv:2603.04964 , 2026. W o osuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Co dy Hao Y u, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efficien t memory management for large language mo del serving with pagedatten tion. In Pr o c e e dings of the 29th symp osium on op er ating systems principles , pages 611–626, 2023. Andrew Kyle Lampinen, Martin Engelck e, Y uxuan Li, Arslan Chaudhry , and James L McClelland. Laten t learning: episo dic memory complemen ts parametric learning by enabling flexible reuse of exp eriences. arXiv pr eprint arXiv:2509.16189 , 2025. P atrick Lewis, Ethan Perez, Aleksandra Piktus, F abio Petroni, Vladimir Karpukhin, Naman Goy al, Heinric h K ¨ uttler, Mik e Lewis, W en-tau Yih, Tim Rockt¨ aschel, et al. Retriev al-augmented generation for knowledge-in tensiv e nlp tasks. A dvanc es in neur al information pr o c essing systems , 33:9459–9474, 2020. Jessy Lin, Vincent-Pierre Berges, Xilun Chen, W en-T au Yih, Gargi Ghosh, and Barlas O˘ guz. Learning facts at scale with active reading. arXiv pr eprint arXiv:2508.09494 , 2025a. Jessy Lin, Luke Zettlemoy er, Gargi Ghosh, W en-T au Yih, Aram Markosy an, Vincent-Pierre Berges, and Barlas O˘ guz. Contin ual learning via sparse memory finetuning. arXiv pr eprint arXiv:2510.15103 , 2025b. Emm y Liu, Graham Neubig, and Chen y an Xiong. Midtraining bridges pretraining and posttraining distributions. arXiv pr eprint arXiv:2510.14865 , 2025. Ily a Loshchilo v and F rank Hutter. Decoupled w eight decay regularization. arXiv pr eprint arXiv:1711.05101 , 2017. Kevin Lu and Thinking Mac hines Lab. On-policy distillation. Thinking Machines L ab: Conne ctionism , 2025. doi: 10.64434/tml.20251026. https://thinkingmac hines.ai/blog/on-p olicy-distillation. Melo dy Ma and John Hewitt. Improving parametric kno wledge access in reasoning language mo dels. arXiv pr eprint arXiv:2602.22193 , 2026. Prat yush Maini, Skyler Seto, Ric hard Bai, Da vid Grangier, Yizhe Zhang, and Na vdeep Jaitly . Rephrasing the w eb: A recip e for compute and data-efficient language mo deling. In Pr o c e e dings of the 62nd A nnual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 14044–14072, 2024. Prat yush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, F an P an, Jack Urbanek, Paul Burstein, Alex F ang, Alvin Deng, Amro Abbas, et al. Bey ondweb: Lessons from scaling synthetic data for trillion-scale pretraining. arXiv pr eprint arXiv:2508.10975 , 2025. John X Morris, Chawin Sitaw arin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalik a Chaudh uri, and Saeed Mahloujifar. How m uch do language mo dels memorize? arXiv pr eprint arXiv:2505.24832 , 2025. Niklas Muennighoff, Alexander Rush, Boaz Barak, T even Le Scao, Nouamane T azi, Aleksandra Pik- tus, Sampo Pyysalo, Thomas W olf, and Colin A Raffel. Scaling data-constrained language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 36:50358–50376, 2023. Thao Nguyen, Y ang Li, Olga Golovnev a, Luk e Zettlemo yer, Sewoong Oh, Ludwig Schmidt, and Xian Li. Recycling the w eb: A metho d to enhance pre-training data quality and quantit y for language mo dels. In Se c ond Confer enc e on L anguage Mo deling , 2025. 11 Jo el Niklaus, Guilherme P enedo, Hynek Kydlicek, Elie Bak ouch, Lewis T unstall, Ed Beeching, Thibaud F rere, Colin Raffel, Leandro von W erra, and Thomas W olf. The synthetic data pla yb o ok: Generating trillions of the finest tokens, 2026. Oded Ov adia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Fine-tuning or retriev al? comparing kno wledge injection in llms. In Pr o c e e dings of the 2024 c onfer enc e on empiric al metho ds in natur al language pr o c essing , pages 237–250, 2024. Ric hard Y uanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnn y Ma, Jana Thompson, He He, and Sam uel Bowman. QuALITY: Question answ ering with long input texts, yes! In Pr o c e e dings of the 2022 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies , pages 5336–5358, Seattle, United States, July 2022. Asso ciation for Computational Linguistics. URL https://aclanthology.org/2022.naacl- main.391 . Guilherme P enedo, Hynek Kydl ´ ı ˇ cek, An ton Lozhko v, Margaret Mitchell, Colin A Raffel, Leandro V on W erra, Thomas W olf, et al. The finew eb datasets: Decanting the web for the finest text data at scale. A dvanc es in Neur al Information Pr o c essing Systems , 37:30811–30849, 2024. Jak e Poznanski, Luca Soldaini, and Kyle Lo. olmo cr 2: Unit test rewards for do cumen t o cr. arXiv pr eprint arXiv:2510.19817 , 2025. Y ang jun Ruan, Neil Band, Chris J Maddison, and T atsunori Hashimoto. Reasoning to learn from latent though ts. arXiv pr eprint arXiv:2503.18866 , 2025. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quo c Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural netw orks: The sparsely-gated mixture-of-exp erts la yer. arXiv pr eprint arXiv:1701.06538 , 2017. Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context. arXiv pr eprint arXiv:2209.15189 , 2022. Heydar Soudani, Ev angelos Kanoulas, and F aegheh Hasibi. Fine tuning vs. retriev al augmen ted gen- eration for less p opular kno wledge. In Pr o c e e dings of the 2024 A nnual International ACM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval in the Asia Pacific R e gion , pages 12–22, 2024. An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bo w en Y u, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical rep ort. arXiv pr eprint arXiv:2505.09388 , 2025a. Zitong Y ang, Neil Band, Shuangping Li, Emman uel Candes, and T atsunori Hashimoto. Synthetic con- tin ued pretraining. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025b. Zitong Y ang, Aonan Zhang, Hong Liu, T atsunori Hashimoto, Emman uel Cand ` es, Chong W ang, and Ruoming Pang. Synthetic b ootstrapp ed pretraining. arXiv pr eprint arXiv:2509.15248 , 2025c. Y anli Zhao, Andrew Gu, Rohan V arma, Liang Luo, Chien-Chin Huang, Min Xu, Less W right, Hamid Sho janazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: exp eriences on scaling fully sharded data parallel. arXiv pr eprint arXiv:2304.11277 , 2023. Adam Zweiger, Xinghong F u, Han Guo, and Y o on Kim. F ast kv compaction via atten tion matc hing. arXiv pr eprint arXiv:2602.16284 , 2026. 12 A Limitations Due to compute limitations, our study focuses on small-scale models up to 8B parameters, whic h stand to b enefit the most from nov el metho ds for knowledge learning. In addition, although we fo cus on learning new kno wledge, mitigating the forgetting of existing knowledge is also an important problem. W e use pretraining data replay [Y ang et al., 2025b, Kotha and Liang, 2026] to mitigate this forgetting issue, and w e view a deeper treatment of forgetting alongside knowledge acquisition as a promising direction for future work. B T raining details Hyp erparameters. W e train all models on the syn thetic data using a fixed set of hyperparameters: a batc h size of 16, a sequence length of 2048, tw o training epo chs, and a fixed replay rate of 0.1 from the FineW eb dataset [Penedo et al., 2024]. W e use cosine learning rate schedule with warm up ratio of 0.05, use AdamW optimizer [Loshchilo v and Hutter, 2017] with weigh t decay of 0.01, beta1 of 0.9, and b eta2 of 0.999. W e also use gradien t clipping of threshold 1.0 and use FSDP2 [Zhao et al., 2023] for model training (used 4 GPUs for training 8B models). The only v ariation is the metho d used to generate the syn thetic data. F or Llama 3.1 8B Instruct, we train models with tw o learning rates (5e-6 and 1e-5) and rep ort the b est result for eac h configuration. F or Qwen3 mo dels (1.7B, 4B, 8B, and 14B), w e use t wo learning rates (1e-5 and 5e-5) and again rep ort the b est result for each configuration. T o use training compute efficien tly , as in pretraining, we pac k randomly sh uffled data instances in to each sequence, separated by the EOD delimiter. Data formatting: inclusion of metadata is imp ortan t. After generating synthetic data, we em- phasize that data formatting is also imp ortan t: metadata ab out the data should b e included (e.g., compan y name, story title and author name, etc). F or example, if the generated data (using Active Reading) is based on FinanceBench, it could follow this format: Here’s a learning strategy. { strategy } Apply this strategy to the document ‘‘ { doc name } ’’ of { company } . Output: { generated text } T able 3: Example data format for synthetic training data with metadata. This is an example of using FinanceBenc h matadata. In our preliminary exp eriments, we found that omitting metadata leads to low er accuracy after training. This is reasonable, as without metadata, the mo del struggles to asso ciate knowledge with the correct source. This observ ation is also consisten t with the findings of Allen-Zh u and Li [2025] and Gao et al. [2025]. Estimating compute for syn thetic training. W e pro vide a crude estimate of the amoun t of compute required for syn thetic training. W e follo w the common appro ximation for FLOP calculation from Kaplan et al. [2020]: 2 N D for the forw ard pass and 4 N D for the backw ard pass, where N is the num b er of mo del parameters and D is the num b er of data tokens. Under this appro ximation, when w e train a mo del with N parameters on D synthetic tok ens generated b y a mo del with M parameters, the total compute required for synthetic training can b e written as C ≈ 2 M D + 6 N D . Using this formula, we estimate the computational cost of our most exp ensive training run: training an 8B mo del on 700M tokens generated b y a 70B model requires appro ximately 1 . 316 × 10 20 FLOPs. Assuming using H100 (1979 TFLOPS) for syn thetic training without an y other ov erhead, it requires 18.5 H100 hours to generate syn thetic data and train the model on it. 13 ### Question { question } ### Choices { options } Choose the best answer from the following options after thinking step by step. There is only one correct choice. Your answer format should be like this: Explanation: [your explanation] Answer: [your answer (only one letter, A, B, C, D, or E)] T able 4: Prompt template used for multiple-c hoice QA ev aluation. This is used for QuaLITY and LongHealth. ### Question { question } Answer the question using the document above. Your answer format should be like this: Explanation: [your explanation] Answer: [your answer] T able 5: Prompt template used for op en-ended QA ev aluation. This is used for FinanceBenc h. C Ev aluation details RA G implemen tation. W e compare the trained mo del against a RAG [Lewis et al., 2020] that uses the mo del before syn thetic data training. T o implemen t RA G, we use Qw en3-Embedding-8B as the retriev er to fetc h the top-128 do cumen t c hunks most relev ant to the query . W e then apply Qwen3- Rerank er-8B to rerank these ch unks and select the top-8 as context. Ev aluation hyperparameters. During ev aluation, we use temp erature=0.1, top-p=0.95, and a max- im um length of 512. W e generate eigh t resp onses p er question (n=8) and rep ort the a verage accuracy for a more robust ev aluation. F or model ev aluations on the MCQA b enc hmarks (QuaLITY, LongHealth), w e use the prompt in T able 4, and for the op en-ended generation task (FinanceBench), we use the prompt in T able 5. D Syn thetic Data Generation Details W e use vLLM [Kwon et al., 2023] for efficient LLM inference during data generation. F or QA pair generation, w e use the prompt in T able 6. F or do cument generation, we use the prompts in T able 7, T able 8, T able 9, T able 10, and T able 11. F or QA generation, we use a temp erature of 1.0, a top-p of 1.0, and a maxim um length of 2048. F or do cumen t generation, we use a temp erature of 0.7, a top-p of 0.95, and a maxim um length of 4096. W e use Llama 3.1 8B Instruct for QA generation in all exp erimen ts, including the generation of questions for F o cal Rewriting. F or the exp eriments in Section 3.3, w e use Llama 3.1 70B Instruct for resp onse and document generation, except on FinanceBenc h, where we use Qw en3 30B A3B Instruct for b oth resp onse and do cument generation. E Measuring Similarit y of the Syn thetic Data in Gradien t Space T o analyze how synthetic QA and syn thetic documents are similar to each other, w e compute gradien t em b eddings for syn thetic QAs and syn thetic do cuments (AR) generated from tw o datasets (QuaLITY, LongHealth [Adams et al., 2025]), and measure b oth in tra- and inter-set gradient-em b edding similarity . All synthetic datasets are generated using 70B generator. Sp ecifically , we follo w Jung et al. [2025] when computing the gradien t embeddings: w e use next-token prediction loss, Qwen3 0.6B as the mo del, and a Johnson-Lindenstrauss transform [Johnson et al., 1984] 14 Generate question-answer pairs from the following article. Article: { article } ONLY ‘Question: ...‘ and ‘Answer: ...‘ tags are allowed. DO NOT include any other text. T able 6: Prompt template used for question-answer pair generation from an article. Rewrite the following document to help the user understand the document better. { document } T able 7: Prompt template used for do cumen t rephrasing [Maini et al., 2024].                                             Figure 9: Average gradient similarit y b et w een datasets. W e compute gradient embeddings for each data point and use cosine similarit y betw een em b eddings to measure the similarity of gradients across datasets. (1) QA examples from different datasets exhibit high gradient similarity ( ≥ 0 . 94), whereas QA examples and documents (AR) from differen t datasets sho w low gradient similarity ( ≤ 0 . 25). (2) Ev en QA and do cuments from the same dataset do not exhibit high gradien t similarit y ( ≤ 0 . 26). to reduce the gradient dimensionality . W e sample 16 batches with the sequence length of 2048 from each data types. Figure 9 sho ws the results. QA datasets exhibit high gradient similarit y across different domains, whereas AR datasets sho w lo wer gradient similarity . How ever, QA and AR data from the same domain exhibit the low est gradien t similarit y , suggesting that data t yp e is an imp ortan t factor in shaping training signals. F More results F.1 Measuring diversit y of generated do cumen ts W e measure div ersity from tw o p erspectives: semantic div ersity and lexical diversit y . F or seman tic div ersity , we use the V endi Score [F riedman and Dieng, 2023], which quan tifies ho w v aried the data instances are. Sp ecifically , we compute an embedding for eac h instance using Qw en3-Embedding-8B and 15 As a knowledge analyzer, your task is to dissect and understand an article provided by the user. You are required to perform the following steps: 1. Summarize the Article: Provide a concise summary of the entire article, capturing the main points and themes. 2. Extract Entities: Identify and list all significant "nouns" or entities mentioned within the article. These entities should include but not limited to: * People: Any individuals mentioned in the article, using the names or references provided. * Places: Both specific locations and abstract spaces relevant to the content. * Object: Any concrete object that is referenced by the provided content. * Concepts: Any significant abstract ideas or themes that are central to the article’s discussion. Try to exhaust as many entities as possible. Your response should be structured in a JSON format to organize the information effectively. Ensure that the summary is brief yet comprehensive, and the list of entities is detailed and accurate. Here is the format you should use for your response: {{ "summary": "", "entities": ["entity1", "entity2", ...] }} Article: { document } T able 8: Prompt template used to extract en tities for EntiGraph [Y ang et al., 2025b].                                                                            Figure 10: Ho w data div ersity c hanges with differen t synthetic document generation meth- o ds. (Left) Seman tic diversit y of syn thetic do cuments, measured b y the V endi score [F riedman and Dieng, 2023] using embedding-based similarit y to compute distances b etw een data p oints. (Right) Lex- ical diversit y of syn thetic documents, measured as the ratio of unique 4-grams in the data. F or b oth metrics, higher v alues indicate greater diversit y . (1) F o cal Rewriting increases b oth semantic and lexical div ersity at all dataset sizes, and (2) scaling the generator do es not significantly affect diversit y . then use the cosine similarit y b etw een embeddings to construct the pairwise similarity matrix required for the V endi Score. F or lexical diversit y , we rep ort the unique 4-gram ratio, which captures the extent of surface-form v ariation in the text. In Figure 10, w e sho w ho w the diversit y of AR documents and F o cal Rewriting AR do cuments c hanges across differen t data sizes on QuaLITY. W e plot the results this w ay to compare how diversit y c hanges under different data budgets. The figure shows that F o cal Rewriting yields higher lexical and semantic div ersity . Interestingly , using a stronger model do es not lead to higher diversit y . 16 You will act as a knowledge analyzer tasked with dissecting an article provided by the user. Your role involves two main objectives: 1. Rephrasing Content: The user will identify two specific entities mentioned in the article. You are required to rephrase the content of the article twice: * Once, emphasizing the first entity. * Again, emphasizing the second entity. 2. Analyzing Interactions: Discuss how the two specified entities interact within the context of the article. Your responses should provide clear segregation between the rephrased content and the interaction analysis. Ensure each section of the output include sufficient context, ideally referencing the article’s title to maintain clarity about the discussion’s focus. Here is the format you should follow for your response: ### Discussion of in relation to <entity1> <Rephrased content focusing on the first entity> ### Discussion of <title> in relation to <entity2> <Rephrased content focusing on the second entity> ### Discussion of Interaction between <entity1> and <entity2> in context of <title> <Discussion on how the two entities interact within the article> ### Document { document } ### Entities: - { entity1 } - { entity2 } T able 9: Prompt template used for entit y linking for generating EntiGraph do cumen ts [Y ang et al., 2025b].                                                             Figure 11: Syn thetic mixed training with different synthetic QA-do cumen t mixing ratios. W e test four mixing ratios of QA and AR: (1:1), (1:8), (2:7), and (8:1). The remaining 10% is used for repla y with FineW eb. Using 1:1 m ixing gives the best result. F.2 Finding optimal mixing ratio for syn th mixed training W e exp eriment with different mixing ratios of synthetic QA data and documents for mixed training. Figure 11 shows the results of training Llama 3.1 8B on data generated by the 70B model. Among the tested v ariants, a 1:1 mixing ratio yields the best p erformance. 17 Consider the following document. What are some strategies specific to this document that I can use to help me learn and remember all of the information contained? Use markdown and prefix each strategy with ##. <document> { document } </document> T able 10: Prompt template used for generating activ e reading strategies [Lin et al., 2025a]. Here’s a learning strategy. { strategy } Apply this strategy to the following document: <document> { document } </document> T able 11: Prompt template used for active reading document generation with a pro vided learning strategy [Lin et al., 2025a]. G Additional Related W orks P arameter-efficient training for new knowledge. Parameter-efficien t adaptation is a promising direction for teac hing mo dels new kno wledge. LoRA [Hu et al., 2022] has been widely used to adapt mo dels through low-rank up dates to their weigh ts, and, com bined with con text distillation [Snell et al., 2022], Caccia et al. [2025] propose training LoRA la yers to acquire new kno wledge. How ever, their Llama 8B mo del trained on QuaLITY achiev es 59.3% accuracy , whic h remains substantially b elo w our results. Biderman et al. [2025] also sho w that low-rank up dates can limit the acquisition of new knowledge, highligh ting a k ey limitation of LoRA for knowledge-in tensive learning. Motiv ated by the h yp othesis that T ransformer k ey-v alue (KV) caches function as a form of kno wledge base [Gev a et al., 2021], Eyub oglu et al. [2025] prop ose an end-to-end training approach that optimizes only the KV cac he to store knowledge. More recen tly , Zweiger et al. [2026] introduce an optimization metho d that up dates the KV cac he to compress knowledge without requiring end-to-end training. Al- though w e view these approaches as promising, we do not include them as baselines for tw o reasons. First, applying these compression-based metho ds to our setting is infeasible because concatenating all do cumen ts would require con text lengths of 1.6M and 4.8M tok ens, resp ectively , which are not supported b y the base mo del w e use. Second, even if the base mo del supp orted context lengths b eyond 1M tokens, their p erformance degrades substan tially at high compression ratios (i.e., when compressing by more than 20 × ), making them difficult to apply in our setting. Consisten t with these limitations, Zweiger et al. [2026] ev aluate on QuaLITY b y compressing only a single do cumen t, while on LongHealth, Zweiger et al. [2026] and Eyuboglu et al. [2025] compress only fiv e and ten do cumen ts, resp ectively . In contrast, w e train mo dels on all do cumen ts in each dataset: 265 do cuments from QuaLITY and 400 documents from LongHealth. Alleviating forgetting. Contin ued training often leads to the forgetting of existing knowledge in language models. A common tec hnique for mitigating this issue is repla y–reusing pretraining data during the contin ued training stage [Lin et al., 2025a, Y ang et al., 2025b, Kotha and Liang, 2026, Liu et al., 2025]. In a symbolic distillation setup, Agarwal et al. [2024], Lu and Lab [2025] suggest using on-p olicy distillation: compared to training the mo del with the syn thetic data generated b y another model (e.g., stronger mo del), using the self-generated data for the p oints to compute the loss leads to less forgetting while learning new kno wledge well. There hav e also b een attempts to address forgetting through impro ved language mo del architectures. F or example, Lin et al. [2025b] suggest using memory lay ers [He, 2024, Berges et al., 2024], which are iden tical to Mixture-of-Exp erts [Shazeer et al., 2017] mo dels but use a large n umber of exp erts in a sp ecific la yer. These la yers are up dated specifically for new kno wledge, reducing in terference with existing kno wledge. The pap er shows that there is a trade-off b et ween learning new kno wledge and forgetting existing knowledge, and that sparse mo del up dates with memory la yers provide a b etter Pareto frontier 18 <document> { document } </document> Here’s a learning strategy. { strategy } Apply this strategy to the document above, with the focus on the question: { query } T able 12: Prompt template used for F o cal Rewriting active reading with a provided learning strat- egy . than full fine-tuning or LoRA [Biderman et al., 2025, Hu et al., 2022]. 19 </div> <hr style="margin: 50px 0; border: 0; border-top: 2px solid #eee;" /> <!-- ── Original Paper Viewer ── --> <section class="original-paper-section" id="paper-viewer-anchor"> <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 25px;"> <h3 style="margin:0; font-size: 1.4rem; color: #222;">Original Paper</h3> <div id="nav-top"></div> </div> <div id="paper-content-container" style="background: #f4f4f4; border: 1px solid #ddd; border-radius: 8px; min-height: 600px; position: relative; overflow: visible;"> <div id="loading-status" style="text-align: center; padding: 100px 20px; color: #888;"> <p>Loading high-quality paper...</p> </div> </div> <div id="nav-bottom" style="margin-top: 30px; display: flex; justify-content: center;"></div> </section> <!-- ── Related Papers Section ── --> <section id="related-papers-section" style="margin-top:60px; border-top:2px solid #eee; padding-top:40px;"> <h3 style="font-size:1.4rem; font-weight:800; color:#1a1a1a; margin-bottom:24px; display:flex; align-items:center; gap:10px;"> <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#0366d6" stroke-width="2"><path d="M4 19.5A2.5 2.5 0 0 1 6.5 17H20"/><path d="M6.5 2H20v20H6.5A2.5 2.5 0 0 1 4 19.5v-15A2.5 2.5 0 0 1 6.5 2z"/></svg> Related Papers </h3> <div id="related-papers-list" style="display:grid; grid-template-columns:repeat(auto-fill,minmax(260px,1fr)); gap:20px;"> <p style="color:#aaa; font-style:italic; font-size:0.9rem;">Loading...</p> </div> </section> <!-- ── Comment Section ── --> <section class="comments-section" style="margin-top: 80px; border-top: 3px solid #0366d6; padding-top: 50px;"> <h3 style="font-size: 1.6rem; font-weight: 800; color: #1a1a1a; margin-bottom: 35px; display: flex; align-items: center; gap: 12px;"> <svg width="28" height="28" viewBox="0 0 24 24" fill="none" stroke="#0366d6" stroke-width="2.5"><path d="M21 15a2 2 0 0 1-2 2H7l-4 4V5a2 2 0 0 1 2-2h14a2 2 0 0 1 2 2z"/></svg> Comments & Academic Discussion </h3> <div id="comments-list" style="margin-bottom: 50px;"> <p style="color: #999; font-style: italic;">Loading comments...</p> </div> <div class="comment-form-wrap" style="background: #fdfdfd; padding: 35px; border-radius: 16px; border: 1px solid #e1e4e8; box-shadow: 0 4px 12px rgba(0,0,0,0.03);"> <h4 id="reply-title" style="margin-top: 0; margin-bottom: 20px; font-size: 1.2rem; font-weight: 800; color: #333;">Leave a Comment</h4> <form id="comment-form" onsubmit="submitComment(event)"> <input type="hidden" id="parent-id" value=""> <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px; margin-bottom: 20px;"> <input type="text" id="comment-author" placeholder="Your Name" required style="padding: 14px; border: 1px solid #ddd; border-radius: 8px; font-size: 1rem; outline: none; transition: border-color 0.2s;" onfocus="this.style.borderColor='#0366d6'" onblur="this.style.borderColor='#ddd'"> <input type="text" id="comment-website" placeholder="Website" style="display:none !important;" tabindex="-1" autocomplete="off"> </div> <textarea id="comment-content" rows="5" placeholder="Share your insights or questions about this paper..." required style="width: 100%; padding: 14px; border: 1px solid #ddd; border-radius: 8px; margin-bottom: 20px; resize: vertical; box-sizing: border-box; font-size: 1rem; outline: none; transition: border-color 0.2s;" onfocus="this.style.borderColor='#0366d6'" onblur="this.style.borderColor='#ddd'"></textarea> <div style="display: flex; justify-content: space-between; align-items: center;"> <button type="button" id="cancel-reply" onclick="resetReply()" style="display: none; background: #fff; border: 1px solid #d73a49; color: #d73a49; padding: 10px 20px; border-radius: 8px; font-weight: bold; cursor: pointer; transition: all 0.2s;">Cancel Reply </button> <button type="submit" style="background: #0366d6; color: white; border: none; padding: 14px 40px; border-radius: 8px; font-weight: 800; cursor: pointer; font-size: 1rem; transition: background 0.2s; margin-left: auto;" onmouseover="this.style.background='#0056b3'" onmouseout="this.style.background='#0366d6'">Post Comment</button> </div> </form> </div> </section> <script> const arxivId = "2603.23562"; const apiUrl = "/api/comments/" + arxivId; const pageLang = "en"; // ── Related Papers ──────────────────────────────────────── (function loadRelatedPapers() { const container = document.getElementById('related-papers-list'); if (!container || !arxivId) return; fetch('/api/related/' + encodeURIComponent(arxivId) + '?lang=' + pageLang) .then(r => r.json()) .then(papers => { if (!papers || papers.length === 0) { document.getElementById('related-papers-section').style.display = 'none'; return; } container.innerHTML = papers.map(p => ` <a href="${p.url}" style="display:block; text-decoration:none; color:inherit; background:#fff; border:1px solid #e8e8e8; border-radius:10px; padding:16px; transition:box-shadow 0.2s;" onmouseover="this.style.boxShadow='0 4px 16px rgba(3,102,214,0.12)'" onmouseout="this.style.boxShadow='none'"> ${p.image_url ? `<div style="aspect-ratio:16/9; overflow:hidden; border-radius:6px; margin-bottom:10px; background:#f5f5f5;"><img src="${p.image_url}" style="width:100%;height:100%;object-fit:cover;" loading="lazy" onerror="this.parentElement.style.display='none'"></div>` : ''} <div style="font-size:0.72rem; color:#888; margin-bottom:5px;">${p.arxiv_id} · ${p.date_str}</div> <div style="font-size:0.92rem; font-weight:700; color:#1a1a1a; line-height:1.4; display:-webkit-box; -webkit-line-clamp:2; -webkit-box-orient:vertical; overflow:hidden;">${p.title}</div> </a> `).join(''); }) .catch(() => { document.getElementById('related-papers-section').style.display = 'none'; }); })(); function loadComments() { fetch(apiUrl) .then(res => res.json()) .then(data => { const list = document.getElementById('comments-list'); if (!data || data.length === 0) { list.innerHTML = `<div style="text-align:center; padding:40px; background:#fcfcfc; border-radius:12px; border:1px dashed #ddd; color:#999;">No comments yet. Be the first to share your thoughts!</div>`; return; } // Group replies under parents const top = data.filter(c => !c.parent_id); const byParent = {}; data.filter(c => c.parent_id).forEach(c => { byParent[c.parent_id] = byParent[c.parent_id] || []; byParent[c.parent_id].push(c); }); top.forEach(c => { c.replies = byParent[c.id] || []; }); list.innerHTML = top.map(renderComment).join(''); }) .catch(e => console.error('loadComments error:', e)); } function renderComment(c) { return ` <div class="comment-item" style="margin-bottom: 30px; border-left: 4px solid #0366d6; padding: 15px 25px; background: #fff; border-radius: 0 12px 12px 0; box-shadow: 0 2px 8px rgba(0,0,0,0.02);"> <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px;"> <span style="font-weight: 800; color: #1a1a1a; font-size: 1.05rem;">${c.author}</span> <span style="font-size: 0.85rem; color: #bbb;">${c.created_at ? c.created_at.slice(0,10) : ''}</span> </div> <p style="margin: 0; color: #4a4a4a; line-height: 1.7; font-size: 1.05rem; white-space: pre-wrap;">${c.content}</p> <div style="margin-top: 15px;"> <button onclick="setReply(${c.id}, '${c.author.replace(/'/g, "\\'")}')" style="background:none; border:none; color:#0366d6; font-size:0.9rem; padding:0; cursor:pointer; font-weight:bold; display:flex; align-items:center; gap:5px;"> <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5"><path d="M15 10l-5 5 5 5"/><path d="M4 4v7a4 4 0 0 0 4 4h12"/></svg> Reply </button> </div> ${c.replies && c.replies.length > 0 ? `<div style="margin-top: 25px; margin-left: 30px; border-top: 1px solid #f0f0f0; padding-top: 25px;">${c.replies.map(renderComment).join('')}</div>` : ''} </div> `; } function setReply(id, name) { document.getElementById('parent-id').value = id; document.getElementById('reply-title').innerText = `Reply to ${name}`; document.getElementById('cancel-reply').style.display = 'inline-block'; document.getElementById('comment-content').focus(); document.getElementById('comment-form').scrollIntoView({ behavior: 'smooth', block: 'center' }); } function resetReply() { document.getElementById('parent-id').value = ""; document.getElementById('reply-title').innerText = "Leave a Comment"; document.getElementById('cancel-reply').style.display = 'none'; } function submitComment(e) { e.preventDefault(); const author = document.getElementById('comment-author').value; const content = document.getElementById('comment-content').value; const parent_id = document.getElementById('parent-id').value || null; const website = document.getElementById('comment-website').value; if (website) return; // honeypot fetch(apiUrl, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ author, content, parent_id }) }).then(res => { if (res.ok) { document.getElementById('comment-content').value = ""; resetReply(); loadComments(); alert("Comment posted successfully."); } else { alert("Error posting comment. Please try again."); } }); } document.addEventListener('DOMContentLoaded', loadComments); </script> <script> document.addEventListener("DOMContentLoaded", function() { const arxivId = "2603.23562"; const container = document.getElementById('paper-content-container'); const loadingStatus = document.getElementById('loading-status'); let currentPage = 1; let totalPages = 0; const yearStr = "2026"; const monthStr = "03"; const paths = [ `/koineu_html/${yearStr}/${monthStr}/${arxivId}/index.html` ]; function updateNav() { const navHtml = ` <div style="display: flex; align-items: center; gap: 15px; background: #fff; padding: 8px 25px; border-radius: 30px; border: 1px solid #ddd; box-shadow: 0 2px 8px rgba(0,0,0,0.05);"> <button onclick="movePage(-1)" ${currentPage === 1 ? 'disabled' : ''} style="border:0; background:none; cursor:pointer; font-weight:bold; color:${currentPage === 1 ? '#ccc' : '#0366d6'}">◀ Prev</button> <span style="font-family: monospace; font-weight:bold; color:#333;">PAGE ${currentPage} / ${totalPages}</span> <button onclick="movePage(1)" ${currentPage === totalPages ? 'disabled' : ''} style="border:0; background:none; cursor:pointer; font-weight:bold; color:${currentPage === totalPages ? '#ccc' : '#0366d6'}">Next ▶</button> </div> `; document.getElementById('nav-top').innerHTML = navHtml; document.getElementById('nav-bottom').innerHTML = navHtml; } window.movePage = function(delta) { const next = currentPage + delta; if (next >= 1 && next <= totalPages) { document.getElementById('pf' + currentPage.toString(16)).style.display = 'none'; document.getElementById('pf' + next.toString(16)).style.display = 'block'; currentPage = next; updateNav(); window.scrollTo({ top: document.getElementById('paper-viewer-anchor').offsetTop - 20, behavior: 'smooth' }); } }; function tryLoad(idx) { if (idx >= paths.length) { loadingStatus.innerHTML = "<p>Original content is being processed. Available soon.</p>"; return; } const url = paths[idx]; const baseUrl = url.replace('index.html', ''); fetch(url).then(r => { if(!r.ok) throw new Error(); return r.text(); }).then(html => { const parser = new DOMParser(); const doc = parser.parseFromString(html, 'text/html'); const pageContainer = doc.getElementById('page-container') || doc.body; const styles = doc.querySelectorAll('style'); styles.forEach(s => { const newStyle = document.createElement('style'); newStyle.textContent = s.textContent.replace(/url\((?!http|data|["']?\/)/g, `url(${baseUrl}`); document.head.appendChild(newStyle); }); if (pageContainer) { const pages = pageContainer.querySelectorAll('.pf'); totalPages = pages.length || 1; pageContainer.style.cssText = "position:relative !important; top:0 !important; left:0 !important; width:100% !important; display:block !important;"; pages.forEach((p, i) => { p.style.display = (i === 0) ? 'block' : 'none'; p.style.cssText += "position:relative !important; margin:0 auto !important; max-width:100% !important; background:white !important; box-shadow:0 0 15px rgba(0,0,0,0.1) !important;"; p.querySelectorAll('img').forEach(img => { const src = img.getAttribute('src'); if (src && !src.startsWith('http') && !src.startsWith('/')) img.src = baseUrl + src; img.onerror = function() { this.style.display = 'none'; }; }); }); container.innerHTML = ""; container.appendChild(pageContainer); updateNav(); } }).catch(() => tryLoad(idx + 1)); } tryLoad(0); }); </script> <script> document.addEventListener("DOMContentLoaded", function() { const arxivId = "2603.23562"; const lang = "en"; // Record View & Update Count fetch(`/api/view/${arxivId}?lang=${lang}`, { method: 'POST' }) .then(res => res.json()) .then(data => { if (data.status === 'success' && data.view_count !== undefined) { const viewEl = document.getElementById('post-view-number'); if (viewEl) viewEl.innerText = Number(data.view_count).toLocaleString(); } }) .catch(e => console.error(e)); // Load Sidebar Data fetch(`/api/sidebar-data?lang=${lang}`) .then(res => res.json()) .then(data => { // Popular Posts const popContainer = document.getElementById('dynamic-sidebar-popular'); if (data.popular_posts && data.popular_posts.length > 0) { let popHtml = `<h4 class="sidebar-section-title">${lang === 'kr' ? '인기 게시물' : 'Popular Posts'}</h4>`; data.popular_posts.forEach(p => { popHtml += ` <a href="${p.url}" class="sidebar-post"> <div class="sidebar-post__img"> <img src="${p.image_url}" onerror="this.src='/images/placeholder.jpg'" alt="${p.title}" loading="lazy" /> </div> <span class="sidebar-post__title">${p.title}</span> </a> `; }); popContainer.innerHTML = popHtml; } else { popContainer.style.display = 'none'; } // Recent Comments const commentContainer = document.getElementById('dynamic-sidebar-comments'); if (data.recent_comments && data.recent_comments.length > 0) { let cHtml = `<h4 class="sidebar-section-title">${lang === 'kr' ? '최근 댓글' : 'Recent Comments'}</h4><div style="display:flex; flex-direction:column; gap:15px;">`; data.recent_comments.forEach(c => { cHtml += ` <a href="${c.url}#comments-list" style="text-decoration:none; background:#f8f9fa; padding:12px; border-radius:8px; display:block; border:1px solid #eee; transition:background 0.2s;" onmouseover="this.style.background='#f0f7ff'" onmouseout="this.style.background='#f8f9fa'"> <div style="font-size:0.85rem; color:#666; margin-bottom:5px;"><strong>${c.author}</strong> on <span style="color:#0366d6;">${c.post_title}</span></div> <div style="font-size:0.95rem; color:#333; line-height:1.4;">"${c.content}"</div> </a> `; }); cHtml += `</div>`; commentContainer.innerHTML = cHtml; } else { commentContainer.style.display = 'none'; } }) .catch(e => { document.getElementById('dynamic-sidebar-popular').style.display = 'none'; document.getElementById('dynamic-sidebar-comments').style.display = 'none'; }); }); </script> <div class="post-share" style="margin-top: 50px;"> <a href="https://twitter.com/intent/tweet?url=http://koineu.com/en/posts/2026/03/2026-03-30-2603_23562&text=Synthetic%20Mixed%20Training%3A%20Scaling%20Parametric%20Knowledge%20Acquisition%20Beyond%20RAG" class="share-btn" target="_blank" rel="noopener">Twitter</a> <a href="https://www.facebook.com/sharer/sharer.php?u=http://koineu.com/en/posts/2026/03/2026-03-30-2603_23562" class="share-btn" target="_blank" rel="noopener">Facebook</a> </div> </div> </div> </div> </main> <footer class="site-footer"> <div class="container"> <div class="footer-grid"> <div> <div class="footer-logo">KOINEU</div> <p class="footer-desc">Global Academic Research Archive powered by AI.</p> <div class="footer-biz-sidebar" style="margin:20px 0; font-size:0.85rem; color:var(--color-muted); line-height:1.7;"> <div style="display:flex; gap:8px;"> <span style="white-space:nowrap;"><b>Company:</b></span> <span>미스미스터크레이지 (MissMrCrazy)</span> </div> <div style="display:flex; gap:8px;"> <span style="white-space:nowrap;"><b>CEO:</b></span> <span>송호성 (Song Ho-seong)</span> </div> <div style="display:flex; gap:8px;"> <span style="white-space:nowrap;"><b>Biz Reg:</b></span> <span>731-64-00881</span> </div> </div> </div> <div> <h3 class="footer-title">Recent posts</h3> <div class="footer-posts-list"> <a href="/en/posts/2026/04/2026-04-01-2604_00465/" class="footer-post"> <div class="footer-post__img"> <img src="/koineu_html/2026/04/2604.00465/bg5.webp" onerror="this.src='/images/placeholder.jpg'" alt="thumb" loading="lazy"> </div> <div> <div class="footer-post__title" style="color:white;"> Gravitational wave spectrum from first-order QCD phase transitions based on a parity doublet model </div> <div class="footer-post__meta" style="font-size:0.75rem; color:#aaa; margin-top:3px;">2026-04-01</div> </div> </a> <a href="/en/posts/2026/04/2026-04-01-2604_00618/" class="footer-post"> <div class="footer-post__img"> <img src="/koineu_html/2026/04/2604.00618/bgd.webp" onerror="this.src='/images/placeholder.jpg'" alt="thumb" loading="lazy"> </div> <div> <div class="footer-post__title" style="color:white;"> Absorption of 1$P$-wave heavy charmonium $χ_{c1}(1P)$ in nuclei </div> <div class="footer-post__meta" style="font-size:0.75rem; color:#aaa; margin-top:3px;">2026-04-01</div> </div> </a> </div> </div> <div> <h3 class="newsletter-title">STAY INFORMED</h3> <p class="newsletter-desc"> Get the latest research breakthroughs delivered to your inbox. </p> <form class="newsletter-form" action="#" method="post"> <input type="email" name="email" class="newsletter-input" placeholder="Email Address" required /> <button type="submit" class="newsletter-btn">SUBSCRIBE</button> </form> </div> </div> <div class="footer-bottom" style="margin-top:50px; padding-top:30px; border-top:1px solid var(--color-border);"> <div style="display:flex; flex-direction:column; gap:10px;"> <div style="display:flex; flex-wrap:wrap; gap:20px; font-size:0.75rem; color:var(--color-muted);"> <span>2026 © KOINEU. All Rights Reserved.</span> <a href="/en/about/" style="color:var(--color-muted); text-decoration:none;">About Us</a> <a href="/en/terms/" style="color:var(--color-muted); text-decoration:none;">Terms</a> <a href="/en/privacy/" style="color:var(--color-muted); text-decoration:none;">Privacy Policy</a> </div> <div style="font-size:0.7rem; color:#aaa; line-height:1.5;"> Address: 31, Hyangsoseojeong-gil, Danwol-myeon, Yangpyeong-gun, Gyeonggi-do, KR | Industry: Information & Communication </div> </div> <a href="#" class="back-to-top" style="display:flex; align-items:center; gap:6px; color:var(--color-muted); font-size:0.8125rem; margin-top:auto; text-decoration:none;"> <svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5"><path d="M18 15l-6-6-6 6"/></svg> Back to top </a> </div> </div> </footer> <script> // ── 모바일 메뉴 토글 ── const navToggle = document.querySelector('.nav-toggle'); const navMenu = document.querySelector('.nav-menu'); if (navToggle && navMenu) { navToggle.addEventListener('click', () => { const expanded = navToggle.getAttribute('aria-expanded') === 'true'; navToggle.setAttribute('aria-expanded', String(!expanded)); navMenu.classList.toggle('is-open'); }); } // ── 검색 패널 ── const searchBtn = document.getElementById('navSearchToggle'); const searchPanel = document.getElementById('searchPanel'); const searchClose = document.getElementById('searchClose'); if (searchBtn && searchPanel) { searchBtn.addEventListener('click', () => { searchPanel.style.display = 'flex'; searchPanel.querySelector('input[name="q"]')?.focus(); }); searchClose?.addEventListener('click', () => { searchPanel.style.display = 'none'; }); searchPanel.addEventListener('click', (e) => { if (e.target === searchPanel) searchPanel.style.display = 'none'; }); document.addEventListener('keydown', (e) => { if (e.key === 'Escape') searchPanel.style.display = 'none'; }); } // ── Back to top ── document.querySelector('.back-to-top')?.addEventListener('click', (e) => { e.preventDefault(); window.scrollTo({ top: 0, behavior: 'smooth' }); }); </script> </body> </html>