LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.

💡 Research Summary

The paper introduces LIME (Linguistic Metadata Embeddings), a lightweight technique that injects linguistic metadata directly into the token embeddings of decoder‑only language models. Traditional large‑scale pre‑training relies on ever‑growing corpora, yet the marginal benefit of additional raw text is diminishing. While metadata (source, genre, timestamps, etc.) is routinely used for data curation, its potential as a training signal has been largely ignored. LIME addresses this gap by encoding three families of metadata—syntactic (part‑of‑speech tags, dependency relations, constituency labels), semantic (WordNet synsets, hierarchical concepts, semantic roles), and contextual (sentence length, paragraph index, document type)—into low‑dimensional vectors via tiny MLPs. These vectors are projected to the same dimensionality as the token embeddings and added element‑wise, increasing the total parameter count by only 0.01 % and adding negligible compute overhead because the metadata embeddings can be pre‑computed and cached.

The authors evaluate LIME on GPT‑style models ranging from 500 M to 2 B parameters, trained on a mixture of The Pile, C4, and Korean web corpora. Compared with a vanilla baseline, LIME‑augmented models converge up to 56 % faster to the same perplexity target, and achieve a final perplexity reduction of roughly 4 %. Tokenization quality improves as well, with a 1.8 % gain in sub‑word segmentation metrics. Downstream performance on a suite of benchmarks (GLUE, SuperGLUE, MMLU, GSM‑8K) shows consistent gains: +1.5 points on GLUE, +2.3 on MMLU, and a striking 35 % increase in arithmetic accuracy on GSM‑8K.

A notable extension, LIME+1, shifts the metadata forward by one step, providing the model with “future” syntactic and semantic hints before it predicts the next token. This guided decoding dramatically benefits chain‑of‑thought reasoning tasks, delivering up to a 38 % boost in logical inference accuracy. Ablation studies reveal that syntactic metadata drives early‑stage convergence, while semantic metadata contributes most to downstream task improvements. The method tolerates moderate noise in automatically extracted metadata (up to 10 % error rate) without significant degradation, suggesting an inherent regularization effect.

The paper situates LIME within prior work on embedding augmentation (positional encodings, segment embeddings, adapters) and on metadata‑driven data selection, highlighting its novelty in treating metadata as a first‑class training signal. Limitations include dependence on the quality of automatic parsers and the current focus on textual metadata; extending the approach to multimodal signals (image captions, audio transcripts) is left for future work. The authors propose further research on (1) learning to generate high‑quality metadata, (2) integrating multimodal metadata, and (3) designing dynamic attention mechanisms that condition on metadata relevance.

In summary, LIME demonstrates that a minuscule increase in model parameters can unlock substantial gains in data efficiency, tokenization, and downstream performance across model scales. By turning linguistic metadata into a direct learning cue, LIME opens a promising avenue for more sustainable and effective large‑language‑model training.

💡 Research Summary

📜 Original Paper Content