ArXiv-to-Model: A Practical Study of Scientific LM Training

ArXiv-to-Model: A Practical Study of Scientific LM Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.


💡 Research Summary

**
The paper presents a thorough, reproducible case study of building a 1.36 billion‑parameter scientific language model directly from raw arXiv LaTeX sources. Unlike most large‑scale LLM papers that rely on curated, proprietary corpora, the authors start from the open‑access arXiv dump and document every step required to turn heterogeneous LaTeX archives into a high‑quality pre‑training corpus.

Data pipeline. They first filter metadata to keep only papers in mathematics, computer science, and theoretical physics (categories: math, cs, hep‑th, hep‑ph, quant‑ph, stat.ML, stat.TH) published after 2000, discarding withdrawn papers and any document with fewer than 2 000 characters of body text. After verifying the integrity of each tar archive, they extract every .tex file, resolve \input and \include directives, and preserve custom macros as much as possible. A cleaning stage removes figures, references, and formatting commands while retaining equation environments. Exact deduplication (hash‑based) and near‑duplicate detection are applied to avoid version‑induced redundancy. The resulting cleaned corpus is about 80 GB of raw text, which after tokenization yields 52.18 billion tokens for pre‑training and an additional 5 billion tokens from post‑training resources (StackExchange, MathInstruct, UltraChat).

Tokenizer design. Recognizing that scientific text is dense with symbols, the authors experiment with custom BPE and SentencePiece models. They ultimately adopt a LLaMA‑compatible SentencePiece tokenizer with a vocabulary of ~102 k tokens, chosen for architectural compatibility, stable embedding initialization, and reduced risk of token‑ID misalignment. The tokenizer is trained to keep common LaTeX commands and mathematical operators as single tokens, which improves compression efficiency and, more importantly, stabilizes early training dynamics.

Model architecture. The model follows the dense decoder‑only Transformer design of LLaMA: 24 layers, hidden size 2048, 16 attention heads, feed‑forward dimension 5504, RoPE positional embeddings (θ = 10 000), maximum context length 4096 tokens, SiLU activation, RMSNorm, and separate input/output embedding matrices. The total parameter count is 1.36 B. The authors deliberately avoid sparse or Mixture‑of‑Experts (MoE) variants because routing complexity and communication overhead would outweigh any parameter‑efficiency gains on a modest 2 × A100 setup.

Training infrastructure and curriculum. Training runs on two NVIDIA A100 GPUs (80 GB each) using data‑parallelism with gradient accumulation to achieve an effective global batch size of 512–2 048 sequences. Mixed‑precision (bfloat16) and ZeRO Stage 2 memory optimizations keep GPU memory usage within limits, while activation checkpointing further reduces the footprint. The total compute budget is estimated at 5 000–8 000 GPU‑hours.

A three‑stage curriculum is employed: (1) textual warm‑up on abstracts, introductions, and conclusions; (2) full LaTeX bodies with theorem and proof environments; (3) a mixed curriculum that balances prose and formula‑heavy sections. Although the model can handle 4096‑token contexts, training sequences are limited to 768 tokens to maximize throughput.

Experimental findings. Across 24 runs the authors explore hyper‑parameter variations, dataset scales, and preprocessing tweaks. Small‑data experiments (≈20 GB) show unstable loss curves with oscillations and high plateau values, confirming that the model is data‑hungry. Scaling up to the full 200 GB corpus yields smooth loss decay and stable convergence. The authors note that the Chinchilla scaling law (T ≈ 20 × P) would suggest ~27 B tokens for optimal compute efficiency, but they deliberately train on 52 B tokens, placing the model in a “data‑rich” regime (≈38 tokens per parameter). This choice improves the model’s ability to capture diverse mathematical notation and proof patterns at the cost of over‑training relative to pure compute‑optimality.

Infrastructure bottlenecks. I/O throughput emerges as a primary limitation; streaming the 200 GB dataset from NVMe storage caps GPU utilization at ~85 %. The authors mitigate this with parallel data loading pipelines and high‑speed storage. Memory pressure is alleviated through activation checkpointing and Fully‑Sharded Data Parallel (FSDP), achieving roughly a 30 % reduction in memory consumption without harming convergence.

Conclusions and impact. The paper does not propose a new model architecture; instead, it delivers a transparent, reproducible blueprint for training domain‑specific scientific LLMs on modest hardware. It highlights how preprocessing decisions, tokenizer design, curriculum scheduling, and hardware‑aware optimizations collectively dictate model quality. The work serves as a practical reference for researchers who lack access to massive clusters but wish to build high‑performing scientific language models. Future directions include systematic comparison of domain‑trained tokenizers, exploration of sparse MoE models under similar constraints, and extending context windows to improve long‑range reasoning in mathematics and physics.


Comments & Academic Discussion

Loading comments...

Leave a Comment