Texo: Formula Recognition within 20M Parameters
In this paper we present Texo, a minimalist yet highperformance formula recognition model that contains only 20 million parameters. By attentive design, distillation and transfer of the vocabulary and the tokenizer, Texo achieves comparable performance to state-of-the-art models such as UniMERNet-T and PPFormulaNet-S, while reducing the model size by 80% and 65%, respectively. This enables real-time inference on consumer-grade hardware and even in-browser deployment. We also developed a web application to demonstrate the model capabilities and facilitate its usage for end users.
💡 Research Summary
The paper introduces Texo, a formula‑recognition model that achieves state‑of‑the‑art performance while using only 20 million parameters. The authors identify two main bottlenecks in existing MER (Mathematical Expression Recognition) systems: (1) oversized vocabularies inherited from natural‑language models, which cause the embedding layers to dominate the parameter budget, and (2) heavyweight transformer decoders that are unnecessary for the relatively limited LaTeX token set.
To address these issues, Texo first performs vocabulary distillation. By parsing the full set of LaTeX macros with the open‑source KaTeX parser, the authors build a rule‑based tokenizer containing only 687 meaningful tokens, removing whitespace and sub‑word splits that are common in BPE tokenizers. A systematic transfer algorithm (Algorithm 1) maps the original embedding vectors to the new token set via averaging, reducing the embedding parameter count from 38 M to under 1 M.
The model architecture reuses components from PPFormulaNet‑S. The image encoder is HGNetV2‑B4, a lightweight CNN backbone originally designed for the RT‑DETR detector and comparable in accuracy to Vision Transformers on classification and detection tasks. The text decoder is a 2‑layer MBart transformer with a hidden dimension of 384 and a context length of 1024. Cross‑attention between the encoder’s visual features and the decoder yields a sequence of log‑probabilities that are trained with a standard cross‑entropy loss.
Training is performed on the UniMER‑1M dataset (≈1 M image‑LaTeX pairs) with extensive augmentations (morphological operations, affine transforms, Gaussian noise, brightness/contrast changes, and weather‑like perturbations). The optimizer is AdamW (β₁ = 0.9, β₂ = 0.999, weight decay = 0.05), with a learning rate of 1e‑5, a linear warm‑up of 5 k steps, and cosine annealing thereafter. The total training budget is 1 × 10⁵ steps on a single NVIDIA A40 (46 GB), but the model and optimizer together occupy only 230 MB, making it feasible to fine‑tune on consumer‑grade GPUs such as an RTX 3090.
Evaluation uses the Character Detection Matching (CDM) metric, which aligns visual tokens rather than strict string equality, making it more suitable for LaTeX where multiple syntactically equivalent representations exist. On the UniMER‑Test split (23 k samples covering simple printed, complex printed, screen‑captured, and handwritten expressions), Texo achieves an overall CDM of 0.902, with category scores of 0.958 (SPE), 0.825 (CPE), 0.882 (SCE), and 0.902 (HWE). While slightly below UniMERNet‑T (0.991 ~ 0.933) and PPFormulaNet‑S (0.949 ~ 0.818), Texo’s performance is remarkable given its 20 M parameter budget—approximately 35 % of PPFormulaNet‑S’s size.
The vocabulary reduction also cuts average output token length by roughly half, which directly speeds up inference. Measured on a single A40 GPU with batch size = 1, Texo processes a sample in 311 ms, a 7× speed‑up over UniMERNet‑T (2266 ms) and only modestly slower than PPFormulaNet‑S (217 ms) which relies on a multi‑token parallel prediction scheme that trades accuracy for speed.
For deployment, the authors export the best checkpoint to ONNX, serve it with Transformers.js, and wrap the inference in a Web‑Worker to keep the UI responsive. The front‑end is built with Vue/Nuxt and provides real‑time conversion to LaTeX, MathML, and a WYSIWYG editor, all running entirely in the browser. This design eliminates server costs, removes latency, and guarantees user privacy because no image data leaves the client device.
In conclusion, Texo demonstrates that careful vocabulary engineering, reuse of an efficient CNN backbone, and a slim transformer decoder can produce a MER model that is both lightweight and high‑performing. Limitations include a modest inference speed gap relative to the fastest existing model and a focus on formula recognition rather than broader document OCR. Future work may extend the parameter‑efficient architecture to general document understanding tasks, integrate layout analysis, or further improve robustness to noisy handwritten inputs.
Comments & Academic Discussion
Loading comments...
Leave a Comment