LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.


💡 Research Summary

The paper introduces LEMUR, a large‑scale multilingual corpus specifically designed for fine‑tuning legal embedding models for semantic retrieval. The authors harvested all official EU environmental legislation (Category 15, Sub‑category 10) from the EUR‑Lex portal, resulting in 1,174 distinct legal acts spanning 1961‑2025. Because each act is published in all 25 official EU languages, the final collection comprises 24,953 PDF files (≈461 k pages).

A central challenge is the noisy conversion of PDFs—often containing multi‑column layouts, tables, and footnotes—into plain text. The authors evaluated several conversion tools (Docling, Unstructured, PyMuPDF) and found that the OCR‑based pipeline olmOCR produced the highest quality output. To quantify conversion fidelity, they propose the Lexical Content Score (LCS), a cosine similarity computed on bag‑of‑words vectors after aggressive normalization of both the HTML reference and the extracted text. High‑resource languages (English, German, French) achieve LCS > 0.95, while low‑resource languages (Latvian, Maltese) reach ≈0.90 and 0.80 respectively, indicating that the extracted texts retain most of the original lexical content.

For retrieval‑oriented training, each legal act is split into a short “metadata” block (act type, date, brief description, legal basis, etc.) and the remaining substantive text. The metadata serves as a realistic query, while the full act text is the target document. This yields a natural query‑document pair without any manual annotation. The dataset is partitioned 60 %/20 %/20 % for training, validation, and testing, ensuring that translations of the same act stay together across splits.

Three state‑of‑the‑art multilingual embedding models are fine‑tuned on LEMUR: Qwen‑3‑0.6B, Qwen‑3‑4B, and E5‑Multilingual. Two fine‑tuning regimes are explored. (1) Monolingual contrastive fine‑tuning: each language is trained separately using a symmetric Multiple Negatives Ranking (MNR) loss, where the query‑document pair is positive and all other documents in the batch are negatives. (2) Bilingual multi‑positive contrastive fine‑tuning: leveraging the parallel nature of the corpus, a single query is paired with all language versions of the same act, treating them as multiple positives. The loss is extended to a grouped symmetric MNR that encourages a query embedding to be close to every aligned document embedding while staying far from unrelated documents. Training runs for up to 30 epochs with early stopping, uses a maximum sequence length of 2,048 tokens (512 for E5), bfloat16 precision, gradient checkpointing, and a linear warm‑up schedule. Training hardware includes RTX A6000 and A100 GPUs; fine‑tuning times range from ~30 minutes (E5) to 6–8 hours (Qwen‑3‑4B) per language.

Evaluation is performed by indexing the resulting embeddings in a FAISS vector database and measuring Top‑k retrieval accuracy (k = 1, 5, 10) for metadata queries. Across all languages, fine‑tuned models improve Top‑k accuracy by 8 %–15 % relative to the off‑the‑shelf baselines. Gains are especially pronounced for low‑resource languages, where improvements reach 12 %–20 %. The bilingual multi‑positive setting further boosts cross‑lingual transfer: models fine‑tuned on a subset of language pairs still outperform baselines on unseen languages, suggesting that the training encourages language‑independent, content‑level representations rather than over‑fitting to language‑specific surface forms.

The authors’ contributions are fourfold: (1) the release of LEMUR, a 25‑language, 24 k‑document legal corpus with a rigorously evaluated PDF‑to‑text pipeline; (2) the introduction of the LCS metric for systematic assessment of conversion quality; (3) a comprehensive contrastive fine‑tuning framework that works both monolingually and bilingually, demonstrating substantial retrieval gains especially for under‑represented languages; (4) open‑source code and data (GitHub and Hugging Face) to ensure reproducibility and to foster further research in multilingual legal AI. The work bridges the gap between large‑scale pre‑training corpora and downstream retrieval benchmarks, offering a practical pathway for building robust, multilingual legal search systems and providing a template that can be adapted to other specialized domains where PDF‑based documents dominate.


Comments & Academic Discussion

Loading comments...

Leave a Comment