ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople

ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.


💡 Research Summary

The paper addresses a critical gap in Legal Natural Language Processing (Legal NLP) concerning the mismatch between the formal language of court judgments, which have traditionally been used to train Legal Statute Identification (LSI) models, and the informal, everyday language used by laypeople when seeking legal advice. To bridge this gap, the authors introduce ILSIC, a two‑part corpus focused on Indian law.

ILSIC‑Lay consists of 8,127 real‑world queries collected from the Indian legal forum kaanoon.com, each paired with one or more statutes cited by lawyers in their responses. The authors built an extraction pipeline that uses GPT‑3.5‑Turbo prompts to pull statute references from lawyer answers, followed by extensive regular‑expression and fuzzy‑matching normalization to map diverse citations to a canonical form. Manual verification by a law graduate on 50 random samples confirmed >95 % coverage, establishing high label quality. Queries were anonymized using the OpenNyAI model to mask personal entities.

ILSIC‑Multi is designed for direct comparison between laypeople and court‑derived inputs. It focuses on a common set of 399 statutes that appear both in ILSIC‑Lay and in a collection of 50 k Indian Supreme Court and High Court judgments. From the judgments, the factual portions were extracted (13,930 documents) and anonymized, yielding queries that are on average four times longer than lay queries. ILSIC‑Multi provides (i) a “lay train/validation” split (5,793/735 queries) drawn from ILSIC‑Lay, (ii) a “court train/validation” split (12,930/1,652 queries) drawn from judgments, and (iii) a test set of 757 lay queries.

The experimental protocol evaluates both retrieval‑based and generative approaches. Retrieval baselines include BM25 (sparse keyword matching), SBER‑T (dense sentence‑level BERT embeddings), and SAILER (legal‑domain pre‑trained dense retriever). For each query, the top‑k candidates (k tuned on validation) are taken as predictions and evaluated with micro and macro F1. Generative baselines comprise GPT‑4.1, Llama‑3, and Gemma‑3, tested under (a) zero‑shot prompting, (b) few‑shot prompting, (c) Retrieval‑Augmented Generation (RAG) where retrieved statutes are fed into the prompt, and (d) supervised fine‑tuning (SFT) on the various training splits. Transfer experiments also explore sequential fine‑tuning: first on court data, then on lay data.

Key findings: (1) Models trained solely on court facts perform poorly on lay queries, achieving near‑random F1 scores, highlighting the severe domain shift caused by differences in length, vocabulary, and syntactic style. (2) Fine‑tuning on lay queries alone yields modest improvements; even the strongest model (GPT‑4.1) does not exceed 35 % micro‑F1, indicating the intrinsic difficulty of the task. (3) Sequential transfer learning (court → lay) provides limited gains, observable only for Llama‑3, suggesting that knowledge from formal judgments does not readily transfer to informal queries without substantial adaptation. (4) Performance degrades sharply for rare statutes across all models, while the correlation between statute frequency and query category (e.g., family law, tax law) is weaker. (5) Retrieval‑augmented generation does not close the gap, underscoring that simply providing candidate statutes to an LLM is insufficient when the input query is informal.

The paper contributes a high‑quality, publicly released dataset (code, splits, and fine‑tuned model checkpoints are available at https://github.com/Law‑AI/ilsic), enabling systematic study of LSI under realistic user‑centric conditions. It also demonstrates that existing LLMs, even state‑of‑the‑art ones like GPT‑4.1, are not yet ready for reliable statute identification from lay queries without further domain‑specific pre‑training or specialized adaptation techniques. Future work suggested includes expanding to multilingual queries, developing legal‑domain pre‑training for LLMs, and exploring meta‑learning or few‑shot strategies to better handle low‑frequency statutes.


Comments & Academic Discussion

Loading comments...

Leave a Comment