AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first open benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system – of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts’ compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.


💡 Research Summary

AIReg‑Bench introduces the first publicly available benchmark for evaluating large language models (LLMs) on the task of assessing compliance with the European Union’s AI Act (AIA). The authors address a clear gap: while governments are rolling out AI regulations and there is growing interest in using LLMs to automate compliance assessments, no standardized dataset existed to measure how well these models perform.

The benchmark consists of 120 synthetic technical documentation excerpts that describe fictional high‑risk AI (HRAI) systems. These excerpts were generated through a three‑stage pipeline built around the LLM gpt‑4.1‑mini: (1) high‑level system overviews for eight distinct use‑cases (e.g., road traffic control, credit scoring); (2) “compliance profiles” that pair each system overview with a specific AIA article (Articles 9, 10, 12, 14, 15) and indicate whether the system should be compliant or non‑compliant; (3) full technical excerpts derived from the overview and article context. The pipeline includes a steering mechanism that deliberately biases one‑third of the samples toward compliance and the remainder toward non‑compliance, ensuring a balanced label distribution.

Human annotation was performed by six legal experts (law graduates, law students, and qualified lawyers) with specialization in AI regulation. Each excerpt was reviewed by three annotators, who assigned a 1‑5 Likert score for “probability of compliance” and provided free‑text justification. The median plausibility score was 4, indicating that most generated excerpts were judged realistic. Inter‑rater reliability measured by Krippendorff’s α was 0.651 (rising to 0.786 when two outlier annotators were excluded), reflecting moderate agreement and highlighting the inherent subjectivity of legal judgment, especially for Articles 10 and 15.

To demonstrate the benchmark’s utility, the authors evaluated ten frontier LLMs—including Gemini 2.5 Pro, GPT‑4o, Claude 3, and others—by prompting each model with the same documentation, system description, and AIA article used for human annotators. Models were restricted to the provided text (no external web search). Performance was quantified using quadratically weighted Cohen’s κ, yielding scores from 0.62 to 0.86. Gemini 2.5 Pro achieved the highest correlation (κ = 0.856), closely approximating expert judgments, while other models showed varying degrees of alignment. Notably, performance gaps widened on articles that required nuanced interpretation (e.g., Articles 10 and 15).

The paper’s contributions are threefold: (1) an open‑source LLM‑based generation pipeline that can be reused for other regulatory domains; (2) the AIReg‑Bench dataset, complete with expert annotations and plausibility assessments; (3) an initial empirical evaluation of state‑of‑the‑art LLMs on AIA compliance assessment. The authors also discuss limitations: reliance on synthetic documents may not capture the full complexity of real‑world compliance dossiers; annotation subjectivity limits the absolute ground truth; and the experimental setup excludes external knowledge retrieval, which auditors often employ.

Future work is suggested in several directions: validating the benchmark against actual corporate AI documentation, extending the label schema to cover technical, ethical, and governance dimensions, exploring semi‑automated annotation methods, and integrating retrieval‑augmented generation to allow LLMs to consult external legal texts. Moreover, adapting the pipeline to other jurisdictions (e.g., the U.S. AI Bill of Rights, Korean AI guidelines) could broaden its impact.

In summary, AIReg‑Bench provides a solid, reproducible foundation for the emerging field of AI‑regulation automation. It demonstrates that contemporary LLMs can already approximate human expert assessments to a notable degree, while also exposing clear areas for improvement. The benchmark is poised to become a standard reference for researchers, regulators, and industry practitioners seeking to develop trustworthy, scalable compliance tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment