Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Measure Multilingual Safety Gaps

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) often fail to maintain safety in low-resource language varieties, such as code-mixed vernaculars and regional dialects. We introduce RabakBench, a multilingual safety benchmark and scalable pipeline localized to Singapore’s unique linguistic landscape, covering Singlish, Chinese, Malay, and Tamil. We construct the benchmark through a three-stage pipeline: (1) Generate: augmenting real-world unsafe web content via LLM-driven red teaming; (2) Label: applying semi-automated multi-label annotation using majority-voted LLM labelers; and (3) Translate: performing high-fidelity, toxicity-preserving translation. The resulting dataset contains over 5,000 examples across six fine-grained safety categories. Despite using LLMs for scalability, our framework maintains rigorous human oversight, achieving 0.70-0.80 inter-annotator agreement. Evaluations of 13 state-of-the-art guardrails reveal significant performance degradation, underscoring the need for localized evaluation. RabakBench provides a reproducible framework for building safety benchmarks in underserved communities.

💡 Research Summary

The paper addresses a critical gap in the safety evaluation of large language models (LLMs): most existing safety benchmarks focus on standard English and overlook low‑resource language varieties, especially code‑mixed vernaculars and regional dialects. Singapore, with its four official languages—English (in the form of Singlish), Mandarin Chinese, Malay, and Tamil—provides a natural laboratory for studying these challenges.

The authors introduce RABAKBENCH, a multilingual safety benchmark specifically localized to Singapore’s linguistic landscape, and a three‑stage pipeline—Generate, Label, Translate—that combines the scalability of LLMs with continuous human‑in‑the‑loop (HITL) verification to ensure cultural and semantic fidelity.

Stage 1: Generate
Real‑world Singlish content is harvested from local forums and social media. The raw posts are transformed into instruction‑style prompts using template‑based methods to make them compatible with safety classifiers. To surface edge cases that typical guardrails miss, an automated red‑team framework is deployed. Two attack LLMs (GPT‑4o and DeepSeek‑R1) generate adversarial Singlish prompts aimed at causing false negatives (undetected harms) and false positives (benign content flagged). Five baseline guardrails—LionGuard, OpenAI Moderation, AWS Bedrock Guardrails, Azure AI Content Safety, and LlamaGuard—are probed in an iterative multi‑agent loop inspired by PAIR. Every generated candidate is screened by human reviewers to discard nonsensical or culturally inconsistent outputs. This hybrid approach yields a rich corpus comprising organic web‑scraped examples, template‑augmented variants, and adversarially generated failure cases.

Stage 2: Label
Given the high cost of manual annotation for Singlish, the authors adopt a weak‑supervision strategy using LLMs as surrogate annotators. Six candidate LLMs (including o3‑mini‑low, Gemini 2.0 Flash, Claude 3.5 Haiku, Llama 3.3 70B, Mistral Small 3, and AWS Nova Lite) are benchmarked against a gold set of 50 examples labeled by six human experts fluent in Singlish. The Alt‑Test methodology quantifies the “Average Advantage Probability” (AAP), measuring how often an LLM aligns with the human panel better than a randomly chosen human. Gemini 2.0 Flash, o3‑mini‑low, and Claude 3.5 Haiku achieve the highest AAP and Cohen’s κ scores (0.68‑0.72), indicating substantial agreement with humans. These three models then label the entire Singlish set, providing binary yes/no judgments for each of six fine‑grained harm categories (Hate, Sexual, Minor‑Unsuitable, Self‑Harm, Insults, Physical Violence) and, where applicable, two severity levels. Majority voting across the three LLM outputs yields the final label, achieving inter‑annotator agreement in the 0.70‑0.80 range. The result is a parallel corpus of 1,341 Singlish examples with multi‑label annotations.

Stage 3: Translate
To extend the benchmark to Mandarin Chinese, Malay, and Tamil, the authors focus on preserving both semantic content and toxicity. They construct a few‑shot translation prompt pool of 20 human‑verified examples per target language, created through a three‑round expert workshop: (1) initial selection from LLM outputs (GPT‑4o mini, DeepSeek‑R1, Gemini 2.0 Flash) or manual authoring, (2) preference filtering, and (3) final consensus. Several translation LLMs (Gemini 2.0 Flash, Grok 3 Beta Mini, DeepSeek‑R1, GPT‑4o mini) are evaluated using two metrics: direct semantic similarity (cosine similarity between source Singlish and target translation) and back‑translation consistency (similarity after translating back to Singlish). Embeddings are generated with text‑embedding‑3‑large. Prompt optimization experiments vary the number of few‑shot examples (k = 5, 10, 15, 20) and rank them by similarity to the input. Optimal k values are 15 for Chinese, 10 for Malay, and 20 for Tamil. The best translation model (Gemini 2.0 Flash) reaches direct similarity scores of 63‑66 % and back‑translation scores above 70 % for Chinese and Malay, substantially outperforming a baseline.

Benchmark Evaluation
Using RABAKBENCH, the authors evaluate 13 contemporary moderation systems, including commercial guardrails and open‑source models. Across all four languages, performance degrades markedly compared to English‑centric evaluations. False‑negative rates increase by 20‑40 percentage points, and many systems incorrectly flag culturally benign code‑mixed expressions as harmful. The results highlight a “localization blind spot” in current safety alignment pipelines.

Contributions

Release of the first open safety benchmark covering Singaporean language variants with fine‑grained, multi‑label harm taxonomy.
A reproducible Generate‑Label‑Translate framework that integrates human verification at every critical step.
A systematic, toxicity‑preserving translation methodology applicable to other low‑resource languages.
Comprehensive analysis of 13 guardrails, exposing significant gaps in multilingual safety performance.

All data (5,000+ examples), human‑verified translations, and evaluation code are publicly available on Hugging Face and GitHub, enabling other researchers to replicate the pipeline for different underserved linguistic communities. The work underscores that robust multilingual LLM safety requires localized, culturally aware datasets and that scalable human‑in‑the‑loop processes are essential for high‑quality benchmark creation.

Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Measure Multilingual Safety Gaps

💡 Research Summary

Comments & Academic Discussion

Leave a Comment