DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

💡 Research Summary

The paper “DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation” investigates whether state‑of‑the‑art text‑to‑image and text‑to‑video models can reliably generate visual content when the textual prompt contains regional English dialect words rather than Standard American English (SAE). To answer this, the authors construct DialectGen, a large‑scale benchmark covering six common English dialects: Standard American English (SAE), British English (BrE), African American English (AAE), Chicano English (ChE), Indian English (InE), and Singaporean English (SgE).

Dataset creation begins with a systematic collection of 1,126 dialect lexemes from five reputable dictionaries (OED Regional, DARE, Singlish, Indian English, and AAE dictionaries). After filtering out potentially derogatory or culturally unique terms lacking SAE equivalents, the remaining lexemes are paired with exact SAE synonyms. Using GPT‑4o, the authors generate two prompt styles for each lexeme: a concise prompt (≤ 6 words) and a detailed prompt (≥ 9 words). The concise style mimics casual user input, while the detailed style reflects professional or descriptive usage. For each lexeme pair, the SAE word is swapped with its dialect counterpart, yielding a total of 6,552 raw prompt pairs.

Human validation is performed by dialect‑native annotators recruited via Amazon MTurk. Annotators first self‑declare their dialect background and then pass a dialect‑speaker assessment quiz to ensure correct matching. Each prompt pair is evaluated by two independent annotators who must answer “Yes” to “Does the dialect prompt make sense and convey exactly the same meaning as the SAE prompt?” and “No” to “Is the dialect prompt ambiguous (i.e., could it be interpreted in SAE)?” Only pairs satisfying both criteria are retained, resulting in a final set of 4,200 high‑quality, non‑ambiguous, synonym‑matched prompts.

The benchmark is used to evaluate 17 widely used multimodal generative models, including multiple versions of Stable Diffusion (1.4, 1.5, 2.1, XL, 3, 3.5 Large, 3.5 Large Turbo), Flux.1 (dev), DALL·E Mini, DALL·E 2, DALL·E 3 (with and without prompt rewriting), GPT‑4‑image, and several text‑to‑video systems (Cosmos‑1, Open‑Sora, VideoCrafter‑2, CogVideoX, Wan 2.1). Automatic evaluation employs reference‑free image‑text alignment metrics: VQAScore and CLIPScore. For each prompt, n images (or video frames) are generated, and the average alignment score is computed for both the SAE prompt and the dialect prompt (aligned against the SAE reference). Human evaluation is conducted on a 5 % random sample, where three external annotators rate the visual‑caption alignment on a 0‑10 scale, later scaled to match the automatic metric ranges.

Results show a dramatic performance drop when a single dialect word replaces its SAE counterpart: overall drops range from 32 % to 48 % across models, with the most severe degradation observed in text‑to‑video models (up to 48 % for Wan 2.1). Even the most advanced diffusion models (e.g., Stable Diffusion 3.5 Large Turbo) suffer ~30 % degradation. The issue is especially pronounced for polysemous dialect terms (e.g., “whip” in AAE meaning “car”), where models default to the SAE meaning regardless of context.

Baseline mitigation strategies—fine‑tuning on dialect data, prompt rewriting, and simple data augmentation—provide modest gains (< 7 % improvement) but often cause a noticeable drop in SAE performance (5‑15 %). To address this, the authors propose a general encoder‑based mitigation strategy. The approach consists of two components: (1) a dialect‑aware text encoder fine‑tuned to map dialect lexemes to their SAE equivalents while preserving the original SAE embeddings, and (2) a KL‑regularization loss computed on image‑SAE caption pairs (e.g., MS‑COCO) that constrains the output distribution shift caused by dialect fine‑tuning. This dual loss encourages the model to recognize dialect features without altering its overall generative behavior.

Applying this method to Stable Diffusion 1.5 and SDXL yields striking improvements: performance on the five non‑SAE dialects rises by an average of +34.4 % (bringing them on par with SAE), while the SAE performance on the standard MS‑COCO validation set drops by less than 1 % (near‑zero cost). The method outperforms all baseline mitigations across both automatic (VQAScore, CLIPScore) and human evaluations.

Key contributions of the paper are:

DialectGen Benchmark – a rigorously validated, large‑scale dataset for assessing dialect robustness in multimodal generation, covering both concise and detailed prompt styles.
Comprehensive Evaluation – systematic analysis of 17 multimodal models and five baseline mitigation techniques, revealing pervasive dialect performance gaps.
Encoder‑Based Mitigation – a novel, model‑agnostic training strategy that simultaneously boosts dialect robustness and preserves SAE performance, demonstrated on leading diffusion models.

The study highlights a critical fairness issue: current generative models, despite impressive visual quality, systematically disadvantage speakers of non‑standard dialects. By providing both a benchmark and an effective mitigation technique, the work paves the way for more inclusive multimodal AI systems that serve the linguistic diversity of real‑world users.

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment