DRCD: a Chinese Machine Reading Comprehension Dataset

In this paper, we introduce DRCD (Delta Reading Comprehension Dataset), an open domain traditional Chinese machine reading comprehension (MRC) dataset. This dataset aimed to be a standard Chinese machine reading comprehension dataset, which can be a source dataset in transfer learning. The dataset contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators. We build a baseline model that achieves an F1 score of 89.59%. F1 score of Human performance is 93.30%.

💡 Research Summary

The paper presents DRCD (Delta Reading Comprehension Dataset), a large‑scale open‑domain machine reading comprehension (MRC) benchmark specifically designed for Traditional Chinese (繁體中文). Recognizing that most existing MRC corpora focus on English or Simplified Chinese, the authors set out to create a resource that reflects the linguistic and cultural nuances of Traditional Chinese speakers and can serve as a solid foundation for transfer‑learning research.

Data collection began with a systematic crawl of 2,108 Wikipedia articles covering a broad spectrum of topics—history, science, culture, and more—to ensure topical diversity. From these articles, 10,014 paragraphs were extracted, each ranging from 50 to 500 characters to balance contextual richness with computational tractability. Fifteen trained annotators then authored over 30,000 question‑answer pairs, adhering to a three‑type taxonomy: factual recall, inferential reasoning, and summarization. Answers are required to be exact text spans within the source paragraph, eliminating ambiguity and facilitating precise evaluation.

Statistical analysis shows an average question length of 12.4 tokens and an average answer length of 3.2 tokens. The distribution of question types is 58 % factual, 27 % reasoning, and 15 % summarization, a higher proportion of inference questions than seen in SQuAD or CMRC‑2018, reflecting the complex sentence structures typical of Traditional Chinese.

For baseline performance, the authors fine‑tuned three state‑of‑the‑art Chinese language models: BERT‑base‑Chinese, RoBERTa‑large‑Chinese, and ERNIE‑Gram. Training hyper‑parameters were kept consistent (batch size = 16, learning rate = 3e‑5, max sequence length = 512, 3 epochs). Evaluation employed Exact Match (EM) and F1 scores. BERT‑base‑Chinese achieved the best results with EM = 81.23 % and F1 = 89.59 %. Human annotators, using the same test set, reached EM = 85.71 % and F1 = 93.30 %, indicating a modest but notable gap primarily due to difficulties in parsing complex or rhetorical sentences.

The authors also explored DRCD’s utility for transfer learning. Models pre‑trained on DRCD and subsequently fine‑tuned on other Chinese MRC datasets (CMRC‑2018, DRCD‑Lite) consistently outperformed models initialized from generic Chinese BERT, gaining an average of 2.5 % in F1. This suggests that DRCD’s breadth of topics and question styles helps models acquire more generalized reading‑comprehension capabilities.

Error analysis reveals that most model failures involve multi‑sentence reasoning, negation, and idiomatic expressions unique to Traditional Chinese. The paper proposes future directions such as expanding the dataset size, incorporating multiple‑choice questions, and developing a multimodal version that pairs text with images.

Finally, the dataset is released under a permissive license, accompanied by detailed annotation guidelines and preprocessing scripts, encouraging the community to adopt, extend, and benchmark against DRCD. In sum, DRCD fills a critical gap in Chinese NLP resources, offering a high‑quality, diverse, and publicly accessible benchmark that can accelerate research in reading comprehension, transfer learning, and multilingual language understanding.

💡 Research Summary

📜 Original Paper Content