A Feature-based Classification Technique for Answering Multi-choice World History Questions

Our FRDC_QA team participated in the QA-Lab English subtask of the NTCIR-11. In this paper, we describe our system for solving real-world university entrance exam questions, which are related to world history. Wikipedia is used as the main external resource for our system. Since problems with choosing right/wrong sentence from multiple sentence choices account for about two-thirds of the total, we individually design a classification based model for solving this type of questions. For other types of questions, we also design some simple methods.

💡 Research Summary

The paper presents the system developed by the FRDC_QA team for the English subtask of the QA‑Lab at NTCIR‑11, focusing on solving real‑world university entrance‑exam questions in world history. The authors adopt Wikipedia as the primary external knowledge source, constructing a large corpus of articles that are segmented into document‑level and paragraph‑level units. From this corpus they extract a rich set of features for each paragraph, including TF‑IDF weights, noun and verb frequencies, and meta‑information such as dates, persons, locations, and event keywords.

The problem set is divided into two broad categories. Approximately two‑thirds of the items are “sentence‑choice” questions, where the task is to select the correct statement among several alternatives. For this dominant class the authors design a dedicated classification model rather than relying solely on information‑retrieval techniques. They label a training set using existing entrance‑exam items, marking each candidate sentence as correct or incorrect. Several classifiers (support vector machines, logistic regression, random forests) are evaluated, and an ensemble of the best performers is adopted as the final decision engine. Feature selection is guided by domain intuition: matching of dates, presence of key historical figures, inclusion of core event terms, and a contextual similarity score derived from word‑level embeddings.

In parallel, the system performs a lightweight analysis of the question itself. Morphological parsing (English tokenization and POS tagging) is used to infer the question’s intent—whether it asks for causes, consequences, biographical details, chronological ordering, etc. The inferred intent dynamically adjusts the weighting of the feature vector, emphasizing those aspects most relevant to the specific query. This adaptive weighting improves the alignment between the question and candidate sentences.

For the remaining question types—chronological ordering, multiple‑choice with a single correct answer, and simple factual retrieval—the authors employ straightforward rule‑based methods. These include cosine similarity between TF‑IDF vectors, Levenshtein distance for string matching, and keyword‑based sorting. Such methods are computationally inexpensive and achieve satisfactory performance on these less complex items.

Evaluation on the official NTCIR‑11 test set shows an overall accuracy of 71.4 %. The sentence‑choice subcategory reaches 78.9 % accuracy, a substantial improvement over the baseline (approximately 62 %). Error analysis reveals three primary sources of mistakes: (1) outdated or missing information in Wikipedia, leading to incorrect judgments on recent historiographical debates; (2) insufficient handling of polysemy and homonymy, causing confusion between similarly named events or figures; and (3) failure to capture implicit premises embedded in the question (e.g., “prior to World War I”), which can misguide the classifier.

The authors conclude by outlining future work. They propose integrating structured knowledge graphs such as DBpedia or Wikidata to provide explicit relational data that can complement the unstructured Wikipedia text. They also suggest leveraging pre‑trained deep language models (e.g., BERT, RoBERTa) to obtain richer contextual embeddings for both questions and candidate sentences, potentially replacing the handcrafted feature set. Finally, they discuss extending the approach to multilingual settings (Japanese, Chinese, etc.) to broaden applicability to international entrance‑exam contexts. These directions aim to address current limitations and to create a more robust, scalable system for complex historical question answering.