Subjective Question Generation and Answer Evaluation using NLP

Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student’s learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.

💡 Research Summary

The paper presents an end‑to‑end natural language processing framework that simultaneously generates subjective (open‑ended) questions from a given text and evaluates student‑written answers. While prior work has largely focused on automatic generation of multiple‑choice or fill‑in‑the‑blank items, the authors argue that subjective questions are more pedagogically valuable because they require higher‑order comprehension and critical thinking. The study therefore tackles two interrelated challenges: (1) producing questions that are semantically coherent, aligned with curricular objectives, and appropriately difficult, and (2) scoring free‑form answers on multiple dimensions such as content relevance, logical coherence, lexical richness, and argumentative structure.

In the question‑generation module, the input document is first segmented into sentences and clauses, and a concept‑extraction component based on ConceptNet‑enhanced attention identifies the key terms that should be the focus of a question. The authors then annotate each segment with educational metadata—core concepts, target difficulty, and Bloom’s taxonomy level—derived from a large teacher‑generated corpus. A Transformer‑based encoder‑decoder architecture is augmented with two special conditioning tokens: a “concept” token that biases attention toward the identified key terms, and a “difficulty” token that controls the complexity of the generated question. During decoding, beam search is combined with a diversity penalty to avoid repetitive outputs and to encourage a variety of questions from the same source text.

The answer‑evaluation module treats scoring as a multi‑task regression problem. Four sub‑networks, each built on a BERT‑large encoder, predict scores for (i) content relevance (semantic similarity to the source text), (ii) logical coherence (sentence‑level consistency modeled with a lightweight recurrent layer), (iii) lexical richness (normalized type‑token ratio and MTLD), and (iv) argument structure (detecting discourse markers and evaluating the logical flow of claims). Human‑graded answer data serve as supervision, and the final overall score is a weighted sum of the four dimensions, where the weights are dynamically adjusted according to the learning objective (e.g., higher weight on logical coherence for critical‑thinking tasks).

The authors constructed two domain‑specific datasets: (a) secondary‑school textbook passages and (b) university lecture notes. Each dataset contains roughly 5,000 source texts and 20,000 question‑answer pairs that were manually annotated for concept, difficulty, and Bloom level. For evaluation, standard NLG metrics (BLEU, ROUGE‑L) are complemented by a newly introduced Question Validity Score (QVS), which measures alignment with human expert judgments on relevance, difficulty, and pedagogical fit. The proposed model achieves BLEU 31.7 and ROUGE‑L 33.5, substantially outperforming a baseline Transformer (BLEU 22.4, ROUGE‑L 24.1). QVS improves from 0.78 to 0.86, indicating that the generated questions are judged more appropriate by educators.

Answer‑scoring results are reported using Pearson correlation and RMSE against teacher grades. The four dimensions achieve correlations of 0.71 (content), 0.68 (coherence), 0.64 (lexical richness), and 0.66 (argument structure), each exceeding the performance of existing automated essay‑scoring tools by 8–15 %. Qualitative feedback from teachers suggests that the system’s detailed, dimension‑specific scores help students identify specific weaknesses and promote self‑revision.

The discussion acknowledges two main limitations. First, the model’s reliance on domain‑specific concept annotations can lead to shallow questions when applied to texts outside the training domains. Second, the evaluation sub‑networks may undervalue creative or unconventional answers that deviate from typical patterns learned from the training data. To address these issues, the authors propose future work that (a) incorporates multimodal inputs (figures, tables) and external knowledge graphs to enrich question content, and (b) employs reinforcement learning where human teacher feedback continuously refines the scoring functions.

In conclusion, this research delivers the first integrated NLP system capable of generating high‑quality subjective questions and providing nuanced, multi‑dimensional feedback on student answers. By bridging the gap between question creation and answer assessment, the framework promises to reduce teacher workload, support personalized learning, and advance the state of AI‑assisted education.