Empirical Analysis of the Effect of Context in the Task of Automated Essay Scoring in Transformer-Based Models

Empirical Analysis of the Effect of Context in the Task of Automated Essay Scoring in Transformer-Based Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated Essay Scoring (AES) has emerged to prominence in response to the growing demand for educational automation. Providing an objective and cost-effective solution, AES standardises the assessment of extended responses. Although substantial research has been conducted in this domain, recent investigations reveal that alternative deep-learning architectures outperform transformer-based models. Despite the successful dominance in the performance of the transformer architectures across various other tasks, this discrepancy has prompted a need to enrich transformer-based AES models through contextual enrichment. This study delves into diverse contextual factors using the ASAP-AES dataset, analysing their impact on transformer-based model performance. Our most effective model, augmented with multiple contextual dimensions, achieves a mean Quadratic Weighted Kappa score of 0.823 across the entire essay dataset and 0.8697 when trained on individual essay sets. Evidently surpassing prior transformer-based models, this augmented approach only underperforms relative to the state-of-the-art deep learning model trained essay-set-wise by an average of 3.83% while exhibiting superior performance in three of the eight sets. Importantly, this enhancement is orthogonal to architecture-based advancements and seamlessly adaptable to any AES model. Consequently, this contextual augmentation methodology presents a versatile technique for refining AES capabilities, contributing to automated grading and evaluation evolution in educational settings.


💡 Research Summary

This paper presents an empirical investigation into enhancing Transformer-based models for Automated Essay Scoring (AES) through systematic context augmentation. The research is motivated by the observed performance gap where alternative deep-learning architectures (e.g., CNN-RNN hybrids) have recently surpassed Transformer models on the AES task, despite the latter’s dominance in other NLP domains. The authors hypothesize that enriching Transformer models with relevant contextual information can bridge this gap.

The study utilizes the ASAP-AES dataset, a standard benchmark comprising essays from eight distinct prompts. The core methodology involves augmenting the input to a Transformer encoder (like BERT) with multiple layers of contextual information. Four primary types of context are explored: 1) Relative Context, implemented using a Margin Ranking loss function that teaches the model the ordinal relationship between essays within a batch; 2) Prompt Context, where the essay’s original prompt text is provided alongside the essay to ground the evaluation in the specific topic; 3) Structural Context, derived from discourse and argumentation theory. This involves using auxiliary models (BiLSTM-CRF) to predict Elementary Discourse Unit (EDU) spans (capturing logical flow) and Argument Component (AC) spans (capturing claim, evidence, etc.) within the essay, and using these predicted labels as additional context; and 4) Feature-based Context, which includes surface-level features like essay length (word count) and the count of predicted EDUs/ACs, acknowledging known correlations in the dataset.

A comprehensive experimental setup evaluates these context types individually and in combination. The primary evaluation metric is the Quadratic Weighted Kappa (QWK). The results demonstrate that while individual context types provide improvements, the most effective model combines Prompt, AC, and Feature-based contexts. This “Composite Context Augmentation” model achieves a mean QWK of 0.823 when trained on all prompts together and an impressive 0.8697 when trained on individual prompt sets separately. This performance substantially surpasses all previous Transformer-based AES models.

A key finding is that while the proposed context-augmented Transformer model still slightly underperforms compared to the state-of-the-art non-Transformer model (DeLAES) by an average of 3.83%, it outperforms DeLAES on three of the eight prompts. Crucially, the authors emphasize that their context augmentation approach is orthogonal to architectural advancements. This means the technique is not a competing alternative to models like DeLAES but rather a complementary, model-agnostic method that can be seamlessly integrated on top of any AES architecture—including DeLAES itself—for potential further gains.

In conclusion, the paper successfully establishes that contextual enrichment is a powerful and versatile strategy for advancing AES systems. It provides a significant performance boost to Transformer-based models and offers a flexible framework for input enhancement that is applicable across different model architectures, contributing a valuable tool to the evolution of automated educational assessment.


Comments & Academic Discussion

Loading comments...

Leave a Comment