GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability
Large language models (LLMs) have demonstrated remarkable potential for automatic short answer grading (ASAG), significantly boosting student assessment efficiency and scalability in educational scenarios. However, their vulnerability to adversarial manipulation raises critical concerns about automatic grading fairness and reliability. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the vulnerability of LLM based ASAG models. Specifically, we align general-purpose attack methods with the specific objectives of ASAG by designing token-level and prompt-level strategies that manipulate grading outcomes while maintaining high camouflage. Furthermore, to quantify attack camouflage, we propose a novel evaluation metric that balances attack success and camouflage. Experiments on multiple datasets demonstrate that both attack strategies effectively mislead grading models, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior camouflage capability. Our findings underscore the need for robust defenses to ensure fairness and reliability in ASAG. Our code and datasets are available at https://anonymous.4open.science/r/GradingAttack.
💡 Research Summary
Title: GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability
Overview
The paper investigates the security and fairness of automatic short‑answer grading (ASAG) systems that rely on large language models (LLMs). While LLMs have shown impressive capabilities in educational tools, their susceptibility to adversarial manipulation threatens the reliability of automated grading. To fill the gap in the literature—most prior work focuses on general‑purpose jailbreaks—the authors propose GradingAttack, a fine‑grained adversarial framework specifically designed for ASAG scenarios.
Framework Components
- Grading Input Alignment – Constructs a grading prompt P that concatenates the question q, the reference solution a_q, and the student answer a_s. This aligns the input format with typical ASAG pipelines while leaving room for adversarial modifications.
- Adversarial Prompt Generation – Two complementary attack strategies are introduced:
- Token‑level attacks: Small, surface‑level edits to the student answer (synonym substitution, adjective/adverb insertion, word order changes). The goal is to keep the answer semantically similar to the original but to steer the LLM’s grading logic toward a targeted outcome.
- Prompt‑level attacks: Larger manipulations of the entire grading prompt, such as inserting misleading instructions, re‑ordering the three components, or adding crafted “hints”. These changes have a stronger influence on the model’s internal reasoning, yielding higher success rates.
- Attack Evaluation – Traditional evaluation uses only Attack Success Rate (ASR). The authors argue that ASR alone cannot capture the stealthiness of an attack. They therefore define a Camouflage Attack Score (CAS), which blends ASR with the ratio of post‑attack to pre‑attack overall accuracy (A_after / A_before). CAS is formulated using a Beta distribution: \
Comments & Academic Discussion
Loading comments...
Leave a Comment