Automated Multiple Mini Interview (MMI) Scoring

Automated Multiple Mini Interview (MMI) Scoring
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.


💡 Research Summary

The paper tackles the problem of automatically scoring virtual Multiple Mini‑Interviews (VMMIs) used in healthcare‑related university admissions. Traditional human raters suffer from fatigue and subjectivity, leading to inconsistent scores for soft‑skill criteria such as empathy, ethical judgment, and communication. While large language models (LLMs) have dramatically improved Automated Essay Scoring (AES), the authors argue that the abstract, context‑dependent nature of MMI responses makes direct transfer of state‑of‑the‑art rationale‑based fine‑tuning methods (e.g., RMTS) ineffective.

The authors collected a dataset of 1,001 transcribed VMMI responses to four distinct scenarios (consoling a grieving friend, confronting professional inadequacy, navigating a public ethical dilemma, and resolving team conflict). Each response was evaluated by expert human raters on a 7‑point Likert scale across nine textual criteria (c2‑c10). The transcripts were obtained solely from automatic speech‑to‑text services to avoid non‑verbal cues, and the score distribution is heavily skewed toward higher marks (85 % of scores are 4‑6).

Initial experiments using Sentence‑BERT embeddings and K‑means clustering revealed that semantic similarity alone cannot capture evaluative nuance: sentences describing “moving toward the friend” and “walking away from the friend” cluster together despite opposite implications for empathy. This demonstrated that simple embedding‑based AES techniques fail to distinguish context‑specific valence required for MMI scoring.

Consequently, the authors explored prompt engineering with instruction‑tuned Llama models (8B, 70B, 405B). They compared zero‑shot, few‑shot, and retrieval‑augmented generation (RAG) strategies. A balanced 3‑shot in‑context learning setup—providing low, medium, and high exemplars selected from the 5th, 50th, and 95th percentiles—proved optimal, yielding a Quadratic Weighted Kappa (QWK) of 0.363 on a representative question. Adding a fourth example caused the model to over‑rely on the last exemplar, degrading performance. RAG attempts (retrieving the most similar responses) underperformed because the dataset’s high‑score bias reduced score diversity in the retrieved set.

Building on these findings, the authors introduced a two‑stage multi‑agent framework. Stage 1 is a preprocessing agent that cleans the raw transcript, removing filler words and extraneous dialogue. Stage 2 consists of nine independent scoring agents, each dedicated to one criterion. Every scoring agent receives a tailored prompt containing the 7‑point rubric and the same balanced 3‑shot exemplars, but the exemplars are chosen specifically from the percentile band relevant to that criterion. This separation eliminates cross‑criterion interference observed in single‑prompt multi‑trait approaches. The multi‑agent system achieved a QWK of 0.533 on the same question—a substantial gain over the best single‑prompt configuration and comparable to human‑expert reliability.

To assess whether fine‑tuning could match this performance, the authors fine‑tuned two models: Llama 3.1 8B (decoder‑only) and modernBER‑T (encoder‑only). They experimented with grouped training (one model scores all criteria), individual training (separate model per criterion), and the use of cleaned transcripts. Across all configurations, fine‑tuned models lagged behind the prompting‑based multi‑agent approach, underscoring the data‑efficiency of well‑designed prompts.

Finally, the same framework was applied to the ASAP essay‑scoring benchmark without any additional training. The prompt‑only system achieved QWK scores on par with the specialized RMTS and other state‑of‑the‑art AES models, demonstrating that the methodology generalises beyond interview contexts to broader subjective assessment tasks.

In summary, the study provides strong empirical evidence that for complex, subjective reasoning tasks—where evaluation hinges on nuanced, context‑specific interpretations—structured prompt engineering combined with a multi‑agent architecture can outperform data‑intensive fine‑tuning. The approach delivers human‑level reliability, requires only a handful of exemplars, and scales across domains, suggesting a promising path forward for automated assessment in education, recruitment, and professional training.


Comments & Academic Discussion

Loading comments...

Leave a Comment