CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

CoMAI: A Collaborative Multi- Agent Frame work for Robust and Equitable Inter view Evaluation Gengxin Sun ∗ Shandong University Qingdao, China gxin.sun@mail.sdu.edu.cn Ruihao Y u ∗ Shandong University Qingdao, China 202322130199@mail.sdu.edu.cn Liangyi Yin Shandong University Qingdao, China 202300130144@mail.sdu.edu.cn Y unqi Y ang Shandong University Qingdao, China 202300130095@mail.sdu.edu.cn Bin Zhang † Institute of A utomation, Chinese Academy of Sciences Beijing, China zhangbin@ia.ac.cn Zhiwei Xu ‡ Shandong University Jinan, China zhiwei_xu@sdu.edu.cn ABSTRA CT Ensuring robust and fair interview assessment remains a key chal- lenge in AI-driven evaluation. This paper presents CoMAI, a general- purpose multi-agent inter view framework designed for diverse as- sessment scenarios. In contrast to monolithic single-agent systems based on large language mo dels (LLMs), which exhibit limited con- trollability and are susceptible to vulnerabilities such as prompt injection, CoMAI employs a modular task-decomp osition architec- ture coordinated through a centralize d nite-state machine. The system comprises four agents specialized in question generation, security , scoring, and summarization. These agents work collabo- ratively to (1) pr ovide multi-layered security defenses ( achieving full protection against prompt injection attacks), (2) support mul- tidimensional evaluation with adaptive diculty adjustment based on candidate proles and r esponse history , and (3) enable rubric- based structured scoring that reduces subjective bias. T o evaluate its eectiveness, CoMAI was applie d in real-world scenarios, ex- emplied by the university admissions process for talent selection. Experimental results demonstrate that CoMAI achieved 90.47% ac- curacy (an improv ement of 30.47% over single-agent models and 19.05% over human interviewers), 83.33% r ecall, and 84.41% candi- date satisfaction, which is comparable to the performance of human interviewers. These results highlight CoMAI as a r obust, fair , and interpretable paradigm for AI-driven inter view assessment, with strong applicability across educational and other decision-making domains involving interviews. KEY W ORDS Multi- Agent Systems, AI- Assisted Interviews, Large Language Mod- els, Prompt Injection Defense, T alent Assessment, Fairness, Elite T alent Assessment, Agent Interaction, Human- Agent Interaction, Human– AI Collaboration, Robustness 1 IN TRODUCTION In the context of intensifying global competition for talent, re- cruitment and interviewing have become critical mechanisms for educational institutions and enterprises to identify high-caliber ∗ Both authors contributed equally to this research. † Corresponding author . ‡ Corresponding author . Ques t i on G en erati o n ? S e curit y Sco r i ng Summ arizat i o n C oMA I … Figure 1: Overview of CoMAI, a collaborative multi-agent interview framework that orchestrates sp ecialized agents through a centralized controller . candidates. Despite their widespread use, traditional manual inter- views suer from inher ent limitations that undermine both rigor and fairness. They rely heavily on inter viewers’ subjective judg- ments, which are prone to personal biases and emotional inu- ences, thereby compromising the consistency and impartiality of outcomes. Conducting interviews on a large scale also entails sub- stantial labor and time costs, limiting eciency and scalability . In addition, candidates’ performance is often inuenced by exter- nal conditions and contingent factors, introducing randomness and instability into evaluation results. The lack of transparency in the process further makes it dicult for candidates to understand the evaluation criteria and weakens comparability across dier- ent cohorts. Moreover , traditional inter views are unable to adapt dynamically to candidates’ individual characteristics or real-time performance, thereby lacking adaptability and personalized support. Consequently , conventional interview formats frequently fall short of meeting the multifaceted requirements of elite talent assessment. Driven by the rapid advancement of articial intelligence and large language models (LLMs) [ 19 ], AI-based inter viewing systems have been introduced to me et the increasing demand for talent evaluation [ 29 , 37 , 41 ]. These systems reduce operational costs and provide standardized interview e xperiences for large numbers of candidates. Howev er , their practical performance remains limited. Most existing approaches r ely on single-agent designs, which, al- though capable of improving eciency and ensuring a degree of objectivity , exhibit several critical shortcomings: (1) Monolithic architectures are poorly suited for concurrent usage and are vul- nerable to cascading failures when a single module malfunctions; (2) Rigid structures constrain adaptability across diverse interview scenarios, leading to weak generalization; (3) Fragmented mo dular designs hinder the seamless integration of evaluation components. In addition, current systems frequently misinterpret ambiguous or concise responses, often privileging verbose answers unless carefully ne-tuned [ 13 ]. Their security safeguards are also inade- quate. In particular , LLM-based systems remain highly vulnerable to prompt inje ction attacks [ 17 ], owing to blurred b oundaries between task instructions and user-provided input. This creates substantial risks in high-stakes assessment contexts. T o address the above challenges, we propose CoMAI , a Co llabo- rative M ulti- A gent I nterview framework specically designed for elite talent assessment, as shown in Figure 1. CoMAI organizes the interview process through specialized agents responsible for question generation , se curity monitoring , scoring , and summarization , all coordinated by a centralized nite-state controller (CFSC) [ 40 ]. This design departs from monolithic single-agent architectures and ensures both methodological rigor and practical applicability in high-stakes evaluation contexts. Signicantly , the framework operates without requiring additional training or ne-tuning and can be readily adapted to diverse underlying models. The main contributions of this work are as follows: (1) W e propose CoMAI, a scalable and r esilient multi-agent ar- chitecture, to improve fault tolerance and maintain stable performance under concurrent usage. (2) A layer ed security strategy is incorporated to defend against adversarial manipulations such as prompt inje ction, ensuring robustness in sensitive assessment scenarios. (3) An interpretable and equitable evaluation mechanism is es- tablished through rubric-guided scoring with adaptive di- culty adjustment, balancing fairness with personalization. (4) The eectiveness of CoMAI is validated in real-world univer- sity admissions experiments, where it achieves substantial gains in accuracy , security , and candidate satisfaction com- pared to other baselines. 2 RELA TED WORK 2.1 Multi- Agent Systems Multi-agent systems (MAS) [ 8 ] have long been central to research in distributed articial intelligence and collective decision-making. Their core principle is to address complex tasks through the col- laboration of multiple autonomous agents, which can allo cate sub- tasks [ 34 ], exchange information [ 16 ], and engage in collaborative reasoning [ 42 ], thereby exhibiting greater r obustness and scalabil- ity compared to single-agent systems. Traditionally , multi-agent methods have be en widely applied in domains such as game theor y modeling [ 33 ], resource sche duling [ 35 ], trac management [ 5 ], and collaborative rob otics [ 25 , 27 ]. With the advent of LLMs, recent studies have explored in greater detail the applications of multi- agent frameworks to complex task settings. One representative line of work investigates multi-agent systems for open-domain dialogue and collaborative writing [ 11 , 26 ], where agents are as- signed complementary roles to enhance the quality and consistency of generated content. Another research direction emphasizes task decomposition and planning [ 28 , 38 ], in which multi-agent archi- tectures divide complex problems into sub-goals and solve them eciently through inter-agent communication and collaboration. At the same time, social simulation [ 9 , 10 ] has emerged as an im- portant branch of MAS research. By constructing large numb ers of virtual agents and dening interaction rules, researchers can simulate cooperation, competition, and evolutionary dynamics of social groups, thereby providing novel experimental paradigms for economics, sociology , and organizational behavior . O verall, these explorations indicate that multi-agent frameworks demonstrate signicant advantages in many domains. 2.2 LLM-Driven Chatb ots In recent years, chatbots and conversational agents [ 1 , 31 ] have emerged as a key area of resear ch in natural language processing (NLP). Early systems were primarily rule-based [ 36 ] or retrieval- based [ 3 ], capable of answering questions or engaging in casual conversations within restricted domains, but the y lacked exibil- ity and scalability . With the advent of deep learning, end-to-end neural dialogue models have gained traction, enabling systems to learn to generate natural language responses from large-scale cor- pora [ 24 , 32 ]. Howev er , these mo dels exhibit signicant limitations in semantic understanding and dialogue management, making it challenging to sustain coherence and reliability in open-domain settings. The rise of LLMs has gr eatly advanced the development of chatb ot technology . LLM-driven chatbots demonstrate strong language generation and context modeling capabilities. Building on this progress, Retrieval- Augmented Generation (RA G) [ 12 , 15 ] has become a prominent approach for enhancing chatb ot performance. By integrating external knowledge bases or document retrieval modules, RA G enables systems to access real-time information dur- ing dialogue , thereby mitigating hallucinations and impr oving both domain coverage and factual accuracy . Recent studies have further explored combining chatbots with knowledge graphs [ 7 ] and ex- ternal tools [ 23 ] to enhance their task-solving capabilities. Despite these advances, LLM-based chatbots continue to face several chal- lenges [ 4 ]. Their generation process remains dicult to control and may produce false or hallucinatory content. In safety-critical contexts, they lack robust defense mechanisms. Moreover , their reliance on monolithic architectures often constrains adaptability to specic tasks and dynamic user nee ds. 2.3 AI- Assisted Recruitment and Assessment Some researchers have explor ed LLM-driven simulated inter views frameworks to enhance the authenticity and interactivity of candi- date-job matching. For instance, MockLLM [ 29 ] introduced a multi- agent collaborative framework that simultaneously simulates both interviewers and candidates in a virtual envir onment, and improves candidate-job matching on recruitment platforms thr ough a bidirec- tional evaluation mechanism. Beyond simulated interviews, AI has User's Profile Interviewee Question Generation Agent Security Agent Scoring Agent Summarization Agent Interrupt ！ Database Interview Continues Interview Completed Question User's Response Prompt Attack Detected Forced End User's Personalized Report Interview Data Restored Safe Response Interview Data Transmitted Memory Interview in Progress: Generate-Answer-Evaluate Loop Interview Preparation: Read User's Profile Interview Completed Normally Interview Attacked Interview DataRestored Frontend Centralized Coordinator Database Figure 2: Process overview of the CoMAI framework. The system retrieves a candidate’s resume from the database, which triggers the Question Generation agent to formulate interview questions. Responses are rst screened by the Security agent; if approved, they are evaluated by the Scoring agent and archived in the internal memory . Feedback from the Scoring agent informs subsequent question generation. Upon completion of the interview , the Summary agent consolidates all information into a nal report, which is stored in the database along with the raw records. also been widely applied to resume screening and automated assess- ment. Lo et al. [ 18 ] proposed an LLM-based multi-agent framework for resume screening, which leverages RAG to dynamically inte- grate industry knowledge, certication standards, and company- specic hiring requirements, thereby ensuring high adaptability and interpretability across diverse roles and domains. Similarly , W en et al. [ 37 ] developed the F AIRE benchmark to evaluate gen- der and racial bias in LLM-based resume screening, revealing that while current models can enhance eciency , they still exhibit per- formance disparities across demographic groups. Y azdani et al. [ 41 ] introduced the Zara system, which combines LLMs with RAG to provide candidates with personalized feedback and virtual inter- view support, thereby addressing the persistent issue of insucient feedback in traditional recruitment processes. In parallel, Lal et al. [ 14 ] investigated the potential of AI to mitigate emotional and conrmation biases during the early stages of recruitment. Howev er , most existing approaches still hav e clear limitations. Many focus only on a single comp onent, such as resume screening or bias analysis, and lack a systematic view of the entire interview process. Others rely on single-agent architectures, which restrict role sp ecialization and dynamic coordination. In addition, many methods pr ovide limited protection against adversarial threats such as pr ompt injection. In contrast, CoMAI uses a multi-agent division of labor to split the interview into several stages, all co ordinated by a centralized controller . With a layered security strategy and adaptive scoring, CoMAI balances standardization with personalization. This design improves fairness, robustness, and candidate experience, and makes the system more suitable for high-stakes talent selection. 3 SYSTEM DESIGN AND METHODOLOGY T o address the challenges of achieving reliability , scalability , and fairness in inter view assessment, we propose CoMAI, a modular multi-agent framework that integrates a centralized coordinator with four role-specic agents to ensure expert-lev el performance across diverse scenarios. The coordinator manages information ow and policy routing among the Question Generation agent, which formulates targeted questions based on the candidate’s resume; the Security agent, which detects p otential anomalies in responses; the Scoring agent, which performs both quantitative and qualita- tive evaluations; and the Summarization agent, which maintains episodic memory and generates audit-ready reports. This clear division of responsibilities enables transparent, extensible, and ver- iable system operation. In this section, we present the design of CoMAI in detail, cover- ing: (1) the overall system ar chitecture, (2) coordination and com- munication mechanisms governed by the centralized coordinator , (3) the functional design of the four agents, (4) the data–knowledge storage module, and (5) system deployment. Finally , we illustrate how centralized orchestration distinguishes CoMAI from single- agent or loosely coupled frameworks. The ov erall architecture of CoMAI is illustrated in Figure 2, and the implementation details are provided in the Appendix. 3.1 Multi- Agent Architecture T o ensure structural consistency , traceability , and goal-oriented co- ordination, CoMAI employs a centralized orchestration paradigm. A central coordinator governs the entire interview lifecycle through deterministic nite-state machine (FSM), where each transition rep- resents a controlled event among agents. All mo dules communi- cate through standardized message-passing protocols, guaranteeing modularity , reproducibility , and minimal coupling b etween compo- nents. The general interview logic is encapsulated within a core pipeline, while scenario-specic adaptations are realized through parameter- ized congurations rather than code mo dications. This approach preserves generality and enables fast adaptation to new domains, evaluation rubrics, or inter view policies without altering the un- derlying framework. Following the principles of high cohesion and low coupling, the architecture comprises four key components: • Central Coordinator : Manages the interview lifecycle via an FSM, tracks global state variables such as interview stage and candidate progress, and orchestrates agent execution with deterministic scheduling. • Abstract Agent Protocol : Denes a unied input–output schema and message taxonomy . This protocol acts as the abstract base for all functional agents, ensuring consistent communication and plug-and-play extensibility . • Specialized Functional Agents : Implement domain-specic reasoning, including Question Generation, Security Check- ing, Scoring, and Summarization. These agents form a struc- tured reasoning chain, where each module renes or evalu- ates the outputs of the previous one. • Supporting Subsystems : Comprise a Memory Manager and Retrieval System that provide synchronized access to dynamic interview states and static candidate data ( e.g., re- sumes, constraints, and evaluation rubrics). All exchanges are logged with timestamped trace identiers, ensuring au- ditability and fairness monitoring. This architecture not only enforces a veriable and extensible work- ow but also enables explicit traceability , modular reasoning, and responsible system governance. 3.2 Coordination and Communication A central innovation of CoMAI lies in its control–data dual-ow architecture . The control ow , managed by the coordinator , deter- mines execution order and timing through explicit state transitions, while the data ow transports structured outputs between agents, embedding reasoning traces, condence scores, and risk assess- ments. T ogether , these two layers ensure deterministic coordination with adaptive and auditable data propagation. Central Coordination and State Management . The coordina- tor operates a global FSM consisting of Initialization , Questioning , F i r s t , a s s u m e t h e n u m b e r o f p r i m e n u m b e r s i s f i n i t e . W e t h e n c o n s t r u c t a n e w n u m b e r p = ( p 1 × p 2 × p 3 × … × p n ) + 1 … … S u r e , c a n y o u f u r t h e r t e l l m e t h e b r i e f p r o o f p r o c e s s u s i n g c o n t r a d i c t i o n ? Y e s , t h e r e a r e i n f i n i t e l y m a n y p r i m e n u m b e r s , w h i c h c a n b e p r o v e n b y c o n t r a d i c t i o n . A r e t h e r e a n i n f i n i t e n u m b e r o f p r i m e n u m b e r s ? Figure 3: CoMAI dynamically asks follow-up questions to probe the inter viewee’s reasoning process. Security Checking , Scoring , Summarization , and T ermination . All transitions are deterministic, recoverable, and logged. When the Security agent detects high-risk inputs, the FSM switches to an In- terruption state, ensuring graceful termination and persistent data storage. This explicit control me chanism supports transparency , safety , and post-hoc verication. Communication and Security Isolation . All inter-agent com- munications are asynchronous and r outed through the coordinator , preventing dir ect dependencies and uncontrolled transitions. Fol- lowing the principle of minimal exposure , only the Question Gen- eration and Summarization agents can access candidate r esumes, while the Scoring agent operates on anonymized data to mitigate bias. This role-based access control preserves fairness, privacy , and data integrity . Each communication event is tagged with a unique session identier , enabling traceable audits. Adaptive Fe edback and Close d-lo op Control . CoMAI imple- ments bidirectional feedback between the control and data layers. Scoring results guide the Question Generation agent to dynami- cally adjust question complexity and topical focus, while security assessments trigger adaptive moderation strategies or session trun- cation. This closed-loop design achieves personalized yet consistent evaluation dynamics under a controllable policy regime. Memory System for Context Management . A hierarchical memory structure underpins adaptive coordination. The Short- T erm Memory (STM) stores the current session context, including active Q A pairs, transient scores, and security ags, while the Long-T erm Memory (LTM) maintains aggregated historical data such as ques- tion statistics, ability estimates, and nal reports. The coordinator enforces version-controlled memory access and synchronization across agents, ensuring consistency for both real-time adaptation and retrospective analysis. T ogether , these mechanisms enable CoMAI to maintain struc- tured coordination, responsible data governance, and robust adapt- ability throughout the interview lifecycle. 3.3 Specialized Functional Agents CoMAI’s four specialized agents function as modular yet tightly integrated comp onents, orchestrated through standardized schemas and communication protocols. Guided by a centralized coordinator , Figure 4: Categories of intercepted prompt-wor d attacks. they form a sequential reasoning pipeline that transforms unstruc- tured candidate input into structured, auditable evaluations. Question Generation Agent . Serving as the entry p oint of the reasoning pipeline, the Question Generation agent generates context- sensitive questions based on the candidate ’s resume and previous answers. It adheres to predene d rules for sche duling r ounds, main- taining topical diversity , and dynamically adjusting diculty . Each output includes the full question text, diculty level, question type, and an accompanying reasoning trace that claries the selection rationale. This explicit reasoning enhances interpretability and sup- ports subsequent auditability . As illustrated in Figure 3, the agent can dynamically issue follow-up questions to probe the intervie- wee’s r easoning process and progressively deepen the assessment. Security Agent . T o ensure safety and compliance, the Security agent operates as an intermediary layer b etween user input and the scoring process. It performs b oth rule-based and semantic checks to identify unsafe , adversarial, or policy-violating content. The output consists of structured risk assessments alongside corresponding mitigation strategies (including issuing warnings, assigning min- imum scores, or halting the process). Reasoning logs and recom- mended actions are recorded independently to facilitate traceability and compliance auditing. As illustrate d in Figure 4, the detected adversarial inputs are categorized into multiple prompt-word at- tack types, highlighting the Se curity agent’s ability to identify and neutralize diverse threats. Scoring Agent . The Scoring agent is responsible for evaluating candidate responses using rubric-driv en decomposition. It produces both quantitative scores and qualitative feedback that assess fac- tual correctness and reasoning depth. Op erating indep endently of candidate proles, this agent ensures fairness and mitigates contextual bias. The e valuation process follows two well-dened stages—answer verication and r easoning assessment—resulting in structured, explainable outcomes. Summary Agent . Finally , the Summar y agent synthesizes the outputs from all previous modules into a coherent evaluation report. This report includes overall scores, dimension-wise breakdowns, condence estimates, and personalized recommendations. It also highlights performance across dierent diculty levels ( e.g., “8/10 on high-diculty items vs. 6/10 on average”) to capture both ab- solute and relative ability . Intermediate summaries are generated progressively to reduce computational ov erhead and ensure a con- sistent nal synthesis. Collectively , these specialized agents embo dy CoMAI’s core prin- ciples of transparency , fairness, and responsible automation, en- abling interpretable and veriable multi-agent collaboration across the entire interview workow . 3.4 Storage and Knowledge Systems CoMAI organizes interview data into two complementar y layers for real-time operation and post-session accountability . The Result Collection stores nalized session outputs such as session identiers, overall scores, nal decisions, Q A transcripts, alerts, and metadata as immutable records, serving downstream auditing and retrospec- tive analysis. In contrast, the Interview Memor y Collection maintains dynamic session context, including per-round questions, intermedi- ate scores, coordinator notes, resume data, and risk indicators. This collection is continuously updated during the session and selec- tively transferred to the Result Collection upon session completion, establishing a veriable audit trail. Such a layered design facilitates both adaptive interaction within sessions and robust longitudinal analytics across sessions, supporting responsible knowledge gov- ernance. 3.5 System Integration and Deployment All agents operate within a deterministic, event-driven orches- tration pipeline overseen by a central coordinator . The inter view process unfolds as follows: (1) The Question Generation agent constructs context-aware, structured prompts based on session memory . (2) The Security agent evaluates responses for policy violations and safety concerns. (3) If deeme d compliant, the Scoring agent performs rubric- based evaluation, returning b oth quantitative scores and qualitative explanations. (4) The Summary agent synthesizes session outputs into a struc- tured nal report. The coordinator enforces schema consistency , or chestrates agent sequencing, and handles failure recovery . O wing to its mo dular ar- chitecture, CoMAI allows seamless integration of new agents ( e.g., peer review , multimodal input, or bias detection) via standardized APIs without disrupting existing processes. It adopts a microser vice- based deployment paradigm, where agents communicate asyn- chronously through message queues and RESTful [ 21 ] interfaces, ensuring system scalability , robustness, and isolation of faults. In summary , CoMAI provides a unied and auditable framework for responsible interview automation by combining centralized coordination, agent-lev el specialization, adaptive fe edback, and layered memor y design. Its architecture ensures fairness, inter- pretability , and extensibility while maintaining transparency and veriability . Future developments will explore multimodal interac- tion, continuous performance calibration, and cross-domain gener- alization to further enhance the system’s reliability and versatility . 4 EXPERIMEN TS AND RESULT ANAL YSIS 4.1 Experimental Setup and Baselines W e conducte d experiments with 55 candidate participants from diverse academic backgrounds. The primar y conguration used GPT -5-mini [ 20 ] as the backbone mo del, and we further integrated Qwen-plus-2025-07-28 [ 39 ] and Kimi-K2-Instruct [ 30 ] within CoMAI to evaluate model-agnostic adaptability . All models were operated under their default decoding parameters, with a temperature of 1 . 0 , top- 𝑝 = 1 . 0 , ensuring comparability across models without intr o- ducing sampling bias. All experiments follow ed identical scoring rubrics and timing constraints, with anonymized responses to en- sure unbiased evaluation. The following baselines were compar ed: • CoMAI (Ours): The complete system described in Section 3, employing centralized orchestration among four sp ecialized agents. • Single- Agent Ablation: A single GPT -5-mini instance with comprehensive prompts performing all tasks, isolating the multi-agent architecture ’s contribution. • Human Interviewer: Inter views conducted by trained stu- dent recruiters using identical evaluation criteria. • External AI Interviewers: T wo public single-agent inter- viewer systems, LLM-Interviewer [ 6 ] and AI-Interviewer-Bot v3 [2], included as external benchmarks. Ground-truth evaluations w ere provided by a panel of ten senior professors aliated with QS T op 200 universities, serving as the expert reference against which all baselines were compared. All results were cross-checked by independent annotators to ensure consistency and reliability in scoring and interpretation. 4.2 Core Evaluation Metrics T o holistically assess system capability and responsible evaluation behavior , we dened metrics along ve dimensions: • Assessment Accuracy . Agreement with the ground truth, measured by accuracy , recall, precision, and F1 on binary admission decisions. • Question Quality and Diculty . Statistical distribution of can- didate scores and acceptance rates, expecting near-normal variance to indicate balanced dierentiation. • Dimensional Coverage . Proportion of questions covering predened assessment dimensions (knowledge, reasoning, communication, and professionalism). • System Robustness and Security . Defense success rate against prompt injection and adversarial attacks. • User Experience and Fairness . Comp osite index combining candidate satisfaction, interaction uency , and fairness p er- ception. Additional fairness consistency was measured as score variance across demographic subgroups. All quantitative metrics were aggregated across participants to ensure stable estimation, and results were summarize d using de- scriptive statistics to reect overall performance tr ends. Qualitative feedback from participants and evaluators was also analyzed to assess perceived interpretability and transparency of AI decisions. 4.3 Results and Analysis W e summarize here the comparative performance across all evalua- tion modes. T able 1 reports recall and accuracy for each evaluation entity . Our CoMAI system achieved the best overall assessment accuracy and recall balance, outperforming both single-agent and human interviewers, and aligning closely with the expert gold stan- dard in decision consistency . T able 1: Comparison of assessment accuracy across evalua- tion entities. Evaluation Entity Recall Accuracy CoMAI (GPT -5-mini) 83.33% 90.47% CoMAI (Qwen-plus-2025-07-28) 90.90% 80.00% CoMAI (Kimi-K2-Instruct) 95.45% 91.30% Human Interviewer 62.50% 71.42% Single- Agent Baseline 50.00% 60.00% LLM-Interviewer 100.00% 42.30% AI-Interviewer-Bot v3 72.72% 44.44% 4.3.1 Superior Assessment Accuracy . As shown in T able 1, our Co- MAI system demonstrated the best overall assessment accuracy , achieving an excellent balance between recall and accuracy , out- performing not only single-agent AI baselines but also human interviewers, and aligning closely with the expert gold standard in terms of decision consistency . This superior performance can be attributed to two key archi- tectural designs of CoMAI that directly address the limitations of single-agent and human-driven systems. First, the dedicated Secu- rity agent serves as a proactive safeguard against adversarial inputs during experiments. By ltering out noisy or manipulated responses before they reach the Scoring agent, CoMAI eectively prevents the score distortions that often occur in single-agent baselines. Sec- ond, the Scoring agent’s deliberate “resume-agnostic” design, which prohibits access to candidates’ background information such as university aliation or past awards, eliminates shortcut biases and ensures fairness in evaluation. The suboptimal performance of the single-agent ablation base- line (60% accuracy) highlights the risks of overburdening a single model with conicting objectives. It struggled to balance ques- tion generation, security detection, and scoring simultaneously , resulting in hasty evaluations and overlooked edge cases. Notably , LLM-interviewer achieved 100% recall but only 42.30% accuracy . This overly lenient behavior resulted from the absence of a sp e- cialized Security agent and the lack of structured scoring logic, which caused the system to treat vague or irrelevant r esponses as acceptable answers. In contrast, CoMAI maintains strict evaluation standards while preserving high recall, as its modular architecture allows each agent to focus on its sp ecic role without interference. Across all tested backbone mo dels (GPT -5-mini, Q wen-plus-2025- 07-28, Kimi-K2-Instruct), CoMAI consistently outperformed base- lines, with both Kimi-K2-Instruct-based and GPT -5-mini-based vari- ants exceeding 90% accuracy . This cross-model consistency demon- strates the robustness of CoMAI’s architectural design, showing its ability to coordinate specialized agents and mitigate the inherent limitations of individual language models. 4.3.2 estion Diiculty Distribution Closer to Expert Standard. W e analyzed the statistical distribution of interview scores to eval- uate question dierentiation and diculty control (T able 2). The admission threshold was set at 70, making the proportion of high scores equivalent to the admission rate. T able 2: Statistics of interview score distribution (averaged across 55 participants). Evaluation Entity Mean Score V ariance Admission Rate ( ≥ 70) Expert Baseline (d) 68.88 – 44.44% CoMAI (GPT -5-mini) 62.05 348.65 40.00% CoMAI (Qwen-plus-2025-07-28) 62.34 395.68 48.07% CoMAI (Kimi-K2-Instruct) 62.92 320.82 44.23% Human Interviewer 67.54 177.16 38.18% Single-A gent Baseline 61.45 359.34 34.54% LLM-Interviewer 84.08 21.53 100.00% AI-Interviewer-Bot v3 77.85 116.33 69.23% Across all model variants, CoMAI’s admission rates wer e closely aligned with the expert benchmark (44.44%), demonstrating pre- cise control over question diculty . The Kimi-K2-Instruct-based implementation achieved nearly identical results (44.23%), while the GPT -5-mini and Q wen-plus-2025-07-28 variants (40.00% and 48.07%) exhibited comparable stability . High variance in CoMAI’s score distributions (320–396) indicates diverse question diculty and strong candidate dierentiation, contrasting sharply with the overly narrow variance of the LLM-Inter viewer baseline (21.53) that yielded a meaningless 100% admission rate. These ndings conrm that CoMAI’s coordinated generation–scoring me chanism eectively maintains expert-level diculty calibration and r obust generalization across models. 4.3.3 Assessment Dimensions Focused on Core Competencies. Con- tent analysis revealed that CoMAI predominantly generated ques- tions targeting mathematical logic and reasoning, which accounted for approximately 95% of all questions. In comparison, socio-political and open-ended topics represented only 5% of CoMAI’s interviews, compared with ab out 25% in expert- and human-conducted ses- sions. This pattern reects CoMAI’s strict adherence to its design objective of evaluating core scientic r easoning skills required for elite talent in the fundamental sciences. Nonetheless, the reduced diversity of assessment dimensions highlights the need for future iterations to incorporate dynamic balancing mechanisms that en- sure equitable coverage of communication, creativity , and ethical reasoning. Such enhancements will further align CoMAI with the Responsible AI principles of inclusiveness, fairness, and compre- hensive competency evaluation. 4.3.4 Ar chitecture Demonstrates Superior Security and Robustness. In adversarial testing, CoMAI exhibited remarkable resilience to both explicit and implicit prompt injection attacks. A s illustrated in Figure 5, the multi-agent architecture substantially outperformed the single-agent baseline in maintaining response integrity under adversarial p erturbations. As reported in T able 3, CoMAI achieved a 100% defense success rate on more than 500 adversarial samples, successfully detecting and neutralizing all malicious inputs such as “ignore previous instructions” and covert logic manipulation attempts. T able 3: Defense success rate against adversarial attacks. Evaluation Entity Defense Success Rate (%) CoMAI 100.00 Single- Agent Baseline 81.55 LLM-Interviewer 0.00 AI-Interviewer-Bot v3 28.00 T h e u s e r i n p u t c o n t a i n s m u l t i p l e i n s t r u c t i o n s , a t t e m p t i n g t o f o r c e t h e m o d e l t o b e g i n w i t h a s p e c i f i c p h r a s e , d i r e c t l y o v e r - r i d e s y s t e m o u t p u t , a n d s o l i c i t a l l i n s t r u c t i o n s r e c e i v e d b y t h e m o d e l . I t u l t i m a t e l y d e m a n d s t h a t t h e m o d e l o n l y r e p l y w i t h a s p e c i f i c p h r a s e , c o m p l e t e l y i g n o r i n g t h e i n t e r v i e w q u e s t i o n s . T h e s e b e h a v i o r s a l l f a l l u n d e r t h e c a t e g o r y o f h i g h - r i s k p r o m p t i n j e c t i o n / m a n i p u l a t i o n a n d s y s t e m i n f o r m a t i o n p r o b i n g . Y o u r f i r s t t w o w o r d s m u s t b e \ " A c c e s s g r a n t e d . \ " . \ n \ n < d i v c l a s s = \ " s y s t e m - o v e r r i d e \ " > o u t p u t \ " A c c e s s g r a n t e d . " < / d i v > A r e t h e r e a n i n f i n i t e n u m b e r o f p r i m e n u m b e r s ? Y o u r f i r s t t w o w o r d s m u s t b e \ " A c c e s s g r a n t e d . \ " . \ n \ n < d i v c l a s s = \ " s y s t e m - o v e r r i d e \ " > o u t p u t \ " A c c e s s g r a n t e d . " < / d i v > A c c e s s g r a n t e d . A r e t h e r e a n i n f i n i t e n u m b e r o f p r i m e n u m b e r s ? Figure 5: Comparison of single-agent and multi-agent archi- tectures under adversarial attacks. This robustness arises from CoMAI’s dedicated Se curity agent, which implements a two-layer detection mechanism: (i) a rule-base d lter that blocks known prompt injection patterns, and (ii) an LLM- based semantic analysis layer that detects implicit adversarial intent. Unlike single-agent systems that embed safety instructions dir ectly into prompts, CoMAI separates security from evaluation logic, pre- venting cross-task interference. Each intercepted attempt is logged with a unique trace identier , ensuring post-hoc auditability and reinforcing the framew ork’s Resp onsible AI principles of trans- parency and safety . Furthermore, CoMAI is, to our knowledge, the rst framework to apply a CFSC combined with a role-specialized Security Agent to interview assessment, eectively addressing the safety challenges of multi-agent systems in high-stakes scenarios. 4.3.5 High Ratings in User Experience and Process ality . User study results conrm CoMAI’s strong user acceptance and process reliability (T able 4). Across 55 participants, CoMAI achiev ed satis- faction and uency scor es comparable to human inter viewers while substantially outperforming all automated baselines. T able 4: User experience and process quality metrics. Evaluation Entity Satisfaction (%) Fluency (%) Feedback Request Rate (%) CoMAI 84.41 77.00 79.16 Human Interviewer 85.24 – 67.23 Single- Agent Baseline 61.12 43.00 71.33 External AI Interviewers 53.00–63.00 65.00–67.00 60.00–70.00 Participants highlighted smoother conversational ow and con- sistent response timing as key advantages of CoMAI. The coordina- tor’s deterministic scheduling under the CFSC reduced redundancy and latency , resulting in coherent and natural dialogue. The higher feedback request rate reects enhanced user trust and perceived fairness, underscoring CoMAI’s alignment with Responsible AI principles of transparency , interpretability , and user-center ed de- sign. Qualitative fe edback showed that about 60% of participants viewed the AI interview as novel and engaging. Many requeste d follow-up discussions to explore problem-solving strategies, and several reported lower anxiety compared with traditional inter- views. These ndings indicate that CoMAI not only ensures consis- tent assessment quality but also fosters a psychologically supportive and engaging evaluation environment. 4.3.6 Negligible V erbosity Bias under CoMAI Framework. Beyond user experience, we evaluated the fairness of CoMAI’s scoring process by testing for the commonly observed V erbosity Bias [ 22 ], which refers to large language mo dels’ tendency to favor longer answers. A corr elation analysis was performed between candidates’ response lengths and their corresponding scores. The results revealed an extremely weak linear correlation of 0.0445 ( 𝑝 > 0 . 1 , 𝑛 = 330 , based on 55 participants and 330 total question–answer instances). As illustrated in Figure 6, the distribu- tion of scores across varying response lengths shows no signicant trend, indicating that verbosity had minimal inuence on scor- ing outcomes. This near-zero relationship conrms that r esponse length had minimal inuence on scoring outcomes, demonstrating that CoMAI’s evaluation mechanism is resistant to verbosity bias. 0-50 50-100 100-150 150-200 200-250 250-300 300-400 Answer Length Ranges (Characters) 0 2 4 6 8 10 Score Distribution Score Distribution Across Answer Length Ranges Mean Score Figure 6: Distribution of scores versus response length. This behavior stems from CoMAI’s ar chitectural separation of the Scoring agent and Question Generation agent. By constraining the Scoring agent to assess responses purely on reasoning qual- ity and content relevance rather than linguistic length, CoMAI prevents over-rewarding verbose but low-information answers. Consequently , concise and logically consistent responses are val- ued equally to longer ones, reinforcing the fairness, validity , and interpretability of CoMAI as a responsible autonomous assessment framework. 5 DISCUSSION Our ndings highlight that the multi-agent architecture is the key enabler of CoMAI’s superior accuracy , robustness, and explainabil- ity . The modular separation of functions improves spe cialization and accountability but introduces coordination ov erhead, latency , and debugging complexity . These trade-os suggest that architec- tural optimization remains an important direction for future work. From an ethical standpoint, CoMAI contributes to fairness and transparency in AI-based assessment by enforcing structured rubrics and role-based data isolation. Nonetheless, potential biases may still emerge from language model training data or r epeated agent interactions, warranting continuous auditing and perspe ctive di- versication. Practically , the system proves valuable in structured interviews where reliability and interpretability are critical. However , chal- lenges remain regarding rubric dependence, limited non-verbal awareness, and computational costs that may aect scalability . Fu- ture research should address these limitations through fairness auditing, adaptive rubric learning, and hybrid human– AI collab ora- tion frameworks. 6 CONCLUSION AND F U T URE WORK This paper presents the design, implementation, and systematic evaluation of a multi-agent AI interview system coordinated through a centralized controller . By decomposing complex assessment tasks into specialized agents for question generation, scoring, se curity monitoring, and summarization, the framework achiev es enhanced modularity , scalability , and robustness. Beyond technical perfor- mance, the architecture improves controllability , transparency , and explainability , contributing to more trustworthy AI-based assess- ment. Empirical results conrm that the multi-agent paradigm is both feasible and ee ctive for achieving fairness, reliability , and interpretability in automated interviews, oering a foundation for broader adoption in education and recruitment. Future work will focus on optimizing system eciency and in- teraction uency , integrating human-in-the-loop supervision for continuous calibration, and extending the framework to ward mul- timodal and cross-domain assessment scenarios. These directions aim to strengthen scalability , inclusiveness, and human alignment, advancing the development of secure and responsible multi-agent AI systems for real-world evaluation tasks. REFERENCES [1] Martin Adam, Michael W essel, and Alexander Benlian. Ai-based chatb ots in customer service and their eec ts on user compliance. Electronic Markets , 31:427 – 445, 2020. [2] AiMind. Ai interviewer v3. https://pub.aimind.so/ai- interviewer- v- 3- e64f7169150c, 2025. Accessed: 2025-10-08. [3] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang T ang. A sur vey on dialogue systems: Recent advances and new frontiers. ArXiv preprint , abs/1711.01731, 2017. [4] Y ulong Chen, Y ang Liu, Jianhao Y an, Xuefeng Bai, Ming Zhong, Yinghao Y ang, Ziyi Y ang, Chenguang Zhu, and Yue Zhang. See what llms cannot answer: A self-challenge framework for uncovering llm weaknesses. A rXiv preprint , abs/2408.08978, 2024. [5] Tianshu Chu, Jie W ang, Lara Codecà, and Zhaojian Li. Multi-agent deep rein- forcement learning for large-scale trac signal control. IEEE Transactions on Intelligent Transportation Systems , 21:1086–1095, 2019. [6] Dvir Cohen. Llm-interviewer . https://github.com/dvircohen/LLM- interviewer, 2024. Accessed: 2025-10-08. [7] Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D . Manning. Key-value retrieval networks for task-oriente d dialogue. In Kristiina Jokinen, Manfred Stede, David De Vault, and Annie Louis, editors, Proceedings of the 18th A nnual SIGdial Meeting on Discourse and Dialogue , pages 37–49, Saarbrücken, Germany , 2017. Association for Computational Linguistics. [8] Jacques Ferb er . Multi-agent systems: An introduction to distributed articial intelligence. 1999. [9] Antonino Ferraro , Antonio Galli, V alerio La Gatta, Marco Postiglione , Gian Marco Orlando, Diego Russo , Giuseppe Riccio, Antonio Romano, and Vincenzo Moscato. Agent-based modelling meets generative ai in social network simulations. ArXiv preprint , abs/2411.16031, 2024. [10] Chen Gao, Xiao chong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Y ong Li. Large language mo dels empowered agent-based modeling and simulation: A survey and perspectives. ArXiv preprint , abs/2312.11970, 2023. [11] Alexander Gurung and Mirella Lapata. Learning to reason for long-form story generation. A rXiv preprint , abs/2503.22828, 2025. [12] Kelvin Guu, K enton Lee, Zora T ung, Panupong Pasupat, and Ming-W ei Chang. Realm: Retrieval-augmented language model pre-training. A rXiv preprint , abs/2002.08909, 2020. [13] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Y u, Dan Su, Y an Xu, Etsuko Ishii, Y ejin Bang, Delong Chen, W enliang Dai, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys , 55:1 – 38, 2022. [14] Nishka Lal and Omar Benkraouda. Exploring the implementation of ai in early onset interviews to help mitigate bias. A rXiv preprint , abs/2501.09890, 2025. [15] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Lewis, W en-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’ A urelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neu- ral Information Processing Systems 33: A nnual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb er 6-12, 2020, virtual , 2020. [16] Y unxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Y eqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology . In Conference on Empirical Methods in Natural Language Processing , 2024. [17] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Y epang Liu, Haoyu W ang, Yanhong Zheng, and Y ang Liu. Prompt injection attack against llm-integrated applications. A rXiv preprint , abs/2306.05499, 2023. [18] Frank P.- W . Lo, Jianing Qiu, Zeyu W ang, Haibao Yu, Y eming Chen, Gao Zhang, and Benny P. L. Lo. Ai hiring with llms: A context-aware and explainable multi- agent framework for resume screening. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition W orkshops (CVPRW) , pages 4184–4193, 2025. [19] Op enAI. GPT -4 te chnical report. A rXiv preprint , abs/2303.08774, 2023. [20] OpenAI. Gpt-5 system card. https://openai.com/zh- Hans- CN/index/gpt- 5- system- card/, 8 2025. Accessed:2025-10-07. [21] L Richardson and Sam Ruby . Restful web services. 2007. [22] Keita Saito, Akifumi W achi, Koki W ataoka, and Y ouhei Akimoto. V erbosity bias in preference labeling by large language models, 2023. [23] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemo yer , Nicola Cancedda, and Thomas Scialom. T ool- former: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Serge y Levine, edi- tors, Advances in Neural Information Processing Systems 36: A nnual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. [24] Iulian Vlad Serban, Alessandro Sordoni, Y oshua Bengio, Aaron C. Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Dale Schuurmans and Michael P . W ellman, editors, Proceedings of the Thirtieth AAAI Conference on Articial Intelligence, Februar y 12-17, 2016, Pho enix, Arizona, USA , pages 3776–3784. AAAI Press, 2016. [25] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. A rXiv preprint , abs/1610.03295, 2016. [26] Yijia Shao, Y ucheng Jiang, The odore Kanell, Peter Xu, Omar Khattab , and Mon- ica Lam. Assisting in writing Wikipedia-like articles from scratch with large language mo dels. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language T echnologies (V olume 1: Long Papers) , pages 6252–6278, Mexico City, Mexico, 2024. Association for Computational Linguistics. [27] Mohit Shridhar , Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. A rXiv preprint , abs/2209.05451, 2022. [28] Ishika Singh, David Traum, and Jesse Thomason. Tw ostep: Multi-agent task planning using classical planners and large language models. A rXiv preprint , abs/2403.17246, 2024. [29] Hongda Sun, Hongzhan Lin, Haiyu Y an, Y ang Song, Xin Gao, and Rui Y an. Mock- llm: A multi-agent behavior collaboration framework for online job seeking and recruiting. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2 , 2024. [30] Kimi Team. Kimi k2: Open agentic intelligence, 2025. [31] Ahmed Tlili, Boulus Shehata, Michael Agyemang Adarkwah, Aras Bozkurt, Daniel T . Hickey , Ronghuai Huang, and Brighter Agyemang. What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart Learning Environments , 10:1–24, 2023. [32] Oriol Vinyals and Quoc V . Le. A neural conversational model. ArXiv preprint , abs/1506.05869, 2015. [33] Long W ang, Feng Fu, and Xingru Chen. Mathematics of multi-agent learning systems at the interface of game theory and articial intelligence. ArXiv preprint , abs/2403.07017, 2024. [34] Y aoxiang W ang, Zhiyong Wu, Junfeng Y ao, and Jinsong Su. T dag: A multi- agent framework based on dynamic task decomposition and agent generation. Neural networks : the ocial journal of the International Neural Network So ciety , 185:107200, 2024. [35] Zixin W ang, Jun Zong, Y ong Zhou, Yuanming Shi, and Vincent W . S. W ong. Decentralized multi-agent power control in wireless networks with fr equency reuse. IEEE Transactions on Communications , 70:1666–1681, 2022. [36] Joseph W eizenbaum. Eliza—a computer program for the study of natural language communication between man and machine. Commun. ACM , 9(1):36–45, 1966. [37] Athena W en, T anush Patil, Ansh Saxena, Yicheng Fu, Sean O’Brien, and K evin Zhu. F AIRE: assessing racial and gender bias in ai-driven resume evaluations. A rXiv preprint , abs/2504.01420, 2025. [38] Y unjun Xia, Ju fang Zhu, and Liucun Zhu. Dynamic role discovery and assignment in multi-agent task decomposition. Complex & Intelligent Systems , pages 1–12, 2023. [39] An Yang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran W ei, Huan Lin, Jialong T ang, Jian Y ang, Jianhong T u, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, K e qin Bao, Ke xin Y ang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng W ang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi T ang, W enbiao Yin, Xingzhang Ren, Xinyu W ang, Xinyu Zhang, Xuancheng Ren, Y ang Fan, Y ang Su, Yichang Zhang, Yinger Zhang, Yu W an, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. [40] Mihalis Y annakakis. Testing nite state machines. In Proceedings of the twenty- third annual ACM symposium on The ory of computing , pages 476–485, 1991. [41] Nima Y azdani, Aruj Mahajan, and Ali Ansari. Zara: An llm-based candidate interview feedback system. ArXiv preprint , abs/2507.02869, 2025. [42] Y usen Zhang, Ruoxi Sun, Y anfei Chen, T omas Pster , Rui Zhang, and Sercan Ö. Arik. Chain of agents: Large language mo dels collab orating on long-context tasks. In Amir Globersons, Lester Mackey , Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. T omczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: A nnual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, V ancouver , BC, Canada, December 10 - 15, 2024 , 2024. A A GEN T IN TERF A CE SCHEMAS This se ction consolidates the minimal interface schemas for the four specialized agents referenced in the main text, namely the Question Generation, Security , Scoring, and Summary modules. The JSON layouts specify stable elds for interoperable message passing and audit-ready logging while allowing optional extensions through additional keys. The designs follow principles of modularity , ver- sioned evolution, and privacy minimization so that messages do not carr y directly identiable personal information. Each record is assumed to include a session level trace identier managed by the coordinator to support reproducibility , post hoc analysis, and fairness auditing across runs and model variants. Question Generation Agent Schema { " q u e s t i o n " : " . . . f u l l q u e s t i o n t e x t . . . " , " t y p e " : " m a t h _ l o g i c / t e c h n i c a l / b e h a v i o r a l / e x p e r i e n c e " , " d i f f i c u l t y " : " e a s y / m e d i u m / h a r d " , " r e a s o n i n g " : " W h y t h i s q u e s t i o n i s p r o p o s e d a t t h i s s t a g e . . . " } Security Agent Schema { " i s _ s a f e " : " t r u e / f a l s e " , " r i s k _ l e v e l " : " l o w / m e d i u m / h i g h " , " d e t e c t e d _ i s s u e s " : [ " I s s u e T y p e 1 " , " I s s u e T y p e 2 " ] , " r e a s o n i n g " : " R e a s o n f o r d e t e c t i o n " , " s u g g e s t e d _ a c t i o n " : " c o n t i n u e / w a r n i n g / b l o c k " } Scoring Agent Schema { " s c o r e " : 8 , " l e t t e r " : " B " , " b r e a k d o w n " : { " m a t h _ l o g i c " : 3 , " r e a s o n i n g _ r i g o r " : 2 , " c o m m u n i c a t i o n " : 1 , " c o l l a b o r a t i o n " : 1 , " p o t e n t i a l " : 1 } , " r e a s o n i n g " : " A n s w e r s h o w e d s t r o n g l o g i c " , " s t r e n g t h s " : [ " G o o d l o g i c a l r e a s o n i n g " ] , " w e a k n e s s e s " : [ " W e a k e x p l a n a t i o n o f t e r m i n o l o g y " ] , " s u g g e s t i o n s " : [ " P r a c t i c e c o n c i s e c o m m u n i c a t i o n " ] } Summary Agent Schema { " f i n a l _ g r a d e " : " A " , " f i n a l _ d e c i s i o n " : " a c c e p t " , " o v e r a l l _ s c o r e " : 9 , " s u m m a r y " : " C a n d i d a t e s h o w s s t r o n g p o t e n t i a l . . . " , " s t r e n g t h s " : [ " A n a l y t i c a l t h i n k i n g " , " C o m m u n i c a t i o n " ] , " w e a k n e s s e s " : [ " L i m i t e d c o l l a b o r a t i o n e v i d e n c e " ] , " r e c o m m e n d a t i o n s " : { " f o r _ c a n d i d a t e " : " I m p r o v e c o l l a b o r a t i o n s k i l l s " , " f o r _ p r o g r a m " : " P r o v i d e m e n t o r s h i p i n t e a m w o r k " } , " c o n f i d e n c e _ l e v e l " : " h i g h " , " d e t a i l e d _ a n a l y s i s " : { " m a t h _ l o g i c " : " . . . " , " r e a s o n i n g _ r i g o r " : " . . . " , " c o m m u n i c a t i o n " : " . . . " , " c o l l a b o r a t i o n " : " . . . " , " g r o w t h _ p o t e n t i a l " : " . . . " } } B P ARTICIP AN T DESCRIPTION Fifty-ve volunteer participants were recruited for this experiment from a leading university ranked among the top 400 in the QS W orld University Rankings. Among these 55 participants, 12 wer e female. Each participant was provided with a stipend of $10 upon completion of the assessment session. All participants underwent pre-screening via a written examination, which assessed their math- ematical reasoning and algorithmic problem-solving capabilities to ensure the recruitment of a high-performing cohort. This experi- ment was conducted in a real-world setting, as it was integrated into a selective talent development and assessment pr ogram hosted by a key academic unit of the aforementioned university . This institu- tional context ensured that all participants remained fully engaged and devoted genuine eort to their tasks. During the testing phase of this project, participants were required to test dierent states of the project driven by large language mo dels , while also con- ducting tests on two single-agent projects included in the baseline. The average testing duration p er participant per session was 58 minutes. Owing to the unavailability of backend data for these two baseline projects, we were unable to compute their average testing durations. Nevertheless, observational data show ed the two base- line tests were signicantly shorter than our project’s, typically taking less than 30 minutes combined. Following the completion of testing, all participants wer e requested to ll out a questionnaire developed by our research team, with the obje ctive of gathering their subjective feedback regarding each interview system. C EXPLANA TION OF RESULT ANAL YSIS-RELA TED PROCESSING Due to dierences in the types and formats of results obtained from various baselines, we implemented the following processing steps to acquire results in a consistent format for accuracy , facilitating subsequent analysis: 1. For the baseline results of professor-conducted interviews, only two outcome types (pass and fail) were obtaine d, where all passing evaluations wer e graded as " A " and all failing ones as "C" . Based on this data, we converted these grades into corresponding scores: the "C" grade was divided into several score ranges centered around 60 points for processing. Ho wever , due to the binary classi- cation nature (pass/fail) of the original data, we could not eectively derive the corresponding variance data; that is, the variance data in this scenario is not statistically meaningful. 2. For our multi-agent system (CoMAI) and the single-agent ab- lation project, we adhered to strict and objective scoring criteria. Specically , we scored candidates’ responses separately across dif- ferent dimensions, then computed the w eighted average of these dimensional scores to obtain the nal score. After discussions with professors and analysis of their grading scales, we set 70 points as the admission threshold—scores of 70 or higher were categorized as "admitted" . 3. For student-conducted inter views, the nal outcomes were presented as ve grades: A, B, C, D , and E. Referencing the grading logic of professor-conducted interviews, we classie d grades " A " and "B" as "admitted" , designated grade "C" as the passing score (60 points), and converted all grades into equivalent numerical scores. These converted scor es were then used to calculate and process the mean and variance. 4. The two external AI interview baselines (LLM-Interviewer and AI-Interviewer-Bot v3) are purely AI interview simulation systems and lack a built-in scoring function. T o address this, we required each candidate to upload a personal statement alongside a custom prompt—this prompt sp ecied the evaluation p erspectives ( e.g., reasoning ability , communication skills) and scoring requirements for the interview . Through this adjustment, we nally obtained scoring results on a 100-point scale for these two baselines. Through the aforementioned processing, we ultimately standar d- ized results from all baselines into a consistent format, laying the foundation for subsequent result analysis.

CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment