Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematicall…
Authors: Jakub Masłowski, Jarosław A. Chudziak
Heterogeneous Debate Engine: Iden tit y-Grounded Cognitiv e Arc hitecture for Resilien t LLM-Based Ethical T utoring Jakub Masło wski [0009 − 0005 − 4597 − 6335] and Jarosła w A. Chudziak [0000 − 0003 − 4534 − 8652] Institute o f Computer Science, W arsa w Universit y of T ec hnology , Poland {jakub.maslowski2.stud, jaroslaw.chudziak}@pw.edu.pl Abstract. Large Language Models (LLMs) are being increasingly used as autonomous agen ts in complex reasoning tasks, op ening the niche for dialectical in teractions. How ever, Multi-Agen t systems implemented with systematically unconstrained systems systematically undergo seman tic drift and logical deterioration and thus can hardly b e used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate in to dialectical stagnation, the agents degener- ate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce do ctrinal fidelit y without suppressing the gener- ativ e flexibility required for dialectical reasoning? T o address this niche, w e contribute the Heterogeneous Debate Engine (HDE), a cognitive ar- c hitecture that combines Iden tity-Grounded Retriev al-Augmented Gen- eration (ID-RA G) for do ctrinal fidelit y and Heuristic Theory of Mind for strategic opp onent mo deling. Our ev aluation sho ws that architec- tural heterogeneit y is a crucial v ariable to stabilit y: contrary do ctrinal initializations (e.g., Deontology vs. Utilitarianism) ha ve increased the Argumen t Complexit y Scores of students b y an order of magnitude, o ver baselines. These findings v alidate the effectiveness of ID-RAG and Heuristic T oM as architectural requirements in maintaining high-fidelity (adv ersarial) p edagogy . Keyw ords: Artificial Intelligence · Multi-Agen t Systems · Ethical T u- toring · Large Language Models · Theory of Mind · Iden tity-Grounded RA G · Critical Thinking · Moral Reasoning · Computational Ethics 1 In tro duction The accelerating growth of Artificial In telligence is leading to a transition to the Agen tic W orld, i n which systems based on AI hav e adv anced to no longer b e mere to ols, but functioning as actors with goals and capable of complex cognition and though t pro cesses [12,9,35]. This paradigm pro vides immense opp ortunities to scaffold critical thinking in the field of education, including the critical thinking 2 J. Masłowski and J. Chudziak Fig. 1. An illustrativ e example of the fragment of debate generated by the Heteroge- neous Debate Engine. The figure demonstrates t wo autonomous agen ts (Kant and Mill) main taining the do ctrinal faithfulness to exp ose the student to philosophical conflict. [25,11]. Impro vemen t through the in tegration of this highly sophisticated tech- nology raises inherent questions with regards to logical in tegrity of automated dialectics [31,32]. This transition is relatively promising, but it comes along with a many ar- c hitectural complexities. Without p o werful constraints, autonomous agen ts are prone to hallucinating facts, forgetting the identit y they w ere supp osed to hav e [26], or achieving consensus rather than a legitimate p edagogical confron tation with one another [36]. T raditional sim ulations do not often extend the p ersonal and dialectical in teraction required to induce profound moral though t [27,5], and in many cases b ecome degenerate in to trying to reac h so cial agreement, oscillating b et ween facts and lies [36,15]. T o harness the strength of an Agentic W orld, conv ersational skill in itself is not enough; agents strive for cognitive mechanisms that bring ab out stabilit y , coherence as well as planning abilities [17,30]. W e hypothesize that those cog- nitiv e abilities should b e endow ed to the educational agen t to ov ercome these limitations: a Theory of Mind (T oM) to simulate and predict the thinking of in- terlo cutors [29,3], and Retriev al-Augmented Generation (RA G) to base on philo- sophical literature for grounded arguments [39]. Though the language core is the Large Language Mo del (LLM), the com bination of these mo dules of higher level is what makes an AI agent a full-fledged philosophical debater [23]. An illustra- tiv e example of such debate is presented in Fig. 1. This highlights the need for dev eloping sophisticated Multi-Agent Debate arc hitectures [2,28]. In this pap er, w e are researching the viability and sustain- abilit y of such a platform. The main research question is to prov e the v alidit y of arc hitectural stability empirically: w e hypothesize that the implementation of ar- c hitectural heterogeneity by Iden tity-Grounded RAG (ID-RAG) is a condition to prev ent logical degeneration in contrast to homogeneous baselines. Particularly , w e consider whether diverse sc ho ols of ethics (e.g. Utilitarianism vs. Deontology) Heterogeneous Debate Engine 3 encapsulated in separate agents can allow the system not to derail in to consen- sus or circular argumentation[28,18]. As a measure of architecture v alidity , w e consider the p edagogical v alue as a measure of system coherence: we determine whether observing a stable multi-agen t debate leads to an increased Argument Complexit y Score of students than interactions with single LLM tutors do [13]. This pap er contributes the Heterogeneous Debate Engine (HDE) to the field of AI-supp orted tutoring, a no vel architecture demonstrating heterogeneity not only as a protection from consensus collapse, but a real prerequisite for building to ol, that is capable of improving studen t critical thinking. 2 Related W ork Our work is set b et ween three o verlapping disciplines: p edagogical ethics, higher- order m ulti-agent systems and cognitive reasoning [11]. In order to fill the gap b et ween the static kno wledge and activ e reasoning, w e refer first to classical Argumen tation Theory . The structural basis of the distinction of claim, data, and w arrant is giv en by mo dels like the lay out of arguments by T oulmin [31]. This theoretical basis is crucial in the case of AI tutoring; a system cannot rely only on retrieving facts but m ust also build v alid lines of inference. Dialectical pedagogy , with So cratic questioning as its foundation, adopts crit- ical thinking since it requires learners to develop indep enden t knowledge systems instead of b eing passiv e consumers of knowledge [24,27,5]. Although in histori- cal dialectics, it is imp ortant to highligh t the formal argumentation in order to dispro ve statemen ts [31], the mo dern AI tutoring tends to revert to the act of passiv e knowledge delivery [10]. In turn, typical LLMs often lac k the antago- nism required for deep moral reasoning, therefore systems that actively prov oke though t by structurally confronting it are desired [36,6]. Recent approaches also attempt to formalize this ethical reasoning using probabilistic or deon tic logic mo dels to ensure decision-making consistency [33,7]. Autonomous agen ts are used to co ordinate complex workflo ws, in the tran- sition to an Agentic W orld, frequen tly applying Belief-Desire-In tention mo dels to solve long-horizon tasks [12,9,19]. By relying on this autonom y , Multi-Agent Debate (MAD) emplo ys this as a means of reducing Degeneration-of-Thought using interactiv e critical review [22]. How ev er, unconstrained systems face so cial conformit y bias, where agents v alue superficiality more than logical correctness of their answer [36,28]. Architectural heterogeneity is therefore needed so that agen ts would not v acillate betw een the true and false knowledge in pursuit of so cial concordance [14]. The necessity of a strong debate is to balance b etw een character stabilit y and generativ e flexibility . In the strategic plane, Counterfactual Reasoning and The- ory of Mind (T oM) can enable agents to simulate “what-if ” scenarios and mo del the opp onents’ inten tions [29,23,6]. On the other hand, Identit y-Grounded RA G (ID-RA G) anchors agents to structured b elief graphs to av oid iden tity drift un- der pressure, instead of using mere factual retriev al [26,38]. Architectural tension 4 J. Masłowski and J. Chudziak to b e analyzed in this pap er is the co op eration of argumen tative credibility pro- vided b y T oM with do ctrinal stability of ID-RAG [1]. 3 Approac h, Metho ds, and T o ols T o address the gaps mentioned in the literature review, we emplo y a researc h- b y-construction approac h through our platform: “League of Moral Minds”. In this section, we present our general researc h plan, a particular technology stack used in the implemen tation, and the analytical mo del created for assessment. 3.1 Researc h Metho dology and T echnology Stac k W e p erform three k ey exp erimen ts: a system resilience test with adv ersarial p er- turbations, an ablation study in order to address the hypothesis of heterogeneity adv antage, and finally , a p edagogical p otential assessment. This triangulation pro vides a high level of examination of tec hnical stability and educational effi- cacy [10,36]. The platform is based on LangGraph [20] to orc hestrate delib eration in a cyclic manner. The cognitiv e back end LLM is Go ogle Gemini 2.0 Flash, chosen to show that the system is capable of pro ducing coherent reasoning ev en with latency-optimised mo dels. A ChromaDB is used to index philosophical corpora (e.g., Gr oundwork b y Imman uel Kan t) using ID-RAG for data retriev al [21,5]. DeepSeek-V3 serv es as external, additional baseline. 3.2 Ev aluation F ramew ork T o measure in a systematic wa y not only the p edagogical success rate but also the architectural relev ance of our system, w e divided the ev aluation into three categories of metrics. System A r gumentative R esilienc e (SysAR). T o assess the ability of the system to reco ver from adv ersarial pressure, this metric quantifies the system capacit y to revisit the original topic after a predefined p erturbation [36]. It is accomplished via keyw ord tracking, by monitoring BASE_KEYWORDS (e.g., tr ol ley, lever, five, one, de ath ). A turn is classified as “recov ered” when at least 3 base keyw ords are men tioned. SysAR = 1 Reco very_Time , (1) where R e c overy_Time is the num ber of turns b etw een injection and the first reco vered turn. SysAR = 0 . 0 means that there was no recov ery throughout 6 follo wing, consecutive turns. A r gumentative Coher enc e (A rCo). This metric (used to complemen t SysAR) quan tifies the proportion of post-p erturbation turns, whic h preserv e an y philo- sophically relev ant debate (original OR p erturb ed topic) [6,28]. It is op erational- ized with an extended set of keyw ords VALID_KEYWORDS . A turn is coherent when Heterogeneous Debate Engine 5 it has at least 3 v alid keyw ords. ArCo = Coheren t_T urns T otal_T urns_Post_P erturbation (2) This measure reflects the existence of the critical difference b et ween a graceful failure (‘SysAR=0.0, ArCo=1.0‘) and a catastrophic one (‘SysAR=0.0, ArCo=0.0‘). W e indicate that we hav e adjusted this metric since an original for- m ulation w as a “time-to-first-coherence” form ulation, to capture debate qualit y more clearly . A r gument Complexity Sc or e (ACS). In order to test the depth of student reason- ing, we used a rubric-based ev aluation of studen t-written justifications [13,27]. It is a measure of three v ariables, namely: the Perspective Range (taking into accoun t different p erspectives), the Conceptual Sophistication (the application of philosophical framew orks), and the Argumen tative structuring (logical con- sistency). Each dimension was rated on a scale of 0-2 resulting in a maximum of 6 p oin ts. Learning gain is defined as ∆ ACS. Do ctrinal A c cur acy. This is an agent-lev el measure of the prop ortion of turns where an agen t correctly applies their assigned philosophical framework, whic h w as used in our ablation study to estimate identit y coherence, serving as a proxy for iden tity relev ance [26]. It is op erationalized through framework-specific key- w ord tracking (e.g. for Kan t: c ate goric al imp er ative, duty ). The Do ctrinal Ac- curacy = 1 of a turn is obtained when it has a minimum of 2 keyw ords of a framew ork. Cr oss-R efer encing. This metric calculates strategic engagemen t (i.e. the ratio of turns when the opp onen ts’ frameworks are referenced) is calculated [22,1]. It measures the shift from parallel monologue to dialectical conv ersation. 4 System Arc hitecture W e present the Heterogeneous Debate Engine (HDE), a cognitive architecture that supp orts philosophical high-fidelit y dialectics. T o empirically v erify our idea, w e ha ve built it as a m ulti-agent platform, named “League of Moral Minds,” whic h is based on philosophical source texts. The system is structured into three core mo dules: a T eam Co ordination Lay er, Philosopher-Agent Cognitiv e Mod- ules and a central Debate Orc hestration Engine. In this section, w e detail the wa y in which these elements are interconnected and the system’s execution pip eline. 4.1 Arc hitectural T op ology and Debate Flow The system is based on a m ulti-agent topology consisting of teams co ordinated through the LangGraph framew ork [20], the control flow of whic h is sho wn in Fig. 2. It is built on a deterministic, chronological order of op erations. 6 J. Masłowski and J. Chudziak Fig. 2. Control flow of the L e ague of Mor al Minds under the LangGraph orchestration. The flo w starts with a co ordinated deliberation and progresses to the lo op of the iterativ e debate. The mechanisms b ehind it make use of cognitive mo dules in a lo op. Phase 1: Internal Delib er ation b egins with in tra-team exchange b etw een agen ts, who ha v e previously been grouped based on affiliation with an ethi- cal discourse (e.g., Mill and Bentham: Utilitarianism). Philosophers refer to a shared memory log to establish a common opinion. In Phase 2: So cr atic Inter- r o gation , a neutral, Socratic mo derator is resp onsible for finding logical gaps and k ey weaknesses in agents’ reasoning, to ask p ersonalized, elenctic questions. Finally , the main dialectic with turn-taking rules and recording of the interac- tion metrics is enforced in Phase 3 (In ter-T eam Debate), structured similarly to formal Oxford-st yle proto cols [8]. 4.2 Cognitiv e Mo dules T o balance p ersona fidelity and argumen tative competence, agents use a dual- mo dule mental structure. T o alleviate the effects of “iden tity drift,” we adapt the structure by Platnick et al. [26]. The individual agen ts’ identities are based on a b elief graph of a philosopher, i.e., each agen t is represented b y a graph of BELIEFs and VALUEs with a certaint y parameter γ ∈ [0 , 1] . In resp onse generation, the working mem- ory is not only app ended but also filtered b y Do ctrinal Boundaries. W e mo del this as a constrain t chec king pro cedure in which retrieved facts f that do not conform to negativ e constraints N constr are deleted: W M ′ t = W M t ⊕ ( K I D t \ { f ∈ K I D t | ∃ n ∈ N constr : f | = n } ) (3) In this case, f | = n represents a fact that is violated by a core identit y constraint (e.g., “REJECT: Reducing moralit y to calculation”), whic h serves as a preven tion of p ersona mimicry [4]. Str ate gic R e asoning via T oM-Lite. W e implement T oM-Lite, a heuristic ap- pro ximation of Theory of Mind. In contrast to recursive simulation metho ds, whic h are computationally in tensive [29], explicit b elief injection, which is used Heterogeneous Debate Engine 7 in T oM-Lite, relies on static opp onent profiles. Eac h agent has a weakness_map of the opp osing agents’ schools, which allo ws them to prepare counter-argumen ts without the latency of full mental simulation. The system ensures that the core p ersona of the agent is not compromised through a strategic ov erride by storing the iden tity (ID-RAG) and strategy (T oM) mo dules separately [16]. 4.3 Orc hestration and Moderation A LangGraph s tate machine is employ ed to main tain consistent flow, as well as k eep the c hat history and turn logic as an ob ject named TeamConversationState . The proposed structure allows for an organized debate flo w, th us preven ting c haotic interference b etw een agents. The So cratic mo derator works in t wo mo des. The first is used during Phase 2 (Pr e-Deb ate Interr o gation) , where it p oses ontology-lev el questions in order to deep en the initial framing b efore the start of the debate’s main phase. Subse- quen tly , in Phase 3 , an adv ersarial scenario (e.g., “Scien tist vs. Killers” T rolley Problem v ariant) is injected into the system during T urn 4, which is used in exp erimen ts to test the system’s resilience. 5 Exp erimen tal Setup W e adopted a research-b y-construction metho dology to ev aluate the three main capabilities in the following structure: a system resilience test, an architecture ablation test, and a con trolled study of h uman sub jects. The following section describ es the system configurations used during the exp eriments. Eac h exp eriment employ ed Go ogle Gemini ( gemini-2.0-flash-exp ) as the LLM back end with default hyperparameter. On av erage, identit y graphs had 34 no des per agen t and 6 core b eliefs whic h w ere lab eled as immutable ( γ = 1 . 0 ). T o exhibit the debate effects and cognitive mo dules, we used four configurations: Baseline 1 (B_Chat). Single conv ersational LLM tutor with no RA G retriev al or p ersona sp ecification used, whic h is the simplest form of tutoring. Applied with DeepSeek-V3 via web interface with zero-shot prompting to use as an external reference state-of-the-art. Baseline 2 (B_SingleRA G). V anilla RA G agent unified ov er all seven philo- sophical corp ora. Imp ortantly , this baseline emplo ys the same Gemini 2.0 Flash bac kend as our prop osed HDE system so that a fair architectural comparison can b e made. The hypothesis b eing tested is whether generic, retriev al-enhanced single-agen t is sufficient to b e used in ethical tutoring. Homogeneous System (Homo). In tra-school debate configuration of our system, with teams defined as: Aristotle + Plato (Ancient Virtue Ethics) vs. A quinas + Augustine (Christian Virtue Ethics). Both are committed to virtue theory , but differ in terms of metaphysics and epistemology . Heterogeneous System (Hetero). Inter-sc ho ol debate configuration. F or the resilience exp eriments: Aristotle + A quinas (Virtue/Natural La w) vs. Mill + 8 J. Masłowski and J. Chudziak Ben tham (Utilitarian Consequentialism). In the student learning experiment, c hanged to Kant + A quinas (Deon tology/Natural Law) vs. Mill + Bentham for a broader philosophical landscap e. T eams present a visible axiom-level conflict. In order to test the resilience of the system and allow for the ablation, we inject three p erturbations into T urn 4 of 10-turn debates: T able 1. Adv ersarial P erturbations for Resilience T esting P erturbation Description P1 (push_vs_lever) Physical proximit y intuition pump—“Is there a moral difference b et ween pulling a lev er and ph ysically pushing someone?” P2 (t yrant_argumen t) Character-based attack—“Historical t yrants claimed to act for the greater go od. Does utilitarian logic risk justifying atro ci- ties?” P3 (scien- tist_vs_killers) V alue asymmetry—“Supp ose the five are convicted m urderers, the one a cancer-curing scientist. Do es this change the calcula- tion?” 6 Results and Ev aluation The exp erimen ts demonstrated that the configuration of the agents is critical in ethical tutoring systems. The findings reveal that the underlying v ariations in p edagogical efficacy and system robustness are dep endent on the architecture t yp e. In this section, we presen t the pro duct of our platform along with three exp erimen ts ev aluated in the con text of our hypotheses. 6.1 Platform Capabilities and Debate Generation The first system capabilit y to b e tested prior to the education influence assess- men t is the generation of philosophical dialectics. The aim of this subsection is to demonstrate the ability of our HDE arc hitecture to produce high-fidelity reasoning as a prerequisite for an y p edagogical goals. The platform successfully carried out a three-stage workflo w. The pro cess b e- gins with T opic Definition , in whic h the So cratic Mo derator enters a formalized ethical dilemma (e.g., the T rolley Problem) to initialize the debate. Afterward, the Heterogeneous Debate Engine used during Deb ate Gener ation pro duces a m ulti-turn dialogue using the dual-mo dule cognitiv e arc hitecture. Analysis of the system log confirms that the agen ts main tain their p ersonas during multi- turn interactions, demonstrating the use of ID-RAG [26]. Simultaneously , we can observ e the use of T oM-Lite applied when needed to challenge the opponent’s reasoning. Lastly , during the Deb ate Delivery stage, the generated dialogues are Heterogeneous Debate Engine 9 logged, and can, for example, then b e given to students during the p edagogical assessmen t in Section 6.4. This qualitative confirmation indicates that the ar- tifact is not a simple sto c hastic LLM output, but the product of the in tended arc hitectural constraints. 6.2 System Resilience Analysis T o v erify ho w imp ortant architectural heterogeneity is for effective debate gen- eration, we compare its b ehavior with a homogeneous setup under adv ersarial pressure. This analysis aims to empirically test if the in tro duction of do ctrinal div ersity can preven t the “consensus collapse,” which is the b ehavior observed in m ulti-agent systems that we wan t to av oid [28,36]. Our exp eriments w ere conducted in a 2 × 3 factorial design crossing system t yp e (Homo vs. Hetero) with p erturbation type (P1, P2, P3). All exp eriments adhered to the designed pipeline: in tra-team delib eration, So cratic mo deration, in ter-team debate (10-turns of single-agen t statemen ts), T urn 4 p erturbation injection, and p ost-perturbation observ ation (the remaining 6 turns). In T able 2, w e are presen ted the results that reveal a significant difference b et ween the tw o designs. The heterogeneous system, with its ArCo = 1.00 across all tests, show ed flawless p erformance, in contrast to the homogeneous one with ArCo = 0.06. The debate analysis show ed that homogeneous agen ts often de- generated in to unrelated meta-epistemological discussions (faith vs. reason). Suc h b eha vior can be interpreted in a w ay that, although agen ts hav e a common framework-lev el commitment (virtue ethics) [34], they differ in meta- metho dology (ancient philosophy vs. Christian theology). Their ethical moti- v ations are usually similar, but when ob ject-lev el dilemmas fail to giv e an y common ground, they drift into irrelev ant meta-debates [22]. On the con trary , heterogeneous agents differ at the very structural level (virtue vs. consequence), whic h makes meta-con vergence imp ossible, and this b ecomes a foundation for the p edagogical ob jectiv e. These findings support the hypothesis that arc hitectural heterogeneit y is a requirement for the main tenance of argumentativ e coherence under adv ersarial pressure. T able 2. System Resilience Across Perturbations System P ert. SysAR ArCo Hetero P1 0.50 1.00 Hetero P2 0.00 1.00 Hetero P3 0.00 1.00 Homo P1 0.00 0.00 Homo P2 0.00 0.00 Homo P3 0.00 0.17 Mean (Het) 0.17 1.00 Mean (Hom) 0.00 0.06 10 J. Masłowski and J. Chudziak Fig. 3. Complementary Con tributions of ID-RAG and T oM-Lite. Bars show Do ctrinal A c cur acy and Cr oss-R eferencing . While ID-RAG maximizes iden tity stability , and T oM impro ves engagemen t, the F ull System achiev es high p erformance on b oth metrics. 6.3 Arc hitectural Ablation Study In this exp eriment, we isolate the contribution of our t wo main cognitive com- p onen ts (ID-RAG and Heuristic T oM) to the system’s p erformance. W e aim to sho w that these mo dules offer different but complemen tary b enefits, although insufficien t when isolated. W e hav e chosen the heterogeneous setup (Kant + Aquinas vs. Mill + Ben- tham) under adv ersarial pressure for the ablation study . F our system v ariants, including the use of the full system, eac h of the modules separately , and none of them (only baseline RAG), were ev aluated. T wo p erturbations (P1, P3) were used, yielding a total of n = 8 . Doctrinal A ccuracy (DA), Cross-Referencing (CR), and Argumen tative Coherence (ArCo) were measured. As detailed in T able 3, ID-RA G had a significan t effect on the do ctrinal stabil- it y , where V anilla + ID-RAG obtained DA = 0 . 90 (39% o ver V anilla Only 0.51) and T oM-Lite improv ed the strategy , with V anilla + T oM pro viding CR = 0 . 40 ( +35% compared to V anilla Only 0.05). Notably , the F ull System obtained a p erfect D A = 1 . 00 and context-dependent p eak engagement (CR = 0 . 60 for sci- entist_vs_kil lers ). All v arian ts maintained maximal coherence (ArCo = 1.00), indicating a lac k of architectural conflict b etw een the mo dules. A visual com- parison of complemen tary contributions of these mo dules is depicted in Fig. 3. Differen t vulnerabilities were observed at the agen t lev el. In sev eral circum- stances, Aquinas reached D A = 0.00 through some systematic abandonmen t of natural la w - V anilla RAG provides generic moralit y but not in Thomistic form. Heterogeneous Debate Engine 11 T able 3. Ablation Study Results (Mean Across Perturbations) Condition D A CR ArCo V anilla RAG Only 0.51 0.05 1.00 V anilla + ID-RAG 0.90 0.25 1.00 V anilla + T oM 0.79 0.40 1.00 F ull System 1.00 0.45 1.00 ∆ (ID-RAG) +39% +20% — ∆ (T oM) +28% +35% — Similarly , Mill yielded DA = 0 . 33 in the case of push_vs_lever p erturbation. Both were rescued by explicitly setting the do ctrinal b oundaries in the F ull Sys- tem, attaining DA = 1 . 00 [26]. Adding T oM mo ved the parallel monologue to a debate form. Only then, agent Mill started to use framework-a ware coun ter- argumen ts (CR = 1 . 00 ). The observed +28% improv ement in V anilla + T oM compared to V anilla RA G Only reveals an unexp ected mec hanism: opp onent mo deling strengthens the doctrinal boundaries indirectly by compell ing agen ts to describ e their own model to come up with counterargumen ts [24]. 6.4 Ev aluation of P edagogical Effectiveness Lastly , we consider the real-w orld utility of the platform, meaning its application to students’ teaching. The primary ob jectiv e of this exp erimen t is to verify the p oten tial of the debate generated b y the Heterogeneous Debate Engine in p ed- agogy . W e w an t to test if it results in a quantifiable increase in the complexit y of the arguments given by the students compared to the baseline single-agent in teractions. In order to test this, a pilot group of univ ersit y studen ts ( N = 22 ) w as recruited through an anonymous online survey , with informed consen t obtained from all participan ts. They w ere randomly giv en one of four conditions (B_Chat: n = 5 , B_SingleRA G: n = 5 , Homo: n = 7 , Hetero: n = 5 ) in a pre- test/exp osure/post-test design. The participants declared mo derate levels of prior knowledge of the T rolley Problem and lo w-mo derate philosophical kno wl- edge. F ollowing the preliminary dilemma stance (pull or do not pull the lever) and justification, participants were provided with the corresp onding debate con- ten t or LLM output. After that, students were once again asked to give a revised p osition, argumentation (rated as A CS_Post) based on established argumenta- tion qualit y frameworks [13], and sub jected to a kno wledge transfer quiz. T able 4 contains the learning outcomes. The Heterogeneous condition pro- vided an 11-fold improv ement in ∆ ACS o ver baselines. More imp ortan tly , re- gression in p erformance w as observ ed in the homogeneous condition ( ∆ ACS = -0.29). The same pattern app eared in Quiz scores (Heterogeneous 55% vs. Ho- mogeneous 25%). Despite the mo dest sample size, the large effect size (Cohen’s d > 1 . 0 ) strongly suggests that the architectural interv ention promotes p edagog- ical gains considerably greater than random v ariance. 12 J. Masłowski and J. Chudziak T able 4. Student Learning Outcomes by Condition Condition N ∆ A CS Quiz Shift B_Chat 5 +0.20 0.4 0.0 B_SingleRA G 5 +0.40 0.6 0.8 Homo 7 − 0 . 29 1.0 0.29 Hetero 5 +2.20 2.2 0.4 The analysis of secondary data sho wed that p ersuasiveness and learning are disso ciated: B_SingleRA G exhibited the greatest stance c hange (0.8) and lo w- est learning gains, which can be associated with the optimization of rhetorical p o wer, as opp osed to p edagogical conten t [1]. The qualitativ e error detection sho wed a critical failure mo de in which B_SingleRAG propagated a Kantian misconception (defending lev er-pulling, which is fundamentally contradictory to the philosopher’s views), which none of the students identified. This suggests that retriev al-enhanced single agents are prone to the error of contextual misuse. The findings demonstrate the efficacy of the heterogeneous systems and highlight the imp ortance of enforcing in ter-school diversit y to outperform single-agen t tutor- ing. 7 Discussion and F uture W ork The experimental outcomes presented in this section aim to highlight three key areas: the architectural synergies, p edagogical p oten tial, and the future direc- tions for system scalabilit y . 7.1 Discussion The dic hotomy b etw een the results of student learning in heterogeneous debate, compared with the homogeneous system, rev eals a crucial asp ect of multi-agen t debate system design. The analysis sho ws that naive setups are prone to b eing p edagogically counterproductive [36]. Homogeneous system failure is lik ely a re- sult of meta-epistemological escalation, in which agents with common axioms (e.g., virtue ethics) but divergen t theology are more concerned with conformity than with resolving problems [28]. In con trast, a heterogeneous debate, by in- tro ducing an axiomatic conflict b etw een agents, sub jects studen ts to a wider p erspective, leading to higher mastery of p hilosophical terminology [22]. ID-RA G and Heuristic T oM are arc hitecturally synergistic. ID-RAG acts as a safety measure (+39% do ctrinal stability) through negative constrain ts [26], and T oM-Lite stimulates dialectical interaction (+35% cross-referencing). Un- exp ectedly , T oM alone also contributed to the increase in stabilit y b y inducing agen ts to define their own structure b oundaries. The “F ull System” recorded the highest metrics, confirming the complemen tarity of b oth mo dules. Heterogeneous Debate Engine 13 While the observed results are substantial ( d > 1 . 0 ), they should b e in ter- preted within the b oundary conditions of a pilot study ( N = 22 ) in a single- domain. Moreov er, while the architecture provided high fidelit y in a structured debate, the upp er b oundary of identit y coherence in op en-ended dialectics is still to b e inv estigated in future scalability testing. 7.2 F uture Directions Our current research aims to extend the human study to a larger treatmen t group, by scaling up the proto col to profile learning outcomes across sp ecific target audiences and differen t degrees of philosophical literacy . This aims to en- hance the p edagogical analysis and confirm the generalizabilit y of the Argument Complexit y Score gains. W e are also extending our architecture with a counterfactual reasoning mo d- ule. This enhancemen t is meant to sub ject the agent to stress by injecting h y- p othetical “what-if ” scenarios that transcend the rigidit y of identit y b oundaries [6,37]. W e are also testing the feasibility of lo cal deploymen t by exploring HDE on op en-w eights mo dels, ensuring that the ID-RA G b enefits are p ossible b eyond fron tier-class mo dels [23]. 8 Conclusions W e presented the Heterogeneous Debate Engine (HDE), an architecture designed to impose do ctrinal faithfulness in AI-based ethical tutoring. By comparing ho- mogeneous and heterogeneous structures, we ha ve shown that structural diver- sit y is a p edagogical requiremen t. W e hav e exp erimentally demonstrated that a com bination of ID-RA G and T oM-Lite is argumen tatively coherent under pres- sure, and it mitigates consensus collapse and hallucination [36,28]. In the future, we aim to apply this framework to m ulti-turn reasoning sce- narios and broader domains, exploring the boundary conditions of retriev al- augmen ted identit y and its effect on retention. Finally , this research pap er contributes a design pattern to the AgenticW orld. A ccording to our findings, identit y-grounded heterogeneity is a priority for ar- c hitects to av oid logical deterioration. The HDE architecture practically demon- strates that the ability to disagree is the basic requirement of true dialectical scaffolding [24,11]. A ckno wledgmen ts. The work reported in this pap er was supported by the Polish National Science Centre under grant 2024/06/Y/HS1/00197. Disclosure of Interests. The authors ha ve no competing interests to declare that are relev an t to the conten t of this article. 14 J. Masłowski and J. Chudziak References 1. Carro, M.V., et al.: Ai debaters are more persuasive when arguing in alignmen t with their own b eliefs (2025), arXiv preprint 2. Chan, C.M., et al.: Chatev al: T ow ards b etter llm-based ev aluators through multi- agen t debate. In: Pro c. EMNLP . pp. 4199–4218. ACL (2023) 3. Chen, R., et al.: Theory of mind in large language mo dels: Assessment and en- hancemen t. In: Pro c. ACL. pp. 31539–31558. ACL (2025) 4. Choi, Y., Kang, E.J., Choi, S., Lee, M.K., Kim, J.: Proxona: Supporting cre- ators’ sensemaking and ideation with llm-p o wered audience p ersonas (2025), arXiv preprin t 5. Ch udziak, J.A., Kostk a, A.: AI-p o wered math tutoring: Platform for p ersonalized and adaptiv e education. In: Artificial Intelligence in Education (AIED). pp. 462– 469. Springer (2025). https://doi.org/10.1007/978- 3- 031- 98465- 5_58 6. F ang, Y., et al.: Coun terfactual debating with preset stances for hallucination elim- ination of llms. In: Pro c. CIKM. pp. 608–617. ACM (2024) 7. Ghosh, K., Smith, C.: F ormal analysis of deon tic logic model for ethical deci- sions. In: Pro ceedings of the 17th Int. Conf. on Agents and Artificial Intelli- gence - V ol. 1: ICAAR T. pp. 218–223. SciT ePress (2025). https://doi.org/10.5220/ 0013385200003890 8. Harbar, Y., Chudziak, J.A.: Simulating oxford-st yle debates with llm-based multi- agen t systems. In: Intelligen t Information and Database Systems (A CI IDS). pp. 286–300. Springer (2025) 9. Ha yashi, H., et al.: A new planning agent arc hitecture that efficiently integrates an online planner with external legal and ethical chec kers. In: Pro c. ICAAR T (1). pp. 263–272. SciT ePress (2025) 10. Holmes, W., Pora ysk a-Pomsta, K.: The Ethics of Artificial Intelligence in Educa- tion: Practices, Challenges, and Debates. Routledge (8 2022). https://doi.org/10. 4324/9780429329067 11. Hou, X., et al.: Eduthink4ai: T ranslating educational critical thinking into multi- agen t llm systems. arXiv preprint arXiv:2507.15015 (2025) 12. Hou, Z., et al.: Halo: Hierarchical autonomous logic-orien ted orchestration for m ulti-agent llm systems (2025), arXiv preprint 13. Iv ano v a, R.V., et al.: Let’s discuss! quality dimensions and annotated datasets for computational argument quality assessment. In: Pro c. EMNLP . pp. 20749–20779. A CL (2024) 14. Ju, T., et al.: When disagreements elicit robustness: Inv estigating self-repair capabilities under llm multi-agen t disagreemen ts. In: ArXiv preprint (2025), 15. Kim, K., et al.: Can llms pro duce faithful explanations for fact-chec king? tow ards faithful explainable fact-c hecking via m ulti-agen t debate (2024), arXiv preprin t 16. K obay ashi, Y., F ujita, K.: Pre-trained mo dels and fine-tuning for negotiation strategies with end-to-end reinforcement learning. In: Pro c. ICAAR T (1). pp. 400– 411. SciT ePress (2025) 17. K ostk a, A., Chudziak, J.A.: Synergizing logical reasoning, kno wledge man- agemen t and collab oration in multi-agen t llm system (2025), arXiv preprint 18. Ku, H.B., et al.: Multi-agent LLM debate unv eils the premise left unsaid. In: Pro c. ArgMining. pp. 58–73. ACL (2025) Heterogeneous Debate Engine 15 19. Kurc hyna, V., et al.: Efficien t selection of consisten t plans using patterns and constrain t satisfaction for b eliefs-desires-in tentions agents. In: Pro c. ICAAR T (1). pp. 333–341. SciT ePress (2025) 20. LangChain T eam: Langgraph. https://langc hain- ai.github.io/langgraph/ (20 23), accessed: 2025-01-10 21. Li, F., et al.: Aistorian lets ai b e a historian: A kg-p ow ered multi-agen t system for accurate biography generation (2025), arXiv preprin t 22. Liang, T., et al.: Encouraging divergen t thinking in large language mo dels through m ulti-agent debate. In: Pro c. EMNLP . pp. 17889–17904. ACL (2024) 23. Lore, N., et al.: Large mo del strategic thinking, small mo del efficiency: T ransferring theory of mind in large language mo dels (2024), arXiv preprint 24. P ei, J., et al.: So cratic st yle chain-of-though ts help llms to be a better reasoner. In: Findings of ACL. pp. 12384–12395. ACL (2025) 25. P eng, X., et al.: KELE: A multi-agen t framework for structured so cratic teaching with llms. In: Findings of EMNLP . pp. 16342–16362. ACL (2025) 26. Platnic k, D., et al.: Id-rag: Identit y retriev al-augmented generation for long-horizon p ersona coherence. arXiv preprint arXiv:2509.25299 (2025) 27. Scarlatos, A., et al.: T raining llm-based tutors to improv e student learning out- comes in dialogues. In: Artificial Intelligence in Education (AIED). pp. 251–266. Springer (2025) 28. Smit, A., et al.: Should we b e going mad? a lo ok at multi-agen t debate strategies for llms. In: Pro c. ICML. vol. 235, pp. 45883–45905. PMLR (2024) 29. Strac han, J.W.A., et al.: T esting theory of mind in large language models and h umans. Nature Human Behaviour 8 (7), 1285–1295 (7 2024). https://doi.org/10. 1038/s41562- 024- 01882- z 30. Tian, C., et al.: Agentinit: Initializing llm-based multi-agen t systems via diversit y orc hestration. arXiv preprin t arXiv:2509.19236 (2025) 31. T oulmin, S.E.: The Uses of Argumen t. Cambridge Universit y Press, Cambridge (1958) 32. T rep czyński, M.: AI as a Rational Theologian: A Comprehensive Skills Assessment. W yda wnictw a Uniwersytetu W arszawskiego, W arsaw (2025). https://doi.org/10. 31338/u w.9788323569183 33. Upreti, N., et al.: T ow ards dev eloping ethical reasoners: In tegrating probabilistic reasoning and decision-making for complex ai systems. In: Pro c. ICAAR T (1). pp. 588–599. SciT ePress (2025) 34. V allor, S.: Virtue ethics, technology , and h uman flourishing. In: T echnology and the Virtues. Oxford Universit y Press (2016) 35. W est, A., et al.: Ab duct, act, predict: Scaffolding causal inference for automated failure attribution in multi-agen t systems (2025), arXiv preprint 36. W ynn, A., et al.: T alk isn’t alwa ys cheap: Understanding failure modes in multi- agen t debate. arXiv preprint arXiv:2509.05396 (2025) 37. Y ang, S., et al.: On the eligibility of llms for counterfactual reasoning: A decom- p ositional study (2025), arXiv preprint 38. Zamo jsk a, M., Chudziak, J.A.: Simulating h uman communication games: T rans- actional analysis in LLM agent in teractions. In: Recen t Challenges in Intelligen t Information and Database Systems (ACIIDS). pp. 173–187. Springer (2025) 39. Zh u, Y., et al.: Argrag: Explainable retriev al augmented generation using quan ti- tativ e bip olar argumentation (2025), arXiv preprint
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment