KGCE Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking

February 04, 2026

Reading time: 20 minute

...

#paper #research

📝 Original Paper Info

- Title: KGCE Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models
- ArXiv ID: 2601.01366
- Date: 2026-01-04
- Authors: Zixian Liu, Sihao Liu, Yuqi Zhao

📝 Abstract

With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross-platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross-platform tasks in educational contexts, especially when dealing with school-specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private-domain software. Additionally, current evaluation methods heavily rely on coarse-grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual-graph evaluation framework. We first constructed a dataset comprising 104 education-related tasks, covering Windows, Android, and cross-platform collaborative tasks. KGCE introduces a dual-graph evaluation framework that decomposes tasks into multiple sub-goals and verifies their completion status, providing fine-grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private-domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school-specific software. The code can be found at https://github.com/Kinginlife/KGCE.

💡 Summary & Analysis

1. **Dataset Construction**: This research constructs a large-scale dataset tailored for educational environments, incorporating various school-specific proprietary software and multi-device collaborative workflows. It's akin to preparing all the ingredients before cooking to ensure you have everything needed in one place.

Knowledge Base Development: A structured JSON-based database providing knowledge on proprietary educational software is developed. This acts like a recipe book that chefs refer to for instructions, allowing models to quickly retrieve necessary information and improve execution efficiency.
Dual-Graph Evaluation Framework: Proposes two graph-based systems to evaluate task performance from the aspects of completeness and efficiency. It’s similar to how athletes analyze their training after a game to identify strengths and weaknesses, enabling detailed model performance analysis.

📄 Full Paper Content (ArXiv Source)

# INTRODUCTION

The rapid advancement of multimodal large language models (MLMs) is reshaping the capability boundaries of autonomous agents, driving them from single-environment task execution towards cross-platform collaboration . Represented by models like GPT-4o, MLMs integrate visual, linguistic, and action-reasoning capabilities, demonstrating significant potential in general scenarios such as cross-device file transfer and multi-application collaborative operations.

Existing agents have predominantly focused on generic scenarios such as scientific research and code generation. However, their performance often declines sharply when transitioning to educational settings due to two major bottlenecks: lack of domain-specific knowledge and misalignment with assessment frameworks. Educational environments pose unique challenges: (1) they heavily rely on school-customized software, characterized by closed private-domain features that lack standardization in interface elements and operational logic; (2) cross-platform tasks require coordination across multiple devices like Windows and Android, involving complex process dependencies and state synchronization; (3) task objectives combine functional requirements with educational significance, demanding agents to exhibit both operational accuracy and cognitive understanding of educational contexts. Current research has yet to effectively address these challenges, thus limiting the practical deployment of educational agents.

The overall framework of KGCE. The system first generates tasks from educational datasets, then executes them through a pipeline of action prediction, execution, and evaluation. A dual-graph evaluator assesses task completeness and execution efficiency. Based on screenshot and OCR feedback, the system may invoke external knowledge from LLMs (e.g., GPT-4o, Qwen-VL, Gemini) to support cross-environment agents (Windows and Android) in accomplishing complex tasks.


System
Environment	Knowledge
Platform	Evaluation
Construction
Task
MetaGUI	Android	$`\textcolor{darkgreen}{\checkmark}`$	$`\textcolor{darkred}{\times}`$	Trajectory	Manual	$`\textcolor{darkred}{\times}`$
AgentBench	Multi-isolated	$`\textcolor{darkred}{\times}`$	$`\textcolor{darkred}{\times}`$	Multiple	Manual	$`\textcolor{darkred}{\times}`$
EduAgent	Multi-platform	$`\textcolor{darkgreen}{\checkmark}`$	$`\textcolor{darkred}{\times}`$	Multi-dimensional	LLM+Tools	$`\textcolor{darkgreen}{\checkmark}`$
WebArena	Web	$`\textcolor{darkgreen}{\checkmark}`$	$`\textcolor{darkred}{\times}`$	Goal-based	Template	$`\textcolor{darkred}{\times}`$
GUICourse	Desktop/Web GUI	$`\textcolor{darkgreen}{\checkmark}`$	$`\textcolor{darkgreen}{\checkmark}`$	Trajectory	Sub-task Comp	$`\textcolor{darkgreen}{\checkmark}`$
OSWorld	Linux/Windows	$`\textcolor{darkred}{\times}`$	$`\textcolor{darkred}{\times}`$	Goal-based	Template	$`\textcolor{darkred}{\times}`$
AndroidWorld	Android	$`\textcolor{darkred}{\times}`$	$`\textcolor{darkred}{\times}`$	Goal-based	Template	$`\textcolor{darkred}{\times}`$
EduBenchmark	Code/Web	$`\textcolor{darkred}{\times}`$	$`\textcolor{darkred}{\times}`$	Multi-dimensional	Template	$`\textcolor{darkgreen}{\checkmark}`$
WORFBench	Multi	$`\textcolor{darkred}{\times}`$	$`\textcolor{darkred}{\times}`$	Graph-based	LLM-inspired	$`\textcolor{darkred}{\times}`$
CRAB	Linux&Android	$`\textcolor{darkred}{\times}`$	$`\textcolor{darkgreen}{\checkmark}`$	Graph-based	Sub-task Comp	$`\textcolor{darkred}{\times}`$
KGCE	Windows&Android

MATH

\textcolor{darkgreen}{\checkmark}
``` | 
``` math
\textcolor{darkgreen}{\checkmark}
``` | **Dual-Graph-based** | Sub-task Comp | **$`\textcolor{darkgreen}{\checkmark}`$** |

<div class="tablenotes">

$`\textcolor{darkgreen}{\checkmark}`$=Supported,
$`\textcolor{darkred}{\times}`$=Not supported. Cross-platform requires
simultaneous multi-device operations.

</div>

</div>

Existing work exhibits significant limitations in three key areas.
Currently, there is a lack of task datasets tailored for educational
agents, which hampers research and development in educational scenarios.
Existing knowledge graphs and GUI operation libraries are primarily
designed for general-purpose software and cannot provide structured
knowledge support for school-specific systems. Conventional metrics,
such as task completion rate and trajectory similarity, focus solely on
macroscopic outcomes. These metrics cannot quantify fine-grained issues
like backtracking operations or omissions of critical steps.

To address the the above issues, we propose Knowledge-Augmented
Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking
with Multimodal Language Models, a cross-platform educational agent
benchmarking framework based on knowledge enhancement and dual-graph
evaluation. The overall framework is shown in
Fig. <a href="#fig:enter-label" data-reference-type="ref"
data-reference="fig:enter-label">1</a>.

**Table <a href="#tab:environments" data-reference-type="ref"
data-reference="tab:environments">[tab:environments]</a>** presents a
comparison between KGCE and existing benchmark frameworks. Key features
are categorized as: *Interactive Environment* (system’s operating
context: Web/GUI/Code); *Knowledge* (structured knowledge management
with knowledge graphs or dynamic reasoning supported by
$`\textcolor{darkgreen}{\checkmark}`$); *Cross-platform* (concurrent
multi-device operations requiring OS/device interoperability);
*Evaluation* (Goal-based: final state verification; Trajectory: action
sequence alignment; Graph-based: DAG checkpoint validation); *Task
Construction* (Template: predefined patterns; Manual: human-crafted;
Sub-task Comp: modular composition);*Educational Task* (curriculum
integration or pedagogical assessment via
$`\textcolor{darkgreen}{\checkmark}`$).

The main contributions are as follows:

- We constructed a dataset of 104 educational tasks spanning Windows,
  Android, and cross-platform collaboration. These tasks involve
  operations with private-domain software and multi-device coordination
  workflows, and its dependencies are modeled using a DAG.

- For private-domain software, we developed a structured JSON knowledge
  base. Knowledge from this base is dynamically retrieved and injected
  into model prompts, significantly improving execution efficiency and
  success rates in private-domain tasks.

- We propose a dual-graph evaluation framework consisting of a
  *Completeness Graph* and an *Efficiency Graph*. This framework
  introduces eight fine-grained metrics to assess task performance in
  detail.

- We validate the effectiveness of the knowledge base across several
  models, including Qwen-VL-Max-Latest, GPT-4o, and Gemini-2.0-Flash,
  revealing differences in their dependence on domain-specific
  knowledge.

# RELATED WORK

To comprehensively contextualize our research, we structure the related
work into four key dimensions: (1) cross-platform agents driven by large
models, (2) task modeling and agents in educational scenarios, (3)
knowledge-enhanced agent architectures, (4) agent evaluation
methodologies. These dimensions were selected because they collectively
address the challenges of building intelligent agents in educational
environments. By analyzing these aspects, we aim to highlight the gaps
in existing research and position our contributions accordingly.

## Large Model-Driven Cross-Platform Agents

In recent years, MLMs have demonstrated significant potential in the
field of cross-platform agents. CRAB introduced the first benchmark
framework supporting cross-environment tasks, enabling efficient
construction and evaluation of complex tasks through subtask composition
and a graph-based evaluator. AndroidWorld and OSWorld have established
dynamic Android environments and open computer environments,
respectively, providing diverse platform support for agent evaluation.
However, these cross-platform agent studies do not address the specific
support required for education-related private-domain software. Notably,
the directed acyclic graph-based task decomposition method proposed by
CRAB offers valuable inspiration for our task modeling. Nonetheless, its
coarse-grained trajectory-matching evaluation approach falls short of
capturing the nuanced execution differences critical to educational
scenarios.

## Task Modeling and Agents in Educational Scenarios

Research on educational agents faces the dual challenges of complex
environments and strong dependence on domain-specific knowledge.
EduAgent boosts task efficiency via multimodal interaction, yet only
targets general tools. GUICourse improves GUI grounding but ignores
cross-platform state sync. EduBench supplies real-world benchmarks, but
its task volume is small and lacks structured knowledge. Current studies
exhibit several key shortcomings: (1) Task datasets largely rely on
general educational platforms and lack support for school-customized
systems; (2) Evaluation metrics focus on macro-level metrics such as
task completion rate, which are insufficient for quantifying the
optimization level of execution paths. These limitations highlight the
critical need for a dedicated evaluation framework tailored to the
specific demands of educational scenarios.

## Knowledge-Augmented Agent Architectures

Knowledge base augmentation has emerged as an effective paradigm for
enhancing agent adaptability across domains. GAT proposes an attention
mechanism to dynamically calculate the importance of nodes in the graph,
which is applicable to the priority sorting of nodes in the knowledge
atlas and provides a theoretical basis for the priority call mechanism
of this research knowledge base. PLaG employed graph structures to
enhance task planning capabilities; however, its static knowledge
representations are ill-suited for the dynamically evolving nature of
educational software. Notably, existing knowledge enhancement methods
predominantly rely on general-purpose knowledge graphs, lacking
structured modeling tailored to private-domain educational software.
This essential yet underexplored factor currently constrains the
performance of educational agents.

## Agent Evaluation Methods

The development of agent evaluation systems is trending from
outcome-focused metrics toward process-oriented analysis. AgentBench
established a multidimensional evaluation benchmark, but its API-based
validation approach is poorly suited for GUI operation scenarios. DyVal
introduced a dynamic evaluation framework using a DAG structure to
capture task execution increments; however, its discrete state labeling
fails to quantify continuous metrics such as the backtracking rate.
Regarding fine-grained evaluation, although CRAB’s graph-based evaluator
can verify subtask completion states, it lacks the capability to analyze
execution path efficiency. Recent studies have shown that combining
structural analysis with process tracing provides a more accurate
reflection of an agent’s cognitive capabilities. These findings offer
theoretical support for the design of our dual-graph evaluation
framework.

# METHOD

This section provides a detailed introduction to the specific
implementation of the cross-platform educational agent benchmark, which
combines knowledge base enhancement with a double-layer graph evaluation
framework.

## Educational Dataset Construction

While benchmarks cover cross-environment tasks, they miss multi-device
collaboration in education. We introduce the first education-focused
task set, grounded in real activities at Central China Normal
University. Spanning proprietary software, it supports Windows, Android,
and cross-platform runs. Inspired by CRAB, we scale via decomposition,
templates, and composition: each complex task splits into atomic
subtasks, linked in a DAG. For example, a complex task such as *“Use
Xiaoya to check the assignments for the Big Data Technology course and
add the task in the Tasks app”* can be broken down into the following
subtasks: *“Open the Xiaoya app,” “Enter the Big Data Technology
course,” “View the assignment tasks,” “Switch to the Tasks app,”* and
*“Add the task.”* These subtasks are organized into a **DAG** that
models the logical dependencies between them. Each node represents a
subtask, and each edge denotes a dependency.

To instantiate concrete tasks, we design task templates using natural
language instruction patterns. These templates contain input attributes
such as `{app_name}`, `{feature}`, and `{action}`, which can be
dynamically replaced with specific values. For example, the template
“Open {feature} in {app_name}, perform {action}, and save it” enables
efficient generation of diverse task instances by substituting real
values. For complex scenarios, we compose multiple subtask templates to
create multi-step tasks. For instance, a cross-platform task may require
opening the One-Stop Service Platform on Windows, accessing the message
center, and then recording the message content in the Keep Notes app on
an Android device.

<figure id="fig:Knowledge-base" data-latex-placement="t">
<img src="/posts/2026/01/2026-01-04-190607-kgce__knowledge_augmented_dual_graph_evaluator_for/KGCE-Knowledge.png" style="width:100%" /> style="width:90.0%" />
<figcaption>The knowledge base module. Given a task description and
package names, the system identifies relevant packages, retrieves their
descriptions from the knowledge base, and constructs prompts for the
LLM. The knowledge base is organized by packages, pages, and elements,
with each element including its position, description, and
sub-elements.</figcaption>
</figure>

To guarantee task feasibility and validity, we rigorously verify each
subtask, ensuring it can be executed on the target platform and that its
input-output logic is consistent. Ultimately, we construct a dataset of
**104 tasks**, covering applications such as HuaShi XiaZi, Xiaoya
Assistant, Keep Notes, and MOOC platforms. This dataset balances task
diversity and complexity, laying a solid foundation for subsequent
experiments and evaluations.

## Knowledge Base Construction

Proprietary software such as Xiaoya Intelligent Assistant presents
unique interfaces and workflows that existing MLMs rarely master. To
close the gap, we manually interact with each target application to
collect software names, page descriptions, UI-element positions, and
functional explanations, then serialize the data into a uniform JSON
schema. At runtime, a **Knowledge Invocation Decision** module checks
the task description: if the software is mentioned, the corresponding KB
records are retrieved and injected into the prompt; otherwise, they are
omitted. Fig. <a href="#fig:Knowledge-base" data-reference-type="ref"
data-reference="fig:Knowledge-base">2</a> illustrates the full pipeline
from manual exploration to structured JSON and prompt augmentation.

## Dual-Graph Evaluation Framework

Current coarse-grained, goal-oriented or trajectory-matching metrics
inadequately capture the nuanced execution of agents in complex tasks.
KGCE proposes a dual-graph evaluation framework—Task Completeness Graph
(TCG) and Execution Efficiency Graph (EEG)—to provide fine-grained
metrics quantifying both completion quality and execution efficiency.

### Task Completeness Graph

TCG decomposes a task into sub-goals and builds a dependency graph where
each node $`v_i`$ represents a sub-goal. During execution, we annotate
real-time completion status with indicator $`I(v_i=1)`$. The overall
progress is captured by the **Completion Ratio**
$`\text{CR} = \sum_{v_i\in V} I(v_i=1)/|V|`$. Action-level performance
is measured by **Completeness per Action**
$`\text{CPA} = \sum_{a_j\in A} I(a_j=1)/|A|`$, $`I(a_j=1)`$ marks
whether action $`a_j`$ is completed.

### Execution Efficiency Graph

EEG evaluates execution efficiency. **Backtracking Ratio**
$`\text{BR} = \text{IO}/\text{ONU}`$ quantifies the frequency of
backtracking steps (IO) relative to total operations (ONU); lower BR
indicates more direct paths. **Precision** is $`\text{CAN}/\text{ANU}`$,
the ratio of completed actions (CAN) to attempted actions (ANU).
**Recall** measures coverage of essential key steps:
$`\text{Recall} = \sum_{k_m\in K} I(k_m=1)/|K|`$. Precision and Recall
are combined into an F1-score. Two exception metrics are recorded: **Out
of Range (OoR)** counts out-of-bounds errors, and **reach_max_step
(RMS)** tallies terminations due to reaching the maximum step limit.

## Dual-Graph Evaluation Framework

Current evaluation methods predominantly rely on goal-oriented or
trajectory-matching coarse-grained metrics, which are insufficient for
capturing the execution nuances of agents in complex tasks. KGCE
proposes a dual-graph evaluation framework, comprising the Task
Completeness Graph and the Execution Efficiency Graph. This framework
aims to provide fine-grained evaluation metrics that more accurately
quantify the agent’s execution efficiency and task completion quality.

### **Task Completeness Graph.**

The Task Completeness Graph evaluates an agent’s ability to achieve
individual subgoals within a task.. We decompose each task into multiple
subgoals, constructing a dependency graph where each node {vi}
represents a subgoal. Throughout execution, we monitor and annotate the
real-time completion status of each subgoal, using {I(vi=1)} to indicate
completion (1 for completed, 0 for incomplete). To quantify overall
progress and assess whether the agent achieves most subgoals, we define
the Completion Ratio (CR):
``` math
\begin{equation}
\text{CR} = \frac{\sum_{v_i \in V} I(v_i = 1)}{|V|}.
\label{eq:cr}
\end{equation}

Click to expand and view more

To assess the execution performance at the action level, we propose the Completeness per Action (CPA):

MATH

\begin{equation}
\text{CPA} = \frac{\sum\limits_{a_j \in A} I(a_j = 1)}{|A|}.
\label{eq:cpa}
\end{equation}

Click to expand and view more

Here, $`{A}`$ is the set of all actions, $`{ a_j }`$ is the $`{ j }`$-th action, and $`{ I(a_j = 1) }`$ is an indicator function that denotes whether action $`{ a_j }`$ is completed: $`I(a_j = 1)`$ $`a_j`$

Execution Efficiency Graph.

The Execution Efficiency Graph is used to evaluate the efficiency of the agent during task execution. To reflect the agent’s path planning effectiveness, we introduce the Backtracking Ratio (BR), which measures the frequency of backtracking during execution. A lower BR indicates more direct task completion. It is calculated as IO/ONU, where IO represents the number of backtracking steps due to errors, and ONU is the total number of operations executed by the agent. To evaluate the accuracy of action execution, we use the ratio CAN/ANU to calculate Precision, where CAN denotes the number of completed actions and ANU is the total number of actions attempted. To assess the coverage of essential steps, we introduce the Recall metric:

MATH

\begin{equation}
Recall = \frac{\sum\limits_{k_m \in K} I(k_m = 1)}{|K|},
\label{eq:recall}
\end{equation}

Click to expand and view more

where $`{K}`$ is the set of all key steps, $`{k_m}`$ is the $`{m}`$-th key step, and $`I(k_m=1)`$ is an indicator function that denotes whether the key step $`{k_m}`$ is covered: $`I(k_m=1)`$ indicates whether key step $`{k_m}`$ is covered (1 if covered, zero otherwise). Combines Precision and Recall to provide a comprehensive evaluation metric F1-score.

Additionally, we introduce two exception metrics: Out of Range (OoR), which records the number of errors caused by out-of-bounds operations, and reach_max_step (RMS), which records the number of times the agent is terminated due to reaching the maximum step limit.

EXPERIMENT

In this section, we address the following research questions:

RQ1: Why do we adopt the Dual-Graph Evaluation Framework?
RQ2: What is the impact of incorporating the knowledge module on the final performance?
RQ3: How do different MLMs perform when executing these tasks?
RQ4: For the knowledge base we constructed, which MLM demonstrates the most significant performance improvement?

To evaluate the effectiveness of KGCE, we formulated RQ1 to validate the necessity of dual-graph evaluator and fine-grained metrics. RQ2 then uses these metrics to evaluate the impact of knowledge modules on agent performance. On this basis, RQ3 compares the performance of different MLM agents with or without knowledge bases. Finally, RQ4 evaluates the effect of the knowledge base on the performance improvement of each model agent.

Compared Methods

We compare KGCE with two state-of-the-art frameworks: CRAB and WORFBENCH , selected for their relevance to task decomposition and graph-based evaluation, while highlighting gaps addressed by our framework, including comprehensive evaluation, knowledge base utilization, and educational task support.

CRAB: As a benchmark for cross-environment agent evaluation, CRAB introduces a graph-based framework that decomposes tasks into sub-tasks with DAG structures, enabling metrics like CR for multi-platform scenarios. Its interactive GUI environment support and DAG-based evaluator make it a critical baseline for assessing agents across devices, aligning with our focus on cross-platform task execution. However, CRAB lacks a knowledge base, specific support for educational scenarios, and fine-grained metrics, which limits its ability to capture detailed execution efficiency and task completion quality in complex educational tasks.

WORFBENCH: WORFBENCH excels in evaluating complex graph-structured workflows and provides fine-grained metrics for linear and graph planning, which aligns with our interest in task execution analysis. However, it does not address the unique challenges of educational agents, such as domain-specific knowledge requirements and cross-platform task execution in educational software, and lacks macro metrics. Moreover, it only analyzes theoretical results without actual task execution.

RQ1: Validating the Effectiveness of the Dual-Graph Evaluation Framework

Experimental Setup: To assess the effectiveness of the framework, we conducted 624 task executions across 104 education-related tasks using MLM agents. For each task execution, we recorded all performance metrics generated from the Dual-Graph Evaluation Framework, including CR, CPA, Precision, Recall, F1-score, BR, OoR, and RMS. We then computed Pearson correlation coefficients among all these metrics to evaluate their interdependencies.

Experimental Results: Fig. 3 presents the correlation-matrix heatmap of all metrics. Fine-grained indicators from the Dual-Graph Evaluation Framework strongly align with macro-level success: Precision, Recall, and F1-score exhibit high positive correlations with CR, indicating that accurate node selection and comprehensive coverage directly boost task completion. Conversely, BR, OoR, and RMS are negatively correlated with CR—frequent backtracking (BR), boundary violations (OoR), or exceeding the maximum step limit (RMS) all lower success rates by introducing redundant, erroneous, or incomplete paths. Together, these micro and macro perspectives provide a clear, actionable diagnosis of agent strengths and weaknesses across multi-path tasks.

RQ2: Verify the effectiveness of the knowledge base

Experimental Setup: To verify the effectiveness of knowledge bases in enhancing the performance of agents, we divided the experiment into two groups: one group is an agent without a knowledge base support, relying solely on the original capabilities of multimodal large models; the other group is an agent with a knowledge base support, which enhances the execution capability of the agent by combining it with the knowledge base. Three models, Qwen-VL-Max-Latest, GPT-4o, and Gemini-2.0-Flash, were used to perform in 104 tasks. In the experiment, we used our dual-graph evaluation framework for performance assessment, including CR, CPA, Precision, Recall, F1-score, BR, OoR, and RMS. The average values of the performance metrics with and without KB were calculated separately.

Metric	Without KB (%)	With KB (%)	Improve (%)
CR	60.02	75.26	+25.39
CPA	7.22	11.29	+56.37
Precision	24.68	32.84	+33.06
Recall	63.87	75.79	+18.66
F1-score	33.96	44.96	+32.39
BR	52.01	41.47	-20.27
OoR	13.42	7.54	-43.81
RMS	46.33	31.27	-32.51

RQ2 Performance comparison with and without KB

RQ3 Comparison of Multimodal Large Language Models

Experimental Results: Table. 1 summarizes the impact of the knowledge base across all key metrics. CR rises from 60.02% to 75.26% (+25.39%), CPA from 7.22% to 11.29% (+56.37%); Precision improves from 24.68% to 32.84% (+33.06%), Recall from 63.87% to 75.79% (+18.66%). At the same time, errors drop sharply: RMS is cut by 32.51%, OoR by 43.81%, and BR by 20.27%. These gains stem from the knowledge base supplying prior rules, task structures, and refined action prompts that enable the agent to plan shorter, more accurate paths, avoid redundant or out-of-range operations, and cover essential steps. These results indicate that the knowledge base helps agents plan and execute tasks more effectively by providing structured information.

RQ3: Evaluate the performance of different large models

Experimental Setup: To evaluate the performance of different MLM agents on the 104 educational tasks we designed, we used three commercial MLMs: Qwen-VL-Max-Latest, GPT-4o, and Gemini-2.0-Flash. Each model performed tasks with and without a knowledge base, conducting a total of 6 sets of experiments, recording experimental results, and calculating comprehensive performance.

Experimental Results: The result is shown in Fig 4. From the results, it can be seen that GPT-4o has the best overall performance in terms of task completion rate and execution efficiency. Moreover, with the support of a knowledge base, GPT-4o performs optimally, achieving a CR of 77.21% and an F1-score of 47.71%. This indicates that GPT-4o possesses more powerful multimodal processing capabilities, efficient reasoning abilities, deep knowledge retrieval, and semantic understanding. It is capable of fully utilizing the knowledge base and dialogue context, making better use of prior knowledge when handling tasks, avoiding redundant exploration and incorrect operations. In contrast, Qwen-VL-Max-Latest and Gemini-2.0-Flash are relatively weaker in multimodal processing and reasoning capabilities.

RQ4: Evaluate the improvement effect of knowledge bases on different large models

Experimental Setup: To verify the effect of different knowledge bases on the performance improvement of different large model agents, we based on the above experimental metrics calculated to calculate the improvement effect of each metric for each large model.

RQ4 Improvement Comparison of different models
Model	Metric
KB (%)
KB (%)
(%)
Qwen-VL- Max-Latest	CR	52.88	76.53	+44.72
	CPA	5.82	12.09	+107.73
	Precision	21.63	35.79	+65.46
	Recall	56.79	79.12	+39.32
	F1-score	28.95	48.45	+67.43
	BR	53.71	38.20	-28.88
	OoR	29.41	14.71	-49.19
	RMS	41.58	25.49	-38.70
GPT-4o	CR	65.39	77.21	+18.08
	CPA	8.71	12.63	+45.01
	Precision	28.33	35.37	+24.85
	Recall	68.92	76.59	+11.13
	F1-score	38.51	47.71	+23.89
	BR	49.08	36.12	-26.41
	OoR	8.91	5.94	-33.33
	RMS	43.56	21.78	-50.00
Gemini- 2.0-Flash	CR	61.80	72.03	+16.55
	CPA	7.14	9.16	+28.29
	Precision	24.08	27.35	+13.58
	Recall	65.91	71.66	+8.72
	F1-score	34.43	38.72	+12.46
	BR	53.25	50.08	-6.07
	OoR	1.92	1.98	+3.13
	RMS	53.85	46.53	-13.59

Experimental Results: As can be seen from Table 2, all three models show a certain improvement in performance after introducing the knowledge base, but compared to them, Qwen-VL-Max-Latest and Gemini-2.0-Flash do not perform as well as GPT-4o overall. Among them, Qwen-VL-Max-Latest shows the most significant improvement after introducing the knowledge base, with the CR increasing from 52.88% to 76.53%, while Gemini-2.0-Flash’s CR increases from 61.80% to 72.03%. This indicates that the knowledge base has a differential impact on the performance improvement of different models, depending on the model’s characteristics and requirements. However, Gemini-2.0-Flash’s OoR increases by 3.13% after the introduction of the knowledge base, which theoretically should decrease. This may be due to the conflict between the static rules of the knowledge base and Gemini’s dynamic reasoning.

CONCLUSIONS

This paper proposes KGCE, a cross-platform benchmark framework that integrates knowledge base augmentation and graph-based evaluation to enhance educational agents powered by MLMs. We constructed 104 tasks across Android, Windows, and hybrid platforms, along with a domain-specific software knowledge base. Experiments on Qwen-VL-Max-Latest, GPT-4o, and Gemini-2.0-Flash show consistent performance improvements with knowledge support, with GPT-4o achieving the best results. These findings underscore both the general effectiveness of knowledge augmentation and the impact of model architecture on task adaptability. However, limitations remain in scalability and error detection. Future work will expand task diversity, enrich software coverage, and improve agents’ ability to detect and respond to unachievable tasks.

Read Full PDF on ArXiv

📊 논문 시각자료 (Figures)

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

KGCE Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Execution Efficiency Graph.

EXPERIMENT

Compared Methods

RQ1: Validating the Effectiveness of the Dual-Graph Evaluation Framework

RQ2: Verify the effectiveness of the knowledge base

RQ3: Evaluate the performance of different large models

RQ4: Evaluate the improvement effect of knowledge bases on different large models

CONCLUSIONS

📊 논문 시각자료 (Figures)

A Note of Gratitude

Table of Contents

Table of Contents

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Execution Efficiency Graph.

EXPERIMENT

Compared Methods

RQ1: Validating the Effectiveness of the Dual-Graph Evaluation Framework

RQ2: Verify the effectiveness of the knowledge base

RQ3: Evaluate the performance of different large models

RQ4: Evaluate the improvement effect of knowledge bases on different large models

CONCLUSIONS

📊 논문 시각자료 (Figures)

A Note of Gratitude

Related Posts

A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

A Comprehensive Dataset for Human vs. AI Generated Image Detection

A Generalized UCB Bandit Algorithm for ML-Based Estimators

Start searching

No results found