MalMoE: Mixture-of-Experts Enhanced Encrypted Malicious Traffic Detection Under Graph Drift

MalMoE: Mixture-of-Experts Enhanced Encrypted Malicious Traffic Detection Under Graph Drift
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Encryption has been commonly used in network traffic to secure transmission, but it also brings challenges for malicious traffic detection, due to the invisibility of the packet payload. Graph-based methods are emerging as promising solutions by leveraging multi-host interactions to promote detection accuracy. But most of them face a critical problem: Graph Drift, where the flow statistics or topological information of a graph change over time. To overcome these drawbacks, we propose a graph-assisted encrypted traffic detection system, MalMoE, which applies Mixture of Experts (MoE) to select the best expert model for drift-aware classification. Particularly, we design 1-hop-GNN-like expert models that handle different graph drifts by analyzing graphs with different features. Then, the redesigned gate model conducts expert selection according to the actual drift. MalMoE is trained with a stable two-stage training strategy with data augmentation, which effectively guides the gate on how to perform routing. Experiments on open-source, synthetic, and real-world datasets show that MalMoE can perform precise and real-time detection.


💡 Research Summary

The paper introduces MalMoE, a novel system for detecting malicious network flows in encrypted traffic that leverages graph‑based analysis while explicitly addressing the problem of temporal graph drift. Encrypted traffic hides payload information, rendering traditional Deep Packet Inspection ineffective. Recent works have turned to flow‑level graphs—nodes represent IP addresses, edges represent individual NetFlow records, and node/edge attributes capture statistical characteristics. However, these graph‑based detectors assume a stationary distribution; in practice, two major drift phenomena occur: (1) flow‑statistic drift, where per‑flow metrics such as bytes, packets, and duration change due to congestion or service shifts, and (2) graph‑scale drift, where the overall number of connections fluctuates (e.g., diurnal patterns). When either drift appears, a model trained on a fixed time window suffers severe performance degradation.

The authors make a key empirical observation: certain node feature constructions are inherently robust to one type of drift but vulnerable to the other. The “average traffic” feature (AVG) computes the mean of neighboring flow statistics and remains stable under graph‑scale drift, while the “node degree” feature (DEG) counts incident edges and stays stable under flow‑statistic drift. Concatenating both features, however, harms performance because each component drifts under the opposite condition.

To exploit this insight, MalMoE adopts a Mixture‑of‑Experts (MoE) architecture. Two lightweight experts are built, each a 1‑hop GNN‑like model that directly concatenates the chosen node feature (AVG or DEG) from the source and destination IPs with the edge’s flow statistics to form a flow embedding. The 1‑hop design is sufficient because routers typically observe only immediate neighbor interactions, and it keeps inference cost low for edge‑level classification. Each expert is trained to be strong against the drift type it is designed for.

Traditional MoE uses a soft weighted sum of expert outputs, which is problematic in this setting: (i) near‑out‑of‑distribution (near‑OOD) samples are hard to detect with only sample‑level cues, (ii) the weighted sum introduces training instability, and (iii) the resulting gating weights are hard to interpret. MalMoE therefore redesigns the gate in two ways. First, the gate receives not only the per‑flow features but also a global graph representation (e.g., a pooled embedding) so it can sense overall drift conditions. Second, the gate is trained to perform hard selection—choosing a single expert per flow—by minimizing a cross‑entropy loss that treats the expert choice as a classification problem. This hard routing improves explainability, reduces gradient noise, and makes the system more robust to subtle distribution shifts.

Training proceeds in two stages. Stage 1 employs data augmentation to synthetically generate graphs exhibiting both types of drift (e.g., scaling the number of edges or perturbing flow statistics). Each expert is pre‑trained on its respective augmented data, learning drift‑specific invariances without interference from the other expert. Stage 2 freezes the experts and trains the gate on the same augmented set, using the ground‑truth expert that yields the lowest classification loss as the target label. This two‑stage scheme stabilizes gate learning, prevents the “expert collapse” problem, and eliminates the need for continual retraining with fresh labeled data.

The authors evaluate MalMoE on three fronts: (1) public benchmark datasets (e.g., CIC‑IDS‑2017) with artificially induced drifts, (2) a synthetic dataset where drift magnitude can be precisely controlled, and (3) real‑world NetFlow traces collected from a major backbone operator over a week in April 2025, which naturally exhibit diurnal flow‑statistic and scale drifts (the paper shows up to 400 % variation in flow count and 300 % in bytes‑per‑packet). Across all settings, MalMoE outperforms baseline GNN classifiers and recent drift‑aware methods by at least 24 % in accuracy and 31 % in F1 score under drift conditions. Moreover, after model quantization and optimized inference pipelines, MalMoE processes 858,646 flows per second on a consumer‑grade CPU, satisfying real‑time detection requirements for high‑throughput networks.

In summary, the contributions are: (i) identifying drift‑specific robust node features and formalizing the drift problem for encrypted traffic graphs, (ii) designing a modular MoE framework where each expert handles a distinct drift type, (iii) introducing a graph‑aware hard‑selection gate that improves routing accuracy and interpretability, (iv) proposing a two‑stage augmentation‑driven training protocol that eliminates the need for periodic retraining, and (v) demonstrating substantial gains in detection performance and throughput on diverse datasets. The work opens avenues for extending the expert pool with additional feature families (e.g., temporal embeddings, protocol‑level attributes) and for exploring multi‑hop GNN experts to capture longer‑range attack patterns, while preserving the core advantage of drift‑aware, retraining‑free, real‑time detection.


Comments & Academic Discussion

Loading comments...

Leave a Comment