Reading time: 14 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.19697
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

With the rapid growth of data volume in modern telecommunication networks and the continuous expansion of their scale, maintaining high reliability has become a critical requirement. These networks support a wide range of applications and services, including highly sensitive and missioncritical ones, which demand rapid and accurate detection and resolution of network errors. Traditional fault-diagnosis methods are no longer efficient for such complex environments. [1] In this study, we leverage Large Language Models (LLMs) to automate network fault detection and classification. Various types of network errors were intentionally injected into a Kubernetesbased test network, and data were collected under both healthy and faulty conditions. The dataset includes logs from different network components (pods), along with complementary data such as system descriptions, events, Round Trip Time (RTT) tests, and pod status information. The dataset covers common fault types such as pod failure, pod kill, network delay, network loss, and disk I/O failures. We fine-tuned the GPT-4.1 nano model via its API on this dataset, resulting in a significant improvement in fault-detection accuracy compared to the base model. These findings highlight the potential of LLM-based approaches for achieving closed-loop, and operator-free fault management, which can enhance network reliability and reduce downtime-related operational costs for service providers.

๐Ÿ“„ Full Content

Maintaining the continuous operation of modern communication networks-especially advanced infrastructures such as 5G systems-has become both crucial and increasingly challenging. Traditional network monitoring approaches rely heavily on manual inspection of logs, alerts, and performance metrics. Such processes are time-consuming, error-prone, and lack scalability as networks grow in size, complexity, and heterogeneity [2]. As telecommunication systems continue to evolve, integrating software-defined architectures, virtualization, and containerized services, the rapid identification and mitigation of faults have become even more difficult to achieve.

Recent advancements in artificial intelligence (AI), and particularly in Large Language Models (LLMs), offer a new paradigm for intelligent and automated network management [3] [4] [6]. These models possess the ability to understand natural language, interpret unstructured data such as logs, and even reason about causal relations in complex systems. This capability introduces a significant opportunity to transform traditional network operations-from reactive troubleshooting to proactive, self-healing mechanisms. While earlier studies explored AI-driven monitoring through statistical and deep learning models [2] [5], LLMs now enable contextual reasoning across multimodal data sources, thereby bridging the gap between human expertise and automated fault diagnosis.

With the global transition from 4G to 5G and beyond, the scale and diversity of telecommunication infrastructures have expanded dramatically. Network environments now integrate cloud-native platforms, such as Kubernetes, where numerous microservices interact dynamically. Configuring, managing, and troubleshooting these environments demand both domainspecific and software-engineering expertise. Leveraging LLMs to automate such complex management tasks has therefore gained significant attention in recent research and industrial discussions, including works such as LLM for 5G: Network Management [7]. By combining domain knowledge with the reasoning capabilities of LLMs, it becomes feasible to realize autonomous, closed-loop network operations capable of detecting, classifying, and resolving errors in near real time. A review of previous research reveals that no prior work has explored a generalized framework for assessing the overall health of communication networks using Large Language Models (LLMs) to identify potential faults across different layers and components. Most existing studies focus on specific layers-such as the physical or data-link layer-and rely on structured numerical data rather than unstructured system information. Techniques such as anomaly detection and time-series forecasting have been applied successfully [1] [2] in these limited contexts to detect deviations in predefined performance indicators.

In parallel, several works have investigated log analysis using traditional machine-learning or rule-based approaches [4] [5] [6]. However, these methods typically operate on preparsed and highly filtered logs within controlled experimental conditions, requiring significant preprocessing efforts. As a result, their applicability is restricted to detecting a narrow range of explicitly defined errors that appear in standardized log formats. While such techniques can achieve high accuracy in constrained settings, they are not scalable or adaptable to heterogeneous, dynamic network environments.

Recent domain-specific efforts have targeted 5G fault detection and RCA with hybrid ML-LLM pipelines. In the core network, 5G Core Fault Detection and Root Cause Analysis using Machine Learning and Generative AI combines supervised anomaly detectors with LLM-based summarization to improve diagnosis quality on control/user-plane events, demonstrating that generative models can convert heterogeneous operational traces into operator-ready incident narratives [8]. Complementary trends extend beyond the core toward cross-domain or cross-vendor settings: An LLM-based Cross-Domain Fault Localization in Carrier Networks shows that aligning prompts with topology/context priors improves generalization across environments, while RCA Copilot operationalizes LLMs as a copilot over multi-source telemetry (logs, metrics, traces), yielding actionable recommendations rather than raw detections [9], [10]. These systems substantiate the utility of LLMs for interpretability and triage, yet most still depend on structured features, curated parsers, or offline post-processing pipelines, which limits real-time applicability in highly dynamic, cloud-native 5G cores.

A parallel line of work studies LLMs (and tool-augmented LLM agents) for automated RCA and decision support. RCAgent integrates tool calls and knowledge retrieval with conversational reasoning to correlate events and surface root causes in complex cloud stacks [11]. In telecom-adjacent 5G contexts, LLM-driven RCA has also been explored for RAN anomalies using graph/transformer models to represent dependencies, and for verticals such as power-grid 5G where domain knowledge must be fused with network telemetry [12], [13]. Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks further argues for explicit chain-ofthought/plan-and-act prompting to improve causal attribution under noisy observability [14]. Across these works, three gaps remain: (i) reliance on pre-parsed or schema-constrained inputs rather than raw, heterogeneous logs and events; (ii) limited coupling with live orchestrators (e.g., Kubernetes) for closed-loop response; and (iii) scarcity of fine-tuned datasets built from realistic fault injections in operational 5G cores.

In this study, our objective is to design a comprehensive and flexible pipeline capable of processing heterogeneous network inputs-such as raw logs, events, and status reports-without extensive preprocessing or manual parsing. The proposed LLM-based approach acts as an intelligent conversational assistant that enables network operators, even without deep technical expertise, to monitor and diagnose network conditions using simple natural-language queries. Furthermore, the adaptability of LLMs allows new data types or fault categories to be seamlessly integrated into the model, dynamically expanding its operational scope. Finally, this model can be deployed in a closed-loop, real-time monitoring framework, continuously observing the network, identifying anomalies, and generating detailed diagnostic reports whenever faults occur. Such an approach has the potential to revolutionize network management by combining automation, interpretability, and reliability within a single unified system [3] [7].

An LLM is trained on logs from 5G mobile networks to detect the root cause of failures. For training, faults are generated using Chaos Mesh in an OpenAirInterface 5G network deployed on a Kubernetes cluster. The entire training pipeline is shown in Fig. 1.

To evaluate the proposed framework under realistic network conditions, we developed an experimental 5G testbed based on the OpenAirInterface (OAI) 5G Core, an open-source and standards-compliant implementation of the 3GPP Release 15 specification [15].

Our core network includes the Access and Mobility Management Function (AMF), the Session Management Function (SMF), the User Plane Function (UPF). A dedicated database module maintains subscriber and session information. The radio access layer of the testbed consists of an OAI-based gNodeB (gNB) connected to multiple User Equipments (UEs), forming a fully functional RAN-Core integration.

In our implementation, each network component was deployed as a containerized pod within a Kubernetes cluster, allowing for isolated management, automated scaling, and continuous log collection. These logs, along with performance metrics from the gNB and UEs, served as the primary data source for our LLM-based monitoring and fault-detection pipeline.

To generate dataset for the proposed LLM-based monitoring framework, controlled fault scenarios were introduced into the experimental testbed using Chaos Mesh, a powerful opensource chaos engineering platform designed for Kubernetes environments [16]. Chaos Mesh enables the simulation of various failure conditions at the pod, network, and system levels in a safe and repeatable manner. In this study, five representative categories of faults were selected to emulate common failure conditions in cloud-native 5G systems:

  1. Pod Failure: Simulates unexpected termination or crash of a containerized network function, such as AMF or SMF. This fault type tests the resilience of service recovery mechanisms and the system’s ability to restart or reassign affected pods automatically.

  2. Pod Kill: Represents a forced termination of a specific pod process at runtime. Unlike pod failure, this type mimics operator-triggered or abrupt shutdown scenarios, providing insights into how quickly the orchestration layer (Kubernetes) detects and replaces the terminated component.

  3. Network Delay: Introduces artificial latency into communication between selected pods or components (e.g., between the gNB and the UPF). This allows evaluation of the system’s performance under delayed signaling and degraded transport conditions, reflecting scenarios such as congested backhaul links.

  4. Network Loss: Simulates packet drops within network interfaces to analyze the tolerance of the 5G core control and data planes to unreliable transmission. This experiment helps study how message retransmissions and session stability are affected under partial network loss.

  5. I/O Injection: Injects read/write latency or temporary disk unavailability into containerized components, replicating storage-related anomalies such as database I/O bottlenecks or temporary filesystem access failures. This fault type examines the robustness of stateful services like the core database module.

Each fault type was applied to different components of the OAI 5G Core for predefined time intervals, while comprehensive telemetry data-including system logs, event traces, pod status updates, and round-trip time (RTT) measurements-were collected in parallel. This process yielded a rich dataset containing both normal and abnormal operational states, which was subsequently used for fine-tuning the LLM and validating its fault detection and reasoning capabilities.

Through this controlled and systematic approach, Chaos Mesh provided a reliable means to evaluate the network’s fault response and to train the LLM model under realistic and diverse failure conditions.

To build a training and evaluation dataset for our LLMbased fault diagnosis system, we developed an automated data collection pipeline that interacts directly with the Kubernetesbased 5G core deployment. The pipeline remotely controls the testbed, orchestrates experiments, injects faults, and captures multi-source telemetry before and after each fault scenario in a structured and repeatable manner. The multi-source telemetry collection and aggregation design is inspired by cloud-native observability frameworks such as Prometheus and Kubernetes logging best practices. [17] The data collection process is executed from a controller machine that connects to the cluster over SSH. Through this interface, the pipeline programmatically issues operational commands to query the state of the system, retrieve runtime information, and persist relevant outputs. The workflow consists of four main stages:

  1. Network Reset and Initialization: At the start of each experiment, the testbed is brought into a known healthy baseline state. The script tears down and reinstalls the mobile core network (AMF, SMF, UPF, database, gNB) and user equipment (UE) instances using predefined deployment descriptors. After redeployment, the pipeline automatically waits until all critical components are reported as ready by Kubernetes. This ensures that each experiment begins from a consistent configuration and that observed degradations are attributable to injected faults rather than residual transient behavior from previous runs. 2) Health Monitoring and Status Capture: The script periodically inspects the lifecycle state of all pods (e.g., Running, CrashLoopBackOff, etc.) to detect any non-running or partially degraded components. Pod status summaries are stored to disk at each experiment step.

In addition, a “describe” snapshot is collected for every pod. This kubectl describe pod output includes recent events, restart counts, container-level conditions, resource usage signals, and warning messages generated by Kubernetes. Capturing these descriptions is critical for post hoc reasoning, because they often contain early indicators of faults (for example, repeated restarts, image pull errors, failed liveness probes) that may not yet appear in high-level KPIs.

  1. Runtime Telemetry and Connectivity Measurements:

For each core network function and UE pod, the pipeline retrieves recent logs directly from the workload containers. These logs include authentication traces, session setup messages, mobility management events, forwarding plane messages, and error messages from services such as AMF, SMF, and UPF. In addition to log data, the pipeline actively measures connectivity quality between the core and each UE by executing an RTT (round-trip time) test. RTT results are collected for all UE instances both before and after the fault is injected. This allows us to correlate control-plane or user-plane anomalies with observable service impact at the UE side. 4) Event and System-Level Reporting: Alongside podlocal telemetry, the pipeline also records the global Kubernetes event history in the target namespace. The event trace captures cluster-level reactions to faults (e.g., pod eviction, rescheduling attempts, container kills, liveness probe failures, node-level resource pressure). By including these events in the dataset, we provide the model with causal context: not only what failed, but also how the orchestration layer responded.

All collected artifacts from a single experimental run-pod logs, pod descriptions, per-pod status, UE RTT measurements (before/after fault), and namespace-wide event history-are stored for each experiment.

This automated approach has three important properties. First, it guarantees repeatability: each dataset sample corresponds to a known network state and a known injected fault condition. Second, it captures heterogeneous data sources (raw logs, status descriptions, cluster events, RTT measurements) without requiring heavy manual preprocessing or hand-written parsing rules. Third, it produces operator-facing, humanreadable traces that can be directly consumed by an LLM. In other words, rather than relying solely on numerical metrics or narrowly structured telemetry, the pipeline aggregates semantically rich, text-based observations that reflect how real engineers would inspect and debug a live 5G core deployment.

These consolidated experiment snapshots form the supervision data used for fine-tuning the LLM. Each snapshot is paired with the ground-truth fault label associated with that experiment (e.g., network delay, pod kill, I/O degradation), enabling supervised training for fault classification and explanatory diagnosis.

Log files are typically very large. For instance, within four minutes, the gNodeB generates over 2,500 log lines (amounting to more than 60,000 tokens), costing about $0.001 when processed using gpt-4o-mini. Therefore, the logs are first filtered to retain only the important lines.

As illustrated in Fig. 4, the gNodeB log lines are colorcoded. During filtering, only the colored lines (green, yellow, or red) are kept, as they represent the most relevant information.

As described in the data-gathering section, the following data were collected after injecting a fault, concatenated, and combined with a system prompt. The complete input was then provided to the LLM, which was instructed to identify the source of the problem: System Prompt (Fault Diagnosis Task Definition)

The model acts as an expert 5G network fault analyzer for OpenAirInterface (OAI) deployments on Kubernetes, identifying exactly one failure type among five: IOInjection, NetworkDelay, NetworkLoss, PodFailure, or PodKill. Input data include pod status (kubectl get pods), pod logs (AMF, SMF, UPF, DB, gNB, UEs), Kubernetes events, and RTT latency statistics. Detection logic:

โ€ข IOInjection: “Unknown database ‘oai 5g”’ in DB logs. In the system prompt, each component of the data and the corresponding fault types were briefly described along with the logic for fault detection. A simplified version of the system prompt is shown in Fig. 6. The collected data, together with the system prompt, were provided to the LLM to infer and report network errors.

50% of the data were used for training, 25% for validation, and the remaining 25% for testing.

The GPT-4.1-Nano model was fine-tuned on the dataset using the OpenAI API. The dataset contains 118 experiments. As shown in Fig. 7, the data volume for each fault type is uniformly distributed across all fault categories. (i) Binary detection, determining whether any fault exists in the log response, and

(ii) Exact matching, verifying the entire predicted response against the labeled dataset to ensure correct identification of the specific fault type.

Fig. 8 summarizes overall performance for binary classification. The fine-tuned model achieved 93% accuracy and 95% F1-score, representing a substantial improvement over the non-tuned baseline (40% accuracy and 45% F1-score). This indicates that fine-tuning significantly enhances the model’s consistency in detecting the presence of faults across diverse log inputs.

The improvement in recall from 30% to 93% indicates that the baseline model was weak in fault detection and often misclassified faulty networks as normal.

Fig. 9 illustrates the per-fault accuracy comparison between the fine-tuned and baseline GPT-4.1-Nano models across five fault categories. The fine-tuned model consistently outperforms its baseline counterpart for all fault types, achieving near-perfect accuracy for I/O Injection (1.00) and Pod Failure (0.97). The 100% accuracy achieved for I/O Injection is primarily attributable to the small sample size (10 instances) and the distinctive log pattern observed during this fault, where the database pod’s disk failed due to I/O Injection and the unknown database ‘oai’ was displayed in the DB

โ€ข

โ€ข Pod status: โ€ข Pod events: Kubernetes event outputs.โ€ข RTT data:

โ€ข Pod status: โ€ข Pod events: Kubernetes event outputs.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut