Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

With frequently evolving Advanced Persistent Threats (APTs) in cyberspace, traditional security solutions approaches have become inadequate for threat hunting for organizations. Moreover, SOC (Security Operation Centers) analysts are often overwhelme…

Authors: Rishikesh Sahay, Bell Eapen, Weizhi Meng

Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage
P olicy-Guided Threat Hun ting: An LLM enabled F ramew ork with Splunk SOC T riage Rishik esh Saha y a , Bell Eapen a , W eizhi Meng b , Md Rasel Al Mamun a , Nikhil Kumar Dora c , Manjusha Sumasadan a , Sumit Kumar T etara ve c , Ro d Soto d , Elyson De La Cruz e a Dep artment of Management Information Systems, University of Il linois, Springfield, USA b Scho ol of Computing and Communic ations, L anc aster University, United Kingdom c Scho ol of Computer Applic ations, Kalinga Institute of Industrial T e chnolo gy, India d Splunk Rese arch T e am, USA e Scho ol of Information T echnolo gy and Artificial Intel ligenc e, University of Cumb erlands, USA Abstract With frequen tly evolving Adv anced Persisten t Threats (APT s) in cyberspace, traditional securit y solu- tions approaches hav e b ecome inadequate for threat h un ting for organizations. Moreo ver, SOC (Security Op eration Cen ters) analysts are often o verwhelmed and struggle to analyze the h uge volume of logs receiv ed from diverse devices in organizations. T o address these challenges, we prop ose an automated and dynamic threat hun ting framework for monitoring ev olving threats, adapting to c hanging netw ork conditions, and p erforming risk-based prioritization for the mitigation of suspicious and malicious traffic. By in tegrating Agen tic AI with Splunk, an established SIEM platform, we developed a unique threat hun ting framew ork. The framew ork systematically and seamlessly in tegrates different threat hun ting mo dules together, ranging from traffic ingestion to anomaly assessmen t using a reconstruction-based auto enco der, deep reinforcemen t learning (DRL) with tw o la yers for initial triage, and a large language mo del (LLM) for contextual analysis. W e ev aluated the framew ork against a publicly a v ailable benchmark dataset, as w ell as against a sim u- lated dataset. The exp erimen tal results show that the framew ork can effectiv ely adapt to different SOC ob jectives autonomously and identify suspicious and malicious traffic. The framew ork enhances op erational effectiv eness b y supporting SOC analysts in their decision-making to block, allo w, or monitor netw ork traffic. This study thus enhances cyb ersecurit y and threat h unting literature by presenting the nov el threat h unting framew ork for securit y decision- making, as well as promoting cum ulative research efforts to develop more effectiv e framew orks to battle contin uously evolving cyb er threats. Keywor ds: Threat hun ting, Splunk, Security Op eration Center, LLM, Agentic AI, Deep Reinforcement Learning, Auto encoder Email addresses: rsaha@uis.edu (Rishikesh Sahay), bpunn@uis.edu (Bell Eap en), w.meng3@lancaster.ac.uk (W eizhi Meng), mmamu@uis.edu (Md Rasel Al Mamun), 2481140@kiit.ac.in (Nikhil Kumar Dora), msuma5@uis.edu (Manjusha Sumasadan), sumitkumar.fca@kiit.ac.in (Sumit Kumar T etarav e), rodsoto@cisco.com (Ro d Soto), elyson.delacruz@ucumberlands.edu (Elyson De La Cruz) 1. In tro duction The frequently evolving threat landscap e in cyb erspace emphasizes the need for proactiv e and intelligen t cyb er threat hun ting [1]. A ccording to Kasp ersky (2024), there is an increase of 74% in adv anced p ersisten t threats (APT s) in 2024 compared to 2023 [2]. A dv anced Persisten t Threats are cyber threats that use sophisticated to ols and resources to exploit vulnerabilities in organizations. According to the F ortinet threat rep ort 2025, the attempts to exploit newly found vulnerabilities ha ve increased, and cyb ercriminals are using artificial in telligence (AI) for phishing, imp ersonation, and ev asion tactics, resulting in the increase of reconnaissance activity b y 16.7% each year [3]. T raditional security approac hes hav e b ecome inadequate with the rise of these adv anced p ersisten t threats b ecause traditional endp oin t detection and resp onse to ols rely on a known attack signature or a clear anomalous pattern [4, 5]. Contemporary anomaly-based detection solutions are reactiv e, fail to address contin uously evolving threat landscap es, and thereb y require proactive threat h unting approaches [6–8]. Th us, this study offers a no vel, proactive threat hun ting framew ork with impro ved effectiv eness to assist securit y analysts in their decision-making to blo c k, allow, or monitor netw ork traffic. Mitigation of adv anced persistent threats requires con tin uous monitoring of security logs. Th us, the securit y analysts in the security op eration cen ter (SOC) con tinuously analyze a large volume of traffic logs to effectively pinp oin t p oten tial threats [9]. How ev er, current threat h unting literature iden tifies that it is c hallenging for security analysts to analyze huge volume and complex data types of logs [10]. Although the Securit y Information and Even t Management (SIEM) to ols such as Splunk offer centralized log aggregation, correlation, and real-time monitoring features, they rely on predefined rules and existing signatures, which limit their effectiv eness against new or con text-driven threats [5, 11]. In addition, it is imp erativ e for the SOC analysts to p erform risk-based prioritization for mitigation. Another critical issue for the current threat h unting pro cess is the shortage of security analysts and the lac k of automation in repetitive security w orkflows that causes a burden on SOC analysts [10]. Thus, there is a need for proactive cyb ersecurit y solutions with automatic security workflo ws in threat h unting that can monitor evolving threats and adapt to changing netw ork conditions, p erform risk-based prioritization for mitigation of suspicious and malicious traffic. Thus, in tegrating Agentic AI with Splunk, an established SIEM platform, this research prop oses an automated threat hun ting framew ork capable of identifying and mitigating evolving threats with high accuracy and v alidate the framew ork. The recent adv ancemen ts in the Agen tic AI system built on LLM (Large Language Mo del) present a significan t opp ortunit y to impro v e threat h un ting op erations [12]. The Agen tic AI system lev erages ad- v ancemen ts in reinforcement learning, goal-orien ted arc hitectures, and adaptive control mec hanisms [13]. The agen tic system comprises collab orativ e agen ts with sp ecialized roles suc h as planning, analysis, and execution facilitated by LLM along with to ol use. In the Agentic AI architecture, LLM is the main decision making con troller, also referred to as the brain of the system [13]. LLMs hav e shown considerable capabili- ties to identify complex patterns, analyze unstructured data, and pro vide contextual insights that are highly 2 relev an t to security op erations [1, 14]. Agen tic AI is goal oriented, with adaptable features that enable it to complete multi-la y ered tasks without instructions eac h time [13]. By integrating the Agentic AI system with established SIEM platforms such as Splu nk within the threat h unting workflo w, we can automate log analysis, detect subtle indicators of compromise (IoCs), and reduce the burden on SOC analysts [15, 16]. Recen t w orks [17–19] highlight AI agents as assisting the SOC ana- lyst by correlating logs across diverse data sources, preserving inv estigativ e context through memory , and con tinuously updating hypotheses ov er time. It has the capability to automate contextual analysis, generate detection rules, assist in formulating sophisticated queries for security information and even t managemen t systems, and ev en supp ort in the developmen t of incident response pla yb ook [20]. This in tegration of Agen tic AI with the SIEM platform can facilitate real-time monitoring and alerting, allo wing faster resp onse times and decision making to security incidents [21]. Automating the threat h unting w orkflo w allows SOC analysts to fo cus on strategic and innov ative asp ects of threat hun ting [22]. This proactive approach not only strengthens an organization’s securit y p osture, but also optimizes the utilization of limited securit y exp ertise. Ho w ever, in the SOC en vironment, human ov er- sigh t is very imp ortan t for safe autonomy and crucial decision-making. Under a fast changing environmen t and incomplete information, agen ts may struggle to generalize, so h uman-in-the-lo op is necessary to v alidate inferred threats and ambiguous findings. Therefore, the challenge is to design an Agen tic threat h un ting framew ork with co ordinated workflo w that keeps SOC analyst driven flexibility along with explainable de- cision making and providing broader automation without exceeding acceptable risk. After a comprehensive analysis of the av ailable threat h unting pro cesses, we ha ve identified sev eral requirements to enhance them, and ev en a new design: (1) minimize false alarms; (2) automation in k ey securit y w orkflo ws; (3) priori- tization of traffic flows for further in v estigation by SOC analysts and LLM; (4) LLM assisted con textual analysis of traffic flo ws and generation of queries to filter logs on SIEM to ols such as Splunk; (5) Automated dev elopment of incident resp onse playbo ok; (6) SOC analyst inv olv ed in final decision making. T o meet all the abov e-men tioned requiremen ts, we prop ose a framework for automated and dynamic threat h unting lev eraging the capabilities of Agen tic AI to address the changing threat landscape. The framew ork is intended to systematically and seamlessly integrate different threat h unting mo dules together, ranging from log ingestion to anomaly assessment using reconstruction-based auto encoder, Deep Reinforce- men t Learning with t wo la yers, and LLM triage. In particular after log collection, auto enco der based anomaly detection is trained on a part of initial b enign traffic and assigns confidence score to all the traffic instances based on the learned normal netw ork features. In the framew ork, the Deep Reinforcemen t Learning (DRL) mo dule is trained on the traffic of fixed length time window for decision making. After DRL decision, traf- fic flo ws are prioritized considering DRL decision and autoenco der anomaly score for LLM analysis using ChatGPT. Only flo ws with a high priority score are forwarded to LLM for con textual analysis to a v oid unnecessary computational o v erload and hallucination. Based on con textual insigh ts from LLM, further v alidation is also p erformed on Splunk to identify malicious and suspicious activities related to flows with a 3 high priority score. This workflo w significantly improv es alert triage and reduces the burden on SOC analysts and helps them in informed decision-making. Moreo v er, it is imp ortan t to differentiate Agentic AI with the deep reinforcement-learning (DRL) agen t in our framework. Agen tic AI refers to a broader architectural paradigm in whic h an LLM orchestrates a collection of specialized, to ol using agen ts that collaborate to ac hieve high-level goals. Within this arc hitecture, the DRL agents provide outputs that the LLM-based agen ts incorp orate in to planning and decision-making. The rest of the pap er is organized as follows. The related literature on agen tic AI and LLM application in cyb ersecurit y and the SIEM to ol for threat hun ting is describ ed in Section 2. A set of key observ ations and motiv ation are presented in Section 3 Section 4 presen ts the Agentic AI driven threat-h unting framework, its differen t comp onen ts. Section 5 describ es the comp onen ts and its functionality in detail. The use case and threat mo del are describ ed in Section 6. Section 7 describ ed the exp erimen tal results. Section 8 presen ts a discussion of the framew ork and its limitations, and finally Section 9 concludes the article with future work. 2. Related W ork Agen tic AI is reshaping threat h unting with adaptiv e and dynamic approach b ey ond the traditional alert driv en detection system in the net work. Specifically , in the net w ork, these agen ts monitor traffic, iden- tify malicious activities, trigger containmen t, or mitigate that to reduce the burden on securit y analysts. Recen tly , a few works ha ve explored the application of argentic AI in differen t cyb ersecurity domains, in- cluding autonomous incident resp onse, cyb er threat intelligence, autonomous monitoring, and adv ersarial cyb er defense [17, 23]. W e review three groups of literature: (1) Agen tic AI for threat hun ting and adaptive defense, (2) Agentic AI based net work monitoring and anomaly detection, and (3) LLM and Generative AI for Security Op eration Centers (SOC) tasks. • Agen tic AI for threat hun ting and adaptive defense: In [19], the authors present an autonomous threat hun ting using mac hine learning and Deep Reinforcement Learning metho ds for proactive threat detection. The framew ork p erforms traffic analysis using ML models such as RNN and CNN, then lev erages Deep Reinforcement Learning for optimal threat hun ting. The pap er highlights th at automa- tion can reduce the burden on the SOC analyst, although the main aim of this work is to detect and resp ond base d on DRL learning. In [24], the authors use agen tic AI for self-healing cyb er systems with autonomous detection, mitigation and adaptation to evolving cyb er threats. Ho wev er, the main fo cus of this work is to impro ve threat detection using agen tic AI and do es not provide SIEM in tegration to supp ort SOC analysts in decision making. In [25], the authors prop osed a conceptual framework for the integration of AI with h uman analysts for SOC en vironmen ts. The framew ork fo cuses on organi- zational and workflo w asp ects rather than anomaly assessment, threat prioritization, and automated in vestigation. In [18, 26], the role of agen tic AI in cyb ersecurit y is presented along with the challenges of using these AI agents for cyb ersecurit y , as threat environmen ts evolv e. As a result, they emphasize that human analysts m ust b e in the lo op for v alidating threats and analyzing complex findings. Unlik e 4 these works, our framework provides auto encoder based anomaly assessment, DRL-based traffic triage and prioritization, and LLM-assisted analysis while also in tegrating with SIEM to ols such as Splunk for the deep in vestigation of malicious flows. • Agen tic AI based Netw ork Monitoring and Anomaly Detection: NetMonAI [27] proposed a scalable distributed net working monitoring framework combining pack et-lev el and flo w-level analysis. Eac h node in the arc hitecture has an agent which captures the traffic, find anomalies and reason using LLMs. These agen ts work in an automated w a y and coordinate with a centralized con troller whic h collects rep orts and pro vides human readable summaries. In [28], a time series based anomaly detection system is prop osed for cloud infrastructure. In the framew ork, multiple agen ts collab orate to autonomously generate detection rules using LLM and improv e detection accuracy . NetMonAI and AR GOS leverage agentic AI for scalable net work monitoring and anomaly detection, but do not address the complete SOC workflo w. These works fo cus on anomaly detection, but our framework pro vides the full cycle of anomaly assessment, initial triage, LLM assisted contextual analysis, and SIEM platform based inv estigation of malicious flo ws. • LLMs and Generativ e AI for SOC Op erations: Recently , the in tegration of LLMs in to cy- b ersecurit y has gained significant attention due to their ability to automate manual tasks, improv ed con textual analysis, and decision making [29]. In [30] the authors proposed a framework for threat h unting using LLM and Splunk. The framew ork leverages LLM for the initial triage of security logs and then p erforms further in vestigation on Splunk. In [31] the authors present the p oten tial of LLMs to automate the analysis and triage of security alerts, highlighting ho w LLMs can pro vide contextual insigh t, reduce false positives, and assist SOC analysts in improving security op erations. In [32], the use of different LLMs is explored for threat hun ting. The main fo cus of the w ork is to determine whether LLM can generate effectiv e queries for security to ols such as Splunk and Elasticsearch to analyze logs. In [33] the authors p ro vide a framework called LLM4Sec to improv e anomaly detection by fine-tuning LLMs. The authors demonstrate extensiv e ev aluation of fiv e LLMs such as BER T, RoBER T a, Distil- RoBER T a, GPT-2 and GPT-Neo, inv estigating their effectiv eness in log analysis. In [34] an AI system is presen ted, leveraging LLMs such as GPT-4 to automate the extraction of Indicators of Compromise (IoCs) and to construct relationship graphs from Cyb er Threat Intelligence (CTI) rep orts, whic h mini- mizes the manual tasks of SOC analysts. Vina y ak [35] studied the application of LLMs in cyb er threat h unting. These w orks mainly in vestigated the p oten tial application of LLMs in analyzing security logs and finding IoCs for threat detection. Industry grade solutions such as Microsoft Security Copilot hav e enabled natural language in teraction and contextual summarization of alerts, that has improv ed SOC analyst productivity . Ho wev er, these solutions are assistive to ols and do not learn decision p olicies. In contrast, the proposed framew ork offers a DRL–based p olicy lay er that op erates upstream of alert generation, learning cost-aw are containmen t decisions ov er aggregated netw ork traffic and iden tifying 5 suspicious traffic window, and helps the SOC analyst in decision making to blo c k or allow traffic in the net work. As compared to monolithic copilot architectures, our framework uses a hierarchical agentic design in which DRL gov erns when and how LLM is in vok ed. Moreo ver, by inv oking LLM only for high-priorit y ev ents, our framework reduces computational ov erhead and analyst burden, p ositioning it as a complemen tary and orthogonal solution to existing copilot-based platforms. Agen tic AI presen ts b oth opp ortunities and challenges: On the one hand, it can automate threat identi- fication and analysis, but their adoption also risks getting manipulated and requires careful human ov ersight to verify the information. Previous works do not address the implemen tation agentic system without limiting auditabilit y and analyst control. Moreov er, prior w orks lack a complete cycle from data ingestion into SIEM to ol to anomalous flows findings, prioritization of flows, LLM based multi-agen t alert triage, query genera- tion to v alidation of IoCs on SIEM platform. Our work addresses this gap by integrating Agentic AI based analysis with SIEM platform such as Splunk [36] to provide automated, context-a ware threat hun ting, along with op erational v alidation of threats on SIEM to ol. Our m ulti-lay ered framework in tegrates the strengths of AI-driv en analysis, LLM based contextual insigh ts, robust SIEM capabilities, and providing a more effectiv e and automated threat h unting framew ork. 3. Key observ ation and design ob jectives W e analyzed and compared the agentic AI and LLM based threat detection mechanisms in Section 2, and summarized in T able 1. Our primary observ ations are highlighted below: • Most of the agentic AI based mechanisms in our survey are designed only for threat detection and do not integrate the SIEM platform into the framework. • One of the desirable features is that Agen tic AI based threat h un ting framew ork should p erform anomaly assessmen t and prioritize alerts and traffic flo ws before forw arding to LLM for contextual analysis to av oid pro cessing ov erhead. In [25], the authors address b oth issues to some extent at the high-lev el. • Initial triage is important for threat hun ting before delving in to more details. While some of the existing ones hav e this feature, but they provide this functionalit y at the level of LLM whic h causes pro cessing o verhead on the LLM due to the h uge v olume of information. • Most of the mechanisms provide a decision supp ort system for the SOC environmen t partially without an y SO AR suggestion. • While analyzing the traffic logs, it is also important to understand the attack technique used by the attac kers. Therefore, the MITRE mapping is imp ortan t to comprehend this. As w e can see in T able 1, few metho ds address this issue. 6 • As sho wn in T able 1, most of the mechanisms that we review ed do not offer the feature to adapt according to the learned p olicy . Although few methods use autonomous adaptation to detect zero da y attac ks [19, 24]. With these k ey observ ations in mind, we are motiv ated to design such an agentic AI based threat hun ting framew ork that can ov ercome the identified limitations and achiev e a n umber of desired prop erties, as follo ws: • Anomaly assessmen t and initial triage: Anomaly assessment, traffic prioritization, and initial triage are p erformed b efore forwarding the alert to LLM for analysis. This closes the gap b et ween detection and resp onse with inv estigation supp ort. • SIEM Integration: The SIEM platform is integrated with the framework for further deep inv estigation of traffic flows. • SO AR and SOC decision supp ort: The framework pro vides SO AR suggestion and supports SOC analysts in decision making b y assisting them in filtering traffic flows on the SIEM platform with queries. • MITRE Mapping: The mapping to MITRE A TT&CK framework is done to understand the attack tec hnique used. • A daptive learned p olicy: The framew ork offers the learned policy la y er to adapt according to the requiremen ts of the SOC ob jectives. Thanks to the aforementioned prop erties, the features ranging from anomaly assessment, initial triage, traffic prioritization, MITRE mapping and SIEM to ol can b e systematically integrated together, to assist the SOC analysts for informed decision making. The next Section 4.1, describ es the ma jor comp onen ts of the proposed framework in detail. 4. Agen tic AI SOC F ramework This section describ es the ma jor comp onen ts of the prop osed threat hun ting framew ork. As sho wn in Fig. 1, the framework comprises of ma jor comp onents: SIEM Indexing, Data Cleaning & Pro cessing, Auto encoder Anomaly Detection, Deep Reinforcement Learning (DRL) Netw ork, Prioritization, LLM Multi-agent triage, and Splunk v alidation. The comp onen ts are describ ed in Section 4.1. 4.1. Ma jor F ramew ork Comp onen ts Data Collection and SIEM Indexing: This comp onen t is resp onsible for indexing logs collected from differen t devices in the organization and store them in a central database for analysis. Collecting and indexing logs from different devices in the SIEM to ol is an imp ortan t step in threat h unting. The log collection agents are deploy ed on the devices that collect and forward the logs to the SIEM server. It is also imp ortan t to 7 T able 1: Comparison b et ween existing LLM and Agentic AI based threat h unting techniques Threat Hunting T ec hnique Anomaly Assessment T raffic Flows Prioritization Initial T riage Agentic AI/LLM Integration MITRE Mapping SIEM Integration and V alidation SOAR Suggestion SOC Decision Support Learned Policy adaptation Agentic AI for Autonomous Threat Hunting [19] Y es No No Y es No No No Y es Partial Agentic AI for Adaptive Threat Resp onse [24] Y es No No Y es No No No Y es at high-level (Partial) Partial Unified F ramew ork for Human-AI Collab oration [18, 25] Conceptual Partial No Y es No Y es at high-level No Y es No NetMon AI F ramew ork [27] Y es No Y es Y es No No No No No ARGOS Agen tic Detection [28] Y es No No Y es No No No No No LLM for Threat Intelligence and Automation [34] No No No LLM integration No No No P artial No LLM for Non-Security Experts [32] No No Y es LLM Analysis No Y es No P artial No LLM Benchmarking for Log Analysis [33] No No Y es LLM Integration No No No No No LLM for Security alarm analysis [31] No No Y es LLM Integration No No No P artial No Microsoft Security Copilot No Partial Y es Y es Partial Y es Y es Y es No LLM and Splunk for SOC [30] No No Y es LLM integration Y es Y es Y es Y es No Proposed F ramew ork Y es Y es Y es Y es Y es Y es Y es Y es Y es 8 Figure 1: Agentic AI-based Threat Hunting F ramew ork create a strategy to collect the volume of logs from different devices, as it can cause pro cessing o verhead on the SIEM to ol as w ell as alert fatigue for the SOC analyst. Data Cleaning and Pro cessing: This mo dule is resp onsible for cleaning and pre-pro cessing the indexed data for further in vestigation. All duplicate instances are remov ed b ecause the data on the SIEM server often ha ve many duplicates. It extracts the features (such as IP addresses, ports, proto cols, in b ytes, out b ytes, etc.) from the indexed data and exp orts them in csv format for pro cessing. Auto encoder-based Anomaly Detection: In the framework, the reconstruction based deep auto en- co der is trained on a part of legitimate traffic and learns the b enign net work pattern. The auto encoder is a neural netw ork that tries to reconstruct the normal netw ork pattern and assigns the reconstruction error to data samples. Data instances that are normal are reconstructed with minimal error. Ho wev er, anomalies are reconstructed with high error. In the framew ork, the reconstruction error is assigned to the data instances as the anomaly score (AAD). The AAD score along with the DRL decision is used to prioritize the malicious traffic window to b e forw arded to LLM for con textual analysis. The details are describ ed in Section 4. 9 Deep Reinforcemen t Learning Net w ork: The DRL mo dule in the framework comprises of three ma jor comp onen ts: a) DRL agent, b) simulation Environmen t, and c) Reward function. The DRL agent in teracts with the simulated en vironment that contains aggregated traffic features ov er a fixed time windo w, and takes action based on its current state. The environmen t responds with a new state and a reward as feedbac k based on its action of the DRL agent. The ob jective of the DRL agent is to take action in such a wa y as to maximize cumulativ e rew ard. In the framework, action space also termed as the DRL decision comprises: a) Containmen t (1), b) Allow (0). Initial T riage and Prioritization: Based on the DRL decision and the anomaly score (AAD Score) of the auto encoder, the Initial Triage and Prioritization mo dule prioritizes malicious traffic windows for LLM analysis. The DRL action (1 or 0) is m ultiplied by the AAD score to prioritize the flow. This initial triage and prioritization of the malicious traffic window reduces the pro cessing ov erhead on LLM. LLM Multi-Agen t based Con textual Analysis: This mo dule analyzes the received malicious traffic windo w and provides the contextual insigh ts. It is based on a multi-agen t system, where one LLM agent acts as an orc hestrator and t w o other LLM agen ts perform the analysis. In the framework, the agen t titled Senior SOC Triage Analyst analyzes the traffic, provides its assessment, and generates SPL (Splunk Query Language) queries to filter the traffic on the Splunk dashboard. Another LLM agent called Threat Intelligence Analyst p erforms the mapping of traffic b eha vior to MITRE A TT&CK framework. Finally , the orc hestrator summarizes the contextual analysis in a human-readable format. The listings 1 and 2 show the sample prompts for the SOC triage agent and the threat intelligence agent. The implementation of this comp onen t is done using a CrewAI framework [37]. Currently , in our prop osed framework, public LLM’ API suc h as ChatGPT is used for LLM triage, but lo cal LLM is suggested for the analysis of sensitive logs. Data anon ymization can also b e used to anonymize sensitive information b efore forwarding it to public LLM. V alidation in Splunk: The insights provided by the LLM is further v alidated b y the human SOC analyst on the Splunk dash b oard by pulling the logs. The SPL queries pro vided b y the Senior SOC Triage Analyst is used by the human SOC analyst to v alidate the insights in Splunk. Based on the findings in Splunk, the SOC analyst can decide to Block or Allow traffic. F or instance, if SOC analyst finds the presence of Indicators of Compromise (IoCs) suc h as a high num b er of pack ets in a very short time windo w in the logs, then they can blo c k the traffic by configuring the detection rules. It reduces the workload of the SOC analyst b y directly identifying the relev an t flows for inv estigation in Splunk and supplying the corresp onding SPL queries. It ensures that the LLM analysis is appropriately filtered. 4.2. Design Rationale The key design principle in the prop osed framework is strict functional separation to a void feature leak age and mo del bias, as shown in T able 2. Specifically , the AAD score is explicitly excluded from DRL training to av oid leak age of features and inflated p erformance. It is only included after DRL training for SOC triage and prioritization. 10 T able 2: Design rationale of the main comp onen ts Comp onen t Purp ose Input F eature AAD Con tinuous anomaly scoring Ra w net work features DRL Decision, initial triage and prioritization Ra w features except AAD score LLM Agent Con textual insigh ts and human lik e triage and reasoning Prioritized traffic flow along with DRL decision and AAD score Listing 1: A sample prompt for the SOC triage agent 1 Role: Senior SOC Triage Analyst 2 Goal: Assess the threat level associated with a prioritized network flow. 3 Backstory: Expert at distinguishing benign network noise from suspicious or malicious traffic. 4 5 Input: 6 - Flow ID: {flow_id} 7 - Source IP: {src_ip} 8 - Destination IP: {dest_ip} 9 - Destination Port: {dest_port} 10 - Priority Score: {priority} 11 - Anomaly Score: {aad_score} 12 13 Task: 14 Analyze the flow and determine whether the observed communication appears benign, 15 suspicious, or high risk. Briefly explain the reasoning using the provided network context. 16 17 Expected Output: 18 A concise SOC-style summary of the alert risk level. 11 Listing 2: A sample prompt for the threat intelligence agent 1 Role: Threat Intelligence Analyst 2 Goal: Map suspicious activity to MITRE ATT&CK and provide remediation guidance. 3 Backstory: Expert in associating network behaviors with adversarial techniques and response actions 4 5 Input: 6 - Flow ID: {flow_id} 7 - Source IP: {src_ip} 8 - Destination IP: {dest_ip} 9 - Destination Port: {dest_port} 10 - Priority Score: {priority} 11 - Anomaly Score: {aad_score} 12 13 Task: 14 Identify the most relevant MITRE ATT&CK technique associated with the observed flow. 15 Provide the MITRE technique ID, technique name, and a brief remediation recommendation. 16 17 Expected Output: 18 MITRE ATT&CK technique ID, technique name, and remediation guidance. 5. Metho dology In this section, we model the Securit y Op erations Cen ter (SOC) as a sequential decision-making problem, where a DRL agent is trained on aggregated netw ork traffic features and decides whether to allow or contain the traffic. The system operates ov er a fixed-length, non-o verlapping time windo ws and is form ulated as a Mark ov Decision Process (MDP). The DRL agent works exclusively on aggregated ra w traffic features, while anomaly scores and LLM reasoning are introduced only after the containmen t decision is made. The framew ork enforces strict separation of resp onsibilities to av oid feature leak age commonly seen in h ybrid mac hine learning security systems. 5.1. Auto enco der-Based Anomaly Detection (AAD) In our framework, the anomaly score (AAD score) is computed using a fully connected reconstruction based auto encoder anomaly detection (AAD) with a lo w dimensional bottleneck architecture (8-2-8), whic h is trained on early b enign traffic flows to mo del normal net work b eha vior [38]. The Algorithm 1 implements an unsup ervised Auto enco der trained on legitimate traffic observed during an initial p eriod (we use the first fraction of the timeline, e.g., 25%). F eature normalization is p erformed using statistics derived only from early benign traffic flows to simulate a clean baseline of normal net work b eha vior to av oid mixing the anomaly 12 Algorithm 1 Auto encoder-Based Anomaly Detection (AAD) with Benign-Only Standardization and Flow Mapping Require: Flo w dataset D with timestamps and iden tifiers (e.g., flow_id ), sorted b y time; feature subset F = { src_port , dest_port , bytes_in , bytes_out } ; benign label indicator y ∈ { 0 , 1 } (used only for selecting b enign training samples); training fraction α (e.g., 0 . 25 ); auto encoder bottleneck dimension d (e.g., d = 2 ). Ensure: Anomaly score AAD( w t ) for eac h time window w t and a mapping M from eac h w t to its repre- sen tative flo w metadata. 1: Sort D chronologically b y timestamp. 2: Split D into an early training perio d D train (first α fraction) and the full p eriod D all . 3: Filter b enign-only samples from the training p eriod: D benig n = { x ∈ D train | y ( x ) = 0 } . 4: Extract b enign training matrix X benig n ← D benig n [ F ] and full matrix X all ← D all [ F ] . 5: Benign-only standardization: 6: Compute µ ← mean( X benig n ) and σ ← std( X benig n ) . 7: Standardize X benig n s ← ( X benig n − µ ) / ( σ + ϵ ) and X all s ← ( X all − µ ) / ( σ + ϵ ) . 8: Initialize reconstruction auto encoder f ϕ with architecture ( |F | → 8 → d → 8 → |F | ) . 9: T rain f ϕ on b enign data by minimizing reconstruction loss: L ( ϕ ) = 1 | X benig n s | X x ∈ X benign s ∥ x − f ϕ ( x ) ∥ 2 2 . 10: Windo wing and mapping: Partition D all in to fixed windows { w t } (e.g., 5 min utes). 11: for each window w t do 12: Construct window feature v ector x t ← Agg( X all s within w t ) . 13: Reconstruct ˆ x t ← f ϕ ( x t ) . 14: Compute anomaly score: AAD( w t ) = 1 |F | X i ∈F ( x t,i − ˆ x t,i ) 2 . 15: Store mapping M ( w t ) to contextual metadata (e.g., src_ip , dest_ip , dest_port , flow_id ) using represen tative v alues such as first or mode within w t . 16: end for 17: return { AAD( w t ) } and mapping M . mo del with attac k information. Giv en a standardized feature vector x t ∈ R |F | , the autoenco der reconstructs ˆ x t , and computes the anomaly score as the mean squared reconstruction error shown in Equation 1. The anomaly score is computed p er flow and then aggregated. 13 AAD( t ) = 1 |F | X i ( x t,i − ˆ x t,i ) 2 . (1) As shown in Algorithm 1, the mo del first learns a compressed laten t representation of normal netw ork features and assigns a higher reconstruction error to malicious flo ws. In our framew ork, the AAD Score is not used as part of the DRL state representation during p olicy learning; instead, it is utilized after DRL decisions to prioritize DRL flagged traffic flows for LLM triage. Although the auto encoder is trained only on n umeric features (p orts and byte counts), the anomaly score is not detached from flow context. Each score is computed per fixed time windo w w t and is stored together with a metadata mapping M ( w t ) containing represen tative identifiers (e.g., flow_id , src_ip , dest_ip , and service p ort). In practice, the auto encoder pro duces AAD( w t ) from F , while attribution fields are preserv ed outside the mo del and re-attac hed to the scored windo w for downstream DRL prioritization and LLM triage. This design preven ts leak age from high-cardinalit y categorical fields while main taining in terpretability for SOC analysts. T able 3: Reconstruction Auto encoder vs Classification Mo del Asp ect State Reconstruction Autoenco der General Classification Mo del Lab els required No Y es Learns Normal b eha vior Decision b oundary Output Anomaly score Discrete class Used for Prioritization, T riage Detection In the framework, the reconstruction based auto encoder is different from general classification mo dels. The reconstruction based auto enco der used in the framework learns normal benign traffic feature and provides anomaly score (AAD). T o learn the traffic b eha vior it do es not need any lab els. This anomaly score is used for traffic prioritization for triage. Ho w ever, general classification mo dels, as shown in T able 3 require lab els and provide a discrete class as a decision. It is generally used for threat detection. 5.2. Time-Windo w State Construction Ra w net work traffic is represented as a sequence of flow records, each comprising of source & destination addresses, p orts, proto col, in bytes and out bytes. T raffic is aggregated into fixed-length windo ws of duration ∆ t = 5 min utes. F or each time window t , aggregated state vector is represen ted by s t b y merging numerical statistics (suc h as mean and maxim um port num bers and b yte counts) along with enco ded categorical attributes such as proto col identifiers and lo w-cardinality representations of IP addresses. The use of fixed temp oral windo w is widely considered b est practices in mo dern Security Op erations Centers [39]. F or example, let F t = { f 1 , f 2 , . . . , f N t } denote the set of raw net work ev ents observ ed during time window t , where N t ma y b e on the order of millions in high-throughput SOC environmen ts. Mo deling the state of the system as s t = F t is computationally infeasible. Instead, we build a state represen tation using an aggregation 14 function represented by ϕ ( · ) ; where ϕ ( · ) represents a large set of raw traffic in a fixed-dimensional feature v ector that captures traffic statistics, proto col b eha vior, and temp oral characteristics. s t = ϕ ( F t ) (2) This aggregation represents b oth regular and b urst-orien ted patterns while simultaneously reducing noise and dimensionality . Detailed feature engineering, aggregation equations and examples are pro vided in Ap- p endix A. 5.3. Deep Reinforcemen t Learning for SOC W e employ Deep Reinforcemen t Learning (DRL) to build our con tainment agen t, often referred to as a DRL agen t in the pap er, which directly learns from traffic features to identify anomalies. Specifically , we mo del the containmen t decision problem as a sequential decision-making pro cess using deep reinforcemen t learning (DRL) by defining the state and action spaces for the agent. F ormally , the containmen t decision is mo deled as a Mark o v Decision Process (MDP) M = ( S , A , P , R , γ ) , where S denotes the state, A represen ts the action space, P represents state transitions across time windo ws, R is a rew ard function reflecting SOC priorities, and γ ∈ (0 , 1] is the discount factor. The state of the DRL agent at the time step t , is represented by s t , that signifies the aggregated net work traffic ov er a fixed time windo w. Equation 2 represen ts the state of the system ov er aggregated traffic features. The action space of the DRL agent is defined as A = { 0 , 1 } . At eac h time step t , giv en the state s t , the DRL agent selects an action a t = 1 , if traffic F t is considered malicious and a t = 0 if F t is assessed as legitimate. The aim of the agen t is to learn an optimal containmen t p olicy that maximizes long-term op erational utilit y while balancing detection accuracy and alert fatigue. The DRL p olicy net work represented as π θ ( a t | s t ) , is implemented using Multila yer Perceptron (MLP) i.e., a feedforw ard neural netw ork with tw o hidden lay ers of 64 neurons eac h, as mentioned in Algorithm 2. The net work takes the aggregated traffic state s t ∈ R d as input and outputs a probability distribution ov er t wo actions: Contain and Al low . The s t ∈ R d means that the state at time t is a real-v alued v ector of dimension d . In our framew ork, we use Rectified Linear Unit (ReLU) activ ations in hidden lay ers, and a softmax function is applied at the output lay er to pro duce action probabilities [40]. Giv en the state s t , the forward pass of the p olicy net work is defined as follows. Hidden La yer 1 h 1 = ReLU( W 1 s t + b 1 ) , (3) Hidden La yer 2 h 2 = ReLU( W 2 h 1 + b 2 ) , (4) 15 Output La yer (Logits) z = W 3 h 2 + b 3 , (5) where z ∈ R 2 con tains unnormalized action scores (logits) for al low and c ontainment . The unnormalized action scores are ra w outputs of the policy neural netw ork that signifies ho w the agen t fav ors each action b efore softmax normalization. The action probabilities are calculated using softmax and the confidence of the action contain is given b y max a π ( a | s t ) . π ( a | s t ) = exp( z a ) P a ′ ∈{ 0 , 1 } exp( z a ′ ) . (6) The anomaly score provided by the Auto enco der-based Anomaly Detection (AAD) to the traffic flo w is not included in the DRL state representation. It prev ents feature leak age and make sure that the DRL agent learns containmen t b ehavior exclusiv ely from raw and aggregated netw ork traffic, rather than precomputed anomaly scores. The anomaly score is used after DRL decision while prioritizing the traffic flow for LLM triage. F or each time windo w, a triage priority score is computed as shown in Equation 7. T riage Priority = DRL_Action × AAD_Score (7) This formulation ensures that only windows flagged by the DRL agent are prioritized, while the anomaly score v alidates their urgency . Highly anomalous flows receiv e higher priority , while low-risk anomalies are de-prioritized ev en if flagged for con tainment. This mirrors real SOC analyst w orkflows, where decisions are discrete but prioritization is contin uous. The p olicy optimization is p erformed using Pro ximal P olicy Optimization (PPO), which stabilizes learn- ing by constraining policy updates within a clipp ed trust region [41]. T emp oral consistency is preserved through time-series cross-v alidation, ensuring that training data strictly precede test data in the timeline. The agent iterativ ely observes the current net work state, selects an action according to its p olicy , and re- ceiv es a scalar reward based on the correctness of the decision relative to ground truth lab els av ailable during training. The multiple rew ard p olicies are describ ed in Section 5.3.1. In summary , PPO decides how the p olicy is up dated in the framew ork. The MLP based on 2 × 64 i.e. 2 la yer and 64 neurons sho ws what the p olicy lo oks like. T ogether, they build DRL agent that: • Observ es aggregated netw ork traffic. • Predicts an action (con tain or allow). • Receiv es a reward based on correctness and cost. • Up dates the p olicy in a stable and constrained manner. 16 Algorithm 2 Deep Reinforcemen t Learning for SOC Con tainment Require: Aggregated netw ork windo ws W = { w 1 , w 2 , . . . , w T } with feature vectors s t Ground-truth lab els y t (used only for rew ard computation during training) Ensure: Con tainment p olicy π θ ( a | s ) 1: Initialize PPO p olicy netw ork π θ with 2 hidden la yers (64 neurons each) 2: Define action space A = { 0 : No Action , 1 : Containmen t } 3: Define reward function R ( a t , y t ) emphasizing low false p ositives 4: for each training episo de do 5: for each time step t do 6: Observ e en vironment state s t 7: Sample action a t ∼ π θ ( a t | s t ) 8: Execute action a t 9: Receiv e rew ard: r t = R ( a t , y t ) 10: Store ( s t , a t , r t ) 11: end for 12: Up date p olicy parameters θ using PPO ob jective 13: end for 14: return trained containmen t p olicy π θ 5.3.1. Rew ard Shaping The rew ard function is designed to reflect SOC operational priorities, with a particular emphasis on mini- mizing false p ositiv es, which are costly and disruptiv e in real-w orld en vironments. Correct containmen t of malicious activit y (true p ositiv es) is positively rew arded, while false p ositiv es incur significant p enalties. T rue negativ es are rew arded to reinforce restraint (no containmen t) when traffic is benign, while false negatives are penalized to discourage missed detections. Multiple reward profiles are ev aluated to study the trade-off b et w een detection sensitivity and false-p ositiv e reduction. r t = R ( a t , y t ) (8) In the prop osed DRL form ulation, the agent op erates in a binary decision space at eac h time step t . Let a t ∈ { 0 , 1 } represen t the action selected b y the agent, where ( a t = 1 ) corresp onds to con tainment decision (e.g., blo c king or isolating a netw ork flow) and a t = 0 represents allow action i.e., traffic is b enign. The ground-truth lab el at time step t is shown by y t ∈ { 0 , 1 } , where y t = 1 denotes that the observ ed net work flo w is malicious, and ( y t = 0 ) shows b enign traffic. The agen t’s action a t and the true lab el y t result in one of four securit y outcomes: true p ositiv e (TP), false p ositiv e (FP), false negative (FN) and true negative (TN). These outcomes are the basis for the reward 17 shaping strategy used to train the deep reinforcement learning agent. By mo deling this interaction, the agen t learns p olicies that balance detection effectiv eness with operational costs such as alert fatigue and unnecessary containmen t actions. T o study the impact of rew ard shaping on decision-making b eha vior, we define four distinct reward profiles (Mo des A–D). Eac h profile represents a differen t op erational fo cus that is t ypically found in Securit y Op erations Cen ter (SOC) environmen ts. Mo de A is a recall-orien ted detection strategy . In this mo de, b oth true p ositiv es and true negatives are rew arded equally , while false negatives are p enalized more than false p ositiv es. It motiv ates the agen t to prioritize detection co verage and early identification of malicious activit y , making Mo de A appropriate for initial threat discov ery and baseline sensitivity analysis rather than strict false-p ositiv e control. Mo de B fo cuses on false p ositiv e reduction. In this mo de, false p ositives are p enalized more heavily than false negatives, while true negativ es are assigned a significan t p ositiv e reward. This reflects a SOC en- vironmen t in whic h excessiv e containmen t measures and alert fatigue can incur significant costs. Sp ecifically , it is effective in prioritizing high-confidence alerts and reducing w orkload on SOC analysts. Mo de C pro vides a mo derate trade-off b et ween detection and op erational cost. F alse p ositiv es and false negativ es are p enalized at intermediate lev els, while correct decisions receive mo derate rew ards. This p olicy indicates SOC environmen ts where b oth false positives and false negatives are considered bad, but neither dominates the op erational p olicy . Mo de D incorp orates controlled randomness into the reward function. It preserves a balanced reward sc heme related to Mo de C, but adds Gaussian noise onto the reward signal. This form of sto c hastic regular- ization improv es p olicy robustness by reducing ov erfitting to fixed reward structures and by mimicking the uncertain ty found in the SOC environmen t and analyst resp onses. It is represen ted b y Equation 9. r ( D ) t = R ( a t , y t ) + ϵ t , ϵ t ∼ N (0 , σ 2 ) . (9) 5.3.2. DRL Ob jective The agent learns a containmen t p olicy π θ ( a t | s t ) that maps the aggregated flo w state s t to an action space o ver (0 , 1) . The learning ob jective is to maximize the exp ected discounted return shown in Equation 10. J ( θ ) = E π θ " T X t =0 γ t r t # , (10) Where γ ∈ (0 , 1] is the discoun t factor and T is the episode horizon. Since the rew ard R ( a t , y t ) is designed to reflect the op erational priorities of SOC, maximizing J ( θ ) is equiv alen t to minimizing the long- term op erational burden under the chosen reward profile. 18 5.3.3. Decision Cost and Regret Analysis T o complemen t the standard detection metrics, we ev aluate the learned con tainment p olicy using decision cost and regret, that provide insight into the op erational quality of the agen t’s actions under SOC constraints. The decision cost at time step t is defined as the negative of the obtained reward shown in Equation 11: cost t = − r t . (11) This formulation reflects the op erational burden asso ciated with an action. High costs relates to undesir- able outcomes, suc h as unnecessary containmen t of b enign traffic or missed detection of malicious activity , while low or negativ e costs show decisions aligned with SOC priorities. While decision cost ev aluates the absolute penalty of an action, regret measures ho w far the agen t’s decision deviates from the b est p ossible decision at that time step. Equation 12 defines the regret . regret t = r ⋆ t − r t , (12) where r ⋆ t = max a ∈{ 0 , 1 } R ( a, y t ) (13) sho ws the maximum achiev able rew ard assuming p erfect knowledge of the ground-truth lab el y t . By rep orting b oth decision cost and regret, we measure not only whether the agent mak es correct deci- sions, but also how costly its mistakes are in practice. This dual p erspective is imp ortan t in SOC environ- men ts, where different errors ha ve different operational consequences. 5.3.4. Putting it T ogether Unlik e traditional sup ervised intrusion detection systems, the DRL agent in the framework do es not aim to maximize classification alone. Instead, it learns a p olicy optimized for long-term operational efficiency under SOC constraints. The DRL agent acts as a decision making filter that determines whether an alert needs escalation reducing the analyst workload. Sp ecifically , the DRL agent is trained without using anomaly detection scores (e.g., AAD Score) as input features to a void feature leak age; the anomaly scores are computed separately and only applied after DRL to prioritize traffic flo ws for LLM-based triage. This separation ensures that the DRL policy generalizes b ey ond sp ecific anomaly mo dels and remains robust to changes in do wnstream scoring mechanisms. 5.4. LLM-Based Multi-Agen t SOC T riage The Algorithm 3 highligh ts triage using Large Language Mo dels. T raffic flows with a priority score greater than 5 assigned after DRL deci sions are forwarded to LLM for analysis. Once a traffic windo w is prioritized b y the DRL agent, individual flo ws within that window are extracted and forw arded for the LLM-based analysis. The LLM operates at flow-lev el gran ularit y , generating per-flow con textual insigh ts, MI TRE A TT&CK 19 Algorithm 3 LLM-Based Multi-Agen t SOC T riage Require: Flagged traffic flows F from DRL agen t, Anomaly scores AAD ( w t ) , (T raffic flo ws with triage score > 5) Con textual metadata (IPs, p orts, timestamps) Ensure: SOC triage rep orts with MITRE mapping and remediation guidance 1: for each traffic flo w f t ∈ H do 2: Construct contextual prompt: ⟨ s t , a t , AAD ( w t ) , netw ork metadata ⟩ 3: In vok e SOC Analyst Agent : 4: Generate tactical summary and risk assessment 5: Generate SPL query to inv estigate in Splunk 6: In vok e Threat Intelligence Agen t : 7: Map activity to MITRE A TT&CK techniques 8: Suggest SOAR or mitigation actions 9: Store structured triage rep ort 10: end for 11: return SOC inv estigation rep orts and master triage table mappings, and recommendations for the SOC analysts. This design enables scalable decision-making by first filtering high-risk windows and p erforming detailed analysis only on selected flo ws. T raffic flows forwarded to the LLM with contextual metadata, including source and destination addresses, p orts, time windo ws, and the corresp onding AAD scores. The source and destination IP addresses are anon ymized b efore b eing forw arded to the LLM for analysis [42]. Large Language Mo del based triage is implemented using a CrewAI multi-agen t framework [37]. During the analysis pro cess, tw o sp ecialized agents are instan tiated: • a Senior SOC T riage Analyst in the framework v alidates con tainment actions, assesses immediate risk, pro vides SPL (Splunk Pro cessing Language) queries to pull the traffic flows on Splunk • a Threat Intelligence Analyst maps the observ ed b eha vior to MITRE A TT&CK techniques and rec- ommends mitigation strategies. LLM agents with sp ecialized SOC analyst roles provide human-readable summaries with con textual in- sigh ts. This stage transforms low-lev el traffic flo w information in to actionable in telligence, reducing the w orkload of the SOC analyst, and accelerating decision making. Moreo ver, SPL (Splunk Pro cessing Lan- guage) queries generated b y the Senior SOC T riage Analysts agent can b e used by analysts to filter the logs on the Splunk dashboard, helping SOC analysts to filter the anomalous flo ws from the huge volume of logs. In the framework, the DRL agen t op erates at the traffic windo w level to identify high-risk traffic windows, 20 Figure 2: Use case illustrating the application of framework therefore acting as a policy-level filter that reduces the volume of data processed by the DRL agen t for in vestigation. Once traffic windows are prioritized, individual flows within that window are extracted and analyzed by the LLM agen ts at the p er-flow granularit y . Finally , the outputs are automatically compiled into p er-flo w in vestigation rep orts and a master SOC triage summary , enabling direct integration with Securit y Orc hestration, Automation, and Resp onse (SOAR) platforms. In the framew ork, public LLM such as ChatGPT is used for the initial prototype. Ho w ever, local large language models can also b e used. So, to preserve data confiden tiality , sensitive information suc h as IP addresses is anon ymized prior to LLM-based analysis using deterministic pseudonymization [43]. It replaces real identifiers with stable tokens, ensuring consistency across time windows while preven ting the exp osure of sensitive information to public LLM. The anonymization is non-destructive and reversible only within the SOC-con trolled en vironment. LLM agen ts generate SPL queries using anon ymized iden tifiers whic h function as placeholders rather than executable v alues. Prior to execution on Splunk, these placeholders are resolved through mapping the anonymized tokens bac k to their original iden tifiers that are maintained in a table. As a result, the generated SPL queries remain applicable to the SIEM platform while ensuring that original information are nev er exp osed to the LLM. 6. Use Case F or better understanding of the workflo w, w e pro vide a use case, as shown in Fig. 2, in which the log collection agen t is deplo yed on Windows 11 running inside the virtual mac hine. As shown in Fig. 2, Kali Linux is used 21 as an attack er machine to launc h the attack on Windows 11 that is the target. The Splunk Universal Forwarder deplo yed on Windows 11 collects and forwards logs to the Splunk Enterprise running on the Windows 2019 serv er. F or the experimental ev aluation, logs of Suricata IDS is forw arded by the Splunk Universal Forwarder to Splunk Enterprise . The receiv ed logs are indexed on the Splunk Enterprise for in vestigation. How ever, we plan to extend the use case with multiple log sources. As shown in Fig. 2, the Suricata IDS is deploy ed on Windows 11 . Once the logs are indexed, data cleaning and aggregation is applied before further pro cessing using auto encoder and DRL. F or the auto enco der, a b enign sample is filtered from an early temporal window, ensuring that the learned representation shows normal netw ork b eha vior without mixing with attac k traffic. Moreo ver, alert labels and non b eha vioral features such as hashes are remov ed. F urthermore, for the DRL training, a window based aggregation and cardinality reduction is p erformed to mak e sure that the DRL agen t op erates on the non leaking raw even ts. After DRL training, the traffic window is prioritized for LLM triage, using the DRL decision and anomaly score (AAD score). Then, CrewAI based LLM agen ts analyze the received traffic window and provide contextual insights along with the SPL queries for further v alidation on Splunk for decision making, as describ ed in Section 4. Based on the insight provided by the Agen tic AI framew ork, entities such as IP addresses and p orts are extracted and analyzed in the Splunk Enterprise by the SOC analyst for effective decision making. The analysis p erformed using Splunk is shown in Section 7. During the ev aluation with the public dataset (Boss of the SOC), we directly ingested the dataset in to Splunk En terprise and then exported it into CSV format for ev aluation using auto encoder and the DRL. The detailed results are discussed in Section 7. 6.1. Threat Mo del for Simulation • Net w ork Scanning A ttac k: It aims to identify active hosts on a net work along with the p orts and services running on the hosts[44]. It helps attac kers assess the weaknesses of assets and plan the attack to exploit those vulnerabilities [30]. • V olumetric Denial of Service Attac k (DoS): A volumetric DoS attack such as a UDP flo od targets the victim system and the netw ork to deplete resources suc h as net w ork bandwidth, CPU, and memory b y sending a large n umber of b ogus pack ets[45]. 7. Exp erimen tation Results This section analyzes the exp erimentation results on the Boss of the SOC [46] and the simulated dataset. W e only used a part of the BOSS of the SOC dataset for this ev aluation. After cleaning and removing duplicates w e used around 12000 instances. The dataset con tains source IP , destination IP , source p ort, destination p ort, in bytes, out bytes, proto cols, and time. W e also simulated the use case describ ed in Section 6 and collected the dataset for the ev aluation. The sim ulated dataset also contains the same features as in the 22 T able 4: Automated LLM T riage based on Adaptiv e Scoring and Reinforcement Learning Agent Flo w ID Source IP Destination IP Priorit y Score MITRE ID Agent Answer 26 172.31.38.181 172.31.0.2 7.12e+12 T1071 Critical: C2 Communication iden tified 35 172.16.0.178 172.16.3.197 9.49e-01 T1071 Malicious b eha vior: Susp ected exfiltration via standard proto cols 34 192.168.8.103 192.168.9.30 7.33e-01 T1071 Risk level medium. Con tinuous monitoring is advised. 30 172.16.0.178 169.254.169.254 5.22e-01 T1552.005 P ossible attempt to exploit cloud infrastructure 23 192.168.8.112 192.168.9.30 5.06e-01 T1071 A mo derate level of concern. Con tin uous monitoring advised public dataset for ev aluation and a total of 300000 instances. 7.1. Ev aluation of Boss of the SOC dataset T able 4 shows the summary of the prioritized traffic flows forw arded to LLM for analysis. Eac h traffic flo w is c haracterized b y its flow identifier, source and destination IP addresses, assigned priorit y score by the DRL agen t after training. The corresp onding MITRE A TT&CK mapping and the final conclusion is provided b y the Threat Intelligence Analyst and the Senior SOC Triage agents. As we can see in T able 4, the flo w (Flow ID:26) with the priority score of ( 7 . 12 × 10 12 ) is mapp ed to technique T1071 (Application Lay er Proto col), whic h is commonly associated with command-and-control (C2) comm unication c hannels. The LLM agent classified this flo w as critic al , highlighting a high-confidence detection of malicious command and control activity that requires immediate con tainmen t. This indicates that our framew ork segregates high-impact threats from bac kground traffic to facilitate quic k resp onse from the SOC analyst. 7.2. SOC Analyst V alidation using Splunk Analysis on Public Dataset After prioritizing suspicious traffic using the prop osed DRL and AAD framew ork, an LLM-based triage mo dule p erformed con textual analysis and generated Splunk queries for the SOC analyst to v alidate the alert context within a real SOC en vironment. Fig. 3 shows the SPL query applied to filter traffic originating from the specific host (src_ip = 172.31.38.181) for the analysis. Specifically this SPL query is very useful in net work securit y analysis. As shown in Fig. 3, the transaction operator groups individual netw ork ev ents into logical communication sessions based on a temporal threshold (maxpause = 5 seconds), recon- structing burst-oriented pattern. The analysis suppresses b enign, sporadic activity and isolates automated 23 Figure 3: Analysis of DNS traffic identified by the prop osed RL and AAD triage mechanism comm unication patterns by filtering ev ents with event count> 10 . As w e can see in Fig. 3, a contin uous high-frequency DNS communication (dest_p ort = 53) originating from the IP address 172.31.38.181, with 1000 even ts occurring within a single transaction window lasting b et w een 217 and 236 seconds. This contin uous DNS activit y is inconsistent with normal user b ehavior and is commonly found with command-and-control b eaconing or DNS-based tunneling techniques. F urthermore, the presence of m ultiple destination IP addresses within the same windo w highlights repeated attempts rather than a legitimate service communication. The destination p ort 9997 sho wn in Fig. 3 is used by the Splunk universal forwarder to send data to Splunk Enterprise. A dditionally , it is found that all DNS responses are syn tactically v alid (e.g., reply_code = NoError ), highligh ting that detection is driven by b ehavioral aggregation rather than proto col violations or signature matc hing. The Splunk analysis confirms the correctness of the DRL and AAD framework and demonstrates that the LLM-assisted triage provides quick v alidation of anomalous net w ork pattern and helps the SOC analyst in informed decision making. 7.3. P olicy A daptation A cross Rew ard Mo des with Boss of the SOC Dataset Fig. 4 presents the p erformance of the prop osed tw o lay er DRL-based agent across the four rew ard mo des on the BOSS of the SOC dataset [46]. As we can see in Fig. 4, rew ard shaping directly impacts the trade-off b et w een precision, recall, and ov erall detection effectiveness. Mo de A, whic h fo cuses on recall by p enalizing false negatives more than false p ositiv es, achiev es balanced p erformance with precision, recall, and F1-scores close to 0.85. This highlights that the agent maintains a 24 Figure 4: Performance Ev aluation of Across Mo des (A-D) stable detection capabilit y while allowing a limited num b er of false p ositives, making it suitable for early-stage threat detection. As shown in Fig. 4, Mode B provides the b est ov erall results, achieving the highest recall (0.873) and F1-score (0.861). It is because of the reward function that rewards true negatives and p enalizes false p ositiv es sev erely , enabling the agent to iden tify malicious activity with high confidence while limiting con tainment actions that are not required. As w e can see in Fig. 4, Mo de C sho ws a notable drop in recall (0.744) and the F1-score (0.783), highligh ting the impact of balanced but strict p enalties on b oth false p ositiv es and false negatives. Although precision remains comparatively high (0.830), the reduced recall indicates a more conserv ative p olicy that reduces alerts at the cost of missing some malicious ev ents. Finally , Mo de D shows stable and consisten t p erformance in all three metrics, with precision, recall, and F1-scores close to 0.82, as we can see in Fig. 4. The introduction of controlled sto c hasticit y into the rew ard function improv es the robustness of the p olicy and prev en ts o v erfitting to deterministic reward patterns, pro ducing a reliable containmen t strategy under uncertain traffic conditions. Ov erall, these results confirm that the prop osed 2-lay er DRL agen t can b e adapted to different SOC op erational ob jectives through rew ard shaping alone, without modifying the underlying model arc hitecture to assist SOC analysts in decision making to either Contain or Allow traffic. 25 7.4. Decision Cost and Regret Analysis on Boss of SOC Dataset Setup Let each aggregated time window (state) b e indexed b y t ∈ { 1 , . . . , T } . The DRL p olicy π θ outputs a con tainmen t decision a t ∈ { 0 , 1 } , where a t = 1 indicates containment (raise an alert/tak e action) and a t = 0 denotes allow (do nothing). The ground-truth lab el is y t ∈ { 0 , 1 } , where y t = 1 indicates malicious activit y and y t = 0 indicates legitimate activity . Eac h decision pro duces one of four outcomes: T rue Positiv e (TP): ( a t , y t ) = (1 , 1) , F alse Positiv e (FP): ( a t , y t ) = (1 , 0) , F alse Negative (FN): ( a t , y t ) = (0 , 1) , T rue Negativ e (TN): ( a t , y t ) = (0 , 0) . Decision cost model: In the SOC environmen t, it is important to kno w the cost associated with eac h decision. W e assign an op erational cost to eac h outcome to reflect the use and risk of SOC resources. Let C TP , C FP , C FN , C TN ∈ R denote the cost (negativ e v alues may represen t b enefit or c ost savings ). W e define the p er-windo w decision cost as: c t =                    C TP , a t = 1 , y t = 1 (TP) , C FP , a t = 1 , y t = 0 (FP) , C FN , a t = 0 , y t = 1 (FN) , C TN , a t = 0 , y t = 0 (TN) . (14) The total decision cost and average decision cost ov er a fold are represented by: C total = T X t =1 c t , C = 1 T T X t =1 c t . (15) Regret analysis Regret analysis ev aluates how muc h less optimal the policy’s decision is compared to an oracle that alwa ys selects the action with the minimal cost for that particular activity . F or each time windo w t , the oracle decision cost is represen ted b y: c ⋆ t = min a ∈{ 0 , 1 } c ( y t , a ) , (16) where c ( y , a ) follows Eq. (14). The instantaneous regret is represented b y: r t = c t − c ⋆ t ≥ 0 , (17) and the total and aver age r e gr et are calculated b y: R total = T X t =1 r t , R = 1 T T X t =1 r t . (18) The lo wer regret sho ws that the p olicy is close to the oracle in terms of op erational cost, i.e., it mak es few er mistak es (sp ecifically false p ositiv es and false negatives under the chosen cost profile). 26 T able 5: A verage Decision Cost and Regret Across Time-Series F olds Mo de Mean Decision Cost Std. Dev. Mean Regret Std. Dev. A -0.526 1.049 1.474 1.049 B -0.254 1.126 2.588 3.059 C -0.789 1.233 1.684 1.921 D -0.794 1.027 1.358 1.250 Decision cost and regret ev aluation T able 5 shows the a veraged operational decision cost and regret calculated across differen t rew ard modes. As w e can see in T able 5 Mode C and D achiev e the lo w est a verage decision cost close to − 0 . 79 , indicating improv ed op erational efficiency compared to Mode A and B. Moreo ver, Mo de D achiev es the low est av erage regret of 1 . 358 , highligh ting that sto c hastic reward impro ves the robustness of the p olicy during shifts in temp oral distributions. In contrast, Mo de B shows the highest regret 2 . 588 and v ariabilit y across folds with a standard deviation of 3 . 059 , highlighting the sensitivity to false positive penalties. Ov erall, these results show that the balanced rew ard policy of Modes C and D offers stable con tainmen t p olicies across evolving traffic compared to strictly recall-oriented (Mode A) or false-p ositiv e intoleran t (Mo de B) configurations. 7.5. P ercen tage Reduction in T raffic Flows F orw arded to LLM The p ercentage reduction in traffic flows indicates ho w eac h DRL reward mo de filters traffic b efore forwarding it to LLM for analysis. A higher p ercen tage of reduction signifies that the mo de suppresses more traffic and reduces LLM pro cessing o verhead. A low er p ercen tage of reduction indicates that the mo de forwards more traffic, preserving broader co verage for LLM analysis. As we can see in Fig. 5, Mode A offers more reduction in traffic flows forwarded to LLm for analysis. This is b ecause it provides a higher p enalt y to false negatives than to false p ositiv es, which makes the agent sensitiv e to malicious activit y , while still allowing some filtering of traffic. Mode B penalizes false p ositiv es and rewards true negatives b oth strongly . This encourages the DRL agent to suppress b enign traffic, which supp orts traffic reduction and av oids unnecessary do wnstream analysis. Mode C is a more balanced rew ard structure b et ween malicious detection and b enign traffic suppression. As it is less filtering than the more selectiv e mo des, it tends to keep more traffic for downstream LLM analysis, leading to a lo wer p ercen tage reduction. Mode D is less conserv ativ e and has some exploratory rew ard design. So, it is less rigid in its traffic suppression, pro ducing a mo derate level of alert reduction while still allowing broader in vestigation of suspicious traffic flo ws. 27 Figure 5: A verage Reduction in T raffic Flows F orw arded to LLM Across Mo des (A-D) 7.6. P olicy A daptation A cross Rew ard Mo des with Sim ulated Dataset W e also simulated the threat mo del mentioned in Section 6.1. Fig. 6 sho ws the performance of the DRL agen t with a tw o-la y er architecture (2 × 64) in the simulated Suricata traffic in the four reward mo des. As w e can see in Fig. 6, it clearly demonstrates that the reward function impacts the agent’s containmen t b eha vior under controlled traffic conditions. As sho wn in Fig. 6, mo de A achiev es a recall of 0 . 998 , resulting in almost p erfect detection of malicious activit y . How ev er, the precision is 0 . 636 in mo de A. It indicates that the agent strongly flags suspicious traffic, whic h is desirable in early detection scenarios, but causes alert fatigue to SOC analysts due to an increased n umber of false positives. Mo de B ac hieves high precision 0 . 976 while significantly reducing recall to 0 . 495 . It v alidates that the heavy p enalization of false p ositiv es in Mo de B causes a conserv ativ e con tainment p olicy that prioritizes high-confidence alerts and allo ws a substantial part of malicious flo ws to pass. It reduces the alert fatigue on SOC analysts due to the high n umber of false p ositiv es. Mo de C achiev es a more balanced p erformance, with a recall of 0 . 998 and a precisi on of 0 . 645 , resulting in an F1-score of 0.784. It indicates that Mo de C captures malicious activity while main taining some control o ver false positives, making it appropriate for en vironmen ts that require high detection and operational stabilit y . Finally , Mo de D shows consistent and well-balanced b eha vior, with precision, recall, and F1-scores of 0 . 926 , 0 . 696 , and 0 . 795 respectively . It reflects that the stochasticit y in the rew ard function enhances robustness b y preven ting the agen t from ov erfitting to deterministic traffic patterns commonly present in 28 Figure 6: Performance Ev aluation of Across Mo des (A-D) on Simulated Dataset sim ulated datasets. Ov erall, these results p oin t that the prop osed DRL agent adapts its containmen t strategy in resp onse to differen t rew ard mo des, v alidating the effectiveness of reward shaping to mo del different SOC ob jectiv es. 7.7. P ercen tage Reduction in Sim ulated T raffic F orwarded to LLM In the framework, alert capture rate is equiv alen t to recall, as b oth ev aluate the prop ortion of true alert instances that are successfully forw arded to the LLM for processing. As we can see in Fig. 7, across all rew ard mo des, the DRL triage mechanism reduces traffic forwarded to the LLM b y approximately 63%-65%. Ho wev er, this reduction is asso ciated with different recall outcomes, as sho wn in Fig 6. As shown in Fig. 7 and Fig. 6, Mo des A and C forward almost all true traffic alerts while still achieving strong traffic reduction, indicating an effective balance b et ween efficiency and detection p erformance. Mode B also obtains a similar reduction in traffic forw arded to LLM processing but with low er recall, highlighting that its p olicy is filtering to o muc h. Mo de D p erforms mo derately , but remains less effectiv e than Mo des A and C in maintaining traffic flow cov erage. This design ensures that the LLM is not ov erloaded with a huge traffic volume for analysis. 7.8. SOC Analyst V alidation via Splunk Analysis on Sim ulated Dataset In the framew ork, after suspicious traffic is prioritized using the DRL netw ork and AAD score, an LLM- based m ulti-agent system pro vides con textual insigh ts and generates SPL queries to v alidate the traffic on Splunk. As mentioned in Section 5, the Senior SOC T riage Analyst agent provides contextual insights and 29 Figure 7: A verage Reduction in T raffic F orwarded to LLM for Analysis generate SPL queries for the analysis. Fig. 8 sho ws the SPL query applied to filter malicious traffic on Splunk. The SPL query filters the traffic originating from the flagged host machine with the IP address of 10 . 0 . 2 . 4 , allowing the SOC analyst to analyze the comm unication pattern. F or clear representation, w e ha ve shown the IP addresses in the pap er, but LLM triage is p erformed on anonymized data. This design ensures that the LLM reasons o ver traffic b eha vior and con text rather than sensitive identifiers, enabling priv acy-preserving SOC triage without degrading analytical capability . As we can see in Fig. 8, repetitive communication from the same source IP address (10.0.2.4) to wards the destination IP address (10.0.2.15) on different destination p orts in a very short time windo w highlights net work scanning activity . It is mapp ed to the MITRE A TT&CK tec hnique of T1046. The filtered traffic on Splunk do es not show proto col violation or malformed pack ets, rather the anomaly is identified through temp oral aggregation and behavioral pattern, signifying the adv antage of DRL based con tainment decision o ver signature driven detection. It is imp ortan t to note that during LLM assisted triage the ma jority of prioritized flows originate from the same host, therefore, a detailed table highlighting LLM agen t determination along with MITRE ID is omitted from the pap er for brevity . This is b ecause in the simulation we used Kali Lin ux with the same IP address. F rom a threat h unting p ersp ectiv e, v alidating the DRL-prioritized traffic through indep enden t SIEM-lev el b eha vioral pattern demonstrates that the proposed framew ork not only achiev es high detection p erformance, but also pro duces explainable insigh ts and assists SOC analyst in decision making to either 30 Figure 8: Analysis of flows from malicious host allo w the traffic in the netw ork, blo c k, or monitor it for a while. The whole pro cess aligns with the real w orld SOC workflo ws and ensures that SOC analyst verifies DRL decision and LLM insights on the SIEM to ol suc h as Splunk rather than only relying on the LLM. 8. Discussion Our framework provides a multi-la y er threat detection architecture that combines Deep Reinforcement Learn- ing (DRL), Auto encoder based anomaly detection (AAD), and LLM based agents for triage and automated SOC decision along with verification of the results using SIEM to ol suc h as Splunk . In the framework, once the DRL agent is trained on aggregated traffic features and learn containmen t decision, then AAD score is used along with the DRL decision in prioritizing flows for LLM triage. It reduces the processing ov erhead on LLM by only forwarding the prioritized flo ws for analysis that also preven ts hallucination. The LLM agen ts also generate SPL queries that are used for filtering malicious flows on the Splunk dashboard. It is helpful for junior SOC analysts who are not well versed in writing SPL queries. Moreo ver, in the agentic AI framew ork, another LLM agent generates an incidence resp onse playbo ok and provides mapping with the MITRE A TT&CK ID. In contrast to man ual inv estigation on the SIEM to ol, such as Splunk, our agentic AI framework automates detection, prioritization, con textual analysis, and rep ort generation for detailed in vestigation. The prop osed framew ork is aligned with real-world SOC workflo ws. The DRL agen t learns sequential decision p olicies that d irectly mo del con tainment actions under uncertaint y rather than handling detection as a static classification problem. This formulation sho ws op erational SOC decision-making, where actions 31 carry costs and delay ed consequences. The use of m ultiple rew ard profiles further demonstrates the flexibility of the framework in adapting to different organizational risk tolerances, such as prioritizing lo w false-p ositiv e rates in high-v olume en vironments. The current experimental ev aluation fo cuses on binary containmen t decisions that containment and no action (allow) . Ho w ever, SOC environmen t often contains actions in multiple stages such as: monitor, throttle, escalate, and isolate. Extending the action would cov er more practical scenarios and will also increase the complexity in the reward function. One ma y argue that in the framework, the use of public LLM for triage requires pro viding sensitive securit y data to an external provider. How ever, lo cal LLM can b e used to av oid forwarding sensitiv e logs to public LLM for triage. Another solution is to use data anonymization technique to anonymize sensitive information such as IP addresses b efore sending it to the public LLM for analysis. Another imp ortant concern of the current framew ork is scalabilit y , as it is common for tens of thousands of traffic logs to arriv e p er second in securit y op eration cen ters. Our framework addresses this challenge through triage priority after the DRL decision, ensuring that the agen tic analysis is only applied to the high priorit y subset of traffic flows. Raw netw ork telemetry is aggregated into a fixed temp oral time window, and flo ws are summarized using statistical descriptors such as the mean & maximum v alue of p orts, in bytes, out b ytes, and proto col. Aggregation of flo ws reduces the dimensionalit y of the data, transforming ra w traffic in to manageable volumes. The DRL agen t op erates as a policy lev el filter ev aluating eac h aggregated windo w rather than individual flows. The auto encoder anomaly detection (AAD) score is combined with the DRL decision to calculate the triage priority , which ranks only traffic windows flagged by the DRL agent for con tainment. Finally , the LLM based agents are inv ok ed only for small num ber of high priorit y traffic flo ws, this selective strategy makes the framew ork computationally feasible for SOC environmen t with high traffic volumes. Ho wev er, the aggregated time window feature ma y limit the detection of short lived attacks. The exp erimen tal ev aluation demonstrate that combining Deep Reinforcement Learning and autoenco der anomaly detection along with con textual analysis improv es the ov erall efficiency and decision making of SOC analysts. Moreo ver, currently the proposed framework lacks complete automation from detection to mitigation. The main purp ose is to assist the SOC analyst in identifying the suspicious traffic for inv estigation and help them in decision making. The mitigation of malicious traffic is p erformed by SOC analysts after in vestigating on the SIEM to ol. Moreov er, the aim is also to study how agen tic AI coupled with the SIEM to ol can assist SOC analysts in threat hun ting. Rather than complete automation from monitoring to mitigation at this stage, it is nice to hav e automation across key security workflo ws, for impro ved p erformance. 9. Conclusion and F uture W ork This pap er presented a threat h unting framework based on Agentic AI that integrates Deep Reinforcement Learning (DRL), Autoenco der-based Anomaly Detection (AAD), and LLM driven multi-agen t con textual analysis. The framework represen ts the SOC environmen t as a sequential decision-making problem, where 32 actions are decided based on learned p olicies instead of b eing treated as static classification outputs. T o demonstrate the feasibility and effectiv eness of our prop osed framew ork, w e dev elop ed a pro of-of-concept protot yp e and ev aluated b oth on the public dataset as w ell as simulated dataset. The results sho w that rew ard shaping allo ws the DRL agen t to adapt its containmen t decision to different SOC ob jectives, such as maximizing recall, minimizing false p ositives, and balancing detection. The op erational cost and regret analysis are also incorp orated into the learning pro cess that provides a more realistic ev aluation of the SOC en vironment. Our experiment and analysis of the protot yp e hav e identified some key b enefits of the framework, (1) The mo dular arc hitecture of the framework separates the resp onsibilities across different comp onen ts; (2) The DRL agen t w orks on aggregated ra w traffic features and is trained without AAD score, preven ting feature leak age; (3) The AAD score is applied only after the DRL decision to prioritize flagged windo ws for LLM triage to av oid pro cessing o verhead on LLM; (4) The integration of DRL–based decision with anomaly-a ware prioritization and LLM-driven reasoning offers explainable, analyst-aligned threat hun ting outcomes; (5) The reduced traffic volume forwarded to the LLM impro ves causes less pro cessing ov erhead on the LLM. This multi-la y er arc hitecture enables scalabilit y b y ensuring that computationally exp ensiv e LLM reasoning is applied only where it is most needed. As the SOC environmen t operates with m ulti- stage actions such as monitoring, escalating, isolating traffic, etc. In our future work, w e will extend the actions space to supp ort m ulti-level containmen t decisions. Currently , the framework relies on fixed time- windo w aggregation due to which it can be difficult to detect short-lived malicious traffic. In our future w ork, w e will explore adaptive window strategies and hierarchical p olicies that work at m ultiple temp oral lev el. Moreo v er, w e will also in vestigate the use of domain sp ecific LLM or lo cally deplo yed LLM to impro ve reliabilit y , reduce hallucination risk and address data priv acy concerns. F urthermore, w e will in v estigate ethical considerations, including transparency , bias, and accountabilit y in agentic SOC systems, to supp ort resp onsible deploymen t of autonomous decision-making systems for cyb ersecurit y op erations. Finally , w e will p erform detailed exp erimen tation with multiple heterogeneous log sources and ev aluate the p erformance. References [1] M. A. F errag, M. Ndhlovu, N. Tihan yi, L. C. Cordeiro, M. Debbah, T. Lestable, N. S. Thandi, Rev olu- tionizing cyb er threat detection with large language mo dels: A priv acy-preserving b ert-based ligh tw eigh t mo del for iot/iiot devices (2024). . URL [2] Kasp ersky , A dv anced p ersisten t threats target one in four companies in 2024 (2024). URL https://www.kaspersky.com/about/press- releases/advanced- persistent- threats- target- one- in- four- c ompanie s- in- 2024 [3] F ortinet, 2025 fortinet global threat landscap e rep ort (2025). URL https://www.fortinet.com/resources/cyberglossary/recent- cyber- attacks 33 [4] T. Nguyen, H. Nguyen, A. Ijaz, S. Sheikhi, A. V. V asilak os, P . Kostak os, Large language mo dels in 6g securit y: c hallenges and opp ortunities (2024). . URL [5] A. Naseer, H. Naseer, A. Ahmad, S. B. Maynard, A. Maso od Siddiqui, Real-time analytics, inciden t resp onse pro cess agilit y and enterprise cyb ersecurit y p erformance: A contingen t resource-based analysis, In ternational Journal of Information Management 59 (2021) 102334. doi:https://doi.org/10.1016/ j.ijinfomgt.2021.102334 . URL https://www.sciencedirect.com/science/article/pii/S026840122100027X [6] F. W ang, C. Liu, L. Shi, H. Pang, Minimaxad: A light weigh t auto encoder for feature-ric h anomaly detection, Computers in Industry 171 (2025) 104315. doi:https://doi.org/10.1016/j.compind. 2025.104315 . URL https://www.sciencedirect.com/science/article/pii/S0166361525000806 [7] A. Zeiser, B. ozcan, B. v an Stein, T. Bäck, Ev aluation of deep unsup ervised anomaly detection metho ds with a data-centric approach for on-line insp ection, Computers in Industry 146 (2023) 103852. doi: https://doi.org/10.1016/j.compind.2023.103852 . URL https://www.sciencedirect.com/science/article/pii/S0166361523000027 [8] C. Catalano, L. Paiano, F. Calabrese, M. Cataldo, L. Mancarella, F. T ommasi, Anomaly detection in smart agriculture systems, Computers in Industry 143 (2022) 103750. doi:https://doi.org/10.1016/ j.compind.2022.103750 . URL https://www.sciencedirect.com/science/article/pii/S0166361522001476 [9] A. T all, J. W ang, D. Han, Surv ey of data in tensive computing tec hnologies application to to security log data managemen t, in: Pro ceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and T echnologies, BDCA T ’16, Asso ciation for Computing Machinery , New Y ork, NY, USA, 2016, p. 268–273. doi:10.1145/3006299.3006336 . URL https://doi.org/10.1145/3006299.3006336 [10] P . Badv a, K. M. Ramok apane, E. P antano, A. Rashid, Unv eiling the Hunter-Gatherers: Exploring threat h unting practices and c hallenges in cyber defense, in: 33rd USENIX Security Symp osium (USENIX Securit y 24), USENIX Asso ciation, Philadelphia, P A, 2024, pp. 3313–3330. URL https://www.usenix.org/conference/usenixsecurity24/presentation/badva [11] R. K. Gupta, S. Shukla, A. T. Ra jan, S. Ara vind, Utilizing Splunk for Proactive Issue Resolution in F ull Stack Developmen t Pro jects (2021). [12] S. Raza, R. Sapkota, M. Kark ee, C. Emmanouilidis, T rism for agen tic ai: A review of trust, risk, and 34 securit y managemen t in llm-based agentic m ulti-agent systems (2025). . URL [13] D. B. Ac hary a, K. Kuppan, B. Divy a, Agentic ai: Autonomous intelligence for complex goals—a com- prehensiv e surv ey , IEEE Access 13 (2025) 18912–18936. doi:10.1109/ACCESS.2025.3532853 . [14] H. Xu, S. W ang, N. Li, K. W ang, Y. Zhao, K. Chen, T. Y u, Y. Liu, H. W ang, Large language mo dels for cyb er security: A systematic literature review (2024). . URL [15] M. A. F errag, F. Alw ahedi, A. Battah, B. Cherif, A. Mechri, N. Tihanyi, T. Bisztra y , M. Debbah, Generativ e ai in cyb ersecurit y: A comprehensiv e review of llm applications and vulnerabilities, In ternet of Things and Cyb er-Ph ysical Systems (2025). doi:https://doi.org/10.1016/j.iotcps.2025.01. 001 . URL https://www.sciencedirect.com/science/article/pii/S2667345225000082 [16] A. Handler, K. R. Larsen, R. Hack athorn, Large language mo dels present new questions for decision supp ort, International Journal of Information Management 79 (2024) 102811. doi:https://doi.org/ 10.1016/j.ijinfomgt.2024.102811 . URL https://www.sciencedirect.com/science/article/pii/S0268401224000598 [17] N. Kshetri, T ransforming cyb ersecurit y with agentic ai to combat emerging cyb er threats, T elecommu- nications Policy 49 (6) (2025) 102976. doi:https://doi.org/10.1016/j.telpol.2025.102976 . URL https://www.sciencedirect.com/science/article/pii/S0308596125000734 [18] Sim bian, Ai agents in cyb ersecurit y:ai agents in cyb ersecurity (2025). URL https://resources.simbian.ai/hubfs/Whitepaper/AI%20Agents%20in%20Cybersecurity% 20White%20Paper%20(1).pdf [19] A. Sheth, A. P atel, C. Upadhy a y , H. Ragothaman, B. Patil, S. K. Uda y akumar, Agen tic ai for au- tonomous cyber threat h un ting and adaptive defense in dynamic securit y environmen ts, in: 2025 IEEE In ternational Conference on Electro Information T echnology (eIT), 2025, pp. 316–321. doi: 10.1109/eIT64391.2025.11103697 . [20] Dylan, Utilizing Generative AI and LLMs to Automate Detection W riting (2024). URL https://medium.com/@dylanhwilliams/utilizing- generative- ai- and- llms- to- automate- detection- writi ng- 5e4e a074072e [21] S. Balogh, M. Mlyncek, O. V ranak, P . Za jac, Using generativ e ai models to supp ort cyb ersecurit y analysts, Electronics 13 (23) (2024). doi:10.3390/electronics13234718 . URL https://www.mdpi.com/2079- 9292/13/23/4718 35 [22] C. Hillier, T. Karroubi, T urning the hun ted into the h unter via threat hun ting: Life cycle, ecosystem, c hallenges and the great promise of ai (2022). . URL [23] S. J. Lazer, K. Ary al, M. Gupta, E. Bertino, A surv ey of agen tic ai and cybersecurity: Challenges, opp ortunities and use-case prototypes (2026). . URL [24] A. Sheth, A. A chan ta, P . Matam, A. Patel, P . Sharma, N. V. P . Janapareddy , B. P atil, V. Gudur, Ai driv en self-healing cyb ersecurit y systems with agentic ai for adaptiv e threat resp onse and resilience, in: 2025 IEEE Cloud Summit, 2025, pp. 147–153. doi:10.1109/Cloud- Summit64795.2025.00030 . [25] A. Mohsin, H. Janic k e, A. Ibrahim, I. H. Sark er, S. Camtepe, A unified framew ork for human ai collab oration in security op erations centers with trusted autonomy (2025). . URL [26] N. Kshetri, J. V oas, Agen tic Artificial Intelligence for Cyb er Threat Management , Computer 58 (05) (2025) 86–90. doi:10.1109/MC.2025.3544797 . URL https://doi.ieeecomputersociety.org/10.1109/MC.2025.3544797 [27] P . Zambare, V. N. Thanik ella, N. P . Kottur, S. A. Akula, Y. Liu, Netmoniai: An agen tic ai framework for netw ork security & monitoring (2025). . URL [28] Y. Gu, Y. Xiong, J. Mace, Y. Jiang, Y. Hu, B. Kasikci, P . Cheng, Argos: Agentic time-series anomaly detection with autonomous rule generation via large language mo dels (2025). . URL [29] Y. Zhou, Y. Y uan, K. Huang, X. Hu, Can chatgpt p erform a grounded theory approac h to do risk analysis? an empirical study, Journal of Managemen t Information Systems 41 (4) (2024) 982–1015. arXiv:https://doi.org/10.1080/07421222.2024.2415772 , doi:10.1080/07421222.2024.2415772 . URL https://doi.org/10.1080/07421222.2024.2415772 [30] R. Sahay , M. Sumasadan, B. Eap en, W. Meng, M. R. A. Mamu, Enhancing threat hun ting with splunk and generative ai forautomated security op erations (2025). doi:10.21203/rs.3.rs- 7515771/v1 . [31] B. Jonkhout, Ev aluating large language models for automated cyber security analysis pro cesses (July 2024). URL http://essay.utwente.nl/100846/ 36 [32] A. Konstan tinou, D. Kasimatis, W. J. Buc hanan, S. U. Jan, J. Ahmad, I. Politis, N. Pitropakis, Lever- aging llms for non-securit y exp erts in threat hun ting: Detecting living off the land techniques, Machine Learning and Knowledge Extraction 7 (2) (2025). doi:10.3390/make7020031 . URL https://www.mdpi.com/2504- 4990/7/2/31 [33] E. Karlsen, X. Luo, N. Zincir-Heywoo d, M. Heyw o od, Benc hmarking large language mo dels for log analysis, security , and in terpretation (2023). . URL [34] P . T seng, Z. Y eh, X. Dai, P . Liu, Using llms to automate threat intelligence analysis workflo ws in security op eration cen ters (2024). . URL [35] V. T anksale, Cyb er threat hun ting using large language mo dels, in: X.-S. Y ang, S. Sherratt, N. Dey , A. Joshi (Eds.), Pro ceedings of Ninth International Congress on Information and Communication T ec h- nology , Springer Nature Singap ore, Singap ore, 2024, pp. 629–641. [36] C. Kidd, What is splunk & what do es it do? a splunk in tro (2024). URL https://www.splunk.com/en_us/blog/learn/what- splunk- does.html [37] Z. Duan, J. W ang, Exploration of llm multi-agen t application implementation based on lang- graph+crew ai (2024). . URL [38] R. Chalapathy , S. Chawla, Deep learning for anomaly detection: A survey (2019). . URL [39] Splunk Inc., Correlation searc hes, https://docs.splunk.com/Documentation/ES/latest/Admin/ Correlationsearches . [40] M. F arhan, H. W aheed ud din, S. Ullah, M. S. Hussain, M. A. Khan, T. Mazhar, U. F. Khattak, I. H. Jaghdam, Netw ork-based in trusion detection using deep learning tec hnique, Scientific Rep orts 15 (1) (2025) 25550. doi:10.1038/s41598- 025- 08770- 0 . URL https://doi.org/10.1038/s41598- 025- 08770- 0 [41] J. Sch ulman, F. W olski, P . Dhariw al, A. Radford, O. Klimov, Pro ximal p olicy optimization algorithms (2017). . URL [42] S. Sriniv as, B. Kirk, J. Zendejas, M. Espino, M. Bosko vic h, A. Bari, K. Da jani, N. Alzahrani, Ai- augmen ted soc: A survey of llms and agents for securit y automation, Journal of Cyb ersecurit y and 37 Priv acy 5 (4) (2025). doi:10.3390/jcp5040095 . URL https://www.mdpi.com/2624- 800X/5/4/95 [43] C.-Y. Sun, S.-S. Chen, Y.-H. Ho, De-identification of op en-source in telligence using finetuned llama-3, High-Confidence Computing (2025) 100357 doi:https://doi.org/10.1016/j.hcc.2025.100357 . URL https://www.sciencedirect.com/science/article/pii/S2667295225000613 [44] N. Hoque, M. H. Bhuy an, R. Baishy a, D. Bhattac haryy a, J. Kalita, Net work attacks: T axonom y , to ols and systems, Journal of Net w ork and Computer Applications 40 (2014) 307–324. doi:https: //doi.org/10.1016/j.jnca.2013.08.001 . URL https://www.sciencedirect.com/science/article/pii/S1084804513001756 [45] R. Saha y , G. Blanc, Z. Zhang, H. Debar, T ow ards autonomic ddos mitigation using softw are defined net working, 2015. URL https://api.semanticscholar.org/CorpusID:18725272 [46] Splunk, Boss of the so c v3 dataset released (2020). URL https://www.splunk.com/en_us/blog/security/botsv3- dataset- released.html App endix A. Mathematical Details and Numerical Illustration This app endix pro vides the detailed mathematical formulation and a w orked numerical example supp orting the reinforcement learning–based containmen t framework describ ed in Section 5. These details are included for completeness and repro ducibilit y and are not required for understanding the main results. App endix A.1. Time-Windo w ed State Construction Let raw netw ork traffic b e represented as a sequence of flow records: F = { f 1 , f 2 , . . . , f N } . (A.1) T raffic is aggregated in to non-o verlapping windows of duration ∆ t (5 minutes). F or each window w t , a state vector s t is constructed using statistical aggregation. F or numerical attributes x ∈ { src_p ort , dest_p ort , bytes_in , bytes_out } , w e compute: µ ( x ) t = 1 | w t | X f i ∈ w t x i , (A.2) max( x ) t = max f i ∈ w t x i . (A.3) The mean captures typical b eha vior within the window, while the maximum captures bursty or extreme activit y commonly asso ciated with scanning, flo oding, or data exfiltration. Categorical features (source IP , destination IP , protocol) are enco ded using reduced-cardinality one-hot represen tations and aggregated as o ccurrence coun ts p er windo w. 38 The resulting state v ector is: s t =  µ ( src_p ort ) , max( src_port ) , µ ( dest_p ort ) , max( dest_p ort ) , µ ( b ytes_in ) , max( b ytes_out ) , IP src , IP dst , Proto  t . (A.4) App endix A.2. W ork ed Numerical Example Consider a single 5-min ute windo w with the following aggregated numerical features: µ ( src_p ort ) = 443 , max( src_p ort ) = 443 , µ ( dest_p ort ) = 51 , 024 , max( dest_p ort ) = 51 , 083 , µ ( b ytes_in ) = 1 , 420 , max( b ytes_out ) = 98 , 230 . After concatenation with enco ded categorical features, the state vector is s t ∈ R d . Assume the p olicy netw ork pro duces logits: z = [ − 1 . 31 , 1 . 11] , (A.5) corresp onding to no c ontainment and c ontainment , resp ectiv ely . Applying softmax: π (0 | s t ) = e − 1 . 31 e − 1 . 31 + e 1 . 11 ≈ 0 . 08 , π (1 | s t ) = e 1 . 11 e − 1 . 31 + e 1 . 11 ≈ 0 . 92 . The agent selects a t = 1 (containmen t) with confidence 0 . 92 . If the ground-truth lab el is malicious ( y t = 1 ), the outcome is a true p ositiv e and the corresp onding rew ard is assigned according to the active rew ard profile. App endix A.3. Decision Cost and Regret T o complement standard detection metrics, we quantify ope rational impact using decision cost and regret. The decision cost at time t is defined as the negative of the obtained reward: cost t = − r t . (A.6) This form ulation directly reflects op erational burden: high cost corresp onds to undesirable actions such as false p ositiv es or missed detections. 39 Regret measures how far the agent’s decision deviates from the b est p ossible action under the defined rew ard function. W e define regret as: regret t = r ⋆ t − r t , (A.7) where r ⋆ t = max a ∈{ 0 , 1 } R ( a, y t ) (A.8) is the oracle rew ard assuming p erfect knowledge of the ground-truth lab el. A regret v alue of zero indicates an optimal decision, while larger v alues indicate increasingly costly deviations from the ideal SOC resp onse. App endix A.4. W ork ed Numeric Example (End-to-End): Aggregation → AAD Score → DRL A ction Probability Goal. This example shows (i) how ra w flow logs inside a 5-minute window are conv erted in to fixed-length features (mean/max), (ii) how an Auto enco der-based Anomaly Detection (AAD) score is computed from reconstruction error, and (iii) how a 2-la yer DRL policy con v erts the feature vector in to a containmen t probabilit y (e.g., 0 . 92 ). App endix A.4.1. Step 1: Raw flows in a 5-min ute window Assume within a 5-min ute time window w , we observe N = 4 flo w records with numeric attributes: src_p ort , dest_p ort , bytes_in , bytes_out . Let the four flo ws b e: Flo w i src_port i dest_p ort i b ytes_in i b ytes_out i 1 443 52344 1200 300 2 443 52345 1300 350 3 22 52346 8000 9000 4 22 52347 9000 11000 App endix A.4.2. Step 2: Window aggregation (mean and max) W e compute a fixed-length feature vector x w ∈ R 6 using mean and max statistics: x w =  src_p ort_mean , src_port_max , dest_p ort_mean , dest_port_max , b ytes_in_mean , bytes_out_max  . The aggregation op erators are: mean ( v ) = 1 N N X i =1 v i , max ( v ) = max i ∈{ 1 ,...,N } v i . Compute each comp onen t: 40 src_p ort_mean = 443+443+22+22 4 = 232.5 src_p ort_max = max(443 , 443 , 22 , 22) = 443 dest_p ort_mean = 52344+52345+52346+52347 4 = 52345.5, dest_p ort_max = max(52344 , 52345 , 52346 , 52347) = 52347 b ytes_in_mean = 1200+1300+8000+9000 4 = 4875 b ytes_out_max = max(300 , 350 , 9000 , 11000) = 11000 Th us, x w = [232 . 5 , 443 , 52345 . 5 , 52347 , 4875 , 11000] . App endix A.4.3. Step 3: Standardization (as used in training) Before AAD/DRL, numeric features are standardized: ˜ x w = x w − µ σ , where µ and σ are computed from the tr aining data only . F or illustration, assume: µ = [200 , 400 , 52000 , 52000 , 2000 , 5000] , σ = [100 , 100 , 200 , 200 , 2000 , 3000] . Then: ˜ x w =  232 . 5 − 200 100 , 443 − 400 100 , 52345 . 5 − 52000 200 , 52347 − 52000 200 , 4875 − 2000 2000 , 11000 − 5000 3000  . Numerically: ˜ x w = [0 . 325 , 0 . 43 , 1 . 7275 , 1 . 735 , 1 . 4375 , 2 . 0] . App endix A.4.4. Step 4: AAD score via Auto encoder reconstruction error AAD mo del. Let the AAD b e a auto encoder trained on early b enign data: f θ : R 6 → R 6 . W e use a hidden lay er enco der, b ottlenec k dimension d = 2 , and a deco der: h = ϕ ( W 1 ˜ x w + b 1 ) , z = ϕ ( W 2 h + b 2 ) , ˆ h = ϕ ( W 3 z + b 3 ) , ˆ x w = W 4 ˆ h + b 4 , where ϕ ( · ) = max(0 , · ) is ReLU, and ˆ x w is the reconstruction. 41 Concrete n umeric forward pass (Example) F or the example, we illustrate with a small hidden size (the real implementation can use larger widths). Assume: W 1 =         0 . 30 0 . 10 0 . 05 0 . 05 0 . 10 0 . 20 0 . 10 0 . 20 0 . 10 0 . 10 0 . 20 0 . 10 0 . 05 0 . 05 0 . 30 0 . 30 0 . 10 0 . 10 0 . 10 0 . 10 0 . 20 0 . 20 0 . 10 0 . 05         , b 1 = 0 . Compute h = ϕ ( W 1 ˜ x w ) : W 1 ˜ x w =         0 . 30(0 . 325) + 0 . 10(0 . 43) + 0 . 05(1 . 7275) + 0 . 05(1 . 735) + 0 . 10(1 . 4375) + 0 . 20(2 . 0) 0 . 10(0 . 325) + 0 . 20(0 . 43) + 0 . 10(1 . 7275) + 0 . 10(1 . 735) + 0 . 20(1 . 4375) + 0 . 10(2 . 0) 0 . 05(0 . 325) + 0 . 05(0 . 43) + 0 . 30(1 . 7275) + 0 . 30(1 . 735) + 0 . 10(1 . 4375) + 0 . 10(2 . 0) 0 . 10(0 . 325) + 0 . 10(0 . 43) + 0 . 20(1 . 7275) + 0 . 20(1 . 735) + 0 . 10(1 . 4375) + 0 . 05(2 . 0)         . Numerically: W 1 ˜ x w ≈         0 . 0975 + 0 . 043 + 0 . 0864 + 0 . 0868 + 0 . 1438 + 0 . 4000 0 . 0325 + 0 . 0860 + 0 . 1728 + 0 . 1735 + 0 . 2875 + 0 . 2000 0 . 0163 + 0 . 0215 + 0 . 5183 + 0 . 5205 + 0 . 1438 + 0 . 2000 0 . 0325 + 0 . 0430 + 0 . 3455 + 0 . 3470 + 0 . 1438 + 0 . 1000         =         0 . 8575 0 . 9523 1 . 4204 1 . 0118         . After ReLU: h = [0 . 8575 , 0 . 9523 , 1 . 4204 , 1 . 0118] ⊤ . No w, w e define b ottlenec k w eights as: W 2 =   0 . 6 0 . 2 0 . 1 0 . 1 0 . 1 0 . 2 0 . 6 0 . 1   , b 2 = 0 . Compute the b ottlenec k: z = ϕ ( W 2 h ) = ϕ     0 . 6(0 . 8575) + 0 . 2(0 . 9523) + 0 . 1(1 . 4204) + 0 . 1(1 . 0118) 0 . 1(0 . 8575) + 0 . 2(0 . 9523) + 0 . 6(1 . 4204) + 0 . 1(1 . 0118)     . Numerically: z ≈ ϕ     0 . 5145 + 0 . 1905 + 0 . 1420 + 0 . 1012 0 . 0858 + 0 . 1905 + 0 . 8522 + 0 . 1012     =   0 . 9482 1 . 2297   . F or reconstruction, we assume a simple linear deco der (Example): ˆ x w = Az , A =               0 . 20 0 . 00 0 . 10 0 . 05 0 . 30 0 . 10 0 . 30 0 . 10 0 . 10 0 . 20 0 . 05 0 . 40               . 42 Then: ˆ x w =               0 . 20(0 . 9482) + 0 . 00(1 . 2297) 0 . 10(0 . 9482) + 0 . 05(1 . 2297) 0 . 30(0 . 9482) + 0 . 10(1 . 2297) 0 . 30(0 . 9482) + 0 . 10(1 . 2297) 0 . 10(0 . 9482) + 0 . 20(1 . 2297) 0 . 05(0 . 9482) + 0 . 40(1 . 2297)               ≈               0 . 1896 0 . 1563 0 . 4074 0 . 4074 0 . 3408 0 . 5393               . AAD score as mean squared reconstruction error Define AAD as: AAD ( ˜ x w ) = 1 6 6 X j =1 ( ˜ x w,j − ˆ x w,j ) 2 . Compute the p er-dimension errors as: ˜ x w − ˆ x w = [0 . 325 − 0 . 1896 , 0 . 43 − 0 . 1563 , 1 . 7275 − 0 . 4074 , 1 . 735 − 0 . 4074 , 1 . 4375 − 0 . 3408 , 2 . 0 − 0 . 5393] . Numerically: ˜ x w − ˆ x w ≈ [0 . 1354 , 0 . 2737 , 1 . 3201 , 1 . 3276 , 1 . 0967 , 1 . 4607] . Squared errors: [0 . 0183 , 0 . 0749 , 1 . 7427 , 1 . 7625 , 1 . 2028 , 2 . 1336] . Finally: AAD ( ˜ x w ) ≈ 0 . 0183 + 0 . 0749 + 1 . 7427 + 1 . 7625 + 1 . 2028 + 2 . 1336 6 = 6 . 9348 6 ≈ 1 . 1558 . In terpretation: Higher AAD shows a larger deviation from b enign reconstruction, i.e., a more anomalous traffic window. App endix A.4.5. Step 5: DRL p olicy con verts features in to a containmen t probabilit y Let the DRL p olicy netw ork (2 hidden lay ers; in the implementation 64 neurons p er lay er) output tw o logits for actions: a ∈ { 0 , 1 } where a = 1 means con tainment . W e denote: h (1) = ϕ ( W (1) ˜ x w + b (1) ) , h (2) = ϕ ( W (2) h (1) + b (2) ) , ℓ = W (3) h (2) + b (3) =   ℓ 0 ℓ 1   , where ℓ 0 is the logit for no c ontainment and ℓ 1 is the logit for c ontainment . 43 Concrete n umeric logits (Example) Let’s assume the netw ork produces logits as: ℓ 0 = − 1 . 2 , ℓ 1 = 1 . 6 . These are not probabilities; they are ra w preference scores. Softmax to obtain action probabilities. The p olicy probability is giv e b y: π ( a = i | ˜ x w ) = exp( ℓ i ) exp( ℓ 0 ) + exp( ℓ 1 ) , i ∈ { 0 , 1 } . Compute: exp( ℓ 0 ) = e − 1 . 2 ≈ 0 . 301 , exp( ℓ 1 ) = e 1 . 6 ≈ 4 . 953 . Sum: Z = exp( ℓ 0 ) + exp( ℓ 1 ) ≈ 0 . 301 + 4 . 953 = 5 . 254 . Th us: π ( a = 1 | ˜ x w ) = 4 . 953 5 . 254 ≈ 0 . 943 , π ( a = 0 | ˜ x w ) = 0 . 301 5 . 254 ≈ 0 . 057 . In terpretation: The DRL p olicy assigns ≈ 94 . 3% confidence to containmen t for this windo w. App endix A.4.6. Step 6: Combining DRL decision with AAD for SOC triage (p ost-training) Because AAD is excluded from DRL training (to preven t feature leak age), it is used after DRL for prioriti- zation b efore forwarding to LLM: Priorit y w = I [ a w = 1] · AAD ( ˜ x w ) , where I [ · ] is an indicator (1 if containmen t is selected, else 0). If the agent selects a w = 1 , then: Priorit y w = 1 · 1 . 1558 = 1 . 1558 . If a w = 0 , then: Priorit y w = 0 . In terpretation: Only traffic windows flagged by DRL are escalated, and AAD provides a score for priori- tization b efore LLM-based contextual analysis. App endix A.4.7. V ariable Dictionary • w : a 5-minute time window. • N : n um b er of flow records inside windo w w . • x w ∈ R 6 : aggregated window feature vector (mean/max). • ˜ x w : standardized features, using training-set mean µ and std σ . 44 • f θ ( · ) : AAD auto encoder reconstruction function. • ˆ x w : reconstruction of ˜ x w pro duced b y the auto enco der. • AAD ( ˜ x w ) : anomaly score computed as mean squared reconstruction error. • a ∈ { 0 , 1 } : DRL action (0 = no containmen t, 1 = containmen t). • ℓ = [ ℓ 0 , ℓ 1 ] ⊤ : logits (raw scores) output by the p olicy netw ork. • π ( a | ˜ x w ) : softmax p olicy giving the probability of eac h action. • Priorit y w : post-training SOC triage priority (used to rank even ts for LLM inv estigation). 45

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment