Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

P olicy-Guided Threat Hun ting: An LLM enabled F ramew ork with Splunk SOC T riage Rishik esh Saha y a , Bell Eapen a , W eizhi Meng b , Md Rasel Al Mamun a , Nikhil Kumar Dora c , Manjusha Sumasadan a , Sumit Kumar T etara ve c , Ro d Soto d , Elyson De La Cruz e a Dep artment of Management Information Systems, University of Il linois, Springﬁeld, USA b Scho ol of Computing and Communic ations, L anc aster University, United Kingdom c Scho ol of Computer Applic ations, Kalinga Institute of Industrial T e chnolo gy, India d Splunk Rese arch T e am, USA e Scho ol of Information T echnolo gy and Artiﬁcial Intel ligenc e, University of Cumb erlands, USA Abstract With frequen tly evolving Adv anced Persisten t Threats (APT s) in cyberspace, traditional securit y solu- tions approaches hav e b ecome inadequate for threat h un ting for organizations. Moreo ver, SOC (Security Op eration Cen ters) analysts are often o verwhelmed and struggle to analyze the h uge volume of logs receiv ed from diverse devices in organizations. T o address these challenges, we prop ose an automated and dynamic threat hun ting framework for monitoring ev olving threats, adapting to c hanging netw ork conditions, and p erforming risk-based prioritization for the mitigation of suspicious and malicious traﬃc. By in tegrating Agen tic AI with Splunk, an established SIEM platform, we developed a unique threat hun ting framew ork. The framew ork systematically and seamlessly in tegrates diﬀerent threat hun ting mo dules together, ranging from traﬃc ingestion to anomaly assessmen t using a reconstruction-based auto enco der, deep reinforcemen t learning (DRL) with tw o la yers for initial triage, and a large language mo del (LLM) for contextual analysis. W e ev aluated the framew ork against a publicly a v ailable benchmark dataset, as w ell as against a sim u- lated dataset. The exp erimen tal results show that the framew ork can eﬀectiv ely adapt to diﬀerent SOC ob jectives autonomously and identify suspicious and malicious traﬃc. The framew ork enhances op erational eﬀectiv eness b y supporting SOC analysts in their decision-making to block, allo w, or monitor netw ork traﬃc. This study thus enhances cyb ersecurit y and threat h unting literature by presenting the nov el threat h unting framew ork for securit y decision- making, as well as promoting cum ulative research eﬀorts to develop more eﬀectiv e framew orks to battle contin uously evolving cyb er threats. Keywor ds: Threat hun ting, Splunk, Security Op eration Center, LLM, Agentic AI, Deep Reinforcement Learning, Auto encoder Email addresses: rsaha@uis.edu (Rishikesh Sahay), bpunn@uis.edu (Bell Eap en), w.meng3@lancaster.ac.uk (W eizhi Meng), mmamu@uis.edu (Md Rasel Al Mamun), 2481140@kiit.ac.in (Nikhil Kumar Dora), msuma5@uis.edu (Manjusha Sumasadan), sumitkumar.fca@kiit.ac.in (Sumit Kumar T etarav e), rodsoto@cisco.com (Ro d Soto), elyson.delacruz@ucumberlands.edu (Elyson De La Cruz) 1. In tro duction The frequently evolving threat landscap e in cyb erspace emphasizes the need for proactiv e and intelligen t cyb er threat hun ting [1]. A ccording to Kasp ersky (2024), there is an increase of 74% in adv anced p ersisten t threats (APT s) in 2024 compared to 2023 [2]. A dv anced Persisten t Threats are cyber threats that use sophisticated to ols and resources to exploit vulnerabilities in organizations. According to the F ortinet threat rep ort 2025, the attempts to exploit newly found vulnerabilities ha ve increased, and cyb ercriminals are using artiﬁcial in telligence (AI) for phishing, imp ersonation, and ev asion tactics, resulting in the increase of reconnaissance activity b y 16.7% each year [3]. T raditional security approac hes hav e b ecome inadequate with the rise of these adv anced p ersisten t threats b ecause traditional endp oin t detection and resp onse to ols rely on a known attack signature or a clear anomalous pattern [4, 5]. Contemporary anomaly-based detection solutions are reactiv e, fail to address contin uously evolving threat landscap es, and thereb y require proactive threat h unting approaches [6–8]. Th us, this study oﬀers a no vel, proactive threat hun ting framew ork with impro ved eﬀectiv eness to assist securit y analysts in their decision-making to blo c k, allow, or monitor netw ork traﬃc. Mitigation of adv anced persistent threats requires con tin uous monitoring of security logs. Th us, the securit y analysts in the security op eration cen ter (SOC) con tinuously analyze a large volume of traﬃc logs to eﬀectively pinp oin t p oten tial threats [9]. How ev er, current threat h unting literature iden tiﬁes that it is c hallenging for security analysts to analyze huge volume and complex data types of logs [10]. Although the Securit y Information and Even t Management (SIEM) to ols such as Splunk oﬀer centralized log aggregation, correlation, and real-time monitoring features, they rely on predeﬁned rules and existing signatures, which limit their eﬀectiv eness against new or con text-driven threats [5, 11]. In addition, it is imp erativ e for the SOC analysts to p erform risk-based prioritization for mitigation. Another critical issue for the current threat h unting pro cess is the shortage of security analysts and the lac k of automation in repetitive security w orkﬂows that causes a burden on SOC analysts [10]. Thus, there is a need for proactive cyb ersecurit y solutions with automatic security workﬂo ws in threat h unting that can monitor evolving threats and adapt to changing netw ork conditions, p erform risk-based prioritization for mitigation of suspicious and malicious traﬃc. Thus, in tegrating Agentic AI with Splunk, an established SIEM platform, this research prop oses an automated threat hun ting framew ork capable of identifying and mitigating evolving threats with high accuracy and v alidate the framew ork. The recent adv ancemen ts in the Agen tic AI system built on LLM (Large Language Mo del) present a signiﬁcan t opp ortunit y to impro v e threat h un ting op erations [12]. The Agen tic AI system lev erages ad- v ancemen ts in reinforcement learning, goal-orien ted arc hitectures, and adaptive control mec hanisms [13]. The agen tic system comprises collab orativ e agen ts with sp ecialized roles suc h as planning, analysis, and execution facilitated by LLM along with to ol use. In the Agentic AI architecture, LLM is the main decision making con troller, also referred to as the brain of the system [13]. LLMs hav e shown considerable capabili- ties to identify complex patterns, analyze unstructured data, and pro vide contextual insights that are highly 2 relev an t to security op erations [1, 14]. Agen tic AI is goal oriented, with adaptable features that enable it to complete multi-la y ered tasks without instructions eac h time [13]. By integrating the Agentic AI system with established SIEM platforms such as Splu nk within the threat h unting workﬂo w, we can automate log analysis, detect subtle indicators of compromise (IoCs), and reduce the burden on SOC analysts [15, 16]. Recen t w orks [17–19] highlight AI agents as assisting the SOC ana- lyst by correlating logs across diverse data sources, preserving inv estigativ e context through memory , and con tinuously updating hypotheses ov er time. It has the capability to automate contextual analysis, generate detection rules, assist in formulating sophisticated queries for security information and even t managemen t systems, and ev en supp ort in the developmen t of incident response pla yb ook [20]. This in tegration of Agen tic AI with the SIEM platform can facilitate real-time monitoring and alerting, allo wing faster resp onse times and decision making to security incidents [21]. Automating the threat h unting w orkﬂo w allows SOC analysts to fo cus on strategic and innov ative asp ects of threat hun ting [22]. This proactive approach not only strengthens an organization’s securit y p osture, but also optimizes the utilization of limited securit y exp ertise. Ho w ever, in the SOC en vironment, human ov er- sigh t is very imp ortan t for safe autonomy and crucial decision-making. Under a fast changing environmen t and incomplete information, agen ts may struggle to generalize, so h uman-in-the-lo op is necessary to v alidate inferred threats and ambiguous ﬁndings. Therefore, the challenge is to design an Agen tic threat h un ting framew ork with co ordinated workﬂo w that keeps SOC analyst driven ﬂexibility along with explainable de- cision making and providing broader automation without exceeding acceptable risk. After a comprehensive analysis of the av ailable threat h unting pro cesses, we ha ve identiﬁed sev eral requirements to enhance them, and ev en a new design: (1) minimize false alarms; (2) automation in k ey securit y w orkﬂo ws; (3) priori- tization of traﬃc ﬂows for further in v estigation by SOC analysts and LLM; (4) LLM assisted con textual analysis of traﬃc ﬂo ws and generation of queries to ﬁlter logs on SIEM to ols such as Splunk; (5) Automated dev elopment of incident resp onse playbo ok; (6) SOC analyst inv olv ed in ﬁnal decision making. T o meet all the abov e-men tioned requiremen ts, we prop ose a framework for automated and dynamic threat h unting lev eraging the capabilities of Agen tic AI to address the changing threat landscape. The framew ork is intended to systematically and seamlessly integrate diﬀerent threat h unting mo dules together, ranging from log ingestion to anomaly assessment using reconstruction-based auto encoder, Deep Reinforce- men t Learning with t wo la yers, and LLM triage. In particular after log collection, auto enco der based anomaly detection is trained on a part of initial b enign traﬃc and assigns conﬁdence score to all the traﬃc instances based on the learned normal netw ork features. In the framew ork, the Deep Reinforcemen t Learning (DRL) mo dule is trained on the traﬃc of ﬁxed length time window for decision making. After DRL decision, traf- ﬁc ﬂo ws are prioritized considering DRL decision and autoenco der anomaly score for LLM analysis using ChatGPT. Only ﬂo ws with a high priority score are forwarded to LLM for con textual analysis to a v oid unnecessary computational o v erload and hallucination. Based on con textual insigh ts from LLM, further v alidation is also p erformed on Splunk to identify malicious and suspicious activities related to ﬂows with a 3 high priority score. This workﬂo w signiﬁcantly improv es alert triage and reduces the burden on SOC analysts and helps them in informed decision-making. Moreo v er, it is imp ortan t to diﬀerentiate Agentic AI with the deep reinforcement-learning (DRL) agen t in our framework. Agen tic AI refers to a broader architectural paradigm in whic h an LLM orchestrates a collection of specialized, to ol using agen ts that collaborate to ac hieve high-level goals. Within this arc hitecture, the DRL agents provide outputs that the LLM-based agen ts incorp orate in to planning and decision-making. The rest of the pap er is organized as follows. The related literature on agen tic AI and LLM application in cyb ersecurit y and the SIEM to ol for threat hun ting is describ ed in Section 2. A set of key observ ations and motiv ation are presented in Section 3 Section 4 presen ts the Agentic AI driven threat-h unting framework, its diﬀeren t comp onen ts. Section 5 describ es the comp onen ts and its functionality in detail. The use case and threat mo del are describ ed in Section 6. Section 7 describ ed the exp erimen tal results. Section 8 presen ts a discussion of the framew ork and its limitations, and ﬁnally Section 9 concludes the article with future work. 2. Related W ork Agen tic AI is reshaping threat h unting with adaptiv e and dynamic approach b ey ond the traditional alert driv en detection system in the net work. Speciﬁcally , in the net w ork, these agen ts monitor traﬃc, iden- tify malicious activities, trigger containmen t, or mitigate that to reduce the burden on securit y analysts. Recen tly , a few works ha ve explored the application of argentic AI in diﬀeren t cyb ersecurity domains, in- cluding autonomous incident resp onse, cyb er threat intelligence, autonomous monitoring, and adv ersarial cyb er defense [17, 23]. W e review three groups of literature: (1) Agen tic AI for threat hun ting and adaptive defense, (2) Agentic AI based net work monitoring and anomaly detection, and (3) LLM and Generative AI for Security Op eration Centers (SOC) tasks. • Agen tic AI for threat hun ting and adaptive defense: In [19], the authors present an autonomous threat hun ting using mac hine learning and Deep Reinforcement Learning metho ds for proactive threat detection. The framew ork p erforms traﬃc analysis using ML models such as RNN and CNN, then lev erages Deep Reinforcement Learning for optimal threat hun ting. The pap er highlights th at automa- tion can reduce the burden on the SOC analyst, although the main aim of this work is to detect and resp ond base d on DRL learning. In [24], the authors use agen tic AI for self-healing cyb er systems with autonomous detection, mitigation and adaptation to evolving cyb er threats. Ho wev er, the main fo cus of this work is to impro ve threat detection using agen tic AI and do es not provide SIEM in tegration to supp ort SOC analysts in decision making. In [25], the authors prop osed a conceptual framework for the integration of AI with h uman analysts for SOC en vironmen ts. The framew ork fo cuses on organi- zational and workﬂo w asp ects rather than anomaly assessment, threat prioritization, and automated in vestigation. In [18, 26], the role of agen tic AI in cyb ersecurit y is presented along with the challenges of using these AI agents for cyb ersecurit y , as threat environmen ts evolv e. As a result, they emphasize that human analysts m ust b e in the lo op for v alidating threats and analyzing complex ﬁndings. Unlik e 4 these works, our framework provides auto encoder based anomaly assessment, DRL-based traﬃc triage and prioritization, and LLM-assisted analysis while also in tegrating with SIEM to ols such as Splunk for the deep in vestigation of malicious ﬂows. • Agen tic AI based Netw ork Monitoring and Anomaly Detection: NetMonAI [27] proposed a scalable distributed net working monitoring framework combining pack et-lev el and ﬂo w-level analysis. Eac h node in the arc hitecture has an agent which captures the traﬃc, ﬁnd anomalies and reason using LLMs. These agen ts work in an automated w a y and coordinate with a centralized con troller whic h collects rep orts and pro vides human readable summaries. In [28], a time series based anomaly detection system is prop osed for cloud infrastructure. In the framew ork, multiple agen ts collab orate to autonomously generate detection rules using LLM and improv e detection accuracy . NetMonAI and AR GOS leverage agentic AI for scalable net work monitoring and anomaly detection, but do not address the complete SOC workﬂo w. These works fo cus on anomaly detection, but our framework pro vides the full cycle of anomaly assessment, initial triage, LLM assisted contextual analysis, and SIEM platform based inv estigation of malicious ﬂo ws. • LLMs and Generativ e AI for SOC Op erations: Recently , the in tegration of LLMs in to cy- b ersecurit y has gained signiﬁcant attention due to their ability to automate manual tasks, improv ed con textual analysis, and decision making [29]. In [30] the authors proposed a framework for threat h unting using LLM and Splunk. The framew ork leverages LLM for the initial triage of security logs and then p erforms further in vestigation on Splunk. In [31] the authors present the p oten tial of LLMs to automate the analysis and triage of security alerts, highlighting ho w LLMs can pro vide contextual insigh t, reduce false positives, and assist SOC analysts in improving security op erations. In [32], the use of diﬀerent LLMs is explored for threat hun ting. The main fo cus of the w ork is to determine whether LLM can generate eﬀectiv e queries for security to ols such as Splunk and Elasticsearch to analyze logs. In [33] the authors p ro vide a framework called LLM4Sec to improv e anomaly detection by ﬁne-tuning LLMs. The authors demonstrate extensiv e ev aluation of ﬁv e LLMs such as BER T, RoBER T a, Distil- RoBER T a, GPT-2 and GPT-Neo, inv estigating their eﬀectiv eness in log analysis. In [34] an AI system is presen ted, leveraging LLMs such as GPT-4 to automate the extraction of Indicators of Compromise (IoCs) and to construct relationship graphs from Cyb er Threat Intelligence (CTI) rep orts, whic h mini- mizes the manual tasks of SOC analysts. Vina y ak [35] studied the application of LLMs in cyb er threat h unting. These w orks mainly in vestigated the p oten tial application of LLMs in analyzing security logs and ﬁnding IoCs for threat detection. Industry grade solutions such as Microsoft Security Copilot hav e enabled natural language in teraction and contextual summarization of alerts, that has improv ed SOC analyst productivity . Ho wev er, these solutions are assistive to ols and do not learn decision p olicies. In contrast, the proposed framew ork oﬀers a DRL–based p olicy lay er that op erates upstream of alert generation, learning cost-aw are containmen t decisions ov er aggregated netw ork traﬃc and iden tifying 5 suspicious traﬃc window, and helps the SOC analyst in decision making to blo c k or allow traﬃc in the net work. As compared to monolithic copilot architectures, our framework uses a hierarchical agentic design in which DRL gov erns when and how LLM is in vok ed. Moreo ver, by inv oking LLM only for high-priorit y ev ents, our framework reduces computational ov erhead and analyst burden, p ositioning it as a complemen tary and orthogonal solution to existing copilot-based platforms. Agen tic AI presen ts b oth opp ortunities and challenges: On the one hand, it can automate threat identi- ﬁcation and analysis, but their adoption also risks getting manipulated and requires careful human ov ersight to verify the information. Previous works do not address the implemen tation agentic system without limiting auditabilit y and analyst control. Moreov er, prior w orks lack a complete cycle from data ingestion into SIEM to ol to anomalous ﬂows ﬁndings, prioritization of ﬂows, LLM based multi-agen t alert triage, query genera- tion to v alidation of IoCs on SIEM platform. Our work addresses this gap by integrating Agentic AI based analysis with SIEM platform such as Splunk [36] to provide automated, context-a ware threat hun ting, along with op erational v alidation of threats on SIEM to ol. Our m ulti-lay ered framework in tegrates the strengths of AI-driv en analysis, LLM based contextual insigh ts, robust SIEM capabilities, and providing a more eﬀectiv e and automated threat h unting framew ork. 3. Key observ ation and design ob jectives W e analyzed and compared the agentic AI and LLM based threat detection mechanisms in Section 2, and summarized in T able 1. Our primary observ ations are highlighted below: • Most of the agentic AI based mechanisms in our survey are designed only for threat detection and do not integrate the SIEM platform into the framework. • One of the desirable features is that Agen tic AI based threat h un ting framew ork should p erform anomaly assessmen t and prioritize alerts and traﬃc ﬂo ws before forw arding to LLM for contextual analysis to av oid pro cessing ov erhead. In [25], the authors address b oth issues to some extent at the high-lev el. • Initial triage is important for threat hun ting before delving in to more details. While some of the existing ones hav e this feature, but they provide this functionalit y at the level of LLM whic h causes pro cessing o verhead on the LLM due to the h uge v olume of information. • Most of the mechanisms provide a decision supp ort system for the SOC environmen t partially without an y SO AR suggestion. • While analyzing the traﬃc logs, it is also important to understand the attack technique used by the attac kers. Therefore, the MITRE mapping is imp ortan t to comprehend this. As w e can see in T able 1, few metho ds address this issue. 6 • As sho wn in T able 1, most of the mechanisms that we review ed do not oﬀer the feature to adapt according to the learned p olicy . Although few methods use autonomous adaptation to detect zero da y attac ks [19, 24]. With these k ey observ ations in mind, we are motiv ated to design such an agentic AI based threat hun ting framew ork that can ov ercome the identiﬁed limitations and achiev e a n umber of desired prop erties, as follo ws: • Anomaly assessmen t and initial triage: Anomaly assessment, traﬃc prioritization, and initial triage are p erformed b efore forwarding the alert to LLM for analysis. This closes the gap b et ween detection and resp onse with inv estigation supp ort. • SIEM Integration: The SIEM platform is integrated with the framework for further deep inv estigation of traﬃc ﬂows. • SO AR and SOC decision supp ort: The framework pro vides SO AR suggestion and supports SOC analysts in decision making b y assisting them in ﬁltering traﬃc ﬂows on the SIEM platform with queries. • MITRE Mapping: The mapping to MITRE A TT&CK framework is done to understand the attack tec hnique used. • A daptive learned p olicy: The framew ork oﬀers the learned policy la y er to adapt according to the requiremen ts of the SOC ob jectives. Thanks to the aforementioned prop erties, the features ranging from anomaly assessment, initial triage, traﬃc prioritization, MITRE mapping and SIEM to ol can b e systematically integrated together, to assist the SOC analysts for informed decision making. The next Section 4.1, describ es the ma jor comp onen ts of the proposed framework in detail. 4. Agen tic AI SOC F ramework This section describ es the ma jor comp onen ts of the prop osed threat hun ting framew ork. As sho wn in Fig. 1, the framework comprises of ma jor comp onents: SIEM Indexing, Data Cleaning & Pro cessing, Auto encoder Anomaly Detection, Deep Reinforcement Learning (DRL) Netw ork, Prioritization, LLM Multi-agent triage, and Splunk v alidation. The comp onen ts are describ ed in Section 4.1. 4.1. Ma jor F ramew ork Comp onen ts Data Collection and SIEM Indexing: This comp onen t is resp onsible for indexing logs collected from diﬀeren t devices in the organization and store them in a central database for analysis. Collecting and indexing logs from diﬀerent devices in the SIEM to ol is an imp ortan t step in threat h unting. The log collection agents are deploy ed on the devices that collect and forward the logs to the SIEM server. It is also imp ortan t to 7 T able 1: Comparison b et ween existing LLM and Agentic AI based threat h unting techniques Threat Hunting T ec hnique Anomaly Assessment T raﬃc Flows Prioritization Initial T riage Agentic AI/LLM Integration MITRE Mapping SIEM Integration and V alidation SOAR Suggestion SOC Decision Support Learned Policy adaptation Agentic AI for Autonomous Threat Hunting [19] Y es No No Y es No No No Y es Partial Agentic AI for Adaptive Threat Resp onse [24] Y es No No Y es No No No Y es at high-level (Partial) Partial Uniﬁed F ramew ork for Human-AI Collab oration [18, 25] Conceptual Partial No Y es No Y es at high-level No Y es No NetMon AI F ramew ork [27] Y es No Y es Y es No No No No No ARGOS Agen tic Detection [28] Y es No No Y es No No No No No LLM for Threat Intelligence and Automation [34] No No No LLM integration No No No P artial No LLM for Non-Security Experts [32] No No Y es LLM Analysis No Y es No P artial No LLM Benchmarking for Log Analysis [33] No No Y es LLM Integration No No No No No LLM for Security alarm analysis [31] No No Y es LLM Integration No No No P artial No Microsoft Security Copilot No Partial Y es Y es Partial Y es Y es Y es No LLM and Splunk for SOC [30] No No Y es LLM integration Y es Y es Y es Y es No Proposed F ramew ork Y es Y es Y es Y es Y es Y es Y es Y es Y es 8 Figure 1: Agentic AI-based Threat Hunting F ramew ork create a strategy to collect the volume of logs from diﬀerent devices, as it can cause pro cessing o verhead on the SIEM to ol as w ell as alert fatigue for the SOC analyst. Data Cleaning and Pro cessing: This mo dule is resp onsible for cleaning and pre-pro cessing the indexed data for further in vestigation. All duplicate instances are remov ed b ecause the data on the SIEM server often ha ve many duplicates. It extracts the features (such as IP addresses, ports, proto cols, in b ytes, out b ytes, etc.) from the indexed data and exp orts them in csv format for pro cessing. Auto encoder-based Anomaly Detection: In the framework, the reconstruction based deep auto en- co der is trained on a part of legitimate traﬃc and learns the b enign net work pattern. The auto encoder is a neural netw ork that tries to reconstruct the normal netw ork pattern and assigns the reconstruction error to data samples. Data instances that are normal are reconstructed with minimal error. Ho wev er, anomalies are reconstructed with high error. In the framew ork, the reconstruction error is assigned to the data instances as the anomaly score (AAD). The AAD score along with the DRL decision is used to prioritize the malicious traﬃc window to b e forw arded to LLM for con textual analysis. The details are describ ed in Section 4. 9 Deep Reinforcemen t Learning Net w ork: The DRL mo dule in the framework comprises of three ma jor comp onen ts: a) DRL agent, b) simulation Environmen t, and c) Reward function. The DRL agent in teracts with the simulated en vironment that contains aggregated traﬃc features ov er a ﬁxed time windo w, and takes action based on its current state. The environmen t responds with a new state and a reward as feedbac k based on its action of the DRL agent. The ob jective of the DRL agent is to take action in such a wa y as to maximize cumulativ e rew ard. In the framework, action space also termed as the DRL decision comprises: a) Containmen t (1), b) Allow (0). Initial T riage and Prioritization: Based on the DRL decision and the anomaly score (AAD Score) of the auto encoder, the Initial Triage and Prioritization mo dule prioritizes malicious traﬃc windows for LLM analysis. The DRL action (1 or 0) is m ultiplied by the AAD score to prioritize the ﬂow. This initial triage and prioritization of the malicious traﬃc window reduces the pro cessing ov erhead on LLM. LLM Multi-Agen t based Con textual Analysis: This mo dule analyzes the received malicious traﬃc windo w and provides the contextual insigh ts. It is based on a multi-agen t system, where one LLM agent acts as an orc hestrator and t w o other LLM agen ts perform the analysis. In the framework, the agen t titled Senior SOC Triage Analyst analyzes the traﬃc, provides its assessment, and generates SPL (Splunk Query Language) queries to ﬁlter the traﬃc on the Splunk dashboard. Another LLM agent called Threat Intelligence Analyst p erforms the mapping of traﬃc b eha vior to MITRE A TT&CK framework. Finally , the orc hestrator summarizes the contextual analysis in a human-readable format. The listings 1 and 2 show the sample prompts for the SOC triage agent and the threat intelligence agent. The implementation of this comp onen t is done using a CrewAI framework [37]. Currently , in our prop osed framework, public LLM’ API suc h as ChatGPT is used for LLM triage, but lo cal LLM is suggested for the analysis of sensitive logs. Data anon ymization can also b e used to anonymize sensitive information b efore forwarding it to public LLM. V alidation in Splunk: The insights provided by the LLM is further v alidated b y the human SOC analyst on the Splunk dash b oard by pulling the logs. The SPL queries pro vided b y the Senior SOC Triage Analyst is used by the human SOC analyst to v alidate the insights in Splunk. Based on the ﬁndings in Splunk, the SOC analyst can decide to Block or Allow traﬃc. F or instance, if SOC analyst ﬁnds the presence of Indicators of Compromise (IoCs) suc h as a high num b er of pack ets in a very short time windo w in the logs, then they can blo c k the traﬃc by conﬁguring the detection rules. It reduces the workload of the SOC analyst b y directly identifying the relev an t ﬂows for inv estigation in Splunk and supplying the corresp onding SPL queries. It ensures that the LLM analysis is appropriately ﬁltered. 4.2. Design Rationale The key design principle in the prop osed framework is strict functional separation to a void feature leak age and mo del bias, as shown in T able 2. Speciﬁcally , the AAD score is explicitly excluded from DRL training to av oid leak age of features and inﬂated p erformance. It is only included after DRL training for SOC triage and prioritization. 10 T able 2: Design rationale of the main comp onen ts Comp onen t Purp ose Input F eature AAD Con tinuous anomaly scoring Ra w net work features DRL Decision, initial triage and prioritization Ra w features except AAD score LLM Agent Con textual insigh ts and human lik e triage and reasoning Prioritized traﬃc ﬂow along with DRL decision and AAD score Listing 1: A sample prompt for the SOC triage agent 1 Role: Senior SOC Triage Analyst 2 Goal: Assess the threat level associated with a prioritized network flow. 3 Backstory: Expert at distinguishing benign network noise from suspicious or malicious traffic. 4 5 Input: 6 - Flow ID: {flow_id} 7 - Source IP: {src_ip} 8 - Destination IP: {dest_ip} 9 - Destination Port: {dest_port} 10 - Priority Score: {priority} 11 - Anomaly Score: {aad_score} 12 13 Task: 14 Analyze the flow and determine whether the observed communication appears benign, 15 suspicious, or high risk. Briefly explain the reasoning using the provided network context. 16 17 Expected Output: 18 A concise SOC-style summary of the alert risk level. 11 Listing 2: A sample prompt for the threat intelligence agent 1 Role: Threat Intelligence Analyst 2 Goal: Map suspicious activity to MITRE ATT&CK and provide remediation guidance. 3 Backstory: Expert in associating network behaviors with adversarial techniques and response actions 4 5 Input: 6 - Flow ID: {flow_id} 7 - Source IP: {src_ip} 8 - Destination IP: {dest_ip} 9 - Destination Port: {dest_port} 10 - Priority Score: {priority} 11 - Anomaly Score: {aad_score} 12 13 Task: 14 Identify the most relevant MITRE ATT&CK technique associated with the observed flow. 15 Provide the MITRE technique ID, technique name, and a brief remediation recommendation. 16 17 Expected Output: 18 MITRE ATT&CK technique ID, technique name, and remediation guidance. 5. Metho dology In this section, we model the Securit y Op erations Cen ter (SOC) as a sequential decision-making problem, where a DRL agent is trained on aggregated netw ork traﬃc features and decides whether to allow or contain the traﬃc. The system operates ov er a ﬁxed-length, non-o verlapping time windo ws and is form ulated as a Mark ov Decision Process (MDP). The DRL agent works exclusively on aggregated ra w traﬃc features, while anomaly scores and LLM reasoning are introduced only after the containmen t decision is made. The framew ork enforces strict separation of resp onsibilities to av oid feature leak age commonly seen in h ybrid mac hine learning security systems. 5.1. Auto enco der-Based Anomaly Detection (AAD) In our framework, the anomaly score (AAD score) is computed using a fully connected reconstruction based auto encoder anomaly detection (AAD) with a lo w dimensional bottleneck architecture (8-2-8), whic h is trained on early b enign traﬃc ﬂows to mo del normal net work b eha vior [38]. The Algorithm 1 implements an unsup ervised Auto enco der trained on legitimate traﬃc observed during an initial p eriod (we use the ﬁrst fraction of the timeline, e.g., 25%). F eature normalization is p erformed using statistics derived only from early benign traﬃc ﬂows to simulate a clean baseline of normal net work b eha vior to av oid mixing the anomaly 12 Algorithm 1 Auto encoder-Based Anomaly Detection (AAD) with Benign-Only Standardization and Flow Mapping Require: Flo w dataset D with timestamps and iden tiﬁers (e.g., flow_id ), sorted b y time; feature subset F = { src_port , dest_port , bytes_in , bytes_out } ; benign label indicator y ∈ { 0 , 1 } (used only for selecting b enign training samples); training fraction α (e.g., 0 . 25 ); auto encoder bottleneck dimension d (e.g., d = 2 ). Ensure: Anomaly score AAD( w t ) for eac h time window w t and a mapping M from eac h w t to its repre- sen tative ﬂo w metadata. 1: Sort D chronologically b y timestamp. 2: Split D into an early training perio d D train (ﬁrst α fraction) and the full p eriod D all . 3: Filter b enign-only samples from the training p eriod: D benig n = { x ∈ D train | y ( x ) = 0 } . 4: Extract b enign training matrix X benig n ← D benig n [ F ] and full matrix X all ← D all [ F ] . 5: Benign-only standardization: 6: Compute µ ← mean( X benig n ) and σ ← std( X benig n ) . 7: Standardize X benig n s ← ( X benig n − µ ) / ( σ + ϵ ) and X all s ← ( X all − µ ) / ( σ + ϵ ) . 8: Initialize reconstruction auto encoder f ϕ with architecture ( |F | → 8 → d → 8 → |F | ) . 9: T rain f ϕ on b enign data by minimizing reconstruction loss: L ( ϕ ) = 1 | X benig n s | X x ∈ X benign s ∥ x − f ϕ ( x ) ∥ 2 2 . 10: Windo wing and mapping: Partition D all in to ﬁxed windows { w t } (e.g., 5 min utes). 11: for each window w t do 12: Construct window feature v ector x t ← Agg( X all s within w t ) . 13: Reconstruct ˆ x t ← f ϕ ( x t ) . 14: Compute anomaly score: AAD( w t ) = 1 |F | X i ∈F ( x t,i − ˆ x t,i ) 2 . 15: Store mapping M ( w t ) to contextual metadata (e.g., src_ip , dest_ip , dest_port , flow_id ) using represen tative v alues such as first or mode within w t . 16: end for 17: return { AAD( w t ) } and mapping M . mo del with attac k information. Giv en a standardized feature vector x t ∈ R |F | , the autoenco der reconstructs ˆ x t , and computes the anomaly score as the mean squared reconstruction error shown in Equation 1. The anomaly score is computed p er ﬂow and then aggregated. 13 AAD( t ) = 1 |F | X i ( x t,i − ˆ x t,i ) 2 . (1) As shown in Algorithm 1, the mo del ﬁrst learns a compressed laten t representation of normal netw ork features and assigns a higher reconstruction error to malicious ﬂo ws. In our framew ork, the AAD Score is not used as part of the DRL state representation during p olicy learning; instead, it is utilized after DRL decisions to prioritize DRL ﬂagged traﬃc ﬂows for LLM triage. Although the auto encoder is trained only on n umeric features (p orts and byte counts), the anomaly score is not detached from ﬂow context. Each score is computed per ﬁxed time windo w w t and is stored together with a metadata mapping M ( w t ) containing represen tative identiﬁers (e.g., flow_id , src_ip , dest_ip , and service p ort). In practice, the auto encoder pro duces AAD( w t ) from F , while attribution ﬁelds are preserv ed outside the mo del and re-attac hed to the scored windo w for downstream DRL prioritization and LLM triage. This design preven ts leak age from high-cardinalit y categorical ﬁelds while main taining in terpretability for SOC analysts. T able 3: Reconstruction Auto encoder vs Classiﬁcation Mo del Asp ect State Reconstruction Autoenco der General Classiﬁcation Mo del Lab els required No Y es Learns Normal b eha vior Decision b oundary Output Anomaly score Discrete class Used for Prioritization, T riage Detection In the framework, the reconstruction based auto encoder is diﬀerent from general classiﬁcation mo dels. The reconstruction based auto enco der used in the framework learns normal benign traﬃc feature and provides anomaly score (AAD). T o learn the traﬃc b eha vior it do es not need any lab els. This anomaly score is used for traﬃc prioritization for triage. Ho w ever, general classiﬁcation mo dels, as shown in T able 3 require lab els and provide a discrete class as a decision. It is generally used for threat detection. 5.2. Time-Windo w State Construction Ra w net work traﬃc is represented as a sequence of ﬂow records, each comprising of source & destination addresses, p orts, proto col, in bytes and out bytes. T raﬃc is aggregated into ﬁxed-length windo ws of duration ∆ t = 5 min utes. F or each time window t , aggregated state vector is represen ted by s t b y merging numerical statistics (suc h as mean and maxim um port num bers and b yte counts) along with enco ded categorical attributes such as proto col identiﬁers and lo w-cardinality representations of IP addresses. The use of ﬁxed temp oral windo w is widely considered b est practices in mo dern Security Op erations Centers [39]. F or example, let F t = { f 1 , f 2 , . . . , f N t } denote the set of raw net work ev ents observ ed during time window t , where N t ma y b e on the order of millions in high-throughput SOC environmen ts. Mo deling the state of the system as s t = F t is computationally infeasible. Instead, we build a state represen tation using an aggregation 14 function represented by ϕ ( · ) ; where ϕ ( · ) represents a large set of raw traﬃc in a ﬁxed-dimensional feature v ector that captures traﬃc statistics, proto col b eha vior, and temp oral characteristics. s t = ϕ ( F t ) (2) This aggregation represents b oth regular and b urst-orien ted patterns while simultaneously reducing noise and dimensionality . Detailed feature engineering, aggregation equations and examples are pro vided in Ap- p endix A. 5.3. Deep Reinforcemen t Learning for SOC W e employ Deep Reinforcemen t Learning (DRL) to build our con tainment agen t, often referred to as a DRL agen t in the pap er, which directly learns from traﬃc features to identify anomalies. Speciﬁcally , we mo del the containmen t decision problem as a sequential decision-making pro cess using deep reinforcemen t learning (DRL) by deﬁning the state and action spaces for the agent. F ormally , the containmen t decision is mo deled as a Mark o v Decision Process (MDP) M = ( S , A , P , R , γ ) , where S denotes the state, A represen ts the action space, P represents state transitions across time windo ws, R is a rew ard function reﬂecting SOC priorities, and γ ∈ (0 , 1] is the discount factor. The state of the DRL agent at the time step t , is represented by s t , that signiﬁes the aggregated net work traﬃc ov er a ﬁxed time windo w. Equation 2 represen ts the state of the system ov er aggregated traﬃc features. The action space of the DRL agent is deﬁned as A = { 0 , 1 } . At eac h time step t , giv en the state s t , the DRL agent selects an action a t = 1 , if traﬃc F t is considered malicious and a t = 0 if F t is assessed as legitimate. The aim of the agen t is to learn an optimal containmen t p olicy that maximizes long-term op erational utilit y while balancing detection accuracy and alert fatigue. The DRL p olicy net work represented as π θ ( a t | s t ) , is implemented using Multila yer Perceptron (MLP) i.e., a feedforw ard neural netw ork with tw o hidden lay ers of 64 neurons eac h, as mentioned in Algorithm 2. The net work takes the aggregated traﬃc state s t ∈ R d as input and outputs a probability distribution ov er t wo actions: Contain and Al low . The s t ∈ R d means that the state at time t is a real-v alued v ector of dimension d . In our framew ork, we use Rectiﬁed Linear Unit (ReLU) activ ations in hidden lay ers, and a softmax function is applied at the output lay er to pro duce action probabilities [40]. Giv en the state s t , the forward pass of the p olicy net work is deﬁned as follows. Hidden La yer 1 h 1 = ReLU( W 1 s t + b 1 ) , (3) Hidden La yer 2 h 2 = ReLU( W 2 h 1 + b 2 ) , (4) 15 Output La yer (Logits) z = W 3 h 2 + b 3 , (5) where z ∈ R 2 con tains unnormalized action scores (logits) for al low and c ontainment . The unnormalized action scores are ra w outputs of the policy neural netw ork that signiﬁes ho w the agen t fav ors each action b efore softmax normalization. The action probabilities are calculated using softmax and the conﬁdence of the action contain is given b y max a π ( a | s t ) . π ( a | s t ) = exp( z a ) P a ′ ∈{ 0 , 1 } exp( z a ′ ) . (6) The anomaly score provided by the Auto enco der-based Anomaly Detection (AAD) to the traﬃc ﬂo w is not included in the DRL state representation. It prev ents feature leak age and make sure that the DRL agent learns containmen t b ehavior exclusiv ely from raw and aggregated netw ork traﬃc, rather than precomputed anomaly scores. The anomaly score is used after DRL decision while prioritizing the traﬃc ﬂow for LLM triage. F or each time windo w, a triage priority score is computed as shown in Equation 7. T riage Priority = DRL_Action × AAD_Score (7) This formulation ensures that only windows ﬂagged by the DRL agent are prioritized, while the anomaly score v alidates their urgency . Highly anomalous ﬂows receiv e higher priority , while low-risk anomalies are de-prioritized ev en if ﬂagged for con tainment. This mirrors real SOC analyst w orkﬂows, where decisions are discrete but prioritization is contin uous. The p olicy optimization is p erformed using Pro ximal P olicy Optimization (PPO), which stabilizes learn- ing by constraining policy updates within a clipp ed trust region [41]. T emp oral consistency is preserved through time-series cross-v alidation, ensuring that training data strictly precede test data in the timeline. The agent iterativ ely observes the current net work state, selects an action according to its p olicy , and re- ceiv es a scalar reward based on the correctness of the decision relative to ground truth lab els av ailable during training. The multiple rew ard p olicies are describ ed in Section 5.3.1. In summary , PPO decides how the p olicy is up dated in the framew ork. The MLP based on 2 × 64 i.e. 2 la yer and 64 neurons sho ws what the p olicy lo oks like. T ogether, they build DRL agent that: • Observ es aggregated netw ork traﬃc. • Predicts an action (con tain or allow). • Receiv es a reward based on correctness and cost. • Up dates the p olicy in a stable and constrained manner. 16 Algorithm 2 Deep Reinforcemen t Learning for SOC Con tainment Require: Aggregated netw ork windo ws W = { w 1 , w 2 , . . . , w T } with feature vectors s t Ground-truth lab els y t (used only for rew ard computation during training) Ensure: Con tainment p olicy π θ ( a | s ) 1: Initialize PPO p olicy netw ork π θ with 2 hidden la yers (64 neurons each) 2: Deﬁne action space A = { 0 : No Action , 1 : Containmen t } 3: Deﬁne reward function R ( a t , y t ) emphasizing low false p ositives 4: for each training episo de do 5: for each time step t do 6: Observ e en vironment state s t 7: Sample action a t ∼ π θ ( a t | s t ) 8: Execute action a t 9: Receiv e rew ard: r t = R ( a t , y t ) 10: Store ( s t , a t , r t ) 11: end for 12: Up date p olicy parameters θ using PPO ob jective 13: end for 14: return trained containmen t p olicy π θ 5.3.1. Rew ard Shaping The rew ard function is designed to reﬂect SOC operational priorities, with a particular emphasis on mini- mizing false p ositiv es, which are costly and disruptiv e in real-w orld en vironments. Correct containmen t of malicious activit y (true p ositiv es) is positively rew arded, while false p ositiv es incur signiﬁcant p enalties. T rue negativ es are rew arded to reinforce restraint (no containmen t) when traﬃc is benign, while false negatives are penalized to discourage missed detections. Multiple reward proﬁles are ev aluated to study the trade-oﬀ b et w een detection sensitivity and false-p ositiv e reduction. r t = R ( a t , y t ) (8) In the prop osed DRL form ulation, the agent op erates in a binary decision space at eac h time step t . Let a t ∈ { 0 , 1 } represen t the action selected b y the agent, where ( a t = 1 ) corresp onds to con tainment decision (e.g., blo c king or isolating a netw ork ﬂow) and a t = 0 represents allow action i.e., traﬃc is b enign. The ground-truth lab el at time step t is shown by y t ∈ { 0 , 1 } , where y t = 1 denotes that the observ ed net work ﬂo w is malicious, and ( y t = 0 ) shows b enign traﬃc. The agen t’s action a t and the true lab el y t result in one of four securit y outcomes: true p ositiv e (TP), false p ositiv e (FP), false negative (FN) and true negative (TN). These outcomes are the basis for the reward 17 shaping strategy used to train the deep reinforcement learning agent. By mo deling this interaction, the agen t learns p olicies that balance detection eﬀectiv eness with operational costs such as alert fatigue and unnecessary containmen t actions. T o study the impact of rew ard shaping on decision-making b eha vior, we deﬁne four distinct reward proﬁles (Mo des A–D). Eac h proﬁle represents a diﬀeren t op erational fo cus that is t ypically found in Securit y Op erations Cen ter (SOC) environmen ts. Mo de A is a recall-orien ted detection strategy . In this mo de, b oth true p ositiv es and true negatives are rew arded equally , while false negatives are p enalized more than false p ositiv es. It motiv ates the agen t to prioritize detection co verage and early identiﬁcation of malicious activit y , making Mo de A appropriate for initial threat discov ery and baseline sensitivity analysis rather than strict false-p ositiv e control. Mo de B fo cuses on false p ositiv e reduction. In this mo de, false p ositives are p enalized more heavily than false negatives, while true negativ es are assigned a signiﬁcan t p ositiv e reward. This reﬂects a SOC en- vironmen t in whic h excessiv e containmen t measures and alert fatigue can incur signiﬁcant costs. Sp eciﬁcally , it is eﬀective in prioritizing high-conﬁdence alerts and reducing w orkload on SOC analysts. Mo de C pro vides a mo derate trade-oﬀ b et ween detection and op erational cost. F alse p ositiv es and false negativ es are p enalized at intermediate lev els, while correct decisions receive mo derate rew ards. This p olicy indicates SOC environmen ts where b oth false positives and false negatives are considered bad, but neither dominates the op erational p olicy . Mo de D incorp orates controlled randomness into the reward function. It preserves a balanced reward sc heme related to Mo de C, but adds Gaussian noise onto the reward signal. This form of sto c hastic regular- ization improv es p olicy robustness by reducing ov erﬁtting to ﬁxed reward structures and by mimicking the uncertain ty found in the SOC environmen t and analyst resp onses. It is represen ted b y Equation 9. r ( D ) t = R ( a t , y t ) + ϵ t , ϵ t ∼ N (0 , σ 2 ) . (9) 5.3.2. DRL Ob jective The agent learns a containmen t p olicy π θ ( a t | s t ) that maps the aggregated ﬂo w state s t to an action space o ver (0 , 1) . The learning ob jective is to maximize the exp ected discounted return shown in Equation 10. J ( θ ) = E π θ " T X t =0 γ t r t # , (10) Where γ ∈ (0 , 1] is the discoun t factor and T is the episode horizon. Since the rew ard R ( a t , y t ) is designed to reﬂect the op erational priorities of SOC, maximizing J ( θ ) is equiv alen t to minimizing the long- term op erational burden under the chosen reward proﬁle. 18 5.3.3. Decision Cost and Regret Analysis T o complemen t the standard detection metrics, we ev aluate the learned con tainment p olicy using decision cost and regret, that provide insight into the op erational quality of the agen t’s actions under SOC constraints. The decision cost at time step t is deﬁned as the negative of the obtained reward shown in Equation 11: cost t = − r t . (11) This formulation reﬂects the op erational burden asso ciated with an action. High costs relates to undesir- able outcomes, suc h as unnecessary containmen t of b enign traﬃc or missed detection of malicious activity , while low or negativ e costs show decisions aligned with SOC priorities. While decision cost ev aluates the absolute penalty of an action, regret measures ho w far the agen t’s decision deviates from the b est p ossible decision at that time step. Equation 12 deﬁnes the regret . regret t = r ⋆ t − r t , (12) where r ⋆ t = max a ∈{ 0 , 1 } R ( a, y t ) (13) sho ws the maximum achiev able rew ard assuming p erfect knowledge of the ground-truth lab el y t . By rep orting b oth decision cost and regret, we measure not only whether the agent mak es correct deci- sions, but also how costly its mistakes are in practice. This dual p erspective is imp ortan t in SOC environ- men ts, where diﬀerent errors ha ve diﬀerent operational consequences. 5.3.4. Putting it T ogether Unlik e traditional sup ervised intrusion detection systems, the DRL agent in the framework do es not aim to maximize classiﬁcation alone. Instead, it learns a p olicy optimized for long-term operational eﬃciency under SOC constraints. The DRL agent acts as a decision making ﬁlter that determines whether an alert needs escalation reducing the analyst workload. Sp eciﬁcally , the DRL agent is trained without using anomaly detection scores (e.g., AAD Score) as input features to a void feature leak age; the anomaly scores are computed separately and only applied after DRL to prioritize traﬃc ﬂo ws for LLM-based triage. This separation ensures that the DRL policy generalizes b ey ond sp eciﬁc anomaly mo dels and remains robust to changes in do wnstream scoring mechanisms. 5.4. LLM-Based Multi-Agen t SOC T riage The Algorithm 3 highligh ts triage using Large Language Mo dels. T raﬃc ﬂows with a priority score greater than 5 assigned after DRL deci sions are forwarded to LLM for analysis. Once a traﬃc windo w is prioritized b y the DRL agent, individual ﬂo ws within that window are extracted and forw arded for the LLM-based analysis. The LLM operates at ﬂow-lev el gran ularit y , generating per-ﬂow con textual insigh ts, MI TRE A TT&CK 19 Algorithm 3 LLM-Based Multi-Agen t SOC T riage Require: Flagged traﬃc ﬂows F from DRL agen t, Anomaly scores AAD ( w t ) , (T raﬃc ﬂo ws with triage score > 5) Con textual metadata (IPs, p orts, timestamps) Ensure: SOC triage rep orts with MITRE mapping and remediation guidance 1: for each traﬃc ﬂo w f t ∈ H do 2: Construct contextual prompt: ⟨ s t , a t , AAD ( w t ) , netw ork metadata ⟩ 3: In vok e SOC Analyst Agent : 4: Generate tactical summary and risk assessment 5: Generate SPL query to inv estigate in Splunk 6: In vok e Threat Intelligence Agen t : 7: Map activity to MITRE A TT&CK techniques 8: Suggest SOAR or mitigation actions 9: Store structured triage rep ort 10: end for 11: return SOC inv estigation rep orts and master triage table mappings, and recommendations for the SOC analysts. This design enables scalable decision-making by ﬁrst ﬁltering high-risk windows and p erforming detailed analysis only on selected ﬂo ws. T raﬃc ﬂows forwarded to the LLM with contextual metadata, including source and destination addresses, p orts, time windo ws, and the corresp onding AAD scores. The source and destination IP addresses are anon ymized b efore b eing forw arded to the LLM for analysis [42]. Large Language Mo del based triage is implemented using a CrewAI multi-agen t framework [37]. During the analysis pro cess, tw o sp ecialized agents are instan tiated: • a Senior SOC T riage Analyst in the framework v alidates con tainment actions, assesses immediate risk, pro vides SPL (Splunk Pro cessing Language) queries to pull the traﬃc ﬂows on Splunk • a Threat Intelligence Analyst maps the observ ed b eha vior to MITRE A TT&CK techniques and rec- ommends mitigation strategies. LLM agents with sp ecialized SOC analyst roles provide human-readable summaries with con textual in- sigh ts. This stage transforms low-lev el traﬃc ﬂo w information in to actionable in telligence, reducing the w orkload of the SOC analyst, and accelerating decision making. Moreo ver, SPL (Splunk Pro cessing Lan- guage) queries generated b y the Senior SOC T riage Analysts agent can b e used by analysts to ﬁlter the logs on the Splunk dashboard, helping SOC analysts to ﬁlter the anomalous ﬂo ws from the huge volume of logs. In the framework, the DRL agen t op erates at the traﬃc windo w level to identify high-risk traﬃc windows, 20 Figure 2: Use case illustrating the application of framework therefore acting as a policy-level ﬁlter that reduces the volume of data processed by the DRL agen t for in vestigation. Once traﬃc windows are prioritized, individual ﬂows within that window are extracted and analyzed by the LLM agen ts at the p er-ﬂow granularit y . Finally , the outputs are automatically compiled into p er-ﬂo w in vestigation rep orts and a master SOC triage summary , enabling direct integration with Securit y Orc hestration, Automation, and Resp onse (SOAR) platforms. In the framew ork, public LLM such as ChatGPT is used for the initial prototype. Ho w ever, local large language models can also b e used. So, to preserve data conﬁden tiality , sensitive information suc h as IP addresses is anon ymized prior to LLM-based analysis using deterministic pseudonymization [43]. It replaces real identiﬁers with stable tokens, ensuring consistency across time windows while preven ting the exp osure of sensitive information to public LLM. The anonymization is non-destructive and reversible only within the SOC-con trolled en vironment. LLM agen ts generate SPL queries using anon ymized iden tiﬁers whic h function as placeholders rather than executable v alues. Prior to execution on Splunk, these placeholders are resolved through mapping the anonymized tokens bac k to their original iden tiﬁers that are maintained in a table. As a result, the generated SPL queries remain applicable to the SIEM platform while ensuring that original information are nev er exp osed to the LLM. 6. Use Case F or better understanding of the workﬂo w, w e pro vide a use case, as shown in Fig. 2, in which the log collection agen t is deplo yed on Windows 11 running inside the virtual mac hine. As shown in Fig. 2, Kali Linux is used 21 as an attack er machine to launc h the attack on Windows 11 that is the target. The Splunk Universal Forwarder deplo yed on Windows 11 collects and forwards logs to the Splunk Enterprise running on the Windows 2019 serv er. F or the experimental ev aluation, logs of Suricata IDS is forw arded by the Splunk Universal Forwarder to Splunk Enterprise . The receiv ed logs are indexed on the Splunk Enterprise for in vestigation. How ever, we plan to extend the use case with multiple log sources. As shown in Fig. 2, the Suricata IDS is deploy ed on Windows 11 . Once the logs are indexed, data cleaning and aggregation is applied before further pro cessing using auto encoder and DRL. F or the auto enco der, a b enign sample is ﬁltered from an early temporal window, ensuring that the learned representation shows normal netw ork b eha vior without mixing with attac k traﬃc. Moreo ver, alert labels and non b eha vioral features such as hashes are remov ed. F urthermore, for the DRL training, a window based aggregation and cardinality reduction is p erformed to mak e sure that the DRL agen t op erates on the non leaking raw even ts. After DRL training, the traﬃc window is prioritized for LLM triage, using the DRL decision and anomaly score (AAD score). Then, CrewAI based LLM agen ts analyze the received traﬃc window and provide contextual insights along with the SPL queries for further v alidation on Splunk for decision making, as describ ed in Section 4. Based on the insight provided by the Agen tic AI framew ork, entities such as IP addresses and p orts are extracted and analyzed in the Splunk Enterprise by the SOC analyst for eﬀective decision making. The analysis p erformed using Splunk is shown in Section 7. During the ev aluation with the public dataset (Boss of the SOC), we directly ingested the dataset in to Splunk En terprise and then exported it into CSV format for ev aluation using auto encoder and the DRL. The detailed results are discussed in Section 7. 6.1. Threat Mo del for Simulation • Net w ork Scanning A ttac k: It aims to identify active hosts on a net work along with the p orts and services running on the hosts[44]. It helps attac kers assess the weaknesses of assets and plan the attack to exploit those vulnerabilities [30]. • V olumetric Denial of Service Attac k (DoS): A volumetric DoS attack such as a UDP ﬂo od targets the victim system and the netw ork to deplete resources suc h as net w ork bandwidth, CPU, and memory b y sending a large n umber of b ogus pack ets[45]. 7. Exp erimen tation Results This section analyzes the exp erimentation results on the Boss of the SOC [46] and the simulated dataset. W e only used a part of the BOSS of the SOC dataset for this ev aluation. After cleaning and removing duplicates w e used around 12000 instances. The dataset con tains source IP , destination IP , source p ort, destination p ort, in bytes, out bytes, proto cols, and time. W e also simulated the use case describ ed in Section 6 and collected the dataset for the ev aluation. The sim ulated dataset also contains the same features as in the 22 T able 4: Automated LLM T riage based on Adaptiv e Scoring and Reinforcement Learning Agent Flo w ID Source IP Destination IP Priorit y Score MITRE ID Agent Answer 26 172.31.38.181 172.31.0.2 7.12e+12 T1071 Critical: C2 Communication iden tiﬁed 35 172.16.0.178 172.16.3.197 9.49e-01 T1071 Malicious b eha vior: Susp ected exﬁltration via standard proto cols 34 192.168.8.103 192.168.9.30 7.33e-01 T1071 Risk level medium. Con tinuous monitoring is advised. 30 172.16.0.178 169.254.169.254 5.22e-01 T1552.005 P ossible attempt to exploit cloud infrastructure 23 192.168.8.112 192.168.9.30 5.06e-01 T1071 A mo derate level of concern. Con tin uous monitoring advised public dataset for ev aluation and a total of 300000 instances. 7.1. Ev aluation of Boss of the SOC dataset T able 4 shows the summary of the prioritized traﬃc ﬂows forw arded to LLM for analysis. Eac h traﬃc ﬂo w is c haracterized b y its ﬂow identiﬁer, source and destination IP addresses, assigned priorit y score by the DRL agen t after training. The corresp onding MITRE A TT&CK mapping and the ﬁnal conclusion is provided b y the Threat Intelligence Analyst and the Senior SOC Triage agents. As we can see in T able 4, the ﬂo w (Flow ID:26) with the priority score of ( 7 . 12 × 10 12 ) is mapp ed to technique T1071 (Application Lay er Proto col), whic h is commonly associated with command-and-control (C2) comm unication c hannels. The LLM agent classiﬁed this ﬂo w as critic al , highlighting a high-conﬁdence detection of malicious command and control activity that requires immediate con tainmen t. This indicates that our framew ork segregates high-impact threats from bac kground traﬃc to facilitate quic k resp onse from the SOC analyst. 7.2. SOC Analyst V alidation using Splunk Analysis on Public Dataset After prioritizing suspicious traﬃc using the prop osed DRL and AAD framew ork, an LLM-based triage mo dule p erformed con textual analysis and generated Splunk queries for the SOC analyst to v alidate the alert context within a real SOC en vironment. Fig. 3 shows the SPL query applied to ﬁlter traﬃc originating from the speciﬁc host (src_ip = 172.31.38.181) for the analysis. Speciﬁcally this SPL query is very useful in net work securit y analysis. As shown in Fig. 3, the transaction operator groups individual netw ork ev ents into logical communication sessions based on a temporal threshold (maxpause = 5 seconds), recon- structing burst-oriented pattern. The analysis suppresses b enign, sporadic activity and isolates automated 23 Figure 3: Analysis of DNS traﬃc identiﬁed by the prop osed RL and AAD triage mechanism comm unication patterns by ﬁltering ev ents with event count> 10 . As w e can see in Fig. 3, a contin uous high-frequency DNS communication (dest_p ort = 53) originating from the IP address 172.31.38.181, with 1000 even ts occurring within a single transaction window lasting b et w een 217 and 236 seconds. This contin uous DNS activit y is inconsistent with normal user b ehavior and is commonly found with command-and-control b eaconing or DNS-based tunneling techniques. F urthermore, the presence of m ultiple destination IP addresses within the same windo w highlights repeated attempts rather than a legitimate service communication. The destination p ort 9997 sho wn in Fig. 3 is used by the Splunk universal forwarder to send data to Splunk Enterprise. A dditionally , it is found that all DNS responses are syn tactically v alid (e.g., reply_code = NoError ), highligh ting that detection is driven by b ehavioral aggregation rather than proto col violations or signature matc hing. The Splunk analysis conﬁrms the correctness of the DRL and AAD framework and demonstrates that the LLM-assisted triage provides quick v alidation of anomalous net w ork pattern and helps the SOC analyst in informed decision making. 7.3. P olicy A daptation A cross Rew ard Mo des with Boss of the SOC Dataset Fig. 4 presents the p erformance of the prop osed tw o lay er DRL-based agent across the four rew ard mo des on the BOSS of the SOC dataset [46]. As we can see in Fig. 4, rew ard shaping directly impacts the trade-oﬀ b et w een precision, recall, and ov erall detection eﬀectiveness. Mo de A, whic h fo cuses on recall by p enalizing false negatives more than false p ositiv es, achiev es balanced p erformance with precision, recall, and F1-scores close to 0.85. This highlights that the agent maintains a 24 Figure 4: Performance Ev aluation of Across Mo des (A-D) stable detection capabilit y while allowing a limited num b er of false p ositives, making it suitable for early-stage threat detection. As shown in Fig. 4, Mode B provides the b est ov erall results, achieving the highest recall (0.873) and F1-score (0.861). It is because of the reward function that rewards true negatives and p enalizes false p ositiv es sev erely , enabling the agent to iden tify malicious activity with high conﬁdence while limiting con tainment actions that are not required. As w e can see in Fig. 4, Mo de C sho ws a notable drop in recall (0.744) and the F1-score (0.783), highligh ting the impact of balanced but strict p enalties on b oth false p ositiv es and false negatives. Although precision remains comparatively high (0.830), the reduced recall indicates a more conserv ative p olicy that reduces alerts at the cost of missing some malicious ev ents. Finally , Mo de D shows stable and consisten t p erformance in all three metrics, with precision, recall, and F1-scores close to 0.82, as we can see in Fig. 4. The introduction of controlled sto c hasticit y into the rew ard function improv es the robustness of the p olicy and prev en ts o v erﬁtting to deterministic reward patterns, pro ducing a reliable containmen t strategy under uncertain traﬃc conditions. Ov erall, these results conﬁrm that the prop osed 2-lay er DRL agen t can b e adapted to diﬀerent SOC op erational ob jectives through rew ard shaping alone, without modifying the underlying model arc hitecture to assist SOC analysts in decision making to either Contain or Allow traﬃc. 25 7.4. Decision Cost and Regret Analysis on Boss of SOC Dataset Setup Let each aggregated time window (state) b e indexed b y t ∈ { 1 , . . . , T } . The DRL p olicy π θ outputs a con tainmen t decision a t ∈ { 0 , 1 } , where a t = 1 indicates containment (raise an alert/tak e action) and a t = 0 denotes allow (do nothing). The ground-truth lab el is y t ∈ { 0 , 1 } , where y t = 1 indicates malicious activit y and y t = 0 indicates legitimate activity . Eac h decision pro duces one of four outcomes: T rue Positiv e (TP): ( a t , y t ) = (1 , 1) , F alse Positiv e (FP): ( a t , y t ) = (1 , 0) , F alse Negative (FN): ( a t , y t ) = (0 , 1) , T rue Negativ e (TN): ( a t , y t ) = (0 , 0) . Decision cost model: In the SOC environmen t, it is important to kno w the cost associated with eac h decision. W e assign an op erational cost to eac h outcome to reﬂect the use and risk of SOC resources. Let C TP , C FP , C FN , C TN ∈ R denote the cost (negativ e v alues may represen t b eneﬁt or c ost savings ). W e deﬁne the p er-windo w decision cost as: c t =                    C TP , a t = 1 , y t = 1 (TP) , C FP , a t = 1 , y t = 0 (FP) , C FN , a t = 0 , y t = 1 (FN) , C TN , a t = 0 , y t = 0 (TN) . (14) The total decision cost and average decision cost ov er a fold are represented by: C total = T X t =1 c t , C = 1 T T X t =1 c t . (15) Regret analysis Regret analysis ev aluates how muc h less optimal the policy’s decision is compared to an oracle that alwa ys selects the action with the minimal cost for that particular activity . F or each time windo w t , the oracle decision cost is represen ted b y: c ⋆ t = min a ∈{ 0 , 1 } c ( y t , a ) , (16) where c ( y , a ) follows Eq. (14). The instantaneous regret is represented b y: r t = c t − c ⋆ t ≥ 0 , (17) and the total and aver age r e gr et are calculated b y: R total = T X t =1 r t , R = 1 T T X t =1 r t . (18) The lo wer regret sho ws that the p olicy is close to the oracle in terms of op erational cost, i.e., it mak es few er mistak es (sp eciﬁcally false p ositiv es and false negatives under the chosen cost proﬁle). 26 T able 5: A verage Decision Cost and Regret Across Time-Series F olds Mo de Mean Decision Cost Std. Dev. Mean Regret Std. Dev. A -0.526 1.049 1.474 1.049 B -0.254 1.126 2.588 3.059 C -0.789 1.233 1.684 1.921 D -0.794 1.027 1.358 1.250 Decision cost and regret ev aluation T able 5 shows the a veraged operational decision cost and regret calculated across diﬀeren t rew ard modes. As w e can see in T able 5 Mode C and D achiev e the lo w est a verage decision cost close to − 0 . 79 , indicating improv ed op erational eﬃciency compared to Mode A and B. Moreo ver, Mo de D achiev es the low est av erage regret of 1 . 358 , highligh ting that sto c hastic reward impro ves the robustness of the p olicy during shifts in temp oral distributions. In contrast, Mo de B shows the highest regret 2 . 588 and v ariabilit y across folds with a standard deviation of 3 . 059 , highlighting the sensitivity to false positive penalties. Ov erall, these results show that the balanced rew ard policy of Modes C and D oﬀers stable con tainmen t p olicies across evolving traﬃc compared to strictly recall-oriented (Mode A) or false-p ositiv e intoleran t (Mo de B) conﬁgurations. 7.5. P ercen tage Reduction in T raﬃc Flows F orw arded to LLM The p ercentage reduction in traﬃc ﬂows indicates ho w eac h DRL reward mo de ﬁlters traﬃc b efore forwarding it to LLM for analysis. A higher p ercen tage of reduction signiﬁes that the mo de suppresses more traﬃc and reduces LLM pro cessing o verhead. A low er p ercen tage of reduction indicates that the mo de forwards more traﬃc, preserving broader co verage for LLM analysis. As we can see in Fig. 5, Mode A oﬀers more reduction in traﬃc ﬂows forwarded to LLm for analysis. This is b ecause it provides a higher p enalt y to false negatives than to false p ositiv es, which makes the agent sensitiv e to malicious activit y , while still allowing some ﬁltering of traﬃc. Mode B penalizes false p ositiv es and rewards true negatives b oth strongly . This encourages the DRL agent to suppress b enign traﬃc, which supp orts traﬃc reduction and av oids unnecessary do wnstream analysis. Mode C is a more balanced rew ard structure b et ween malicious detection and b enign traﬃc suppression. As it is less ﬁltering than the more selectiv e mo des, it tends to keep more traﬃc for downstream LLM analysis, leading to a lo wer p ercen tage reduction. Mode D is less conserv ativ e and has some exploratory rew ard design. So, it is less rigid in its traﬃc suppression, pro ducing a mo derate level of alert reduction while still allowing broader in vestigation of suspicious traﬃc ﬂo ws. 27 Figure 5: A verage Reduction in T raﬃc Flows F orw arded to LLM Across Mo des (A-D) 7.6. P olicy A daptation A cross Rew ard Mo des with Sim ulated Dataset W e also simulated the threat mo del mentioned in Section 6.1. Fig. 6 sho ws the performance of the DRL agen t with a tw o-la y er architecture (2 × 64) in the simulated Suricata traﬃc in the four reward mo des. As w e can see in Fig. 6, it clearly demonstrates that the reward function impacts the agent’s containmen t b eha vior under controlled traﬃc conditions. As sho wn in Fig. 6, mo de A achiev es a recall of 0 . 998 , resulting in almost p erfect detection of malicious activit y . How ev er, the precision is 0 . 636 in mo de A. It indicates that the agent strongly ﬂags suspicious traﬃc, whic h is desirable in early detection scenarios, but causes alert fatigue to SOC analysts due to an increased n umber of false positives. Mo de B ac hieves high precision 0 . 976 while signiﬁcantly reducing recall to 0 . 495 . It v alidates that the heavy p enalization of false p ositiv es in Mo de B causes a conserv ativ e con tainment p olicy that prioritizes high-conﬁdence alerts and allo ws a substantial part of malicious ﬂo ws to pass. It reduces the alert fatigue on SOC analysts due to the high n umber of false p ositiv es. Mo de C achiev es a more balanced p erformance, with a recall of 0 . 998 and a precisi on of 0 . 645 , resulting in an F1-score of 0.784. It indicates that Mo de C captures malicious activity while main taining some control o ver false positives, making it appropriate for en vironmen ts that require high detection and operational stabilit y . Finally , Mo de D shows consistent and well-balanced b eha vior, with precision, recall, and F1-scores of 0 . 926 , 0 . 696 , and 0 . 795 respectively . It reﬂects that the stochasticit y in the rew ard function enhances robustness b y preven ting the agen t from ov erﬁtting to deterministic traﬃc patterns commonly present in 28 Figure 6: Performance Ev aluation of Across Mo des (A-D) on Simulated Dataset sim ulated datasets. Ov erall, these results p oin t that the prop osed DRL agent adapts its containmen t strategy in resp onse to diﬀeren t rew ard mo des, v alidating the eﬀectiveness of reward shaping to mo del diﬀerent SOC ob jectiv es. 7.7. P ercen tage Reduction in Sim ulated T raﬃc F orwarded to LLM In the framework, alert capture rate is equiv alen t to recall, as b oth ev aluate the prop ortion of true alert instances that are successfully forw arded to the LLM for processing. As we can see in Fig. 7, across all rew ard mo des, the DRL triage mechanism reduces traﬃc forwarded to the LLM b y approximately 63%-65%. Ho wev er, this reduction is asso ciated with diﬀerent recall outcomes, as sho wn in Fig 6. As shown in Fig. 7 and Fig. 6, Mo des A and C forward almost all true traﬃc alerts while still achieving strong traﬃc reduction, indicating an eﬀective balance b et ween eﬃciency and detection p erformance. Mode B also obtains a similar reduction in traﬃc forw arded to LLM processing but with low er recall, highlighting that its p olicy is ﬁltering to o muc h. Mo de D p erforms mo derately , but remains less eﬀectiv e than Mo des A and C in maintaining traﬃc ﬂow cov erage. This design ensures that the LLM is not ov erloaded with a huge traﬃc volume for analysis. 7.8. SOC Analyst V alidation via Splunk Analysis on Sim ulated Dataset In the framew ork, after suspicious traﬃc is prioritized using the DRL netw ork and AAD score, an LLM- based m ulti-agent system pro vides con textual insigh ts and generates SPL queries to v alidate the traﬃc on Splunk. As mentioned in Section 5, the Senior SOC T riage Analyst agent provides contextual insights and 29 Figure 7: A verage Reduction in T raﬃc F orwarded to LLM for Analysis generate SPL queries for the analysis. Fig. 8 sho ws the SPL query applied to ﬁlter malicious traﬃc on Splunk. The SPL query ﬁlters the traﬃc originating from the ﬂagged host machine with the IP address of 10 . 0 . 2 . 4 , allowing the SOC analyst to analyze the comm unication pattern. F or clear representation, w e ha ve shown the IP addresses in the pap er, but LLM triage is p erformed on anonymized data. This design ensures that the LLM reasons o ver traﬃc b eha vior and con text rather than sensitive identiﬁers, enabling priv acy-preserving SOC triage without degrading analytical capability . As we can see in Fig. 8, repetitive communication from the same source IP address (10.0.2.4) to wards the destination IP address (10.0.2.15) on diﬀerent destination p orts in a very short time windo w highlights net work scanning activity . It is mapp ed to the MITRE A TT&CK tec hnique of T1046. The ﬁltered traﬃc on Splunk do es not show proto col violation or malformed pack ets, rather the anomaly is identiﬁed through temp oral aggregation and behavioral pattern, signifying the adv antage of DRL based con tainment decision o ver signature driven detection. It is imp ortan t to note that during LLM assisted triage the ma jority of prioritized ﬂows originate from the same host, therefore, a detailed table highlighting LLM agen t determination along with MITRE ID is omitted from the pap er for brevity . This is b ecause in the simulation we used Kali Lin ux with the same IP address. F rom a threat h unting p ersp ectiv e, v alidating the DRL-prioritized traﬃc through indep enden t SIEM-lev el b eha vioral pattern demonstrates that the proposed framew ork not only achiev es high detection p erformance, but also pro duces explainable insigh ts and assists SOC analyst in decision making to either 30 Figure 8: Analysis of ﬂows from malicious host allo w the traﬃc in the netw ork, blo c k, or monitor it for a while. The whole pro cess aligns with the real w orld SOC workﬂo ws and ensures that SOC analyst veriﬁes DRL decision and LLM insights on the SIEM to ol suc h as Splunk rather than only relying on the LLM. 8. Discussion Our framework provides a multi-la y er threat detection architecture that combines Deep Reinforcement Learn- ing (DRL), Auto encoder based anomaly detection (AAD), and LLM based agents for triage and automated SOC decision along with veriﬁcation of the results using SIEM to ol suc h as Splunk . In the framework, once the DRL agent is trained on aggregated traﬃc features and learn containmen t decision, then AAD score is used along with the DRL decision in prioritizing ﬂows for LLM triage. It reduces the processing ov erhead on LLM by only forwarding the prioritized ﬂo ws for analysis that also preven ts hallucination. The LLM agen ts also generate SPL queries that are used for ﬁltering malicious ﬂows on the Splunk dashboard. It is helpful for junior SOC analysts who are not well versed in writing SPL queries. Moreo ver, in the agentic AI framew ork, another LLM agent generates an incidence resp onse playbo ok and provides mapping with the MITRE A TT&CK ID. In contrast to man ual inv estigation on the SIEM to ol, such as Splunk, our agentic AI framework automates detection, prioritization, con textual analysis, and rep ort generation for detailed in vestigation. The prop osed framew ork is aligned with real-world SOC workﬂo ws. The DRL agen t learns sequential decision p olicies that d irectly mo del con tainment actions under uncertaint y rather than handling detection as a static classiﬁcation problem. This formulation sho ws op erational SOC decision-making, where actions 31 carry costs and delay ed consequences. The use of m ultiple rew ard proﬁles further demonstrates the ﬂexibility of the framework in adapting to diﬀerent organizational risk tolerances, such as prioritizing lo w false-p ositiv e rates in high-v olume en vironments. The current experimental ev aluation fo cuses on binary containmen t decisions that containment and no action (allow) . Ho w ever, SOC environmen t often contains actions in multiple stages such as: monitor, throttle, escalate, and isolate. Extending the action would cov er more practical scenarios and will also increase the complexity in the reward function. One ma y argue that in the framework, the use of public LLM for triage requires pro viding sensitive securit y data to an external provider. How ever, lo cal LLM can b e used to av oid forwarding sensitiv e logs to public LLM for triage. Another solution is to use data anonymization technique to anonymize sensitive information such as IP addresses b efore sending it to the public LLM for analysis. Another imp ortant concern of the current framew ork is scalabilit y , as it is common for tens of thousands of traﬃc logs to arriv e p er second in securit y op eration cen ters. Our framework addresses this challenge through triage priority after the DRL decision, ensuring that the agen tic analysis is only applied to the high priorit y subset of traﬃc ﬂows. Raw netw ork telemetry is aggregated into a ﬁxed temp oral time window, and ﬂo ws are summarized using statistical descriptors such as the mean & maximum v alue of p orts, in bytes, out b ytes, and proto col. Aggregation of ﬂo ws reduces the dimensionalit y of the data, transforming ra w traﬃc in to manageable volumes. The DRL agen t op erates as a policy lev el ﬁlter ev aluating eac h aggregated windo w rather than individual ﬂows. The auto encoder anomaly detection (AAD) score is combined with the DRL decision to calculate the triage priority , which ranks only traﬃc windows ﬂagged by the DRL agent for con tainment. Finally , the LLM based agents are inv ok ed only for small num ber of high priorit y traﬃc ﬂo ws, this selective strategy makes the framew ork computationally feasible for SOC environmen t with high traﬃc volumes. Ho wev er, the aggregated time window feature ma y limit the detection of short lived attacks. The exp erimen tal ev aluation demonstrate that combining Deep Reinforcement Learning and autoenco der anomaly detection along with con textual analysis improv es the ov erall eﬃciency and decision making of SOC analysts. Moreo ver, currently the proposed framework lacks complete automation from detection to mitigation. The main purp ose is to assist the SOC analyst in identifying the suspicious traﬃc for inv estigation and help them in decision making. The mitigation of malicious traﬃc is p erformed by SOC analysts after in vestigating on the SIEM to ol. Moreov er, the aim is also to study how agen tic AI coupled with the SIEM to ol can assist SOC analysts in threat hun ting. Rather than complete automation from monitoring to mitigation at this stage, it is nice to hav e automation across key security workﬂo ws, for impro ved p erformance. 9. Conclusion and F uture W ork This pap er presented a threat h unting framework based on Agentic AI that integrates Deep Reinforcement Learning (DRL), Autoenco der-based Anomaly Detection (AAD), and LLM driven multi-agen t con textual analysis. The framework represen ts the SOC environmen t as a sequential decision-making problem, where 32 actions are decided based on learned p olicies instead of b eing treated as static classiﬁcation outputs. T o demonstrate the feasibility and eﬀectiv eness of our prop osed framew ork, w e dev elop ed a pro of-of-concept protot yp e and ev aluated b oth on the public dataset as w ell as simulated dataset. The results sho w that rew ard shaping allo ws the DRL agen t to adapt its containmen t decision to diﬀerent SOC ob jectives, such as maximizing recall, minimizing false p ositives, and balancing detection. The op erational cost and regret analysis are also incorp orated into the learning pro cess that provides a more realistic ev aluation of the SOC en vironment. Our experiment and analysis of the protot yp e hav e identiﬁed some key b eneﬁts of the framework, (1) The mo dular arc hitecture of the framework separates the resp onsibilities across diﬀerent comp onen ts; (2) The DRL agen t w orks on aggregated ra w traﬃc features and is trained without AAD score, preven ting feature leak age; (3) The AAD score is applied only after the DRL decision to prioritize ﬂagged windo ws for LLM triage to av oid pro cessing o verhead on LLM; (4) The integration of DRL–based decision with anomaly-a ware prioritization and LLM-driven reasoning oﬀers explainable, analyst-aligned threat hun ting outcomes; (5) The reduced traﬃc volume forwarded to the LLM impro ves causes less pro cessing ov erhead on the LLM. This multi-la y er arc hitecture enables scalabilit y b y ensuring that computationally exp ensiv e LLM reasoning is applied only where it is most needed. As the SOC environmen t operates with m ulti- stage actions such as monitoring, escalating, isolating traﬃc, etc. In our future work, w e will extend the actions space to supp ort m ulti-level containmen t decisions. Currently , the framework relies on ﬁxed time- windo w aggregation due to which it can be diﬃcult to detect short-lived malicious traﬃc. In our future w ork, w e will explore adaptive window strategies and hierarchical p olicies that work at m ultiple temp oral lev el. Moreo v er, w e will also in vestigate the use of domain sp eciﬁc LLM or lo cally deplo yed LLM to impro ve reliabilit y , reduce hallucination risk and address data priv acy concerns. F urthermore, w e will in v estigate ethical considerations, including transparency , bias, and accountabilit y in agentic SOC systems, to supp ort resp onsible deploymen t of autonomous decision-making systems for cyb ersecurit y op erations. Finally , w e will p erform detailed exp erimen tation with multiple heterogeneous log sources and ev aluate the p erformance. References [1] M. A. F errag, M. Ndhlovu, N. Tihan yi, L. C. Cordeiro, M. Debbah, T. Lestable, N. S. Thandi, Rev olu- tionizing cyb er threat detection with large language mo dels: A priv acy-preserving b ert-based ligh tw eigh t mo del for iot/iiot devices (2024). . URL [2] Kasp ersky , A dv anced p ersisten t threats target one in four companies in 2024 (2024). URL https://www.kaspersky.com/about/press- releases/advanced- persistent- threats- target- one- in- four- c ompanie s- in- 2024 [3] F ortinet, 2025 fortinet global threat landscap e rep ort (2025). URL https://www.fortinet.com/resources/cyberglossary/recent- cyber- attacks 33 [4] T. Nguyen, H. Nguyen, A. Ijaz, S. Sheikhi, A. V. V asilak os, P . Kostak os, Large language mo dels in 6g securit y: c hallenges and opp ortunities (2024). . URL [5] A. Naseer, H. Naseer, A. Ahmad, S. B. Maynard, A. Maso od Siddiqui, Real-time analytics, inciden t resp onse pro cess agilit y and enterprise cyb ersecurit y p erformance: A contingen t resource-based analysis, In ternational Journal of Information Management 59 (2021) 102334. doi:https://doi.org/10.1016/ j.ijinfomgt.2021.102334 . URL https://www.sciencedirect.com/science/article/pii/S026840122100027X [6] F. W ang, C. Liu, L. Shi, H. Pang, Minimaxad: A light weigh t auto encoder for feature-ric h anomaly detection, Computers in Industry 171 (2025) 104315. doi:https://doi.org/10.1016/j.compind. 2025.104315 . URL https://www.sciencedirect.com/science/article/pii/S0166361525000806 [7] A. Zeiser, B. ozcan, B. v an Stein, T. Bäck, Ev aluation of deep unsup ervised anomaly detection metho ds with a data-centric approach for on-line insp ection, Computers in Industry 146 (2023) 103852. doi: https://doi.org/10.1016/j.compind.2023.103852 . URL https://www.sciencedirect.com/science/article/pii/S0166361523000027 [8] C. Catalano, L. Paiano, F. Calabrese, M. Cataldo, L. Mancarella, F. T ommasi, Anomaly detection in smart agriculture systems, Computers in Industry 143 (2022) 103750. doi:https://doi.org/10.1016/ j.compind.2022.103750 . URL https://www.sciencedirect.com/science/article/pii/S0166361522001476 [9] A. T all, J. W ang, D. Han, Surv ey of data in tensive computing tec hnologies application to to security log data managemen t, in: Pro ceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and T echnologies, BDCA T ’16, Asso ciation for Computing Machinery , New Y ork, NY, USA, 2016, p. 268–273. doi:10.1145/3006299.3006336 . URL https://doi.org/10.1145/3006299.3006336 [10] P . Badv a, K. M. Ramok apane, E. P antano, A. Rashid, Unv eiling the Hunter-Gatherers: Exploring threat h unting practices and c hallenges in cyber defense, in: 33rd USENIX Security Symp osium (USENIX Securit y 24), USENIX Asso ciation, Philadelphia, P A, 2024, pp. 3313–3330. URL https://www.usenix.org/conference/usenixsecurity24/presentation/badva [11] R. K. Gupta, S. Shukla, A. T. Ra jan, S. Ara vind, Utilizing Splunk for Proactive Issue Resolution in F ull Stack Developmen t Pro jects (2021). [12] S. Raza, R. Sapkota, M. Kark ee, C. Emmanouilidis, T rism for agen tic ai: A review of trust, risk, and 34 securit y managemen t in llm-based agentic m ulti-agent systems (2025). . URL [13] D. B. Ac hary a, K. Kuppan, B. Divy a, Agentic ai: Autonomous intelligence for complex goals—a com- prehensiv e surv ey , IEEE Access 13 (2025) 18912–18936. doi:10.1109/ACCESS.2025.3532853 . [14] H. Xu, S. W ang, N. Li, K. W ang, Y. Zhao, K. Chen, T. Y u, Y. Liu, H. W ang, Large language mo dels for cyb er security: A systematic literature review (2024). . URL [15] M. A. F errag, F. Alw ahedi, A. Battah, B. Cherif, A. Mechri, N. Tihanyi, T. Bisztra y , M. Debbah, Generativ e ai in cyb ersecurit y: A comprehensiv e review of llm applications and vulnerabilities, In ternet of Things and Cyb er-Ph ysical Systems (2025). doi:https://doi.org/10.1016/j.iotcps.2025.01. 001 . URL https://www.sciencedirect.com/science/article/pii/S2667345225000082 [16] A. Handler, K. R. Larsen, R. Hack athorn, Large language mo dels present new questions for decision supp ort, International Journal of Information Management 79 (2024) 102811. doi:https://doi.org/ 10.1016/j.ijinfomgt.2024.102811 . URL https://www.sciencedirect.com/science/article/pii/S0268401224000598 [17] N. Kshetri, T ransforming cyb ersecurit y with agentic ai to combat emerging cyb er threats, T elecommu- nications Policy 49 (6) (2025) 102976. doi:https://doi.org/10.1016/j.telpol.2025.102976 . URL https://www.sciencedirect.com/science/article/pii/S0308596125000734 [18] Sim bian, Ai agents in cyb ersecurit y:ai agents in cyb ersecurity (2025). URL https://resources.simbian.ai/hubfs/Whitepaper/AI%20Agents%20in%20Cybersecurity% 20White%20Paper%20(1).pdf [19] A. Sheth, A. P atel, C. Upadhy a y , H. Ragothaman, B. Patil, S. K. Uda y akumar, Agen tic ai for au- tonomous cyber threat h un ting and adaptive defense in dynamic securit y environmen ts, in: 2025 IEEE In ternational Conference on Electro Information T echnology (eIT), 2025, pp. 316–321. doi: 10.1109/eIT64391.2025.11103697 . [20] Dylan, Utilizing Generative AI and LLMs to Automate Detection W riting (2024). URL https://medium.com/@dylanhwilliams/utilizing- generative- ai- and- llms- to- automate- detection- writi ng- 5e4e a074072e [21] S. Balogh, M. Mlyncek, O. V ranak, P . Za jac, Using generativ e ai models to supp ort cyb ersecurit y analysts, Electronics 13 (23) (2024). doi:10.3390/electronics13234718 . URL https://www.mdpi.com/2079- 9292/13/23/4718 35 [22] C. Hillier, T. Karroubi, T urning the hun ted into the h unter via threat hun ting: Life cycle, ecosystem, c hallenges and the great promise of ai (2022). . URL [23] S. J. Lazer, K. Ary al, M. Gupta, E. Bertino, A surv ey of agen tic ai and cybersecurity: Challenges, opp ortunities and use-case prototypes (2026). . URL [24] A. Sheth, A. A chan ta, P . Matam, A. Patel, P . Sharma, N. V. P . Janapareddy , B. P atil, V. Gudur, Ai driv en self-healing cyb ersecurit y systems with agentic ai for adaptiv e threat resp onse and resilience, in: 2025 IEEE Cloud Summit, 2025, pp. 147–153. doi:10.1109/Cloud- Summit64795.2025.00030 . [25] A. Mohsin, H. Janic k e, A. Ibrahim, I. H. Sark er, S. Camtepe, A uniﬁed framew ork for human ai collab oration in security op erations centers with trusted autonomy (2025). . URL [26] N. Kshetri, J. V oas, Agen tic Artiﬁcial Intelligence for Cyb er Threat Management , Computer 58 (05) (2025) 86–90. doi:10.1109/MC.2025.3544797 . URL https://doi.ieeecomputersociety.org/10.1109/MC.2025.3544797 [27] P . Zambare, V. N. Thanik ella, N. P . Kottur, S. A. Akula, Y. Liu, Netmoniai: An agen tic ai framework for netw ork security & monitoring (2025). . URL [28] Y. Gu, Y. Xiong, J. Mace, Y. Jiang, Y. Hu, B. Kasikci, P . Cheng, Argos: Agentic time-series anomaly detection with autonomous rule generation via large language mo dels (2025). . URL [29] Y. Zhou, Y. Y uan, K. Huang, X. Hu, Can chatgpt p erform a grounded theory approac h to do risk analysis? an empirical study, Journal of Managemen t Information Systems 41 (4) (2024) 982–1015. arXiv:https://doi.org/10.1080/07421222.2024.2415772 , doi:10.1080/07421222.2024.2415772 . URL https://doi.org/10.1080/07421222.2024.2415772 [30] R. Sahay , M. Sumasadan, B. Eap en, W. Meng, M. R. A. Mamu, Enhancing threat hun ting with splunk and generative ai forautomated security op erations (2025). doi:10.21203/rs.3.rs- 7515771/v1 . [31] B. Jonkhout, Ev aluating large language models for automated cyber security analysis pro cesses (July 2024). URL http://essay.utwente.nl/100846/ 36 [32] A. Konstan tinou, D. Kasimatis, W. J. Buc hanan, S. U. Jan, J. Ahmad, I. Politis, N. Pitropakis, Lever- aging llms for non-securit y exp erts in threat hun ting: Detecting living oﬀ the land techniques, Machine Learning and Knowledge Extraction 7 (2) (2025). doi:10.3390/make7020031 . URL https://www.mdpi.com/2504- 4990/7/2/31 [33] E. Karlsen, X. Luo, N. Zincir-Heywoo d, M. Heyw o od, Benc hmarking large language mo dels for log analysis, security , and in terpretation (2023). . URL [34] P . T seng, Z. Y eh, X. Dai, P . Liu, Using llms to automate threat intelligence analysis workﬂo ws in security op eration cen ters (2024). . URL [35] V. T anksale, Cyb er threat hun ting using large language mo dels, in: X.-S. Y ang, S. Sherratt, N. Dey , A. Joshi (Eds.), Pro ceedings of Ninth International Congress on Information and Communication T ec h- nology , Springer Nature Singap ore, Singap ore, 2024, pp. 629–641. [36] C. Kidd, What is splunk & what do es it do? a splunk in tro (2024). URL https://www.splunk.com/en_us/blog/learn/what- splunk- does.html [37] Z. Duan, J. W ang, Exploration of llm multi-agen t application implementation based on lang- graph+crew ai (2024). . URL [38] R. Chalapathy , S. Chawla, Deep learning for anomaly detection: A survey (2019). . URL [39] Splunk Inc., Correlation searc hes, https://docs.splunk.com/Documentation/ES/latest/Admin/ Correlationsearches . [40] M. F arhan, H. W aheed ud din, S. Ullah, M. S. Hussain, M. A. Khan, T. Mazhar, U. F. Khattak, I. H. Jaghdam, Netw ork-based in trusion detection using deep learning tec hnique, Scientiﬁc Rep orts 15 (1) (2025) 25550. doi:10.1038/s41598- 025- 08770- 0 . URL https://doi.org/10.1038/s41598- 025- 08770- 0 [41] J. Sch ulman, F. W olski, P . Dhariw al, A. Radford, O. Klimov, Pro ximal p olicy optimization algorithms (2017). . URL [42] S. Sriniv as, B. Kirk, J. Zendejas, M. Espino, M. Bosko vic h, A. Bari, K. Da jani, N. Alzahrani, Ai- augmen ted soc: A survey of llms and agents for securit y automation, Journal of Cyb ersecurit y and 37 Priv acy 5 (4) (2025). doi:10.3390/jcp5040095 . URL https://www.mdpi.com/2624- 800X/5/4/95 [43] C.-Y. Sun, S.-S. Chen, Y.-H. Ho, De-identiﬁcation of op en-source in telligence using ﬁnetuned llama-3, High-Conﬁdence Computing (2025) 100357 doi:https://doi.org/10.1016/j.hcc.2025.100357 . URL https://www.sciencedirect.com/science/article/pii/S2667295225000613 [44] N. Hoque, M. H. Bhuy an, R. Baishy a, D. Bhattac haryy a, J. Kalita, Net work attacks: T axonom y , to ols and systems, Journal of Net w ork and Computer Applications 40 (2014) 307–324. doi:https: //doi.org/10.1016/j.jnca.2013.08.001 . URL https://www.sciencedirect.com/science/article/pii/S1084804513001756 [45] R. Saha y , G. Blanc, Z. Zhang, H. Debar, T ow ards autonomic ddos mitigation using softw are deﬁned net working, 2015. URL https://api.semanticscholar.org/CorpusID:18725272 [46] Splunk, Boss of the so c v3 dataset released (2020). URL https://www.splunk.com/en_us/blog/security/botsv3- dataset- released.html App endix A. Mathematical Details and Numerical Illustration This app endix pro vides the detailed mathematical formulation and a w orked numerical example supp orting the reinforcement learning–based containmen t framework describ ed in Section 5. These details are included for completeness and repro ducibilit y and are not required for understanding the main results. App endix A.1. Time-Windo w ed State Construction Let raw netw ork traﬃc b e represented as a sequence of ﬂow records: F = { f 1 , f 2 , . . . , f N } . (A.1) T raﬃc is aggregated in to non-o verlapping windows of duration ∆ t (5 minutes). F or each window w t , a state vector s t is constructed using statistical aggregation. F or numerical attributes x ∈ { src_p ort , dest_p ort , bytes_in , bytes_out } , w e compute: µ ( x ) t = 1 | w t | X f i ∈ w t x i , (A.2) max( x ) t = max f i ∈ w t x i . (A.3) The mean captures typical b eha vior within the window, while the maximum captures bursty or extreme activit y commonly asso ciated with scanning, ﬂo oding, or data exﬁltration. Categorical features (source IP , destination IP , protocol) are enco ded using reduced-cardinality one-hot represen tations and aggregated as o ccurrence coun ts p er windo w. 38 The resulting state v ector is: s t =  µ ( src_p ort ) , max( src_port ) , µ ( dest_p ort ) , max( dest_p ort ) , µ ( b ytes_in ) , max( b ytes_out ) , IP src , IP dst , Proto  t . (A.4) App endix A.2. W ork ed Numerical Example Consider a single 5-min ute windo w with the following aggregated numerical features: µ ( src_p ort ) = 443 , max( src_p ort ) = 443 , µ ( dest_p ort ) = 51 , 024 , max( dest_p ort ) = 51 , 083 , µ ( b ytes_in ) = 1 , 420 , max( b ytes_out ) = 98 , 230 . After concatenation with enco ded categorical features, the state vector is s t ∈ R d . Assume the p olicy netw ork pro duces logits: z = [ − 1 . 31 , 1 . 11] , (A.5) corresp onding to no c ontainment and c ontainment , resp ectiv ely . Applying softmax: π (0 | s t ) = e − 1 . 31 e − 1 . 31 + e 1 . 11 ≈ 0 . 08 , π (1 | s t ) = e 1 . 11 e − 1 . 31 + e 1 . 11 ≈ 0 . 92 . The agent selects a t = 1 (containmen t) with conﬁdence 0 . 92 . If the ground-truth lab el is malicious ( y t = 1 ), the outcome is a true p ositiv e and the corresp onding rew ard is assigned according to the active rew ard proﬁle. App endix A.3. Decision Cost and Regret T o complement standard detection metrics, we quantify ope rational impact using decision cost and regret. The decision cost at time t is deﬁned as the negative of the obtained reward: cost t = − r t . (A.6) This form ulation directly reﬂects op erational burden: high cost corresp onds to undesirable actions such as false p ositiv es or missed detections. 39 Regret measures how far the agent’s decision deviates from the b est p ossible action under the deﬁned rew ard function. W e deﬁne regret as: regret t = r ⋆ t − r t , (A.7) where r ⋆ t = max a ∈{ 0 , 1 } R ( a, y t ) (A.8) is the oracle rew ard assuming p erfect knowledge of the ground-truth lab el. A regret v alue of zero indicates an optimal decision, while larger v alues indicate increasingly costly deviations from the ideal SOC resp onse. App endix A.4. W ork ed Numeric Example (End-to-End): Aggregation → AAD Score → DRL A ction Probability Goal. This example shows (i) how ra w ﬂow logs inside a 5-minute window are conv erted in to ﬁxed-length features (mean/max), (ii) how an Auto enco der-based Anomaly Detection (AAD) score is computed from reconstruction error, and (iii) how a 2-la yer DRL policy con v erts the feature vector in to a containmen t probabilit y (e.g., 0 . 92 ). App endix A.4.1. Step 1: Raw ﬂows in a 5-min ute window Assume within a 5-min ute time window w , we observe N = 4 ﬂo w records with numeric attributes: src_p ort , dest_p ort , bytes_in , bytes_out . Let the four ﬂo ws b e: Flo w i src_port i dest_p ort i b ytes_in i b ytes_out i 1 443 52344 1200 300 2 443 52345 1300 350 3 22 52346 8000 9000 4 22 52347 9000 11000 App endix A.4.2. Step 2: Window aggregation (mean and max) W e compute a ﬁxed-length feature vector x w ∈ R 6 using mean and max statistics: x w =  src_p ort_mean , src_port_max , dest_p ort_mean , dest_port_max , b ytes_in_mean , bytes_out_max  . The aggregation op erators are: mean ( v ) = 1 N N X i =1 v i , max ( v ) = max i ∈{ 1 ,...,N } v i . Compute each comp onen t: 40 src_p ort_mean = 443+443+22+22 4 = 232.5 src_p ort_max = max(443 , 443 , 22 , 22) = 443 dest_p ort_mean = 52344+52345+52346+52347 4 = 52345.5, dest_p ort_max = max(52344 , 52345 , 52346 , 52347) = 52347 b ytes_in_mean = 1200+1300+8000+9000 4 = 4875 b ytes_out_max = max(300 , 350 , 9000 , 11000) = 11000 Th us, x w = [232 . 5 , 443 , 52345 . 5 , 52347 , 4875 , 11000] . App endix A.4.3. Step 3: Standardization (as used in training) Before AAD/DRL, numeric features are standardized: ˜ x w = x w − µ σ , where µ and σ are computed from the tr aining data only . F or illustration, assume: µ = [200 , 400 , 52000 , 52000 , 2000 , 5000] , σ = [100 , 100 , 200 , 200 , 2000 , 3000] . Then: ˜ x w =  232 . 5 − 200 100 , 443 − 400 100 , 52345 . 5 − 52000 200 , 52347 − 52000 200 , 4875 − 2000 2000 , 11000 − 5000 3000  . Numerically: ˜ x w = [0 . 325 , 0 . 43 , 1 . 7275 , 1 . 735 , 1 . 4375 , 2 . 0] . App endix A.4.4. Step 4: AAD score via Auto encoder reconstruction error AAD mo del. Let the AAD b e a auto encoder trained on early b enign data: f θ : R 6 → R 6 . W e use a hidden lay er enco der, b ottlenec k dimension d = 2 , and a deco der: h = ϕ ( W 1 ˜ x w + b 1 ) , z = ϕ ( W 2 h + b 2 ) , ˆ h = ϕ ( W 3 z + b 3 ) , ˆ x w = W 4 ˆ h + b 4 , where ϕ ( · ) = max(0 , · ) is ReLU, and ˆ x w is the reconstruction. 41 Concrete n umeric forward pass (Example) F or the example, we illustrate with a small hidden size (the real implementation can use larger widths). Assume: W 1 =         0 . 30 0 . 10 0 . 05 0 . 05 0 . 10 0 . 20 0 . 10 0 . 20 0 . 10 0 . 10 0 . 20 0 . 10 0 . 05 0 . 05 0 . 30 0 . 30 0 . 10 0 . 10 0 . 10 0 . 10 0 . 20 0 . 20 0 . 10 0 . 05         , b 1 = 0 . Compute h = ϕ ( W 1 ˜ x w ) : W 1 ˜ x w =         0 . 30(0 . 325) + 0 . 10(0 . 43) + 0 . 05(1 . 7275) + 0 . 05(1 . 735) + 0 . 10(1 . 4375) + 0 . 20(2 . 0) 0 . 10(0 . 325) + 0 . 20(0 . 43) + 0 . 10(1 . 7275) + 0 . 10(1 . 735) + 0 . 20(1 . 4375) + 0 . 10(2 . 0) 0 . 05(0 . 325) + 0 . 05(0 . 43) + 0 . 30(1 . 7275) + 0 . 30(1 . 735) + 0 . 10(1 . 4375) + 0 . 10(2 . 0) 0 . 10(0 . 325) + 0 . 10(0 . 43) + 0 . 20(1 . 7275) + 0 . 20(1 . 735) + 0 . 10(1 . 4375) + 0 . 05(2 . 0)         . Numerically: W 1 ˜ x w ≈         0 . 0975 + 0 . 043 + 0 . 0864 + 0 . 0868 + 0 . 1438 + 0 . 4000 0 . 0325 + 0 . 0860 + 0 . 1728 + 0 . 1735 + 0 . 2875 + 0 . 2000 0 . 0163 + 0 . 0215 + 0 . 5183 + 0 . 5205 + 0 . 1438 + 0 . 2000 0 . 0325 + 0 . 0430 + 0 . 3455 + 0 . 3470 + 0 . 1438 + 0 . 1000         =         0 . 8575 0 . 9523 1 . 4204 1 . 0118         . After ReLU: h = [0 . 8575 , 0 . 9523 , 1 . 4204 , 1 . 0118] ⊤ . No w, w e deﬁne b ottlenec k w eights as: W 2 =   0 . 6 0 . 2 0 . 1 0 . 1 0 . 1 0 . 2 0 . 6 0 . 1   , b 2 = 0 . Compute the b ottlenec k: z = ϕ ( W 2 h ) = ϕ     0 . 6(0 . 8575) + 0 . 2(0 . 9523) + 0 . 1(1 . 4204) + 0 . 1(1 . 0118) 0 . 1(0 . 8575) + 0 . 2(0 . 9523) + 0 . 6(1 . 4204) + 0 . 1(1 . 0118)     . Numerically: z ≈ ϕ     0 . 5145 + 0 . 1905 + 0 . 1420 + 0 . 1012 0 . 0858 + 0 . 1905 + 0 . 8522 + 0 . 1012     =   0 . 9482 1 . 2297   . F or reconstruction, we assume a simple linear deco der (Example): ˆ x w = Az , A =               0 . 20 0 . 00 0 . 10 0 . 05 0 . 30 0 . 10 0 . 30 0 . 10 0 . 10 0 . 20 0 . 05 0 . 40               . 42 Then: ˆ x w =               0 . 20(0 . 9482) + 0 . 00(1 . 2297) 0 . 10(0 . 9482) + 0 . 05(1 . 2297) 0 . 30(0 . 9482) + 0 . 10(1 . 2297) 0 . 30(0 . 9482) + 0 . 10(1 . 2297) 0 . 10(0 . 9482) + 0 . 20(1 . 2297) 0 . 05(0 . 9482) + 0 . 40(1 . 2297)               ≈               0 . 1896 0 . 1563 0 . 4074 0 . 4074 0 . 3408 0 . 5393               . AAD score as mean squared reconstruction error Deﬁne AAD as: AAD ( ˜ x w ) = 1 6 6 X j =1 ( ˜ x w,j − ˆ x w,j ) 2 . Compute the p er-dimension errors as: ˜ x w − ˆ x w = [0 . 325 − 0 . 1896 , 0 . 43 − 0 . 1563 , 1 . 7275 − 0 . 4074 , 1 . 735 − 0 . 4074 , 1 . 4375 − 0 . 3408 , 2 . 0 − 0 . 5393] . Numerically: ˜ x w − ˆ x w ≈ [0 . 1354 , 0 . 2737 , 1 . 3201 , 1 . 3276 , 1 . 0967 , 1 . 4607] . Squared errors: [0 . 0183 , 0 . 0749 , 1 . 7427 , 1 . 7625 , 1 . 2028 , 2 . 1336] . Finally: AAD ( ˜ x w ) ≈ 0 . 0183 + 0 . 0749 + 1 . 7427 + 1 . 7625 + 1 . 2028 + 2 . 1336 6 = 6 . 9348 6 ≈ 1 . 1558 . In terpretation: Higher AAD shows a larger deviation from b enign reconstruction, i.e., a more anomalous traﬃc window. App endix A.4.5. Step 5: DRL p olicy con verts features in to a containmen t probabilit y Let the DRL p olicy netw ork (2 hidden lay ers; in the implementation 64 neurons p er lay er) output tw o logits for actions: a ∈ { 0 , 1 } where a = 1 means con tainment . W e denote: h (1) = ϕ ( W (1) ˜ x w + b (1) ) , h (2) = ϕ ( W (2) h (1) + b (2) ) , ℓ = W (3) h (2) + b (3) =   ℓ 0 ℓ 1   , where ℓ 0 is the logit for no c ontainment and ℓ 1 is the logit for c ontainment . 43 Concrete n umeric logits (Example) Let’s assume the netw ork produces logits as: ℓ 0 = − 1 . 2 , ℓ 1 = 1 . 6 . These are not probabilities; they are ra w preference scores. Softmax to obtain action probabilities. The p olicy probability is giv e b y: π ( a = i | ˜ x w ) = exp( ℓ i ) exp( ℓ 0 ) + exp( ℓ 1 ) , i ∈ { 0 , 1 } . Compute: exp( ℓ 0 ) = e − 1 . 2 ≈ 0 . 301 , exp( ℓ 1 ) = e 1 . 6 ≈ 4 . 953 . Sum: Z = exp( ℓ 0 ) + exp( ℓ 1 ) ≈ 0 . 301 + 4 . 953 = 5 . 254 . Th us: π ( a = 1 | ˜ x w ) = 4 . 953 5 . 254 ≈ 0 . 943 , π ( a = 0 | ˜ x w ) = 0 . 301 5 . 254 ≈ 0 . 057 . In terpretation: The DRL p olicy assigns ≈ 94 . 3% conﬁdence to containmen t for this windo w. App endix A.4.6. Step 6: Combining DRL decision with AAD for SOC triage (p ost-training) Because AAD is excluded from DRL training (to preven t feature leak age), it is used after DRL for prioriti- zation b efore forwarding to LLM: Priorit y w = I [ a w = 1] · AAD ( ˜ x w ) , where I [ · ] is an indicator (1 if containmen t is selected, else 0). If the agent selects a w = 1 , then: Priorit y w = 1 · 1 . 1558 = 1 . 1558 . If a w = 0 , then: Priorit y w = 0 . In terpretation: Only traﬃc windows ﬂagged by DRL are escalated, and AAD provides a score for priori- tization b efore LLM-based contextual analysis. App endix A.4.7. V ariable Dictionary • w : a 5-minute time window. • N : n um b er of ﬂow records inside windo w w . • x w ∈ R 6 : aggregated window feature vector (mean/max). • ˜ x w : standardized features, using training-set mean µ and std σ . 44 • f θ ( · ) : AAD auto encoder reconstruction function. • ˆ x w : reconstruction of ˜ x w pro duced b y the auto enco der. • AAD ( ˜ x w ) : anomaly score computed as mean squared reconstruction error. • a ∈ { 0 , 1 } : DRL action (0 = no containmen t, 1 = containmen t). • ℓ = [ ℓ 0 , ℓ 1 ] ⊤ : logits (raw scores) output by the p olicy netw ork. • π ( a | ˜ x w ) : softmax p olicy giving the probability of eac h action. • Priorit y w : post-training SOC triage priority (used to rank even ts for LLM inv estigation). 45

Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment