Agentic Observability: Automated Alert Triage for Adobe E-Commerce

Agentic Observability: Automated Alert Triage for Adobe E-Commerce
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern enterprise systems exhibit complex interdependencies that make observability and incident response increasingly challenging. Manual alert triage, which typically involves log inspection, API verification, and cross-referencing operational knowledge bases, remains a major bottleneck in reducing mean recovery time (MTTR). This paper presents an agentic observability framework deployed within Adobe’s e-commerce infrastructure that autonomously performs alert triage using a ReAct paradigm. Upon alert detection, the agent dynamically identifies the affected service, retrieves and analyzes correlated logs across distributed systems, and plans context-dependent actions such as handbook consultation, runbook execution, or retrieval-augmented analysis of recently deployed code. Empirical results from production deployment indicate a 90% reduction in mean time to insight compared to manual triage, while maintaining comparable diagnostic accuracy. Our results show that agentic AI enables an order-of-magnitude reduction in triage latency and a step-change in resolution accuracy, marking a pivotal shift toward autonomous observability in enterprise operations.


💡 Research Summary

The paper presents an “agentic observability” framework that automates alert triage in Adobe’s large‑scale e‑commerce platform. Unlike prior works that focus on post‑mortem root‑cause analysis (e.g., RCA‑Copilot, IRCopilot) or on executing predefined troubleshooting workflows (StepFly, FLASH), this system operates proactively the moment an alert is raised. It implements a ReAct‑style loop using three specialized GPT‑4o agents—Splunk Agent, Tools Agent, and Reflection Agent—coordinated through LangGraph.

When an alert arrives, the Splunk Agent extracts metadata (service name, session ID, request ID) and queries the Splunk API to pull relevant logs and distributed traces. The retrieved data are filtered for error‑level events and organized into a structured evidence set. The Tools Agent then acts as planner and reasoner: it identifies gaps in the evidence, formulates sub‑goals (e.g., verify API responses, check recent deployments, run auxiliary health checks), and constructs a step‑wise action plan. Retrieval‑augmented generation (RAG) is employed to pull contextual information from internal wikis, runbooks, and recent code‑deployment metadata, grounding the agent’s hypotheses in concrete artifacts.

The Reflection Agent performs a meta‑evaluation of the generated diagnosis, checking for completeness (all affected components covered), causality (log evidence supports the inferred cause), and actionability (the recommendation can be safely executed). It allows up to five reflection cycles to resolve uncertainty; if confidence remains low, the most probable hypothesis is returned with an explicit uncertainty flag, preventing endless loops and bounding compute time.

The framework was deployed across multiple production services (checkout, subscription management, catalog ingestion) and evaluated over a 12‑week period covering 250 alert events. Two baselines were used: manual triage by on‑call engineers (average 18 minutes) and a centralized support team (average 33 minutes). The agent achieved a mean time to insight (MTTI) of 2.3 minutes—a 90 % reduction—while maintaining an error‑localization accuracy (ELA) of 88.4 %, comparable to expert engineers. Engineer effort reduction (EER) was 65 %, and alert responsiveness (percentage of alerts with a diagnostic report within five minutes) reached 90 %.

A detailed case study on “Content Validation Error – WARN” illustrates the gains. The manual workflow required 10‑15 minutes of log hunting, variant/locale identification, script execution, and content correction. The agent completed log retrieval (≈30 seconds), variant inference (≈20 seconds), script execution (≈25 seconds), and produced a structured diagnostic summary in under two minutes. Across 72 occurrences, the automated system generated initial reports for 91.6 % of alerts versus 61.1 % under manual triage, and automated three of four manual steps, yielding a 75 % EER in this scenario.

Limitations identified include dependence on log quality, occasional Splunk API throttling that can delay evidence collection, and the need for human onboarding of new runbooks and tooling metadata. Future work will explore uncertainty quantification, feedback‑driven reinforcement learning to improve hypothesis robustness, and asynchronous log indexing or caching to mitigate API rate limits.

In sum, the paper demonstrates that a production‑grade, multi‑agent LLM system can perform real‑time, context‑aware incident triage at enterprise scale, delivering order‑of‑magnitude reductions in triage latency without sacrificing diagnostic fidelity. This work marks a pivotal step toward autonomous observability and suggests a viable path for other large organizations to embed agentic AI into their AIOps pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment