Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA
CI/CD pipeline failure management is time-consuming when performed manually. Automating this process is non-trivial because the information required for effective failure management is unstructured and cannot be automatically processed by traditional programs. With their ability to process unstructured data, large language models (LLMs) have shown promising results for automated failure management by previous work. Following these studies, we evaluated whether an LLM-based system could automate failure management in a CI/CD pipeline in the context of a large industrial software project, namely SAP HANA. We evaluated the ability of the LLM-based system to identify the error location and to propose exact solutions that contain no unnecessary actions. To support the LLM in generating exact solutions, we provided it with different types of domain knowledge, including pipeline information, failure management instructions, and data from historical failures. We conducted an ablation study to determine which type of domain knowledge contributed most to solution accuracy. The results show that data from historical failures contributed the most to the system’s accuracy, enabling it to produce exact solutions in 92.1% of cases in our dataset. The system correctly identified the error location with 97.4% accuracy when provided with domain knowledge, compared to 84.2% accuracy without it. In conclusion, our findings indicate that LLMs, when provided with data from historical failures, represent a promising approach for automating CI/CD pipeline failure management.
💡 Research Summary
This paper investigates the feasibility of using a large language model (LLM) to automate failure management in a complex, real‑world CI/CD pipeline, specifically the Jenkins‑based delivery process for SAP HANA. The authors identify three research questions: (1) the most common causes of pipeline failures in SAP HANA, (2) the accuracy with which an LLM can locate the error within the pipeline, and (3) which type of domain knowledge most improves the LLM’s solution‑generation accuracy.
The study builds a failure‑management system that is itself implemented as a Jenkins pipeline and triggered automatically whenever a build fails. The system first retrieves the console logs of the failed build via the Python Jenkins API. Because Jenkins only reports which downstream job failed, not the root cause, the authors develop a regular‑expression‑based extractor that walks the pipeline hierarchy (main pipeline, sub‑pipelines, remote pipelines) to locate the “most downstream failed job” – the job whose console output actually contains the error message.
Once the relevant log is identified, it undergoes preprocessing to filter out misleading messages and retain only the essential error information. The preprocessed log is then combined with three categories of domain knowledge: (a) pipeline metadata (stage and step definitions), (b) failure‑management instructions (internal runbooks), and (c) historical failure records that share the same downstream job. The historical records are retrieved using a Retrieval‑Augmented Generation (RAG) approach, ranking the top three most similar past failures based on log similarity.
All of this information is fed to OpenAI’s GPT‑4o model with temperature set to 0.0, prompting it to (i) identify the exact error location, (ii) explain the root cause, and (iii) propose a precise remediation action that contains no unnecessary steps. The model’s response is then sent to on‑call engineers to accelerate resolution.
The experimental evaluation uses 200 real failure instances from the SAP HANA delivery pipeline, which consists of 46 main steps, three sub‑pipelines, and a total of 64 individual steps. An ablation study compares four knowledge configurations: (1) full knowledge (pipeline + instructions + historical failures), (2) pipeline metadata only, (3) instructions only, (4) historical failures only, and (5) no additional knowledge (raw log only).
Key results:
- With full knowledge, the LLM correctly identifies the error location in 97.4 % of cases, compared to 84.2 % without any domain knowledge.
- Exact solution generation succeeds in 92.1 % of cases with full knowledge.
- Historical failure data contributes the most to accuracy: using only historical failures yields 95.1 % location accuracy and 88.3 % solution accuracy, outperforming the other knowledge types.
- Pipeline metadata and instructions provide modest additional gains (≈5 % and ≈3 % respectively).
The authors argue that, unlike prior work that separates root‑cause analysis and solution generation into multiple LLM calls, their single‑call approach reduces latency, cost, and error propagation. They also note that LLMs are robust to log‑format changes, a common problem for traditional machine‑learning‑based log parsers that require retraining on new data.
Threats to validity include the focus on a single product (SAP HANA), potential drift between the LLM’s knowledge and evolving pipeline configurations, and the impact of prompt design and token limits on performance. Future work will explore cross‑project generalization, automated prompt optimization, and tighter integration of LLM updates with pipeline changes.
In conclusion, the study demonstrates that an LLM, when augmented with relevant domain knowledge—especially historical failure records—can reliably locate errors and generate precise remediation steps in an industrial CI/CD pipeline. This approach promises to reduce on‑call engineers’ manual effort, accelerate release cycles, and improve overall system reliability in large‑scale software delivery environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment