LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis

LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Logs are essential for understanding Continuous Integration (CI) behavior, particularly for diagnosing build failures and performance regressions. Yet their growing volume and verbosity make both manual inspection and automated analysis increasingly costly, time-consuming, and environmentally costly. While prior work has explored log compression, anomaly detection, and LLM-based log analysis, most efforts target structured system logs rather than the unstructured, noisy, and verbose logs typical of CI workflows. We present LogSieve, a lightweight, RCA-aware and semantics-preserving log reduction technique that filters low-information lines while retaining content relevant to downstream reasoning. Evaluated on CI logs from 20 open-source Android projects using GitHub Actions, LogSieve achieves an average 42% reduction in lines and 40% reduction in tokens with minimal semantic loss. This pre-inference reduction lowers computational cost and can proportionally reduce energy use (and associated emissions) by decreasing the volume of data processed during LLM inference. Compared with structure-first baselines (LogZip and random-line removal), LogSieve preserves much higher semantic and categorical fidelity (Cosine = 0.93, GPTScore = 0.93, 80% exact-match accuracy). Embedding-based classifiers automate relevance detection with near-human accuracy (97%), enabling scalable and sustainable integration of semantics-aware filtering into CI workflows. LogSieve thus bridges log management and LLM reasoning, offering a practical path toward greener and more interpretable CI automation.


💡 Research Summary

LogSieve addresses the growing problem of verbose Continuous Integration (CI) logs, especially in mobile Android projects using GitHub Actions, by introducing a task‑aware, root‑cause‑analysis (RCA)‑focused log reduction technique. The authors first collected 9,166 CI workflow runs from 452 open‑source Android repositories, then stratified a sample of 100 failed runs and manually annotated 20 of them, labeling each of the 14,646 log lines as either RCA‑relevant or irrelevant. Inter‑rater agreement was high (Cohen’s κ = 0.80), establishing a reliable ground‑truth dataset.

LogSieve’s core algorithm computes a relevance score for every line using pre‑computed embeddings. Three embedding families are explored: sparse TF‑IDF vectors, dense BERT‑base contextual embeddings, and instruction‑tuned LLaMA‑3‑8B embeddings. No fine‑tuning is performed, preserving the lightweight nature of the approach. A diverse suite of classifiers—including logistic regression, linear and kernel SVMs, random forests, XGBoost, and LightGBM—is trained on these embeddings. Using stratified 10‑fold cross‑validation, the best models achieve 97 % accuracy and a weighted F1 of 0.96, demonstrating near‑human performance in automatically detecting RCA‑relevant lines.

Applying a threshold on the relevance scores, LogSieve removes on average 42 % of log lines and 40 % of tokens, thereby shrinking the input size for downstream Large Language Models (LLMs). To evaluate semantic preservation, the reduced logs are fed to GPT‑4o for two tasks: failure explanation generation and failure‑type categorization. Compared with the full‑log baseline, the reduced logs achieve cosine similarity of 0.93, GPTScore of 0.93, and ROUGE‑1/L of 0.91, with an exact‑match accuracy of 80 % on the generated explanations. This indicates that the essential diagnostic information is retained despite substantial reduction.

Baseline comparisons include LogZip (a structure‑first compression tool) and random line deletion. While LogZip achieves a slightly higher line‑reduction rate, it suffers from severe semantic loss, yielding lower cosine similarity and explanation accuracy. Random deletion performs similarly poorly, confirming that indiscriminate trimming harms LLM reasoning.

The token reduction translates directly into computational savings: a 40 % token cut leads to roughly a 35 % decrease in inference FLOPs, which in turn reduces energy consumption and associated carbon emissions proportionally. This demonstrates a concrete pathway to greener CI analytics.

The authors discuss threats to validity, noting that the dataset is limited to Android projects and that manual labeling may introduce domain‑specific bias. Nevertheless, the labeling guidelines and the embedding‑based classification pipeline are designed to be portable across languages, platforms, and CI systems. Future work is suggested to extend LogSieve to other ecosystems, incorporate multimodal artifacts (e.g., binary artifacts, screenshots), and explore adaptive thresholding based on real‑time resource constraints.

In summary, LogSieve provides a practical, semantics‑preserving log reduction method that enables efficient, sustainable LLM‑based CI analysis. By filtering out low‑information lines before inference, it maintains high diagnostic fidelity while cutting inference cost, offering immediate benefits for software engineering teams seeking both scalability and environmental responsibility.


Comments & Academic Discussion

Loading comments...

Leave a Comment