LLM-Assisted Logic Rule Learning: Scaling Human Expertise for Time Series Anomaly Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Time series anomaly detection is critical for supply chain management to take proactive operations, but faces challenges: classical unsupervised anomaly detection based on exploiting data patterns often yields results misaligned with business requirements and domain knowledge, while manual expert analysis cannot scale to millions of products in the supply chain. We propose a framework that leverages large language models (LLMs) to systematically encode human expertise into interpretable, logic-based rules for detecting anomaly patterns in supply chain time series data. Our approach operates in three stages: 1) LLM-based labeling of training data instructed by domain knowledge, 2) automated generation and iterative improvements of symbolic rules through LLM-driven optimization, and 3) rule augmentation with business-relevant anomaly categories supported by LLMs to enhance interpretability. The experiment results showcase that our approach outperforms the unsupervised learning methods in both detection accuracy and interpretability. Furthermore, compared to direct LLM deployment for time series anomaly detection, our approach provides consistent, deterministic results with low computational latency and cost, making it ideal for production deployment. The proposed framework thus demonstrates how LLMs can bridge the gap between scalable automation and expert-driven decision-making in operational settings.

💡 Research Summary

The paper tackles the problem of anomaly detection in massive supply‑chain time‑series data, where traditional unsupervised methods (e.g., Isolation Forest, ARIMA, VAEs, Transformers) struggle for two reasons: (1) they rely solely on statistical patterns and ignore business context, leading to many false alarms on volatile product‑level series, and (2) they are black‑box models that provide no interpretable rationale for alerts. Manual expert analysis can incorporate contextual knowledge but does not scale to the hundreds of millions of ASINs Amazon monitors daily.

To bridge this gap, the authors propose a three‑stage framework that uses large language models (LLMs) as a knowledge‑extraction and rule‑optimization engine rather than as a direct production‑time detector.

Stage 1 – Labeling. Multimodal vision‑language LLMs (e.g., GPT‑4V, Claude‑Vision) are fed both a plotted image of each weekly time‑series and a textual description of the relevant business context (stock‑out ratio, demand surge, etc.). The model returns (a) a binary anomaly label and (b) a natural‑language justification. To improve label reliability, several LLM providers are used in parallel; each provider internally performs a majority‑vote over multiple runs, and a final label is accepted only when all providers agree. Human experts then review a random sample, identify systematic disagreements, and refine the prompting or add deterministic filter rules (e.g., “a drop in out‑of‑stock ratio is a positive signal”). This human‑in‑the‑loop step converts implicit domain expertise into a high‑quality, large‑scale labeled dataset.

Stage 2 – Rule Learning. From the labeled set, two streams of information are extracted: (i) qualitative reasoning extracted from the LLM‑generated justifications (phrases such as “significantly higher than recent weeks”, “breaks established trend”), and (ii) quantitative statistics (means, standard deviations, z‑scores, trend indicators) computed with conventional feature‑engineering pipelines. These are combined into a prompt that asks an LLM to synthesize a symbolic, executable rule (e.g., “IF recent 4‑week average > 2σ AND trend‑slope < –0.5 THEN anomaly”).

The initial rule is then refined through an automated iterative pipeline that mirrors a standard machine‑learning training loop:

Performance evaluation & behavioral analysis. The rule is applied to the labeled data, yielding precision, recall, and F1. A confusion‑matrix‑based taxonomy classifies the rule’s behavior (over‑conservative, over‑aggressive, specific pattern failures).
Targeted modification. The current rule, its behavioral report, and the full history of previous attempts (trajectory) are fed back to the LLM, which proposes concrete modifications (e.g., lower thresholds, add a minimum consecutive‑weeks condition).
Update & early stopping. The modified rule is re‑evaluated; if it improves the best F1, it becomes the new best rule. The loop continues until a maximum iteration count or convergence. The trajectory‑aware design prevents cyclic changes and encourages convergent optimization.

Stage 3 – Augmentation. Once a high‑performing rule set is obtained, the LLM categorizes each rule into business‑relevant anomaly types (stock‑outs, demand spikes, price anomalies, etc.). This step adds a layer of interpretability that allows operations teams to understand, audit, and, if necessary, manually adjust the rules.

Experimental results (though details are omitted in the excerpt) show that the rule‑based system outperforms the baseline unsupervised models on both detection accuracy (higher F1) and interpretability. Moreover, compared with deploying LLMs directly for inference, the rule engine delivers deterministic outputs with microsecond latency and negligible compute cost, making it production‑ready at Amazon’s scale.

Key contributions include:

A novel use of multimodal LLMs for large‑scale, context‑aware labeling, turning tacit expert knowledge into data.
An LLM‑driven rule synthesis and iterative refinement pipeline that is cast into a conventional ML training framework, enabling early stopping, multi‑start, and avoidance of over‑fitting.
Demonstration that symbolic, business‑aligned rules can achieve superior performance to black‑box anomaly detectors while providing full auditability.

Limitations and future work are acknowledged: the labeling quality depends on the LLM’s visual reasoning and prompt design; rule complexity may affect runtime efficiency; and the current study focuses on a single metric per product. The authors suggest extending the approach to automatic prompt optimization, rule compression techniques, and multi‑metric, cross‑series correlation modeling.

In summary, the paper presents a practical, scalable methodology that leverages LLMs as knowledge extractors and rule optimizers, thereby marrying human expertise with automated, high‑throughput anomaly detection in supply‑chain time series.

LLM-Assisted Logic Rule Learning: Scaling Human Expertise for Time Series Anomaly Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment