Anomaly Detection for Automated Data Quality Monitoring in the CMS Detector

Anomaly Detection for Automated Data Quality Monitoring in the CMS Detector
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Successful operation of large particle detectors like the Compact Muon Solenoid (CMS) at the CERN Large Hadron Collider requires rapid, in-depth assessment of data quality. We introduce the AutoDQM'' system for Automated Data Quality Monitoring using advanced statistical techniques and unsupervised machine learning. Anomaly detection algorithms based on the beta-binomial probability function, principal component analysis, and neural network autoencoder image evaluation are tested on the full set of proton-proton collision data collected by CMS in 2022. AutoDQM identifies anomalous bad’’ data affected by significant detector malfunction at a rate 4 – 6 times higher than ``good’’ data, demonstrating its effectiveness as a general data quality monitoring tool.


💡 Research Summary

The paper presents AutoDQM, an automated data‑quality‑monitoring (DQM) framework designed for the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider. Traditional DQM relies on human shifters visually inspecting thousands of one‑ and two‑dimensional histograms for each run, a process that is time‑consuming, error‑prone, and limited in scalability. AutoDQM replaces this manual workflow with a combination of rigorous statistical testing and unsupervised machine‑learning (ML) techniques, delivering rapid, quantitative, and visual feedback on detector performance.

The statistical core uses the beta‑binomial probability mass function to model the count in each histogram bin as a binomial outcome with a beta prior derived from reference runs. For a data run with total entries D and a reference run with total entries R, the likelihood L_i for each bin i is computed, then normalized against the maximum‑likelihood value to obtain a relative likelihood L_rel,i. This is transformed into a pull value Z_i = √(−2 ln L_rel,i). To guarantee a minimum 1 % prediction tolerance, a scaling factor τ is applied to the reference counts, effectively limiting the statistical uncertainty for high‑occupancy bins. When multiple reference runs are available, the Z_i values are averaged, allowing the test to accommodate systematic variations across runs. Two anomaly metrics are derived: a χ²‑like sum of Z_i² over all bins, and a modified maximum pull Z′_max that incorporates a look‑elsewhere correction.

For unsupervised ML, the authors implement two complementary models. Principal Component Analysis (PCA) is trained on 216 “good” runs, extracting the two most significant components. Histograms are flattened (2D → 1D) and low‑occupancy bins are merged until each contains at least 0.33 % of total entries, reducing statistical noise. The PCA reconstruction is compared to the original histogram, and a χ²′ score (χ² divided by D^{1/3}) quantifies deviation.

The second ML model is a convolutional autoencoder (AE). The encoder consists of 1‑D convolutional layers, a bottleneck, and a decoder built from transposed convolutions. The architecture uses 50 nodes, 12 filters, two hidden layers, and a learning rate of 0.001, implemented with TensorFlow. Like PCA, histograms are pre‑processed (flattened and low‑occupancy merging). After training on “good” data, the AE reconstructs each input histogram; the reconstruction error is again expressed as a χ²′ value, where the reconstructed histogram is scaled by a factor of 100 before applying the beta‑binomial function, thereby suppressing statistical fluctuations in the reference.

Both PCA and AE produce heat‑map visualizations of anomalous regions, enabling shifters to pinpoint problematic detector areas instantly. The system is delivered as a web‑based GUI that highlights only histograms flagged as anomalous, while still allowing full access to the complete set if needed.

Performance is evaluated on the full 2022 CMS proton‑proton dataset, encompassing billions of events. Real anomalies—such as HCAL timing glitches, CSC track‑stub occupancy deficits, and unexpected muon pseudorapidity distributions—are used as ground truth. AutoDQM identifies “bad” runs at a rate four to six times higher than the baseline of “good” runs, despite “bad” runs constituting less than 2 % of the total data. In cases where visual inspection of the original DQM histogram shows negligible differences, the beta‑binomial heat map reveals statistically significant deficits, demonstrating the added sensitivity of the statistical layer. The unsupervised ML models successfully flag anomalous histograms that deviate from the learned manifold of good data, confirming that the approach does not require labeled “bad” examples, which are scarce in practice.

The authors acknowledge a limitation: χ²′ scores tend to be biased low for histograms with very few entries, potentially missing subtle anomalies. They suggest future work on improved normalization or weighting schemes, as well as exploring ensemble methods that combine statistical and ML scores into a single meta‑anomaly metric. Integration of AutoDQM into the Level‑1 trigger decision chain and extension to other LHC experiments are proposed as next steps.

In summary, AutoDQM delivers a robust, scalable, and interpretable solution for CMS data‑quality monitoring, merging principled Bayesian statistics with modern unsupervised deep learning. It dramatically reduces the manual burden on shifters, improves detection of subtle detector failures, and sets a new benchmark for automated quality assurance in high‑energy physics experiments.


Comments & Academic Discussion

Loading comments...

Leave a Comment