RangeAD: Fast On-Model Anomaly Detection

In practice, machine learning methods commonly require anomaly detection (AD) to filter inputs or detect distributional shifts. Typically, this is implemented by running a separate AD model alongside the primary model. However, this separation ignore…

Authors: Luca Hinkamp, Simon Klüttermann, Emmanuel Müller

RangeAD: Fast On-Model Anomaly Detection
RangeAD: F ast On-Mo del Anomaly Detection Luca Hink amp 1 (  ) [0009 − 0004 − 1547 − 1590] , Simon Klüttermann 1 [0000 − 0001 − 9698 − 4339] , and Emman uel Müller 1 , 2 [0000 − 0002 − 5409 − 6875] 1 TU Dortm und Universit y , Dortm und, Germany {luca.hinkamp, simon.kluettermann, emmanuel.mueller}@cs.tu-dortmund.de 2 Researc h Center T rust worth y Data Science and Security , UA Ruhr, Germany Abstract. In practice, machine learning metho ds commonly require anomaly detection (AD) to filter inputs or detect distributional shifts. T ypically , this is implemented b y running a separate AD model along- side the primary mo del. Ho wev er, this separation ignores the fact that the primary mo del already enco des substan tial information ab out the tar- get distribution. In this pap er, w e introduce On-Mo del AD, a setting for anomaly detection that explicitly leverages access to a related mac hine learning mo del. Within this setting, w e prop ose RangeAD, an algorithm that utilizes neuron-wise output ranges derived from the primary mo del. RangeAD achiev es sup erior p erformance even on high-dimensional tasks while incurring substan tially lo w er inference costs. Our results demon- strate the p oten tial of the On-Mo del AD setting as a practical framework for efficien t anomaly detection. Keyw ords: Anomaly Detection · Outlier Detection · On-Mo del ML. 1 In tro duction Mac hine learning (ML) has b ecome a cornerstone of modern soft ware systems; ho w ev er, the transition from con trolled exp erimen tal settings to practical, real- w orld deploymen t requires robust safeguards. Anomaly Detection (AD) is critical to this transition, acting as a gatekeeper for system reliability [38,17]. The ne- cessit y of A D in pro duction environmen ts is driven by three primary challenges. First, the principle of "garbage in, garbage out" dictates that mo dels p erform unpredictably when fed corrupted or irrelev an t data; AD serves as a crucial fil- ter to reject such inputs b efore they degrade system p erformance [11]. Second, real-w orld environmen ts are non-stationary . As data distributions evolv e (a phe- nomenon known as concept drift), AD allo ws systems to detect when current inputs no longer match the training distribution, signaling the need for adapta- tion [22]. Finally , standard ML mo dels are typically optimized for the av erage case and suffer from undersp ecification when facing "blac k sw an" even ts or out- liers. AD acts as a safety mechanism to identify and handle these edge cases where mo del guarantees fail [8]. Despite these necessities, integrating AD in to pro duction pipelines remains c hallenging. Typically , AD is deploy ed as a secondary system running alongside 2 L. Hink amp et al. the primary predictiv e mo del. This duality increases computational ov erhead and in tro duces a risk of misalignment, where the AD mo del’s definition of "normal" do es not p erfectly map to the primary model’s operational domain. F urther- more, traditional AD metho ds are often hindered b y the curse of dimensional- it y [2], which limits the complexity of inputs they can handle effectively . Per- haps most critically , the fundamen tal unkno wabilit y of future anomalies mak es h yp erparameter tuning and model selection a near-imp ossible task; optimizing for known anomalies often fails to generalize to unforeseen contingencies encoun- tered in the wild [29,19]. These limitations are comp ounded by the high runtime of many state-of-the-art AD algorithms, whic h renders them unsuitable for real- time applications such as user input verification or high-frequency time-series monitoring [9]. T o address these c hallenges, we introduce RangeAD, a nov el framew ork that pro vides highly accurate, application-sp ecific anomaly scores with close-to-zero- shot computational ov erhead. Our key insight is that the predictive mo del al- ready contains the information needed to identify anomalies. By exploiting the activ ation ranges of neurons within the trained mo del, we can derive a robust in- dication of "anomalousness" without the need for an external detector. Because these activ ation ranges are intrinsic to the primary mo del, they naturally align with the application’s sp ecific goals. Crucially , these statistics can b e calculated during the standard forward pass of the mo del, resulting in negligible additional time cost. Our main con tributions are as follo ws: – W e in tro duce the On-Mo del AD setting, a framework for ML-ready anomaly detection that bridges the gap b et ween theoretical AD and practical deploy- men t. – W e prop ose the RangeAD algorithm, whic h leverages internal neural acti- v ation ranges to detect anomalies in real-time. – W e provide a comprehensive ablation study to v alidate our design choices and demonstrate the efficacy of our metho d against current baselines. T o facilitate repro ducibilit y and further research, our code is av ailable at anon ymous.4op en.science/r/RangeAD. 2 Related W ork Anomaly Detection (AD) is a long-standing and widely studied problem in ma- c hine learning, with a rich b o dy of literature spanning classical statistical meth- o ds, mo dern machine learning approac hes, and deep learning-based tec hniques. Its imp ortance in practical mac hine learning systems has been rep eatedly empha- sized in the literature [40]. Numerous surveys provide comprehensive ov erviews of the field and its methodological diversit y [38,9,13,33]. AD plays a critical role across a wide range of application domains, including fraud detection [15,48], industrial and mac hine fault detection [10,31,35], and scien tific data analysis suc h as particle physics experiments [30,7]. In many real-world systems, anomaly RangeAD: F ast On-Mo del Anomaly Detection 3 detection op erates alongside other machine learning comp onents to ensure reli- abilit y , robustness, and safety . In practice, anomaly detection frequently serves as a monitoring mechanism for machine learning systems deploy ed in sensitive domains such as health- care [18], In ternet-of-Things (IoT) infrastructures [1], and financial monitoring systems [15]. In these settings, predictiv e mo dels are typically trained to per- form domain-sp ecific tasks (e.g., diagnosis, forecasting, or classification), while anomaly detection is implemented as a separate comp onent tasked with iden- tifying abnormal inputs or system b ehavior. How ever, these t wo systems ar e commonly developed indep endently , even though they op erate on the same data streams and often rely on related representations. As a consequence, the anomaly detector may fail to fully exploit the information already learned by the primary mo del, p oten tially leading to inefficiencies or misalignment b et ween the anomaly detection ob jective and the mo del’s op erational domain. A central challenge in anomaly detection is the scarcity or complete absence of lab eled anomaly samples. Most AD metho ds op erate in an unsup ervised or semi-sup ervised setting where only normal data is a v ailable during training. It is w ell established that even small amounts of lab eled information can significantly impro v e detec tion p erformance by guiding the mo del tow ard the most informa- tiv e subspaces of the data [51,38,13]. Motiv ated by this observ ation, we prop ose the On-Mo del AD setting, which lev erages the presence of a sup ervised model trained for a related task. Although the sup ervised mo del itself is not trained for anomaly detection, it enco des task-relev ant structure that can provide v aluable guidance for iden tifying abnormal inputs. This in tuition is supported by prior work on neural netw ork representations. Deep neural netw orks are known to learn hierarchical feature representations, where deeper lay ers capture increasingly abstract and task-relev an t informa- tion [49]. Several approaches hav e exploited this prop ert y b y applying classical anomaly detection algorithms to the learned represen tations of neural netw orks rather than the raw input space [47,41]. Such approac hes can significantly im- pro v e the p erformance of traditional AD algorithms on high-dimensional and complex datasets, as the learned embeddings often provide a more structured represen tation of the data. Ho w ev er, representation-based approaches typically suffer from several limi- tations. Many metho ds rely on extracting a low-dimensional em b edding from the sup ervised mo del, which can constrain the information av ailable to the anomaly detector and require arc hitectural modifications or additional training pro ce- dures. F urthermore, these approaches often assume strong alignment b etw een the sup ervised task and the anomalies of in terest, whic h ma y not hold in prac- tice. They also t ypically rely on accessing only a small subset of internal repre- sen tations, limiting the amount of information that can b e exploited from the trained mo del. Our w ork takes inspiration from a differen t prop ert y of neural net works: the phenomenon of dead neurons in netw orks with rectified linear unit (R eLU) activ ations [28]. During training, some neurons become permanently inactive 4 L. Hink amp et al. b ecause their inputs never enter the p ositiv e activ ation regime. More broadly , the learned weigh ts and the distribution of training data implicitly constrain the range of activ ation v alues that neurons can pro duce under normal op erating conditions, even when using alternative activ ation functions. This observ ation suggests that eac h neuron implicitly defines a feasible activ ation range deter- mined by the training data and mo del parameters. Deviations from these learned ranges, therefore, pro vide a natural signal for detecting anomalous inputs. Building on this insight, our prop osed metho d leverages activ ation ranges throughout the netw ork to detect anomalies directly during the forward pass of a trained mo del. Unlik e approac hes that train a separate detector or rely on sp ecialized representations, our framework integrates anomaly detection into the predictive mo del itself. This enables extremely efficient detection with neg- ligible computational ov erhead while simultaneously leveraging the task-sp ecific kno wledge already captured by the pretrained mo del. As a result, our approach bridges the gap b et w een standalone anomaly detection systems and the predic- tiv e mo dels they are meant to safeguard. 3 Metho dology 3.1 On-Mo del Anomaly Detection W e consider the follo wing scenario: W e assume a typical machine learning prob- lem, e.g., a classification problem that is trained by applying a neural net work to a set of normal data. During deplo ymen t, in addition to normal data, anoma- lies might also app ear. Our metho d then tries to capture these anomalies using b oth normal data (one-class classification [39]) and the initial neural netw ork’s w eigh ts. W e consider anomaly detection algorithms that can benefit from addi- tional information through a trained neural net w ork as On-Model Anomaly Detection . Giv en a neural net work classifier f with pretrained parameters θ k i ∈ Θ for ev ery neuron n k i from the la y ers l k ∈ L making up the netw ork f and a dataset of normal samples x j ∈ X train , our method outputs score s ( ˆ x j , X train , f ) ∈ R that capture the probability of ˆ x j ∈ X test b eing an anomaly . By using a threshold t , these scores s ( ˆ x j , X train , f ) > t can then b e con verted to binary decisions with the lab el y ano ∈ { +1 , − 1 } for b eing an anomaly or not. The prop osed metho dology comprises three distinct phases: training, prepara- tion, and inference. In the training phase, the initial neural net w ork f is trained on, e.g., a classification problem with lab eled data ( x, y ) ∈ X train . After that, metrics based on clean data, M ( X prep , f ) , are extracted from the trained net- w ork as preparations. With these, an anomaly score can b e assigned to unknown test data during the inference phase. W e note that X train and X prep can but m ust not b e the same, how ever they should contain only uncon taminated data, while X test ma y contain anomalies. RangeAD: F ast On-Mo del Anomaly Detection 5 3.2 RangeAD Our instantiation of On-Mo del AD, whic h we name RangeAD, requires the fol- lo wing metrics M ( X prep , f ) : W e tak e the idea that data considered to b e normal, e.g., in-distribution, usually stays in a finite, determinable range of v alues, and ev erything outside is abnormal. While suc h a c hec k only finds a limited num- b er of very sev ere anomalies, the same is true in subspaces of the data, such as those created b y the outputs of v arious neurons in a trained neural net- w ork. By considering all neurons of a related neural netw ork, w e can th us gain a more decisive anomaly score that also b enefits from setting sp ecific idiosyn- crasies from the related net w ork. So for ev ery neuron n k i in ev ery observ ed 3 la y er l k w e calculate the range of normal v alues in the neural net works fea- ture space as the interv al in whic h clean data pro duces activ ation outputs as I normal ( n k i ) = [ Q ( n k i ( X prep ) , σ ) , Q ( n k i ( X prep ) , 1 − σ )] where n k i ( X prep ) is the dis- tribution of observed intermediate activ ation output of the neuron n k i generated in the forw ard pass of f ( X prep ) and Q ( · , σ ) represents the σ -Quantile. While for σ = 0 w e consider the maximum observ ed v alues, partially contaminated data mak es using σ > 0 preferable. After the forw ard pass of test samples ˆ x ∈ X test , w e chec k whether the activ ations lie outside the previously calculated ranges for all of the neurons. The anomaly score ( s ( ˆ x ) ) for a sample ˆ x is then the cardinalit y of activ ations lying outside the resp ectiv e interv als. s ( ˆ x ) = L X k N k X i Θ ( n k i ( ˆ x ) / ∈ I normal ( n k i )) Θ ( T rue ) = 1 , Θ ( F alse ) = 0 (1) where Θ ( · ) is a function that returns 1 if the statement is true and 0 otherwise. A higher count of out-of-range activ ations indicates a greater likelihoo d that the sample is an anomaly . Figure 1 illustrates this pro cess. T o conv ert contin uous anomaly scores in to binary decisions, the default approac h selects a threshold based on the exp ected n umber of anomalies in the test set [50]. When using RangeAD, w e can also select a more intuitiv e threshold based on the fraction of observ ed neurons (e.g. 10% ). 4 Ev aluation In this section, we presen t a comprehensive empirical ev aluation of our proposed anomaly detection framework. W e b enc hmark our metho d against several base- lines across three distinct data mo dalities. F or each exp erimental scenario, out- of-distribution (OOD) test sets are systematically curated b y either in tro ducing external anomalous samples or selectively filtering existing classes to ensure con- textually relev an t detection challenges. In every exp erimen t, our framew ork is built on a classification neural net work trained on clean data (in-distribution) in 3 W e do not consider ev ery possible la yer of the neural netw ork, since e.g. ReLU or P o oling lay ers only represent degregated copies of previous lay ers. 6 L. Hink amp et al. l 1 l 2 ... l N (a) I normal ( n 1 i ) I normal ( n 2 i ) (b) Chec k interv als n 1 1 ( x i ) ∈ I normal ( n 1 1 ) n 1 2 ( x i ) ∈ I normal ( n 1 2 ) n 1 3 ( x i ) ∈ I normal ( n 1 3 ) ... (c) Fig. 1: Depiction of our prop osed metho dology . (a) represen ts the neural net w ork, from whic h the sample activ ations p er neuron are depicted in (b). F rom these distributions - to b e considered normal - interv al b orders are derived. The activ ations pro duced for a new test input are then chec ked p er neuron to see whether it falls inside the corresp onding interv al in (c). F or every test sample x i the amount of feature-outputs lying outside of the corresp onding in terv al indicates the anomaly grade of x i . the resp ectiv e domain, and the detection p erformance is ev aluated on a con tam- inated testing dataset. Our primary exp erimen ts focus on tabular data, which remains a cornerstone of anomaly detection research. T o demonstrate the v ersa- tilit y of our approach, we extend our ev aluation to image and time-series classi- fication tasks. Finally , we conduct a detailed ablation study to assess the sensitivity of our metho d to v arious implementation choices and hyperparameters. F or the vision and time series tasks, we took σ = 1% / 0 . 1% quantiles to derive the interv al b orders from the activ ation distribution and the σ = 1% quantile for the tabular data, for whic h other v ariants are co v ered in the corresp ond- ing ablation study . T o measure the efficiency of our algorithm, we separate the required runtime in to three phases. Both the training and forward pass of the initial neural netw ork occur whether w e use it for on-mo del anomaly detection or not, and are thus ignored here. The training time of our metho d (computing M ( X prep , f ) ) o ccurs only once and is thus usually negligible in practice. Hence, w e fo cus here on the inference time, as this represents the time required to clas- sify new samples and state the remaining times in the supplemen tary material. All exp erimen ts were p erformed using an In tel Xeon Gold 6258R CPU, 252 GB RAM, and a Nvidia Quadro R TX 6000 GPU with 24 GB VRAM. 4.1 Anomaly detection p erformance T abular data The 12 tabular datasets are selected from OddBench [9] by taking all datasets with multiple normal classes and at least 30 features, for which our net w orks achiev e at least 70% classification accuracy on the clean part of the test RangeAD: F ast On-Mo del Anomaly Detection 7 R angeAD (100) R angeAD (500) R angeAD (1000) autoencoder icl ifor deepsvdd copod ecod hbos knn lof ocsvm sod cblof dte loda dean cof dagmm goad 0.4 0.6 0.8 1.0 AUC ROC AUC ROC Distribution over datasets per Competitor Mean Median Fig. 2: A UC-R OC of differen t anomaly detection metho ds o ver 12 tabular datasets. Our metho d is run three times with mo dels ha ving hidden-lay er neuron coun ts of 100, 500, and 1000. It reaches the highest AUC-R OC on a v erage compared to all comp etitors. data 4 . Th us, we take the "BaseballEven ts", "CasePriority", "CreativeSc ho ol- Certification", "FinanceJobCategories", "GamePositionAnomaly", "HousingV ul- nerabilit y", "NFLPerformanceAnomalies", "SentencingF raud", "T rav elW eather- Scores", "T reeCondition", "W eatherUVIndex", and "Windo wsEditionAnomaly" datasets, which eac h contain multiple normal classes in addition to anomalies. F or our exp erimen ts, we train a simple three-lay er Multilay er Perceptron (MLP) with ReLU as a non-linear activ ation function on the normalized, m ulti-class training data as a classifier. W e emplo y hidden lay ers with 100, 500, and 1000 neurons to ac hieve div ersity , roughly 2-50 times the dataset’s feature coun t. W e tak e the outputs of the first tw o hidden dimensions (b efore ReLU activ a- tion) for calculating the anomaly detection ranges. In the supplementary mate- rial, we compare alternative lay er c hoices for our metho d. The neural netw orks are trained for 500 ep ochs with a learning rate of 0.001. While ev aluating our anomaly detection method, we use the training dataset during the building phase and add the anomalous class to the test data for ev aluation. W e ev aluate the prop osed method mainly on runtime and the Area Under the Receiv er Op erating Characteristic Curve (AUC-R OC) and compare it with sev- eral other anomaly detection techniques from the literature on these t w o metrics across the same 12 tabular datasets. The comp etitors feature Auto enco der [42], CBLOF [14], COF [46], COPOD [23], Dagmm [52], Dean [6], DeepSVDD [39], DTE [27], ECOD [24], GOAD [3], HBos [12], ICL [44], Isolation F orest [25], k-Nearest Neighbour [36], LODA [34], LOF [5], OC-SVM [4] and SOD [20] pro- viding v arious angles on anomaly detection. The anomaly detection results can b e seen in Figure 2. Our metho d achiev es the highest A UC-R OC most often on the datasets compared to the comp etitors. A dditionally , all three mo del sizes lead to an av erage AUC-R OC higher than 4 W e filter for the p erformance of the related netw ork, as we desire models that can b e deploy ed in practice. 8 L. Hink amp et al. all competitors b y a difference of 0.03 to the second-b est approac h (0.79 for our worst mo del compared to 0.76 for the autoenco der). In seven of the t w elv e datasets, the top AUC-R OCs w ere abov e 0.9, while our approach most often rank ed among the top three comp etitors. In the remaining five datasets, our metho d achiev ed AUC-R OCs still in the better half of the methods compared. Ha ving the highest av erage score across all comp etitors and all three mo del sizes sho ws that our approach is comp etitiv e with all the compared metho ds. W e pro- vide a critical difference plot in the supplementa ry material. Figure 3 sho ws the results of the run time ev aluation. Our metho d has an a v erage inference time of around 2 ms across all three mo del sizes, while the second-b est comp etitor, an Auto enco der, is roughly 100 times slo wer. The fastest comp etitor, DeepSVDD, still needs almost double the runtime for an a v erage in- ference pass o ver the test dataset, 3.7 ms. The results highlight the b enefits of the on-mo del AD setting, as our metho d is the fastest and best-p erforming algo- rithm among those used. It outp erformes the deep and slow er metho ds detection wise and at the same time is faster than the light weigh t comp etitors. Comparing the three hidden lay er sizes, used in our metho d, the wider netw orks seem to p erform b etter while naturally b eing a bit slow er. This indicates a trade- off, whic h w e will co ver in more detail in the ablation study . 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 T ime (s) 0.50 0.55 0.60 0.65 0.70 0.75 0.80 AUC ROC AUC ROC vs Infer ence T ime for Differ ent Models R angeAD (100) R angeAD (500) R angeAD (1000) autoencoder cblof cof copod dagmm dean deepsvdd dte ecod goad hbos icl ifor knn loda lof ocsvm sod Fig. 3: AUC-R OC and inference time of different anomaly detection metho ds a v eraged o ver 12 tabular datasets. Our method is run three times with mo dels with hidden lay er neuron counts of 100, 500, 1000. It ac hieves the highest detection p erformance while providing the fastest inference run time of all comp etitors. W e provide the same analysis on the training time in the supplemen tary material. RangeAD: F ast On-Mo del Anomaly Detection 9 Vision Data While the previous section shows that our metho d achiev es com- parable or even b etter p erformance to the comp etitors on relatively lo w-dimensional tabular data, we also aimed to ev aluate on more complex scenarios. Therefore, w e also conduct anomaly detection exp erimen ts on image data. Therefore, we tak e a mo dern Swin-v2-Vision T ransformer [26] mo del pretrained on ImageNet- 1k (in-distribution) as our base net work. The feature spaces from whic h the detection interv als are calculated are either taken after ev ery GELU-Activ ation function call ("in") or after a whole T ransfomer-Blo ck ("after"). W e note that all neurons could also b e used at the cost of a higher memory usage, but for our experiments, the 17664/4416 neurons (GELU/T ransformer-Block) w ere al- ready enough to p erform well. W e utilize the 100,000 ImageNet test images as the preparation dataset for building our anomaly detector. The test dataset to ev alu- ate on consists of the whole Imagenette subset [16] (train+test) as in-distribution and tw o different datasets from domains that are not con tained in ImageNet- 1k, as out-of-distribution, namely AstronomyImages [45] 5 and Ambivision [32], a dataset of animal-based optical illusions, to bring in m ultiple angles for the OOD-domain. W e p erform the ev aluation on the tw o OOD-datasets, b oth sepa- rately and combined, and compare our method against an Auto enco der, Isolation F orest, and DeepSVDD, the top 3 comp etitors from the tabular exp eriment, ex- cluding ICL for computational reasons. Dataset in, 0.01 after, 0.01 in, 0.001 after, 0.001 IF AE DeepSVDD INet + Ambivision 0.9893 0.9923 0.9945 0.9937 0.9985 0.9472 0.8830 INet + SpaceImages 0.9625 0.9094 0.9592 0.909 0.4977 0.1864 0.6757 INet + b oth 0.971 0.9355 0.9703 0.9357 0.6597 0.4266 0.6833 UrbanSound [43] 0.8804 - 0.9142 - 0.6939 0.8002 0.7927 T able 1: A UC-R OC of vision and time series tasks. "in" means in the netw ork blo c ks, e.g. after GELU and Conv olution-Lay ers; "after" means after the net- w ork blo cks, e.g. T ransformer-Block (vision-only). 0.01 and 0.001 is the quantile used. AE and DeepSVDD were trained with 10% of the INet testset for mem- ory reasons. Our metho d reaches the most reliable p erformance across high-dimensional datasets. The results from this exp erimen t show that also in the vision anomaly detection task, our approac h can comp ete with the baselines and ev en outperforms them in almost every setup. In T able 1 the AUC ROCs can b e found. When taking Am bivision as the OOD-dataset, the Isolation F orest has an AUC R OC negligi- bly (0.004) higher than the b est v ariant of our metho d, when taking activ ations after GELU output and using 0.1% quantile. Despite that, with the SpaceIm- age as OOD our using b oth OOD-Datasets, all v ariants of our metho d p erform atleast 0.2 AUC R OC b etter than the comp etitors. Internally , the GELU-feature 5 Classes "cosmos space", "galaxies", "nebula" and "stars" were used as they provide the most similar photographs and ImageNet-1k do es only contain very few space based images. 10 L. Hink amp et al. space tends to pro vide a b etter anomaly detection capability than the outputs at the end of the transformer-blo ck. W e discuss alternative lay er c hoices in the supplemen tary material. The inference runtimes provide even better results of our metho d compared to the three comp etitors. T able 2 shows that our metho d tak es roughly 0.014 seconds for performing inference. The Isolation F orest per- forms inference in roughly 0.8 seconds, ov er 50 times more than ours, while the t w o deep approaches tak e several min utes. Across the differen t v ariants tested for our metho d, there is almost no difference in run time, whether the observ ed la y ers or the quantiles are v aried, highlighting the scalability of RangeAD. Dataset in, 0.01 after, 0.01 in, 0.001 after, 0.001 IF AE DeepSVDD Inet + Ambivision 0.0151 0.0161 0.0131 0.0129 0.8197 240.9355 133.9441 Inet + SpaceImages 0.0131 0.012 0.0123 0.0129 0.8438 238.3285 138.8146 Inet + b oth 0.0129 0.0126 0.0129 0.0136 0.8560 251.041 141.8705 UrbanSound [43] 0.0138 - 0.0103 - 0.0711 0.8833 0.5184 T able 2: Inference time in seconds of vision and time series tasks. "in" means in the net w ork blo cks, e.g. after GELU and Conv olution-Lay ers; "after" means after the net work blo c ks, e.g. T ransformer-Blo c k (vision-only). 0.01 and 0.001 are the quantiles used. AE and DeepSVDD were trained with 10% of the INet testset for memory reasons. Our metho d requires m ultiple orders of magnitude less run time as it is not using an additional mo del. Time Series In addition to tabular and image data, we wan ted to ev aluate our metho d in a third domain to demonstrate its general applicability . W e choose time-series data and use the UrbanSound dataset [43], which consists of 44.1kHz recordings of typical urban environmen ts, for acoustic analysis. The dataset con- sists of ten classes representing common sounds from the urban area like children pla ying, dogs barking or construction site noise. Within this framework, gunshot sounds are designated as the OOD class, as they represent sparse, high-impact ev en ts that are naturally anomalous to the urban soundscap e. This scenario is mo deled using a temp oral conv olutional net w ork (TCN) [21] coupled with an MLP classification head for classifying the sound data. W e train the mo del for 20 Ep ochs on the normal classes achieving roughly 65% test accuracy . T o calcu- late the anomaly detection ranges, w e use the output of the conv olution lay ers. After building our anomaly detection metho d, we add the anomaly class to the test data for ev aluating. W e compare the results to the same three comp etitors from the vision task, Auto encoder, Isolation F orest and DeepSVDD. T able 1 and T able 2 sho w the A UC R OCs and inference times respectively . While the b est comp etitor - an Auto encoder - achiev es around 0.8 AUC ROC, our metho d detects anomalies with an AUC ROC around 0.9. The difference b e- t w een using the 1% and 0.1% quantile is ov er 0.03, comparable to the difference in the vision tasks at around 0.01, highlighting that tuning this hyperparameter can slightly increase the p erformance dep ending on the scenario, but this is not deeply necessary . The inference runtime is with 0.013 seconds on par with the RangeAD: F ast On-Mo del Anomaly Detection 11 10 20 50 100 150 200 300 400 500 750 1000 1250 1500 2000 2500 3000 Neur ons per hidden layer 0.74 0.76 0.78 0.80 AUC ROC 0.89 0.90 0.91 0.92 0.93 T est A ccuracy AUC-ROC and T est A ccuracy vs. Model Size, 2 Layer -MLP, Mean/Std over 5 runs AUC ROC T est A ccuracy Fig. 4: When using differently sized hidden la y ers in our neural netw orks, AUC R OC and mo del p erformance are correlated . The results represen t the a v erage of 5 runs with a standard deviation ov er the 12 tabular datasets. vision task, as it is only dep enden t on the sample count and the neurons used, not the feature size, which would b e the most influential difference to the time series dataset. The three comp etitors, on the other hand, scale worse with higher feature dimensions, needing from 0.07 seconds to 0.88 seconds, showing that our metho d is also more efficient and effective for time-series anomaly detection. 4.2 Ablation Study T o understand the details and alterations of our metho d, we inv estigate different asp ects of the approac h in an ablation study . Therefore we ev aluate the b eha vior in differen t scenarios on the tabular datasets. Mo del Size Ablations First, w e wan t to inv estigate ho w mo del size and the feature space size affect anomaly detection p erformance. F or this, we alter the tabular setup so that the initial classifier consists of hidden lay ers with 10 to 3000 neurons. By taking the feature space of the first t w o lay ers of the MLP- Net w ork, our metho d detects anomalies based on 20 to 6000 activ ation interv als. F or each mo del size, the detection p erformance was captured as the AUC R OC. While the accuracy of the initial classifier increases logarithmically and interest- ingly do es not observe ov erfitting, as shown in Figure 4, the detection p erfor- mance mostly closely follows this pattern. They b oth exhibit sharp initial gains at smaller scales, follow ed by shallow er increases at larger la y er sizes. As test accuracy contin ues to improv e with more than 1000 neurons p er lay er, the AUC R OC reaches its p eak at 750 neurons and then declines sligh tly , indicating mod- est o v ersaturation. Mo del T raining Ablations After in vestigating the mo del size, w e seek to un- derstand the influence of the training of the base netw ork. Our assumption here 12 L. Hink amp et al. is that with a b etter-performing / longer-trained netw ork, the activ ation ranges b ecome more meaningful, leading to a more precise anomaly separation and a sup erior de tection p erformance. T o ac hiev e this comparison, we train the initial classifier and measure detection p erformance after every ep o c h. As the previous exp erimen t show ed a dep endency of the p erformance on different mo del sizes and to provide more meaningful outcomes, w e use 100, 500, and 1000 neurons in the MLP’s hidden la y ers. Going from the results in Figure 5a, we note that across the three tested mo del sizes, the detection p erformance wen t higher o v er the ep ochs, indicating a gener- ally better p erformance with b etter-trained mo dels. A dditionally , we note that ev en before training, the AUC-R OC was high. Compared with the competitors in Figure 2, our mo dels achiev e an av erage ROC AUC higher than almost all com- p etitors, even in an untrained state. F rom this, w e relate that the intermediate feature spaces of neural netw orks are intrinsically go od indicators for anomalies. By training these spaces, one can further impro ve these capabilities, but our metho d also likely still works very well when the related model is unrelated to the t yp e of anomalies we are searching for. 0 100 200 300 400 500 Epochs 0.755 0.760 0.765 0.770 0.775 0.780 0.785 0.790 0.795 AUC ROC AUC ROC over train Epochs, thr ee model sizes ms 100 ms 500 ms 1000 (a) Anomaly detection p erformance as A UC-ROC of our metho d in three different mo del sizes measured after ev ery training ep och of the initial classifier. Ev en with minimal training our metho d reaches comp etitiv e performance. 1 0 3 1 0 2 1 0 1 1 0 0 F alse P ositive R ate 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 AUC ROC t = 0 . 2 t = 0 . 0 0 1 = 0 . 0 = 0 . 1 t = 0 . 0 1 t = 0 . 9 Comparison of Bor der R etrieving Appr oaches E x t r e m a + t r a n g e - Q u a n t i l e s q 1 / q 3 + t I Q R (b) AUC ROC and F alse-Positiv e-Rate (FPR) for different approaches of calculat- ing the interv al b orders. The Quan tile- based approach reaches the b est bal- ance b et ween FPR and anomaly de- tection p erformance. Fig. 5: Ablation study results. Range Border Retriev al The final ablation study ev aluates different methods for determining the b orders of the normal -in terv als derived from each neuron’s activ ation output. Our primary approach uses quantiles close to the extrema of the activ ation distribution. W e therefore ev aluate different v alues of σ , which con trol the dis tance of the selected quan tiles from the distribution b oundaries, RangeAD: F ast On-Mo del Anomaly Detection 13 in a tabular setting using an MLP with 1000 hidden units. W e additionally consider approaches that place borders further from the extrema. First, w e use the quartiles ( q 1 , q 3 ) and extend them b y adding a fraction t of the in terquartile range (IQR), pro ducing b orders b et ween the quartiles and the extrema. Second, w e apply the same idea using the minimum and maximum together with the Min- Max range, resulting in b orders outside the observ ed activ ation distribution. The scaling parameter t v aries from 0.001 to 0.9, while σ ranges from 0 to 0.1. Borders placed deep within the distribution classify many samples as anomalies, whereas b orders outside the distribution yield few detections. T o analyze this trade-off, we compare the A UC-R OC to the resulting false-p ositiv e rate (FPR). Binary predictions for the FPR are obtained using a threshold based on 10% of the monitored neurons (200 of 2000 for tw o lay ers). Figure 5b shows the resulting A UC-R OC-FPR trade-offs. The extrema-based metho d yields the lo w est FPR but also the lo w est AUC-R OC (maxim um at 0.76). The quantile- and quartile- based approaches achiev e similar AUC-R OC v alues close to 0.8, although the quartile-based metho d pro duces substantially higher FPR. Overall, the quantile- based approac h pro vides the most fav orable trade-off in this setting. 5 Conclusion and F uture W ork In this pap er, we in tro duce a nov el anomaly detection setting that exploits a related machine learning mo del, which w e call On-Mo del AD. W e also introduce a technique that can leverage such a setting, RangeAD. Our scenario and al- gorithm are esp ecially useful in complicated production environmen ts, where a related neural netw ork mo del often exists to solve task b ey ond anomaly detec- tion, e.g., classification. By lev eraging the neural netw ork’s intermediate feature space and analyzing the activ ation outputs of the preparation data, our metho d constructs a single in terv al of normal activ ations p er neuron. Based on this build- ing phase, our metho d detects anomalies b y aggregating the information wether the observ ed data’s activ ation falls inside or outside eac h neurons in terv al. W e examined our metho d in anomaly detection scenarios across tabular, vision, and time series domains, in each of which it prov ed a sup erior performance to the comp etitors, doing so with regards to b oth detection quality and efficiency . In an ablation study , w e further analyzed the approach and made multiple changes to the detection framew ork to highligh t its v ariations and limitations. While this study focused on p oin t-wise anomalies within static classification tasks, the framework’s versatilit y suggests high extensibility to regression and streaming time-series en vironments. A particularly comp elling fron tier lies in Natural Language Pro cessing. Given the scalability demands and the problem of anomalous Large Language Models states (hallucinations) [37], our ligh t weigh t add-on approach could enhance the reliability in v alidating LLM outputs. F uture researc h will explore arc hitectural optimizations to further refine this on-mo del detection, including selective neuron pruning, automated optimal lay er selection, and sp ecialized regularization techniques designed to cultiv ate more discrimina- tiv e feature represen tations during training. These adv ancements aim to further 14 L. Hink amp et al. reduce computational ov erhead while maximizing the sensitivity of the anomaly detection b oundary . A ckno wledgments. This research w as supp orted by the Researc h Cen ter T rust- w orth y Data Science and Security ( https://rc- trust.ai ), one of the Researc h Alliance cen ters within the Univ ersit y Alliance Ruhr ( https://uaruhr.de ). References 1. A nov el machine learning pip eline to detect malicious anomalies for the in ternet of things. Internet of Things 20 , 100603 (2022). https://doi.org/10.1016/j.iot. 2022.100603 2. Bellman, R.: Dynamic programming. Science 153 (3731), 34–37 (1966) 3. Bergman, L., Hoshen, Y.: Classification-based anomaly detection for general data. In: ICLR. Op enReview.net (2020), http://dblp.uni- trier.de/db/conf/iclr/ iclr2020.html#BergmanH20 4. Bounsiar, A., Madden, M.G.: One-class support vector machines revisited. In: 2014 In ternational Conference on Information Science Applications (ICISA). pp. 1–4 (2014). https://doi.org/10.1109/ICISA.2014.6847442 5. Breunig, M., Kröger, P ., Ng, R., Sander, J.: Lof: Identifying densit y-based lo cal outliers. vol. 29, pp. 93–104 (06 2000). https://doi.org/10.1145/342009.335388 6. Böing, B., Klüttermann, S., Müller, E.: Post-robustifying deep anomaly detection ensem bles by mo del selection. In: ICDM (2022) 7. Craig, N., Ho ward, J.N., Li, H.: Exploring Optimal T ransp ort for Even t-Level Anomaly Detection at the Large Hadron Collider (1 2024) 8. D’Amour, A., et.al.: Undersp ecification presents challenges for credibility in mo d- ern mac hine learning (2020), 9. Ding, X., Klüttermann, S., W en, H., Chen, Y., Akoglu, L.: Macro data: New b enc hmarks of thousands of datasets for tabular outlier detection (2026), https: 10. Dong, L., Shulin, L., Zhang, H.: A metho d of anomaly detection and fault diagnosis with online adaptive learning under small training samples. P attern Recognition 64 (2017) 11. Geiger, R.S., Cop e, D., Ip, J., Lotosh, M., Shah, A., W eng, J., T ang, R.: “garbage in, garbage out” revisited: What do machine learning application pap ers rep ort ab out h uman-lab eled training data? (2021). https://doi.org/10.1162/qss_a_00144 12. Goldstein, M., Dengel, A.R.: Histogram-based outlier score (hbos): A fast unsu- p ervised anomaly detection algorithm (2012) 13. Han, S., Hu, X., Huang, H., Jiang, M., Zhao, Y.: A db ench: Anomaly detection b enc hmark. In: NeurIPS (2022) 14. He, Z., Xu, X., Deng, S.: Discov ering cluster-based lo cal outliers. Pattern Recogni- tion Letters 24 (9), 1641–1650 (2003). https://doi.org/10.1016/S0167- 8655(03) 00003- 5 15. Hilal, W., Gadsden, S.A., Y awney , J.: Financial fraud: a review of anomaly detec- tion tec hniques and recent adv ances. Exp ert systems With applications (2022) 16. Ho w ard, J., Husain, H.: Imagenette: A smaller subset of 10 easily classified classes from imagenet, and a little more french. https://github.com/fastai/ imagenette , accessed: 10.03.2026 RangeAD: F ast On-Mo del Anomaly Detection 15 17. Kamat, Poo ja, Sugandhi, Rekha: Anomaly detection for predictive mainte- nance in industry 4.0- a survey 170 (2020). https://doi.org/10.1051/e3sconf/ 202017002007 18. Khan, M.M., Alkhathami, M.: Anomaly detection in iot-based healthcare: mac hine learning for enhanced securit y . Scientific Rep orts (2024). https://doi.org/10. 1038/s41598- 024- 56126- x 19. Klüttermann, S., Gupta, S., Müller, E.: Ev aluating anomaly detection algorithms: The role of hyperparameters and standardized b enchmarks. In: DSAA (2025). https://doi.org/10.1109/DSAA65442.2025.11247988 20. Kriegel, H.P ., Kröger, P ., Sch ub ert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Adv ances in Kno wledge Disco very and Data Mining (2009) 21. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: T emporal Conv olutional Net works for Action Segmentation and Detection . In: CVPR (2017). https:// doi.org/10.1109/CVPR.2017.113 22. Li, B., Gupta, S., Müller, E.: State-transition-aw are anomaly detection under concept drifts. Data and Knowledge Engineering 154 , 102365 (2024). https://doi.org/https://doi.org/10.1016/j.datak.2024.1023 65 , https://www.sciencedirect.com/science/article/pii/S0169023X24000892 23. Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.: Copo d: Copula-based outlier detection. In: 2020 IEEE In ternational Conference on Data Mining (ICDM) (2020). https://doi.org/10.1109/ICDM50108.2020.00135 24. Li, Z., Zhao, Y., Hu, X., Botta, N., Ionescu, C., Chen, G.: Ecod: Unsup ervised outlier detection using empirical cumulativ e distribution functions (01 2022) 25. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: ICDM (2008) 26. Liu, Z., Hu, H., Lin, Y., Y ao, Z., Xie, Z., W ei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., W ei, F., Guo, B.: Swin transformer v2: Scaling up capacity and resolution (2022), 27. Liv erno c he, V., Jain, V., Hezav eh, Y., Ra v anbakhsh, S.: On diffusion mo deling for anomaly detection. In: ICLR 2024 (2024), https://openreview.net/forum?id= lR3rk7ysXz 28. Lu, L., Shin, Y., Su, Y., Karniadakis, G.E.: Dying relu and initialization: Theory and n umerical examples (2020). https://doi.org/10.4208/cicp.OA- 2020- 0165 29. Ma, M.Q., Zhao, Y., Zhang, X., Ak oglu, L.: The need for unsup ervised ou tlier mo del selection: A review and ev aluation of internal ev aluation strategies (2023) 30. Mikuni, V., Nachman, B., Shih, D.: Online-compatible unsup ervised nonresonant anomaly detection. Phys. Rev. D 105 , 055006 (Mar 2022). https://doi.org/10. 1103/PhysRevD.105.055006 31. Miljk o vić, D.: F ault detection metho ds: A literature survey . In: 2011 Pro ceedings of the 34th International Conv ention MIPRO. pp. 750–755 (2011) 32. New en, C., Hink amp, L., Nton ti, M., Müller, E.: Do you see what i see? an am biguous optical illusion dataset exp osing limitations of explainable ai (2025), 33. Oltean u, M., Rossi, F., Y ger, F.: Meta-survey on outlier and anomaly detec- tion. Neurocomputing 555 , 126634 (2023). https://doi.org/https://doi.org/ 10.1016/j.neucom.2023.126634 34. P evný , T.: Lo da: Light weigh t on-line detector of anomalies. Mach. Learn. 102 (2) (feb 2016). https://doi.org/10.1007/s10994- 015- 5521- 0 35. Radaideh, M.I., Pappas, C., W ezensky , M., Ramuhalli, P ., Cousineau, S.: F ault prognosis in particle accelerator p ow er electronics using ensem ble learning. ArXiv abs/2209.15570 (2022) 16 L. Hink amp et al. 36. Ramasw am y , S., Rastogi, R., Shim, K.: Efficien t algorithms for mining outliers from large data sets. In: SIGMOD (2000). https://doi.org/10.1145/342009.335437 37. Rateik e, M., Cintas, C., W am buru, J., Akum u, T.L., Sp eakman, S.D.: W eakly su- p ervised detection of hallucinations in llm activ ations. In: NeurIPS 2023 W orkshop on So cially Resp onsible Language Modelling Research (SoLaR) (2023) 38. Ruff, L., Kauffmann, J.R., V andermeulen, R.A., Monta von, G., Samek, W., Kloft, M., Dietterich, T.G., Muller, K.R.: A unifying review of deep and shallow anomaly detection. Pro ceedings of the IEEE 109 (5), 756–795 (May 2021) 39. Ruff, L., V andermeulen, R., Go ernitz, N., Deec ke, L., Siddiqui, S.A., Binder, A., Müller, E., Kloft, M.: Deep one-class classification. In: ICML (2018) 40. Rö c hner, P ., Klüttermann, S., Rothlauf, F., Schlör, D.: W e need to rethink b enc h- marking in anomaly detection (2025), 41. Saeedi, J., Giusti, A.: Anomaly detection for industrial inspection using con volu- tional autoenco der and deep feature-based one-class classification. pp. 85–96 (01 2022). https://doi.org/10.5220/0010780200003124 42. Sakurada, M., Y airi, T.: Anomaly detection using auto encoders with nonlinear dimensionalit y reduction. In: PMLSDA 2014 - Machine Learning for Sensory Data Analysis. p. 4–11. MLSDA’14 (2014). https://doi.org/10.1145/2689746. 2689747 43. Salamon, J., Jacob y , C., Bello, J.P .: A dataset and taxonom y for urban sound researc h. In: Pro ceedings of the 22nd ACM in ternational conference on Multimedia. pp. 1041–1044 (2014) 44. Shenk ar, T., W olf, L.: Anomaly detection for tabular data with int ernal con trastiv e learning. In: International Conference on Learning Representations (2022), https: //openreview.net/forum?id=_hszZbt46bT 45. Sriv astav a, A.: Astronomy image classification dataset. https://www.kaggle.com/ datasets/abhikalpsrivastava15/space- images- category , accessed: 10.03.2026 46. T ang, J., Chen, Z., F u, A.W.c., Cheung, D.W.: Enhancing effectiveness of outlier detections for lo w density patterns (2002) 47. W ang, Y., et.al.: Unsup ervised anomaly detection with compact deep features for wind turbine blade images tak en by a drone. In ternational Journal of Computer Vision (2019). https://doi.org/10.1186/s41074- 019- 0056- 0 48. Xu, W., Jang-Jaccard, J., Liu, T., Sabrina, F., Kw ak, J.: Improv ed bidirectional gan-based approac h for net work in trusion detection using one-class classifier. Com- puters (2022). https://doi.org/10.3390/computers11060085 49. Zeiler, M.D., F ergus, R.: Visualizing and understanding conv olutional net works. In: Europ ean Conference on Computer Vision (ECCV). pp. 818–833. Springer (2014) 50. Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: A python to olbox for scalable outlier detection. Journal of Machine Learning Research 20 (96), 1–7 (2019), http: //jmlr.org/papers/v20/19- 011.html 51. Zhong, Z., Y u, Z., Y ang, K., Chen, C.L.P .: Lab els matter more than mo d- els: Quan tifying the b enefit of sup ervised time series anomaly detection (2025), 52. Zong, B., et.al.: Deep auto encoding gaussian mixture mo del for unsup ervised anomaly detection. In: ICLR (2018)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment