FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation

FlowCLAS: Enhancing Normalizing Flow Via Contrastive Learning For Anomaly Segmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Anomaly segmentation is an essential capability for safety-critical robotics applications that must be aware of unexpected events. Normalizing flows (NFs), a class of generative models, are a promising approach for this task due to their ability to model the inlier data distribution efficiently. However, their performance falters in dynamic scenes, where complex, multi-modal data distributions cause them to struggle with identifying out-of-distribution samples, leaving a performance gap to leading discriminative methods. To address this limitation, we introduce FlowCLAS, a hybrid framework that enhances the traditional maximum likelihood objective of NFs with a discriminative, contrastive loss. Leveraging Outlier Exposure, this objective explicitly enforces a separation between normal and anomalous features in the latent space, retaining the probabilistic foundation of NFs while embedding the discriminative power they lack. The strength of this approach is demonstrated by FlowCLAS establishing new state-of-the-art (SOTA) performance across multiple challenging anomaly segmentation benchmarks for robotics, including Fishyscapes Lost & Found, Road Anomaly, SegmentMeIfYouCan-ObstacleTrack, and ALLO. Our experiments also show that this contrastive approach is more effective than other outlier-based training strategies for NFs, successfully bridging the performance gap to leading discriminative methods. Project page: https://trailab.github.io/FlowCLAS


💡 Research Summary

FlowCLAS tackles the challenging problem of anomaly segmentation in safety‑critical robotics, where detecting and localizing out‑of‑distribution (OoD) objects is essential for autonomous driving, space manipulators, and similar applications. While normalizing flows (NFs) have been attractive for this task because they provide an exact likelihood estimate of the inlier data distribution, they struggle in complex, multimodal scenes. Standard NF‑based approaches model only low‑level pixel statistics and consequently assign high likelihood to many anomalous regions when the “normal” data itself exhibits large variability (different viewpoints, lighting, object configurations).

To overcome this limitation, the authors propose a hybrid training framework that augments the classic maximum‑likelihood (ML) objective of NFs with a discriminative contrastive loss, leveraging Outlier Exposure (OE). The pipeline consists of three main components: (1) a frozen, pre‑trained vision encoder (e.g., a Vision Transformer) extracts dense feature maps from RGB images; (2) a multi‑scale invertible flow network maps these features to a latent space Z, where a multivariate Gaussian (learnable mean μ and diagonal covariance Σ) models the inlier distribution; (3) pseudo‑anomalies are generated by copy‑pasting objects from an auxiliary dataset (e.g., COCO) into normal training images using binary masks, creating a mixed dataset D_mix. Pure outlier images are also fed to the encoder to increase the pool of anomalous samples.

During training, two losses are jointly optimized:

  • Maximum‑likelihood loss (L_ml) – encourages latent vectors belonging to normal pixels to follow the base Gaussian, thus preserving the probabilistic foundation of NFs.

  • Contrastive loss (L_con) – follows the supervised InfoNCE formulation. Latent maps are projected to a lower‑dimensional space via a 1×1 convolution and L2‑normalization. An anchor set A is built by sampling equal numbers of normal and anomalous latent vectors from the mixed images, while a secondary set B is sampled from pure outlier images. Positive pairs are vectors of the same class (normal‑normal or anomalous‑anomalous); negatives are vectors of the opposite class. The loss maximizes intra‑class similarity and minimizes inter‑class similarity, explicitly pushing anomalous features into low‑likelihood regions of the latent space.

The final objective is L = L_ml + λ·L_con, where λ balances generative and discriminative forces. Empirical studies show that a modest λ (0.1–0.5) yields the best trade‑off: the flow still models the normal distribution accurately, while the contrastive term prevents the model from collapsing anomalous samples onto high‑likelihood regions.

Evaluation is performed on four robotics‑oriented anomaly segmentation benchmarks: Fishyscapes Lost & Found, Road Anomaly, SegmentMeIfYouCan‑ObstacleTrack, and ALLO. FlowCLAS consistently outperforms prior NF‑based methods such as FastFlow, Pyramid‑Flow, and AE‑Flow, achieving 4–7 percentage‑point gains in average precision. Moreover, it narrows the gap to leading discriminative approaches (UNO, RPL, DenseHybrid), sometimes matching or surpassing them, especially in scenarios with small or context‑dependent anomalies. Ablation experiments confirm that (i) removing the contrastive term degrades performance dramatically, (ii) using OE alone without contrastive learning yields only modest improvements, and (iii) overly large λ harms the likelihood estimation of normal data.

Qualitative results illustrate that FlowCLAS can detect subtle anomalous objects (e.g., a small bag on the road) that FastFlow completely misses, while producing segmentation masks comparable to the supervised SOTA UNO. The contrastive component is shown to steer the latent space toward high‑level semantic separation rather than low‑level texture differences, which is crucial for handling the multimodal nature of real‑world robotic perception.

In summary, FlowCLAS presents a principled integration of generative density estimation and discriminative contrastive learning. By retaining the exact likelihood computation of normalizing flows and simultaneously enforcing a structured latent space that separates normal from outlier features, the method delivers robust, probabilistically interpretable anomaly segmentation suitable for dynamic, safety‑critical robotic environments. Future directions include scaling the approach with larger vision backbones, real‑time flow architectures, and multimodal extensions (e.g., LiDAR, radar) to further enhance reliability in autonomous systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment