BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images

BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose BioTamperNet, a novel framework for detecting duplicated regions in tampered biomedical images, leveraging affinity-guided attention inspired by State Space Model (SSM) approximations. Existing forensic models, primarily trained on natural images, often underperform on biomedical data where subtle manipulations can compromise experimental validity. To address this, BioTamperNet introduces an affinity-guided self-attention module to capture intra-image similarities and an affinity-guided cross-attention module to model cross-image correspondences. Our design integrates lightweight SSM-inspired linear attention mechanisms to enable efficient, fine-grained localization. Trained end-to-end, BioTamperNet simultaneously identifies tampered regions and their source counterparts. Extensive experiments on the benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions. Code - https://github.com/SoumyaroopNandi/BioTamperNet


💡 Research Summary

BioTamperNet addresses the pressing problem of detecting duplicated or forged regions in biomedical images—a task that is critical for maintaining scientific integrity but remains challenging due to the domain‑specific visual characteristics of microscopy, blot/gel, flow‑cytometry, and macroscopic scans. Existing forensic detectors are largely trained on natural‑image datasets and therefore struggle with the subtle, often low‑contrast manipulations typical of biomedical publications.

The authors propose a unified Siamese architecture that simultaneously handles External Duplication Detection (EDD) – where duplicated regions appear across two images – and Internal Duplication Detection (IDD) as well as Cut/Sharp Transition Detection (CSTD) – where duplicated regions reside within a single image. The core novelty lies in “affinity‑guided” attention modules built on lightweight State Space Model (SSM) approximations of linear attention. By embedding the token sequence of a Vision Transformer (ViT‑Base) into a selective‑scan SSM, the model captures long‑range dependencies with O(N) complexity while preserving the expressive power of full self‑attention.

Affinity matrices are computed over SSM‑encoded features, normalized with Rotary Positional Embeddings (RoPE) and ELU‑shifted activations. To suppress the dominant diagonal (self‑correlation) a distance‑based spatial suppression kernel is applied, followed by bidirectional softmax (temperature = 5) to obtain a refined affinity map. This map is further processed by a four‑layer convolutional refinement block, yielding a compact “Affinity_Map” that highlights the most similar patches for each spatial location.

Three parallel Affinity‑Guided SSM (AGSSM) blocks perform self‑attention, each modulated by the flattened affinity map. Their outputs are averaged, projected with a 1×1 convolution, and added back to the original token features, producing enriched representations V′₁ and V′₂. Cross‑attention then operates on these enriched tokens, with the affinity‑guided similarity term Λ injected directly into the attention score. The authors prove (Proposition 1) that when the affinity margin between a true duplicated pair (i, j) and all other candidates exceeds a threshold δ, the softmax weights concentrate on the true counterpart, making the cross‑attention update effectively V₁(i) + W_V V₂(j). This property ensures that duplicated patches dominate the interaction, leading to precise alignment of source and target regions.

A lightweight decoder consisting of stacked convolutions, a 1×1 projection, sigmoid activation, and bilinear upsampling converts V′₁ and V′₂ into binary tampering masks O₁ and O₂. Training minimizes a weighted sum of binary cross‑entropy losses for self‑attention, cross‑attention, and fused outputs, using AdamW (lr = 1e‑4) with cosine decay and early stopping.

Because the BioFors benchmark provides only pristine training images, the authors synthesize a large training corpus: duplicated patches are inserted with extensive geometric augmentations (scale, rotation, flip, crop) and noise, and realism is boosted via GAN‑generated patches and sophisticated blending. For IDD and CSTD, each image is artificially split into a pseudo‑pair, allowing the same EDD‑oriented training pipeline to be reused without architectural changes.

Experimental results on the BioFors test set (real forgeries from retracted papers) show that BioTamperNet achieves substantially higher Matthews Correlation Coefficient (MCC) scores than traditional local‑feature methods (SIFT, ORB, BRIEF) and recent forensic deep models across all four biomedical modalities (Microscopy, Blot/Gel, FACS, Macroscopy). The model not only localizes duplicated regions accurately but also predicts the source counterpart, a capability absent in prior work.

Limitations include the fixed 224 × 224 input resolution, reliance on synthetic data which may not capture all real‑world forgery nuances, and potential scalability issues for very high‑resolution biomedical images (>1024 px). Nonetheless, the paper demonstrates that affinity‑guided SSM attention provides an efficient, domain‑adapted solution for biomedical image forensics, opening avenues for further hierarchical or multi‑scale extensions.


Comments & Academic Discussion

Loading comments...

Leave a Comment