CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning
📝 Abstract
In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features – extracted using a frozen patch encoder – into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL
💡 Analysis
In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features – extracted using a frozen patch encoder – into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL
📄 Content
CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning Andreas Lolos∗1,2 andreaslolos@phys.uoa.gr Theofilos Christodoulou∗2 th.christodoulou@athenarc.gr Aris L. Moustakas1,2 arislm@phys.uoa.gr Stergios Christodoulidis3,4 stergios.christodoulidis@centralesupelec.fr Maria Vakalopoulou2,3,4 maria.vakalopoulou@centralesupelec.fr 1 National and Kapodistrian University of Athens, Greece 2 Archimedes, Athena Research Center, Greece 3 MICS Laboratory, CentraleSup´elec, Universit´e Paris-Saclay 4 IHU PRISM, National Center for Precision Medicine in Oncology, Gustave Roussy Abstract In computational pathology, weak supervision has become the standard for deep learn- ing due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of rely- ing on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correla- tion learning on downstream tasks. By projecting patch features -extracted using a frozen patch encoder- into a small set of global context/morphology-aware tokens and utiliz- ing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathol- ogy benchmarks, while reducing the total number of trainable parameters by 48% −92.8% versus SOTA MILs, lowering FLOPs during inference by 52% −99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at: https://github.com/mandlos/CAPRMIL Keywords: Digital Pathology, Multiple Instance Learning, Context Aware Representa- tions
- Introduction Whole Slide Image (WSI) analysis has become the foundation of clinical practice in com- putational pathology (Alkhalaf et al., 2024; Wang et al., 2024), however their sheer size poses a significant challenge for Deep Learning approaches (Brixtel et al., 2022; Lu et al., 2021b; Gadermayr and Tschuchnig, 2024). At the same time, pixel-level annotations are ∗Contributed equally © CC-BY 4.0, A. Lolos, T. Christodoulou, A.L. Moustakas, S. Christodoulidis & M. Vakalopoulou. arXiv:2512.14540v1 [cs.CV] 16 Dec 2025 Lolos Christodoulou Moustakas Christodoulidis Vakalopoulou prohibitively expensive and time-consuming, resulting in clinical datasets typically provid- ing only slide-level labels rather than fine-grained annotations. (Lu et al., 2021b; Song et al., 2023; Gadermayr and Tschuchnig, 2024). To address the computationally prohibitive size of WSIs and the lack of pixel-level anno- tations, Multiple Instance Learning (MIL) has been established as the standard framework for WSI analysis. The MIL pipeline comprises patch feature extraction, typically adopting pre-trained foundation models (Xiong et al., 2025), followed by aggregation/pooling to pro- duce the slide-level representation for downstream tasks. In recent years, attention-based mechanisms have emerged as a promising approach for a trainable MIL aggregator (Ilse et al., 2018; Wang et al., 2024; Gadermayr and Tschuchnig, 2024), due to their impressive correlation learning capabilities. While effective, approaches that utilize standard attention directly on the patch embeddings face computational bottlenecks due to the quadratic com- plexity of the attention operator (Shao et al., 2021). Attention-based MIL methods for WSI have also been found to be highly susceptible to overfitting and offer limited interpretabil- ity (Zhang et al., 2025b), while often lacking principled uncertainty quantification (Sun et al., 2026; Cui et al., 2022; Lolos et al., 2025), limiting the potential of clinical translation. Therefore, developing aggregation strategies that can effectively model instance interactions, handle the challenges inherent to long sequence processing in WSIs, and provide reliable representations remains an active area of research (Bilal et al., 2023; Fang et al., 2024). At the same time, we identify that neural Partial Differential Equation (PDE) Solvers (Li et al., 2020; Hao et al., 2023; Wu et al., 2024) face a similar challenge: how to achieve efficient and reliable correlation learning in large-scale inputs. Solving PDEs often includes modeling complex phenomena that may cause long-distance interactions, on domains discretized into millions of mesh points (Grossmann et al., 2024).
This content is AI-processed based on ArXiv data.