Collaborative Edge-to-Server Inference for Vision-Language Models

December 18, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Collaborative Edge-to-Server Inference for Vision-Language Models
ArXiv ID: 2512.16349
Date: 2025-12-18
Authors: Soochang Song, Yongjune Kim

📝 Abstract

We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.

💡 Deep Analysis

📄 Full Content

With the rapid advancement of artificial intelligence (AI), the integration of multiple data modalities-such as images, text, and audio-into a shared embedding space has become a central paradigm in modern AI systems [1]- [3]. Among these, vision-language models (VLMs), also referred to as multimodal large language models (MLLMs), have emerged as a prominent and widely adopted architecture, combining a vision encoder with a large language model (LLM) to enable visual reasoning capabilities [3]- [5]. This architecture facilitates complex multimodal tasks that require joint understanding of visual and textual inputs, including visual question answering (VQA), image captioning, and image-text retrieval [6].

In real-world applications, VLMs are often required to process visual data captured at edge devices (clients). However, deploying a full VLM directly on the edge is generally infeasible due to its substantial computation and memory costs [7], [8]. Consequently, the visual data is typically transmitted to a server hosting the VLM for inference. To match the input resolution requirements of the vision encoder (e.g., 336 × 336 pixels) [1], [9], [10], the original images can be S. Song and Y. Kim are with the Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, South Korea (e-mail: {ssc6351, yongjune}@postech.ac.kr). resized before transmission. While resizing reduces communication cost, it may discard fine-grained visual information, particularly in high-resolution images, thereby degrading inference accuracy [11]. Alternatively, one may transmit the fullresolution image without downscaling, which preserves details but significantly increases communication load.

To address this problem, our framework introduces an uncertainty-aware retransmission mechanism tailored to the collaborative edge-to-server setting. In the first stage, the edge device transmits a low-resolution global image along with the question to the server. The server performs an initial inference and quantifies inference uncertainty using the minentropy of the output tokens. If the min-entropy is below a predefined threshold, the inference result is directly finalized. Otherwise, a second stage is triggered: the server leverages the VLM’s internal attention scores to identify a task-relevant region of interest (RoI) and requests a detail-preserved local image from the edge device. Upon receiving the retransmission request, the edge device extracts the local image from the original image and transmits it to the server. The server then refines its inference by jointly processing the global and local images. This uncertainty-aware, two-stage strategy reduces communication overhead by transmitting task-relevant highquality visual details only when necessary, while maintaining inference accuracy across diverse tasks.

At the core of the proposed framework is an entropy-aware decision mechanism that determines whether retransmission is required based on the VLM’s inference uncertainty. Since the server cannot directly ascertain whether the initial inference is correct, uncertainty is estimated from output probabilities and used as a proxy. Specifically, we compute the per-token min-entropy from the softmax outputs of the LLM decoder at each generation step and average it across all output tokens to obtain a single scalar decision statistic. This enables the server to request a local image only when the initial inference is likely unreliable, thereby attaining accuracy comparable to unconditionally transmitting local details while substantially reducing communication overhead.

Moreover, the proposed framework also alleviates serverside computational overhead compared with conventional schemes that either perform inference on higher-resolution images transmitted from the edge or always transmit an additional local image after the global image for inference. In VLMs, the LLM decoder accounts for the majority of overall floating-point operations (FLOPs). Since the number of visual tokens typically dominates the total number of input tokens in VQA tasks, the computational complexity of the VLM scales approximately linearly with the number of visual tokens [12].

By selectively increasing the number of visual tokens only when necessary, the min-entropy-based retransmission strategy significantly reduces the server-side computational cost while maintaining inference accuracy.

Experimental results show that decisions guided by minentropy [13] achieve higher inference accuracy than those based on other uncertainty metrics such as Shannon entropy [14] and probability margin [15]. Compared to highresolution end-to-end models such as LLaVA-1.5-HD [3], the proposed framework provides a more favorable tradeoff between inference accuracy and both communication and computation costs. In addition, it is complementary to image compression techniques, enabling further communication efficiency and remaining broad

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Collaborative Edge-to-Server Inference for Vision-Language Models

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

Full System Architecture Modeling for Wearable Egocentric Contextual AI

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Start searching

No results found