EventFlash: Towards Efficient MLLMs for Event-Based Vision

EventFlash: Towards Efficient MLLMs for Event-Based Vision
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4\times$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.


💡 Research Summary

EventFlash addresses the inefficiencies of current event‑based multimodal large language models (MLLMs) that treat asynchronous event streams as dense, image‑like inputs. By explicitly exploiting the inherent spatiotemporal sparsity of event cameras, the authors propose a two‑stage token sparsification pipeline that dramatically reduces computational load while preserving essential motion cues.

The first stage, Adaptive Temporal Window Aggregation (ATWA), partitions the raw microsecond‑resolution event stream into fine‑grained temporal bins. Each bin is modeled as a Gaussian‑kernel intensity function λ_B(x, y, t, p) that captures spatial location, timestamp, and polarity. Adjacent bins are merged when their L2 distance D(B_i, B_{i+1}) falls below a learned threshold τ, forming meta‑windows M_k that compress redundant temporal information. A second semantic‑aware merging step evaluates cosine similarity between CLS tokens extracted from each meta‑window (using a pre‑trained vision encoder such as CLIP‑ViT) and further merges windows with low similarity, ensuring that only semantically distinct segments remain.

The second stage, Sparse Density‑Guided Attention (SDGA), tackles spatial redundancy. For each meta‑window, the method computes a normalized event density r_i = (1/|M_i|)∑_{n∈M_i} f(p_n) and discards or attenuates regions whose density falls below a predefined cutoff. This selective attention focuses the model on informative pixels while suppressing empty or low‑activity areas, effectively reducing the number of spatial tokens fed to the language model.

To train and evaluate this architecture, the authors introduce EventMind, a 500 k instruction dataset covering seven tasks (simple captioning, scene captioning, motion captioning, Event QA, Fine‑grained QA, Multiple‑Choice QA, and Human‑Action QA). EventMind combines real‑world recordings from DSEC, HARD VS, N‑ImageNet, and synthetic streams generated via the V2E simulator from large video corpora (Kinetics‑700, UCF‑101, etc.). The dataset is organized into three curriculum stages based on sequence length: short (0–50 ms), medium (50 ms–5 s), and long (5 s–20 s). This progressive curriculum enables the model to first master basic captioning on short clips and then gradually acquire complex reasoning abilities on longer, more challenging streams.

Experimental results show that EventFlash achieves a 12.4× throughput improvement over a baseline “EventFlash‑Zero” that processes raw tokens without sparsification, while maintaining comparable performance on standard metrics (BLEU, CIDEr, VQA accuracy). Notably, EventFlash can handle up to 1,000 temporal bins, far exceeding the 5‑bin limit of the competing EventGPT. Ablation studies confirm that both ATWA and SDGA contribute substantially to token reduction and speedup, and sensitivity analyses explore the impact of the merging threshold τ and density cutoffs.

Limitations include reliance on a CLIP‑ViT backbone for event encoding, which may not fully exploit event‑specific features, and the absence of latency measurements on real‑time streaming hardware. Future work is suggested to develop dedicated event transformers, co‑optimize software and hardware accelerators, and extend the approach to domains such as robotics and autonomous driving where low‑latency, long‑range event understanding is critical.


Comments & Academic Discussion

Loading comments...

Leave a Comment