Reading Recognition in the Wild
To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user’s interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.
💡 Research Summary
The paper introduces a novel task called “reading recognition,” which aims to determine in real time whether a wearer of always‑on smart glasses is actively reading. Recognizing reading moments is essential for contextual AI on wearable devices because it provides a lightweight proxy signal that can trigger more expensive processing (e.g., OCR, large‑scale vision‑language models) only when needed, thereby respecting the strict power, bandwidth, and heat constraints of such hardware.
To support this task, the authors release the “Reading in the Wild” dataset, the first large‑scale multimodal egocentric collection specifically designed for reading detection. The dataset comprises roughly 100 hours of video captured with a Meta Project Aria rig, including a forward‑facing RGB camera (30 Hz, 1408 px width, 110° FoV), two eye‑tracking cameras (60 Hz, calibrated gaze points), and dual IMUs providing head‑pose and odometry. Two subsets are provided: the “Seattle” subset (≈80 hours, 111 participants) focuses on diversity—different indoor/outdoor locations, a wide range of reading media (print, digital screens, product labels, whiteboards), multiple reading modes (deep reading, skimming, scanning, reading aloud), and multitask scenarios (reading while walking, typing, etc.). The “Columbus” subset (≈20 hours) is curated to stress‑test models with hard negatives (text present but not read), non‑English scripts (Arabic, Chinese, Bengali), and mirror setups where the same participant performs reading and non‑reading actions in identical environments. All recordings are anonymized (faces, license plates blurred) and timestamps are automatically labeled using voice cues (“start reading”, “finished reading”) transcribed by WhisperX, eliminating manual annotation effort.
The authors propose a flexible transformer‑based encoder that can ingest any combination of the three modalities. Each modality is treated as a separate token sequence: (1) gaze is fed as a raw high‑frequency time series without handcrafted fixation/saccade detection; (2) RGB frames are cropped around the instantaneous gaze point, limiting the visual field to the ~2° foveal region that humans actually use for reading, thereby reducing computational load; (3) IMU data (3‑axis accelerometer and gyroscope) are supplied as a short temporal window. These token streams are concatenated and processed by a standard multi‑head self‑attention transformer, producing a scalar confidence score sₜ ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment