Eye Feel You: A DenseNet-driven User State Prediction Approach

Eye Feel You: A DenseNet-driven User State Prediction Approach
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Subjective self-reports, collected with eye-tracking data, reveal perceived states like fatigue, effort, and task difficulty. However, these reports are costly to collect and challenging to interpret consistently in longitudinal studies. In this work, we focus on determining whether objective gaze dynamics can reliably predict subjective reports across repeated recording rounds in the eye-tracking dataset. We formulate subjective-report prediction as a supervised regression problem and propose a DenseNet-based deep learning regressor that learns predictive representations from gaze velocity signals. We conduct two complementary experiments to clarify our aims. First, the cross-round generalization experiment tests whether models trained on earlier rounds transfer to later rounds, evaluating the models’ ability to capture longitudinal changes. Second, cross-subject generalization tests models’ robustness by predicting subjective outcomes for new individuals. These experiments aim to reduce reliance on hand-crafted feature designs and clarify which states of subjective experience systematically appear in oculomotor behavior over time.


💡 Research Summary

The paper presents a novel approach for predicting users’ subjective states—such as fatigue, mental effort, and task difficulty—directly from raw eye‑tracking data. Instead of relying on handcrafted gaze features (e.g., fixation duration, saccade amplitude), the authors convert the 2‑D gaze position stream into a 1‑D velocity signal, normalize it, and feed a fixed‑length (50 seconds, 5 k samples) sequence into a deep regression network. The core of the model is a pre‑activation DenseNet adapted for 1‑D temporal data. Eight convolutional layers with a growth rate of 32 are stacked, each using exponentially increasing dilation rates (2ⁿ⁻¹ mod 7) to expand the receptive field up to 257 time steps. Batch Normalization and ReLU precede each convolution, and a Global Average Pooling layer aggregates the temporal dimension into a compact embedding. This embedding passes through a fully‑connected regression head (FC → ReLU‑Dropout‑FC) that outputs multiple continuous scores (three in the cross‑round setting, six in the cross‑subject setting). The network is trained with a Smooth L1 (Huber) loss, Adam optimizer (lr = 3e‑4), batch size = 16, and early stopping after 50 epochs.

Experiments use the publicly available GazeBase dataset, which contains 9 recording rounds over ~37 months, 322 participants, and 12 334 eye‑tracking sessions captured at 1000 Hz with an EyeLink 1000 device. The authors evaluate two generalization scenarios: (1) cross‑round generalization, where models trained on round 2 data for known participants predict three subjective scores (overall difficulty, mental tiredness, eye tiredness) on later rounds (3 and 4); and (2) cross‑subject generalization, where models trained on participants from rounds 1 ∪ 2 predict six questionnaire scores (general comfort, shoulder fatigue, neck fatigue, eye fatigue, physical effort, mental effort) for participants not seen during training. Baselines consist of a global mean predictor that simply outputs the per‑dimension average of the training labels.

Performance is measured with Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Pearson correlation (r), coefficient of determination (R²), and “Exact Accuracy” (the proportion of predictions that round to the correct integer on the 1‑7 scale). In cross‑round experiments, DenseNet reduces MAE from ~0.78 (baseline) to ~0.64 and raises exact accuracy from ~0.22 to ~0.60, with Pearson r ≈ 0.55–0.60 and R² ≈ 0.30–0.35. Similar improvements are observed in cross‑subject tests, where MAE drops by 0.15–0.20 and accuracy climbs to 0.45–0.55. These results demonstrate that the learned temporal representations capture meaningful information linking eye‑movement dynamics to subjective self‑reports, and that the model generalizes both across time for the same individuals and across different individuals.

The authors highlight several contributions: (i) framing subjective‑report prediction as a multi‑target regression problem using raw gaze velocity; (ii) introducing a DenseNet‑based architecture that leverages dense connectivity and dilated convolutions for efficient temporal feature extraction; and (iii) providing systematic evaluations of longitudinal (cross‑round) and inter‑subject (cross‑subject) generalization. They also discuss limitations: the use of a regression loss on inherently ordinal 1‑7 ratings may introduce scaling artifacts; the dataset’s high‑precision laboratory eye‑tracker may limit transfer to consumer‑grade or mobile devices; the fixed 50‑second input window restricts applicability to shorter, real‑time interactions; and only velocity information is used, ignoring potentially informative cues such as pupil diameter or facial expressions.

Future work suggestions include adopting ordinal regression or label‑smoothing techniques to respect the discrete nature of the ratings, exploring variable‑length sequence models such as Temporal Convolutional Networks or Transformers, compressing the architecture for real‑time inference on lightweight hardware, and integrating multimodal signals (pupil size, blink rate, facial EMG) to enhance prediction robustness. Overall, the paper demonstrates that deep, densely connected temporal models can effectively bridge objective eye‑movement dynamics and subjective user experience, opening avenues for scalable, unobtrusive monitoring of mental and physical states in HCI and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment