Transformer-Based Modeling of User Interaction Sequences for Dwell Time Prediction in Human-Computer Interfaces

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study investigates the task of dwell time prediction and proposes a Transformer framework based on interaction behavior modeling. The method first represents user interaction sequences on the interface by integrating dwell duration, click frequency, scrolling behavior, and contextual features, which are mapped into a unified latent space through embedding and positional encoding. On this basis, a multi-head self-attention mechanism is employed to capture long-range dependencies, while a feed-forward network performs deep nonlinear transformations to model the dynamic patterns of dwell time. Multiple comparative experiments are conducted with BILSTM, DRFormer, FedFormer, and iTransformer as baselines under the same conditions. The results show that the proposed method achieves the best performance in terms of MSE, RMSE, MAPE, and RMAE, and more accurately captures the complex patterns in interaction behavior. In addition, sensitivity experiments are carried out on hyperparameters and environments to examine the impact of the number of attention heads, sequence window length, and device environment on prediction performance, which further demonstrates the robustness and adaptability of the method. Overall, this study provides a new solution for dwell time prediction from both theoretical and methodological perspectives and verifies its effectiveness in multiple aspects.

💡 Research Summary

The paper addresses the problem of predicting user dwell time on graphical user interfaces, a key indicator of attention, engagement, and cognitive processing. Recognizing that traditional statistical methods and shallow machine‑learning models cannot capture the high‑dimensional, nonlinear, and long‑range dependencies inherent in interaction logs, the authors propose a Transformer‑based framework that models the entire sequence of user actions.

First, raw interaction logs (clicks, scrolls, cursor movements, timestamps, etc.) are pre‑processed into a unified representation. Four behavioral dimensions—dwell duration, click frequency, scrolling behavior, and contextual features (device type, OS, browser, ad slot, etc.)—are each embedded into fixed‑size vectors. A learnable or sinusoidal positional encoding is added to preserve temporal order, yielding a sequence X ∈ ℝ^{T×d}. This sequence is linearly projected into a latent space h_t = W_e · x_t + p_t.

The core model follows the standard Transformer encoder architecture. For each time step, queries, keys, and values are obtained via linear projections (W_Q, W_K, W_V). Scaled dot‑product attention computes the affinity matrix A = softmax(QKᵀ / √d_k), allowing every token to attend to all others. Multi‑head attention (M heads) concatenates the heads and projects them back, enabling the model to capture diverse interaction patterns simultaneously. A position‑wise feed‑forward network (FFN) with GELU activation adds depth and nonlinearity. Layer normalization, residual connections, and dropout are employed to stabilize training and avoid over‑fitting.

After the final encoder layer, a global average pooling aggregates the sequence into a single vector z. A linear regression head (p·z + b) outputs the predicted dwell time ŷ. The loss function is mean squared error (MSE), optimized with Adam and equipped with learning‑rate scheduling and early stopping.

The experimental evaluation uses the Avazu Click‑Through Rate dataset, repurposed for dwell‑time prediction by attaching timestamps and constructing dwell labels for both clicked and non‑clicked impressions. This dataset contains tens of millions of records with rich contextual metadata, making it a realistic benchmark for large‑scale HCI modeling. The authors compare their method against four baselines: BILSTM, DRFormer, FedFormer, and iTransformer, all trained under identical preprocessing and hyper‑parameter settings.

Results (Table 1) show that the proposed Transformer consistently outperforms the baselines across four metrics: MSE (0.1361 vs. 0.1642 for BILSTM), RMSE (0.3690 vs. 0.4052), MAPE (7.12 % vs. 8.73 %), and RMAE (0.2745 vs. 0.3124). The improvement is especially notable in relative‑error measures (MAPE, RMAE), indicating that the model handles the high inter‑user variability of dwell times more robustly.

Sensitivity analyses explore three key factors: (1) the number of attention heads, (2) the length of the input window, and (3) the device environment (mobile vs. desktop). Increasing the number of heads from 1 to 8 yields a monotonic reduction in RMAE, confirming that multi‑head attention enriches the representation of complex interaction patterns. Window‑size experiments reveal an optimal range (≈40–60 time steps); shorter windows miss long‑range dependencies, while excessively long windows introduce noise. Performance remains stable across device types, demonstrating the model’s adaptability to heterogeneous deployment contexts.

The authors conclude that integrating multi‑modal behavioral embeddings with a Transformer encoder provides a powerful, generalizable solution for dwell‑time prediction. Their contributions are: (i) a unified embedding scheme for heterogeneous interaction signals, (ii) a Transformer architecture that captures both global dependencies and fine‑grained dynamics, (iii) comprehensive empirical validation against strong baselines, and (iv) thorough hyper‑parameter and environment robustness testing.

Future work is suggested in three directions: (a) augmenting the model with graph‑based Transformers to explicitly encode UI element relationships, (b) applying meta‑learning or federated learning to enable rapid adaptation to new domains such as education or healthcare, and (c) developing lightweight, on‑device versions for real‑time personalization. Overall, the paper demonstrates that Transformer‑based sequence modeling can substantially advance the state of the art in HCI dwell‑time prediction, with clear implications for adaptive interface design, recommendation systems, and user‑centred analytics.

Transformer-Based Modeling of User Interaction Sequences for Dwell Time Prediction in Human-Computer Interfaces

💡 Research Summary

Comments & Academic Discussion

Leave a Comment