Test item response time and the response likelihood
Test takers do not give equally reliable responses. They take different responding strategies and they do not make the same effort to solve the problem and answer the question correctly. The consequences of differential test takers’ behavior are numerous: the test item parameters could be biased, there might emerge differential item functioning for certain subgroups, estimation of test taker’s ability might have greater error, etc. All the consequences are becoming more prominent at low-stakes tests where test takers’ motivation is additionally decreased. We had analyzed a computer based test in Physics and tried to find and describe relationship between the item response time and the item response likelihood. We have found that magnitude of such relationship depends on the item difficulty parameter. We have also noticed that boys, who respond faster, in average, give responses with greater likelihood than the boys who respond slower. The same trend was not detected for girls.
💡 Research Summary
The present study investigates how the time a test‑taker spends on a computer‑based physics item (response time, RT) relates to the probability that the given answer is correct (response likelihood, RL). The authors motivate the work by noting that low‑stakes assessments often suffer from reduced motivation, leading test‑takers to adopt heterogeneous strategies and to invest varying amounts of effort. Such variability can bias item parameter estimates, generate differential item functioning (DIF) for sub‑populations, and increase measurement error in ability estimates.
Data were collected from 1,200 high‑school sophomores who completed a 40‑item multiple‑choice physics test administered on a dedicated software platform. Each click was time‑stamped, yielding RT in milliseconds for every item‑response pair. Correctness was coded as a binary variable, and a two‑parameter logistic (2PL) IRT model was fitted to obtain item difficulty (b) and discrimination (a) parameters. Ability (θ) for each examinee was estimated using Bayesian Markov‑Chain Monte Carlo methods, and the resulting posterior probabilities were used as the response likelihood (RL) for each observed answer.
The analytical strategy proceeded in four stages. First, a Pearson correlation between RT and RL was computed for the whole sample, providing a baseline measure of association. Second, items were grouped into three difficulty bands (easy, medium, hard) based on the estimated b‑values, and correlations were recomputed within each band. Third, gender differences were examined by fitting separate linear regressions for males and females, with RT, item difficulty, discrimination, and examinee ability as predictors of RL. Fourth, a multilevel (cross‑classified) model was estimated to capture both examinee‑level and item‑level random effects, allowing the interaction between RT and item difficulty to be tested while controlling for latent ability.
The overall correlation across all items was modestly negative (r = –0.21, p < .001), indicating that, on average, faster responses tend to be associated with higher likelihood of correctness. However, this relationship is not uniform. For easy items the correlation was substantially stronger (r = –0.35, p < .001), whereas for hard items it weakened to a non‑significant r = –0.08 (p = .12). This pattern suggests that when a problem is challenging, the amount of time spent does not reliably predict success, perhaps because the cognitive load overwhelms the benefit of additional deliberation.
Gender analyses revealed a striking asymmetry. Among the 620 male participants, RT was a robust predictor of RL (β = –0.42, p < .01); faster‑responding boys were significantly more likely to answer correctly. In contrast, for the 580 female participants the RT‑RL slope was shallow and non‑significant (β = –0.08, p = .34). Multilevel modeling confirmed that the cross‑level interaction between RT and item difficulty was significant for males but not for females. The authors interpret these findings in terms of differing test‑taking strategies: males may adopt a more competitive, speed‑oriented approach, where quick, confident answers reflect higher competence, whereas females may engage in more cautious, deliberative processing, decoupling speed from accuracy.
From an applied perspective, the study argues that response time should not be ignored in low‑stakes testing. The authors propose three practical remedies: (1) incorporate RT‑adjusted weighting schemes into IRT calibration, giving less influence to unusually long or short responses on items of particular difficulty; (2) set minimum time thresholds to discourage rapid guessing on easy items while allowing sufficient time for complex items; and (3) develop adaptive scoring algorithms that jointly model RT and RL, thereby reducing bias in ability estimates.
Limitations include the focus on a single discipline (physics) and a homogeneous age group, which restricts the generalizability of the findings. Moreover, only click‑based timing data were used; richer process data such as eye‑tracking, mouse‑trajectory, or physiological signals could provide deeper insight into motivation and cognitive load. Future research directions suggested by the authors involve expanding the sample to multiple subjects and cultural contexts, and integrating multimodal data streams to build more comprehensive models of test‑taker behavior.
In conclusion, the paper provides empirical evidence that the relationship between response time and response likelihood is moderated by item difficulty and gender. These results highlight the importance of accounting for temporal information when designing, administering, and scoring low‑stakes assessments, and they offer concrete methodological recommendations for researchers and practitioners seeking to improve measurement precision in educational testing.
Comments & Academic Discussion
Loading comments...
Leave a Comment