Beyond Literacy: Predicting Interpretation Correctness of Visualizations with User Traits, Item Difficulty, and Rasch Scores
Data Visualization Literacy assessments are typically administered via fixed sets of Data Visualization items, despite substantial heterogeneity in how different people interpret the same visualization. This paper presents and evaluates an approach for predicting Human Interpretation Correctness (P-HIC) of data visualizations; i.e., anticipating whether a specific person will interpret a data visualization correctly or not, before exposure to that DV, enabling more personalized assessment and training. We operationalize P-HIC as a binary classification problem using 22 features spanning Human Profile, Human Performance, and Item difficulty (including ExpertDifficulty and RaschDifficulty). We evaluate three machine-learning models (Logistic Regression model, Random Forest, Multi Layer Perceptron) with and without feature selection, using a survey with 1,083 participants who answered 32 Data Visualization items (eight data visualizations per four items), yielding 34,656 item responses. Performance is assessed via a ten-time ten-fold cross-validation in each 32 (item-specific) datasets, using AUC and Cohen’s kappa. Logistic Regression model with feature selection is the best-performing approach, reaching a median AUC of 0.72 and a median kappa of 0.32. Feature analyses show RaschDifficulty as the dominant predictor, followed by experts’ ratings and prior correctness (PercCorrect), whose relevance increases across sessions. Profile information did not particularly support P-HIC. Our results support the feasibility of anticipating misinterpretations of data visualizations, and motivate the runtime selection of data visualizations items tailored to an audience, thereby improving the efficiency of Data Visualization Literacy assessment and targeted training.
💡 Research Summary
The paper tackles the problem of predicting whether an individual will correctly interpret a data visualization before they actually see it, a task the authors term Predicting Human Interpretation Correctness (P‑HIC). Traditional visualization literacy assessments use a fixed set of items, ignoring the substantial variability in how different users understand the same visual. To address this, the authors designed a large‑scale online survey involving 1,083 participants who answered 32 visualization items (eight visualizations, each with four questions of three types: Name, Function, Content). This yielded 34,656 item‑level responses.
For each response the authors constructed 22 features grouped into three families: (1) Item difficulty – ExpertDifficulty (median rating from seven visualization experts) and RaschDifficulty (item difficulty estimated via the Rasch model, which separates person ability from item difficulty); (2) Human performance – prior correctness proportion (PercCorrect), cumulative correct count, session index (as a proxy for fatigue), etc.; (3) Human profile – demographics such as age, gender, education, and self‑reported visualization experience. All features are measurable before the participant interacts with the item.
Three machine‑learning classifiers were evaluated: Logistic Regression (LR), Random Forest (RF), and a Multi‑Layer Perceptron (MLP). Each was trained both with and without a univariate feature‑selection step (SelectKBest), resulting in six model variants. Performance was assessed separately for each of the 32 item‑specific datasets using ten repetitions of ten‑fold cross‑validation. Because the data are imbalanced (incorrect answers are less frequent), the authors used Area Under the ROC Curve (AUC) and Cohen’s κ (chance‑corrected agreement) as evaluation metrics.
Results show that LR with feature selection consistently outperformed the other configurations, achieving a median AUC of 0.72 and a median κ of 0.32 across items. RF and MLP attained lower AUCs (≈0.65–0.68) and did not benefit substantially from feature selection. Feature‑importance analysis revealed that RaschDifficulty was the strongest predictor, followed by ExpertDifficulty and PercCorrect. The importance of PercCorrect grew in later sessions, suggesting that prior performance and fatigue effects become more informative as the test progresses. In contrast, demographic profile features contributed little to predictive power; in some cases they added noise.
The study’s contributions are threefold: (1) it defines and operationalizes the novel P‑HIC problem; (2) it demonstrates that item‑level difficulty estimated via Rasch modeling can be leveraged effectively alongside expert ratings; (3) it shows that a relatively simple logistic‑regression model with a compact set of performance‑based features can achieve useful prediction accuracy without relying on extensive user profiling.
Limitations include the homogeneity of the sample (English‑speaking adults, largely university‑educated), the restriction to relatively simple chart types (bar, line, etc.), and the absence of fine‑grained interaction data such as eye‑tracking or response times. The authors suggest future work should expand the participant pool across cultures and languages, incorporate more complex visualizations, and integrate multimodal behavioral signals. Moreover, embedding P‑HIC into adaptive testing platforms could enable real‑time selection of items that are neither too easy nor too hard for each learner, thereby improving the efficiency of visualization‑literacy assessment and targeted training.
Overall, the paper provides a solid empirical foundation for anticipatory, personalized assessment in the visualization domain and opens avenues for adaptive learning systems that dynamically tailor visual tasks to individual users’ abilities and current state.
Comments & Academic Discussion
Loading comments...
Leave a Comment