Simple Surveys: Response Retrieval Inspired by Recommendation Systems
In the last decade, the use of simple rating and comparison surveys has proliferated on social and digital media platforms to fuel recommendations. These simple surveys and their extrapolation with machine learning algorithms shed light on user preferences over large and growing pools of items, such as movies, songs and ads. Social scientists have a long history of measuring perceptions, preferences and opinions, often over smaller, discrete item sets with exhaustive rating or ranking surveys. This paper introduces simple surveys for social science application. We ran experiments to compare the predictive accuracy of both individual and aggregate comparative assessments using four types of simple surveys: pairwise comparisons and ratings on 2, 5 and continuous point scales in three distinct contexts: perceived Safety of Google Streetview Images, Likeability of Artwork, and Hilarity of Animal GIFs. Across contexts, we find that continuous scale ratings best predict individual assessments but consume the most time and cognitive effort. Binary choice surveys are quick and perform best to predict aggregate assessments, useful for collective decision tasks, but poorly predict personalized preferences, for which they are currently used by Netflix to recommend movies. Pairwise comparisons, by contrast, perform well to predict personal assessments, but poorly predict aggregate assessments despite being widely used to crowdsource ideas and collective preferences. We demonstrate how findings from these surveys can be visualized in a low-dimensional space that reveals distinct respondent interpretations of questions asked in each context. We conclude by reflecting on differences between sparse, incomplete simple surveys and their traditional survey counterparts in terms of efficiency, information elicited and settings in which knowing less about more may be critical for social science.
💡 Research Summary
The paper “Simple Surveys: Response Retrieval Inspired by Recommendation Systems” investigates how four minimalist questionnaire formats—binary (2‑point) choice, 5‑point Likert, continuous (100‑point slider), and pairwise comparison—perform when used to elicit subjective judgments in three visual domains: perceived safety of Google Street‑View panoramas, likeability of fine‑art images, and hilarity of short animal GIFs. Over 600 crowd‑sourced participants on Amazon Mechanical Turk completed 100 items per domain, with each participant answering 80 rating questions of a single type plus 20 pairwise comparisons (or 100 pairwise comparisons for the PC condition). The authors treat the observed responses as a sparse respondent‑by‑item matrix and apply matrix‑factorization‑based collaborative filtering to predict (1) the missing individual ratings for the 20 held‑out items and (2) the aggregate ranking that would emerge from pooling all respondents’ judgments.
Key methodological steps include: (i) encoding each survey type as a numeric matrix (binary 0/1, Likert 1‑5, slider 1‑100); (ii) fitting low‑rank factor matrices U (respondent latent factors) and V (item latent factors) via regularized squared‑error loss for rating data and a pairwise‑ranking loss for comparison data; (iii) selecting the latent dimensionality k and regularization strength through out‑of‑sample cross‑validation; and (iv) visualizing the first two latent dimensions to reveal how respondents cluster in interpretation space.
The empirical findings are three‑fold. First, in predicting individual preferences, the continuous slider (R100) yields the lowest root‑mean‑square error, followed closely by pairwise comparisons. The finer granularity of R100 allows respondents to express subtle differences, and the direct relative information in PC also proves highly informative. By contrast, the binary (R2) and 5‑point (R5) scales provide coarser signals and achieve substantially higher prediction error. Second, for aggregate (population‑level) predictions, the binary choice format outperforms all others. Its forced‑choice nature reduces inter‑respondent variance, making the mean response a stable estimator of the collective attitude. The continuous scale, while excellent for personal nuance, amplifies individual idiosyncrasies and thus distorts the group average. Pairwise comparisons, despite being rich in relative information, suffer from heterogeneous interpretation across participants, leading to poorer consensus ranking. Third, cognitive load and time‑on‑task differ markedly: average completion times follow R2 < R5 < PC < R100, with R100 taking roughly 12 seconds longer per item and eliciting higher self‑reported fatigue.
The latent‑space visualizations corroborate these quantitative results. In the safety domain, items cluster tightly, indicating a shared interpretation of “safety” across respondents. In the humor and art domains, items spread across the two‑dimensional latent plane, suggesting that participants rely on multiple, possibly orthogonal, criteria (e.g., familiarity vs. novelty for humor; technical skill vs. emotional impact for art). This qualitative insight demonstrates that simple surveys can be used not only for prediction but also for uncovering hidden dimensions of how people understand a question.
The authors conclude that the choice of survey format should be driven by research goals. When the objective is personalized recommendation or fine‑grained psychometric measurement, continuous sliders or pairwise comparisons are preferable despite higher respondent burden. When the aim is to capture a collective stance for policy, market research, or crowd‑sourced decision making, binary choices are the most efficient and accurate. Moreover, integrating matrix‑factorization techniques allows researchers to recover a dense preference matrix from a sparse set of cheap responses, dramatically reducing the length and cost of traditional exhaustive surveys.
Limitations noted include reliance on a non‑representative MTurk sample and the modest scale of 100 items per domain; future work should test scalability to thousands of items and explore demographic weighting to improve external validity. Nonetheless, the study bridges a gap between social‑science measurement traditions and modern recommendation‑system algorithms, offering a practical, data‑driven toolkit for designing lean yet powerful surveys in the digital age.
Comments & Academic Discussion
Loading comments...
Leave a Comment