Harvesting Collective Intelligence: Temporal Behavior in Yahoo Answers

When harvesting collective intelligence, a user wishes to maximize the accuracy and value of the acquired information without spending too much time collecting it. We empirically study how people behave when facing these conflicting objectives using data from Yahoo Answers, a community driven question-and-answer site. We take two complementary approaches. We first study how users behave when trying to maximize the amount of the acquired information, while minimizing the waiting time. We identify and quantify how question authors at Yahoo Answers trade off the number of answers they receive and the cost of waiting. We find that users are willing to wait more to obtain an additional answer when they have only received a small number of answers; this implies decreasing marginal returns in the amount of collected information. We also estimate the user’s utility function from the data. Our second approach focuses on how users assess the qualities of the individual answers without explicitly considering the cost of waiting. We assume that users make a sequence of decisions, deciding to wait for an additional answer as long as the quality of the current answer exceeds some threshold. Under this model, the probability distribution for the number of answers that a question gets is an inverse Gaussian, which is a Zipf-like distribution. We use the data to validate this conclusion.

💡 Research Summary

The paper investigates how users of a collective‑intelligence platform balance the desire for more information against the cost of waiting for that information. Using a large dataset from Yahoo Answers (over 1.2 million questions posted between 2010 and 2012), the authors adopt two complementary analytical frameworks.

The first framework treats the user’s decision as a utility‑maximization problem in which the utility depends positively on the number of answers received (N) and negatively on the total waiting time (t). Formally, the utility is expressed as U(N, t)=f(N)−c·t, where f(N) captures the value of additional answers and c is a constant representing the per‑unit‑time cost. By fitting this model to the empirical data, the authors find that f(N) exhibits diminishing marginal returns: the first few answers provide a large boost to utility, while later answers add relatively little. For example, users are willing to wait an average of about two hours for the first answer, but only about half an hour for each subsequent answer after the fourth. Parameter estimation is performed via log‑linear regression and maximum‑likelihood methods, yielding statistically significant coefficients (p < 0.01). This analysis quantifies the trade‑off between information quantity and time cost and demonstrates that users behave as if they are rational agents facing decreasing returns to additional information.

The second framework focuses on the micro‑level decision process concerning answer quality. The authors assume that users evaluate each incoming answer against a quality threshold θ; they continue to wait for another answer only if the current answer’s quality exceeds θ. Answer arrivals are modeled as a Poisson process, and answer qualities are treated as independent random variables. Under the threshold policy, the total number of answers a question receives follows an inverse Gaussian (Wald) distribution:

f(N)=√(λ/(2πN³)) exp

💡 Research Summary

📜 Original Paper Content