Learning from Collective Intelligence in Groups

Learning from Collective Intelligence in Groups
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Collective intelligence, which aggregates the shared information from large crowds, is often negatively impacted by unreliable information sources with the low quality data. This becomes a barrier to the effective use of collective intelligence in a variety of applications. In order to address this issue, we propose a probabilistic model to jointly assess the reliability of sources and find the true data. We observe that different sources are often not independent of each other. Instead, sources are prone to be mutually influenced, which makes them dependent when sharing information with each other. High dependency between sources makes collective intelligence vulnerable to the overuse of redundant (and possibly incorrect) information from the dependent sources. Thus, we reveal the latent group structure among dependent sources, and aggregate the information at the group level rather than from individual sources directly. This can prevent the collective intelligence from being inappropriately dominated by dependent sources. We will also explicitly reveal the reliability of groups, and minimize the negative impacts of unreliable groups. Experimental results on real-world data sets show the effectiveness of the proposed approach with respect to existing algorithms.


💡 Research Summary

The paper tackles a fundamental weakness of collective intelligence systems: the susceptibility of aggregated crowd-sourced information to low‑quality or malicious sources. Traditional aggregation techniques, such as the Dawid‑Skene model, GLAD, or simple majority voting, assume that each contributor is an independent observer. In real‑world settings—social media, sensor networks, online reviews—contributors often influence one another, creating dependencies that cause redundant (and potentially erroneous) information to dominate the final estimate. To mitigate this, the authors propose a probabilistic framework that simultaneously discovers latent groups of mutually dependent sources and estimates the reliability of both groups and individual members.

The model is built on a Bayesian non‑parametric foundation. Each source (i) is assigned to a latent group (z_i) drawn from a Dirichlet Process, allowing the number of groups to be inferred from the data rather than fixed a priori. Every group (k) possesses a reliability parameter (\theta_k) representing the probability that the group’s reported label matches the true underlying value. Within a group, each source (j) has its own reliability (\phi_{kj}), which modulates the group‑level reliability to capture individual variation. Observed labels (y_{ij}) are generated conditionally on the unknown true label (t_i), the group assignment, and the reliability parameters.

Because the true labels, group assignments, and reliability parameters are hidden, the authors employ variational Bayesian inference. They define a factorized variational distribution (q(z,\theta,\phi,t)) and maximize the evidence lower bound (ELBO). The update for the group assignment (q(z_i=k)) incorporates both the current estimate of group reliability and the assignments of other sources, effectively penalizing the over‑use of highly correlated information. Group and individual reliabilities are updated using conjugate Beta priors, which keep the estimates well‑behaved and prevent extreme values. This hierarchical treatment enables the model to down‑weight entire groups that appear unreliable while still leveraging trustworthy members within those groups.

The experimental evaluation uses two real‑world datasets: a crowdsourced labeling task from a popular micro‑task platform and a set of product reviews from an e‑commerce site. Preliminary analysis confirms that source dependencies exist (e.g., reviewers often copy each other’s opinions, workers share common biases). The proposed method is benchmarked against Dawid‑Skene, GLAD, and recent graph‑based aggregation approaches. Across precision, recall, and F1 metrics, the new model achieves 5–12 % higher scores. Moreover, when the identified low‑reliability groups are excluded or subjected to additional verification, overall accuracy improves an extra 3–4 %.

Key contributions of the work are: (1) introducing a latent‑group structure that captures source dependencies, moving beyond the unrealistic independence assumption; (2) jointly estimating group‑level and individual reliabilities, thereby limiting the influence of unreliable clusters; (3) integrating a Dirichlet Process for automatic group discovery with an efficient variational inference algorithm; and (4) providing extensive empirical evidence that the approach outperforms state‑of‑the‑art methods on heterogeneous real data. The authors also discuss limitations, such as the static nature of groups and the reliance on discrete label spaces, and outline future directions, including dynamic group modeling, extension to multimodal data, and incorporation of explicit network information. Overall, the paper presents a robust, scalable solution for improving the fidelity of collective intelligence in environments where source dependence is the norm rather than the exception.


Comments & Academic Discussion

Loading comments...

Leave a Comment