We analyze the structure and evolution of discussion cascades in four popular websites: Slashdot, Barrapunto, Meneame and Wikipedia. Despite the big heterogeneities between these sites, a preferential attachment (PA) model with bias to the root can capture the temporal evolution of the observed trees and many of their statistical properties, namely, probability distributions of the branching factors (degrees), subtree sizes and certain correlations. The parameters of the model are learned efficiently using a novel maximum likelihood estimation scheme for PA and provide a figurative interpretation about the communication habits and the resulting discussion cascades on the four different websites.
Human communication patterns on the Internet are characterized by transient responses to social events. Examples of such phenomena are the discussion threads generated in news aggregators, the propagation of massively circulated Internet chain letters, or the synthesis of articles in collaborative web-based spaces such as Wikipedia.
These responses can be regarded as tree-like cascades of activity generated from an underlying social network. Typically, a trigger event, or a small set of initiators, generate a chain reaction which may catch the attention of other users who end up participating in the cascade (see Figure 1 for examples) and attract even more users. Since these cascades of comments are a direct consequence of the information flow in a social system, understanding the mechanisms and patterns which govern them plays a fundamental role in contexts like spreading of technological innovations [23], diffusion of news and opinion [11,20], social influence [1] or collective problem-solving [15].
Although information cascades have been extensively analyzed for particular domains, such as blogs [11,20], chain letters [21], Flickr [6], Twitter [17] or page diffusion on Facebook [24], the cascades under consideration in those studies rarely involve elaborated discussions or complex interchange of opinions: generally, a small piece of information is just forwarded from an individual to its direct neighbors. To the best of our knowledge, with the exception of [16], no previous work exists on modeling the evolution and structure of long discussion-based cascades.
Here, as in [16], we consider several websites where the associated (discussion) cascades contain high level of interaction. We analyze for the first time the cascades of the popular news aggregator Slashdot, Barrapunto (a Spanish version of Slashdot) and Meneame (a Spanish Digg-clone) and the English Wikipedia. As the reader may notice, these datasets are quite heterogeneous. For instance, although posts from both Slashdot and Meneame correspond to popular news which rely on broadcasted events, Slashdot contains rich and very extensive comments, which are less frequent in Meneame. The cascades found in Wikipedia, on the other hand, represent collaborative effort towards a well defined goal: produce a free, reliable article.
In this study we address the following questions: what are the statistical patterns that determine the structure of such cascades and their evolution? Can these patterns be largely determined regardless of semantic information using a simple parametric model? Can the parameterization corresponding to a given website provide a global characterization for it?
We first provide a global analysis of the cascade behavior in the four mentioned websites. Among other results, we find that typically, the sizes of the cascades have a clear defined scale, which seems to contradict the recent results of [16]. Our analysis also highlights the importance of repetitive user participation in relation to other types of cascades and their impact on the entire social network. We also present a growth model for discussion cascades which is validated in the four datasets. Our approach is based on a simple model of preferential attachment (PA) [2], where new contributions in the cascade tree are linked to existing contributions with a probability which depends on their popularity (degree).
Two key ingredients characterize our approach: First, we account for a certain bias favoring the root, or event initiator. In this way, we are able to capture the different processes governing the global (direct reactions) and the localized responses of the system. Second, we use a likelihood method particularly developed for this study which allows an efficient estimation of the model parameters which considers the entire generative model. The method is applicable not only for the data considered here but for a more general class of growing graphs. Here we are only interested on the stochastic process which generates the cascade. We do not model network dynamics or a termination criteria for the cascades. Such a model could be built on top of our current model as it is done for example in [16].
In the next Section, we explain the proposed model and how we estimate its parameters. Section 3 introduces the datasets and provides a global analysis about their main characteristics. In Section 4 we explain the main results and give an interpretation of the parameters of the model. Finally, in Section 5 we describe related work and discuss the results in Section 6. In the Appendix we explain some aspects of the likelihood approach which are important for the estimation of parameters.
We model a discussion cascade as a growing network in which nodes correspond to comments and the initial node corresponds to the post (a news article, for instance). A new node is added sequentially at discrete time-steps. Our model is based on the original PA model to which we add a
This content is AI-processed based on open access ArXiv data.