Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data
Use of socially generated “big data” to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society’s reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between “real time monitoring” and “early predicting” remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.
💡 Research Summary
This paper presents a novel approach in computational social science by utilizing “big data” from Wikipedia to predict the box office success of movies well before their release. Moving beyond real-time monitoring of social media platforms like Twitter, the research explores the predictive power of user activity on the more structured, collaborative environment of Wikipedia.
The study focuses on a sample of 312 movies released in the United States in 2010. For each movie, the researchers tracked the corresponding Wikipedia article and collected time-series data on four key activity metrics: the number of page views (V), the number of unique human editors (U), the number of edits (E), and a measure of collaborative rigor (R). These Wikipedia-based predictors were analyzed alongside a traditional market variable: the number of theaters screening the movie on its opening weekend (T). The target variable was the film’s opening weekend box office revenue.
The core methodological approach involved building multivariate linear regression models using different combinations of these predictor variables. The predictive power of each model was evaluated using 10-fold cross-validation and measured by the coefficient of determination (R²). The temporal evolution of both the correlation between individual predictors and revenue, and the R² of the models, was analyzed from the article’s creation up to the release date.
Key findings reveal that all Wikipedia activity metrics showed a significant correlation with box office revenue, with the correlation strength increasing as the release date approached. Notably, the number of page views (V) exhibited the highest correlation in the pre-release period. The most effective predictive model incorporated all four Wikipedia metrics plus the theater count ({V, U, R, E, T}). This model achieved a high R² of approximately 0.77 as early as one month before a movie’s release, significantly outperforming a model based on theater count alone.
The study provides a compelling comparison with a prominent prior work that used Twitter mention volume for prediction. While the Twitter-based model achieved a slightly higher R² (0.98) on the night before release for a smaller sample (24 movies), the Wikipedia-based model maintained a strong predictive accuracy (R² > 0.92) a full month in advance. The authors attribute this advantage to the nature of Wikipedia contributors: they are often deeply interested, informed followers of the film industry who begin gathering information and editing articles long before the marketing campaign peaks and generates mass social media chatter on platforms like Twitter.
The research demonstrates that collective attention and engagement, as reflected in the passive (views) and active (edits) use of collaborative knowledge platforms, can serve as powerful early indicators of real-world success. The methodology is language-agnostic and based on simple activity statistics, making it potentially applicable to non-English markets and other products beyond movies. The paper concludes by suggesting that predictive power could be further enhanced by incorporating more sophisticated techniques like sentiment analysis or neural networks, and that platforms like Wikipedia represent a valuable, underutilized resource for gauging public interest and predicting societal trends.
Comments & Academic Discussion
Loading comments...
Leave a Comment