Insights from the Wikipedia Contest (IEEE Contest for Data Mining 2011)

The Wikimedia Foundation has recently observed that newly joining editors on Wikipedia are increasingly failing to integrate into the Wikipedia editors’ community, i.e. the community is becoming increasingly harder to penetrate. To sustain healthy growth of the community, the Wikimedia Foundation aims to quantitatively understand the factors that determine the editing behavior, and explain why most new editors become inactive soon after joining. As a step towards this broader goal, the Wikimedia foundation sponsored the ICDM (IEEE International Conference for Data Mining) contest for the year 2011. The objective for the participants was to develop models to predict the number of edits that an editor will make in future five months based on the editing history of the editor. Here we describe the approach we followed for developing predictive models towards this goal, the results that we obtained and the modeling insights that we gained from this exercise. In addition, towards the broader goal of Wikimedia Foundation, we also summarize the factors that emerged during our model building exercise as powerful predictors of future editing activity.

💡 Research Summary

The paper presents a comprehensive account of the team’s participation in the 2011 IEEE International Conference on Data Mining (ICDM) Wikipedia Contest, which aimed to predict the number of edits a Wikipedia contributor would make over the next five months based on their historical activity. The authors begin by describing the motivation behind the contest: the Wikimedia Foundation had observed a growing difficulty for new editors to integrate into the community and a high attrition rate among newcomers. To support the foundation’s strategic goal of sustaining healthy community growth, a predictive modeling challenge was launched.

The dataset supplied for the contest comprised edit logs for roughly one million registered users spanning from January 2010 to May 2011. Each record contained the editor’s identifier, timestamp, page identifier, namespace (article, talk, user, etc.), edit comment, and the size of the edit. The target variable was the total number of edits each user would make between June and October 2011. The evaluation metric was the Root Mean Squared Logarithmic Error (RMSLE), which penalizes large relative errors and is appropriate for the highly skewed distribution of edit counts.

Initial exploratory analysis revealed extreme right‑skewness: a large majority of users made very few edits, while a small minority contributed thousands. Directly feeding raw counts into a regression model would therefore lead to biased predictions and poor generalisation. To address this, the authors applied a log(1 + x) transformation to both the target and several continuous features, which stabilised variance and reduced the impact of outliers.

Feature engineering formed the core of the methodology. The authors derived more than thirty predictive variables, grouped into several categories:

Cumulative activity – total edits up to the cut‑off date.
Recency metrics – edits in the last 30, 90, and 180 days, as well as the average time between successive edits.
Diversity measures – number of distinct pages edited, proportion of edits across different namespaces, and the count of unique talk‑page interactions.
Edit‑comment signals – length of the edit summary, frequency of keywords such as “revert”, “add”, and “delete”.
Temporal patterns – distribution of edits across hours of the day and days of the week.
Interaction features – co‑editing with other users on talk pages, indicating social integration.

Correlation analysis and variable‑importance scores from tree‑based models highlighted three features that consistently ranked highest: edits in the most recent 30‑day window, the number of distinct pages edited, and the presence of “revert” in edit comments. These variables were later confirmed as the strongest predictors of future activity.

The modeling pipeline explored several algorithms. Linear regression and its regularised variants (Lasso, Ridge) served as baselines, achieving RMSLE scores around 0.78–0.74. Tree‑based ensembles—Random Forest, Gradient Boosting Machine (GBM), and XGBoost—substantially improved performance, with XGBoost reaching an RMSLE of 0.58 after careful tuning of depth, learning rate, and subsample ratios. However, individual models exhibited over‑fitting on high‑frequency editors, manifesting as overly optimistic predictions for a small subset of users.

To mitigate this, the authors employed 5‑fold cross‑validation and constructed a stacked ensemble. Base learners (Linear Regression, Random Forest, XGBoost) generated out‑of‑fold predictions, which were then combined using a meta‑learner (Lasso regression) that learned optimal weights. This stacking approach yielded the best result—an RMSLE of 0.55, placing the solution within the top five percent of contest submissions.

Beyond predictive accuracy, the study extracted actionable insights for the Wikimedia Foundation. First, recency is paramount: editors who were active in the month preceding the prediction window are far more likely to remain active. Second, activity diversity—editing many distinct pages and participating across namespaces—correlates with long‑term retention, suggesting that encouraging breadth rather than depth may improve newcomer survival. Third, social engagement measured through talk‑page edits is a positive signal, implying that mentorship or community‑building initiatives could reduce attrition. Conversely, a high frequency of “revert” comments signals conflict or frustration, which often precedes a sharp decline in activity.

The authors conclude that a combination of log‑transformed targets, rich temporal and semantic features, and ensemble learning provides a robust framework for forecasting user contributions in highly skewed, sparse activity data. They recommend that Wikipedia’s editorial support teams use the identified key predictors to flag at‑risk newcomers early and intervene with targeted onboarding resources. Moreover, the methodological lessons—particularly the effectiveness of stacking heterogeneous models and the importance of handling zero‑inflated distributions—are transferable to other online collaborative platforms seeking to predict user engagement and mitigate churn.