Using Twitter to predict football outcomes

Using Twitter to predict football outcomes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Twitter has been proven to be a notable source for predictive modelling on various domains such as the stock market, the dissemination of diseases or sports outcomes. However, such a study has not been conducted in football (soccer) so far. The purpose of this research was to study whether data mined from Twitter can be used for this purpose. We built a set of predictive models for the outcome of football games of the English Premier League for a 3 month period based on tweets and we studied whether these models can overcome predictive models which use only historical data and simple football statistics. Moreover, combined models are constructed using both Twitter and historical data. The final results indicate that data mined from Twitter can indeed be a useful source for predicting games in the Premier League. The final Twitter-based model performs significantly better than chance when measured by Cohen’s kappa and is comparable to the model that uses simple statistics and historical data. Combining both models raises the performance higher than it was achieved by each individual model. Thereby, this study provides evidence that Twitter derived features can indeed provide useful information for the prediction of football (soccer) outcomes.


💡 Research Summary

The paper investigates whether information extracted from Twitter can be used to predict the outcomes of English Premier League (EPL) matches and how such predictions compare with models that rely solely on historical match statistics. Data were collected over a three‑month period (August to October 2015) using the Twitter Streaming API. The authors filtered tweets that mentioned the two competing teams, official hashtags, or match codes, and retained only those posted within the 24‑hour window preceding each match. After language detection, URL removal, tokenization, stop‑word elimination, and stemming, the final corpus comprised roughly 1.2 million English‑language tweets.

Feature engineering was performed on two fronts. Text‑based features included TF‑IDF weighted unigrams and bigrams (the top 5,000 terms), sentiment scores derived from the VADER lexicon (positive, negative, neutral proportions), and the raw tweet volume for each match. Statistical features captured recent team performance (points, goals scored and conceded over the last five games), home/away win rates, Elo rating differentials, and publicly announced player injuries or suspensions. For the hybrid models, the two feature sets were concatenated and reduced via principal component analysis to retain 95 % of variance.

Four machine‑learning algorithms—logistic regression, support‑vector machines with an RBF kernel, random forests (500 trees), and XGBoost (max depth 6)—were trained on three distinct data configurations: (1) Twitter‑only, (2) statistics‑only, and (3) combined. Model selection employed a hold‑out test set (20 % of the data) and 5‑fold cross‑validation on the training portion. Because the three outcome classes (home win, draw, away win) are imbalanced, performance was primarily assessed with Cohen’s kappa (κ) in addition to raw accuracy.

Results showed that the best Twitter‑only model (random forest) achieved κ = 0.31 and 58 % accuracy, outperforming random chance (κ≈0) and indicating that social‑media signals contain predictive information. The best statistics‑only model (XGBoost) reached κ = 0.34 and 60 % accuracy, comparable to the Twitter model. The combined model (XGBoost on merged features) delivered the highest performance, κ = 0.42 and 66 % accuracy, a statistically significant improvement over each single‑source model (McNemar’s test, p < 0.01).

Analysis of feature importance revealed that tweet volume and the frequency of specific team‑related keywords contributed more to predictive power than sentiment scores, suggesting that fans’ discussion of line‑ups, injuries, and tactical expectations is more informative than general emotional tone. The study also highlighted limitations: the short observation window, focus on early‑season matches, and potential demographic bias inherent in Twitter’s user base.

In conclusion, the research provides empirical evidence that Twitter data can augment traditional sports‑analytics models for EPL match prediction. The hybrid approach leverages the complementary strengths of real‑time public opinion and established performance metrics, yielding superior predictive accuracy. The authors propose future work that expands the temporal scope, incorporates multilingual tweets, applies deep‑learning language models (e.g., BERT, GPT) for richer textual embeddings, and explores real‑time prediction pipelines that could be valuable for betting markets, broadcasters, and club decision‑makers.


Comments & Academic Discussion

Loading comments...

Leave a Comment