Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet
Social media serves as a critical medium in modern politics because it both reflects politicians’ ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians’ social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.
💡 Research Summary
In the era of digital politics, social media has emerged as a pivotal arena where political ideologies are broadcasted and engagement with younger demographics is facilitated. However, a significant gap exists between the analysis of formal parliamentary records and the informal, rapid-fire discourse found on platforms like X (formerly Twitter). This paper addresses this gap by introducing MultiParTweet, a multilingual tweet corpus designed to extend the existing German parliamentary corpus, GerParCor. By integrating parliamentary debates with politicians’ social media activity, the researchers provide a unified framework for comparative political discourse analysis.
The MultiParTweet dataset is substantial, comprising 39,546 tweets and 19,056 media items. To extract meaningful insights from this unstructured data, the authors implemented an advanced automated annotation pipeline. This pipeline utilizes nine text-based models alongside one sophisticated Vision-Language Model (VLM) to annotate the corpus with key dimensions: emotion, sentiment, and topic. The inclusion of a VLM is a critical methodological advancement, as it allows the system to interpret the interplay between textual content and visual media, which is essential for understanding the context of social media posts.
A rigorous evaluation was conducted by comparing these automated annotations against a manually annotated subset. The study yielded two groundbreaking findings. First, the researchers demonstrated “mutual predictability” among the various models, meaning the outputs of the different models were highly correlated and could predict one another. This indicates a high level of consistency and structural integrity within the automated annotation framework. Second, and perhaps most significantly, human annotators preferred the VLM-based annotations over purely text-based ones. This finding suggests that multimodal representations—which account for both text and imagery—align much more closely with human cognitive processes and interpretive patterns.
Furthermore, the paper introduces TTLABTweetCrawler, a general-purpose tool designed for systematic data collection from X, ensuring the reproducibility of the dataset construction. In summary, MultiParTweet serves as a robust, multi-layered resource that bridges the divide between formal political archives and modern social media discourse. By proving the superiority of multimodal annotation, this research sets a new standard for computational social science, providing a powerful tool for analyzing the complex, multi-layered communication strategies of modern political actors.
Comments & Academic Discussion
Loading comments...
Leave a Comment