Auditing Meta and TikTok Research API Data Access under Article 40(12) of the Digital Services Act
Article 40(12) of the Digital Services Act (DSA) requires Very Large Online Platforms (VLOPs) to provide vetted researchers with access to publicly accessible data. While prior work has identified shortcomings of platform-provided data access mechanisms, existing research has not quantitatively assessed data quality and completeness in Research APIs across platforms, nor systematically mapped how current access provisions fall short. This paper presents a systematic audit of research access modalities by comparing data obtained through platform Research APIs with data collected about the same platforms’ user-visible public information environment (PIE). Focusing on two major platform APIs, the TikTok Research API and the Meta Content Library, we reconstruct full information feeds for two controlled sockpuppet accounts during two election periods and benchmark these against the data retrievable for the same posts through the corresponding Research APIs. Our findings show systematic data loss through three classes of platform-imposed mechanisms: scope narrowing, metadata stripping, and operational restrictions. Together, these mechanisms implement overlapping filters that exclude large portions of the platform PIE (up to approximately 50 percent), strip essential contextual metadata (up to approximately 83 percent), and impose severe technical constraints for researchers (down to approximately 1000 requests per day). Viewed through a data quality lens, these filters primarily undermine completeness, resulting in a structurally biased representation of platform activity. We conclude that, in their current form, the Meta and TikTok Research APIs fall short of supporting meaningful, independent auditing of systemic risks as envisioned under the DSA.
💡 Research Summary
The paper conducts a systematic audit of the research‑access mechanisms that Meta (through its Content Library) and TikTok provide under Article 40(12) of the European Union’s Digital Services Act (DSA). While prior work has highlighted procedural hurdles and qualitative concerns about platform‑provided data, no study has quantitatively compared the data that ordinary users actually see—the public information environment (PIE)—with the data that vetted researchers can retrieve via the official research APIs.
To create a reliable baseline, the authors built two controlled “sock‑puppet” accounts and used the System for Observing and Analyzing Posts (SO AP) to capture every HTTP response sent to the browser during two politically sensitive periods: the 2024 U.S. presidential election on TikTok’s For You feed and the 2025 German federal election on Instagram’s Explore feed. This method records not only the visible content (videos, captions, images) but also all accompanying fields transmitted to the client, such as internal post IDs, recommendation scores, location and language tags, and other contextual signals. The resulting dataset represents the full PIE for each platform.
The authors then queried the official research APIs—Meta’s Content Library API and TikTok’s Research API—using the same accounts and collected the “research‑accessible” subsets. By aligning the two datasets, they identified three overlapping mechanisms that cause substantial data loss:
-
Scope Narrowing – The APIs apply internal filters that limit results to accounts or content types meeting certain criteria (e.g., follower thresholds, geographic regions, media format). Consequently, roughly 45‑52 % of PIE posts are completely absent from API responses.
-
Metadata Stripping – The APIs expose only a fraction of the fields present in the raw HTTP payload. For Instagram, 236 original parameters are reduced to 100 in the API and merely 14 in the UI; TikTok similarly omits 70‑83 % of key metadata such as unique identifiers, author verification status, and algorithmic ranking scores. This loss hampers any attempt to reconstruct the context in which content was recommended or to perform causal analyses of platform dynamics.
-
Operational Restrictions – Research APIs impose strict rate limits (≈1,000 requests per day), pagination constraints, and unstable endpoint versions. These technical caps make large‑scale longitudinal studies, real‑time risk monitoring, or comprehensive audits practically infeasible.
The combined effect of these mechanisms is a structurally biased representation of the PIE. Missing posts disproportionately affect certain political viewpoints, regions, or content formats, while stripped metadata eliminates crucial signals needed to understand algorithmic amplification. As a result, the current implementations of Meta and TikTok research APIs fall short of the DSA’s intent to enable independent, systematic scrutiny of systemic risks such as election interference, misinformation spread, or public‑security threats.
Beyond the empirical findings, the paper proposes concrete regulatory and technical improvements: (a) clarify the definition of “publicly accessible data” in the DSA to limit platform discretion; (b) mandate a minimum set of metadata (post ID, author ID, recommendation score, etc.) to be included in research‑API responses; (c) relax daily request caps or introduce scalable quota mechanisms for approved research projects; and (d) require platforms to publish transparent mappings between API schemas and the full client‑side payloads, accompanied by regular third‑party verification. Implementing these recommendations would allow researchers to reconstruct a near‑complete PIE, thereby restoring the DSA’s promise of robust, independent oversight of Very Large Online Platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment