The CTI Echo Chamber: Fragmentation, Overlap, and Vendor Specificity in Twenty Years of Cyber Threat Reporting
Despite the high volume of open-source Cyber Threat Intelligence (CTI), our understanding of long-term threat actor-victim dynamics remains fragmented due to the lack of structured datasets and inconsistent reporting standards. In this paper, we present a large-scale automated analysis of open-source CTI reports spanning two decades. We develop a high-precision, LLM-based pipeline to ingest and structure 13,308 reports, extracting key entities such as attributed threat actors, motivations, victims, reporting vendors, and technical indicators (IoCs and TTPs). Our analysis quantifies the evolution of CTI information density and specialization, characterizing patterns that relate specific threat actors to motivations and victim profiles. Furthermore, we perform a meta-analysis of the CTI industry itself. We identify a fragmented ecosystem of distinct silos where vendors demonstrate significant geographic and sectoral reporting biases. Our marginal coverage analysis reveals that intelligence overlap between vendors is typically low: while a few core providers may offer broad situational awareness, additional sources yield diminishing returns. Overall, our findings characterize the structural biases inherent in the CTI ecosystem, enabling practitioners and researchers to better evaluate the completeness of their intelligence sources.
💡 Research Summary
The paper tackles a fundamental gap in cyber‑threat‑intelligence (CTI) research: the lack of a large, structured, longitudinal dataset that captures both technical indicators and strategic context across two decades of open‑source reporting. The authors assemble 13,308 unique CTI reports spanning 2000‑2023 from ten public repositories (MITRE ATT&CK, APT‑Notes, Malpedia, etc.). Using a two‑stage large‑language‑model (LLM) pipeline—first a taxonomy‑generation step with OpenAI’s o4‑mini‑2025‑04‑16, then a full‑report extraction step with the reasoning model o3‑2025‑04‑16—they automatically extract ten key fields: report title, vendor, date, report type, attack motivation, threat actor, victim geography, victim sector, IoC list, and TTP list. Structured JSON prompts, enumerated label sets, and post‑processing rules ensure that the model’s output conforms to a predefined schema.
The raw LLM output undergoes extensive normalization. Country names are collapsed from 326 variants to 254 canonical entries; vendor names are unified from 2,538 to 1,626 by handling re‑branding and acquisitions; threat‑actor names are de‑duplicated using a consensus of five external alias mappings (Microsoft, CrowdStrike, Unit 42, SecureWorks, MITRE). Only alias pairs appearing in at least three sources are retained, reducing 4,241 raw actor strings to 2,722 canonical actors. Human validation on a random sample yields an overall F1‑score of 0.94, confirming that the pipeline achieves high precision and recall despite the heterogeneity of the source documents (PDFs, web‑pages, multi‑column layouts).
The resulting dataset, named CTIRep, contains 12,723 structured records, 12,723 distinct threat‑actor‑victim‑motivation tuples, 107,611 IoCs, and 833 TTPs. Temporal analysis reveals three distinct phases: an “inception” period (2000‑2010) with modest reporting volume, an “expansion” period (2011‑2019) marked by exponential growth in both report count and vendor diversity, and a “maturation” period (2020‑present) where growth stabilizes but the focus shifts from pure malware analysis toward strategic intelligence. Technical indicators (IoCs, TTPs) correlate strongly with report volume and vendor count (Pearson r = 0.93), whereas strategic metadata (motivation, actor‑victim relationships) shows a weaker correlation, indicating that the surge in reporting is driven primarily by technical detail.
Actor‑centric analysis shows a high degree of specialization: over 30 % of threat groups concentrate on a single motivation‑geography‑sector combination, while only 7 % display broad multi‑motivation, multi‑sector activity. Financially motivated cybercrime accounts for 53 % of campaigns, espionage for 35.4 %; the former targets high‑wealth commercial sectors in a few high‑volume economies, whereas espionage is geographically dispersed and focuses on government and military sectors.
The meta‑analysis of the CTI vendor ecosystem uncovers a pronounced long‑tail distribution. Approximately 88 % of vendors are “niche” contributors, each publishing few reports, while a small elite (≈5 % of vendors) generate the majority of global, multi‑actor intelligence. Pairwise overlap between any two vendors is low: the average shared threat‑actor coverage is below 12 %, and detailed overlap on the same actor’s intelligence drops below 5 %. Consequently, adding more vendors yields diminishing returns for situational awareness, but deep, actor‑level insight requires a diversified multi‑vendor strategy.
In sum, the paper makes three core contributions: (1) it demonstrates that LLM‑assisted large‑scale CTI extraction is feasible and yields a high‑quality structured corpus; (2) it provides a quantitative portrait of how threat‑actor motivations and victim profiles have evolved and how technical versus strategic reporting have diverged over time; (3) it reveals systemic biases in the CTI market, highlighting both the concentration of reporting power among a few vendors and the low redundancy across sources. These findings equip security practitioners with evidence‑based guidance for constructing intelligence portfolios and give researchers a benchmark dataset for future studies on CTI dynamics, bias mitigation, and automated threat‑modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment