EvoTaxo: Building and Evolving Taxonomy from Social Media Streams
Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
💡 Research Summary
EvoTaxo addresses the long‑standing challenge of building hierarchical taxonomies from noisy, short, and temporally dynamic social‑media streams. Traditional taxonomy induction methods—whether clustering‑based, topic‑model‑based, or recent LLM‑assisted approaches—are designed for static corpora and struggle with three core issues: (1) the semantic sparsity and lexical variability of short posts, (2) the computational cost of repeatedly scanning massive, continuously growing streams, and (3) the inability to capture short‑lived but meaningful concepts that emerge during real‑world events.
The proposed framework introduces a novel pipeline that transforms each incoming post into a structured “draft action” using a large language model (LLM). Five action types are defined: set_node (assign post to an existing node), add_child (create a new child under an existing node), add_path (create a short root→topic→subtopic path), update_cmb (revise the concept memory bank of a node), and skip_post (ignore). Non‑structural actions (set_node, skip_post) are executed immediately, while structural actions are accumulated in a backlog for later consolidation.
Posts are processed chronologically and grouped into fixed‑size time windows (e.g., yearly or quarterly). At each window boundary, the backlog is first partitioned by action type and target node, preventing semantically unrelated edits from being mixed. Within each bucket, every draft action is embedded into a semantic vector. Two complementary clustering passes are then performed:
- Semantic‑only clustering uses cosine distance to group actions that express similar structural intents regardless of when they occurred, highlighting persistent themes.
- Joint semantic‑temporal clustering combines the semantic distance with a normalized temporal distance, weighted by a tunable λ (0 ≤ λ ≤ 1). This pass captures bursts of activity tied to specific events, ensuring that transient but important concepts are not smoothed away.
Both passes employ HDBSCAN, a density‑based algorithm that does not require a predefined number of clusters and can accept custom distance functions. The resulting candidate clusters from both views are fed into a two‑step review and arbitration process.
In the cluster‑level refinement, an LLM reviewer examines the representative posts and actions of each cluster, deciding whether the cluster is coherent enough to be turned into a “refined action” (a higher‑level summary of the collective evidence) or whether it should be deferred.
In the window‑level arbitration, another LLM reviewer looks at all refined actions from the current window and resolves conflicts such as overlapping node names, mutually exclusive edits, or redundant additions. The arbitration step outputs a final set of actions that are guaranteed to be compatible with the current taxonomy.
Each node in the evolving taxonomy maintains a Concept Memory Bank (CMB) consisting of a short definition, inclusion cues, and exclusion cues. The CMB acts as a persistent semantic anchor, preventing drift as the taxonomy is repeatedly edited and ensuring that short posts are interpreted against a stable concept description rather than a bare label.
After arbitration, the final actions are deterministically applied: new nodes or paths are added, CMBs are updated, and the supporting posts are grounded to the affected nodes, preserving traceability from the taxonomy back to the original corpus.
The authors evaluate EvoTaxo on two Reddit communities: /r/opiates, a high‑noise, experience‑focused forum with diverse personal narratives, and /r/ICE_Raids, an event‑driven community where discussions surge around specific incidents. For /r/opiates they test both a 10‑year yearly window (15,846 posts) and a recent 2‑year quarterly window (8,582 posts); for /r/ICE_Raids they use a similarly sized temporally granular split. Baselines include a classic hierarchical topic model, a pure clustering hierarchy, and a recent static LLM‑generated taxonomy.
Quantitative results show that EvoTaxo consistently outperforms baselines across three key metrics:
- Leaf‑assignment entropy – lower values indicate that posts are assigned to more definitive leaf nodes; EvoTaxo achieves the lowest entropy, reflecting clearer topic delineation.
- Corpus coverage – measured as the proportion of posts that can be grounded to a leaf; EvoTaxo improves coverage by 12‑18 % at comparable tree sizes, demonstrating its ability to incorporate emerging sub‑topics.
- Structural quality – evaluated via average tree depth, balance factor, and branching factor; EvoTaxo produces more balanced and appropriately deep hierarchies, avoiding overly shallow or overly deep structures that plague other methods.
A qualitative case study on /r/ICE_Raids illustrates the temporal sensitivity of the approach. When a high‑profile ICE raid occurs, EvoTaxo quickly creates a burst‑aligned sub‑topic (“specific raid incident”) that later splits into follow‑up sub‑topics (“legal response”, “community backlash”). This dynamic adaptation is absent in static baselines, confirming that the joint semantic‑temporal clustering successfully captures short‑lived discourse spikes.
The paper’s contributions are fourfold: (1) introducing the first LLM‑based taxonomy induction framework explicitly designed for temporally evolving social‑media streams, (2) formulating taxonomy construction as incremental structured editing with dedicated mechanisms for robustness, scalability, and temporal awareness, (3) demonstrating empirically that the method yields higher‑quality, more interpretable taxonomies, and (4) releasing the codebase for reproducibility.
Limitations noted by the authors include the computational cost of frequent LLM calls (especially for large streams), sensitivity to the choice of λ and window size, and the current focus on text‑only data, leaving multimodal extensions (images, videos) for future work. Nonetheless, EvoTaxo represents a significant step toward automated, real‑time organization of social‑media discourse, with potential applications in public‑opinion monitoring, crisis response, and downstream NLP tasks such as stance detection or structured knowledge extraction.
Comments & Academic Discussion
Loading comments...
Leave a Comment