Why Didnt You Listen to Me? Comparing User Control of Human-in-the-Loop Topic Models

Why Didnt You Listen to Me? Comparing User Control of Human-in-the-Loop   Topic Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

To address the lack of comparative evaluation of Human-in-the-Loop Topic Modeling (HLTM) systems, we implement and evaluate three contrasting HLTM modeling approaches using simulation experiments. These approaches extend previously proposed frameworks, including constraints and informed prior-based methods. Users should have a sense of control in HLTM systems, so we propose a control metric to measure whether refinement operations’ results match users’ expectations. Informed prior-based methods provide better control than constraints, but constraints yield higher quality topics.


💡 Research Summary

This paper tackles a notable gap in the evaluation of Human‑in‑the‑Loop Topic Modeling (HLTM) systems: while many prior works have measured topic quality (e.g., coherence, perplexity), they have rarely quantified whether the system actually obeys the user’s refinement commands. To fill this void, the authors implement three contrasting HLTM pipelines, run extensive simulation experiments, and introduce a novel “control” metric that directly measures how closely a model’s post‑refinement state matches the user’s intent.

Three HLTM approaches

  1. Info‑gibbs – an informed‑prior method combined with Gibbs sampling. User feedback is encoded by adjusting the Dirichlet hyper‑parameters α (document‑topic) and β (topic‑word). This extends the work of Smith et al. (2018) with an additional “create‑topic” operation.
  2. Info‑vb – the same informed‑prior idea but using variational Expectation‑Maximization (EM) inference. Here a global λ matrix (topic‑word responsibilities) is reset to reflect new knowledge, allowing rapid updates while preserving the variational framework. This is, to the authors’ knowledge, the first HLTM system that couples informed priors with variational inference.
  3. Const‑gibbs – a constraint‑based approach that injects soft or hard “must‑link / cannot‑link” potentials fₘ(z,m,d) into a Gibbs‑sampling LDA model. Constraints directly penalize or reward assignments of particular words or documents to specific topics.

All three systems support seven user‑requested refinements that appear in prior HLTM literature: (i) remove a word from a topic, (ii) add a word to a topic, (iii) remove a document from a topic, (iv) merge two topics, (v) split a topic given seed words, (vi) reorder two words within a topic, and (vii) create a new topic from seed words. The paper details, for each refinement, how the three pipelines manipulate their internal parameters (e.g., setting a tiny prior ε = 10⁻⁸, adding a log ε soft constraint, or re‑assigning λ rows).

Control metric
The authors formalize control as a ratio between the observed rank change of a word (or document) and the rank change that the user explicitly requested. For a word w in topic t, let r₁(w) be its rank before refinement and r₂(w) after. The control score S = (r₁ – r₂) / (expected Δr). S = 1 indicates perfect compliance; S < 0 indicates the model moved in the opposite direction. Analogous ratios are defined for document‑level operations, topic creation (seed‑word coverage), topic merging (proportion of merged words that originated from the parent topics), and topic splitting (average of parent‑ and child‑topic scores).

Simulation design
Real user interaction is expensive, so the authors simulate two user archetypes on a corpus of 7,000 news articles (14 categories, 500 articles each) harvested from the Guardian API.

  • Random user – selects refinements and their parameters uniformly at random.
  • Good user – mimics a realistic analyst: it first identifies a topic that mixes multiple categories, then applies refinements that push the topic toward its dominant category (using the true category labels as oracle).

Forty pretrained LDA models serve as starting points (20 with K = 10 topics, 20 with K = 20 topics). For each HLTM system and each refinement type, a random pretrained model is chosen, the simulated user’s command is applied, and inference proceeds until convergence or a fixed budget (20 Gibbs sweeps or 3 EM iterations). The process is repeated 100 times per condition, yielding distributions of control scores and changes in topic coherence measured by NPMI (using Wikipedia as the reference corpus).

Results
Statistical testing (Kruskal‑Wallis) confirms that the three systems differ significantly in both control and coherence. Key findings:

  • Control – Both informed‑prior pipelines (info‑gibbs, info‑vb) achieve markedly higher control scores for all refinements under both user types (C_Rand and C_Good). Scores for “remove word”, “remove document”, and “merge topics” are essentially perfect (≈ 1.0). The constraint‑based pipeline (const‑gibbs) also attains perfect control for operations that naturally map to hard constraints (e.g., removing a word or document). However, for “create topic” the constraint system performs poorly for random users (C_Rand ≈ 0.08) but improves for good users (C_Good ≈ 0.81), reflecting its dependence on underlying data to generate the requested topic.

  • Coherence – The constraint‑based approach consistently yields larger positive NPMI deltas (Q_Good) than the informed‑prior methods, especially for “add word” and “remove document”. This suggests that enforcing hard or soft constraints stabilizes the word distribution within topics, improving interpretability. In contrast, informed‑prior methods sometimes degrade coherence (negative Q_Good) for “create topic” and “split topic”, likely because the prior adjustments introduce new words that are not well supported by the corpus.

  • User type comparison – Across most operations, control scores are similar between random and good users, indicating that the pipelines react robustly to both purposeful and noisy feedback. The notable exception is const‑gibbs on “create topic”, where purposeful guidance dramatically improves control.

Discussion
The study reveals a clear trade‑off: informed priors excel at giving users fine‑grained, predictable control over the model’s behavior, but they may sacrifice topic coherence when the prior pushes the model away from the data’s natural structure. Constraints, on the other hand, preserve or even enhance coherence by anchoring the model to the observed word‑document statistics, yet they can be limited when the user wishes to introduce novel concepts (e.g., creating a new topic from seed words).

The proposed control metric fills a methodological void, offering a quantitative lens to evaluate whether an HLTM system “listens” to its users. By framing control as a ratio of observed to expected rank changes, the metric can be applied uniformly across word‑level, document‑level, and structural refinements.

Implications for system design
Developers of interactive topic modeling tools should consider the primary user goal. If the application demands precise steering—such as curating a taxonomy for a specialized domain—an informed‑prior pipeline (potentially with variational inference for speed) is advisable. If the goal is to maintain high interpretability while allowing modest user nudges, a constraint‑based pipeline may be preferable. Hybrid designs that combine soft constraints with prior adjustments could capture the best of both worlds, a promising direction for future work.

Conclusion
By implementing three distinct HLTM pipelines, devising a rigorous control metric, and conducting large‑scale simulations with both purposeful and noisy user behavior, the authors provide the first systematic comparison of user control versus topic quality in HLTM systems. The findings underscore that “listening to the user” and “producing coherent topics” are not automatically aligned; they represent a design trade‑off that must be explicitly managed. The control metric introduced here offers a practical tool for future research and for evaluating real‑world interactive topic modeling interfaces.


Comments & Academic Discussion

Loading comments...

Leave a Comment