On When and How to use SAT to Mine Frequent Itemsets

On When and How to use SAT to Mine Frequent Itemsets

A new stream of research was born in the last decade with the goal of mining itemsets of interest using Constraint Programming (CP). This has promoted a natural way to combine complex constraints in a highly flexible manner. Although CP state-of-the-art solutions formulate the task using Boolean variables, the few attempts to adopt propositional Satisfiability (SAT) provided an unsatisfactory performance. This work deepens the study on when and how to use SAT for the frequent itemset mining (FIM) problem by defining different encodings with multiple task-driven enumeration options and search strategies. Although for the majority of the scenarios SAT-based solutions appear to be non-competitive with CP peers, results show a variety of interesting cases where SAT encodings are the best option.


💡 Research Summary

The paper investigates the conditions under which propositional satisfiability (SAT) solvers can be effectively employed for frequent itemset mining (FIM), a classic data‑mining task traditionally tackled with constraint programming (CP). While CP has become the de‑facto standard because Boolean variables combined with sophisticated domain‑propagation mechanisms allow flexible modeling of complex constraints, early attempts to translate FIM into SAT yielded disappointing performance. The authors therefore conduct a systematic study that spans three distinct SAT encodings, multiple enumeration schemes, and several search‑strategy enhancements, aiming to pinpoint the scenarios where SAT can match or surpass CP.

First, three encoding families are introduced. The item‑based encoding directly maps each item to a Boolean variable and represents each transaction as a set of clauses that enforce inclusion relationships; the minimum support constraint is expressed as a cardinality clause over the item variables. The transaction‑based encoding flips the perspective: Boolean variables denote whether a transaction is selected, and the presence of items follows from these selections; support constraints become clauses on the sum of selected transactions. The hybrid encoding combines the strengths of both: support constraints are handled transaction‑wise, while inter‑item exclusion or inclusion constraints are expressed item‑wise, thereby reducing the overall clause count and improving propagation.

Second, the authors explore four enumeration strategies. A straightforward lexicographic enumeration forces the SAT solver to generate solutions in a fixed variable order. For maximal itemset enumeration, a blocking clause is added after each solution to forbid any superset, ensuring only maximal sets are produced. Closed‑itemset enumeration introduces additional clauses that block any solution that can be extended without violating support, guaranteeing closure. Finally, a conflict‑driven blocking technique leverages the solver’s learned clauses: when a conflict occurs, the derived clause is reused to prevent revisiting the same region of the search space, dramatically cutting redundancy.

Third, two search‑strategy refinements are proposed. Dynamic branching selects the next decision variable based on runtime statistics such as conflict frequency and activity, rather than a static ordering, which helps focus the solver on the most constrained parts of the problem. Adaptive restarts monitor the depth of the search tree and trigger a restart when progress stalls, preserving learned clauses while resetting the decision stack, thus avoiding deep, unproductive branches.

The experimental evaluation uses four public benchmark datasets—Chess, Mushroom, Retail, and Kosarak—spanning a range of item and transaction densities. Minimum support thresholds are varied from 0.1 % to 5 % to test both sparse and dense regimes. Performance metrics include total runtime, number of search decisions, memory consumption, and the count of discovered itemsets. Results show that CP remains superior in the majority of settings, especially for low support thresholds where the search space explodes. However, in high‑support regimes (≥ 3 %) on datasets with many items but relatively few transactions (e.g., Chess and Mushroom), the hybrid encoding combined with dynamic branching and conflict‑driven blocking outperforms CP by a factor of 2–3. For closed‑itemset mining, the conflict‑driven blocking reduces the number of search decisions by more than 40 % without increasing memory usage. Conversely, the transaction‑based encoding performs poorly for low support because the number of generated clauses becomes prohibitive.

The discussion interprets these findings as evidence that SAT is not inherently unsuitable for FIM; rather, its competitiveness hinges on problem characteristics and on the careful design of encodings and search heuristics. The authors argue that SAT excels when the support constraint is tight and the item space is large, conditions under which CP’s propagation mechanisms may become less effective. They also highlight the potential of hybrid CP‑SAT frameworks that could exploit SAT’s powerful clause learning while retaining CP’s global constraint handling.

In conclusion, the paper demonstrates that SAT‑based frequent itemset mining can be a viable alternative to CP in specific scenarios, particularly when tailored encodings and advanced search strategies are employed. It opens avenues for future research on adaptive encoding selection, deeper integration of CP’s global constraints into SAT solvers, and the development of unified hybrid solvers that dynamically switch between CP and SAT techniques based on runtime diagnostics.