Flexible constrained sampling with guarantees for pattern mining

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling …

Authors: Vladimir Dzyuba, Matthijs van Leeuwen, Luc De Raedt

Flexible constrained sampling with guarantees for pattern mining
Flexible constrained sampling with guaran tees for pattern mining Vladimir Dzyuba 1 , Matthijs v an Leeu w en 2 , Luc De Raedt 1 Abstract P attern sampling has been proposed as a potential solution to the infamous pat- tern explosion. Instead of en umerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a giv en qualit y measure. Sev eral sampling algorithms ha ve b een prop osed, but each of them has its limitations when it comes to 1) flexibilit y in terms of qualit y measures and constrain ts that can b e used, and/or 2) guaran tees with resp ect to sampling accuracy . W e therefore presen t Flexics , the first flexible pattern sampler that sup- p orts a broad class of qualit y measures and constraints, while pro viding strong guaran tees regarding sampling accuracy . T o achiev e this, we lev erage the p er- sp ectiv e on pattern mining as a constraint satisfaction problem and build up on the latest adv ances in sampling solutions in SA T as w ell as existing pattern min- ing algorithms. F urthermore, the proposed algorithm is applicable to a v ariety of pattern languages, whic h allo ws us to in troduce and tac kle the no vel task of sampling sets of patterns. W e in tro duce and empirically ev aluate tw o v ariants of Flexics : 1) a generic v arian t that addresses the well-kno wn itemset sampling task and the nov el pat- tern set sampling task as w ell as a wide range of expressiv e constraints within these tasks, and 2) a specialized v arian t that exploits existing frequen t itemset tec hniques to ac hiev e substantial sp eed-ups. Exp erimen ts show that Flexics is both accurate and efficien t, making it a use ful tool for pattern-based data exploration. 1 In tro duction P attern mining [1] is an imp ortant and well-studied task in data mining. Infor- mally , a pattern is a statemen t in a formal language that concisely describes a subset of a giv en dataset. P attern mining tec hniques aim at providing compre- hensible descriptions of coheren t regions in the data. Many v ariations of pattern 1 DT AI, KU Leuven, Belgium, firstname.lastname@cs.kuleuven.be 2 LIACS, Leiden University , The Netherlands, m.van.leeuwen@liacs.leidenuniv.nl 1 mining hav e been proposed in the literature, together with ev en more algorithms to efficien tly mine the corresponding patterns. Best kno wn is frequen t pattern mining [2], whic h includes frequen t itemset mining and its extensions. T raditional pattern mining methods en umerate all frequen t patterns, though it is w ell-known that this usually results in h umongous amoun ts of patterns (the infamous p attern explosion ). T o make pattern mining more useful for exploratory purp oses, differen t solutions to this problem hav e been prop osed. Eac h of these solutions has its o wn adv antages and disadv an tages. Condense d r epr esentations [3] can often b e efficien tly mined, but generally still result in large n umbers of patterns. T op- k mining [4] is efficien t but results in strongly related, redundant patterns showing a lack of diversit y . Constr aine d mining [5] may result in to o few or to o many patterns, dep ending on the user-chosen constrain ts. Pattern set mining [6] takes in to account the relationships b etw een the patterns, which can result in small solution sets, but is computationally in tensive. In this pap er, we study p attern sampling , another approach that has b een prop osed recen tly: instead of en umerating all patterns, patterns are sampled one by one, according to a probability distribution that is prop ortional to a giv en quality measure. The promised b enefits include: 1) flexibilit y in that p oten tially a broad range of quality measures and constraints can b e used; 2) ‘an ytime’ data exploration, where a gro wing representativ e set of patterns can b e generated and inspected at an y time; 3) div ersity in that the generated sets of patterns are indep endently sampled from differen t regions in the solution space. T o be reliable, pattern samplers should pro vide theoretical guaran tees regarding the sampling accuracy , i.e., the difference b et ween the empirical probabilit y of sampling a pattern and the (generally unknown) target probability determined b y its qualit y . These properties are essen tial for pattern mining applications ranging from sho wing patterns directly to the user, where flexibility and the an ytime property enable exp erimenting with and fine-tuning mining task for- m ulations, to candidate generation for building pattern-based mo dels, for which the approximation guarantees can be derived from those of the sampler. While a n umber of pattern sampling approaches ha ve b een developed ov er the past years, they are either inflexible (as they only supp ort a limited n umber of quality measures and constraints), or do not provide theoretical guarantees concerning the sampling accuracy . A t the algorithmic level, they follo w standard sampling approac hes suc h as Mark ov Chain Mon te Carlo random walk s ov er the pattern lattice [7, 8, 9], or a special purp ose sampling pro cedure tailored for a restricted set of itemset mining tasks [10, 11]. Although MCMC approac hes are in principle applicable to a broad range of tasks, they often conv erge only slo wly to the desired target distribution and require the selection of the “right” prop osal distributions. T o the b est of our knowledge, none of the existing approaches to pattern sampling takes adv antage of the latest dev elopments in sampling technology from the SA T-solving comm unit y , where a num b er of pow erful samplers based on random hash functions and X OR-sampling hav e been dev eloped [12, 13, 14, 15]. WeightGen [16], one of the recen t approaches, p ossesses the benefits 2 T able 1: Our metho d is the first pattern sampler that com bines flexibilit y with resp ect to the choice of constraints and sampling distributions with strong theoretical guarantees. Sampler Arbitrary Arbitrary Strong Efficiency Pattern set constraints distributions guarantees sampling A CFI [7] Minimal - - X - frequency LR W [8] X X - Implementation- - specific F CA [9] An ti-/ X - X - monotonic TS (Two-step) - - X X - [10, 11] Flexics GFlexics X X EFlexics X This paper men tioned abov e: it is an an ytime algorithm, it is flexible as it works with an y distribution, it generates div erse solutions, and pro vides strong p erformance guaran tees under reasonable assumptions. In this pap er, w e show that the latest developmen ts in sampling solutions in SA T are also relev ant to pattern sampling and essen tially offer the same adv an tages. Our results build upon the view of pattern mining as constraint satisfaction, whic h is no w commonly accepted in the data mining communit y [17]. Appr o ach and c ontributions More sp ecifically , w e introduce Flexics : a flexible pattern sampler that samples from distributions induced by a v ariet y of pattern qualit y measures and allows for a broad range of constrain ts while still providing strong theoretical guaran tees. Notably , Flexics is, in principle, agnostic of the quality measure, as the sampler treats it as a blac k b o x. (Ho wev er, its prop erties affect the efficiency of the algorithm.) The other building block is a c onstr aint or acle that enumerates all patterns that satisfy the constraints, i.e., a mining algorithm. The prop osed approac h allows con v erting an existing pattern mining algorithm into a sampler with guarantees. Th us, its flexibility is not limited by the c hoice of constrain ts and quality measures, but even allo ws tac kling richer pattern languages, whic h we demonstrate b y tackling the no vel task of sampling sets of p atterns . T able 1 compares the prop osed approach to alternativ e samplers; see Section 3 for a more detailed discussion. The main tec hnical con tribution of this paper consists of t wo v arian ts of the Flexics sampler, whic h are based on differen t constraint oracles. First, w e in tro duce a generic v ariant, dubb ed GFlexics , that supports a wide range of pattern constraints, such as syntactic or redundancy-eliminating constraints. 3 GFlexics uses cp4im [17], a declarative constrain t programming-based mining system, as its oracle. An y constrain t supported by cp4im can be used with- out in terfering with the umbrella pro cedure that p erforms the actual sampling task. Unlike the original v ersion of WeightGen that is geared tow ards SA T, GFlexics can handle cardinalit y constraints that are ubiquitous in pattern mining. F urthermore, w e identify (based on previous research) the properties of the constraint satisfaction-based formalization of pattern mining that further impro ve the efficiency of the sampling pro cedure without affecting its guaran tees and thus make it applicable to practical problems. W e use GFlexics to tackle a wide range of w ell-kno wn itemset sampling tasks as well as the no vel pattern set sampling task. Second, as it is well-kno wn that generic solvers imp ose an o verhead on run time, we introduce a v ariant sp ecialized tow ards frequent item- sets, dubb ed EFlexics , which has an extended v ersion of Ecla t [18] at its core as oracle. Exp erimen ts sho w that Flexics ’ sampling accuracy is impres siv ely high: in a v ariet y of settings supp orted by the sampler, empirical frequencies are within a small factor of the target distribution induced b y v arious qualit y measures. F urthermore, practical accuracy is substantially higher than theory guaran tees. EFlexics is sho wn to b e faster than its generic cousin, demonstrating that dev eloping sp ecialized solv ers for specific tasks is beneficial when run time is an issue. Finally , the flexibilit y of the sampler allo ws us to use the same ap- proac h to successfully tackle the no vel problem of sampling pattern sets. This demonstrates that Flexics is a useful to ol for pattern-based data exploration. This pap er is organized as follows. W e formally define the problem of pattern sampling in Section 2. After reviewing related research in Section 3, w e present the t w o k ey ingredien ts of the proposed approac h in Section 4: 1) the persp ec- tiv e on pattern mining as a constraint satisfaction problem and 2) hashing- based sampling with WeightGen . In Section 5, w e presen t Flexics , a flexible pattern sampler with guarantees. In particular, we outline the mo difications required to adapt WeightGen to pattern sampling and describ e the pro ce- dure to con vert t wo existing mining algorithms in to oracles suitable for use with WeightGen , which yields tw o v ariants of Flexics . In Section 6, we in tro duce the pattern set sampling task and describ e how it can b e tac kled with Flex- ics . W e also outline sampling non-ov erlapping tilings, an example of pattern set sampling that is studied in the exp eriments. The experimental ev aluation in Section 7 inv estigates the accuracy , scalability , and flexibility of the prop osed sampler. W e discuss its p oten tial applications, adv antages, and limitations in Section 8. Finally , w e present our conclusions in Section 9. 2 Problem definition Here we present a high-level definition of the task that w e consider in this pap er; for concrete instances and examples, see Sections 4 and 6. The pattern sampling problem is formally defined as follows: given a dataset D , a pattern language L , a set of constraints C , and a qualit y measure ϕ : L → R + , generate 4 random patterns that satisfy constrain ts in C with probability proportional to their qualities: P ϕ ( p ) = ( ϕ ( p ) / Z ϕ if p ∈ L satisfies C 0 otherwise where Z ϕ is an (often unknown) normalization constant. A qualit y measure quantifies the domain-specific in terestingness of a pattern. The c hoice of a quality measure and constrain ts allo ws a user to express her analysis requiremen ts. The sampling pro cedure meets these requiremen ts by satisfying the constraints and generating high-quality patterns more frequen tly . Th us, sampled patterns are a representativ e subset of all in teresting regularities in the dataset. P attern set mining is an extension of pattern mining, which considers sets of patterns rather than individual patterns. Despite its p opularity , we are not a ware of the existence of pattern set samplers. The task of pattern set sampling can easily b e formalized as an extension of pattern sampling, where w e sample sets of patterns s ⊂ L , and the constrain ts C as w ell as the quality measure ϕ are sp ecified ov er sets of patterns (from 2 L ) rather than individual patterns (from L ). 3 Related w ork W e here fo cus on t wo classes of related w ork, i.e., 1) pattern mining as constraint satisfaction and 2) pattern sampling. Constr aine d p attern mining The study of constraints has b een a prominent subfield of pattern mining. A wide range of constraint classes w ere inv estigated, including anti-monotonic constraints [1], conv ertible constraints [19], and others. Another developmen t of these ideas led to the introduction of global constraints that concern m ultiple patterns and to the emergence of p attern set mining [20, 21]. F urtheremore, generic mining systems that could freely combine v arious constrain ts w ere proposed [22, 23]. These insigh ts allow ed to dra w a connection b et ween pattern mining and constrain t satisfaction in AI, e.g., SA T or constraint programming (CP). As a result, declarative mining systems, which use generic constraint solvers to mine patterns according to a declarative sp ecification of the mining task, were pro- p osed. F or example, CP was used to develop first declarative systems for itemset mining [17] and pattern set mining [24, 25]. Recently , declarativ e approaches ha ve b een extended to supp ort sequence mining [26] and graph mining [27]. Constrain t-based systems allo w a user to specify a wide range of pattern constrain ts and thus provide to ols to alleviate the pattern explosion. How ever, the underlying solv ers use systematic searc h, whic h affects the order of pattern generation and th us prev en ts them from being used in a truly anytime manner due to low diversit y of consecutive solutions. Similarly , pattern set miners that 5 directly aim at obtaining div erse result sets t ypically incur prohibitiv e compu- tational costs as the size of the pattern space gro ws. Pattern sampling In this pap er w e fo cus on the approaches that directly aim at generating random pattern collections rather than the metho ds whose goal is to estimate dataset or pattern language statistics; cf. Sherv ashidze et al. [28]. T able 1 compares our metho d with the approaches describ ed in Section 1, namely MCMC and two-step samplers [10, 11]. W e further break down MCMC samplers into three groups: A CFI, the very first uniform sampler developed for appro ximate coun ting of frequen t itemsets [7]; LR W, a generic approach based on random walks o ver pattern lattice [8]; and FCA, a sampler, which uses Mark ov chains based on insigh ts from formal concept analysis [9]. Although MCMC samplers provide theoretical guarantees, in practice, their con vergence is often slo w and hard to diagnose. Solutions suc h as long burn-in or heuristic adaptations either increase the runtime or w eaken the guaran tees. F ur- thermore, ACFI is tailored for a single task; FCA only supp orts an ti-/monotone constrain ts; and LR W c hecks constrain ts locally , while building the neigh b or- ho od of a state, whic h migh t require adv anced reasoning and extensive cac hing. Tw o-step samplers, while pro v ably accurate and efficien t, only support a limited n umber of weigh t functions and do not supp ort constraints. 4 Preliminaries W e first outline itemset mining, a prototypical pattern mining task, and for- malize it as a CSP and then describe WeightGen , a hashing-based sampling algorithm. 4.1 Itemset mining Itemset mining is an instance of pattern mining sp ecialized for binary data. Let I = { 1 . . . M } denote a set of items. A dataset D is a bag of transactions o v er I , where eac h transaction t is a subset of I , i.e., t ⊆ I ; T = { 1 . . . N } is a set of transaction indices. The pattern language L also consists of sets of items, i.e., L = 2 I . An itemset p o ccurs in a transaction t , iff p ⊆ t . The frequency of p is the n umber of transactions in which it o ccurs: f req ( p ) = |{ t ∈ D | p ⊆ t }| . In lab eled datasets, a transaction has a lab el from {− , + } ; f r eq − , + are defined accordingly . W e first giv e a brief ov erview of the general approach to solving CSPs and then present a formalization of itemset mining as a CSP , follo wing that of cp4im [17]. F ormally , a CSP is comprised of variables along with their domains and c onstr aints ov er these v ariables. The goal is to find a solution, i.e., an assign- men t of v alues to all v ariables that satisfies all constraints. Every constraint is implemen ted by a pr op agator , i.e., an algorithm that takes domains as input and remo v es v alues that do not satisfy the constrain t. Propagators are activ ated when v ariable domains change, e.g., by the search mechanism or other propa- gators. A CSP solv er is typically based on depth-first searc h. After a v ariable 6 T able 2: Constraint programming formulations of common itemset mining con- strain ts. I i = 1 implies that item i is included in the curren t (partial) solution, whereas T t = 1 implies that it co v ers transaction t . Constrain t P arameters CP formulation cov er ag e ∀ t ∈ T T t = 1 ⇔ P i ∈I I i (1 − D ti ) = 0 minf req ( θ ) θ ∈ (0 , 1] ∀ i ∈ I I i = 1 ⇒ P t ∈T T t D ti ≥ θ × |D | closed ∀ i ∈ I I i = 1 ⇔ P t ∈T T t (1 − D ti ) = 0 minlen ( λ ) λ ∈ [1 , M ] ∀ t ∈ T T t = 1 ⇒ P i ∈I I i D ti ≥ λ is assigned a v alue, propagators are run until domains cannot b e reduced an y further. At this p oin t, three cases are p ossible: 1) a v ariable has an empty domain, i.e., the curren t searc h branch has failed and backtrac king is necessary , 2) there are unassigned v ariables, i.e., further branching is necessary , or 3) all v ariables are assigned a v alue, i.e., a solution is found. Let I i denote a v ariable corresp onding to eac h item; T t a v ariable corresp ond- ing to eac h transaction; and D ti a constant that is equal to 1, if item i o ccurs in transaction t , and 0 otherwise. V ariables I i and T t are binary , i.e., their domain is { 0 , 1 } . Each CSP solution corresponds to a single itemset. Th us, for example, I i = 1 implies that item i is included in the curren t (partial) solution, whereas T t = 0 implies that transaction t is not cov ered b y it. T able 2 lists some of the most common constraints. The cov er ag e constraint essen tially models a dataset query and ensures that if the item v ariable assignment corresp onds to an item- set p , only those transaction v ariables that corresp ond to indices of transactions where p o ccurs, are assigned v alue 1. Other constraints allow users to remov e unin teresting solutions, e.g., redundant non- closed itemsets. Most solvers pro- vide facilities for en umerating all solutions in sequence, i.e., to en umerate all patterns. In contrast to hard constrain ts, quality me asur es are used to describ e soft user preferences with resp ect to interestingness of patterns. Common qual- it y measures concern frequency , e.g., ϕ ≡ f r eq , discriminativity in a lab eled dataset, e.g., purity ϕ ( p ) = max { f req + ( p ) , f req − ( p ) } / f req ( p ) , etc. 4.2 W eigh tGen WeightGen [16] is an algorithm for appro ximate weigh ted sampling of satisfy- ing assignments (solutions) of a Bo olean formula that only requires access to an efficien t constraint oracle that en umerates the solutions, e.g., a SA T solver. The core idea consists in partitioning the solution space into a n umber of “cells” and sampling a solution from a random cell. Partitioning with desired properties is obtained via augmenting the original problem with random XOR constrain ts. Theoretical guaran tees stem from the prop erties of uniformly random XOR con- strain ts. The sequel follo ws Sections 3-4 in Chakraborty et al. [16]. Pr oblem statement and guar ante es F ormally , let F denote a Bo olean formula; F 7 a satisfying v ariable assignment of F ; M the total n umber of v ariables; w ( · ) a blac k-b ox w eight function that for each F returns a num b er in (0 , 1]; and w min (resp. w max ) the minimal (resp. maximal) w eight ov er all satisfying assignmen ts of F . The weigh t function induces the probabilit y distribution ov er satisfying assignmen ts of F , where P w ( F ) = w ( F )/ P w ( F 0 ) . Quantit y r = w max / w min is the (possibly unkno wn) tilt of the distribution P w . Giv en a user-provided upp er b ound on tilt ˆ r ≥ r and a desired sampling error tolerance κ ∈ (0 , 1) (the low er κ , the tigh ter the b ounds on the sampling error), WeightGen generates a random solution F . Performance guaran tees concern b oth accuracy and efficiency of the algorithm and depend on the parameters and the n umber of v ariables M ; see Section 5 for details. Al gorithm Recall that the core idea that underlies sampling with guaran tees is partitioning the ov erall solution space into a n um b er of random cells b y adding random X OR constrain ts. WeightGen pro ceeds in tw o phases: 1) the esti- mation phase and 2) the sampling phase. The goal of the estimation phase is to estimate the n umber of X OR constraints necessary to obtain a “small” cell, where the required cell w eigh t is determined b y the desired sampling error tolerance. The sampling phase starts with applying the estimated num ber of X OR constrain ts. If it obtains a cell whose total weigh t lies within a certain range, whic h dep ends on κ , a solution is sampled exactly from all solutions in the cell; otherwise, it adds a new random XOR constraint. How ev er, the num b er of X OR constraints that can be added is limited. If the algorithm cannot obtain a suitable cell, it indicates failure and returns no sample. Both phases make use of a b ounde d oracle that terminates as soon as the to- tal w eight of enumerated solutions exceeds a predefined num b er. It enumerates solutions of the original problem F augmented with the X OR constraints. An individual XOR constraint o ver v ariables X has the form N b i · X i = b 0 , where b 0 | i ∈ { 0 , 1 } . The coefficients b i determine the v ariables in v olved in the con- strain t, whereas the p arity bit b 0 determines whether an ev en or an o dd num b er of v ariables must b e set to 1. T ogether, m XOR constraints identify one cell b elonging to a partitioning of the ov erall solution space into 2 m cells. The core op eration of WeightGen in v olves drawing coefficients uniformly at random, which induces a random partitioning of the solution space that satisfies the 3 -wise indep endenc e pr op erty , i.e., knowing the cells for tw o arbi- trary assignments do es not pro vide any information ab out the cell for a third assignmen t [12]. This ensures desired statistical properties of random parti- tions, required for the theoretical guarantees. The reader in terested in further tec hnical details should consult App endix A and Chakrab ort y et al. [16]. 8 5 Flexics: Flexible pattern sampler with guar- an tees In this pap er, w e prop ose Flexics , a pattern sampler that uses WeightGen as the umbrella sampling pro cedure. T o this end, we 1) extend it to CSPs with binary v ariables, a class of problems that is more general than SA T and that includes pattern mining as describ ed in Section 4; 2) augment existing pattern mining algorithms for use with WeightGen ; and 3) inv estigate the properties of pattern qualit y measures in the con text of WeightGen ’s requirements. WeightGen was originally presen ted as an algorithm to sample solutions of the SA T problem. Pattern mining problems cannot b e efficien tly tac kled b y pure Boolean solvers due to the prominence of cardinalit y constrain ts (e.g., minf req ). Ho wev er, we observ e that the core sampling pro cedure is applicable to any CSP with binary v ariables, as its solution space can be partitioned with X OR constrain ts in the required manner. Based on this insight, we present t wo v ariants of Flexics that differ in their oracles. Each oracle is essen tially a pattern mining algorithm extended to supp ort XOR constrain ts along with common constraints on patterns. The first one, dubb ed GFlexics , builds up on the generic formalization and solving tec hniques describ ed in Section 4 and thus supp orts a wide range of constraints. Owing to the prop erties of the cov er ag e constraint, XOR constraints only need to in v olve item v ariables 1 , which makes them relativ ely short, mitigating the computational ov erhead. Moreov er, this p ersp ective helps us design the second approac h, dubb ed EFlexics , which uses an extension of Ecla t [18], a w ell- kno wn mining algorithm, as an oracle. It is tailored for a single task (frequent itemset mining, i.e., it only supp orts the minf r eq constraint), but is capable of handling larger datasets. W e describ e eac h oracle in detail in the following subsections. Giv en a dataset D , constrain ts C , a quality measure ϕ , and the error tolerance parameter κ ∈ (0 , 1), Flexics first constructs a CSP corresp onding to the task of mining patterns satisfying C from D . It then determines parameters for the sampling pro cedure, including the appropriate num ber of XOR constraints, and starts generating samples. T o this end, it uses one of the tw o prop osed oracles to en umerate patterns that satisfy C and random XOR constraints. Both v arian ts of Flexics support sampling from blac k-b o x distributions derived from qualit y measures and, most importantly , preserve the theoretical guarantees of WeightGen 2 : Theorem 1. The pr ob ability that Flexics samples a r andom p attern p that satisfies c onstr aints C fr om a dataset D , lies within a b ounde d r ange determine d by the quality of the p attern ϕ ( p ) and κ : ϕ ( p ) Z ϕ × 1 1 + ε ( κ ) ≤ P ( Flexics ( D , C , ϕ ; κ ) = p ) ≤ ϕ ( p ) Z ϕ × (1 + ε ( κ )) 1 In other words, item variables I are the indep endent supp ort of a pattern mining CSP . 2 Theorem 1 corresponds to and follo ws from Theorem 3 of Chakrab orty et al. [16]. 9 Pr o of. Theorem 3 of Chakraborty et al. [16] states: P w ( F )/(1 + ε ( κ )) ≤ ˆ P F ≤ P w ( F ) × (1 + ε ( κ )) where ˆ P F denotes the probability that WeightGen called with parameters ˆ r and κ samples the solution F , P w ( F ) ∝ w ( F ) denotes the target probability of F , and ε ( κ ) = (1 + κ )  2 . 36 + 0 . 51 / (1 − κ ) 2  − 1 denotes sampling error deriv ed from κ . F or technical purp oses, we introduce the notion of the weight of a pattern as its quality scaled to the range (0 , 1], i.e., w ϕ ( p ) = ϕ ( p ) /C , where C is an arbitrary constant suc h that C ≥ max p ∈L ϕ ( p ). The pro of follo ws from Theo- rem 3 of Chakraborty et al. [16] and the observ ation that Flexics ( D , C , ϕ ; κ ) is equiv alent to WeightGen (CSP ( D , C ) , w ϕ ; κ ). The estimation phase effec- tiv ely corrects for potential discrepancy b etw een C and Z ϕ . F urthermore, Theorem 4 of Chakrab ort y et al. [16], pro vides efficiency guar- ante es : the num b er of calls to the oracle is linear in ˆ r and p olynomial in M and 1 /ε ( κ ). The assumption that the tilt is b ounde d fr om ab ove b y a reasonably lo w n umber is the only assumption regarding a (black-box) weigh t function. Moreo ver, it only affects the efficiency of the algorithm, b ut not its accuracy . Th us, using a quality measure with Flexics requires kno wledge of tw o prop- erties: scaling constan t C and tilt bound ˆ r . In practice, both are fairly easy to come up with for a v ariety of measures. F or example, for f r eq and pur ity , C = |D| , ˆ r = θ − 1 and C = 1, ˆ r = 2 respectively; see Section 6 for another example. 5.1 GFlexics: Generic pattern sampler The first v arian t relies on cp4im [17], a constraint programming-based mining system. A wide range of constraints supp orted by cp4im are automatically supp orted by the sampler and can b e freely com bined with v arious qualit y mea- sures. In order to turn cp4im into a suitable b ounded oracle, w e need to extend it with an efficient propagator for XOR constrain ts. This propagator is based on the process of Gaussian elimination [29], a classical algorithm for solving systems of linear equations. Each XOR constraint can b e view ed as a linear equalit y ov er the field F 2 of tw o elemen ts, 0 and 1, and all co efficien ts form a binary matrix (Figure 1.2). At each step, the matrix is up dated with the latest v ariable assignments and transformed to r ow e chelon form , where all ones are on or ab ov e the main diagonal and all non-zero rows are ab o ve any ro ws of all zero es (Figure 1.3). During echelonization, tw o situations enable propagation. If a row b ecomes empty while its righ t hand side is equal to 1, the system is unsatisfiable and the current search branch terminates (Figure 1.5). If a ro w 10 ↓ ↓ x 1 ⊗ x 5 =1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 → 0 0 0 0 0 1 x 2 ⊗ x 3 ⊗ x 4 ⊗ x 5 =0 0 1 1 1 1 0 → 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 x 1 ⊗ x 2 ⊗ x 3 ⊗ x 5 =0 1 1 1 0 1 0 → 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 x 2 ⊗ x 4 ⊗ x 5 =1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1) Random X OR constrain ts 2) Initial con- strain t matrix 3) Ec helonized matrix: assign- men ts x 2 = 0 and x 3 = 1 are deriv ed 4) Up dated matrix (ro ws 2 and 4 are sw app ed) 5) If x 1 and x 5 are set to 1 (e.g., by searc h), the system is unsatisfiable Figure 1: Propagating XOR constrain ts using Gaussian elimination in F 2 . con tains only one free v ariable, it is assigned the righ t hand side of the ro w (Figure 1.3). Gaussian elimination in F 2 can b e p erformed very efficiently , because no division is necessary (all coefficients are 1), and subtraction and addition are equiv alen t operations. F or a system of k X OR constraints ov er n v ariables, the total time complexit y of Gaussian elimination is O  k 2 n  . 5.2 EFlexics: Efficient pattern sampler Generic constraint solvers curren tly cannot compete with the efficiency and scalabilit y of specialized mining algorithms. In order to dev elop a less flexible, y et more efficien t version of our sampler, w e extend the w ell-known Ecla t algorithm to handle XOR constraints. Th us, EFlexics is tailored for frequent itemset sampling and uses Ecla t X OR (Algorithm 1) as an oracle. Algorithm 1 sho ws the pseudo co de of the extended Ecla t . The algorithm relies on the vertic al data representation, i.e., for each candidate item, it stores a set of indices of transactions (TIDs), in which this item o ccurs (Line 4). Ecla t starts with determining frequent items and ordering them, by frequency ascending. It explores the search space in a depth-first manner, where eac h branc h corresponds to (ordered) itemsets that share a prefix. The core op eration is referred to as pr o c essing an e quivalenc e class of itemsets ( EqClass ). F or eac h prefix, Ecla t main tains a set of candidate suffixes, i.e., items that follo w the last item of the prefix in the item order and are frequent. The frequency of a candidate suffix, giv en the prefix, is computed by intersecting its TID with the TID of the prefix (Lines 9, 15, and 22). W e extend Ecla t with XOR constraint handling (Lines 16-22). V ariable up- dates stem from Ecla t extending the prefix and removing infrequent suffixes (Line 16). X OR propagation can result in extending the prefix or remo ving can- didate suffixes as w ell (Line 19). F urthermore, if the prefix has been extended, 11 Algorithm 1 Ecla t augmented with XOR constraint propagation (Lines 16- 22) Input: Dataset D o ver items I , min.freq θ , X OR matrix M Assumes: Item order  I b y frequency ascending 1: function Ecla tXOR ( D , θ , M )  Mine al l fr e quent p atterns that satisfy XOR c onstr aints enc o de d by M 2: F requent items F I = ∅ 3: for item i ∈ I do 4: T I D i = { transaction index t ∈ T | D ti = 1 } 5: if | T I D i | ≥ θ then  Item is frequent 6: F I Add ← ( i, T I D i ) 7: Sor t ( F I ,  I ) 8: for i ∈ F I do 9: Candidate suffixes C S = { i 0 ∈ F I \ i | i 0 > I i } 10: EqClass ( { i } , C S , M ) 11: function EqClass (Prefix P , cand.suffixes C S 6 = ∅ , M )  Mine al l p atterns that start with P 12: if CheckConstraints ( P , M ) then 13: return P  Return prefix, if it satisfies XORs 14: for candidate suffix s ∈ C S do 15: P 0 = P ∪ s ; frequent suffixes F S = { f ∈ C S \ s | f > I s ∧ | f .T I D ∩ s.T I D | ≥ θ }                                     Pr op agate XOR c onstr aints 16: U 1 = { s } , U 0 = C S \ F S  V ariable updates 17: M 0 = Upda teAndEchelonize ( M , U 1 , U 0 ) 18: ( A 1 , A 0 ) = Prop aga te ( M 0 )  Item v ariables  that w ere assigned v alue 1 or 0 by propagation 19: F S 0 = F S \ ( A 1 ∪ A 0 ) 20: if A 1 6 = ∅ then  If prefix was extended,  up date TIDs and c hec k support 21: P 0 ← P 0 ∪ A 1 , ∆ T I D = T f ∈ A 1 f .T I D 22: F S 0 ← { f 0 ∈ F S 0 : | f 0 .T I D ∩ ∆ T I D | ≥ θ } 23: if | P 0 .T I D | ≥ θ ∧ F S 0 6 = ∅ then 24: EqClass ( P 0 , F S 00 , M 0 ) 12 TIDs of candidate suffixes need to b e up dated, with some of them possibly be- coming infrequent, leading to further propagation (Lines 19-22). If the prefix b ecomes infrequent, the searc h branch terminates. Fixed v ariable-order searc h, like Ecla t , is an adv antageous case for Gaus- sian elimination [30]: non-zero elemen ts are restricted to the right region of the matrix, hence Gaussian elimination only needs to consider a con tiguous, pro- gressiv ely shrinking subset of columns. T otal memory ov erhead of Ecla t XOR compared to plain Ecla t is O ( d × |F | × N X O R + piv ot × r ), where d denotes maximal search depth, |F | the num b er of frequent singletons (columns of a ma- trix), and N X O R the n um b er of X OR constrain ts (ro ws of a matrix). The first term refers to a set of XOR matrices in unexplored search branches, whereas the second term refers to storing itemsets in a cell (Line 19 in Algorithm 2 in App endix A). 6 P attern set sampling W e highlight the flexibility of Flexics by introducing and tackling the nov el task of sampling sets of p atterns . F or the purp oses of sampling, a set of patterns is essen tially treated as a composite pattern. Typically , constituent patterns are required to be differen t from eac h other. The quality (and hence, the sampling probabilit y) of a pattern set dep ends on collectiv e properties of constituent patterns. These characteristics, coupled with the immense size of the pattern set search space, mak e sampling ev en more challenging. T o dev elop a sampler, w e extend GFlexics with the CSP-form ulation of the k -pattern set mining task [25], which in turn builds up on the form ulation of the itemset mining task describ ed in Section 4. Recall that a CSP is defined b y a set of v ariables and constraints ov er these v ariables. Eac h constituent pattern is mo deled with distinct item and transaction v ariables, i.e., I ik and T tk for the k th pattern p k . Note that this increases the length of XOR constrain ts, whic h p oses an additional challenge from the sampling p ersp ectiv e. An y single-pattern constraint can be enforced for a constituent pattern, e.g., minf r eq ( θ ), cl osed , or minlen ( λ ). A common pattern set-sp ecific constraint is no over l ap , which enforces that neither the itemsets (1), nor the sets of transactions that they cov er (2) o verlap: (1) ∀ i ∈ I P I ik ≤ 1 (2) ∀ t ∈ T P T tk ≤ 1 F urthermore, there is t ypically a symmetry-breaking constraint that requires that the set of transaction indices of p i lexicographically precedes those of { p j | j > i } . This approach allows mo deling a wide range of pattern set sam- pling tasks, e.g., sampling k -term DNFs, conceptual clusterings, redescriptions, and others. In this pap er, w e use the problem of tiling datasets [31] as an example. The main aim of tiling is to cov er a large num ber of 1s in a binary 0 / 1 dataset with a giv en num b er of patterns. Th us, a tiling is essentially a set of itemsets that together describ e as many item o ccurrences as p ossible. Without loss of generalit y , we describ e the task of sampling non-o verlapping 2-tilings ( k = 2). 13 Let p 1 and p 2 denote the constituen t patterns of a 2-tiling. The qualit y of a tiling is equal to its ar e a , i.e., the num ber of 1s that it co v ers: ar ea ( { p 1 , p 2 } ) = ( f r eq ( p 1 ) × | p 1 | + f r eq ( p 2 ) × | p 2 | ) The scaling constant for area is C = P D ti , i.e., the total n umber of 1s in the dataset. The tilt b ound is ˆ r = P D ti /(2 × ( |D | × θ ) × λ ) , where the denominator is the smallest possible area of a 2-tiling given the constraints. 7 Exp erimen ts The exp erimental ev aluation fo cuses on accuracy , scalability , and flexibility of the prop osed sampler. The research questions are as follows: Q1 How close is the empiric al sampling distribution to the tar get distribution? Q2 How do es Flexics c omp ar e to the sp e cialize d alternatives? Q3 Do es Flexics sc ale to lar ge datasets? Q4 How flexible is Flexics , i.e., c an it b e use d for new p attern sampling tasks? The implementations of GFlexics and EFlexics 3 are based on cp4im 4 and a custom implementation of Ecla t resp ectively . Both are augmented with a propagator for a system of X OR constrain ts based on the implementation of Gaussian elimination in the m4ri library 5 [32]. All experiments were run on a Lin ux mac hine with an Intel Xeon CPU@3.2GHz and 32Gb of RAM. Q1: Sampling accuracy W e study the sampling accuracy of GFlexics in settings with tigh t constrain ts, whic h yield a relativ ely lo w n umber of solutions. This allo ws us to compute the exact statistical distance b etw een the empirical sampling distribution and the target distribution. W e inv estigate settings with v arious qualit y measures and constrain t sets as well as the effect of the tolerance parameter κ . W e select sev eral datasets from the CP4IM rep ository 6 in the follo wing w ay . F or each dataset, w e construct tw o constraint sets (see T able 3). W e choose a v alue of θ suc h that there are approximately 60 000 frequen t patterns. Giv en θ , w e choose a v alue of λ ≥ 2 suc h that there are at least 15 000 closed patterns that satisfy the minl en constraint. In order to obtain sufficien tly challenging sampling tasks, we omit the datasets where the latter condition does not hold (i.e., there are too few closed “long” patterns). Combining t w o constraint sets with three qualit y measures yields six experimental settings per dataset. T able 5 sho ws dataset statistics and parameter v alues. F or each κ ∈ { 0 . 1 , 0 . 5 , 0 . 9 } , we request 900 000 samples. 3 Av ailable at https://bitbucket.org/wxd/flexics . 4 https://dtai.cs.kuleuven.be/CP4IM 5 https://bitbucket.org/malb/m4ri/ 6 Source: https://dtai.cs.kuleuven.be/CP4IM/datasets/ 14 T able 3: Com binations of t wo constrain t sets and three quality measures yield six exp erimen tal settings p er dataset for sampling accuracy exp eriments; see Section 4 for definitions. Constrain ts C Itemsets p er dataset F min F req ( θ ) ∼ 60 000 F CL min F r eq ( θ ) ∧ ≥ 15 000 C losed ∧ min L en ( λ ) Qualit y Tilt measure ϕ bound ˆ r unif or m ( ϕ ≡ 1) 1 pur ity 2 f r eq θ − 1 Let T denote the set of all itemsets that satisfy the constraints, E denote the m ultiset of all samples, and 1 S its m ultiplicity function. F or a giv en qual- it y measure ϕ , target and empirical probabilities of sampling an itemset p are resp ectiv ely defined as P T ( p ) = ϕ ( p ) / P p 0 ∈ T ϕ ( p 0 ) and P E ( p ) = 1 E ( p ) / | E | . W e use Jensen-Shannon (JS) diver genc e to quantify the statistical distance b e- t ween P T and P E . Let D K L ( P 1 k P 2 ) denote the well-kno wn Kul lb ack-L eibler diver genc e betw een distributions P 1 and P 2 . JS-divergence D J S is defined as follo ws: D J S ( P T k P E ) = 0 . 5 × ( D K L ( P T k P M ) + D K L ( P E k P M )) where P M = 0 . 5 × ( P T + P E ) JS-div ergence ranges from 0 to 1 and, unlik e KL-divergence, do es not require that P T ( p ) > 0 ⇒ P E ( p ) > 0, i.e., that each solution is sampled at least once, whic h do es not alw ays hold in sampling exp erimen ts. W e compare D J S attained with our sampler with that of the ideal sampler, which materializes all itemse ts satisfying the constrain ts, computes their qualities, and uses these to sample directly from the target distribution. A char acteristic exp eriment in detail Our exp erimen ts show that results are consisten t across v arious datasets. Therefore, w e first study the results on the vote dataset in detail. T able 4 sho ws that the theoretical error tolerance pa- rameter κ has no considerable effect on practical p erformance of the algorithm, except for runtime, which we ev aluate in subsequent exp eriments. One p ossible explanation is the high quality of the output of the estimation phase, which th us alleviates theoretical risks that hav e to b e accounted for in the general case (see b elo w for a numerical c haracterization). Hence, in the following exp erimen ts we use κ = 0 . 9 unless noted otherwise. JS-div ergences for different quality measures and constrain t sets are im- pressiv ely low, equiv alent to the highest p ossible sampling accuracy attainable with the ideal sampler. Figure 2 illustrates this for minf r eq (0 . 09) ∧ cl osed ∧ minlen (7), ϕ = f r eq , and κ = 0 . 9 ( D J S = 0 . 004): the sampling frequency of an a verage itemset is close to the target probabilit y . F or at least 90% of patterns, the sampling error do es not exceed a factor of 2. 15 vote , minf r eq (0 . 09) ∧ cl osed ∧ minl en (7), ϕ = f r eq κ = 0 . 9/ ε ( κ ) = 100 . 38; D J S = 0 . 004 8 . 00 · 10 − 6 9 . 20 · 10 − 5 2 . 32 · 10 − 5 8 . 66 · 10 − 5 5% Avg 95% T arget T arget × 2 T arget × 0 . 5 Empirical probabilit y T arget probabilit y Bounds (log) Figure 2: Empirical sampling frequencies of itemsets that share the same target probabilit y , i.e., hav e the same quality . On av erage, frequencies are close to the target probabilities. 90% of frequencies are w ell within a factor 2 from the target, whic h is considerably lo wer than the theoretical factor of 100.38. (The dots show the tails of the e mpirical probabilit y distribution for a giv en target probability . The low er right box shows theoretical bounds and empirical frequencies on the logscale). T able 5 sho ws that similar conclusions hold for several other datasets. Over all exp erimental settings, the error of the estimation of the total w eight of all solutions, which is used to deriv e the num b er of X OR constrain ts for the sampling phase, never exceeds 10%, whereas the bounds assume the error of 45 to 80%. This helps explain wh y practical errors are considerably lo wer than theoretical b ounds. In line with theoretical exp ectations (see Section 5), the splice dataset pro ves the most challenging due to the large num b er of items (v ariables in XOR constrain ts). As a result, GFlexics do es not generate the requested num b er of samples within the 24-hour timeout. W e study the run time in the follo wing exp erimen t. Q2: Comp arison with alternative p attern samplers W e compare Flexics to A CFI [7] and TS [11], alternative samplers 7 describ ed in Section 3, in the set- tings that they are tailored for. ACFI only supp orts the setting with a single minf r eq ( θ ) constraint and ϕ = unif or m . It is run with a burn-in of 100 000 steps and uses a built-in heuristic to determine the num b er of steps b et ween consecutiv e samples. TS is ev aluated in the setting with ϕ = f r eq and both 7 The co de was pro vided by their respective authors. W e also obtained the “unmaintained” code for the uniform LR W sampler (p ersonal comm unication), but were unable to mak e it run on our machines. The co de for the F CA sampler was not av ailable (p ersonal communication). 16 T able 4: Sampling accuracy of Flexics (here GFlexics ) is consistently high across qualit y measures, constrain t sets ( min F r eq (0 . 09) vs. min F r eq (0 . 09) ∧ C losed ∧ min L en (7)), and error tolerance κ . JS-divergence is impressively low, equiv alen t to that of the ideal sampler. vote dataset, JS-div ergence from target Uniform ( ˆ r = 1) Purit y ( ˆ r = 2) F requency ( ˆ r = 11) κ F F CL F F CL F F CL 0 . 9 0 . 013 0 . 004 0 . 013 0 . 004 0 . 013 0 . 004 0 . 5 0 . 013 0 . 004 0 . 013 0 . 004 0 . 013 0 . 004 0 . 1 0 . 013 0 . 004 0 . 013 0 . 004 0 . 013 0 . 004 Ideal sampler 0 . 013 0 . 004 0 . 013 0 . 004 0 . 013 0 . 004 constrain t sets from the previous exp erimen ts. It samples from t wo of the distri- butions it supp orts, f r eq and f r eq 4 ; samples that do not satisfy the constraints are rejected. Both samplers are requested to generate 900 000 samples and are allo wed to run up to 24 hours. Datasets and parameters are identical to the previous exp eriments. T able 6 sho ws the accuracy of the samplers. The p erformance of Flexics is on par with specialized samplers. That is, in uniform frequen t itemset sam- pling, the accuracy of b oth Flexics and ACFI is equiv alent to that of the ideal sampler and can therefore not b e impro ved. When sampling proportional to fre- quency , it is equiv alent to the accuracy of the exact tw o-step sampler TS ∼ f r eq . Ho wev er, the latter do es not directly take constraints into accoun t, which p oses considerable problems on most datasets. F or example, for the heart dataset, TS fails to generate a single accepted sample, despite generating 2 billion un- constrained candidates. This issue is not solved by increasing the bias tow ards more frequent itemsets by sampling prop ortional to f r eq 4 . F urthermore, this w ould substan tially decrease accuracy , as seen in primary and vote . T able 7 sho ws the runtimes for frequen t itemset sampling (i.e., only the minf r eq constrain t). In most settings, EFlexics provides runtime benefits o ver GFlexics . The splice dataset is the most challenging due to the large n umber of items; it highlights the imp ortance of an efficien t constrain t oracle. Accordingly , the sp ecialized sampler ACFI is from 6 to 22 milliseconds faster than a faster v ariant of Flexics in uniform sampling (excluding splice ). In frequency-w eighted sampling, Flexics is considerably faster in the settings with tigh ter constraints, where the t wo-step sampler is slow to generate accepted samples. This illustrates the ov erhead as well as the b enefits of the flexibility of the prop osed approach. F urthermore, in these settings, there are at most 66 000 patterns, which is too low to suggest the need for pattern sampling (recall that the primary goal of these exp eriments was to ev aluate and compare sampling accuracy) and do es not allow for the ov erhead amortization. W e therefore tackle settings with a muc h larger n umber of patterns in the following exp erimen ts. 17 T able 5: Dataset statistics and parameter v alues and results of sampling ac- curacy exp eriments. Even with high error tolerance κ = 0 . 9, JS-divergence of Flexics (here GFlexics ) is consistently lo w across datasets, quality measures, and constrain t sets. (On the splice dataset, GFlexics generates less than 900 000 samples before the timeout; see also T able 7.) JS-div ergence, κ = 0 . 9 Uniform Purit y F requency |D | |I | Densit y θ λ F F CL F F CL F F CL german 1000 112 34% 0 . 35 (349) 2 0 . 012 0 . 003 0 . 013 0 . 003 0 . 013 0 . 003 heart 296 95 47% 0 . 43 (127) 2 0 . 012 0 . 003 0 . 012 0 . 003 0 . 012 0 . 003 hepatitis 137 68 50% 0 . 39 (53) 5 0 . 013 0 . 004 0 . 014 0 . 004 0 . 013 0 . 004 kr-vs-kp 3196 74 49% 0 . 69 (2190) 6 0 . 013 0 . 005 0 . 013 0 . 005 0 . 013 0 . 005 primary 336 31 48% 0 . 09 (30) 7 0 . 013 0 . 004 0 . 013 0 . 004 0 . 013 0 . 004 splice 3190 287 21% 0 . 04 (122) 3 − − − − − − vote 435 48 33% 0 . 09 (40) 7 0 . 013 0 . 004 0 . 013 0 . 004 0 . 013 0 . 004 T able 6: The accuracy of Flexics (here GFlexics ) is consistent across settings. In uniform frequent itemset sampling, performance of Flexics as well as of A CFI is equiv alent to that of the ideal sampler (not sho wn). In frequency- w eighted sampling, it is comparable to the exact t wo-step sampler (TS ∼ f r eq ) with rejection. How ev er, the latter suffers from low acceptance rates, which, for settings mark ed with ‘ − ’, is not impro ved by increasing bias (TS ∼ f r eq 4 ). On splice , neither TS nor Flexics generate 900 000 samples b efore the timeout; see also T able 7. JS-div ergence (for TS, acceptance rate) Uniform F requency F F F CL GF ACFI GF TS ∼ f r eq TS ∼ f r eq 4 GF TS ∼ f r eq TS ∼ f r eq 4 german 0 . 01 0 . 01 0 . 01 − ( 9 · 10 − 8 ) − (0 . 02) 0 . 00 − ( 5 · 10 − 8 ) − (0 . 06) heart 0 . 01 0 . 01 0 . 01 − ( 4 · 10 − 10 ) − (0) 0 . 00 − (0) − ( 3 · 10 − 3 ) hepatitis 0 . 01 0 . 01 0 . 01 − ( 2 · 10 − 6 ) − (0 . 01) 0 . 00 − ( 1 · 10 − 6 ) − (0 . 01) kr-vs-kp 0 . 01 0 . 01 0 . 01 − ( 7 · 10 − 7 ) − (0 . 01) 0 . 01 − ( 4 · 10 − 7 ) − ( 4 · 10 − 3 ) primary 0 . 01 0 . 01 0 . 01 0 . 01 (0 . 30) 0 . 40 (0 . 99) 0 . 01 0 . 01 (0 . 13) 0 . 27 (0 . 10) splice 0 . 01 − − − (0) − (0) − − (0) − (0) vote 0 . 01 0 . 01 0 . 01 0 . 01 (0 . 13) 0 . 23 (0 . 94) 0 . 00 0 . 01 (0 . 05) 0 . 14 (0 . 22) 18 T able 7: Runtime in milliseconds required to sample a frequent itemset, includ- ing pre-processing, i.e., estimation or burn-in, amortized o ver 1000 samples. Both v arian ts of Flexics are suitable for anytime exploration, although slow er than the sp ecialized samplers. The tw o-step sampler is the fastest in the task it is tailored for, but fails in the settings with tigh ter constrain ts. EFlexics pro vides run time benefits compared to GFlexics . ϕ = unif or m , C = F ϕ = f r eq , C = F GFlexics EFlexics ACFI GFlexics EFlexics TS ∼ f r eq german 110 25 39 133 34 58540 heart 60 45 24 73 44 − hepatitis 23 33 11 30 45 2632 kr-vs-kp 59 9 6 59 10 8731 primary 10 10 4 27 25 0 . 10 splice 170360 1376 580 − 1095 − vote 25 19 8 46 28 0 . 03 Q3: Sc alability T o study scalability of the proposed sampler, w e compare its run time costs with those required to construct an ideal sampler with lcm 8 , an efficien t frequent itemset miner [33]. T o this end, we estimate the costs of com- pleting the following scenario: pre-processing (estimation or coun ting), follo wed b y sampling 100 itemsets in tw o batches of 50. W e use non-synthetic datasets from the FIMI repository 9 , whic h ha v e few er than one billion transactions and select θ suc h that there are more than one billion frequen t itemsets (see T able 8). A char acteristic exp eriment in detail W e use the accidents dataset (469 items, 340 183 transactions) and θ = 0 . 009 (3000 transactions), which results in a staggering num b er of 5 . 37 billion frequen t itemsets. W e run WeightGen with v alues of κ ∈ { 0 . 1 , 0 . 5 , 0 . 9 } . (Note that the estimation phase is iden tical for all three cases.) The baseline sampler is constructed as follows. lcm is first run in counting mode, which only returns the total num b er of itemsets. Then, for eac h batc h, 50 random line n umbers are drawn, and the corresponding item- sets are printed while lcm is en umerating the solutions 10 . The latter phase is implemen ted with the standard Unix utility ‘ awk ‘. Figure 3 illustrates the results. The counting mo de of lcm is roughly 4 . 5 min utes faster than the estimation phase of EFlexics . Generating samples from the output of lcm , on the other hand, is considerably slow er: it takes appro ximately 35s to sample one itemset, whereas EFlexics takes from 10s to 27s p er sample, dep ending on error tolerance κ . As a result, EFlexics samples t wo batches faster than lcm regardless of its parameter v alues. Moreov er, with κ = 0 . 9 it samples all 100 itemsets even b efore the first batch is returned by lcm . 8 http://research.nii.ac.jp/ ~ uno/codes.htm , ver. 3 9 http://fimi.ua.ac.be/data/ 10 Storing all itemsets on disk provides no b enefits: it increases the mining run time to 23 minutes and results in a file of 215Gb; simply counting its lines with ‘ wc -l ’ takes 25 minutes. 19 accidents , minf r eq (0 . 009), unif or m 50 100 2 6 24 30 35 51 Time per sample: κ = 0 . 9 10.3 s κ = 0 . 5 18.1 s κ = 0 . 1 26.5 s lcm 34.8 s 1st LCM batc h 2nd LCM batc h Samples Time, min. a) Sampling run time comparison 3 9 1 5 9 13 17 Estimate × (1 + ε est ) Estimate / (1 + ε est ) 5.37 T rue count Itemsets (bln.) Iterations b) Estimation accuracy Figure 3: a) EFlexics generates tw o batches of 50 samples faster than a sampler deriv ed from lcm , regardless of error tolerance. b) EFlexics with the unif or m qualit y conv erges to a high-quality estimate of the total num ber of itemsets in a small n umber of iterations (three different random seeds shown). Practical error of the estimation phase is substan tially lo wer than theoretical b ounds, whic h indirectly signals high sampling accuracy . Th us, the prop osed sampler outp erforms a sampler deriv ed from an efficient itemset miner, even though the experimental setup fa vors the latter. First, non- uniform weigh ted sampling w ould require more adv anced computations with itemsets, whic h w ould increase the costs of b oth coun ting and sampling with lcm . Second, EFlexics could also b enefit from the exact count obtained by lcm and start sampling after 1 . 5 min utes. Third, the individual itemsets sam- pled from the output of an algorithm based on deterministic search are not exchange able . Figure 4 illustrates this: due to lcm ’s search order, certain items only o ccur at the b eginning of batches, while for EFlexics , the order within a batc h is random. The accuracy of Flexics in this scenario can b e ev aluated indirectly , b y comparing the estimate of the total num b er of itemsets obtained at the estima- tion phase with the actual num b er. The error tolerance of the estimation phase is ε est = 0 . 8 (see Appendix A for details). Figure 3b demonstrates that, in prac- tice, the error is substan tially low er than the theoretical b ound. F urthermore, 3 to 9 iterations suffice to obtain an accurate estimate. Similar to previous ex- p erimen ts, accurate input from the estimation phase alleviates theoretical risks and is expected to enable accurate sampling. T able 8 summarizes the results. On three out of four datasets, lcm is faster in counting itemsets, but considerably slo wer in generating individual samples, whic h is ev en more pronounced on connect and pumsb than on accidents . The results are opposite on the kosarak dataset, which is in line with the theoretical exp ectations (see Section 5): the large n umber of items and the sparsit y of the dataset sharply increase the costs of XOR constraint propagation. As a result, 20 accidents , minf r eq (0 . 009), unif or m Items in lcm search order Expected probability (0-0.5) lcm Sample index EFlexics , κ = 0 . 9 Sample index Figure 4: The probability of observing a given item at a certain p osition in a batc h by EFlexics is close to the exp ected probability of observing this item in a random itemset, whic h indicates high sampling accuracy . The samples b y the lcm -based sampler are not exchangeable, i.e., certain items are under- or ov ersampled at certain p ositions in a batch, dep ending on their p osition in lcm ’s search order. T able 8: EFlexics generates individual samples considerably faster than lcm , although it is slow er in coun ting. The kosarak dataset p oses a significant c hal- lenge to EFlexics due to its num b er of items and sparsity that complicate the propagation of X OR constrain ts. Itemsets, Coun ting, min Sampling, s |D | |I | Densit y θ bln. lcm EFlexics lcm EFlexics accidents 340183 469 7 . 21% 0 . 009 5 . 37 1 . 55 6 . 48 33 . 77 10 . 30 connect 67557 130 33 . 08% 0 . 178 16 . 88 0 . 01 0 . 38 59 . 00 0 . 37 kosarak 990002 41271 0 . 02% 0 . 042 10 . 93 4 . 87 456 . 30 73 . 04 294 . 89 pumsb 49046 7117 1 . 04% 0 . 145 1 . 11 0 . 09 1 . 19 18 . 14 0 . 75 en umeration with Ecla t within EFlexics b ecomes considerably slow er than with lcm (augmenting lcm to handle X OR constrain ts might provide a solution, but is c hallenging from an implementation p ersp ective). Q4: Pattern set sampling In order to demonstrate the flexibility of our approac h and the promised b enefits of w eigh ted constrained pattern sampling, i.e., 1) di- v ersity and qualit y of results, 2) utilit y of constrain ts, and 3) the potential for an ytime exploration, we here address the problem of sampling non-ov erlapping 2-tilings as in tro duced in Section 6. W e re-use the implemen tation of GFlexics from the itemset sampling experiments, only modifying the declarativ e speci- fication of the CSP . Likewise, w e imp ose the F CL constrain ts on constituen t patterns. T able 9 sho ws parameters and runtimes for sampling 2-tilings proportional to ar ea . The time to sample a single 2-tiling is suitable for pattern-based data exploration, where tilings are insp ected b y a human user, as it exceeds 5s only on the german dataset. F or sev eral settings, the estimation phase runtime slightly 21 T able 9: Time required to sample a 2-tiling is appro ximately 4s, whic h is suitable for anytime exploration. Runtime b enefits of the sampling procedure are the largest for the settings with the largest tiling coun ts ( kr-vs-kp , primary , and vote ). Sampling with GFlexics θ λ Tilt Tilings, Enumeration, Estimation, Per sample, b ound ˆ r mln. min min s german-credit 0 . 22 3 25 . 4 11 . 2 8 . 2 12 . 6 15 . 3 heart 0 . 30 5 13 . 3 2 . 2 1 . 0 3 . 3 3 . 9 hepatitis 0 . 26 5 12 . 4 7 . 2 1 . 9 2 . 6 3 . 6 kr-vs-kp 0 . 31 4 13 . 1 20 . 3 18 . 5 3 . 5 5 . 1 primary 0 . 03 5 50 . 3 24 . 9 5 . 5 4 . 0 4 . 5 vote 0 . 10 5 15 . 3 170 . 1 37 . 0 2 . 9 4 . 4 exceeds the run time of en umerating all solutions. How ev er, for the settings with a large n umber of pattern sets, which are arguably the primary target of pattern samplers, the opposite is true. F or example, in the vote exp eriment with 170 million tilings, the estimation phase runtime only amoun ts to 8% of the complete en umeration runtime, which demonstrates the b enefits of the prop osed approac h. The left part of Figure 5 shows six random 2-tilings sampled from the vote dataset. Constrain ts ensure that the individual tiles comprising each 2-tiling do not ov erlap, simplifying interpretation. Moreov er, the set of tilings is diverse, i.e., the tilings are dissimilar to eac h other. They cov er differen t regions in the data, revealing alternative structural regularities. The right part of Figure 5 shows the ar ea distribution of all 2-tilings that satisfy the constraints, obtained by complete enumeration. Qualities of 5 out of 6 tilings fall in the dense region betw een the 25th and 75th p ercen tile, indi- cating high sampling accuracy . This is completely exp ected from the problem statemen t. In practice, pattern quality measures, like ar ea , are only an approx- imation of application-sp ecific pattern in terestingness, th us diversit y of results is a desirable characteristic of a pattern sampler as long as the quality of indi- vidual patterns is sufficien tly high. T o sample patterns from the right tail (i.e., with exceptionally high qualities) more frequen tly , the sampling task could b e c hanged, e.g., either b y c ho osing another sampling distribution or by enforcing constrain ts on ar ea . 8 Discussion The exp erimen ts demonstrate that Flexics deliv ers the promised b enefits: 1) it is flexible in that it supports a wide range of pattern constrain ts and sampling distributions in itemset mining as w ell as the nov el pattern set sampling task; 22 Tiling 1 Tiling 2 ar ea = 1314 966 Tiling 3 Tiling 4 941 878 Tiling 5 Tiling 6 799 765 0 . 1 0 . 5 440 Min 828 Median 1758 Max 1% 25% 75% 99% n n Tiling n Smo oth histogram of ar ea Tilings (mln.) ar ea Ar ea distribution of all non-o v erlapping 2-tilings vote , minf r eq (0 . 1) ∧ cl osed ∧ minl en (5) 1 1 2 2 3 3 4 4 5 5 6 6 Figure 5: Left : Six 2-tilings sampled consecutively from the vote dataset. The tilings are div erse, i.e., cov er differen t regions in the data, a prop erty essen tial for pattern-based data exploration. (Note that while the sampled tilings are fair random dra ws, the images are not random: the tilings w ere sorted b y ar ea descending, and items and transactions were re-arranged so that the cells cov- ered b y tilings with larger ar ea are as close to eac h other as p ossible.) Right : Qualities ( area ) of the samples, indicated by v ertical bars, tend tow ards a dense region b etw een the 25th and the 75th percentile. 2) it is anytime in that the time it takes to generate random patterns is suitable for online data exploration, including the settings with large datasets or large solution spaces; and 3) b y virtue of high sampling accuracy in all supp orted settings, sampled patterns are div erse, i.e., originate from differen t regions in the solution space. The theoretical guaran tees ensure that the empirical ob- serv ations extend reliably b eyond the studied settings. F urthermore, practical accuracy is substan tially higher than theory guaran tees. The results confirm that pattern mining can benefit from the latest adv ances in AI, particularly in w eighted constrained sampling for SA T. In this section, w e discuss potential applications, adv antages, and limitations of the proposed approac h. The primary application of pattern sampling inv olves showing sampled pat- terns directly to the user. In exploratory data analysis, the mining task is often ill-defined, i.e., the qualit y measure and the constraints reflect the application- sp ecific pattern interestingness only approximately [34]. Owing to its flexibility , Flexics allo ws experimenting with v arious task formulations using the same algorithm. Pattern sampling allo ws obtaining div erse and represen tativ e sets of patterns in an an ytime manner. These properties are particularly important in inter active mining systems , which aim at returning patterns that are subje c- tively interesting to the curren t user. Boley et al. [35] used tw o-step samplers in such a system, while Dzyuba and v an Leeuw en [36] prop osed to learn lo w-tilt 23 sub jective quality measures specifically for sampling with Flexics . F urthermore, the theoretical guarantees enable applications b ey ond displa y- ing the sampled patterns: Flexics can be plugged in to algorithms that use patterns as building blo c ks for pattern-based models, yielding anytime versions thereof with ( ε, δ )-approximation guarantees of their own derived from Flex- ics ’ guarantees. Example approaches include communit y detection with Ecla t [37] or outlier detection with t wo-step sampling [38]. The authors note that the form ulation of the mining task has a strong influence on the results in the resp ectiv e applications. Flexics allows the algorithm designer to exp eriment with these c hoices and thus to obtain v ariants of these approaches, perhaps with b etter application performance. The flexibilit y also pro vides algorithmic adv antages. In addition to being agnostic of the qualit y measure ϕ and the constrain t set C , Flexics is also ag- nostic of the underlying solution space and the oracle, as long as 1) solutions can b e encoded with binary v ariables and 2) the oracle supp orts X OR constraints. Th us, Flexics pro vides a principled metho d to con vert a pattern en umeration algorithm in to a sampling algorithm, which amounts to implemen ting the mec h- anism to handle XOR constraints. This allows re-using algorithmic adv ances in pattern mining for dev eloping pattern samplers, which we accomplished with cp4im and Ecla t . Most imp ortantly , Flexics ’ black-box nature simplifies extensions to new pattern languages. F or example, p ossible extensions of GFlexics cov er a v a- riet y of pattern set languages in Guns et al. [25], e.g., conceptual clustering. EFlexics can b e extended to sample other binary pattern languages, e.g., as- so ciation rules [1] or redescriptions [39]. In contrast, MCMC algorithms, lik e LR W, are based on local neigh b ourhoo d en umeration, whic h is uncommon in traditional pattern mining techniques, and thus require distinctive design and implemen tation principles for no v el problems. On the other hand, Flexics only supp orts pattern languages that can b e compactly represen ted with binary v ariables, suc h as the itemsets and pattern sets studied in this pap er. This essen tially limits it to propositional discrete (binary , categorical, or discretized n umeric) data. While in principle structured pattern languages, e.g., sequences or graphs, could also b e mo deled using this framew ork, the num ber of v ariables would rise sharply , whic h would negatively affect p erformance. Devising hashing-based sampling algorithms for non-binary domains is an open problem. In particular, sequence mining can be encoded with integer v ariables [26]; generalized X OR constraints [29] is one possible researc h direction. Alternatively , as the m4ri library [32] that we base our im- plemen tation on is optimized for dense F 2 matrices, certain performance issues ma y b e addressed with Gaussian elimination algorithms optimized for sparse matrices [40]. Another limitation concerns the bounded tilt assumption regarding sampling distributions: many common quality measures, e.g., χ 2 , information gain [41], or weighte d r elative ac cur acy [42], hav e high or ev en effectively infinite tilts (if ϕ can b e arbitrarily close to 0). Such quality measures could b e tac kled with divide-and-conquer approaches [16, Section 6] or alternative estimation 24 tec hniques [43]. This requires the capacity to efficien tly handle constraints of the form a ≤ ϕ ( p ) ≤ b , which is p ossible for a n umber of qualit y measures, including the ones listed abov e. 9 Conclusion W e prop osed Flexics , a flexible pattern sampler with theoretical guaran tees regarding sampling accuracy . W e leveraged the p erspective on pattern mining as a constrain t satisfaction problem and dev elop ed the first pattern sampling algorithm that builds up on the latest adv ances in sampling solutions in SA T. Exp erimen ts sho w that Flexics deliv ers the promised b enefits regarding flexi- bilit y , efficiency , and sampling accuracy in itemset mining as w ell as in the no vel task of pattern set sampling and that it is comp etitiv e with state-of-the-art al- ternativ es. Directions for future work include extensions to richer pattern languages and relaxing assumptions regarding sampling distributions (see Section 8 for a dis- cussion). Sp ecializing the sampling pro cedure to w ards t ypical mining scenarios ma y allow for deriving tighter theoretical b ounds and improving the practical p erformance; examples include specific constraint types (e.g., an ti-/monotone), shap es of sampling distributions (e.g., right-peaked distributions, similar to Fig- ure 5), and iterative mining. F ollowing the future developmen ts in weigh ted constrained sampling in AI ma y pro vide insigh ts for improving v arious aspects of Flexics or pattern sampling in general. A cknow le dgements The authors would like to thank Guy V an den Broeck for useful discussions and Martin Albrec ht for the supp ort with the m4ri library . Vladimir Dzyuba is supp orted b y FWO-Vlaanderen. References [1] Rakesh Agra wal, Heikki Mannila, Ramakrishnan Srik ant, Hann u T oiv o- nen, and A. Inkeri V erk amo. A dvanc es in Know le dge Disc overy and Data Mining , chapter F ast Discov ery of Asso ciation Rules, pages 307–328. 1996. [2] Charu C Aggarwal and Jiaw ei Han, editors. F r e quent p attern mining . Springer International Publishing, 2014. [3] T o on Calders, Christophe Rigotti, and Jean-F ran¸ cois Boulicaut. A surv ey on condensed represen tations for frequent sets. In Jean-F ran¸ cois Boulicaut, Luc De Raedt, and Heikki Mannila, editors, Constr aint-Base d Mining and Inductive Datab ases , pages 64–80. Springer Berlin Heidelberg, 2006. [4] Albrech t Zimmermann and Siegfried Nijssen. Sup ervised pattern mining and applications to classification. In C. Charu Aggarwal and Jia w ei Han, editors, F r e quent Pattern Mining , c hapter 17, pages 425–442. Springer In- ternational Publishing, 2014. 25 [5] Siegfried Nijssen and Albrec ht Zimmermann. Constraint-based pattern mining. In C. Charu Aggarw al and Jiaw ei Han, editors, F r e quent Pattern Mining , chapter 7, pages 147–163. Springer International Publishing, 2014. [6] Bj¨ orn Bringmann, Siegfried Nijssen, Nik ola j T atti, Jilles V reeken, and Al- brec ht Zimmermann. Mining sets of patterns. T utorial at the Eur op e an Confer enc e on Machine L e arning and Principles and Pr actic e of Know l- e dge Disc overy (ECML/PKDD ’10) , 2010. [7] Mario Boley and Henrik Grosskreutz. Approximating the n umber of fre- quen t sets in dense data. Know le dge and information systems , 21(1):65–89, 2009. [8] Mohammad Al Hasan and Mohammed J. Zaki. Output space sampling for graph patterns. Pr o c e e dings of the VLDB Endowment , 2(1):730–741, August 2009. [9] Mario Boley , Thomas G¨ artner, and Henrik Grosskreutz. F ormal con- cept sampling for counting and threshold-free local pattern mining. In Pr o c e e dings of the 10th SIAM International Confer enc e on Data Mining (SDM ’10) , pages 177–188, 2010. [10] Mario Boley , Claudio Lucc hese, Daniel Paurat, and Thomas G¨ artner. Di- rect lo cal pattern sampling by efficien t tw o-step random pro cedures. In Pr o c e e dings of the 17th ACM SIGKDD Confer enc e on Know le dge Disc ov- ery and Data Mining (KDD ’11) , pages 582–590, 2011. [11] Mario Boley , Sandy Mo ens, and Thomas G¨ artner. Linear space direct pattern sampling using coupling from the past. In Pr o c e e dings of the 18th ACM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining (KDD ’12) , pages 69–77, 2012. [12] Carla P Gomes, Ashish Sabharwal, and Bart Selman. Near-uniform sam- pling of combinatorial spaces using XOR constraints. In A dvanc es in Neur al Information Pr o c essing Systems 19 , pages 481–488. 2007. [13] Supratik Chakraborty , Kuldeep S Meel, and Moshe Y V ardi. A scalable and nearly uniform generator of SA T witnesses. In Pr o c e e dings of the 25th In- ternational Confer enc e on Computer-Aide d V erific ation (CA V ’13) , pages 608–623, 2013. [14] Stefano Ermon, Carla P Gomes, Ashish Sabharwal, and Bart Selman. Em- b ed and pro ject: Discrete sampling with universal hashing. In A dvanc es in Neur al Information Pr o c essing Systems 26 , pages 2085–2093. 2013. [15] Kuldeep Meel, Moshe V ardi, Supratik Chakraborty , Daniel F remont, Sanjit Seshia, Dror F ried, Alexander Ivrii, and Sharad Malik. Constrained sam- pling and coun ting: Univ ersal hashing meets SA T solving. In Pr o c e e dings of the Beyond NP AAAI Workshop , 2016. 26 [16] Supratik Chakrab orty , Daniel J F remont, Kuldeep S Meel, and Moshe Y V ardi. Distribution-aw are sampling and w eighted mo del counting for SA T. In Pr o c e e dings of the 28th AAAI Confer enc e on Artificial Intel ligenc e (AAAI ’14) , pages 1722–1730, 2014. [17] Tias Guns, Siegfried Nijssen, and Luc De Raedt. Itemset mining: A con- strain t programming p ersp ectiv e. Artificial Intel ligenc e , 175(12-13):1951– 1983, aug 2011. [18] Mohammed J. Zaki, Sriniv asan P arthasarathy , Mitsunori Ogihara, and W ei Li. New algorithms for fast discov ery of asso ciation rules. In Pr o c e e dings of the 3r d ACM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining (KDD ’97) , pages 283–296, 1997. [19] Jian Pei and Jia wei Han. Can we push more constrain ts into frequent pattern mining? In Pr o c e e dings of the 6th A CM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining (KDD ’00) , pages 350–354, 2000. [20] Arno Knobbe and Eric Ho. Pattern teams. In Pr o c e e dings of the 10th Eu- r op e an Confer enc e on Principles of Data Mining and Know le dge Disc overy (PKDD ’06) , pages 577–584, 2006. [21] Luc De Raedt and Albrec ht Zimmermann. Constrain t-based pattern set mining. In Pr o c e e dings of the 7th SIAM International Confer enc e on Data Mining (SDM ’07) , pages 237–248, 2007. [22] Cristian Bucil˘ a, Johannes Gehrke, Daniel Kifer, and W alker White. Dualminer: A dual-pruning algorithm for itemsets with constraints. Data Mining and Know le dge Disc overy , 7(3):241–272, 2003. [23] F rancesco Bonc hi, F osca Giannotti, Claudio Lucchese, Salv atore Orlando, Raffaele P erego, and Roberto T rasarti. A constrain t-based querying system for exploratory pattern discov ery . Information Systems , 34(1):3–27, 2009. [24] Mehdi Khiari, Patrice Boizumault, and Bruno Cr´ emilleux. Constraint pro- gramming for mining n-ary patterns. In Pr o c e e dings of the 16th Interna- tional Confer enc e on Principles and Pr actic e of Constr aint Pr o gr amming (CP ’10) , pages 552–567, 2010. [25] Tias Guns, Siegfried Nijssen, and Luc De Raedt. k -pattern set mining un- der constrain ts. IEEE T r ansactions on Know le dge and Data Engine ering , 25(2):402–418, 2013. [26] A Kemmar, W Ugarte, S Loudni, T Charnois, Y Lebbah, P Boizumault, and B Cr ´ emilleux. Mining relev ant sequence patterns with CP-based frame- w ork. In Pr o c e e dings of the 26th IEEE International Confer enc e on T o ols with A rtificial Intel ligenc e (ICT AI ’14) , pages 552–559, 2014. 27 [27] Sergey P aramonov, Matthijs v an Leeuw en, Marc Denec ker, and Luc De Raedt. An exercise in declarative mo deling for relational query mining. In Pr o c e e dings of the 25th International Confer enc e on Inductive L o gic Pr o- gr amming (ILP ’15) , 2015. [28] Nino Sherv ashidze, SVN Vishw anathan, T obias Petri, Kurt Mehlhorn, and Karsten M Borgw ardt. Efficient graphlet kernels for large graph compari- son. In Pr o c e e dings of the 12th International Confer enc e on A rtificial In- tel ligenc e and Statistics (AIST A TS ’09) , pages 488–495, 2009. [29] Carla P Gomes, Willem-jan v an Ho eve, Ashish Sabharwal, and Bart Selman. Coun ting CSP solutions using generalized XOR constraints. In Pr o c e e dings of the 22nd AAAI Confer enc e on A rtificial Intel ligenc e (AAAI ’07) , pages 204–209, 2007. [30] Mate So os. Enhanced gaussian elimination in DPLL-based SA T solvers. In Pr o c e e dings of the Pr agmatics of SA T Workshop (POS ’10) , pages 2–14, 2010. [31] Floris Geerts, Bart Go ethals, and T Mielik¨ ainen. Tiling databases. In Pr o- c e e dings of the 7th International Confer enc e on Disc overy Scienc e (DS ’04) , pages 278–289, 2004. [32] Martin Albrec h t and Gregory Bard. The M4RI Libr ary . The M4RI T eam, 2012. [33] T akeaki Uno, Masashi Kiy omi, and Hiroki Arim ura. LCM v er. 3: Collab- oration of arra y , bitmap and prefix tree for frequent itemset mining. In Pr o c e e dings of the 1st International Workshop on Op en Sour c e Data Min- ing: F r e quent Pattern Mining Implementations (OSDM ’05) , pages 77–86, 2005. [34] Deb orah R Carv alho, Alex A F reitas, and Nelson Eb eck en. Ev aluating the correlation betw een ob jective rule in terestingness measures and real h uman in terest. In Pr o c e e dings of the 9th Eur op e an Confer enc e on Principles of Data Mining and Know le dge Disc overy (PKDD ’05) , pages 453–461, 2005. [35] Mario Boley , Michael Mampaey , Bo Kang, P av el T okmako v, and Ste- fan W rob el. One Click Mining – interactiv e local pattern discov ery through implicit preference and p erformance learning. In Pr o c e e dings of the ACM SIGKDD Workshop on Inter active Data Explor ation and Analyt- ics (IDEA ’13) , pages 28–36, 2013. [36] Vladimir Dzyuba and Matthijs v an Leeu wen. Learning what matters – sampling interesting patterns. In Pr o c e e dings of the 21st Pacific-Asia Con- fer enc e on Know le dge Disc overy and Data Mining (P AKDD ’17) , 2017. in press. 28 [37] Michele Berlingerio, F abio Pinelli, and F rancesco Calabrese. ABACUS: F requen t pattern mining-based comm unit y disco v ery in multidimensional net works. Data Mining and Know le dge Disc overy , 27(3):294–320, 2013. [38] Arnaud Giacometti and Arnaud Soulet. Anytime algorithm for frequent pattern outlier de tection. International Journal of Data Scienc e and A na- lytics , 2(3):119–130, 2016. [39] Naren Ramakrishnan, Deept Kumar, Bud Mishra, Malcolm P otts, and Ric hard Helm. T urning CAR Twheels: an alternating algorithm for mining redescriptions. In Pr o c e e dings of the 10th A CM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining (KDD ’04) , pages 266–275, 2004. [40] Charles Bouillaguet and Claire Delaplace. Sparse gaussian elimination mo dulo p : An up date. In Pr o c e e dings of the 18th International Workshop on Computer Algebr a in Scientific Computing (CASC ’16) , pages 101–116, 2016. [41] Siegfried Nijssen, Tias Guns, and Luc De Raedt. Correlated itemset mining in ROC space: A constraint programming approach. In Pr o c e e dings of the 15th ACM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining (KDD ’09) , pages 647–655, 2009. [42] Florian Lemmeric h, Martin Bec ker, and F rank Puppe. Difference-based estimates for generalization-a ware subgroup discov ery . In Pr o c e e dings of the Eur op e an Confer enc e on Machine L e arning and Principles and Pr actic e of Know le dge Disc overy (ECML/PKDD ’13) , pages 288–303, 2013. [43] Stefano Ermon, Carla P . Gomes, Ashish Sabharwal, and Bart Selman. T am- ing the curse of dimensionalit y: Discrete integration by hashing and opti- mization. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning (ICML ’13) , pages 334–342, 2013. [44] Supratik Chakrab orty , Daniel J. F remont, Kuldeep S. Meel, Sanjit A. Se- shia, and Moshe Y. V ardi. On parallel scalable uniform SA T witness gen- eration. In Pr o c e e dings of the 21st International Confer enc e on T o ols and A lgorithms for the Construction and A nalysis of Systems (T ACAS ’15) , v olume 9035, pages 304–319, 2015. A W eigh tGen In this section, w e presen t an extended tec hnical description of the WeightGen algorithm, whic h closely follows Sections 3 and 4 in [16], whereas the pseudoco de in Algorithm 2 is structured similarly to that of UniGen2 , a close cousin of WeightGen [44]. Lines 1-3 correspond to the estimation phase and Lines 4-8 corresp ond to the sampling phase. Sol veBounded stands for the bounded en umeration oracle. 29 The parameters of the estimation phase are fixed to particular theoretically motiv ated v alues. pivot est denotes the maximal weigh t of a cell at the estimation phase; piv ot est = 46 corresponds to estimation error tolerance ε est = 0 . 8 (Line 10). If the total weigh t of solutions in a given cell exceeds piv ot est , a new random XOR constrain t is added in order to eliminate a n umber of solutions. Rep eating the pro cess for a num b er of iterations increases the confidence of the estimate, e.g., 17 iterations result in 1 − δ est = 0 . 8 (Line 1). Note that Estima te essen tially estimates the total weigh t of al l solutions, from which N X O R , the initial num b er of X OR constrain ts for the sampling phase, is deriv ed (Line 4). A similar procedure is emplo yed at the sampling phase. It starts with N X O R constrain ts and adds at most thr e e extra constraints. The user-c hosen error tol- erance parameter κ determines the range [ l oT hr esh, hiT hr esh ], within which the total w eight of a suitable cell should lie (Line 5). F or example, κ = 0 . 9 corresp onds to range [6 . 7 , 49 . 4]. If a suitable cell can b e obtained, a solution is sampled exactly from all solutions in the cell; otherwise, no sample is returned. Requiring the total cell weigh t to exceed a particular v alue ensures the low er b ound on the sampling accuracy . The preceding presentation mak es tw o simplifying assumptions: (1) all weigh ts lie in [1 /r, 1]; (2) adding X OR constrain ts nev er results in unsatisfiable sub- problems (empty cells). The former is relaxed b y m ultiplying piv ots b y ˆ w max = ˆ w min × ˆ r < 1, where ˆ w min is the smallest weigh t observed so far. The latter is solved by simply restarting an iteration with a newly generated set of con- strain ts. See Chakraborty et al. [16] for the full explanation, including the precise formulae to compute all parameters. Implementation details F ollowing suggestions of Chakrab ort y et al. [44], w e implemen t le apfr o gging , a technique that improv es the performance of the um- brella sampling pro cedure and thus b enefits b oth GFlexics and EFlexics . First, after three iterations of the estimation phase, we initialize the following iterations with a num b er of XOR constraints that is equal to the smallest n um- b er returned in the previous iterations (rather than with zero X ORs). Second, in the sampling phase, w e start with one X OR constrain t more than the n um b er suggested by theory . If the cell is to o small, we remo ve one constrain t; if it is to o large, w e proceed adding (at most tw o) constraints. Both mo difications are based on the observ ation that theoretical parameter v alues address h yp othetical corner cases that rarely o ccur in practice. Finally , we only run the estimation phase un til the initial num ber of XOR constrain ts, which only dep ends on the median of total weigh t estimates, conv erges. F or example, if the estimation phase is supp osed to run for 17 iterations, the conv ergence can happ en as early as after 9 iterations. 30 Algorithm 2 WeightGen [16] Input: Bo olean formula F , weigh t w , tilt bound ˆ r , sampling error tolerance parameter κ Assumes: w ( · ) ∈ [1 / ˆ r, 1], b ounded en umeration algorithm Sol veBounded 1: for 17 iterations do  Corresp onds to δ est = 0 . 2 2: W eig htE stimates Add ← Estima te ( ∅ ) 3: T otal W eig ht = Median ( W eig htE stimates ) 4: N X O R = O  log 2 T otalW eig ht/  1 + κ − 1  5: l oT hr esh ∝ (1 + κ ) /κ 2 , hiT hr esh ∝ (1 + κ ) 3 /κ 2 6: for N samples times do 7: I nitX O Rs = { RandomXOR () × N X O R times } 8: Genera te ( κ , [ loT hr esh, hiT hr esh ], I nitX O Rs , 3) 9: function Estima te ( X O Rs )  Returns an estimate of the total w eight of all solutions 10: pivot est = 46  Corresp onds to ε est = 0 . 8 11: S ols ← Sol veBounded ( F , X O Rs , piv ot est ) 12: C el l W eig ht ← P s ∈ S ols w ( s ) 13: if C el l W eig ht ≤ pivot est then  Cell of the “righ t” size 14: return C ell W eig ht × 2 | X O Rs | 15: else  Shrink cell by adding XOR constraint 16: Estima te ( X O Rs ∪ RandomX OR ()) 17: function Genera te ( κ , [ lT , hT ], X O Rs , i )  Returns a random solution of F 18: S ols ← Sol veBounded ( F , X O Rs , hT ) 19: C el l W eig ht ← P s ∈ S ols w ( s ) 20: if C el l W eig ht ∈ [ lT , hT ] then  Cell of the “righ t” size 21: return SampleExactl y ( S ols , w ) 22: else if C el lW eig ht > lT ∧ i > 0 then  Cell is to o large 23: Genera te ( κ , [ lT , hT ], X O Rs ∪ RandomXOR (), i − 1) 24: else  Cell is to o small 25: return ⊥ 31

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment