While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-itemset must use space Omega(min{mb,n^k,(mb/phi)^k}) bits, where mb is the number of items in the stream so far, n is the number of distinct items and phi is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy improves with the length of the stream for any fixed support threshold.
Imagine that we have a set of m sets ("transactions"), each a subset of {1, . . . , n}, and that we want to find interesting associations among items in these transactions. This problem is often framed in a "market basket" model where we are interested in finding those pairs of items that are frequently bought together. Whether a pattern is really interesting or not is a problem dependent question, and for this reason various similarity measures other than number of co-occurrences have been introduced. Some of the most common measures are Jaccard [7], cosine, and all confidence [17,19]. Besides these measures we are also interested in association rules, which are intimately related to the overlap coefficient similarity measure. See [12,Chapter 5] for background and discussion of similarity measures.
We initiate the study of this problem in the streaming model where transactions arrive one by one, and we are allowed limited time per transaction and very small space. The latter constraint implies we cannot hope to store much information regarding pairs that are not similar and, moreover, we cannot store the input. In particular, classical frequent item set algorithms such as Apriori [1] and FP-growth [13] that work in several passes over the data cannot be used. The survey of Jiang and Gruenwald [14] gives a good overview of the challenges in data stream association mining.
Previous works on transaction data streams have focused on finding frequent itemsets, and can be classified in the following way [22]:
Landmark model The frequent itemsets are searched for in the whole stream, so that itemsets that appeared in the far past have the same importance as recent ones;
Damped model This model is also called Time-Fading. Recent transactions have a higher weight than the older ones, so nearer itemsets are considered more interesting than the further;
Sliding window Only a part of the stream is considered at a given time in this model, the one falling in the sliding window. This implies storing information concerning the transactions falling within the window, since whenever a transaction gets out of the window span, it has to be removed from the counts of the itemsets.
The last two models make the problem of achieving low space usage simpler, since most of the information in the stream has little or no effect on the mining result. The challenge is instead to handle the real-time requirements of data stream settings.
All these approaches look for frequent items and do not try to compute any similarity, relying on the tacit assumption that whatever is frequent is automatically interesting. This assumption is not always true:
Example Suppose we have item 1 appearing in 20% of transactions, item 2 appearing in 20% of transactions, and the pair {1, 2} appears in 10% of transactions. Suppose moreover that the pair {3, 4} appears in only 5% of transactions and that these transactions are the only ones in which 3 and 4 appear. The set {1, 2} has a frequency that is two times the one of {3, 4}. But looking at the similarity function cosine, we can easily realize that s(1, 2) = 10/20 = 0.5 while s(3, 4) = 5/5 = 1. If we base the idea of similarity only on frequencies, we are likely to miss the pair {3, 4} which holds a much higher similarity than the more frequent pair {1, 2}.
Notice also that {3, 4} holds a higher similarity for all the measures we are addressing, so the example shows how frequencies alone do not suffice to infer similarity properties of pairs.
•
In this paper we address the problem of finding similar pairs in a stream of transactions. We first show a negative result, which is that a worst-case stream does not allow solutions with non-trivial space usage: To approximate even the simplest similarity measure one essentially needs space that would be sufficient to store either the number of occurrences of all pairs or the contents of the stream itself. Imposing a minimum support ϕ for the items we are interested in alleviates the problem only when ϕ is close to the number of transactions.
Theorem 1 Given a constant k > 0, and integers m, n, ϕ, consider inputs of m transactions of total size mk with n distinct items. Let s max denote the highest support among k-itemsets where each item has support ϕ or more. Any algorithm that makes a single pass over the transactions and estimates s max within a factor α < 2 with error probability δ < 1/2 must use space Ω(min(m, n k , (m/ϕ) k )) bits in expectation on a worst-case input distribution.
This lower bound extends and strengthens a lower bound for single-item streams presented in [8].
Of course, many data streams may not exhibit worst-case behavior. Several papers have considered models of data streams where the items are supposed to be independently chosen from some distribution, or presented in random order [5,9,11,21]. We present an upper bound that works for a worst-case set of transactions under the condition that it is presented in random order,
This content is AI-processed based on open access ArXiv data.