Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

Reading time: 6 minute
...

📝 Original Info

  • Title: Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates
  • ArXiv ID: 0903.0625
  • Date: 2009-03-05
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this union-sketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable sketches but excluded from the union sketch. We establish analytically that our estimators dominate estimators applied to the union-sketch for {\em all queries and data sets}. Empirical evaluation on synthetic and real data reveals that on typical applications we can expect a 25%-4 fold reduction in estimation error.

💡 Deep Analysis

Deep Dive into Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates.

Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys’ attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this union-sketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable

📄 Full Content

We consider datasets modeled as a collection S of (possibly intersecting) sets, defined over a ground set I of (possibly weighted) keys. A classic example is documents over features or terms, according to presence in the document.

Basic aggregates over such data are weight and selectivity of subpopulations of keys. A query specifies a subpopulation of I by a selection predicate. The weight aggregate is the sum of the weights of the keys that satisfy the predicate. If keys have uniform weights, the weight aggregate is known as DV (distinct values) count. An example of a weight query is the number of terms present both in document A and in document B and are at least 5 characters long. Selectivity queries are defined with respect to some (sub) collection of sets: The result is the ratio of the sum of the weights of all keys in the union of these sets for which the predicate holds and the total weight of the union of these sets. An important selectivity aggregate is the Jaccard coefficient of A and B defined as |A ∩ B|/|A ∪ B|, which measures the similarity between A and B. A common technique to enhance this similarity metric is to assign larger weights to features/terms that are less frequent in the corpus. For weighted keys, the Jaccard coefficient generalizes to w(A ∩ B)/w(A ∪ B) (the ratio of the weight of the intersection and the weight of the union).

Basic (approximate) weight aggregates are also used to compute more complex (approximate) aggregates, such as variance [15] of a subpopulation of keys or ratio of the weights of two subpopulations of keys.

The selection predicates that specify subpopulations are defined using conditions on keys’ attributes and on keys’ memberships in the different sets. We distinguish between attribute-based conditions, that are based on properties available through the identifier of the key (length, origin, or frequency of a term, type of feature) and membership-based conditions that are based on the key’s set memberships. For example, terms common to two documents A, B are specified using the predicate with membershipbased conditions “in A and in B”. The predicate “in A and not in B and length ≥ 5” has both attribute-based (length of a term) and membership-based conditions.

We list additional datasets that fall in this framework.

• Sensor nodes recording daily vehicle traffic in different locations in a city: Keys are distinct vehicles (license plate numbers) and sets are location-date pairs (all vehicles observed at that location that date). Example queries with membership-based conditions: “number of distinct vehicles which operated in Manhattan on election day, 2008” (size of the union of all Manhattan locations in election day) or “number of distinct vehicles operated in Tribeca on both Sunday and Monday on election week” (size of intersection of the union of locations in Tribeca neighborhood in Monday and Tuesday); “number of vehicles that crossed both the Hudson and the East River on Independence day 2008” (size of intersection of the union of bridges/tunnels across the Hudson and bridges/tunnels across the East River) etc. Queries can be re-stricted to particular classes of vehicles (e.g., taxi cubs or heavy trucks) by adding attribute-based conditions. Such queries can be used for planning purposes.

• Market-basket dataset: Keys are goods, each with an associated marketing cost (these are the weights). Each customer (basket) defines a set which is the set of goods she purchased. Example queries are “the total marketing cost of baby products purchased by male customers from Union county.” This predicate has attribute-based condition (product type) and membershipbased conditions (specification of the customer segment as a union of sets).

• “Inverted” market-basket dataset: Keys are baskets (customers) and sets are goods (all baskets containing that particular good). A query that asks “what is the likelihood that a certain item is purchased given that another item is purchased” (this is an “association rule” [1,42]) can be expressed using a predicate with membership-based conditions. If A is the set of customers purchasing, say beer, and B is the set of customers purchasing diapers then the selectivity of A∩B with respect to B is just the likelihood that a person purchases beer given that she/he purchased diapers. This query can be narrowed down to a particular customer segment (eg, by zip code or gender) if we add an attribute-based conditions to the predicate.

• Hyperlinked documents: Sets and keys are documents, where the set of document A includes all documents with hyperlinks to document A. Documents may be weighted by access data or page rank. Example queries are “the total weight of documents referencing at least 5 out of the 10 documents in Q.” This predicate has membership-based conditions.

• P2P network: Keys are files and sets are all neighborhoods of all peers (sets of files shared by peers in that neighborhood). Example queries are “the weight of

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut