Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes

Pr obabilistic Sear ch f or Structur ed Data via Pr obabilistic Pr ogramming and Nonparametric Bayes Feras Saad, Leonardo Casarsa, and Vikash Mansinghka Probabilistic Computing Project Massachusetts Institute of T echnology Abstract Databases are widespread, yet extracting rel- ev ant data can be dif ﬁcult. W ithout substan- tial domain knowledge, multiv ariate search queries often return sparse or uninformati ve re- sults. This paper introduces an approach for searching structured data based on probabilis- tic programming and nonparametric Bayes. Users specify queries in a probabilistic lan- guage that combines standard SQL database search operators with an information theo- retic ranking function called pr edictive r ele- vance . Predicti ve relev ance can be calculated by a fast sparse matrix algorithm based on posterior samples from CrossCat, a nonpara- metric Bayesian model for high-dimensional, heterogeneously-typed data tables. The result is a ﬂexible search technique that applies to a broad class of information retriev al problems, which we integrate into BayesDB, a proba- bilistic programming platform for probabilis- tic data analysis. This paper demonstrates ap- plications to databases of US colleges, global macroeconomic indicators of public health, and classic cars. W e found that human ev alu- ators often prefer the results from probabilistic search to results from a standard baseline. 1 Introduction W e are surrounded by multi variate data, yet it is dif ﬁcult to search. Consider the problem of ﬁnding a univ ersity with a city campus, low student debt, high in vestment in student instruction, and tuition fees within a certain budget. The US College Scorecard dataset (Council of Economic Advisers, 2015) contains these v ariables plus hundreds of others. Ho we ver , choosing thresholds for the quantitativ e v ariables — debt, in vestment, tuition, etc — requires domain knowledge. Furthermore, results gro w sparse as more constraints are added. Figure 1a shows results from an SQL SELECT query with plausible thresh- olds for this question that yields only a single match. This paper shows how to formulate a broad class of prob- abilistic search queries on structured data using proba- bilistic programming and information theory . The core technical idea combines SQL search operators with a ranking function called predictive r ele vance that assesses the rele v ance of database records to some set of query records, in a context deﬁned by a variable of interest. Figures 1b and 1c sho w two examples, expanding and then reﬁning the result from Figure 1a by combining pre- dictiv e relev ance with SQL. Predictive relev ance is the probability that a candidate record is informati ve about the answers to a speciﬁc class of predictiv e queries about unknown ﬁelds in the query records. The paper presents an ef ﬁcient implementation applying a simple sparse matrix algorithm to the results of infer- ence in CrossCat (Mansinghka et al., 2016). The result is a scalable, domain-general search technique for sparse, multiv ariate, structured data that combines the strengths of SQL search with probabilistic approaches to informa- tion retrie v al. Users can query by example, using real records in the database if the y are familiar with the do- main, or partially-speciﬁed hypothetical records if they are less familiar . Users can then narro w search results by adding Boolean ﬁlters, and by including multiple records in the query set rather than a single record. An overvie w of the technique and its integration into BayesDB (Mans- inghka et al., 2015) is shown in Figure 3. W e demonstrate the proposed technique with databases of (i) US colleges, (ii) public health and macroeconomic indicators, and (iii) cars from the late 1980s. The paper empirically conﬁrms the scalability of the technique and shows that human e v aluators often prefer results from the proposed technique to results from a standard baseline. %bql SELECT ... "institute" , ... "median_sat_math" , ... "admit_rate" , ... "tuition" , ... "median_student_debt" , ... "instructional_invest" , ... "locale" ... FROM college_scorecard ... WHERE ... "locale" LIKE ' %City% ' ... "tuition" < 50000 ... "median_student_debt" < 10000 ... "instructional_invest" > 50000 ... LIMIT 10 (a) Standard SQL . Using a SQL WHERE clause to search for a university with a city campus, lo w student debt (at most $10K), high in v estment in student instruction (at least $50K), and a tuition within their b udget (at most $50K). Due to sparsity in the dataset for the chosen thresholds, the Boolean conditions in the clause have only a single matching result, shown in the table belo w . The user needs to iterati vely adjust the thresholds in order to obtain more results which match the search query . institute admit sat tuition debt investment locale Duke University 11% 745 47,243 7,500 50,756 Midsize City %bql SELECT ... "institute" , ... "admit_rate" , ... "median_sat_math" , ... "tuition" , ... "median_student_debt" , ... "instructional_invest" , ... "locale" ... FROM college_scorecard ... ORDER BY ... RELEVANCE PROBABILITY ... TO HYPOTHETICAL ROW (( ... "locale" = ' Midsize City ' ... "tuition" = 50000, ... "median_student_debt" = 10000, ... "instructional_invest" = 50000 ... )) ... IN THE CONTEXT OF ... "instructional_invest" ... DESC ... LIMIT 10 (b) Relevance to h ypothetical r ecord . If the search query is instead speciﬁed as a hypo- thetical record in a BQL RELEVANCE PROBABILITY query , then ORDER BY can give the top-10 ranked matches. The results are all top-tier schools with high teaching in vestment, a city or lar ge sub urban campus, and low student debt. Howe ver , the user is surprised by the highly stringent admission rates at these colleges, which are mostly belo w 10%. institute admit sat tuition debt investment locale Duke University 11% 745 47,243 7,500 50,756 Midsize City Princeton University 8% 755 41,820 7,500 52,224 Large Suburb Harvard University 6% 755 43,938 6,500 49,500 Midsize City Univ of Chicago 8% 758 49,380 12,500 83,779 Large City Mass Inst Technology 8% 770 45,016 14,990 62,770 Midsize City Calif Inst Technology 8% 785 43,362 11,812 92,590 Midsize City Stanford University 5% 745 45,195 12,782 93,146 Large Suburb Yale University 6% 750 45,800 13,774 107,982 Midsize City Columbia University 7% 745 51,008 23,000 80,944 Large City University of Penn. 10% 735 47,668 21,500 49,018 Large City %bql SELECT ... "institute" , ... "admit_rate" , ... "median_sat_math" , ... "tuition" , ... "median_student_debt" , ... "instructional_invest" , ... "locale" ... FROM college_scorecard ... WHERE ... "admit_rate" > 0.10 ... AND "locale" LIKE ' %City% ' ... ORDER BY ... RELEVANCE PROBABILITY ... TO EXISTING ROWS IN ( ... ' Duke University ' , ... ' Harvard University ' , ... ' Mass Inst Technology ' , ... ' Yale University ' , ... ) ... IN THE CONTEXT OF ... "instructional_invest" ... DESC ... LIMIT 10 (c) Relevance to observ ed records combined with SQL . Combining BQL and SQL to search for colleges which are most relevant to the schools from (b) in the context of “instructional in vestment”, b ut that must hav e (i) less stringent admissions (at least 10%) and (ii) city campuses only . The quantitativ e search metrics of interest for the colleges in the result set are all signiﬁcantly better than the national av erage, but they are mostly below the more selecti ve schools in (b). institute admit sat tuition debt investment locale Duke University 11% 745 47,243 7,500 50,756 Midsize City Georgetown Univ 17% 710 46,744 17,000 31,102 Midsize City Johns Hopkins Univ 16% 730 47,060 16,250 77,339 Midsize City Vanderbilt Univ 13% 760 43,838 13,000 79,372 Large City University of Penn. 10% 735 47,668 21,500 49,018 Large City Carnegie Mellon 24% 750 49,022 25,250 31,807 Midsize City Rice University 15% 750 40,566 9,642 40,056 Midsize City Univ Southern Calif 18% 710 48,280 21,500 43,170 Midsize City Cooper Union 15% 710 41,400 18,250 21,635 Large City New York University 35% 685 46,170 23,300 30,237 Large City N A TIONWIDE RESUL TS (B) RESUL TS (C) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Admit Rate N A TIONWIDE RESUL TS (B) RESUL TS (C) 400 500 600 700 800 SA T Math N A TIONWIDE RESUL TS (B) RESUL TS (C) 10K 20K 30K 40K 50K T uition N A TIONWIDE RESUL TS (B) RESUL TS (C) 10K 20K 30K 40K 50K Student Debt N A TIONWIDE RESUL TS (B) RESUL TS (C) 25K 50K 75K 100K In v estment Figure 1: Combining predictive relev ance probability in the Bayesian Query Language (BQL) with standard tech- niques in SQL to search the US College Scorecard dataset. The full data contains over 7000 colleges and 1700 variables, and is a vailable for download at collegescorecard.e d.gov/data . 2 Establishing an inf ormation theor etic deﬁnition of context-speciﬁc predicti ve rele v ance In this section, we outline the basic set-up and notations for the database search problem, and establish a formal deﬁnition of the probability of “predictiv e relev ance” be- tween records in the database. 2.1 Finding predicti vely r elev ant records Suppose we are gi ven a sparse dataset D = { x 1 , x 2 , . . . , x N } containing N records, where each x r = ( x [ r, 1] , . . . , x [ r,p ] ) is an instantiation of a p -dimensional random vector , possibly with missing values. For notational conv enience, we refer to arbitrary collections of observations using sets as indices, so that x [ R,C ] ≡  x [ r,c ] : r ∈ R, c ∈ R  . Bold-face symbols denote multi v ariate entities, and v ariables are capitalized as X [ r,c ] when they are unobserv ed (i.e. random). Let Q ⊂ [ N ] index a small collection of “query records” x Q = { x q : q ∈ Q} . Our objecti ve is to rank each item x i ∈ D by how rele v ant it is for formulating predictions about values of x Q , “in the conte xt” of a particular di- mension c . W e formally deﬁne the context of c as a sub- set of dimensions V ⊆ [ p ] such that for an arbitrary record r ∗ and each v ∈ V , the random variable X [ r ∗ ,v ] is statis- tically dependent with X [ r ∗ ,c ] . 1 In other words, we are searching for records i where knowledge of x [ i, V ] is useful for predicting x [ Q , V ] , had we not known the v alues of these observations. 2.2 Deﬁning context-speciﬁc predicti ve r elev ance using mutual information W e now formalize the intuition from the pre vious sec- tion more precisely . Let R c ( Q , r ) denote the probability that r is predicti vely relev ant to Q , in the context of c . Furthermore, let c ∗ denote the index of a ne w dimension in the length- p random vectors, which is statistically de- pendent on dimension c (i.e. is in its conte xt) but is not one of the p existing v ariables in the database. Since c ∗ index es a nov el v ariable, its v alue for each row r is itself a random variable, which we denote X [ r,c ∗ ] . W e no w de- ﬁne the probability that r is predictiv ely relev ant to Q in the conte xt of c as the posterior probability that the mu- tual information of X [ r,c ∗ ] and each query record X [ q ,c ∗ ] 1 A general deﬁnition for statistical dependence is having non-zero mutual information with the context variable. Ho w- ev er , the method for detecting dependence to ﬁnd variables in the context can be arbitrary e.g., using linear statistics such as Pearson-R, directly estimating mutual information, or others. is non-zero: R c ( Q , r ) = (1) P   \ q ∈Q  I ( X [ q ,c ∗ ] : X [ r,c ∗ ] ) > 0      λ c ∗ , α, D   . The symbol λ c ∗ refers to an arbitrary set of hyperparam- eters which govern the distribution of dimension c ∗ , and α is a context-speciﬁc hyperparameter which controls the prior on structural dependencies between the ran- dom v ariables  X [ r,c ∗ ] : r ∈ [ N ]  . Moreov er , the mu- tual information I , a well-established measure for the strength of predictiv e relationships between random vari- ables (Co ver and Thomas, 2012), is deﬁned in the usual way , I ( X [ q ,c ∗ ] : X [ r,c ∗ ] | λ c ∗ , α, D ) = (2) E  log  p ( X [ q ,c ∗ ] , X [ r,c ∗ ] | λ c ∗ , α, D ) p ( X [ q ,c ∗ ] | λ c ∗ , α, D ) p ( X [ r,c ∗ ] | λ c ∗ , α, D )  . Figure 2 illustrates the predicti v e rele v ance probability in terms of a hypothesis test on two competing graph- ical models, where the mutual information is non-zero in panel (a) indicating predictiv e relev ance; and zero in panel (b), indicating predictiv e irrele v ance. 2.3 Related W ork Our formulation of predictiv e rele v ance in terms of mu- tual information between ne w v ariables X [ r,c ∗ ] is related to the idea of “property induction” from the cognitiv e sci- ence literature (Rips, 1975; Osherson et al., 1990; Shafto et al., 2008), where subjects are asked to predict whether an entity has a property , gi v en that some other entity has that property; e.g. how likely are cats to hav e some ne w disease, giv en that mice are kno wn to ha ve the disease? It is also informativ e to consider the relationship be- tween the predicti ve relev ance R c ( Q , r ) in Eq (1) and the Bayesian Sets ranking function from the statistical modeling literature (Ghahramani and Heller, 2005): score Bayes-Sets ( Q , r ) = p ( x r | x Q ) p ( x r ) . (3) Bayes Sets deﬁnes a Bayes Factor , or ratio of marginal likelihoods, which is used for hypothesis testing without assuming a structure prior . On the other hand, predictiv e relev ance deﬁnes a posterior probability , whose v alue is between 0 and 1, and therefore requires a prior over de- pendence structure between records (our approach out- lined in Section 3 is based on nonparametric Bayes). While Bayes Sets dra ws inferences using only the query and candidate ro ws without considering the rest of the data, predicti ve relev ance probabilities are necessarily α z 0 θ c 0 θ c ∗ 0 . . . . . . x [ Q,c ] X [ Q,c ∗ ] x [ r,c ] x [ r,c ∗ ] λ c ∗ λ c (a) Same generativ e process for x Q and x r . α z 0 θ c 0 θ c ∗ 0 . . . . . . x [ Q ,c ] X [ Q ,c ∗ ] z 1 θ c 1 θ c ∗ 1 x [ r,c ] X [ r,c ∗ ] λ c ∗ λ c (b) Different generati ve processes for x Q and x r . Figure 2: The predictive rele v ance of a collection of query records Q to a candidate record r , in the context of v ariable c , computes the probability that x [ Q ,c ] and x [ r,c ] are drawn from (a) the same generativ e process, versus (b) dif ferent generativ e processes. The latent variables z 0 and z 1 are indicators for the generative process of the records; and θ c 0 (resp. θ c 1 ) are distrib utional parameters of data under model z 0 (resp. z 1 ) for variable c . Hyperparameter α dictates the prior on z , and λ dictates the prior on distributional parameters θ . The symbol c ∗ denotes a new dimension which is statistically dependent on c , and for which no values are observed for either Q or r. Conditioned on hyperparameters, knowing X [ r,c ∗ ] in (a) carries information about the unknown v alues X [ Q ,c ∗ ] , whereas in (b) it does not. conditioned on D as in Eq (1). Finally Bayes Sets con- siders the entire data vectors for scoring, whereas predic- tiv e relev ance considers only dimensions which are in the context of a variable c , making it possible for two records to be predicti vely relev ant in some context but probably predictiv ely irrele v ant in another . 3 Computing the probability of pr edicti ve rele v ance using nonparametric Bayes This section describes the cross-categorization prior (CrossCat, Mansinghka et al. (2016)) and outlines algo- rithms which use CrossCat to efﬁciently estimate pre- dictiv e relev ance probabilities Eq (1) for sparse, high- dimensional, and heterogenously-typed data tables. CrossCat is a nonparametric Bayesian model which learns the full joint distribution of p v ariables using struc- ture learning and divide-and-conquer . The generativ e model begins by partitioning the set of p variables into blocks using a Chinese restaurant process. This step is CrossCat’ s “outer” clustering, since it partitions the columns of a data table where variables correspond to columns, and records correspond to rows. Let π de- note the partition of [ p ] whose k -th block is V k ⊆ [ p ] : for j 6 = k , all variables in V k are mutually (marginally and conditionally) independent of all v ariables in V j . W ithin block k , the variables x [ r, V k ] follow a Dirichlet process mixture model (Escobar and W est, 1995), where we fo- cus on the case the joint distrib ution factorizes given the latent cluster assignment z k r . This step is an “inner” clus- tering in CrossCat, since it speciﬁes a cluster assignment for each row in block k . CrossCat’ s combinatorial struc- ture requires detailed notation to track the latent variables and dependencies between them. The generativ e process for an exchangeable sequence ( X 1 , . . . , X N ) of N ran- dom vectors is summarized belo w . T able 1: Symbols used to describe CrossCat prior Symbol Description α 0 Concentration hyperparameter of column CRP α 1 Concentration hyperparameter of ro w CRP v c Index of v ariable c in column partition V k List of variables in block k of column partition z k r Cluster index of r in row partition of block k C k y List of rows in cluster y of block k M c Joint distribution of data for v ariable c λ c Hyperparameters of M c X [ r,c ] r -th observation of v ariable c S E T ( l ) Unique items in list l C R O S S C A T P R I O R 1. Sample column partition into blocks. v = ( v 1 , . . . , v p ) ∼ C R P ( ·| α 0 ) V k ← { c ∈ [ p ] : v c = k } foreach k ∈ S E T ( v ) 2. Sample ro w partitions within each block. z k = ( z k 1 , . . . , z k N ) ∼ C R P ( ·| α 1 ) foreach k ∈ S E T ( v ) C k y ←  r ∈ [ N ] : z k r = y  foreach k ∈ S E T ( v ) foreach y ∈ S E T ( z k ) 3. Sample data jointly within ro w cluster .  X [ r,c ] : r ∈ C k y  ∼ M c ( ·| λ c ) foreach k ∈ S E T ( v ) foreach y ∈ S E T ( z k ) foreach c ∈ V k Sparse T ab ular Database country oil hdi sno w gover nment Australia 19 parliamentary Lebanon 145 1.3 semi-presidential Swaziland 17 110 monarchy USA 31 197 2.9 presidential China 21 3.4 politburo Greece 03 180 parliamentary Peru 147 1.1 presidential . . . . . . . . . . . . . . . BayesDB Modeling Posterior Cr ossCat Structures O H S G O H S G O H S G Model ˆ φ 1 Model ˆ φ 2 Model ˆ φ 3 BQL Predicti ve Relev ance Query %bql SELECT "country" , "oil" , "hdi" ... FROM population ... WHERE "government" IS NOT ' monarchy ' ... ORDER BY ... RELEVANCE PROBABILITY ... TO HYPOTHETICAL ROW WITH VALUES ... (( "oil" =27, "snow" =0.2, "hdi" =180)) ... IN THE CONTEXT OF "hdi" Query Results country oil hdi USA 31 197 Australia 19 Greece 03 180 Peru 17 147 China 21 Lebanon 145 ... ... ... C R O S S C A T - I N C O R P O R A T E - R E C O R D (Algorithm 3) C R O S S C A T - P R E D I C T I V E - R E L E V A N C E (Algorithm 1) BayesDB Query Engine Country Relevance Pr ob ˆ φ 1 ˆ φ 2 ˆ φ 3 avg China 0 1 0 0.33 USA 1 1 1 1.00 Lebanon 0 0 0 0.00 Greece 1 0 1 0.66 Australia 1 0 1 0.66 Peru 1 0 0 0.33 . . . . . . . . . . . . SQL Sorting Figure 3: BayesDB workﬂo w for computing context-speciﬁc predictive rele v ance between database records. Model- ing and inference in BayesDB produces an ensemble of posterior CrossCat model structures. Each structure speciﬁes (i) a column partition for the factorization of the joint distribution of all variables in the database, using a Chinese resta- raunt process; and (ii) a separate ro w partition within each block of variables, using a Dirichlet process mixture. The column partition clusters variables into different “contexts”, where all v ariables in a context are probably dependent on one another . With each context, the row partition clusters records which are probably informativ e of one another . End-user queries for predicti ve relev ance are e xpressed in Bayesian Query Langauge. The BQL interpreter aggregates relev ance probabilities across the ensemble, and can use them as a ranking function in a probabilistic ORDER BY query . The representation of CrossCat in this paper assumes that data within a cluster is sampled jointly (step 3), marginal- izing ov er cluster-speciﬁc distrib utional parameters: M c ( x [ C k y ,c ] , λ c ) = Z θ Y r ∈C k y p ( x [ r,c ] | θ ) p ( θ | λ c ) dθ . This assumption sufﬁces for our dev elopment of pre- dictiv e relev ance, and is applicable to a broad class of statistical data types (Saad and Mansinghka, 2016) with conjugate prior-lik elihood representations such as Beta- Bernoulli for binary , Dirichlet-Multinomial for categori- cal, Normal-In verse-Gamma-Normal for real values, and Gamma-Poisson for counts. Giv en dataset D , we refer to Obermeyer et al. (2014) and Mansinghka et al. (2016) for scalable algorithms for pos- terior inference in CrossCat, and assume we ha v e access to an ensemble of H posterior samples n ˆ φ 1 , . . . , ˆ φ H o where each ˆ φ h is a realization of all v ariables in T able 1. 3.1 Estimating predicti ve r elev ance using CrossCat W e no w describe ho w to use posterior samples of Cross- Cat to ef ﬁciently estimate the predictiv e relev ance proba- bility R c ( Q , r ) from Eq (1). Letting c denote the context variable, we formalize the novel v ariable c ∗ as a fresh column in the tabular population which is assigned to the same block k as c (i.e. k = v c = v c ∗ ) . As shown by Saad and Mansinghka (2017), structural dependen- cies induced by CrossCat’ s variable partition are related to an upper-bound on the probability there exists a statis- tical dependence between c and c ∗ . T o estimate Eq (1), we ﬁrst treat the mutual information between X [ q ,c ∗ ] and X [ r,c ∗ ] as a deriv ed random v ariable, which is a function of their random cluster assignments z k q and z k r , ( z k q , z k r ) 7→ I ( X [ q ,c ∗ ] : X [ r,c ∗ ] | z k q , z k r , α 1 , λ c ∗ ) . (4) The key insight, implied by step 3 of the CrossCat prior, is that, conditioned on their assignments, ro ws from dif- ferent clusters are sampled independently , which giv es z k q 6 = z k r ⇐ ⇒ p ( x [ q ,c ∗ ] , x [ r,c ∗ ] | z k q , z k r , λ c ∗ , α 1 , D ) = p ( x [ q ,c ∗ ] | z k q , λ c ∗ , α 1 , D ) p ( x [ r,c ∗ ] | z k r , λ c ∗ , α 1 , D ) ⇐ ⇒ I ( X [ q ,c ∗ ] : X [ r,c ∗ ] | z k q , z k r , α 1 , λ c ∗ ) = 0 , (5) where the ﬁnal implication follo ws directly from the def- inition of mutual information in Eq (2). Note that Eq (5) does not depend on the particular choice of λ c ∗ , and indeed this hyperparameter is nev er represented e xplic- itly . Moreover , hyperparameter α 1 (corresponding to α in Figure 2) is the concentration of the Dirichlet process for CrossCat row partitions. Eq (5) implies that we can estimate the probability of non- zero mutual information between X [ r,c ∗ ] and each X [ q ,c ∗ ] for q ∈ Q by forming a Monte Carlo estimate from the ensemble of posterior CrossCat samples, R c ( Q , r ) = P   \ q ∈Q  I ( X [ q ,c ∗ ] : X [ r,c ∗ ] ) > 0      λ c ∗ , α 1 , D   = P   \ q ∈Q  z v c q = z v c r      α 1 , D   ≈ 1 H H X h =1   I   \ q ∈Q  ˆ z ˆ v h c ,h q = ˆ z ˆ v h c ,h r      , (6) where ˆ v h c index es the context block, and ˆ z ˆ v h c ,h r denotes cluster assignment of r in the row partition of ˆ v h c , accord- ing to the sample ˆ φ h . Algorithm 1 outlines a procedure (used by the BayesDB query engine from Figure 3) for formulating a Monte Carlo based estimator for a predic- tiv e rele v ance query using CrossCat. Algorithm 1 C R O S S C A T - P R E D I C T I V E - R E L E V A N C E Require:    CrossCat samples: ˆ φ h for h = 1 , . . . ,H query rows: Q = { q i : 1 ≤ i ≤ |Q|} context v ariable: c Ensure: predictiv e rele v ance of each existing ro w in D to Q 1: for r = 1 , . . . , N do  for each existing ro w 2: for h = 1 , . . . , H do  for each CrossCat sample 3: k ← ˆ v h c  retriev e the context block 4: for q ∈ Q do  for each query row 5: if ˆ z k,h q 6 = ˆ z k,h r then  r and q are dif ferent clusters 6: R h c ( Q , r ) ← 0  r irrelev ant to some q 7: break 8: else  r in same cluster as all q ∈ Q 9: R h c ( Q , r ) ← 1  r relev ant to all q 10: R c ( Q , r ) ← 1 H P H h =1 R h c ( Q , r )  av erage rele vances 11: return {R c ( Q , r ) : 1 ≤ r ≤ N } 3.2 Optimizing the estimator using a sparse matrix-vector multiplication In this section, we show how to greatly optimize the naiv e, nested for-loop implementation in Algorithm 1 by instead computing predictive relev ance for all r through a single matrix-vector multiplication. Deﬁne the pairwise cluster co-occurrence matrix S k,h for block k of CrossCat sample ˆ φ h to ha ve binary en- tries S k,h i,j = I [ ˆ z k,h i = ˆ z k,h j ] . Furthermore, let 1 Q denote a length- N vector with a 1 at inde xes q ∈ Q and 0 other- wise. W e vectorize R c ( Q , r ) across r ∈ [ N ] by: u h = 1 |Q| S k,h 1 Q h = 1 , . . . , H (7) R c ( Q , · ) = 1 H H X h =1 u h . (8) The resulting length- N vector u h in Eq (7) satisﬁes u h r = 1 if and only if ˆ z k,h r = ˆ z k,h q for all q ∈ Q , which we identify as the argument of the indicator function in Eq (6). Finally , by a veraging u h across the H samples in Eq (8), we arriv e at the v ector of rele v ance probabilities. For large datasets, constructing the N × N matrix S k,h using Θ( N 2 ) operations is prohibiti vely e xpensi ve. Al- gorithm 2 describes an efﬁcient procedure that ex- ploits CrossCat’ s sparsity to build S k,h in expected time  O ( N 2 ) by using (i) a sparse matrix representation, and (ii) CrossCat’ s partition data structures to a v oid con- sidering all pairs of ro ws. This fast construction means that Eq (7) is practical to implement for large data tables. The algorithm’ s running time depends on (i) the number of clusters | S E T ( ˆ z k ) | in line 1; (ii) the average number of rows per cluster | ˆ C k y | in line 2; and (iii) the data structures used to represent S k,h in line 3. Under the CRP prior , the expected number of clusters is O ( α 1 log( N )) , which im- plies an a verage occupancy of O ( N / ( α 1 log( N ))) rows per cluster . If the sparse binary matrix is stored with a list-of-lists representation, then the update in line 3 re- quires O (1) time. Furthermore, we emphasize that since S k,h does not depend Q , its cost of construction is amor - tized ov er an arbitrary number of queries. Algorithm 2 C R O S S C A T - C O - O C C U R R E N C E - M AT R I X Require: CrossCat sample ˆ φ h ; block index k . Ensure: Pairwise co-occurrence matrix S k,h 1: for y ∈ S E T ( ˆ z k ) do  for each cluster in block k 2: for r ∈ ˆ C k y do  for each row in the cluster 3: Set S k,h r,j = 1 , where j ∈ ˆ C k y  update the matrix 4: return S k,h 3.3 Computing predicti ve r elev ance probabilities for query r ecords that ar e not in the database W e have so far assumed that the query records must con- sist of items that already exist in the database. This section relaxes this restrictiv e assumption by illustrat- ing how to compute relev ance probabilities for search records which do not exist in D , and are instead speci- ﬁed by the user on a per-query basis (refer to the BQL query in Figure 3 for an example of a hypothetical query record). The ke y idea is to (i) incorporate the new records into each CrossCat sample ˆ φ h by using a Gibbs-step to sample cluster assignments from the joint posterior (Neal, 2000); (ii) compute Eq (7) on the updated sam- ples; and (iii) unincorporate the records, lea ving the orig- inal samples unmutated. Letting  x [ N + i ] : 1 ≤ i ≤ t  denote t (partially ob- served) ne w rows and Q = { N +1 , . . . , N + t } the query , we compute R c ( Q , r ) for all r by ﬁrst apply- ing C R O S S C A T - I N C O R P O R A T E - R E C O R D (Algorithm 3) to each q ∈ Q sequentially . Sequential incorporation cor- responds to sampling from the sequence of predictiv e distributions, which, by e xchangeability , ensures that each updated ˆ φ h contains a sample of cluster assign- ments from the joint distrib ution, guaranteeing correct- ness of the Monte Carlo estimator in Eq (6). Note that since CrossCat speciﬁes a non-parametric mixture, the proposal clusters include all existing clusters, plus one singleton cluster max( z k ) + 1 . W e next update the co- occurrence matrices in time linear in the size of the sam- pled cluster and then ev aluate Eq (7) and (8). T o unincor- porate, we rev erse lines 7-9 and restore the co-occurrence matrices. Figure 4 conﬁrms that the runtime scaling is asymptotically linear , varying the (i) number of new rows, (ii) fraction of v ariables speciﬁed for the new rows that are in the conte xt block (i.e. query sparsity), (iii) number of clusters in the context block, and (iv) number of variables in the conte xt block. Algorithm 3 C R O S S C A T - I N C O R P O R AT E - R E C O R D Require: CrossCat sample φ ; context c ; ne w ro w x N +1 Ensure: Updated crosscat sample φ 0 1: k ← v c  Retriev e block of context v ariable 2: Y ← max( z k ) + 1  Retriev e proposal clusters 3: for y = 1 , . . . , Y do  Compute cluster probabilities 4: n y ← (   C k y   if y ∈ z k α 1 if y = max( z k ) + 1 5: l y ←  Q c ∈V k M c ( x [ N +1 ,c ] | x [ C k y ,c ] , λ c )  n y 6: z k N +1 ∼ C A T E G O R I C A L ( l 1 , . . . , l Y )  Sample cluster 7: z 0 k ← z k ∪  z k N +1   Append cluster assignment 8: C 0 k z k N +1 ← C k z k N +1 ∪ { N +1 }  Append row to cluster 9: D 0 ← D ∪  x [ N +1 , V k ]   Append record to database 10: return φ 0  Return the updated sample 0 . 2 . 4 . 6 . 8 1 r ow sparsity 0 100 200 300 400 500 runtime (ms) 1 r o w 0 . 2 . 4 . 6 . 8 1 r ow sparsity 2 r o ws 0 . 2 . 4 . 6 . 8 1 r ow sparsity 4 r o ws 0 100 200 clusters in context 0 10 20 30 40 50 60 runtime (ms) 0 100 200 clusters in context 0 100 200 clusters in context 5 10 15 20 25 30 35 40 45 number of variables in context Figure 4: Empirical measurements of the asymp- totic scaling of C R O S S C A T - I N C O R P O R A T E - R E C O R D (Algo- rithm 3) on the Gapminder dataset (Section 4). The color of each measurement indicates the number of v ariables in the block of the context variable; each column shows a different number of records (1, 2, 4, and 8) incorpo- rated by the algorithm. The top panels sho ws that, for a ﬁxed number of v ariables in the context, the runtime (in milliseconds) decays linearly with the sparsity of the hypothetical records (dimensions which are not in the same block as the context v ariable are ignored). The lower panels sho w the runtime increasing linearly with the number of clusters in the context; the number of v ari- ables in the context dictates the slope of the curv e. 4 A pplications This section illustrates the efﬁcacy of predictiv e rele- vance in BayesDB by applying the technique to sev- eral search problems in real-world, sparse, and high- dimensional datasets of public interest. 2 4.1 College Scorecard The College Scorecard (Council of Economic Advisers, 2015) is a federal dataset consisting of over 7000 col- leges and 1700 variables, and is used to measure and im- prov e the performance of US institutions of higher edu- cation. These variables include a broad set of categories 2 Appendix D contains a further application to a dataset of classic cars from 1987. Appendix A formally describes the in- tegration of RELVANCE PROBABILITY into BayesDB as an ex- pression in the Bayesian Query Language (Figure 3). Pairwise CrossCat predicti ve rele vances in different conte xts (a) CrossCat (life expectanc y) (b) CrossCat (exports, % gdp) Pairwise cosine similarities in dif ferent contexts (c) Cosine (life expectanc y) (d) Cosine (exports, % gdp) Concept Representati ve Countries in the Concept Low-Income Nations Burundi, Ethiopia, Uganda, Benin, Malawi, Rwanda, T ogo, Guinea, Sene gal, Afghanistan, Malawi Post-So viet Nations Russia, Ukraine, Bulgaria, Belarus, Slov akia, Serbia, Croatia, Poland, Hungary , Romania, Latvia W estern Democracies France, Britain, Germany , Netherlands, Italy , Denmark, Finland, Sweden, Norway , Australia, Japan Small W ealthy Nations Qatar , Bahrain, K uwait, Emirates, Singapore, Israel, Gibraltar , Bermuda, Jersey , Cayman Islands (e) Countries which are mutually predictiv e in the context of “life e xpectancy” according to CrossCat’ s rele vance matrix (a). Figure 5: (a) – (d) Pairwise heatmaps of countries from the Gapminder dataset in the contexts of “life expectanc y at birth” and “exports of goods and services (% of gdp) ”, using CrossCat predictiv e relev ance and cosine similarity . Each row and column in a matrix is a country , and a cell v alue (between 0 and 1) indicates the strength of match between those two countries. (e) CrossCat learns a sparse set of relev ances; for “life expectancy”, these broadly correspond to common-sense taxonomies of countries based on shared geographic, political and macroeconomic characteristics. These concepts were manually labeled by inspecting clusters of countries in matrix (a); the colors in the matrix correspond to countries in the table which belong to the concept of that color . Note that the rele v ance structure dif fers signiﬁcantly when ranking in the context of “exports, % gdp”, as shown by the colors in matrix (b) where the clusters of mutually rele vant countries form a different pattern than in (a). Cosine similarity learns dense, noisy sets of spuriously high-ranking countries with coarser structure, as shown in (c) and (d). Refer to Appendix C for more baselines. such as the campus characteristics, academic programs, student debt, tuition fees, admission rates, instructional in v estments, ethnic distrib utions, and completion rates. W e analyzed a subset of 2000 schools (four-year institu- tions) and 100 v ariables from the categories listed abo ve. Suppose a student is interested in attending a city uni ver - sity with a set of desired speciﬁcations. Starting with a standard SQL Boolean search in Figure 1a (on p. 2) they ﬁnd only one matching record, which requires iterativ ely rewriting the search conditions to retrie ve more results. Figure 1b instead expresses the search query as a hypo- thetical row in a BQL PREDICTIVE RELEVANCE query (which in v okes the technique in Section 3.3). The top- ranking records contain ﬁrst-rate schools, but their ad- mission rates are much too stringent. In Figure 1c, the user re-expresses the BQL query to rank schools by pre- dictiv e rele v ance, in the context of instructional inv est- ment, to a subset of the ﬁrst-rate schools discov ered in 1b. Combining ORDER BY PREDICTIVE RELEVANCE with Boolean conditions in the WHERE clause returns an- other set of top-quality schools with city-campuses that are less competiti ve than those in 1b, b ut ha ve quantita- tiv e metrics that are much better than national a verages. 4.2 Gapminder Gapminder (Rosling, 2008) is an extensiv e longitudinal dataset of over ∼ 320 global macroeconomic variables of population growth, education, climate, trade, welfare and health for 225 countries. Our experiments are based on a cross-section of the data from the year 2002. The data is sparse, with 35% of the data missing. Figure 5 shows heatmaps of the pairwise predictive relev ances for all countries in the dataset under different contexts, and compares the results to cosine similarity . Clusters of predictiv ely rele v ant countries form common-sense tax- onomies; refer to the caption for further discussion. Figure 6 ﬁnds the top-15 countries in the dataset ordered by their predicti ve relev ance to the United States, in the context of “life expectancy at birth”. T able 6b shows representativ e variables which are in the context; these variables hav e the highest dependence probability with the context variable, according a Monte Carlo estimate using 64 posterior CrossCat samples. The countries in Figure 6a are all rich, W estern democracies with highly dev eloped economies and adv anced healthcare systems. T o quantitati vely ev aluate the quality of top-ranked coun- tries returned by predictive relev ance, we ran the tech- %bql .barplot ... ESTIMATE "country" , ... RELEVANCE PROBABILITY ... TO EXISTING ROWS IN ... ( ' United States ' ) ... IN THE CONTEXT OF ... "life expectancy at birth" ... AS "rel_us_lifexp" ... FROM gapminder ... ORDER BY "rel_us_lifexp" DESC ... LIMIT 15 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Denmark German y Switzerland Canada Belgium Austria San Marino Norw ay Cyprus Ne w Zealand Andorra Ireland Australia Iceland United States (a) Relev ance to USA in the context of “life expectancy” Measles, mumps, & rubella vaccines (% population) Under 5 mortality rate Dead children per woman access to improv ed sanitation facilities (% population) access to improv ed drinking water sources (% population) human dev elopment index body mass index (kg/m2) murder rate (per 100,000) food supply (kilocalories per person) contraceptiv e pre v alence (% women ages 15-49) alcohol consumption (liters per adult) prev alence of tobacco use among adults (% population) (b) V ariables in the context of “life e xpectancy at birth” Figure 6: Using BQL to search for the top 15 countries in the Gapminder dataset ranked by their relev ance to the United States in the context of “life expectanc y at birth” ﬁnds rich, W estern democracies with advanced healthcare systems. nique on 10 representati ve search queries (varying the country and context v ariable) and obtained the top 10 re- sults for each query . Figure 7 shows the queries, and hu- man prefer ences for the results from predicti ve rele v ance versus results from cosine similarity between the coun- try v ectors. W e deﬁned the context for cosine similarity by the 320-dimensional vectors do wn to 10 dimensions and selecting variables which are most dependent with the context variable according to CrossCat’ s dependence probabilities. T o deal with sparsity , which cosine simi- larity cannot handle natively , we imputed missing v alues using sample medians; imputation techniques like MICE (Buuren and Groothuis-Oudshoorn, 2011) resulted in lit- tle difference (Appendix C). - . 75 - . 50 - . 25 0 . 25 . 50 . 75 A verage Pr efer ence (70 Humans) Singapore–Urban Population Hong K ong–Urban Population United Kingdom–Life Expectanc y Qatar –Life Expectanc y Japan–Life Expectanc y Bulgaria–Life Expectanc y Bangladesh–Life Expectanc y Australia–Life Expectanc y United States–Democrac y Score Saudi Arabia–Democrac y Score Sear ch Query (Country–Context) Pr efer Cosine Pr efer Cr ossCat Figure 7: Comparing human preferences for the top-ranked countries returned by cosine similarity versus CrossCat predic- tiv e relev ance, in 10 representative search queries (sho wn on the y-axis). For each query , human subjects were given the top 10 most rele v ant countries, according to both cosine and Cross- Cat, and then asked to choose which results the y preferred, if any . W e scored the responses in the following way: “countries returned by cosine are more relev ant” (score = -1); “countries returned by CrossCat are more relevant” (score = +1); “both results are equally rele vant” (score = 0). The x-axis sho ws the scores a veraged across 70 humans, surveyed on the cloud through crowdflower.com . Error bars represent one standard error of the mean. For most of the queries, human preferences are biased in fav or of CrossCat’ s rankings. Further details on the experimental design and results are gi ven in Appendix B. 5 Discussion This paper has sho wn ho w to perform probabilistic searches of structured data by combining ideas from probabilistic programming, information theory , and non- parametric Bayes. The demonstrations suggest the tech- nique can be effecti ve on sparse, real-world databases from multiple domains and produce results that human ev aluators often preferred to a standard baseline. More empirical e v aluation is clearly needed, ideally in- cluding tests of hundreds or thousands of queries, more complex query types, and comparisons with query results manually provided by human domain experts. In fact, search via predicti v e relev ance in the conte xt of v ariables drawn from learned representations of data could poten- tially provide a meaningful way to compare representa- tion learning techniques. It also may be fruitful to build a distributed implementation suitable for database rep- resentations of web-scale data, including photos, social network users, and web pages. Relativ ely unstructured probabilistic models, such as topic models, pro ved suf ﬁcient for making unstructured text data far more accessible and useful. W e hope this pa- per helps illustrate the potential for structured probabilis- tic models to improv e the accessibility and usefulness of structured data. Acknowledgments The authors wish to acknowledge Ryan Rifkin, Anna Comerford, Marie Huber , and Richard T ibbetts for help- ful comments on early drafts. This research was sup- ported by D ARP A (PP AML program, contract number F A8750-14-2-0004), IARP A (under research contract 2015-15061000003), the Of ﬁce of Nav al Research (un- der research contract N000141310333), the Army Re- search Ofﬁce (under agreement number W911NF-13-1- 0212), and gifts from Analog Devices and Google. References Stef Buuren and Karin Groothuis-Oudshoorn. mice: Multiv ariate imputation by chained equations in r . Journal of Statistical Softwar e , 45(3), 2011. Council of Economic Advisers. Using federal data to measure and impro ve the performance of u.s. institu- tions of higher education. T echnical report, Ex ecuti ve Ofﬁce of the President of the United States, 2015. Thomas Cov er and Joy Thomas. Elements of Informa- tion Theory . W iley Series in T elecommunications and Signal Processing. W ile y , 2012. Michael Escobar and Mike W est. Bayesian density esti- mation and inference using mixtures. Journal of the American Statistical Association , 90(430):577–588, 1995. Zoubin Ghahramani and Katherine A. Heller . Bayesian sets. In Pr oceedings of the 18th International Con- fer ence on Neur al Information Pr ocessing Systems , pages 435–442. MIT Press, 2005. Dennis Kibler , David W Aha, and Marc K Albert. Instance-based prediction of real-valued attributes. Computational Intelligence , 5(2):51–57, 1989. V ikash Mansinghka, Richard Tibbetts, Jay Baxter, Pat Shafto, and Baxter Eav es. BayesDB: A probabilistic programming system for querying the probable impli- cations of data. CoRR , abs/1512.05006, 2015. V ikash Mansinghka, Patrick Shafto, Eric Jonas, Cap Petschulat, Max Gasner, and Joshua B. T enenbaum. CrossCat: A fully Bayesian nonparametric method for analyzing heterogeneous, high dimensional data. Journal of Machine Learning Resear ch , 17(138):1– 49, 2016. Radford M Neal. Markov chain sampling methods for dirichlet process mixture models. J ournal of Compu- tational and Graphical Statistics , 9(2):249–265, 2000. Fritz Obermeyer , Jonathan Glidden, and Eric Jonas. Scaling nonparametric Bayesian inference via subsample-annealing. In Pr oceedings of the Sev- enteenth International Confer ence on Artiﬁcial Intelligence and Statistics , pages 696–705. JMLR.org, 2014. Daniel N Osherson, Edward E Smith, Ormond W ilkie, Alejandro Lopez, and Eldar Shaﬁr . Category-based induction. Psychological r e view , 97(2):185, 1990. Lance J. Rips. Inductiv e judgments about natural cate- gories. Journal of V erbal Learning and V erbal Behav- ior , 14(6):665–681, 1975. Hans Rosling. Gapminder: Un v eiling the beauty of statistics for a fact based world view , 2008. URL https://www.gapminder.org/data/ . Feras Saad and V ikash Mansinghka. Probabilistic data analysis with probabilistic programming. CoRR , abs/1608.05347, 2016. Feras Saad and V ikash Mansinghka. Detecting de- pendencies in sparse, multiv ariate databases us- ing probabilistic programming and non-parametric bayes. In Pr oceedings of the T wentieth International Confer ence on Artiﬁcial Intelligence and Statistics . JMLR.org, 2017. Patrick Shafto, Charles Kemp, Elizabeth Bonawitz, John Coley , and Joshua T enenbaum. Inductiv e reasoning about causally transmitted properties. Cognition , 109 (2):175–192, 2008. A ppendices A Integrating predicti ve r ele vance as a ranking function in BayesDB This section describes the integration of predicti ve rele- vance into BayesDB (Mansinghka et al., 2015; Saad and Mansinghka, 2016), a probabilistic programming plat- form for probabilistic data analysis. New syntaxes in the Bayesian Query Language (BQL) allow a user to express predictiv e relev ance queries where the query set can be an arbitrary combination of existing and hypothetical records. W e implement predic- tiv e relev ance in BQL as an expression with the follow- ing syntax es, depending on the speciﬁcation of the query records. • Query records are existing ro ws. RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF • Query records are hypothetical ro ws. RELEVANCE PROBABILITY TO HYPOTHETICAL ROWS WITH VALUES () IN THE CONTEXT OF • Query records are existing and hypothetical ro ws. RELEVANCE PROBABILITY TO EXISTING ROWS IN AND HYPOTHETICAL ROWS WITH VALUES () IN THE CONTEXT OF The expression is formally implemented as a 1-ro w BQL estimand, which speciﬁes a map r 7→ R c ( Q , r ) for each record in the table. As sho wn in the expressions abov e, query records are speciﬁed by the user in two ways: (i) by giving a collection of EXISTING ROWS , whose pri- mary key indexes are either speciﬁed manually , or re- triev ed using an arbitrary BQL ; (ii) by specifying one or more HYPOTHETICAL RECORDS with their as a list of column-v alue pairs. These new rows are ﬁrst incorporated using Algorithm 3 from Section 3.3 and they are then unincorporated after the query is ﬁnished. The can be any vari- able in the tabular population. As a 1-ro w function in the structured query language, the RELEVANCE PROBABILITY expression can be used in a variety of settings. Some typical use-cases are shown in the follo wing examples, where we use only existing query rows for simplicity . • As a column in an ESTIMATE query . ESTIMATE "rowid" , RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF FROM • As a ﬁlter in WHERE clause. ESTIMATE "rowid" FROM

WHERE ( RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF ) > 0.5 • As a comparator in an ORDER BY clause. ESTIMATE "rowid" FROM

ORDER BY RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF [ ASC | DESC ] It is also possible to perform arithmetic operations and Boolean comparisons on relev ance probabilities. • Finding the mean rele v ance probability for a set of rowid s of interest. ESTIMATE AVG ( RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF ) FROM

WHERE "rowid" IN • Finding ro ws which are more rele v ant in some con- text c 0 than in another context c 1 . ESTIMATE "rowid" FROM

WHERE ( RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF ) > ( RELEVANCE PROBABILITY TO EXISTING ROWS IN IN THE CONTEXT OF ) B Predicti ve r ele vance and cosine similarity on Gapminder human e valuation queries Saudi Arabia, Democracy A Saudi Oman Libya Kuwait W . Sahara Qatar Bahrain Algeria Iraq Emirates Bhutan B Saudi V enezuela Israel Trdad & T ob Malta Puerto Rico Oman Spain Canada Japan Argentina prefer B prefer A equal United States, Democracy A USA France Finland Norway UK Sweden Estonia Denmark Australia Switzerland Germany B USA Australia Ireland Canada UK Iceland Netherlands Austria Denmark Japan New Zealand prefer B prefer A equal A ustralia, Life Expectancy A Australia Ireland Iceland Andorra United States New Zealand Austria Belgium Canada Switzerland Cyprus B Australia Israel Germany Canada Iceland Malta Ireland Finland United States Luxembourg UK prefer B prefer A equal Bangladesh, Life Expectancy A Bangladesh Bhutan Papua NG India Gambia Uganda Nepal Timor -Leste Pakistan Mauritania Indonesia B Bangladesh India Bhutan Myanmar Indonesia Philippines Nepal Pakistan Mongolia V iet Nam Kyr gyzstan prefer B prefer A equal Bulgaria, Life Expectancy A Bulgaria Estonia Portugal Macedonia Kuwait Bosnia Hungary Croatia Spain Japan Poland B Bulgaria Croatia Poland Serbia Hungary Slovakia Bosnia Belarus Montenegro Estonia Montserrat prefer B prefer A equal Japan, Life Expectancy A Japan Hungary Portugal Spain Slovakia Greece Kuwait Slovenia Emirates Poland Ireland B Japan Austria Belgium Canada Switzerland Germany Denmark Finland France UK Netherlands prefer B prefer A equal Qatar , Life Expectancy A Qatar Emirates Kuwait Bahrain T urks Isld Cayman Isld Guernsey Bermuda Jersey Israel Singapore B Qatar Serbia Bosnia Belarus Croatia Montenegro Estonia Bulgaria Lithuania Latvia Saudi Arabia prefer B prefer A equal UK, Life Expectancy A UK Belgium France Luxembourg Slovenia Germany Malta Canada Finland Ireland Czechia B UK Austria Belgium Canada Switzerland Germany Denmark Finland France Japan Netherlands prefer B prefer A equal Hong Kong, Urban P op A Hong Kong Italy Mexico Finland Bulgaria Belgium Lithuania Slovakia Poland Lebanon Panama B Hong Kong Singapore Austria Canada Greenland Netherlands Andorra Switzerland Ireland Iceland Denmark prefer B prefer A equal Singapore, Urban Pop A Singapore Barbados Oman Norway Romania Libya Algeria Palau Gabon Cuba Switzerland B Singapore Hong Kong Gibraltar Andorra Monaco United States San Marino Luxembourg Norway Austria Australia prefer B prefer A equal Figure 8: The top-10 ranking countries returned by predictiv e rele v ance and cosine similarity for each of the 10 queries used for the human e v aluation in Figure 7. For each country-context search query , we showed sev enty subjects (surve yed on the AI crowdsourcing platform crowdflower.com ) a pair of tables. W e then asked each subject to select the table which contains more rele v ant results to the search query , or report that both tables contain equally relev ant results. The tables above sho w the top-ranked countries using CrossCat predicti ve relev ance and cosine similarity , with a histogram of the human responses. The caption of Figure 7 describes ho w we con verted these raw histograms into scores between -1 and 1 that are displayed in the main text. The tables showing countries ranked using CrossCat predictiv e relev ance are: Saudi Arabia (A); United States (B); Australia (A); Bangladesh (B); Bulgaria (B); Japan (B); Qatar (A); UK (B); Hong K ong (B); Singapore (B). C Pairwise heatmaps on Gapminder countries using baseline methods C O S I N E S I M I L A R I T Y Median Imputation (5 vars) Median Imputation (10 vars) Median Imputation (15 vars) Median Imputation (20 vars) MICE Imputation (5 vars) MICE Imputation (10 vars) MICE Imputation (15 vars) MICE Imputation (20 vars) B R AY - C U RT I S C O E FFI C E N T Median Imputation (5 vars) Median Imputation (10 vars) Median Imputation (15 vars) Median Imputation (20 vars) MICE Imputation (5 vars) MICE Imputation (10 vars) MICE Imputation (15 vars) MICE Imputation (20 vars) E U C L I D E A N D I S TA N C E Median Imputation (5 vars) Median Imputation (10 vars) Median Imputation (15 vars) Median Imputation (20 vars) MICE Imputation (5 vars) MICE Imputation (10 vars) MICE Imputation (15 vars) MICE Imputation (20 vars) Figure 9: Pairwise heatmaps of countries in Gapminder dataset in the context of “life e xpectancy at birth”, using various distance and similarity measures on the country vectors. Each heatmap is labeled with the imputation technique (median or MICE (Buuren and Groothuis-Oudshoorn, 2011)), and the number of v ariables in the context (i.e. dimensionality of the v ectors). These techniques struggle with sparsity and their structures are much noisier than the results of relev ance probability sho wn in Figure 5a and T able 5e. D A pplication to a dataset of 1987 cars %bql CREATE TABLE cars_1987_raw ... FROM ' cars_1987.csv ' %bql SELECT ... "make" , ... "price" , ... "wheels" , ... "doors" , ... "engine" , ... "horsepower" , ... "body" ... FROM cars_1987_raw ... WHERE "price" < 45000 ... AND "wheels" = ' rear ' ... AND "doors" = ' four ... AND "engine" >= 250 ... AND "horsepower" > 180 ... AND "body" sedan (a) Suppose a customer wishes to purchase a classic car from 1987 with a budget of $45,000 and a desired set of technical speciﬁcations. They ﬁrst load a csv ﬁle of 200 cars with 26 variables into a BayesDB table, and then specify the search conditions as Boolean ﬁlters in a SQL WHERE clause. Due to sparsity in the table, only one record is returned. T o obtain more relev ant results, the user needs to broaden the speciﬁcations in the query . make price wheels doors engine horsepower body mercedes 40,960 rear four 308 184 sedan %mml CREATE POPULATION ... cars_1987 ... FOR cars_1987_raw ... WITH SCHEMA ( ... GUESS STATISTICAL ... TYPES FOR (*); ... ) %mml CREATE METAMODEL m FOR cars_1987 ... WITH BASELINE crosscat; %mml INITIALIZE 100 MODELS FOR m; %mml ANALYZE m FOR 1 MINUTE ; %bql .heatmap ESTIMATE ... DEPENDENCE PROBABILITY ... FROM PAIRWISE VARIABLES ... OF cars_1987 %bql SELECT ... "make" , ... "price" , ... "wheels" , ... "doors" , ... "engine-size" , ... "horsepower" , ... "style" ... FROM cars_1987 ... ORDER BY ... RELEVANCE PROBABILITY ... TO HYPOTHETICAL ROW (( ... "price" = 42000, ... "wheels" = ' rear ' , ... "doors" = ' four ' , ... "engine" = 250, ... "horsepower" = 180, ... "body" = ' sedan ' ... )) ... IN THE CONTEXT OF ... "price" ... LIMIT 10 (b) Building CrosssCat models in BayesDB for the cars_1987 population learns a full joint probabilistic model over all v ariables. The ESTIMATE DEPENDENCE PROBABILITY query allows the user to plot a heatmap of probable dependencies between car characteristics. The context of “price” probably contains the majority of other v ariables in the search query . engine-lo cation fuel-system fuel-t yp e aspiration comp ression-ratio engine-t yp e num-of-cylinders width length wheel-base ho rsep o w er cit y-mpg highw a y-mpg engine-size p rice curb-w eight drive-wheels p eak-rpm b o rk e mak e strok e no rmalized-losses b o dy-st yle symb oling height num-of-do o rs num-of-do o rs height symb oling b o dy-st yle no rmalized-losses strok e mak e b o rk e p eak-rpm drive-wheels curb-w eight p rice engine-size highw a y-mpg cit y-mpg ho rsep o w er wheel-base length width num-of-cylinders engine-t yp e comp ression-ratio aspiration fuel-t yp e fuel-system engine-lo cation 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 (c) Using ORDER BY RELEVANCE PROBABILITY in BQL ranks each car in the table by its relev ance to the user’ s speciﬁcations, which are speciﬁed as a hypothetical row . The top-10 ranked cars by probability of relev ance to the search query , in the context of price , are shown belo w in the table belo w . The user can now inspect further characteristics of this subset of cars, to ﬁnd ones that they like best. make price wheels doors engine horsepower body jaguar 35,550 rear four 258 176 sedan jaguar 32,250 rear four 258 176 sedan mercedes 40,960 rear four 308 184 sedan mercedes 45,400 rear two 304 184 hardtop mercedes 34,184 rear four 234 155 sedan mercedes 35,056 rear two 234 155 convertible bmw 36,880 rear four 209 182 sedan bmw 41,315 rear two 209 182 sedan bmw 30,760 rear four 209 182 sedan jaguar 36,000 rear two 326 262 sedan Figure 10: A session in BayesDB for probabilistic model building and search in the cars dataset (Kibler et al., 1989).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment