Answering Table Queries on the Web using Column Keywords

We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T1, . . ., Tn, and a query Q with q sets of keywords Q1, . . ., Qq, decide for each Ti if it is relevant to Q and if so, identify the mapping between the columns of Ti and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods.

💡 Research Summary

The paper introduces a structured search engine that answers “column‑keyword” queries by returning multi‑column tables harvested from the Web. A user supplies a set of keyword groups Q₁,…,Q_q, each describing the desired content of a column (for example, “city”, “population”, “year”). The system must (1) decide which tables in a massive web‑derived corpus are relevant to the query and (2) map each column of a relevant table to one of the query’s keyword groups. The authors argue that web tables contain far richer, more up‑to‑date structured knowledge than free‑text documents or centrally curated knowledge bases, but they also pose unique challenges: heterogeneous HTML layouts, missing or noisy headers, massive redundancy, and the sheer scale of the data (tens of millions of tables).

To address these challenges, the authors formulate the problem as a joint labeling task over a graphical model. For each table T_i with m_i columns they introduce a set of latent variables X_i = {x₁,…,x_{m_i}} where each x_j ∈ {1,…,q} ∪ {null}. The model incorporates three complementary sources of evidence: (a) Header‑Keyword Matching – string similarity, word‑embedding cosine similarity, and character n‑gram overlap between the column header (or caption) and the query keywords; (b) Cell‑Content Matching – TF‑IDF weighted similarity between the set of cell values in a column and the keyword group, augmented by corpus‑wide co‑occurrence statistics that capture how often a particular keyword appears together with certain cell values across all tables; (c) Global Co‑Occurrence Prior – a corpus‑level distribution over (column type, keyword) pairs learned from the entire table collection, which provides a prior probability that a column of a certain semantic type (e.g., dates, locations, monetary amounts) matches a given keyword group. In addition, the model adds a Content‑Overlap Penalty that discourages assigning the same keyword to two columns of the same table when the columns share a large fraction of their cell values, thereby enforcing column distinctness.

The overall energy function is the sum of unary potentials (derived from (a)–(c)) and pairwise potentials (the overlap penalty). Exact inference is intractable because of the combinatorial number of possible labelings across millions of tables. The authors therefore propose a two‑stage approximate inference scheme. First, each table is processed independently as a bipartite matching problem between its columns and the query keyword groups. The matching cost for edge (column j, keyword k) is the negative log‑probability from the unary potentials. They solve this assignment using a variant of the Hungarian algorithm, which runs in O(m_i·q) time per table. Second, to enforce consistency across tables, they construct a global graph where nodes represent column‑keyword assignments and edges encode the pairwise overlap penalties. They then apply a constrained graph‑cut algorithm (equivalent to a min‑cut on a suitably constructed flow network) to obtain a globally consistent labeling that respects the pairwise constraints while staying close to the locally optimal matchings.

The system was evaluated on a corpus of roughly 25 million tables extracted from the public Web. A benchmark set of 59 multi‑column queries (each containing 2–5 keyword groups) was assembled. Baselines included a standard information‑retrieval pipeline that retrieves tables based on keyword matching in the entire document and a naïve per‑column matching method that ignores cross‑column evidence. The proposed joint model achieved an average precision of 0.84, recall of 0.78, and F1‑score of 0.81, compared with 0.62/0.59/0.65 for the IR baseline. Error analysis revealed that failures were most common when column headers were missing or extremely ambiguous, or when multiple columns contained highly overlapping value sets (e.g., “city” and “capital” columns). The authors suggest that future work could incorporate entity linking and semantic clustering of cell values to further disambiguate such cases.

In summary, the paper makes three key contributions: (1) a formal definition of the column‑keyword query task over web tables; (2) a novel graphical model that jointly leverages header similarity, cell‑content similarity, and corpus‑wide co‑occurrence statistics, together with a robust mechanism for exploiting content overlap across columns; and (3) efficient inference algorithms based on bipartite matching and constrained graph cuts that scale to tens of millions of tables. The experimental results demonstrate a substantial improvement over traditional IR approaches, indicating that web‑derived tables can be turned into a practical, large‑scale structured knowledge source for complex multi‑column queries.