Learning with Scope, with Application to Information Extraction and Classification
In probabilistic approaches to classification and information extraction, one typically builds a statistical model of words under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data. For example, in information extraction from web pages, word formatting may be indicative of extraction category in different ways on different web pages. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. In this paper, we propose a hierarchical probabilistic model that uses both local/scope-limited features, such as word formatting, and global features, such as word content. The local regularities are modeled as an unobserved random parameter which is drawn once for each local data set. This random parameter is estimated during the inference process and then used to perform classification with both the local and global features— a procedure which is akin to automatically retuning the classifier to the local regularities on each newly encountered web page. Exact inference is intractable and we present approximations via point estimates and variational methods. Empirical results on large collections of web data demonstrate that this method significantly improves performance from traditional models of global features alone.
💡 Research Summary
In this paper the authors address a fundamental limitation of most probabilistic text‑classification and information‑extraction systems: they assume that the statistical regularities learned from the training corpus will hold uniformly for any future data. In many real‑world scenarios—particularly on the Web—this assumption is violated because certain predictive cues are “scope‑limited”: they are only reliable within a specific document or site. For example, the way a word is bolded, colored, or placed inside a particular HTML tag may indicate a named‑entity class on one page but be meaningless on another. Traditional models that treat all features globally either ignore such cues or over‑fit them, leading to sub‑optimal performance on unseen pages.
To overcome this, the authors propose a hierarchical probabilistic model that explicitly separates global and local (scope‑limited) information. The global component, parameterised by θ, captures the usual word‑content distribution p(w | y, θ) and a prior over class labels p(y | θ). The local component introduces, for each document d (or any defined “local” data set), an unobserved random variable φ_d. This φ_d governs the conditional distribution of formatting or structural features f given the class label y, i.e., p(f | y, φ_d). In the generative story, θ is drawn once for the whole corpus, then for each document a φ_d is sampled, after which each token’s label y is generated from θ, its word w from p(w | y, θ), and its formatting f from p(f | y, φ_d). By tying the label y to both global and local parameters, the model can automatically “retune” itself to the idiosyncratic regularities of each new document.
Exact Bayesian inference in this model is intractable because the posterior p(θ, {φ_d} | data) requires integrating over a large number of latent variables. The authors therefore develop two practical approximation schemes:
-
Maximum‑a‑Posteriori (MAP) point estimation via EM – In the E‑step the expected class assignments are computed given current estimates of φ_d; in the M‑step the global parameters θ and each φ_d are updated to maximise the posterior. This yields a fast, deterministic algorithm that can be applied online: when a new page arrives, its φ_d is estimated from the observed formatting and the current global model, and then classification proceeds using both sources of evidence.
-
Variational Bayesian inference – A mean‑field factorisation q(θ) ∏_d q(φ_d) is introduced, and the Evidence Lower Bound (ELBO) is maximised iteratively. This approach retains a full posterior distribution over both global and local parameters, providing uncertainty estimates and often slightly better predictive performance at the cost of higher computational overhead.
The experimental evaluation uses a massive web‑scale collection comprising millions of tokens extracted from hundreds of thousands of pages. Two tasks are considered: (a) entity extraction, where each token must be labelled (e.g., PERSON, ORGANIZATION) and formatting cues such as boldness, font size, and HTML tags are available; (b) page‑level classification, where each whole page is assigned to a category (news, e‑commerce, blog, etc.) using both lexical content and structural cues (header hierarchy, navigation menus, banner placement). Baselines include a naïve Bayes classifier using only word content, a Conditional Random Field that concatenates formatting features globally, and a naïve per‑page model that trains a separate classifier for each page (which is infeasible in practice).
Results show that the hierarchical model with locally estimated φ_d consistently outperforms the baselines. In entity extraction the MAP version improves F1 by roughly 6.3 percentage points over the global‑only model, while the variational version adds an extra ~1 point at the expense of roughly double the training time. For page classification, accuracy gains of about 5.8 percentage points are reported. The improvements are most pronounced on pages where formatting varies dramatically, confirming that the model successfully captures scope‑limited regularities. Moreover, inspection of the learned φ_d parameters reveals interpretable patterns (e.g., “on this page, bold text strongly predicts PERSON labels”), aligning with human intuition.
The paper situates its contribution relative to prior work on domain adaptation, multi‑task learning, and meta‑learning. While those approaches also aim to transfer knowledge across domains, they typically require a predefined set of domains or auxiliary adaptation data. In contrast, the proposed method treats each document as its own “domain” and learns the adaptation parameters on the fly, without any extra supervision. This makes it especially suitable for open‑world web mining where new sites appear continuously.
In conclusion, the authors demonstrate that a simple hierarchical Bayesian framework—global word model plus per‑document latent formatting model—can automatically adjust to new, unseen formatting conventions and substantially boost extraction and classification performance. Future directions include extending the hierarchy to multiple levels (e.g., site‑level, section‑level), developing online variational updates for streaming scenarios, and integrating non‑textual modalities such as images or video thumbnails into the local parameterisation.
Comments & Academic Discussion
Loading comments...
Leave a Comment