Complex information extraction (IE) pipelines assembled by plumbing together off-the-shelf operators, specially customized operators, and operators re-used from other text processing pipelines are becoming an integral component of most text processing frameworks. A critical task faced by the IE pipeline user is to run a post-mortem analysis on the output. Due to the diverse nature of extraction operators (often implemented by independent groups), it is time consuming and error-prone to describe operator semantics formally or operationally to a provenance system. We introduce the first system that helps IE users analyze pipeline semantics and infer provenance interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of any IE pipeline consisting of arbitrary types of operators. We propose an effective provenance model for IE pipelines which captures a variety of operator types, ranging from those for which full or no specifications are available. We present a suite of algorithms to effectively build provenance and facilitate debugging. Finally, we present an extensive experimental study on large-scale real-world extractions from an index of ~500 million Web documents.
Deep Dive into PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines.
Complex information extraction (IE) pipelines assembled by plumbing together off-the-shelf operators, specially customized operators, and operators re-used from other text processing pipelines are becoming an integral component of most text processing frameworks. A critical task faced by the IE pipeline user is to run a post-mortem analysis on the output. Due to the diverse nature of extraction operators (often implemented by independent groups), it is time consuming and error-prone to describe operator semantics formally or operationally to a provenance system. We introduce the first system that helps IE users analyze pipeline semantics and infer provenance interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of any IE pipeline consisting of arbitrary types of operators. We propose an effective provenance model fo
PROBER: Ad-Hoc Debugging of Extraction and Integration
Pipelines
Anish Das Sarma
Yahoo, CA, USA
anishdas@yahoo-inc.com
Alpa Jain
Yahoo, CA, USA
alpa@yahoo-inc.com
Philip Bohannon
Yahoo, CA, USA
plb@yahoo-inc.com
ABSTRACT
Complex information extraction (IE) pipelines assembled by plumb-
ing together off-the-shelf operators, specially customized opera-
tors, and operators re-used from other text processing pipelines are
becoming an integral component of most text processing frame-
works. A critical task faced by the IE pipeline user is to run a
post-mortem analysis on the output. Due to the diverse nature of
extraction operators (often implemented by independent groups), it
is time consuming and error-prone to describe operator semantics
formally or operationally to a provenance system.
We introduce the first system that helps IE users analyze pipeline
semantics and infer provenance interactively while debugging. This
allows the effort to be proportional to the need, and to focus on the
portions of the pipeline under the greatest suspicion. We present
a generic debugger for running post-execution analysis of any IE
pipeline consisting of arbitrary types of operators. We propose an
effective provenance model for IE pipelines which captures a va-
riety of operator types, ranging from those for which full or no
specifications are available. We present a suite of algorithms to
effectively build provenance and facilitate debugging. Finally, we
present an extensive experimental study on large-scale real-world
extractions from an index of ∼500 million Web documents.
1.
INTRODUCTION
Growing amounts of knowledge is being made available in the
form of unstructured text documents such as, web pages, email,
news articles, etc.
Information extraction (IE) systems identify
structured information (e.g., people names, relations betwen com-
panies, people, locations, etc.) and, not surprisingly, IE systems are
becoming a critical first-class operator in a large number of text-
processing frameworks. As a concrete example, search engines are
moving beyond a “keyword in, document out” paradigm to provid-
ing structured information relevant to users’ queries (e.g., provid-
ing contact information for businesses when user queries involve
business names). For this, search engines typically rely on hav-
ing available large repositories of structured information generated
from web pages or query logs using IE systems. With the increas-
ing complexity of IE pipelines, a critical exercise for IE developers
and even users is to debug, i.e., perform a thorough post-mortem
analysis of the output generated by running an entire or partial ex-
traction pipeline. Despite the popularity of IE pipelines, very little
attention has been given to building effective ways to trace the con-
trol or data flow through an extraction pipeline.
EXAMPLE 1.1. Consider an IE pipeline for extracting contact
information for businesses, namely, business name, address (one
or many), phone number (one or many), from a set of web pages.
The pipeline, in addition to others, consists of operators (a) to clean
and parse html web pages, (b) to classify ‘blocks’ of text in a web
page as being useful or not for this task, (c) extract business names,
(d) extract address(es). (We discuss this real-world pipeline in de-
tail later in Section 2.) Two interesting points to note here: First, in
practice, such complex pipelines may be put together using off-the-
shelf operators (e.g., html parsers or segmenters) along with some
newly designed as well as some re-usable operators from other sys-
tems. Second, IE is an erroneous process and oftentimes, output
from an IE pipeline may miss some information (e.g., a record
where contact information is present but business name is absent)
or may generate unexpected output (e.g., associate a fax number
with a business instead of phone number).
Say a user of this IE pipeline processes a batch of web pages and
generates a set of (partial, complete, or incorrect) output records.
Given the output, the user may be interested in understanding why
certain incorrect records were generated to identify and eliminate
their ‘sources’; similarly, the user may also be interested in under-
standing why certain records were missing attributes in the output
to identify the ‘restrictive’ operators in the pipeline. 2
To date, there have been two main approaches for understanding
the output from an IE but neither fully addresses the problem of
debugging arbitrary IE pipelines. The first approach is to build sta-
tistical models to predict the output quality of an IE system [6, 9].
However, these models address the more modest goal of assessing
the overall output quality and lack the intuitive interaction neces-
sarily for building debuggers to trace the generation of an output
record. The second approach involves using complete knowledge
of how each operator functions. As highlighted by the above exam-
ple, prior information regarding the specifications of th
…(Full text truncated)…
This content is AI-processed based on ArXiv data.