Missing web pages, URIs that return the 404 "Page Not Found" error or the HTTP response code 200 but dereference unexpected content, are ubiquitous in today's browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery process. We propose querying web pages' titles against search engines. We investigate the retrieval performance of titles and compare them to lexical signatures which are derived from the pages' content. Since titles naturally represent the content of a document they intuitively change over time. We measure the edit distance between current titles and titles of copies of the same pages obtained from the Internet Archive and display their evolution. We further investigate the correlation between title changes and content modifications of a web page over time. Lastly we provide a predictive model for the quality of any given web page title in terms of its discovery performance. Our results show that titles return more than 60% URIs top ranked and further relevant content returned in the top 10 results. We show that titles decay slowly but are far more stable than the pages' content. We further distill stop titles than can help identify insufficiently performing search engine queries.
Deep Dive into Is This a Good Title?.
Missing web pages, URIs that return the 404 “Page Not Found” error or the HTTP response code 200 but dereference unexpected content, are ubiquitous in today’s browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery process. We propose querying web pages’ titles against search engines. We investigate the retrieval performance of titles and compare them to lexical signatures which are derived from the pages’ content. Since titles naturally represent the content of a document they intuitively change over time. We measure the edit distance between current titles and titles of copies of the same pages obtained from the Internet Archive and display their evolution. We further investigate the correlation between title changes and content modifications of a web page over time. Lastly we provide a predictive model for the quality of any given web page title in terms of its discovery performance. Our results show tha
Is This a Good Title?
Martin Klein
Department of Computer
Science
Old Dominion University
Norfolk, VA, 23529
mklein@cs.odu.edu
Jeffery Shipman
Department of Computer
Science
Old Dominion University
Norfolk, VA, 23529
jshipman@cs.odu.edu
Michael L. Nelson
Department of Computer
Science
Old Dominion University
Norfolk, VA, 23529
mln@cs.odu.edu
ABSTRACT
Missing web pages, URIs that return the 404 “Page Not
Found” error or the HTTP response code 200 but derefer-
ence unexpected content, are ubiquitous in today’s browsing
experience. We use Internet search engines to relocate such
missing pages and provide means that help automate the
rediscovery process. We propose querying web pages’ titles
against search engines. We investigate the retrieval perfor-
mance of titles and compare them to lexical signatures which
are derived from the pages’ content. Since titles naturally
represent the content of a document they intuitively change
over time. We measure the edit distance between current
titles and titles of copies of the same pages obtained from
the Internet Archive and display their evolution. We further
investigate the correlation between title changes and content
modifications of a web page over time. Lastly we provide a
predictive model for the quality of any given web page title
in terms of its discovery performance. Our results show that
titles return more than 60% URIs top ranked and further rel-
evant content returned in the top 10 results. We show that
titles decay slowly but are far more stable than the pages’
content. We further distill stop titles than can help identify
insufficiently performing search engine queries.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval
General Terms
Measurement, Performance, Design
Keywords
Web Page Titles, Web Page Discovery, Digital Preservation
1.
INTRODUCTION
Inaccessible web pages and “404 Page Not Found” re-
sponses are part of the web browsing experience. Despite
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HT’10, June 13–16, 2010, Toronto, Ontario, Canada..
Copyright 2010 ACM 978-1-4503-0041-4/10/06 ...$10.00.
guidance for how to create “Cool URIs” that do not change
[6] there are many reasons why URIs or even entire websites
break [29].
Since web users frequently re-visit web pages
[1] a 404 response constitutes a detriment to their brows-
ing experience. However, we claim that information on the
web is rarely completely lost, it is just missing. In whole or
in part, content is often just moving from one URI to an-
other. Figure 1 graphically explains this URI content map-
Figure 1: The URI Content Mapping Problem
ping problem showing four scenarios with URIs (U) mapping
to the same and to different content (C) over time. Fur-
thermore Figure 2 shows an example of a web page whose
content has moved within a three year period. Figure 2(a)
shows the content of the original URI of the Hypertext 2006
conference as displayed in 8/2009. The original URI clearly
does not hold conference related content anymore. Our sus-
picion is that the website administrators did not renew the
domain registration and therefore enabling someone else to
take over.
However, the content is not lost.
The title of
the original web page was ACM Hypertext 2006. Querying
it against today’s search engines results in discovering the
content at its new URI. Yahoo and Bing return the new page
top ranked and Google returns it ranked fourth. Figure 2(b)
shows the content which is now available at a new URI.
It is our intuition that major search engines like Google,
Yahoo and MSN Live (now Bing), as members of what
we call the Web Infrastructure (WI), likely have crawled
the content and possibly even stored a copy in their cache.
Therefore the content is not lost, it “just” needs to be redis-
covered. The WI, explored in detail in [21, 30, 31], also in-
cludes non-profit archives such as the Internet Archive (IA)
arXiv:1004.2719v1 [cs.IR] 15 Apr 2010
or the European Archive as well as large-scale academic dig-
ital data preservation projects e.g., CiteSeer and NSDL.
It is our goal to utilize the WI for digital preservation
and in particular for the rediscovery of missing web pages.
Therefore we need to explore the notion of the “aboutness”
of the missing pages. Lexical signatures (LSs) haven been
shown to be suitable for this purpose [24, 26, 33, 34] but they
are expensive to generate since the inverse document fre-
quency (IDF) value needs to be acquired for each candidate
term for example by querying search engines. In the worst
case the cost is one query for each term. In this paper
…(Full text truncated)…
This content is AI-processed based on ArXiv data.