Is This a Good Title?

Reading time: 6 minute
...

📝 Original Info

  • Title: Is This a Good Title?
  • ArXiv ID: 1004.2719
  • Date: 2010-04-19
  • Authors: ** - Martin Klein (Old Dominion University) - Jeffery Shipman (Old Dominion University) - Michael L. Nelson (Old Dominion University) **

📝 Abstract

Missing web pages, URIs that return the 404 "Page Not Found" error or the HTTP response code 200 but dereference unexpected content, are ubiquitous in today's browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery process. We propose querying web pages' titles against search engines. We investigate the retrieval performance of titles and compare them to lexical signatures which are derived from the pages' content. Since titles naturally represent the content of a document they intuitively change over time. We measure the edit distance between current titles and titles of copies of the same pages obtained from the Internet Archive and display their evolution. We further investigate the correlation between title changes and content modifications of a web page over time. Lastly we provide a predictive model for the quality of any given web page title in terms of its discovery performance. Our results show that titles return more than 60% URIs top ranked and further relevant content returned in the top 10 results. We show that titles decay slowly but are far more stable than the pages' content. We further distill stop titles than can help identify insufficiently performing search engine queries.

💡 Deep Analysis

Deep Dive into Is This a Good Title?.

Missing web pages, URIs that return the 404 “Page Not Found” error or the HTTP response code 200 but dereference unexpected content, are ubiquitous in today’s browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery process. We propose querying web pages’ titles against search engines. We investigate the retrieval performance of titles and compare them to lexical signatures which are derived from the pages’ content. Since titles naturally represent the content of a document they intuitively change over time. We measure the edit distance between current titles and titles of copies of the same pages obtained from the Internet Archive and display their evolution. We further investigate the correlation between title changes and content modifications of a web page over time. Lastly we provide a predictive model for the quality of any given web page title in terms of its discovery performance. Our results show tha

📄 Full Content

Is This a Good Title? Martin Klein Department of Computer Science Old Dominion University Norfolk, VA, 23529 mklein@cs.odu.edu Jeffery Shipman Department of Computer Science Old Dominion University Norfolk, VA, 23529 jshipman@cs.odu.edu Michael L. Nelson Department of Computer Science Old Dominion University Norfolk, VA, 23529 mln@cs.odu.edu ABSTRACT Missing web pages, URIs that return the 404 “Page Not Found” error or the HTTP response code 200 but derefer- ence unexpected content, are ubiquitous in today’s browsing experience. We use Internet search engines to relocate such missing pages and provide means that help automate the rediscovery process. We propose querying web pages’ titles against search engines. We investigate the retrieval perfor- mance of titles and compare them to lexical signatures which are derived from the pages’ content. Since titles naturally represent the content of a document they intuitively change over time. We measure the edit distance between current titles and titles of copies of the same pages obtained from the Internet Archive and display their evolution. We further investigate the correlation between title changes and content modifications of a web page over time. Lastly we provide a predictive model for the quality of any given web page title in terms of its discovery performance. Our results show that titles return more than 60% URIs top ranked and further rel- evant content returned in the top 10 results. We show that titles decay slowly but are far more stable than the pages’ content. We further distill stop titles than can help identify insufficiently performing search engine queries. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Measurement, Performance, Design Keywords Web Page Titles, Web Page Discovery, Digital Preservation 1. INTRODUCTION Inaccessible web pages and “404 Page Not Found” re- sponses are part of the web browsing experience. Despite Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’10, June 13–16, 2010, Toronto, Ontario, Canada.. Copyright 2010 ACM 978-1-4503-0041-4/10/06 ...$10.00. guidance for how to create “Cool URIs” that do not change [6] there are many reasons why URIs or even entire websites break [29]. Since web users frequently re-visit web pages [1] a 404 response constitutes a detriment to their brows- ing experience. However, we claim that information on the web is rarely completely lost, it is just missing. In whole or in part, content is often just moving from one URI to an- other. Figure 1 graphically explains this URI content map- Figure 1: The URI Content Mapping Problem ping problem showing four scenarios with URIs (U) mapping to the same and to different content (C) over time. Fur- thermore Figure 2 shows an example of a web page whose content has moved within a three year period. Figure 2(a) shows the content of the original URI of the Hypertext 2006 conference as displayed in 8/2009. The original URI clearly does not hold conference related content anymore. Our sus- picion is that the website administrators did not renew the domain registration and therefore enabling someone else to take over. However, the content is not lost. The title of the original web page was ACM Hypertext 2006. Querying it against today’s search engines results in discovering the content at its new URI. Yahoo and Bing return the new page top ranked and Google returns it ranked fourth. Figure 2(b) shows the content which is now available at a new URI. It is our intuition that major search engines like Google, Yahoo and MSN Live (now Bing), as members of what we call the Web Infrastructure (WI), likely have crawled the content and possibly even stored a copy in their cache. Therefore the content is not lost, it “just” needs to be redis- covered. The WI, explored in detail in [21, 30, 31], also in- cludes non-profit archives such as the Internet Archive (IA) arXiv:1004.2719v1 [cs.IR] 15 Apr 2010 or the European Archive as well as large-scale academic dig- ital data preservation projects e.g., CiteSeer and NSDL. It is our goal to utilize the WI for digital preservation and in particular for the rediscovery of missing web pages. Therefore we need to explore the notion of the “aboutness” of the missing pages. Lexical signatures (LSs) haven been shown to be suitable for this purpose [24, 26, 33, 34] but they are expensive to generate since the inverse document fre- quency (IDF) value needs to be acquired for each candidate term for example by querying search engines. In the worst case the cost is one query for each term. In this paper

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut