Agents, Bookmarks and Clicks: A topical model of Web traffic
Analysis of aggregate and individual Web traffic has shown that PageRank is a poor model of how people navigate the Web. Using the empirical traffic patterns generated by a thousand users, we characterize several properties of Web traffic that cannot be reproduced by Markovian models. We examine both aggregate statistics capturing collective behavior, such as page and link traffic, and individual statistics, such as entropy and session size. No model currently explains all of these empirical observations simultaneously. We show that all of these traffic patterns can be explained by an agent-based model that takes into account several realistic browsing behaviors. First, agents maintain individual lists of bookmarks (a non-Markovian memory mechanism) that are used as teleportation targets. Second, agents can retreat along visited links, a branching mechanism that also allows us to reproduce behaviors such as the use of a back button and tabbed browsing. Finally, agents are sustained by visiting novel pages of topical interest, with adjacent pages being more topically related to each other than distant ones. This modulates the probability that an agent continues to browse or starts a new session, allowing us to recreate heterogeneous session lengths. The resulting model is capable of reproducing the collective and individual behaviors we observe in the empirical data, reconciling the narrowly focused browsing patterns of individual users with the extreme heterogeneity of aggregate traffic measurements. This result allows us to identify a few salient features that are necessary and sufficient to interpret the browsing patterns observed in our data. In addition to the descriptive and explanatory power of such a model, our results may lead the way to more sophisticated, realistic, and effective ranking and crawling algorithms.
💡 Research Summary
The paper investigates why traditional Markovian models of web navigation, most notably PageRank, fail to capture the richness of real user behavior observed in large‑scale click data. Using a two‑month capture of HTTP traffic from a dormitory network at Indiana University, the authors collected over 408 million requests from 1,083 unique MAC addresses. After filtering out non‑page requests, anonymizing query strings, and discarding automated traffic, the final dataset comprises roughly 29.5 million page requests from 967 active users, organized into more than 11 million logical sessions. Sessions are defined not by idle timeouts but by constructing trees from referrer information, which naturally accommodates tabbed browsing and back‑button usage.
Statistical analysis of this dataset reveals several discrepancies with PageRank: (1) the distribution of page and link traffic is far more heterogeneous than the uniform random walk predicts; (2) the popularity of session‑starting pages is highly skewed; (3) individual users exhibit low Shannon entropy (i.e., they repeatedly visit a limited set of pages) while the aggregate population shows high entropy; and (4) session size and depth follow heavy‑tailed distributions that cannot be reproduced by a simple random walk with uniform teleportation. The authors also evaluate BookRank, a previous extension of PageRank that adds a memory mechanism via ranked bookmarks. BookRank improves the fit for aggregate traffic patterns but still cannot reproduce the observed entropy and session‑size distributions.
To address these shortcomings, the authors propose a new agent‑based model called ABC (Agents, Bookmarks, Clicks). ABC incorporates three realistic browsing ingredients:
-
Bookmark Memory – Each agent maintains a personal list of bookmarked pages, ranked by visitation frequency. When a new session begins, the agent “teleports” to a bookmark chosen probabilistically from this list, rather than to a uniformly random page. This mechanism reproduces the observed diversity of session‑starting pages and the non‑uniform teleportation pattern seen in real data.
-
Back‑Button and Tab‑Browsing – Agents can retreat along previously visited links, mimicking the back button, and can also open new branches in the session tree, representing tabbed browsing. This branching process generates the heavy‑tailed distribution of session depth and size, matching empirical observations.
-
Topical Locality and Interest‑Driven Continuation – Pages are assigned latent topical vectors. The probability that an agent follows a link depends on the topical similarity between the current page and the candidate page. Moreover, each agent has an intrinsic interest profile; if the topical relevance of a newly visited page falls below a threshold, the agent terminates the current session and initiates a new one via a bookmark teleport. This dynamic creates heterogeneous session lengths and explains why users tend to stay within topical “neighborhoods” of the web.
Simulations of ABC, calibrated on the empirical data, demonstrate that the model simultaneously reproduces: (a) the empirical power‑law distributions of page and link traffic; (b) the skewed popularity of session‑starting pages; (c) the heavy‑tailed distributions of session size and depth; and (d) the low per‑user entropy together with high aggregate entropy. Quantitatively, ABC outperforms both PageRank and BookRank across all measured metrics.
Beyond the immediate empirical fit, the authors discuss broader implications. A more accurate navigation model can improve search‑engine ranking algorithms by accounting for user memory and topical drift, guide intelligent web crawlers to prioritize pages that are likely to be revisited, and enhance advertising revenue forecasts by providing realistic traffic predictions.
In summary, the paper makes three major contributions: (1) a comprehensive empirical characterization of both collective and individual web‑traffic patterns using a uniquely rich dataset; (2) a critical evaluation of existing Markovian models, highlighting their inability to capture memory, backtracking, and topicality; and (3) the introduction of the ABC model, which shows that a combination of bookmark‑based teleportation, back‑button/branching behavior, and topical interest is both necessary and sufficient to explain the full spectrum of observed web‑navigation phenomena. This work thus bridges the gap between narrowly focused individual browsing behavior and the extreme heterogeneity seen in aggregate web traffic, offering a solid foundation for future research and practical applications in web analytics, search, and crawling.
Comments & Academic Discussion
Loading comments...
Leave a Comment