Remembering what we like: Toward an agent-based model of Web traffic
Analysis of aggregate Web traffic has shown that PageRank is a poor model of how people actually navigate the Web. Using the empirical traffic patterns generated by a thousand users over the course of two months, we characterize the properties of Web traffic that cannot be reproduced by Markovian models, in which destinations are independent of past decisions. In particular, we show that the diversity of sites visited by individual users is smaller and more broadly distributed than predicted by the PageRank model; that link traffic is more broadly distributed than predicted; and that the time between consecutive visits to the same site by a user is less broadly distributed than predicted. To account for these discrepancies, we introduce a more realistic navigation model in which agents maintain individual lists of bookmarks that are used as teleportation targets. The model can also account for branching, a traffic property caused by browser features such as tabs and the back button. The model reproduces aggregate traffic patterns such as site popularity, while also generating more accurate predictions of diversity, link traffic, and return time distributions. This model for the first time allows us to capture the extreme heterogeneity of aggregate traffic measurements while explaining the more narrowly focused browsing patterns of individual users.
💡 Research Summary
The paper confronts a long‑standing assumption in web‑traffic modeling: that user navigation can be captured by a Markovian process such as PageRank, where each click depends only on the current page’s outgoing links and a small, uniformly random “teleportation” probability. Using a rich dataset collected from 1,000 volunteers over two months, the authors demonstrate that three aggregate properties of real traffic are systematically mis‑predicted by the PageRank model. First, the diversity of sites visited by an individual user is far lower than PageRank would suggest, and the distribution of this diversity across users is highly skewed. Second, the distribution of traffic across hyperlinks (link traffic) is far broader than the power‑law shape implied by a pure random‑walk model. Third, the inter‑visit time to the same site (return time) is less dispersed; users tend to revisit a site after short intervals, a pattern that PageRank’s memoryless assumption cannot reproduce.
To explain these discrepancies, the authors propose an agent‑based model that augments the random‑walk with two realistic behavioral mechanisms. Each simulated agent maintains a personal bookmark list; when a teleportation event occurs, the destination is drawn from this list rather than from a uniform distribution over all pages. This captures the empirical observation that a substantial fraction of jumps are driven by direct URL entry, bookmarks, or other user‑specific shortcuts. In addition, the model incorporates a “branching” process that mimics browser features such as opening new tabs or using the back button. With a certain probability, an agent clones its current navigation path and continues along a new link, thereby creating parallel browsing streams that later converge.
The model’s parameters—bookmark‑teleport probability, branching probability, and the size distribution of bookmark lists—are calibrated against the empirical data. Simulations show that the model reproduces the overall site‑popularity distribution (the same heavy‑tailed pattern that PageRank captures) while simultaneously matching the observed distributions of user diversity, link traffic, and return times. Notably, the bookmark‑driven teleportation accounts for roughly 20‑30 % of all jumps in the simulated traffic, aligning with the measured proportion in the real logs. The branching mechanism is essential for reproducing the short‑interval return‑time peak, as it creates rapid revisits that would otherwise be impossible in a pure Markov chain.
Beyond fitting the data, the authors discuss practical implications. Search engines and advertising platforms that rely solely on PageRank‑style link analysis may overlook a large, user‑specific component of traffic, leading to sub‑optimal ranking and targeting. Incorporating personalized teleportation sources (e.g., inferred bookmark lists or frequent direct‑entry URLs) could improve relevance predictions. Network operators could use the branching insight to better anticipate bursty traffic caused by tabbed browsing, thereby optimizing caching and load‑balancing strategies. Finally, web designers might deliberately shape the user interface (e.g., tab management, back‑button behavior) to influence traffic flow in predictable ways.
In summary, the paper provides the first comprehensive, agent‑based framework that captures both the extreme heterogeneity observed in aggregate web‑traffic measurements and the comparatively focused browsing patterns of individual users. By explicitly modeling memory (through bookmarks) and parallelism (through branching), the authors bridge the gap between theoretical random‑walk models and the nuanced reality of human web navigation, opening new avenues for research in web analytics, personalization, and network optimization.
Comments & Academic Discussion
Loading comments...
Leave a Comment