Ergodic Control and Polyhedral approaches to PageRank Optimization

Reading time: 6 minute
...

📝 Original Info

  • Title: Ergodic Control and Polyhedral approaches to PageRank Optimization
  • ArXiv ID: 1011.2348
  • Date: 2012-09-15
  • Authors: Olivier Fercoq, Michel Gondran, and J. N. —

📝 Abstract

We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a "master" page to which all controlled pages should point. We report numerical results on fragments of the real web graph.

💡 Deep Analysis

Figure 1

📄 Full Content

The PageRank introduced by Brin and Page [1] is defined as the invariant measure of a walk made by a random surfer on the web graph. When reading a given page, the surfer either selects a link from the current page (with a uniform probability), and moves to the page pointed by that link, or interrupts his current search, and then moves to an arbitrary page, which is selected according to given "zapping" probabilities. The rank of a page is defined as its frequency of visit by the random surfer.

The interest of the PageRank algorithm is to give each page of the web a measure of its popularity. It is a link-based measure, meaning that it only takes into account the hyperlinks between web pages, and not their content. It is combined in practice with content-dependent measures, taking into account the relevance of the text of the page to the query of the user, in order to determine the order in which the answer pages will be shown by the search engine. This leads to a family of search methods the details of which may vary (and are often not publicly known). However, a general feature of these methods is that among the pages with a comparable relevance to a query, the ones with the highest PageRank will appear first.

The importance of optimizing the PageRank, specially for e-business purposes, has led to the development of a number of companies offering Search Engine Optimization services. We refer in particular the reader to [2] for a discussion of the PageRank optimization methods which are used in practice. Understanding PageRank optimization is also useful to fight malicious behaviors like link spamming, which intend to increase artificially the PageRank of a web page [3], [4].

The PageRank has motivated a number of works, dealing in particular with computational issues. Classically, the PageRank vector is computed by the power algorithm [1]. There has been a considerable work on designing new, more efficient approaches for its computation [5,6]: Gauss-Seidel method [7], aggregation/disaggregation [6] or distributed randomized algorithms [8,9]. Other active fields are the development of new ranking algorithms [10] or the study of the web graph [11].

The optimization of PageRank has been studied by several authors. Avrachenkov and Litvak analyzed in [12] the case of a single controlled page and determined an optimal strategy. In [13], Mathieu and Viennot established several bounds indicating to what extent the rank of the pages of a (multi-page) website can be changed, and derived an optimal referencing strategy in a special unconstrained case: if the webmaster can fix arbitrarily the hyperlinks in a web site, then, it is optimal to delete every link pointing outside the web site. To avoid such degenerate strategies, De Kerchove, Ninove and van Dooren [14] studied the problem of maximizing the sum of the PageRank coordinates in a web site, provided that from each page, there is at least one path consisting of hyperlinks and leading to an external page. They gave a necessary structural condition satisfied by an optimal outlink strategy. In [15], Ninove developed a heuristic based on these theoretical results, which was experimentally shown to be efficient. In [16], Ishii and Tempo investigated the sensitivity of the PageRank to fragile (i.e. erroneous or imperfectly known) web data, including fragile links (servers not responding, links to deleted pages, etc.). They gave bounds on the possible variation of PageRank and introduced an approximate PageRank optimization problem, which they showed to be equivalent to a linear program. In [17], (see also [18] for more details), Csáji, Jungers and Blondel thought of fragile links as controlled links and gave an algorithm to optimize in polynomial time the PageRank of a single page.

In the present paper, we study a more general PageRank optimization problem, in which a webmaster, controlling a set of pages (her web site), wishes to maximize a utility function depending on the PageRank or, more generally, on the associated occupation measure (frequencies of visit of every link, the latter are more informative). For instance, the webmaster might wish to maximize the number of clicks per time unit of a certain hyperlink bringing an income, or the rank of the most visible page of her site, or the sum of the ranks of the pages of this site, etc. We consider specifically two versions of the PageRank optimization problem.

We first study a continuous version of the problem in which the set of actions of the webmaster is the set of admissible transition probabilities of websurfers. This means that the webmaster, by choosing the importance of the hyperlinks of the pages she controls (size of font, color, position of the link within the page), determines a continuum of possible transition probabilities. Although this model has been already proposed by Nemirovsky and Avrachenkov [19], its optimization does not seem to have considered previously. This continuous version incl

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut