Power-law Distributions in Information Science - Making the Case for Logarithmic Binning

Reading time: 5 minute
...

📝 Original Info

  • Title: Power-law Distributions in Information Science - Making the Case for Logarithmic Binning
  • ArXiv ID: 1011.1533
  • Date: 2012-04-03
  • Authors: Stav{s}a Milojevic

📝 Abstract

We suggest partial logarithmic binning as the method of choice for uncovering the nature of many distributions encountered in information science (IS). Logarithmic binning retrieves information and trends "not visible" in noisy power-law tails. We also argue that obtaining the exponent from logarithmically binned data using a simple least square method is in some cases warranted in addition to methods such as the maximum likelihood. We also show why often used cumulative distributions can make it difficult to distinguish noise from genuine features, and make it difficult to obtain an accurate power-law exponent of the underlying distribution. The treatment is non-technical, aimed at IS researchers with little or no background in mathematics.

💡 Deep Analysis

Figure 1

📄 Full Content

Information science (IS) is replete with distributions that can be characterized as a power law. Examples include author productivity (Egghe, 2005;Lotka, 1926;Pao, 1986), citations received by papers (Price, 1965;Redner, 1998), scattering of scientific literature (Bradford, 1934;Nicolaisen & Hjørland, 2007), and collaborative tagging behavior (Golder & Huberman, 2006). These constitute only a subset of empirically found distributions that follow a power law (an excellent overview of power laws and processes that can lead to these distributions is given by Newman (2005)).

Here we focus on some aspects of power law functions that are relevant to IS researchers. Mostly, we want to provide a method, logarithmic binning, that will help researchers recognize the presence or absence of power laws in their data. Descriptions will refrain from using mathematical formalism in order to make it useful for those who do not have mathematical or physical sciences training.

While detailed technical reviews of power laws exist in recent literature (e.g., Clauset, Shalizi, & Newman (2009) and Newman (2005)), these do not devote much attention to the logarithmic binning method. Binning is simply a procedure of averaging the data that fall in certain ranges of values (bins), and here we use it to “beat” the statistical noise and thus reveal the trends in the data. Binning is logarithmic, meaning that the bins have equal sizes in logarithms, which is, as we will see, a natural choice for a power law. Logarithmic binning is given some consideration (especially as a vast improvement over unbinned representations) in Adamic’s (n.a.) online tutorial, but in it most attention is given to Pareto’s cumulative distribution, which, we will argue, is not always a better alternative.

Power-law distributions can mathematically be represented with power-law functions:

, where a is the power law exponent, and c an overall scale, or normalization. Power-law functions are monotonous, which means that when x changes, y either only decreases or only increases. When power laws are used to describe distributions, the exponent a is typically positive, meaning that when x increases, the y value decreases. Qualitatively this means that objects or events with a high value of some quantity are typically rare (there are few very prolific authors, very large cities, etc.). Power law distributions lead to phenomena such as the 80:20 rule. This rule, also known as the Pareto principle, was conceptualized by J. M. Juran, and states that 20% of causes lead to 80% of phenomena. It should be stated that this exact ratio (80:20) corresponds only to one specific value of the power-law exponent, a = 2.16 (calculated from Newman (2005), eq. 29).

Sometimes, certain distributions are described as scale-free in addition to being power law. This means that an increase by a certain factor at any value of x will produce the same decrease (or increase) in y. However, as Newman (2005) showed, the power law distribution is the only scale-free distribution, so the two expressions are synonymous, and we use only power law in this paper.

Historically, different IS phenomena have been described using the mathematical formalism of power law (Bookstein, 1976;de Bellis, 2009;Egghe, 1985). These power-law distributions have been given different names, although they are all related (Bookstein, 1990).

Lotka’s distribution (law). Lotka’s law is one of the best-known examples of a power-law distribution in IS, though not widely known outside the field. Its original formulation (Lotka, 1926) can be described in the following way: a large number of authors (y value) produces a small number of papers (x value), while very few produce many. Such a description, of course, is not precise since many distributions (not necessarily power law) exhibit such a property. More specifically, the power-law nature of Lotka’s law can be illustrated in the following way. Let us take a power law exponent of 2, as suggested by Lotka, then, if there are 100 authors with one article, there should be 25 with two, 11 with three, and so on. Lotka’s law is an example of a size-frequency distribution, which describes the number of sources with a certain number of items.

Zipf’s distribution (law). Zipf’s law comes from linguistics (Zipf, 1949). It is a rank-frequency distribution which describes the number of items in the source where sources are ranked in the decreasing order of frequency. Zipf’s law originally applied to the number of times certain words have been used in a text, from the most frequent to the least frequent. This distribution is again a power law. We note that it is possible to construct the word occurrence distribution in the size-frequency manner as well. Thus, the Lotka “version” of the word frequency “law” would be that there are many words that appear rarely and few that are common, and that this distribution is a power law. So, words such as “the” or “a” would have large values of

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut