Classification and Powerlaws: The Logarithmic Transformation
Logarithmic transformation of the data has been recommended by the literature in the case of highly skewed distributions such as those commonly found in information science. The purpose of the transformation is to make the data conform to the lognormal law of error for inferential purposes. How does this transformation affect the analysis? We factor analyze and visualize the citation environment of the Journal of the American Chemical Society (JACS) before and after a logarithmic transformation. The transformation strongly reduces the variance necessary for classificatory purposes and therefore is counterproductive to the purposes of the descriptive statistics. We recommend against the logarithmic transformation when sets cannot be defined unambiguously. The intellectual organization of the sciences is reflected in the curvilinear parts of the citation distributions, while negative powerlaws fit excellently to the tails of the distributions.
💡 Research Summary
The paper investigates the consequences of applying a logarithmic transformation to highly skewed citation data, using the Journal of the American Chemical Society (JACS) as a case study. The authors first collect raw citation counts for articles that cite JACS and for articles cited by JACS from major bibliographic databases. They then conduct parallel analyses: one on the original count data and another on the same data after a log transformation (natural or base‑10). Factor analysis and multidimensional scaling are employed to uncover the underlying structure of the citation environment and to visualize it.
In the untransformed data, the distribution is extremely asymmetric: a small number of journals account for a large share of citations, producing a heavy‑tailed, log‑normal‑like shape. Factor analysis yields well‑separated components that correspond to distinct disciplinary clusters (e.g., core chemistry, adjacent physics and biology). The visualizations show clear boundaries between these clusters, reflecting the intellectual organization of the sciences.
When the logarithmic transformation is applied, the variance of the citation counts collapses dramatically. The factor structure becomes blurred: eigenvalues lose their spread, and the previously distinct components merge. Visualizations reveal an inflated central cluster and the disappearance of peripheral clusters, indicating that the transformation erodes the discriminative power needed for classification. The authors argue that this reduction of variance is counter‑productive for descriptive statistics that aim to map scientific fields.
Beyond classification, the authors dissect the citation distribution itself. The central, curvilinear portion of the distribution mirrors the disciplinary organization and is best captured by non‑linear clustering techniques. The tail, however, follows a negative power‑law (P(x) ∝ x^‑α) with an exponent around 2.1–2.3, confirming earlier findings that low‑citation journals obey a scale‑free pattern. This dual behavior suggests that while the bulk of citations encodes the structured knowledge network, the extreme tail reflects a universal scaling law.
The central conclusion is that logarithmic transformation, while useful for meeting normality assumptions in inferential statistics, undermines the very structure researchers seek to uncover in citation networks. When sets cannot be unambiguously defined, the transformation leads to loss of critical information and hampers the identification of intellectual domains. Consequently, the authors recommend retaining the raw, skewed data for descriptive and classificatory purposes and employing power‑law modeling for the tail, thereby preserving both the nuanced organization of science and the universal scaling properties of citation counts.
Comments & Academic Discussion
Loading comments...
Leave a Comment