Neuropsychological constraints to human data production on a global scale
Which are the factors underlying human information production on a global level? In order to gain an insight into this question we study a corpus of 252-633 Million publicly available data files on the Internet corresponding to an overall storage volume of 284-675 Terabytes. Analyzing the file size distribution for several distinct data types we find indications that the neuropsychological capacity of the human brain to process and record information may constitute the dominant limiting factor for the overall growth of globally stored information, with real-world economic constraints having only a negligible influence. This supposition draws support from the observation that the files size distributions follow a power law for data without a time component, like images, and a log-normal distribution for multimedia files, for which time is a defining qualia.
💡 Research Summary
The paper asks a broad question: what limits the amount of information that humans generate and store on a global scale? To address this, the authors collected a massive corpus of publicly available files from the Internet, using the FindFiles.net spider to follow outgoing links from Wikipedia and DMOZ. In total they indexed between 252 million and 633 million files, occupying roughly 284–675 TB of storage. The files were classified by MIME type; images dominate (≈58 % of files), followed by applications (≈33 %), text (≈5 %), audio (≈2.9 %) and video (≈0.7 %).
The central empirical observation is that the size distribution of files differs systematically across content categories. For “static” data types that lack an explicit time dimension—images and text—the distribution follows a power‑law: the probability density P(s) scales as s⁻ᵞ over several orders of magnitude, with exponents that vary (e.g., JPEG files show a slope change from –2 to –4 around 4 MB, which the authors attribute to a transition from amateur to professional production). In contrast, multimedia files that contain a temporal component—audio and video—exhibit a log‑normal distribution. In log‑log plots the tails of these distributions are well described by a quadratic function of log s, i.e., log P(s) ≈ α log s – β (log s)², fitting over more than five decades of file size.
To explain these patterns the authors invoke an information‑theoretic framework. They define the Shannon entropy of the size distribution, H = –∑ P(s) log P(s), and introduce a cost function c(s) that captures constraints on data production. The cost is modeled as a linear term a s (representing economic costs such as storage, bandwidth, and hardware) plus logarithmic terms b log s and c (log s)², which they argue arise from the Weber‑Fechner law describing how human perception of stimulus intensity, number of objects, and duration scale logarithmically in the brain. Maximizing entropy under the constraint ⟨c(s)⟩ = constant yields a distribution P(s) ∝ exp(–λ c(s)). If only the linear term is relevant, the solution reduces to a power‑law; if the logarithmic and quadratic logarithmic terms dominate (as would be the case when two independent perceptual dimensions are involved, e.g., spatial resolution and time), the solution becomes log‑normal. Thus, the observed dichotomy between static and temporal media is interpreted as a manifestation of the dimensionality of the underlying neuro‑psychological cost.
Statistically, the authors fit both candidate models to the empirical tails using maximum‑likelihood estimation (MLE) and evaluate goodness‑of‑fit via residual sum of squares (RSS) between the empirical complementary cumulative distribution function (CCDF) and the model CCDF. For audio and video, the log‑normal model yields RSS values an order of magnitude lower than the power‑law, indicating a superior fit. For images and applications, the power‑law provides a broader fit and lower RSS, while for text the two models perform comparably. The analysis excludes files larger than 10 GB due to scarcity, and the lower cutoff k_min is selected by minimizing RSS across a range of plausible thresholds.
The paper also examines the relationship between the number of files hosted on a domain and the domain’s in‑degree (the number of inbound links). High‑in‑degree domains (e.g., Twitter) host relatively few public files, whereas many low‑in‑degree domains host large numbers of files, suggesting that file availability is not driven by popularity but by other factors such as personal data sharing. The in‑degree distribution itself follows a power‑law with exponent ≈ –2.2, consistent with prior studies of the Web’s link topology.
In the discussion, the authors argue that economic constraints (hardware cost, storage capacity) are insufficient to explain the observed distributions because exponential tails—predicted by models where cost scales linearly with size—are absent. Instead, the neuro‑psychological constraints embodied in the Weber‑Fechner law provide a parsimonious explanation for both power‑law and log‑normal regimes. They note that for images the production cost depends primarily on resolution (a single dimension), whereas for audio/video it depends on both resolution per frame and total duration (two dimensions), leading to the different statistical forms.
While the hypothesis is intriguing, several limitations deserve attention. First, the data collection is biased toward sites linked from Wikipedia and DMOZ, potentially under‑representing large commercial or cloud‑based repositories where storage economics may differ. Second, the cost function treats economic factors as a simple linear term, ignoring non‑linear pricing models, bandwidth caps, and energy consumption that could affect file‑size decisions. Third, compression algorithms and format choices (e.g., JPEG vs. PNG, MP3 vs. FLAC) introduce additional dimensions not captured by the simple one‑ or two‑dimensional cost model. Fourth, a growing share of data is generated automatically (e.g., sensor streams, AI‑generated content) where human perceptual limits may play a reduced role, challenging the universality of the neuro‑psychological argument.
Overall, the paper provides a novel perspective linking human perceptual scaling laws to macroscopic patterns in Internet file sizes. The empirical evidence for distinct power‑law and log‑normal regimes is solid, and the information‑theoretic derivation offers a coherent theoretical framework. However, a more comprehensive model that integrates economic, technological, and automated data‑generation factors, as well as broader sampling of the Web, would be needed to confirm that neuro‑psychological constraints are indeed the dominant driver of global data production.
Comments & Academic Discussion
Loading comments...
Leave a Comment