Temporal Limits of Privacy in Human Behavior
Large-scale collection of human behavioral data by companies raises serious privacy concerns. We show that behavior captured in the form of application usage data collected from smartphones is highly unique even in very large datasets encompassing millions of individuals. This makes behavior-based re-identification of users across datasets possible. We study 12 months of data from 3.5 million users and show that four apps are enough to uniquely re-identify 91.2% of users using a simple strategy based on public information. Furthermore, we show that there is seasonal variability in uniqueness and that application usage fingerprints drift over time at an average constant rate.
💡 Research Summary
The paper investigates how uniquely smartphone application‑usage data can identify individuals, even when the data set contains millions of users. Using a 12‑month longitudinal data set from February 2016 to January 2017, the authors analyze the behavior of 3.56 million Android users, covering 1.1 million distinct apps available on Google Play. For each user and each month, a binary vector (the “app‑fingerprint”) records whether the user has used a particular app; on average users engage with 23 apps per month and 76 apps over the whole year.
Uniqueness (or “unicity”) is defined as the proportion of users whose fingerprint is unique given a selected set of apps. Two selection strategies are compared. The first selects apps at random; the second exploits publicly available popularity information (download counts on Google Play) and starts with the least‑used apps, a method the authors call the “popularity strategy.” With the random approach, four apps uniquely identify 21.8 % of users—a modest but non‑trivial figure given the binary nature of the data. The popularity strategy dramatically improves performance: four carefully chosen low‑popularity apps uniquely re‑identify 91.2 % of the population. This demonstrates that the heavy‑tailed distribution of app usage makes rare apps powerful identifiers.
Temporal analysis reveals seasonal variation in uniqueness. When fingerprints are constructed on a monthly basis, the re‑identification rate peaks during June, July, and August. The authors attribute this to increased usage of travel, weather, sports, and health‑and‑fitness apps during vacation months, while education and business apps decline. Thus, changes in physical routines (e.g., traveling) are reflected in digital behavior, making fingerprints temporarily more distinctive.
The impact of sample size is also examined. By subsampling the full data set (from 100 k to 3.5 M users), the authors find that random selection suffers a noticeable drop in uniqueness as the population grows (e.g., for five apps, re‑identification falls from 45.9 % to 32.1 %). In contrast, the popularity strategy is far less sensitive to scale, decreasing only from 96.6 % to 92.7 % across the same range. Extrapolating using power‑law, exponential, and stretched‑exponential models suggests that even with ten‑fold larger populations (≈35 M users) five apps would still uniquely identify 75–80 % of individuals. Consequently, “hiding in the crowd” via sheer population size is ineffective for this type of high‑dimensional behavioral data.
The study concludes that app‑usage metadata is intrinsically high‑dimensional and sparse, allowing a small set of rarely used apps to serve as quasi‑fingerprints. Because such metadata can be obtained from public sources (e.g., Google Play) or purchased from data brokers, the risk of cross‑dataset linkage and privacy breach is substantial. The authors recommend that data‑collecting entities adopt stricter minimisation and retention policies, and that users become aware of the limited control they have over usage traces. Potential technical mitigations include differential privacy, stronger anonymisation techniques, and possibly limiting the granularity of app‑usage logs shared with third parties. The work underscores the urgent need for policy and technical safeguards in the era of pervasive mobile data collection.
Comments & Academic Discussion
Loading comments...
Leave a Comment