"All of Me": Mining Users' Attributes from their Public Spotify Playlists
In the age of digital music streaming, playlists on platforms like Spotify have become an integral part of individuals’ musical experiences. People create and publicly share their own playlists to express their musical tastes, promote the discovery of their favorite artists, and foster social connections. In this work, we aim to address the question: can we infer users’ private attributes from their public Spotify playlists? To this end, we conducted an online survey involving 739 Spotify users, resulting in a dataset of 10,286 publicly shared playlists comprising over 200,000 unique songs and 55,000 artists. Then, we utilize statistical analyses and machine learning algorithms to build accurate predictive models for users’ attributes.
💡 Research Summary
The paper “All of Me: Mining Users’ Attributes from their Public Spotify Playlists” investigates whether private user attributes can be inferred from publicly shared Spotify playlists. The authors conducted an online survey between May and September 2022, recruiting 739 participants from 76 countries. Participants provided their Spotify IDs and self‑reported information on 16 attributes spanning demographics (gender, age, country, relationship status, living alone, occupation, economic status), habits (sports participation, smoking, alcohol consumption, Spotify Premium subscription), and personality (the five OCEAN traits).
Using the Spotify API, the researchers collected 10,286 public playlists belonging to these users, comprising over 200,000 unique songs and 55,000 artists. For each playlist they extracted 111 features, including song‑level audio attributes (danceability, acousticness, etc.), track metadata (release year, popularity), artist popularity and follower counts, genre proportions across 30 popular genres, and miscellaneous statistics such as playlist length, follower count, album diversity, and temporal patterns of song addition. Numeric features were aggregated per playlist using mean, standard deviation, min, and max.
Statistical analysis was performed by treating each feature as a dependent variable and each user attribute as a grouping factor. For binary attributes, independent‑samples t‑tests were applied; for multi‑class attributes, one‑way ANOVA was used. Tests were conducted at the user level by aggregating a user’s playlists to avoid bias from users with many playlists. The results showed that different feature families (Misc, Artists, Songs, Genres) discriminate different attributes. Gender, for example, is most distinguishable via Artists, Songs, and Genres, while Age and Occupation are mainly reflected in Misc features. Habits such as alcohol consumption, smoking, and Premium subscription exhibit significant differences across all feature families. Among personality traits, Openness, Conscientiousness, and Neuroticism show the strongest feature‑level distinctions.
For predictive modeling, the dataset was split stratified by user into 70 % training, 10 % validation, and 20 % test sets, ensuring no user appears in multiple splits. Five classifiers were evaluated: Logistic Regression, Decision Tree, Random Forest, k‑Nearest Neighbours, and a Multi‑Layer Perceptron (MLP). Each model processes a single playlist at a time, producing class probabilities for each attribute; the final user‑level prediction is obtained by averaging probabilities across all of a user’s playlists.
Performance results indicate that several attributes can be predicted with appreciable accuracy. Gender prediction reaches up to 71 % accuracy (MLP), while age, country, and economic status achieve 60–70 % accuracy. Habit‑related attributes (Alcohol, Smoking, Premium) attain around 80 % accuracy. All models, except for “Live Alone,” outperform a random‑guess baseline with statistical significance (p < 0.05). The authors also report standard deviations across multiple runs, confirming robustness.
Ethical considerations are explicitly addressed: participants gave informed consent, data were anonymized, and a mechanism for data removal was provided. The authors acknowledge limitations such as the modest sample size, potential self‑selection bias, and the fact that publicly shared playlists may not fully represent a user’s listening behaviour.
The study concludes that publicly available music‑playlist data contain enough signal to infer sensitive personal attributes, raising privacy concerns for music‑streaming platforms. It recommends that service providers evaluate privacy risks when exposing playlist data via APIs and consider mitigation strategies (e.g., data minimization, access controls). Future work is suggested to expand the dataset across more cultures, incorporate longitudinal listening logs, and explore privacy‑preserving machine‑learning techniques such as differential privacy.
Comments & Academic Discussion
Loading comments...
Leave a Comment