How Unique and Traceable are Usernames?

How Unique and Traceable are Usernames?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Suppose you find the same username on different online services, what is the probability that these usernames refer to the same physical person? This work addresses what appears to be a fairly simple question, which has many implications for anonymity and privacy on the Internet. One possible way of estimating this probability would be to look at the public information associated to the two accounts and try to match them. However, for most services, these information are chosen by the users themselves and are often very heterogeneous, possibly false and difficult to collect. Furthermore, several websites do not disclose any additional public information about users apart from their usernames (e.g., discus- sion forums or Blog comments), nonetheless, they might contain sensitive information about users. This paper explores the possibility of linking users profiles only by looking at their usernames. The intuition is that the probability that two usernames refer to the same physical person strongly depends on the “entropy” of the username string itself. Our experiments, based on crawls of real web services, show that a significant portion of the users’ profiles can be linked using their usernames. To the best of our knowledge, this is the first time that usernames are considered as a source of information when profiling users on the Internet.


💡 Research Summary

The paper tackles a deceptively simple yet highly consequential question: when the same username appears on different online services, what is the probability that those accounts belong to the same physical person? While most prior work on online profiling relies on rich metadata such as real names, email addresses, or social‑graph structures, many services expose only the username. This scarcity of auxiliary data makes traditional linking techniques cumbersome, error‑prone, and often infeasible at scale.

To address this gap, the authors propose a methodology that treats the username string itself as a source of identifying information. The central concept is “information surprisal” (also called self‑information), defined as I(u) = ‑log₂ P(u), where P(u) is the probability of observing username u in the overall population. A higher surprisal value indicates that the string carries more bits of identifying information and is therefore more likely to be unique. The challenge lies in estimating P(u) for any possible username, including those never seen in the training data.

A naïve maximum‑likelihood estimate (count(u)/N) would assign zero probability to unseen names and produce a coarse model. Instead, the authors train a character‑level 5‑gram Markov chain on a massive corpus of roughly 10 million usernames collected from public Google profiles and eBay accounts. For each character position i, the model estimates the conditional probability P(cᵢ | cᵢ₋₄…cᵢ₋₁) by counting occurrences of 5‑grams in the training set. The probability of an entire username is then the product of these conditional probabilities, allowing the computation of surprisal for any arbitrary string.

With surprisal values in hand, the authors define a theoretical uniqueness threshold: if I(u) exceeds log₂ W (where W is the total number of users in the target population), the username is guaranteed to be unique within that population. Empirical analysis shows that about 30 % of the 10 million usernames have surprisal above 20 bits (roughly one unique identifier per million users), while common short names such as “john” or “admin” fall below 10 bits and are expected to be shared by many individuals.

Beyond single‑name uniqueness, the paper addresses the more realistic scenario where a person uses slightly different usernames across services (e.g., “sara123” vs. “sara_123”). The authors construct a Bayesian matching model that combines the surprisal of each name, the edit distance between them, and the length of common substrings to compute the posterior probability that two usernames belong to the same person. They validate this model using ground‑truth links extracted from Google profiles, where users voluntarily list their other online accounts. The matching algorithm achieves over 85 % accuracy and an F1 score of 0.78, outperforming traditional record‑linkage techniques that require richer attribute sets.

The experimental dataset is diverse: (a) 3.5 M usernames from Google profiles, (b) 6.5 M from eBay, (c) 16 k from an institutional LDAP, (d) a “Finnish” leak containing 79 k usernames, emails, and passwords, and (e) a MySpace phishing dump with 30 k usernames. This breadth ensures that the model captures a wide range of naming conventions, language influences, and cultural patterns. The authors also note that 85 % of the collected usernames are purely alphanumeric, simplifying the character set for the Markov model.

In the discussion, the authors highlight the privacy implications of their findings. Even when users deliberately avoid providing personal details, the mere choice of a username can leak a substantial amount of identifying information. They propose countermeasures such as encouraging users to select higher‑entropy usernames, adding random characters, or service providers offering “uniqueness scores” at registration time. An online tool (hosted at https://planete.inrialpes.fr/projects/how-unique-are-your-usernames) is made publicly available, allowing individuals to test the traceability of their chosen handles.

The paper concludes that usernames, despite being a minimal data point, can serve as a powerful identifier when analyzed with proper probabilistic models. Future work is suggested in extending the approach to multilingual usernames, handling emojis and Unicode symbols, and modeling temporal changes in naming patterns. Overall, the study provides a rigorous, scalable, and practical framework for assessing and mitigating the privacy risks associated with username reuse across the Internet.


Comments & Academic Discussion

Loading comments...

Leave a Comment