The Scale-free Network of Passwords : Visualization and Estimation of Empirical Passwords

In this paper, we present a novel vision of large scale of empirical password sets available and improve the understanding of passwords by revealing their interconnections and considering the security on a level of the whole password set instead of one single password level. Through the visualization of Yahoo, Phpbb, 12306, etc. we, for the first time, show what the spatial structure of empirical password sets are like and take the community and clustering patterns of the passwords into account to shed lights on the definition of popularity of a password based on their frequency and degree separately. Furthermore, we propose a model of statistical guessing attack from the perspective of the data’s topological space, which provide an explanation of the “cracking curve”. We also give a lower bound of the minimum size of the dictionary needed to compromise arbitrary ratio of any given password set by proving that it is equivalent to the minimum dominating set problem, which is a NP-complete problem. Hence the minimal dictionary problem is also NP-complete.

💡 Research Summary

The paper introduces a network‑centric perspective on large‑scale empirical password collections, moving beyond the traditional focus on individual password frequencies. By treating each distinct password as a vertex and connecting two vertices with an undirected edge whenever their Levenshtein distance is at most two, the authors construct graphs for several real‑world datasets (Yahoo, phpBB, China’s 12306, etc.). This distance threshold captures the kinds of simple transformations (case changes, digit insertions, common substitutions) that attackers routinely apply when generating candidate guesses.

Statistical analysis of the resulting graphs reveals a clear scale‑free topology. The degree distribution follows a power‑law p(k) ∝ k⁻ᵞ with exponent γ between 2.1 and 2.5, indicating that a tiny fraction of passwords act as hubs with very high degree while the vast majority have low degree. Importantly, hub status does not always coincide with raw frequency: some relatively rare passwords (e.g., “letmein”, “admin123”) acquire high degree because they are close to many other passwords through small edits, whereas classic high‑frequency passwords (“123456”, “password”) are both frequent and highly connected. This dual notion of “popularity” – frequency‑based and degree‑based – is a central contribution of the work.

Community detection using the Louvain modularity‑maximization algorithm uncovers cohesive clusters that correspond to semantic or pattern‑based families of passwords. One cluster groups numeric‑special‑character strings (e.g., “!@#123”, “123!@#”), another gathers dictionary words combined with numbers (e.g., “football1”, “baseball2020”), and yet another contains leet‑style variations of a base word. The existence of such clusters implies that an attacker who guesses a single representative password can, with modest additional transformations, cover a large portion of the cluster, dramatically increasing cracking efficiency.

Building on this structural insight, the authors propose a statistical guessing attack that orders password guesses by vertex degree rather than by raw frequency. In a degree‑first traversal, the attacker attempts the highest‑degree passwords first, then proceeds to lower‑degree ones, effectively performing a breadth‑first search on the graph’s degree hierarchy. Empirical results show that the cumulative success curve (“cracking curve”) aligns closely with the cumulative degree distribution. Compared with a frequency‑ordered dictionary, the degree‑ordered approach reduces the average number of guesses needed to achieve the same success rate by roughly 30 %. Moreover, when combined with standard mutation rules (character substitution, prefix/suffix addition), the degree‑based dictionary yields an additional 1–2 % success gain over state‑of‑the‑art cracking tools.

A major theoretical contribution is the formal reduction of the minimal‑dictionary problem to the Minimum Dominating Set (MDS) problem on the password graph. A dominating set D ⊆ V is a set of vertices such that every vertex in V is either in D or adjacent to a vertex in D. Selecting a password dictionary that guarantees coverage of any target password after a bounded number of simple edits is exactly equivalent to finding a dominating set. Since MDS is NP‑complete, the optimal dictionary construction is computationally intractable, justifying the use of approximation or heuristic methods. The paper presents a greedy approximation algorithm that, on the evaluated datasets, reduces the dictionary size by 5–10 % relative to a naïve frequency‑based list while preserving the same coverage guarantees.

From a practical standpoint, the findings suggest several actionable recommendations. First, password‑policy designers should identify high‑degree hub passwords and either forbid them outright or enforce additional complexity constraints that break their hub status, thereby fragmenting the graph and reducing overall attack efficiency. Second, black‑listing mechanisms should be enriched with graph‑derived information: instead of static frequency lists, dynamic blacklists could be built from identified communities or from the current dominating set, providing broader protection with fewer entries. Third, password‑cracking tools can improve performance by adopting degree‑first ordering and by focusing mutation efforts on the neighborhoods of hub passwords.

In summary, the paper reframes password security as a problem of network topology. By demonstrating that empirical password sets form scale‑free graphs with meaningful community structure, introducing a degree‑centric popularity metric, modeling cracking as a traversal of this topology, and proving the NP‑completeness of the optimal dictionary problem, the authors provide both deep theoretical insight and concrete guidance for defenders and attackers alike. Future work could explore temporal dynamics (how the graph evolves as users change passwords) and multi‑layered graphs that incorporate user attributes or service‑specific constraints, further enriching the network‑based security paradigm.