Measuring Password Strength: An Empirical Analysis
We present an in-depth analysis on the strength of the almost 10,000 passwords from users of an instant messaging server in Italy. We estimate the strength of those passwords, and compare the effectiveness of state-of-the-art attack methods such as dictionaries and Markov chain-based techniques. We show that the strength of passwords chosen by users varies enormously, and that the cost of attacks based on password strength grows very quickly when the attacker wants to obtain a higher success percentage. In accordance with existing studies we observe that, in the absence of measures for enforcing password strength, weak passwords are common. On the other hand we discover that there will always be a subset of users with extremely strong passwords that are very unlikely to be broken. The results of our study will help in evaluating the security of password-based authentication means, and they provide important insights for inspiring new and better proactive password checkers and password recovery tools.
💡 Research Summary
The paper presents an empirical study of password strength using a unique dataset of 9,317 plaintext passwords collected from an Italian instant‑messaging server that stores passwords in clear text due to its use of the CRAM‑MD5 authentication mechanism. Because the passwords are available in their original form, the authors can evaluate the true security of user‑chosen passwords without the ambiguities introduced by hashing or salting.
The authors define password strength as the size of the search space that an attacker must explore to guess a given password, i.e., the expected number of attempts required under a specific cracking strategy. They examine two state‑of‑the‑art cracking techniques: dictionary attacks and Markov‑chain‑based candidate generation.
For dictionary attacks, they employ the John the Ripper (JtR) suite, which includes 21 language‑specific wordlists and a large “mangling” rule set that produces variations such as appending digits or concatenating words. The combined dictionaries contain roughly four million distinct entries, while an expanded set with mangling reaches about 40 million candidates. Using these dictionaries, the authors recover 25 % of the passwords. Notably, increasing the dictionary size by a factor of 300 yields only a modest rise in success rate—from roughly 7 % to 30 %—illustrating a classic diminishing‑returns effect.
The second technique builds a Markov model of character transitions from the password corpus itself. By estimating the probability of each character given its predecessor(s), the model generates password candidates in order of decreasing likelihood. When applied to the passwords that survived the dictionary stage, the model shows that only about 10 % of all passwords reside in a relatively small search space (10⁸–10⁹ guesses). The remaining 90 % require on the order of 10¹² or more guesses, making them infeasible to crack with current computational resources.
Beyond raw success rates, the paper introduces a cost‑benefit framework. The cost of a single guess is modeled as the computational, time, and infrastructure expense incurred by the attacker. Multiplying this per‑guess cost by the estimated search‑space size yields the total expected cost of compromising a particular password. The analysis reveals that low‑strength passwords (small search spaces) offer a high return on investment for attackers, whereas passwords with moderate to high strength quickly become uneconomical to target. This observation supports the idea that password policies should enforce a minimum strength threshold beyond which additional complexity provides little extra security relative to the user‑experience cost.
The authors also compare their search‑space metric with the simplistic “bit‑strength” measures commonly used in password checkers (which count length and character class diversity). They argue that the search‑space approach captures the real difficulty of guessing a password under realistic attack models, and therefore can guide the design of more effective proactive password checkers that give users actionable feedback about the actual effort an attacker would need.
In conclusion, the study finds a wide variance in password strength among the user base: roughly half of the passwords are trivially weak, while a small but significant subset are extremely strong and would resist even sophisticated attacks. Dictionary attacks alone recover only a quarter of the passwords, and even when combined with advanced Markov‑chain generation, the cost to an attacker escalates sharply after the weakest passwords are exhausted. These empirical results provide concrete data for system administrators and security designers to calibrate password policies, estimate the economic feasibility of offline cracking attempts, and develop user‑friendly yet secure password‑creation tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment