Student uses power of grammar to crack passwords

Ph.D. student Ashwini Rao utilized grammar to crack particularly long passwords.  (credit: Kate Groschner/Photo Editor) Ph.D. student Ashwini Rao utilized grammar to crack particularly long passwords. (credit: Kate Groschner/Photo Editor)

Ashwini Rao, a software engineering Ph.D. student in the Institute for Software Research, recently made important discoveries in the field of password security. She determined ways that make cracking a long password easier; in doing so, she identified grammatical structures that may add further security to longer passwords.

Generally, there are two types of password attacks: online and offline.

An online attack occurs when the attacker sits at a computer terminal, has a web page open (such as a bank site), and tries to guess a user’s password. What usually happens is that, once the attacker makes two or three guesses, the account is locked and the attacker is unsuccessful.

Offline attacks typically deal with hashed passwords used in most websites. Each plain-text password has a generated number associated with it — known as a hash — and that number is the one that is stored. When this data set of hashed passwords is leaked, like what happened to professional networking website LinkedIn recently, attackers can use an unlimited amount of computing power. Hackers will generate a huge number of plain-text guesses, hash them, and compare them to the leaked hashed passwords. If it is a match, they have found the plain-text passwords.

Rao’s research has focused on this kind of offline attack for long passwords (16 characters or more.) “Currently there is a trend toward longer passwords and those are usually called pass phrases, where instead of choosing a word with lots of special characters and numbers you tend to just increase the length of the password and try to decrease the number of restrictions you put on a user,” she said.

Rao and the other students on her team looked at a data set of about 1,500 passwords. When users were asked to come up with long passwords without any restrictions, 18 percent of those had phrases in them, or contained postal addresses or email addresses. The structure of the passwords influenced their security. When current password crackers — which are algorithms used to determine the hashed passwords — tried to figure out these codes, only 6 percent were cracked. The password crackers didn’t work because they were not meant for long ones and did not take into account the possible grammatical structure of passwords.

For example, the passwords “Shehas3cats” and “Danhas3cats” have the same length, but the second one is more secure. This is because the security of long passwords is influenced by their component parts of speech. The noun “Dan” is more difficult to guess than the pronoun “she.”

Rao understood that current password crackers don’t look at structure. “They are for conventional passwords, shorter passwords, maybe 12 characters or less. They are not built for these kind of [long] passwords,” she said. She tried to assist the security of these passwords by improving the algorithms that are used to crack them. The idea, in a sense, is to remain one step ahead of hackers. She improved the crackers in two ways.

First, she built an improved password cracker. Current crackers are dictionary based: they have a dictionary of words called “base words” (taken from a thesaurus, short phrases from Wikipedia, movie titles, music lyrics, etc.) and “mangling” rules. Such mangling rules could be “take a word from a dictionary and capitalize the first letter” or “concatenate the same word twice.” These rules are limited to working with maximum two words. The problem was that, in the long password dataset, people were using three, four, or five words.

Rao and her team wanted to build a new password cracker that could “come up with phrases — longer sequences of words,” she said. For this, they defined grammar to be a sequence of parts of speech such as nouns, verbs, and adjectives. Next, they wanted to construct a phrase automatically using the Brown corpus — a famous collection of annotated words. They took the 1,000 most common sequences of parts of speech and assigned concrete words for each part of speech. Thus they ended up with phrases, or long passwords. For each part of speech, they created a dictionary. For example, they created a noun dictionary that included all of the nouns from the Brown corpus.

Their new cracker takes the sequence of parts of speech, a dictionary for each part of speech, and the maximum guesses the cracker can make as inputs. It will generate a number of sequences and will answer how many of the passwords from a test data set it can crack.

For example, for the sequence of parts of speech “noun, adjective” — and a dictionary containing 20,000 nouns and a different dictionary containing 1,000 adjectives — the cracker will generate 20 million sequences of words. If the maximum number of guesses is only 10,000, it will take the first 10,000 sequences of words, in order by the length of the sequences, and try to see how many of those sequences match the passwords from the data set it is trying to guess.

The second improvement to the current password crackers was using the Google corpus instead of the current dictionaries. The Google corpus contains frequently used sequences of words from web pages on the Internet. By using this corpus instead of the current dictionaries, Rao’s team was able to crack 20 percent of longer passwords, much more than the predicted 6 percent.

They were able to crack an additional 10 percent of the passwords by combining the Google corpus with the new and improved cracker.

“This is the most important result. We were able to crack new passwords that otherwise couldn’t be cracked,” Rao said.

Rao will present the findings on Feb. 20 at the Association for Computing Machinery’s Conference on Data and Application Security and Privacy (CODASPY 2013) in San Antonio, Texas.