My thoughts as an enterprise Java developer.

Tuesday, August 04, 2009

Defeating Character Frequency Analysis of Substitution Ciphertext

(This is a paper I wrote in the fall of 2008 for CS461: Computer Security I at UIUC. If you are interest in the code leave a comment and contact info.)


Character frequency analysis is a common way to crack substitution ciphertext because of the uneven distribution of letters in languages. In order to defeat that analysis, the techniques considered were: synonyms, dropping letters or words, adding extra letters or words, alternate spelling (Leet, spam, etc), and Two cipher letters per plaintext letter. Most of these techniques would be applied before a standard substitution cipher encryption. To evaluate the techniques, test data was obtained from Project Gutenberg( and consisted of the texts of Psalms (Ps), Pride and Prejudice (P&P), Alice in Wonderland (AiW), and Sherlock Holmes (SH). After applying the technique, the output character frequency was compared to the standard character frequency ( and
) and then the standard deviation of the proportion of the changed frequency for each letter compared to the standard character frequency. i.e. If the letter "e" occurred 10% of the time in the changed text then the proportion for "e" is .787 (.10/.12702). That gives an overall measure of how close the changed text character frequency is to the standard character frequency. A standard deviation of 0% would mean that it exactly matches the standard frequency. Initial standard deviations are: Ps: 25%, P&P: 28%, AiW: 25%, SH: 14%. For comparison, an equal distribution produces a standard deviation of 1,439%. In there are programs for most of these options and for computing the standard deviation.


Replacing words with synonyms that have a higher standard deviation can help defeat character frequency analysis. The thesaurus used has synonyms in categories, so when a word was looked up, the first category was arbitrarily used. Within the first category, the word that had the largest standard deviation from the normal character frequency was chosen. The chosen work often didn't fit grammatically, so the results are skewed, but this is somewhat offset by only looking in the first category. To make this usable, grammar checking would have to be used. It would probably need a UI that would suggest alternatives (based on grammar rules and the standard deviation). Attacks against this technique could include creating a list of the best synonyms, creating a new standard character frequency distribution for those words, and therefore largely defeat the benefit of this technique. Final standard deviation achieved: Ps: 182%, P&P: 180%, AiW: 135%, SH: 149%. The increase in standard deviation was very significant, but not enough to make a large impact on frequency analysis.

Dropping letters or words

Randomly dropping letters or words could easily hide the meaning from the intended recipient so it would be hard to keep the meaning while still randomly dropping enough to make an impact. An attempt to always drop vowels and rare letters (a,e,i,o,u,x,q) and words (I, the, is, was, of, a, & an) made an insignificant difference (Drop Words Ps: 8%, P&P: 6%, AiW: 8%, SH: 8%; Drop Letters: Ps: 73%, P&P: 91%, AiW: 69%, SH: 70%). Also, some of each letter would need to be dropped to prevent the non-dropped letters from serving as landmarks.

Adding extra letters or words

To be effective, extra letters and words must be randomly placed. They could be chosen based on the inverse of distribution. However, additions that have a significant impact on the standard deviation may garble the message too much. Add letters results(based on inverse of distribution): P&P: 2,154%, SH: 2,795%. The result is readable but takes significantly more work to understand the message.

Alternate spelling (Leet, spam, etc)

Alternate spelling includes l33t ( and spam spelling ( and is similar to synonyms, in that the possibility of a new new standard character frequency has to be taken into account. This technique was not implemented or programatically evaluated because it was non-obvious what to use as the standard distribution.

Two cipher letters per plaintext letter

Using two letters in the cipher text for each letter in the plaintext can be a good way to create a flat character distribution. The algorithm is to partition the 676 2-letter combinations based on the standard character frequency. i.e. if the standard frequency for a letter is 5% then it will get 5% of the 2-letter combinations (randomly selected). This doubles the size of the data, could include spaces & punctuation, and makes a much larger key. Note that some letters may get dropped because they occur less than 1/676 (0.15%) of the time. Both 1-gram and 2-gram frequency analysis produce a nearly uniform histogram (variation appears to only be caused by rounding). Two-gram results: P&P=5,117%; SH=5,013%. Therefore this technique was extremely effective with no obvious weaknesses.


Changing the plaintext proved problematic because of inadvertent changes to the meaning, but it did make a significant impact on standard deviation even though it probably wasn't significant enough. Applying all of the plaintext changes to a short message while using safeguards to protect the meaning would probably be effective. Two cipher letters per plaintext letter appears to be the easiest way to defeat character frequency analysis.

No comments: