### Making an unbreakable code doesn't seem that hard.

4/07/2014 05:09:00 PM
Tweetable
I've been learning some programming languages in recent months, and one fun little exercise to help learn the languages is building a code/decode program to encrypt a hypothetical secret message. Of course, the "codes" (really just cyphers) involved in those exercises would be fairly easy to crack, even as an amateur with limited computational resources. But it doesn't strike me as all that hard to produce an unbreakable code.

The English language is one of the largest in the world, with hundreds of thousands of distinct words. That said, chances are that every word you or I have ever heard of probably exists in this database of 90,000 words. Of course, the number of words we are likely to use in our secret message is considerably fewer--a typical person only uses a few hundred words on a regular basis. Still, lets go with that 90,000 figure, bearing in mind that we will also treat spaces and punctuation as words, as well as having a "word" for each possible character (so that we can represent words that happen not to be on that list).

I presume that we want to transmit this code using standard keyboard symbols for easy typing, and I count 95 distinct characters on my keyboard, so our task is to represent approximately 90,000 words with 95 symbols. The number of characters per "word" we need is given by n such that 95^n>90,000. The minimum number of characters, then, is 3, which produces a whopping 857,375 possible combinations, well in excess of the 90,000 or so we needed. So we start constructing our code by assigning each of those 90,000 possible words and characters a unique 3-digit character combination selected at random.

At this point, a CIA cryptographer who intercepted our message would see a page of jumbled characters with no discernible spacing, paragraph breaks or obvious structure of any kind. Also, note that the vast majority of possible 3-digit combinations have no meaning in our code. But it would still be a relatively simple matter to break our code using what I'll call the "distributional method," whereby the cryptographer compares the distribution of 3-digit sequences in our code to the distribution of words in typical usage of the English language. This is a relatively simple maximum likelihood procedure that hardly needs an actual cryptographer at all.

But, fortunately for us, our code does not stop there. We have 767,375 remaining unassigned 3-digit combinations, and can easily obtain an unlimited quantity more by simply going from 3-digit combinations to 4 or more digits. With this surplus, what we do is make tons of duplicates, so that the code is no longer one-to-one--that is, each word would have more than one 3-digit code assigned to it, and our encryption software would be programmed to select one of several alternatives at random each time it encounters that word during the encryption process. But here's the trick: the number of alternative specifications for each word would be proportional to the frequency with which that word is used, so that the resulting document, on average, uses each of those 857,375 (or more) possible code combinations with equal likelihoods. That is, our code transforms the distribution of the English language into a uniform distribution. Now, that CIA cryptographer is staring at a transmission consisting of a jumble of characters with no discernible structure, and where the frequency of characters used are indistinguishable from a document generated at random. And what's nice about this is that as the code maker, you don't actually require any special knowledge of the English language at all--you can just use the empirical distributions of documents you have written as the basis for the code--whereas the code-breaker will have a much harder time determining what distribution to maximize in their procedure.

At this point, we've made decryption very hard. The cryptographer's best tool, the distributional method, has been defeated. But decryption is still not quite impossible, because the cryptographer can still use what I'll call the "positional method." Similar to how we use certain words with certain predictable frequencies, it is also true that we tend to use certain words in certain positions within our documents--for example, messages do not end in the words "the" or "a," and neither do sentences. So one way of trying to decode our message would be to compare the distribution of the locations--as opposed to the frequencies used above--of code combinations to the distribution that's typical in the English language. Fortunately, we can obfuscate this effort as well, by scrambling the message after it has been encrypted. Scrambling is a two step procedure. The first step is to insert words in obviously inappropriate locations such as, for example, appending "the" and "a" to the ends of some sentences. Because of their obviously inappropriate location, no clarification is needed when the message is decoded--the reader, being familiar with the code, will know which words to ignore, and which words were part of the original message--but the key is that this will only be apparent after the message is decoded. In addition to that, we could pre-specify that certain positions in our message will be reserved for randomized characters, so that if you don't know which positions are random in advance, you have no hope of knowing where one word ends and the next word begins (or for that matter, how many characters there are per word). The second step is to literally scramble the characters in our message according to a predetermined pattern, so that again if you do not know in advance the order in which the characters are meant to be read, you have essentially no way to decrypt the message.

At this point, CIA cryptographers really have only two things they can do: they can either write a computer program to just guess and check every possible combination, and they can use what I'll call the "contextual method" where they use their knowledge obtained elsewhere to attempt to guess what to look for in the message (for example, if you recently sent an unencrypted message talking about Cincinnati, then they might guess that the encrypted message contains the word "Cincinnati" somewhere). The former is just a matter of computational power--eventually, the computer will try all possible combinations, one of which will be the answer key. There is, however, no limit to how expensive we can make this for them--how long it takes to decrypt is increases exponentially with the number of characters we use per word: at 3 characters per word there are 857,375 possible combinations, while at 4 characters there are over 81.5 million possibilities, and at 5 characters there are 7.7 billion possibilities. There's no limit to the possibilities here.

And here's the thing: none of what I discussed above is particularly difficult to implement. Even as an amateur programmer, I could write such a program. So what I mean to say in all this is that the fact that such a thing as a "cryptographer" exists--that anyone bothers to try to decode messages at all--is a testament to the shear stupidity of code-makers. It is not, in my opinion, particularly hard to make a code that is unbreakable.
Max 4/07/2014 09:10:00 PM