CFS Travails travis+web@subspacefield.org This is a story about a mysteriously corrupted encrypted file system and my attempts to recover the data. Forgive me if my narration of it is a bit rough, as I am not exactly sure what happened; all I can do is make observations. This is my first attempt to turn a chronological log of observations into a good narrative. In doing so I am gaining a new appreciation for the work of investigative journalists, who must go through a similar process to turn their notes into a compelling story. This is a work in progress. If you have solutions to any problems I mention herein, please email them to me. The encrypting file system under observation is CFS. CFS is the "Crypting File System" by Matt Blaze. Parts of it date back to 1987, making it a rather venerable piece of code. Matt Blaze is well-known in the computer security community, and the code has been available for scrutiny for some time. The code is fairly portable and may be the most popular encrypting filesystem in Unix. If we ignore the fact that very few Unix users use encryption to protect their files, one might be tempted to assume that CFS should be relatively secure and bug-free, but alas, this is not the case. CFS is available here: First a little background on how CFS works. CFS stores each encrypted file as a seperate entity in the file system (it is not an encrypted disk device, in other words). For each file there is also something stored. It's called an initialization vector (IV), and at first CFS used the inode number to store it. Then it used the GID field. Finally Matt decided to use a seperate symlink (with the prefix ".pvect_") to hold it (by pointing to it). In CFS, the IV is used to make files with identical contents encrypt to different data streams to prevent snoops from noticing correlations between files. It's hard to know exactly when the corruption occured or why. My logs say I had some disk problems around 2 May 2001. I believe some of the encrypted files were moved to lost+found. My notes say that I restored a copy of the encrypted directory hierarchy to make comparisons and moved some files around. I was not aware of the significance of the symlink/IVs at that time and probably made some mistakes de-synchronizing them. I did make some edits to sensitive files on 15 Dec 2001, and apparently the files I was working on were not corrupted. Then followed a period of hassles getting CFS to co-exist with my OS, which had a new v3 RPC mechanism. I played around with recompilations of CFS around this time, and I was also using an OS release with a buggy compiler (egcs-1.1.1). This compiler is known to generate incorrect code, and could be the source of some of the data corruption. On 6 Dec 2003, decrypting certain files yields corruption. That is, ASCII text files suddenly aren't. How can I quantify this corruption? I ran a tool called "ent" on the files, and it yielded some interesting results: $ ent corrupted_text_file Entropy = 5.558949 bits per byte. Optimum compression would reduce the size of this 6588 byte file by 30 percent. Chi square distribution for 6588 samples is 40067.26, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 63.7025 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.087718 (totally uncorrelated = 0.0). The entropy is one measure of the amount of [unpredictable] information in the file. For totally random files, it should be 8 bits per byte, and for a file full of nulls, it should be 0 bits per byte. The optimum compression measurement merely describes the percentage of predictable information in the file. Chi square is a statistical test that involves putting things in "bins" (probably 256 bins in this case, one for each possible byte), squaring the difference between the actual bin contents and the predicted values, then dividing by the probability of that bin being selected (in this case, 1/256 since all bytes are equally probable). Thus, the more a file deviates from random, the higher the chi-square score, and the lower the percentage of cases that would exceed it. The arithmetic mean is obvious, and the Monte Carlo approximation involves constructing a virtual 1x1 dart board with a circle of diameter one inside it, and using the samples as coordinates of darts, then dividing the number that land inside the circle by the total number of darts, and using that to compute a value of pi (it converges rather slowly). Finally, the serial correlation merely describes how much each sample depends on the prior sample. Normally when encryption goes awry (due to using the incorrect key, for example), you get something indistinguishable from random bits. As you can see, the files are definitely not random. Random looks like this: $ ent random_file Entropy = 7.970002 bits per byte. Optimum compression would reduce the size of this 6144 byte file by 0 percent. Chi square distribution for 6144 samples is 250.58, and randomly would exceed this value 50.00 percent of the times. Arithmetic mean value of data bytes is 126.5417 (127.5 = random). Monte Carlo value for Pi is 3.152343750 (error 0.34 percent). Serial correlation coefficient is -0.009299 (totally uncorrelated = 0.0). On the other hand, they are more random than ordinary text files: $ ent uncorrupted_text_file Entropy = 4.436244 bits per byte. Optimum compression would reduce the size of this 5134 byte file by 44 percent. Chi square distribution for 5134 samples is 76869.97, and randomly would exceed this value 0.01 percent of the times. Arithmetic mean value of data bytes is 92.6139 (127.5 = random). Monte Carlo value for Pi is 4.000000000 (error 27.32 percent). Serial correlation coefficient is -0.008061 (totally uncorrelated = 0.0). To do some more tests, I wrote two programs, "freqcount.pl" and "freqgraph.pl". The former counts the frequency of each byte, and the latter represents it graphically (in text mode) as bars of asterisks running horizontally, so you can compare two files in side-by-side windows. What I'm seeing appears to be disruption of the large frequency spike of the space character; however, it is definitely not uniform. It looks as though it has gone through some polyalphabetic substitution with a periodicity of four or so. My first reaction was to go to backup tapes. However, my backup tapes for 15 Oct 2003 are no good; I ran out of tape and didn't notice. And although I have many backups, this corruption went unnoticed for so long I don't have a backup set old enough. My next reaction was to see if cfsd, the NFS-like daemon that serves up the files, was at fault. I decided to try ccat (a stand-alone utility) on some of the files to eliminate cfsd as a source of errors. Unfortunately it warns that it only works on old format CFS files. You see, CFS went through several stages of evolution. I won't bore you with the details. The stand-alone tools like ccat do not share the decryption routines with cfsd, and they had not been kept up-to-date with the various evolutionary changes. I sent out an email to Matt Blaze to get some help, but no reply. I read Matt Blaze's notes.ms, and read through some of his code, particularly cmkdir.c. I summarize my findings in a README.internals document I am creating. I found a few errors in there that are security-relevant; I must question how much trust I put into this software without doing even a cursory review of the code. To summarize my findings here: 1) The type of cipher is included, apparently erroneously, in some key manipulations: struct cfs_admkey { ciphers cipher; union { cfs_adm_deskey deskey; cfs_adm_3deskey des3key; ... } cfs_admkey_u; }; cfs_admkey k; ... /* now we xor in some truerand bytes for good measure */ bcopy(&k,ekey,32); /* assumes key material < 32 bytes */ for (i=0; i<32; i++) { ekey[i] ^= randbyte(); } encrypt_key(&k,ekey); bcopy(ekey,ek1,32); decrypt_key(&k,ek1); /* new &k is our real key */ 2) A file called "..." is created. Apparently this was to provide a somewhat-known plaintext so that the program knows if you have given the correct key or not. He attempts to make this 8-byte value half random, but due to shifting in the wrong direction the attacker knows 7 of the 8 bytes, almost giving him a full known-plaintext. By contrast, PGP uses two bytes of known plaintext to determine if you have the correct key (meaning that it accepts the wrong key 1 in 2^16 times). char str[8]; ... strcpy(str,"qua!"); /* now randomize the end of str.. */ r = trand32(); for (i=0; i<4; i++) str[i+4]=(r<<(i*8))&0377; At this point I thought that the IV was stored in the inode numbers. CFS used to do this, and I almost certainly changed the inode numbers on several files during my restoration efforts. There are 2^32 possible inodes, and I would have to search for the correct one for each file. This could have taken a while. I would have had to be clever and crafty to finish this project in my lifetime. But perhaps it was not as impossible as it seemed; the stream generated from the inode number is XORed with the data at one point; perhaps I wouldn't have had to try all of them in a brute-force method. Also, the file system may not have 2^32 inodes to try out --- perhaps much fewer. Yes, this might be possible. I contacted the TCFS people, who based their work on CFS. I do this in part to warn them about some of the vulnerabilities I have found, and also to solicit help in understanding CFS. I notice many of the links and mirrors of their web page are broken. Their last announcement was over a year ago. And when I send them email, it bounces. Sigh, more abandonware. I thought my problem was that there's a different inode used to seed a PRNG of some kind to create a stream which is being XORed with my program's text. To extract the original text, I need to know the right inode number. The size of the inode number tells me its maximum theoretical size, and if the file system it is on has only gotten larger, its maximum inode value will tell me an even lower upper bound, and the inodes of nearby files might give me a good place to start searching. I will also need to recognize when I've gotten the inode right. This will probably be some statistical test on the contents of the file (remember, I have the key already, so the file "..." is no help). On top of all this, these file's contents are sensitive (that's why they're encrypted) so I won't be able to distribute any brute-forcing if I have to examine the file contents to determine a match. I will either have to use his code or develop my own, and test it to make sure it is doing exactly the same thing as his. I think this is a turning point in my level of expertise. First I found that I couldn't get advice from family, people I met at social events, etc. In the second stage, I found that no close friend that could give me advice. Increasingly I had to go beyond my circle of friends to seek out experts. And now, with the few experts unreachable or unwilling to help, I find that I must solve the problems myself. I suppose in the last stage, I won't be able to find anybody more expert. At this point I decided to print out all the papers on CFS and dive deep, really deep, into the belly of the beast. I really needed to understand the encryption that CFS is using. Perhaps I can be clever and find an algorithmic shortcut, or a known-plaintext situation. On 6 Jan 2004, I found there are "only" about 280,000 inodes on that file system. That means 280,000 trials. This is getting reasonable. Checking each decryption for the bytewise frequency count of the space character could do a 1:256 winnowing. I'd need about another 1:100 reduction of those candidates to do manual inspection (I figure 10 manual inspections per file is my upper limit). On 7 Jan 2004, I think I may have found my answer, due to the equations in Matt's papers which I quote in README.internals: D_p = DES^-1(K_2, E_p xor DES^1(K_1, g(p mod m))) xor DES^1(K_1, f(p mod m)) xor i All the data are 8 bytes wide, except perhaps the keys. I think in this case, everything is known except i, which "is a bit representation of a unique file identifier derived from the Unix inode number and creation time of the encrypted file". His reference to a creation time is a little odd, as UFS inodes only store access, modification, and change times. Still, in the equations above, i is the only variable that is unknown to me. Since it is not used to seed any RNGs or to key any ciphers, I am in luck. Looking at the decryption equation, if i were wrong it would give me the correct data but XORed with some constant. That would be consistent with the entropy measurements I have made, which indicate that the file is definitely non-random but does not have the strong frequency spike of the space character in English prose. Let me put this another way. If you plugged the wrong value of i (let's call it i'), into the above equation, you'd get this: D_p' = D_p xor (i' xor i) That's what is happening to me. Each 8-byte chunk of data that I see is the original plaintext, XORed with some 8-byte constant value, which is the error (XOR) of i' and i. Statistically, if I know the most common D_p', and I know the most common D_p, and I know i', I can solve for i: i = D_p' xor D_p xor i' Now, in that equation, each variable is an 8-byte array. At first, I thought i was four bytes long, and it simply concatenated it to itself to form an 8-byte value (later I learned that i was indeed an 8-byte value). Each byte of i can be solved independently: i[0] = D_p'[0] xor D_p[0] xor i'[0] i[1] = D_p'[1] xor D_p[1] xor i'[1] ... i[3] = D_p'[3] xor D_p[3] xor i'[3] i[0] = D_p'[4] xor D_p[4] xor i'[0] ... i[3] = D_p'[7] xor D_p[7] xor i'[3] In other words, I essentially have a polyalphabetic substitution cipher, with 4 substitution tables which are all built on bitwise XOR with a constant. For text files, I may even be able to solve each byte of the value i independently, suggesting only 4*256 = 1024 guesses! This is workable! In fact I can vary each byte of i at the same time, meaning only 256 decryptions! On closer reading of the release notes (notes.ms), I find that CFS now stores an IV for filename in a symbolic link .pvect_filename. Amazingly, I have traced corruption to those files missing symbolic links! Hooray! Now I know exactly what is missing, which files are corrupted, and how to fix it! I have verified that these .pvect files are missing on my oldest backup too. This new development means I must guess 8 hex digits, which is 32 bits of randomness. Quite an improvement, although I must read the source to see exactly how it is used to see if I can search each byte independently. I suspect that I will be able to. I can test my theories by creating a known plaintext file, noting then removing the .pvect_whatever file, and then conducting a search for the proper (known) value. This might be faster than understanding CFS's code. On 9 Jan 2004, I wrote a little program today to find files w/o .pvect_ files. There are 19, and have been exactly 19 for a while. That's not too bad. Overall I have about 500 files in CFS. I also have modified ccat to the extent necessary to work with new-style dirs. It's not pretty, but it works. Huzzah! Then I modified ccat to accept both an IV and a passphrase. I wrote a program that tries IVs with each byte ranging from 0-255. For each of the 256 possibilities, it runs ccat with an IV of that value, repeated (all bytes equal). Then it does a simple frequency count for offsets of 0, 1, 2 and 3 into the file. After doing this it dumps all the information using PERL's Data::Dumper. My plan was to analyze this output later using statistical tests. Estimates of the time to do this were around 38 minutes, which turned out to be surprisingly accurate. Thinking back, I do think I recall seeing symlinks pointing to garbage in the lost+found, I must have decided they were deletable and didn't restore them. All this work due to that miscalculation! Oh well, I have learned quite a bit about CFS... At this point I noticed that the IV is 8 bytes in ASCII-encoded hexadecimal, which is a convenient way to store a 32-bit value. However, it is not converted into binary before use; the ASCII values are used instead! So basically you have a 64-bit IV but it can only hold 2^32 different values. Thus, my program should have really calculated frequency counts for offsets of 0-7 into the file (not 0-3): i[0] = D_p'[0] xor D_p[0] xor i'[0] i[1] = D_p'[1] xor D_p[1] xor i'[1] ... i[7] = D_p'[7] xor D_p[7] xor i'[7] It's worth noting here, as a sidebar, that the more I know about the file in question, the better I can narrow down the possibilities for the IV. If I know nothing about the file; that is, if it is truly random data, I cannot tell if one IV is better than the other. The more I know about it, the better I can weed out all the irrelevant IVs. This knowledge comes from knowing the names of the files, remembering their contents, that kind of stuff. One thing about the IVs is that they don't affect the high bits of the file. That is, since the IV is all in the ASCII range of 0-127, being XORed with the file will not change the high bits. That is unfortunate, as the files typically don't have high bit set anywhere. However, this makes sense from a cryptographic perspective of trying to hide similarities between files; you are modifying bits that people care about, instead of ones that are typically not used. Taking all this into consideration, I wrote a program (smart_iv) which tells you what the missing IV is very quickly. The basic idea behind this is that you run ccat once, get the output, break it down by which byte of the IV it is being XORed against (forming 8 bins). Then you count the frequencies of the characters in that bin. The whole data structure is essentially an 8x256 array. Now I want to test for the IV that gives me the greatest number of space characters. One could map the frequency counts based on the IV you are testing, but I found it easier to go the other way; that is, to XOR the space character with the IV in question, and look at the occurences (frequency) of that entry in the frequency chart. This program was capable of recovering the longer files of English text, probably about 5 of the 19. I then renamed this program (find_blank) because I had some smarter ideas. In my call to ccat, I have changed it to accept an IV, but what IV do I specify? Ideally I would use a null vector. That is, if I let i'=0, then it drops out of the equations mentioned above because XOR with 0 is an identity operation. In fact, cfsd probably assumes i'=0 when there's no .pvect_ files. However, ccat uses C string functions to parse the IV from the command line and eight nulls would be read as a zero-length string. So instead I specify "00000000" and have to deal with the XOR deltas between that and the actual IV instead of with absolute values (that is, they are relative to 0x3030303030303030 since ASCII for "0" is 0x30). Around this time I noticed that there were other files which had .pvect_ files but were also corrupted. Is there no end to this pain? I will have to examine every file on the encrypted file system, by hand, to determine if it looks reasonable. Argh! I will put this off until I have completed recovering the 19 files without .pvect_ symlinks. 11 Jan 2004 I noticed that there was a class of files which were not English prose, and therefore did not have the prevalence of the space character in their frequency profiles. So, I wrote another program, called maximize_printables, which found the IVs that maximized the number of printable characters. This program did not work as well as I expected, since some IVs which maximized printables also contained junk like vertical tabs (!), which virtually never appear in ASCII files. So the next logical step was another program, called minimize_nonprintables, which did what you'd expect. However, unlike the maximizing spaces heuristic, which tended to only pick a handful of IVs, the minimize_nonprintables generally printed out several (on the order of 16 or so). I found that since it usually got half of the bytes in the IV correct, I could try one possibility, with ccat, see halves of words that I recognized, then exploit these inter-byte dependencies to infer the correct values of the currently incorrect IV bytes. I was able to recover around 5 more of the 19 files without .pvect_ files. [NB: theoretically minimize_nonprintables and maximize_printables should give the same results as they are complementary. However, my implementation of one or the other must have been in error since I think I noticed different results.] I should mention here some of the other tools I use. I use bc quite a bit, even though it has the annoying property of not accepting lowercase letters as hex digits. However, it does not have a bitwise XOR operator, so I often end up using PERL. If you try stuff like this you'll need the chr and ord functions. I also refer to an ASCII chart quite a bit. So finding those halves of words in the plaintext gave me an idea. Perhaps I should write another program, for files where I might know a string that occurs in it. For example, an RCS file has the words "head", "access", "symbols" --- surely this knowledge could be helpful in determining an IV! Again, the more I know about the plaintext, the easier it is to recover an IV. For simplification, let me consider the case of an RCS file. In this case, I know it begins with "head\t". In most cases I never increment the major number, so the next characters are usually "1.", which is followed by one or more digits. If I know that a fixed string of length l occurs at a particular offset into the file, I can easily compute l bytes of the IV starting at that offset modulo 8 and wrap around in the obvious fashion. 12 Jan 2004 It's worth noting that this applies to other types of files too; for scripts, I start with "#! /". The extra space is there for portability because Dynix uses a four-byte "magic number" to identify scripts. I never expect to see Dynix, but it's only one byte and portability is not a bad habit to have. In many cases the following letters are "usr/" too. I like to start from the simple and specific cases, and enhance scripts to handle the general cases. So the next program I wrote, known_header, cracks an IV based on known plaintext at the beginning of the file. In this case, I simply XOR "0" and the known byte and the byte from the decrypted file, and that gives me the IV. Simplicity itself! This was capable of recovering about 5 of the 19 files. The rest of the 19 files were automatically generated files which I simply removed, so I consider these techniques moderately successful. Now onto something more difficult. Let's say I'm looking for a known string of some length l in the plaintext, at an unknown location. If the IV were fixed, I would look for one of eight patterns, with the pattern being dependent on my offset into the file (modulo 8 of course). At each step I might have to look ahead as far as l-1 bytes, to see if I'm at the beginning of a match. That might be acceptable, but the IV isn't fixed. What I'm really doing is trying to find the IV that induces the most occurences of that string, and I'm trying to do it without iterating over all possible IV values (2^32 iterations over the length of the file sounds like too much computation). Surely there has to be a better way! 14 Jan 2004 At this point to count strings of length l I'm thinking of a (l+1) state DFA with multiple "tokens" that move around to keep track of the state. Specifically, there will be 2^32 logical tokens, but in implementation I'll consolidate them into l+1 tokens, one for each state. Every time a logical token hits the final state, it increments a counter associated with its IV. In actuality, there may be multiple logical tokens for a given IV. For example, if the IVs were only 2 bytes long, and the string we want to find is ABABC and we read in ABAB then there will be two tokens one on the first B and one on the second. Note that the string must be longer than the IV to have two logical tokens in the works. First, to simplify the matter, I think I want to modify ccat to accept a null IV. I'll just test to see if it's a null string, and as a special case if it is, accept it as the null vector (which would be impossible to pass on a command line anyway). Okay that cleans things up a bit. 17 Jan 2004 It's interesting to note that in some cases I may know two different types of things about a file; for example that space is the most frequent character, and that it contains a given string. Each of these individually induces a distribution on the IV. However, it is more difficult to represent and combine this distribution between these two programs than if I achieved the same level of confidence using one technique alone. That is, it is easier to print out the most probable IV than it is to communicate the top ten IVs and that is easier than communicating all IVs and their associated probabilities. This has implications in designing a cryptographic system. For example you can split up the plaintext under numerous keys (perhaps using the unicity distance as a guide of how much per key) and encrypt these keys with a "metakey". This would make it more difficult on the cryptanalyst, since if your keys are perfectly random he cannot attack the metakey directly. Instead he must combine limited information about a number of keys, and the encryption of those keys under the metakey induces probabilities in the metakey that are complex by design. This would make encryption slower, due to the need to rekey, but would require more sophisticated cryptanalytic software. Are such gains against cryptanalysis valuable? Stated another way, this question asks if it is valuable to make attacks against you more expensive but not impossible. The answer depends on the distribution of attackers over resources. If your only concern is about attackers with infinite resources, perhaps not. I believe that in most cases, "raising the bar" is valuable, but one should decide on a strategy based on a scientific cost-benefit appraisal. If nothing else, use of metakeying would retain the benefits of using a proven cipher while seperating you from the crowd. Attacking your system would require custom modification of existing cryptanalytic software. This means expert manpower, and that directly translates to money. 19 Jan 2004 I'm writing known_strings. First I start by examining all the relevant finite-state automata (FSA) classes on CPAN; none really fit the bill. Next I write the main body of the program, pretending as though I have already written any relevant classes. I know that I'll need some kind of custom FSA and that I'll need some structure in which to store results. Representing the results of the searching poses an interesting challenge. Do I want to allocate a PERL structure with 2^32 entries? Even at a byte per entry, that's the entire address space of the machine! I could do this with a large file, but it's inelegant. If the known string is short, there would be many matching IVs and it would be quite slow. One alternative would seem to be storing only non-zero entries; this is only useful if the resultant array would be sparse. The sparseness of this array depends on the length of the string I'm looking for and the content of the file. Another alternative would be to represent the partial matches of IVs, and at the end of parsing the file, loop through all IVs and see how many of these partial matches a specific IV actually matches. Put another way, my results may be "any IV that starts with A" and "any IV that starts with AB"; I would then enumerate all IVs, starting with AAAAAAAA, noting that it matches once, so I output "AAAAAAAA 1". When I get to ABAAAAAA, it will match twice, so I output "ABAAAAAA 2". Thus instead of storing the results array in memory, I am actually storing it temporally by walking through IVs and dynamically generating the values it would have. Of course, if I spent 1 ms on every IV iterating through them all would take 49 days. In this light, perhaps it is better to iterate over the file than to deal with the whole IV space. Most of my file are small text files. In fact, most Unix files are small; for this reason the file system is optimized for dealing with relatively small files. You've reached the end of this file, but not the end of this story. My work continues to this day. I took a little detour investigating alternatives to CFS. More later... NB: The author is currently in search of employment. Interested employers please email to travis+web@subspacefield.org.