CFS Travails
travis+web@subspacefield.org

This is a story about a mysteriously corrupted encrypted file system
and my attempts to recover the data.  Forgive me if my narration of it
is a bit rough, as I am not exactly sure what happened; all I can do
is make observations.  This is my first attempt to turn a
chronological log of observations into a good narrative.  In doing so
I am gaining a new appreciation for the work of investigative
journalists, who must go through a similar process to turn their notes
into a compelling story.  This is a work in progress.  If you have
solutions to any problems I mention herein, please email them to me.

The encrypting file system under observation is CFS.  CFS is the
"Crypting File System" by Matt Blaze.  Parts of it date back to 1987,
making it a rather venerable piece of code.  Matt Blaze is well-known
in the computer security community, and the code has been available
for scrutiny for some time.  The code is fairly portable and may be
the most popular encrypting filesystem in Unix.  If we ignore the fact
that very few Unix users use encryption to protect their files, one
might be tempted to assume that CFS should be relatively secure and
bug-free, but alas, this is not the case.

CFS is available here: <URL:http://www.crypto.com/software/>

First a little background on how CFS works.  CFS stores each encrypted
file as a seperate entity in the file system (it is not an encrypted
disk device, in other words).  For each file there is also something
stored.  It's called an initialization vector (IV), and at first CFS
used the inode number to store it.  Then it used the GID field.
Finally Matt decided to use a seperate symlink (with the prefix
".pvect_") to hold it (by pointing to it).  In CFS, the IV is used to
make files with identical contents encrypt to different data streams
to prevent snoops from noticing correlations between files.

It's hard to know exactly when the corruption occured or why.  My logs say
I had some disk problems around 2 May 2001.  I believe some of the encrypted
files were moved to lost+found.  My notes say that I restored a copy of
the encrypted directory hierarchy to make comparisons and moved some files
around.  I was not aware of the significance of the symlink/IVs at that
time and probably made some mistakes de-synchronizing them.

I did make some edits to sensitive files on 15 Dec 2001, and apparently
the files I was working on were not corrupted.  Then followed a period of
hassles getting CFS to co-exist with my OS, which had a new v3 RPC
mechanism.  I played around with recompilations of CFS around this time,
and I was also using an OS release with a buggy compiler (egcs-1.1.1).
This compiler is known to generate incorrect code, and could be the source
of some of the data corruption.

On 6 Dec 2003, decrypting certain files yields corruption.  That is,
ASCII text files suddenly aren't.  How can I quantify this corruption?
I ran a tool called "ent" on the files, and it yielded some interesting
results:

    $ ent corrupted_text_file
    Entropy = 5.558949 bits per byte.

    Optimum compression would reduce the size
    of this 6588 byte file by 30 percent.

    Chi square distribution for 6588 samples is 40067.26, and randomly
    would exceed this value 0.01 percent of the times.

    Arithmetic mean value of data bytes is 63.7025 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is -0.087718 (totally uncorrelated = 0.0).

The entropy is one measure of the amount of [unpredictable] information
in the file.  For totally random files, it should be 8 bits per byte,
and for a file full of nulls, it should be 0 bits per byte.  The optimum
compression measurement merely describes the percentage of predictable
information in the file.  Chi square is a statistical test that involves
putting things in "bins" (probably 256 bins in this case, one for each
possible byte), squaring the difference between the actual bin contents
and the predicted values, then dividing by the probability of that bin
being selected (in this case, 1/256 since all bytes are equally probable).
Thus, the more a file deviates from random, the higher the chi-square
score, and the lower the percentage of cases that would exceed it.
The arithmetic mean is obvious, and the Monte Carlo approximation involves
constructing a virtual 1x1 dart board with a circle of diameter one
inside it, and using the samples as coordinates of darts, then dividing
the number that land inside the circle by the total number of darts, and
using that to compute a value of pi (it converges rather slowly).  Finally,
the serial correlation merely describes how much each sample depends on
the prior sample.

Normally when encryption goes awry (due to using the incorrect key, for
example), you get something indistinguishable from random bits.  As you
can see, the files are definitely not random.  Random looks like this:

    $ ent random_file
    Entropy = 7.970002 bits per byte.

    Optimum compression would reduce the size
    of this 6144 byte file by 0 percent.

    Chi square distribution for 6144 samples is 250.58, and randomly
    would exceed this value 50.00 percent of the times.

    Arithmetic mean value of data bytes is 126.5417 (127.5 = random).
    Monte Carlo value for Pi is 3.152343750 (error 0.34 percent).
    Serial correlation coefficient is -0.009299 (totally uncorrelated = 0.0).

On the other hand, they are more random than ordinary text files:

    $ ent uncorrupted_text_file
    Entropy = 4.436244 bits per byte.

    Optimum compression would reduce the size
    of this 5134 byte file by 44 percent.

    Chi square distribution for 5134 samples is 76869.97, and randomly
    would exceed this value 0.01 percent of the times.

    Arithmetic mean value of data bytes is 92.6139 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is -0.008061 (totally uncorrelated = 0.0).

To do some more tests, I wrote two programs, "freqcount.pl" and "freqgraph.pl".
The former counts the frequency of each byte, and the latter represents it
graphically (in text mode) as bars of asterisks running horizontally, so you
can compare two files in side-by-side windows.  What I'm seeing appears to
be disruption of the large frequency spike of the space character; however,
it is definitely not uniform.  It looks as though it has gone through some
polyalphabetic substitution with a periodicity of four or so.

My first reaction was to go to backup tapes.  However, my backup tapes for 15
Oct 2003 are no good; I ran out of tape and didn't notice.  And although I
have many backups, this corruption went unnoticed for so long I don't have
a backup set old enough.

My next reaction was to see if cfsd, the NFS-like daemon that serves up the
files, was at fault.  I decided to try ccat (a stand-alone utility) on some
of the files to eliminate cfsd as a source of errors.  Unfortunately it warns
that it only works on old format CFS files.  You see, CFS went through
several stages of evolution.  I won't bore you with the details.  The
stand-alone tools like ccat do not share the decryption routines with cfsd,
and they had not been kept up-to-date with the various evolutionary changes.

I sent out an email to Matt Blaze to get some help, but no reply.

I read Matt Blaze's notes.ms, and read through some of his code, particularly
cmkdir.c.  I summarize my findings in a README.internals document I am
creating.  I found a few errors in there that are security-relevant; I must
question how much trust I put into this software without doing even a cursory
review of the code.  To summarize my findings here:

1) The type of cipher is included, apparently erroneously, in some key
manipulations:

    struct cfs_admkey {
        ciphers cipher;
        union {
                cfs_adm_deskey deskey;
                cfs_adm_3deskey des3key;
                ...
        } cfs_admkey_u;
    };

    cfs_admkey k;
    ...
    /* now we xor in some truerand bytes for good measure */
    bcopy(&k,ekey,32);  /* assumes key material < 32 bytes */
    for (i=0; i<32; i++) {
        ekey[i] ^= randbyte(); 
    }
    encrypt_key(&k,ekey);
    bcopy(ekey,ek1,32);
    decrypt_key(&k,ek1);
    /* new &k is our real key */

2) A file called "..." is created.  Apparently this was to provide a
somewhat-known plaintext so that the program knows if you have given the
correct key or not.  He attempts to make this 8-byte value half random,
but due to shifting in the wrong direction the attacker knows 7 of the 8
bytes, almost giving him a full known-plaintext.  By contrast, PGP uses
two bytes of known plaintext to determine if you have the correct key
(meaning that it accepts the wrong key 1 in 2^16 times).

    char str[8];
    ...
    strcpy(str,"qua!");
    /* now randomize the end of str.. */
    r = trand32();
    for (i=0; i<4; i++)
        str[i+4]=(r<<(i*8))&0377;

At this point I thought that the IV was stored in the inode numbers.  CFS
used to do this, and I almost certainly changed the inode numbers on several
files during my restoration efforts.  There are 2^32 possible inodes, and I
would have to search for the correct one for each file.  This could have
taken a while.  I would have had to be clever and crafty to finish this
project in my lifetime.  But perhaps it was not as impossible as it seemed;
the stream generated from the inode number is XORed with the data at one
point; perhaps I wouldn't have had to try all of them in a brute-force
method.  Also, the file system may not have 2^32 inodes to try out ---
perhaps much fewer.  Yes, this might be possible.

I contacted the TCFS people, who based their work on CFS.  I do this in part
to warn them about some of the vulnerabilities I have found, and also to
solicit help in understanding CFS.  I notice many of the links and mirrors of
their web page are broken.  Their last announcement was over a year ago.
And when I send them email, it bounces.  Sigh, more abandonware.

I thought my problem was that there's a different inode used to seed a
PRNG of some kind to create a stream which is being XORed with my program's
text.  To extract the original text, I need to know the right inode number.
The size of the inode number tells me its maximum theoretical size, and if
the file system it is on has only gotten larger, its maximum inode value
will tell me an even lower upper bound, and the inodes of nearby files
might give me a good place to start searching.  I will also need to recognize
when I've gotten the inode right.  This will probably be some statistical
test on the contents of the file (remember, I have the key already, so the
file "..." is no help).  On top of all this, these file's contents are
sensitive (that's why they're encrypted) so I won't be able to distribute
any brute-forcing if I have to examine the file contents to determine a match.
I will either have to use his code or develop my own, and test it to make sure
it is doing exactly the same thing as his.

I think this is a turning point in my level of expertise.  First I found
that I couldn't get advice from family, people I met at social events, etc.
In the second stage, I found that no close friend that could give me advice.
Increasingly I had to go beyond my circle of friends to seek out experts.
And now, with the few experts unreachable or unwilling to help, I find that
I must solve the problems myself.  I suppose in the last stage, I won't be
able to find anybody more expert.

At this point I decided to print out all the papers on CFS and dive deep,
really deep, into the belly of the beast.  I really needed to understand the
encryption that CFS is using.  Perhaps I can be clever and find an algorithmic
shortcut, or a known-plaintext situation.

On 6 Jan 2004, I found there are "only" about 280,000 inodes on that
file system.  That means 280,000 trials.  This is getting reasonable.
Checking each decryption for the bytewise frequency count of the space
character could do a 1:256 winnowing.  I'd need about another 1:100
reduction of those candidates to do manual inspection (I figure 10
manual inspections per file is my upper limit).

On 7 Jan 2004, I think I may have found my answer, due to the equations in
Matt's papers which I quote in README.internals:

D_p = DES^-1(K_2, E_p xor DES^1(K_1, g(p mod m)))
      xor DES^1(K_1, f(p mod m)) xor i

All the data are 8 bytes wide, except perhaps the keys.  I think in
this case, everything is known except i, which "is a bit
representation of a unique file identifier derived from the Unix inode
number and creation time of the encrypted file".  His reference to a
creation time is a little odd, as UFS inodes only store access,
modification, and change times.

Still, in the equations above, i is the only variable that is unknown to
me.  Since it is not used to seed any RNGs or to key any ciphers, I am
in luck.  Looking at the decryption equation, if i were wrong it would
give me the correct data but XORed with some constant.  That would be
consistent with the entropy measurements I have made, which indicate
that the file is definitely non-random but does not have the strong
frequency spike of the space character in English prose.

Let me put this another way.  If you plugged the wrong value of i
(let's call it i'), into the above equation, you'd get this:

D_p' = D_p xor (i' xor i)

That's what is happening to me.  Each 8-byte chunk of data that I see
is the original plaintext, XORed with some 8-byte constant value, which
is the error (XOR) of i' and i.  Statistically, if I know the most
common D_p', and I know the most common D_p, and I know i', I can solve
for i:

i = D_p' xor D_p xor i'

Now, in that equation, each variable is an 8-byte array.  At first,
I thought i was four bytes long, and it simply concatenated it to itself
to form an 8-byte value (later I learned that i was indeed an 8-byte
value).  Each byte of i can be solved independently:

i[0] = D_p'[0] xor D_p[0] xor i'[0]
i[1] = D_p'[1] xor D_p[1] xor i'[1]
...
i[3] = D_p'[3] xor D_p[3] xor i'[3]
i[0] = D_p'[4] xor D_p[4] xor i'[0]
...
i[3] = D_p'[7] xor D_p[7] xor i'[3]

In other words, I essentially have a polyalphabetic substitution
cipher, with 4 substitution tables which are all built on bitwise XOR
with a constant.  For text files, I may even be able to solve each
byte of the value i independently, suggesting only 4*256 = 1024 guesses!
This is workable!  In fact I can vary each byte of i at the same time,
meaning only 256 decryptions!

On closer reading of the release notes (notes.ms), I find that CFS now
stores an IV for filename in a symbolic link .pvect_filename.  Amazingly,
I have traced corruption to those files missing symbolic links!  Hooray!
Now I know exactly what is missing, which files are corrupted, and how
to fix it!  I have verified that these .pvect files are missing on my
oldest backup too.

This new development means I must guess 8 hex digits, which is 32 bits
of randomness.  Quite an improvement, although I must read the source to
see exactly how it is used to see if I can search each byte independently.
I suspect that I will be able to.

I can test my theories by creating a known plaintext file, noting then
removing the .pvect_whatever file, and then conducting a search for the
proper (known) value.  This might be faster than understanding CFS's
code.

On 9 Jan 2004, I wrote a little program today to find files w/o
.pvect_ files.  There are 19, and have been exactly 19 for a while.
That's not too bad.  Overall I have about 500 files in CFS.

I also have modified ccat to the extent necessary to work with new-style
dirs.  It's not pretty, but it works.  Huzzah!  Then I modified ccat to
accept both an IV and a passphrase.

I wrote a program that tries IVs with each byte ranging from 0-255.  For
each of the 256 possibilities, it runs ccat with an IV of that value,
repeated (all bytes equal).  Then it does a simple frequency count for
offsets of 0, 1, 2 and 3 into the file.  After doing this it dumps all
the information using PERL's Data::Dumper.  My plan was to analyze this
output later using statistical tests.  Estimates of the time to do this
were around 38 minutes, which turned out to be surprisingly accurate.

Thinking back, I do think I recall seeing symlinks pointing to garbage
in the lost+found, I must have decided they were deletable and didn't
restore them.  All this work due to that miscalculation!  Oh well, I have
learned quite a bit about CFS...

At this point I noticed that the IV is 8 bytes in ASCII-encoded
hexadecimal, which is a convenient way to store a 32-bit value.
However, it is not converted into binary before use; the ASCII values
are used instead!  So basically you have a 64-bit IV but it can only
hold 2^32 different values.  Thus, my program should have really
calculated frequency counts for offsets of 0-7 into the file (not
0-3):

i[0] = D_p'[0] xor D_p[0] xor i'[0]
i[1] = D_p'[1] xor D_p[1] xor i'[1]
...
i[7] = D_p'[7] xor D_p[7] xor i'[7]

It's worth noting here, as a sidebar, that the more I know about the
file in question, the better I can narrow down the possibilities for the IV.
If I know nothing about the file; that is, if it is truly random data, I
cannot tell if one IV is better than the other.  The more I know about it,
the better I can weed out all the irrelevant IVs.  This knowledge comes
from knowing the names of the files, remembering their contents, that kind
of stuff.  One thing about the IVs is that they don't affect the high bits
of the file.  That is, since the IV is all in the ASCII range of 0-127,
being XORed with the file will not change the high bits.  That is unfortunate,
as the files typically don't have high bit set anywhere.  However, this makes
sense from a cryptographic perspective of trying to hide similarities between
files; you are modifying bits that people care about, instead of ones that
are typically not used.

Taking all this into consideration, I wrote a program (smart_iv) which
tells you what the missing IV is very quickly.  The basic idea behind this
is that you run ccat once, get the output, break it down by which byte of
the IV it is being XORed against (forming 8 bins).  Then you count the
frequencies of the characters in that bin.  The whole data structure is
essentially an 8x256 array.  Now I want to test for the IV that gives me
the greatest number of space characters.  One could map the frequency counts
based on the IV you are testing, but I found it easier to go the other way;
that is, to XOR the space character with the IV in question, and look at
the occurences (frequency) of that entry in the frequency chart.  This
program was capable of recovering the longer files of English text,
probably about 5 of the 19.  I then renamed this program (find_blank)
because I had some smarter ideas.

In my call to ccat, I have changed it to accept an IV, but what IV do
I specify?  Ideally I would use a null vector.  That is, if I let i'=0,
then it drops out of the equations mentioned above because XOR with 0
is an identity operation.  In fact, cfsd probably assumes i'=0
when there's no .pvect_ files.  However, ccat uses C string functions to
parse the IV from the command line and eight nulls would be read as a
zero-length string.  So instead I specify "00000000" and have to deal with
the XOR deltas between that and the actual IV instead of with absolute values
(that is, they are relative to 0x3030303030303030 since ASCII for "0" is
0x30).

Around this time I noticed that there were other files which had .pvect_
files but were also corrupted.  Is there no end to this pain?  I will have
to examine every file on the encrypted file system, by hand, to determine
if it looks reasonable.  Argh!  I will put this off until I have completed
recovering the 19 files without .pvect_ symlinks.

11 Jan 2004

I noticed that there was a class of files which were not English prose,
and therefore did not have the prevalence of the space character
in their frequency profiles.  So, I wrote another program, called
maximize_printables, which found the IVs that maximized the number of
printable characters.  This program did not work as well as I expected,
since some IVs which maximized printables also contained junk like vertical
tabs (!), which virtually never appear in ASCII files.

So the next logical step was another program, called minimize_nonprintables,
which did what you'd expect.  However, unlike the maximizing spaces heuristic,
which tended to only pick a handful of IVs, the minimize_nonprintables
generally printed out several (on the order of 16 or so).  I found that since
it usually got half of the bytes in the IV correct, I could try one
possibility, with ccat, see halves of words that I recognized, then exploit
these inter-byte dependencies to infer the correct values of the currently
incorrect IV bytes.  I was able to recover around 5 more of the 19 files
without .pvect_ files.

[NB: theoretically minimize_nonprintables and maximize_printables should
give the same results as they are complementary.  However, my implementation
of one or the other must have been in error since I think I noticed different
results.]

I should mention here some of the other tools I use.  I use bc quite a bit,
even though it has the annoying property of not accepting lowercase letters
as hex digits.  However, it does not have a bitwise XOR operator, so I often
end up using PERL.  If you try stuff like this you'll need the chr and ord
functions.  I also refer to an ASCII chart quite a bit.

So finding those halves of words in the plaintext gave me an idea.  Perhaps
I should write another program, for files where I might know a string that
occurs in it.  For example, an RCS file has the words "head", "access",
"symbols" --- surely this knowledge could be helpful in determining an IV!

Again, the more I know about the plaintext, the easier it is to recover
an IV.  For simplification, let me consider the case of an RCS file.  In
this case, I know it begins with "head\t".  In most cases I never
increment the major number, so the next characters are usually "1.",
which is followed by one or more digits.  If I know that a fixed string of
length l occurs at a particular offset into the file, I can easily compute
l bytes of the IV starting at that offset modulo 8 and wrap around in the
obvious fashion.

12 Jan 2004

It's worth noting that this applies to other types of files too; for
scripts, I start with "#! /".  The extra space is there for portability
because Dynix uses a four-byte "magic number" to identify scripts.  I
never expect to see Dynix, but it's only one byte and portability is not
a bad habit to have.  In many cases the following letters are "usr/" too.

I like to start from the simple and specific cases, and enhance scripts
to handle the general cases.  So the next program I wrote, known_header,
cracks an IV based on known plaintext at the beginning of the file.  In
this case, I simply XOR "0" and the known byte and the byte from the
decrypted file, and that gives me the IV.  Simplicity itself!  This was
capable of recovering about 5 of the 19 files.  The rest of the 19 files
were automatically generated files which I simply removed, so I consider
these techniques moderately successful.

Now onto something more difficult.  Let's say I'm looking for a known
string of some length l in the plaintext, at an unknown location.  If
the IV were fixed, I would look for one of eight patterns, with the
pattern being dependent on my offset into the file (modulo 8 of course).
At each step I might have to look ahead as far as l-1 bytes, to see if
I'm at the beginning of a match.  That might be acceptable, but the IV
isn't fixed.  What I'm really doing is trying to find the IV that induces
the most occurences of that string, and I'm trying to do it without
iterating over all possible IV values (2^32 iterations over the length
of the file sounds like too much computation).  Surely there has to be
a better way!

14 Jan 2004

At this point to count strings of length l I'm thinking of a (l+1) state
DFA with multiple "tokens" that move around to keep track of the state.
Specifically, there will be 2^32 logical tokens, but in implementation
I'll consolidate them into l+1 tokens, one for each state.  Every time
a logical token hits the final state, it increments a counter associated
with its IV.  In actuality, there may be multiple logical tokens for a
given IV.  For example, if the IVs were only 2 bytes long, and the string
we want to find is  ABABC and we read in ABAB then there will be two
tokens one on the first B and one on the second.  Note that the string
must be longer than the IV to have two logical tokens in the works.

First, to simplify the matter, I think I want to modify ccat to accept
a null IV.  I'll just test to see if it's a null string, and as a special
case if it is, accept it as the null vector (which would be impossible
to pass on a command line anyway).  Okay that cleans things up a bit.

17 Jan 2004

It's interesting to note that in some cases I may know two different types
of things about a file; for example that space is the most frequent character,
and that it contains a given string.  Each of these individually induces a
distribution on the IV.  However, it is more difficult to represent and
combine this distribution between these two programs than if I achieved the
same level of confidence using one technique alone.  That is, it is easier
to print out the most probable IV than it is to communicate the top ten IVs
and that is easier than communicating all IVs and their associated
probabilities.

This has implications in designing a cryptographic system.  For
example you can split up the plaintext under numerous keys (perhaps
using the unicity distance as a guide of how much per key) and encrypt
these keys with a "metakey".  This would make it more difficult on the
cryptanalyst, since if your keys are perfectly random he cannot attack
the metakey directly.  Instead he must combine limited information
about a number of keys, and the encryption of those keys under the
metakey induces probabilities in the metakey that are complex by
design.  This would make encryption slower, due to the need to rekey,
but would require more sophisticated cryptanalytic software.  Are such
gains against cryptanalysis valuable?  Stated another way, this
question asks if it is valuable to make attacks against you more
expensive but not impossible.  The answer depends on the distribution
of attackers over resources.  If your only concern is about attackers
with infinite resources, perhaps not.  I believe that in most cases,
"raising the bar" is valuable, but one should decide on a strategy
based on a scientific cost-benefit appraisal.  If nothing else, use of
metakeying would retain the benefits of using a proven cipher while
seperating you from the crowd.  Attacking your system would require
custom modification of existing cryptanalytic software.  This means
expert manpower, and that directly translates to money.

19 Jan 2004

I'm writing known_strings.  First I start by examining all the
relevant finite-state automata (FSA) classes on CPAN; none really fit
the bill.  Next I write the main body of the program, pretending as
though I have already written any relevant classes.  I know that I'll
need some kind of custom FSA and that I'll need some structure in
which to store results.

Representing the results of the searching poses an interesting
challenge.  Do I want to allocate a PERL structure with 2^32 entries?
Even at a byte per entry, that's the entire address space of the
machine!  I could do this with a large file, but it's inelegant.  If
the known string is short, there would be many matching IVs and it
would be quite slow.  One alternative would seem to be storing only
non-zero entries; this is only useful if the resultant array would be
sparse.  The sparseness of this array depends on the length of the
string I'm looking for and the content of the file.  Another
alternative would be to represent the partial matches of IVs, and at
the end of parsing the file, loop through all IVs and see how many of
these partial matches a specific IV actually matches.  Put another
way, my results may be "any IV that starts with A" and "any IV that
starts with AB"; I would then enumerate all IVs, starting with
AAAAAAAA, noting that it matches once, so I output "AAAAAAAA 1".  When
I get to ABAAAAAA, it will match twice, so I output "ABAAAAAA 2".  Thus
instead of storing the results array in memory, I am actually storing
it temporally by walking through IVs and dynamically generating the
values it would have.  Of course, if I spent 1 ms on every IV
iterating through them all would take 49 days.

In this light, perhaps it is better to iterate over the file than to
deal with the whole IV space.  Most of my file are small text files.
In fact, most Unix files are small; for this reason the file system is
optimized for dealing with relatively small files.

You've reached the end of this file, but not the end of this story.
My work continues to this day.  I took a little detour investigating
alternatives to CFS.  More later...

NB: The author is currently in search of employment.
    Interested employers please email to travis+web@subspacefield.org.