Hard Drives of Doom

[email protected]

1 Introduction

Sometimes you go through life feeling like you haven’t gotten anywhere near your full potential, and that if someone would only give you a chance, you’d be able to do something impressive. Maybe you know it, and your friends know it, but how in the world do you explain it to someone without sounding like a pompous fool? You could talk about the thousands of little things that come up, but by themselves, they’re probably unimpressive. What you really need is an anecdote that really exemplifies your potential. After some thinking, I realized that I have one of these, something I’ve told in bits and pieces, and which has been buried in my homepage, but has never been told in its entirety, in part because hard drive encryption wasn’t considered prudent back then, it was considered outright paranoid. Time has shown the wisdom of my approach, and now I suppose that hard drive encryption is common enough that this tale can be told in full, and in the incremental, investigative way that it occured.

2 The Enclosure

The time is sometime around 1995. The Internet, though open to the public, is not a household word. “The Web” is two years old, and the main browser is Mosaic. Netscape, which would later become Mozilla Firefox, is in beta. Almost nobody has “home pages”. I am working for a company that sells database migration tools. The largest SCSI drives are single digit gigabytes. There is one AT&T “Teradata” machine in the building, and it is the size and shape of a full-sized refrigerator. The name suggests that it actually holds a terabyte of data, and is meant to impress, but at the time I was still a bit skeptical.
Now for those of you who don’t remember SCSI, let’s have a history lesson. At the time, most computers had ATA/IDE drives. These drives had to be put two to a bus, and one had to be master, and one had to be slave, and the slave suffered some serious performance penalties. In most computers, this limited you to two hard drives; you sometimes hooked up two CD-ROM drives to feel that the capacity wasn’t wasted. Since most hard drives were only a few gigabytes, and since ATA was slow, if you were serious, you forked out bigger bucks for SCSI. SCSI allowed you to “daisy chain” up to 7 hard drives for each controller, though it was hard to fit that many in your case.
Around this time, the company is getting rid of some older parts. One of them is a large external RAID storage array, apparently custom-made by a company named “Eagle Storage” or something like that, which held eight enterprise-class, full-height, 5.25” 2.5GB “wide” differential SCSI drives. It was the size of the refrigerators meant for dorm rooms, and had two built-in 300W power supplies and 80mm fans (one either half of the unit), was awkward to carry, and weight about 100 pounds when loaded with those drives.
Now, this got my attention! I was young, still poor, and I had the ability to upgrade to 20GB of space. Not only that, but I’d have differential SCSI, which allows you to run much longer cables to the drives (allowing for external enclosures), and wide SCSI, which allows you to have up to 15 drives on a chain. The only problem was that I had to buy a relatively rare HVD SCSI controller. I was able to find one on the Internet; it cost several hundred dollars, but I was very happy to have it all. The drives were built like tanks, probably several pounds each. Larger capacity drives were on the market, but not “enterprise class” ones. So I had about 16-20GB of storage in one computer, and I was quite happy with it.

3 Seduction by Pure Evil

Now, over the next year or two, some new 4.5 GB drives came on the market. Specifically, Micropolis had some 7200RPM 4.5GB HVD SCSI drives that came on the market, and I stumbled across a wonderful deal on used ones. I don’t have the prices now, but I could easily buy several getting paid $35-40k/yr, so it couldn’t have been very much, perhaps $50-100 per drive.
I assumed at the time that the resale market for wide, HVD SCSI drives was simply too small; it was a rare technology, and so trying to sell them was a bit like trying to sell Betamax tapes when VHS dominated. But I should not have accepted such a glib explanation from myself; I should have dug further, because this is where I became ensnared. I bought four.

4 Death’s Icy Embrace

For a while, they seemed to be working fine. However, after a while, the computer really slowed down. The computer was still running, but several processes had simply frozen in the D (disk wait) state. I tried to kill the stuck processes. Instead, my virtual terminal stopped responding. Now this was strange! I switched to another virtual terminal, and I did the kill again. My virtual terminal stopped responding again.
I stopped to think about this riddle. I opened up another virtual terminal, and looked at the process list. The shells which I had been running before were in this “disk wait” state. That would explain why the console had stopped responding. Now why could this be?
I knew that kill was a shell built-in so that root could still kill processes while the process table was full, so I deliberately invoked /bin/kill to on yet another virtual terminal - and it hung. But when I hit ctrl-Z, the shell responded, and I was able to put kill in the background. I knew better than to try and kill it.
So now I realized that this “infection” was spreading by touch; anything which touched a process infected by it would itself be infected. I had a zombie outbreak on my hands. Time to start containment procedures.
I shut down the system gracefully. Or rather; I tried. The computer froze up completely, and so a hard reboot was necessary.
Now, the whole point of doing a clean shutdown is to avoid the possible data corruption from a hard reboot, or “power cycle”. So it was no surprise to me to find that there was some corruption on the first boot and fsck. Files ended up in lost+found. They’d lose their filename, being replaced by an inode number. This is what happens when a Unix filesystem has a directory that gets damaged; the filenames are stored in the directory, and it points to an inode which has the contents, so that you can have the same file appear in multiple places (“hard links”).

5 The Constant Gardener

I spent some time moving them back to their original places. There were two types of numbered files in lost+found. The first was a directory of named files; for those I was aided by the fact that a security subsystem I had would make a copy of the entire filesystem tree. Based on the names of the files in it, I could “graft” them back into the right place with a rename. Files in lost+found were more difficult, but file(1) would recognize them as a binary or text file. Text files were often recognizable; source to my operating system and my 200+ projects in CVS had filenames in the comments. Binaries were more difficult, but I could run “strings” on them and usually get a good idea of what they were. A few required me to run “nm” to get their symbols out of them because they were object file or libraries. By and large it wasn’t that difficult; I knew my system like an old friend. A friend I was reassembling, piece by piece.
Later, the system started to do its slow-motion train wreck again. This time, I noticed that the SCSI bus light was on (sadly, nobody was home; I suppose they were all stuck on the bus). Eventually, the system would slowly freeze “to death”, and I’d have to do a hard reboot. Nothing could be saved after the first frost hit, so I quickly realized that a hard reboot was the only useful move at that point.
Over time, this happened more and more. Memory corruption? Bad RAM? Device driver problems, probably. It was hard to tell, but the Open-Source OSes were writing their own drivers for various chipsets from the specs, and bugs were common. I decided to blame the OS. My device driver skills were primitive, so I bought some books on the subject from a local used book store, but it was hard to know where to start.
The grafting process was a tedious, manual task, so I had to automate it. I wrote a tool that would examine what was in lost+found, particularly when it was a directory with files and subdirectories, and automatically try and find where to reattach it to the main file system tree. It was like you had a picture of a tree, and every day you find pieces of branches laying at its feet, and you have to graft them back. If my tool found multiple valid locations for a directory to be grafted, it would notify me for a manual conflict resolution.
Then I realized the inode numbers were not just arbitrary; I could run a find command and with the “ls” flag and output the inode numbers along with the filenames, size, etc. to create a database of every file and directory I had. This was awesome! It allowed me to graft numbered entries under lost+found back to their original places, even fixing up the files with multiple hard links. I automated this process, making the gardening much simpler and less stressful. Finally, I had some relief.
Then I started noticing sectors of NULs in the middle of certain files. This was a seriously insidious kind of corruption. It wasn’t just an annoyance now; I was having silent corruption of the contents of my files. I had never lost significant data in my life, and this computer held everything, including all my class assignments from college. It was like a scrapbook and diary for me, and something was eating away at it, silently. I had no idea how extensive the damage was, since I didn’t keep checksums of every file (indeed, that’s silly, since certain files are supposed to change contents).

6 The Security Landscape

Living in Austin as I was, the location of Steve Jackson Games, I was familiar with the 1990 Operation Sundevil, which was chronicled in The Hacker Crackdown. I met “Erik Bloodaxe” of LoD, who later became an editor of Phrack, as well as “Minor Threat”, who had a few years earlier finished “Tone Loc”, which is (as of 2010) still a widely-used wardialing program.
The first edition of Applied Cryptography had come out in 1994, just a few years earlier. It was widely regarded as a watershed event; it came out just after the Clipper Chip, and during a time when you could not export encryption technology by the same laws that prevented exporting sophisticated military weaponry. Although it is accepted now, many specialists back then who saw encryption as necessary for the security of the Internet were having a hard time getting their point across. By publishing Applied Cryptography, Bruce Schneier was a modern Prometheus, stealing fire and putting knowledge of it in the minds of the people.
Of course, this watershed moment had historical precedents; for example, David Kahn’s book “The Code Breakers”, while largely historical, did give out some information on how to apply cryptography, but this was the first description of modern algorithms like DES, and better yet, IDEA.
Were there encryption programs at this time? Well yes, Phil Zimmerman wrote Pretty Good Privacy (PGP) in 1991 [A]  [A] http://www.philzimmermann.com/EN/essays/WhyIWrotePGP.html, literally putting working software in your hand, and he is perhaps more Promethean than Bruce Schneier. Prometheus was chained to a rock, with an eagle tearing out his liver each day, which regrew. Phil was hounded by the government after its worldwide spread, he was forced to defend himself against these arms-trafficking regulations, trying to prove that he didn’t export it himself. That legal hounding to prove a negative cost him quite a bit of money ($1M?) and so he was forced to start a business around it; a business which is being purchased by Symantec at this very moment, some 20 years after writing the program!
There was also a Norton disk encryption product called Diskreet, but it only ran in DOS, and by this time, I had totally moved over to NetBSD as my main operating system (Linux, NetBSD and FreeBSD were available competitors at the time). It was known to some of us that it only did DES encryption, which I considered too weak, and that it didn’t handle the keyspace properly, making it a joke to break. In fact, the DESCHALL project broke a DES-encrypted message in June 1997.
So, to summarize, there were threats against youg people or publishers with computers, the information was out there, most of the tools were unsuitable because they were using DES, or simply weren’t designed right, and there was one good program, PGP, which did a lot of what you wanted.

7 My Crypto Contribution (idea_filter)

So, at this time, I was backing up hard drives to tapes. I wanted to be able to encrypt things as a Unix filter, so that I could encrypt tapes with something like this:
dump -cvf /dev/stdout / PIPE gzip -9c PIPE idea_filter PIPE dd of=/dev/rmt8 bs=8k
I may not have gotten the flags to dump right, but you get the idea. I’d then compress it heavily, encrypt it, and then stream it out to tape. DES was dead, and with no obvious candidate to choose, I took Schneier’s suggestion and picked IDEA. It was also getting used in PGP. I was using CFB mode, because it seemed simple and straightforward; as my first piece of serious crypto, I wanted to avoid worrying about padding to a block boundary. Apart from prepending an IV, it stored the data in the same length as the original.
So, as of 1997, I was encrypting my backup tapes, and I could therefore leave them in an untrusted place, and nobody who found them would be able to do much with them. Companies were paying Iron Mountain to truck them off and store them deep in guarded vaults; I kept mine at my friend’s house. This is the power that crypto gives you; you no longer need to care (much) about the confidentiality of the data.
In my first pass I had written this as a “streaming filter” that read STDIN. After some consideration, I decided to try doing it a different way. I’d take a filename as an argument, and I’d encrypt that file in place, overwriting the old data. Since I could no longer prepend an IV, I had to pass one in hex as an argument. Encrypting in-place was nice, because if you were encrypting a file, you typically didn’t want the plaintext still lying around. Not even PGP could do this. And the way I did it was interesting; I’d read a sector, encrypt it, back up a sector (using lseek), and then write it. So the file was incrementally encrypted from beginning to end, and in addition to the key you’d also need the correct initialization vector.
This seems relatively mundane, but consider that in Unix, disk devices are just files, so instead of just encrypting files, you can encrypt any block device, like a disk partition, or the entire hard drive!

8 The Boot Disks

A big part of security is being prepared, thinking through your problems before they actually happen. Back then, I used to worry about not being able to boot my system (disks were less reliable back then). I did this in part by mirroring the first slice of the disk (including MBR, boot sector, and /) to another identical drive, so if one wouldn’t boot I could swap in the other. I had a little script which would do the mirroring, and then fix up the names of the drives so that the mount points (which referred to most of these disks) all worked still.
Another part had to do with getting a minimal operating system on a floppy. At that time, I could build a kernel containing only the drivers I needed (using BSD’s kconfig system). But userland was a bigger problem, especially since most programs were statically linked (dynamic linking hadn’t spread to us yet). It turned out there was a program called crunchgen, which would allow you to compile several programs together, and they’d share the same library functions, making a huge executable you could link to several names, and when called with that name, it behaved as that program. This worked wonderfully, and so I included idea_filter in that list of programs, and I made “rescue” floppies which could be used to repair a broken system.

9 The Big Con

It’s 1998 and I’m preparing for the trip to San Antonio for the 7th USENIX Security Symposium. This is one of my first conferences since HoHoCon, one of the first hacker conferences held in Houston by the Cult of the Dead Cow.
Well, time to put this thing to work! I boot off the floppies, and start idea_filter encrypting in-place on the raw disk devices. There are nine drives total, and nine idea_filters running in parallel, at surprisingly different rates of speed. It takes longer than I thought; it’s almost time to leave and they aren’t even close to done! I can’t wait, and I can’t stop the process, so I leave with the drives being encrypted.
Hopefully the process will finish in a hour or two, and nobody will ever figure out how to use the floppy. Beyond the complete bespoke obscurity of what I was doing, I realized that guessing the passphrase would be easy; you might have noticed I didn’t mention passing the passphrase on the command line, through basic understanding of security practices. Also, I had recently read Stevens “Advanced Programming in the Unix Environment” and so knew enough terminal handling ioctls to avoid echoing it to the screen.

10 The Abomination

I came back and - guess what, the SCSI bus was hung. Of all the bad luck!
I could tell that not all the drives had finished encrypting, so now I had nine drives, all with their beginnings encrypted, and the ends still plaintext.
This was terrible. I sat and thought about what I could do.

11 A Spark of Light

It occured to me that encrypted data has an entropy of nearly 1, and that disks write sectors at a time (or none at all), so the boundary between encrypted and decrypted is a boundary between sectors. It’s also possible to distinguish the entropy of a sector of plaintext from a sector of encrypted data, even if the plaintext is part of a binary, and maybe even if it’s part of a mp3 or compressed file. So, I could do a binary search on the disk, testing one sector for entropy, and either deciding it’s encrypted or unencrypted, and make my next test further ahead or behind, respectively.
I found source code to “ent” online, which did a wonderful suite of entropy tests. Using a script I cobbled together and “dd”, I extracted a sector at a time, decided if it was encrypted or not with ent. Ent conducts various tests, and the chi-square test is the most common in the academic literature; it tells you how likely a random distribution (e.g. encrypted) would be to exceed it. However, I found it rather useless for this purpose. Instead, I used the Shannon information-theory entropy measurement (using a threshold of about 7.5 bits per octet, determined empirically), and performed a binary search. Once I found the boundary, I did a few tests for a few octets on either side to make sure it wasn’t some local abberation. It was surprising how reliable an entropy test on 512 octets was; I found no abberations.
But when I looked at the second drive, I tested the boundary with ent just to make sure. It wasn’t the same.
I puzzled over this a bit. It turned out that the drives had actually encrypted at wildly different paces, so the boundary was completely different on every drive. Fortunately, the binary search was easily scripted up for automation.
Most of the drives, I found the boundary just fine.
But on one, the one which had encrypted the furthest, there was a hitch.

12 Eureka!

No, dd didn’t crash; the hard drive crashed, and it hung the SCSI bus just like what had been going on all along.
I rebooted and tested again. Sure enough, it crashed again. I rebooted and tested again. Yep.
There is something really weird about this. I was looking at the offset (in sectors), and it looked somewhat familiar.
I quickly did the math, and found that if I multiplied the sector offset by 512, I came up with 2 gigabytes. By this time my spider sense was tingling. I did some more experiments, doing reads of different sizes around this area, and discovered I could trigger this with a single sector read that hit this offset.
Put simply; if I performed a disk operation that spanned the 231barrier, the SCSI bus hung.
Further tests showed that all of the Micropolis 4.5GB HVD SCSI drives had this behavior; the reason why only one hung was that it had encrypted the fastest; once it hit this barrier, it hung the bus, and the host computer could no longer continue to operate on the other drives, so they stalled at some point earlier on.
Looking into the kernel source, and thinking about the bus failure, I realized it wasn’t my tools at all, or my operating system’s device driver.
It turns out that the hard drive itself had an unhandled signed integer overflow.

13 The Fix is In

Now, it turns out that hard drives aren’t just little machines. They actually have some little microprocessors and related electronics, and these run software (called firmware) that interpret the SCSI commands and do stuff with the servo motors. This firmware is just software, written in assembly language.
I procured the tools that could download and upload the software to the drive over the SCSI bus. This was proprietary stuff written by Micropolis in DOS, and so I dual-booted into DOS, extracted the firmware images, and started doing research. I was in luck! The microprocessor was a common model (80186 if I recall correctly), and so with some disassembly and cooperation with people on the Internet (mostly a guy in Germany), I was able to fix the firmware. The drives never crashed that way again, and some lasted at least ten years.
Incidentally, the software used to reprogram the drives did not require any jumpers to be set on the disks. Thus, the possibility remains that some malware could actually reprogram the drives without the user knowing, turning them into “hot bricks” (as one friend put it). Of course, that’s evil, so I’m glad people with those kinds of skills have better things to do with their time.

14 Coda

My best guess is that they used the same software from their 2GB models on their 4GB models, and it suddenly developed this bug. Coincidentally, Micropolis went out of business around this time. Too bad they hadn’t hired me to help them; I would have been worth several times what they would have had to pay me.

15 Addendum in 2015

In 2003, Simson Garfinkel published an interesting paper on data remanance that mentions drive firmware:
http://simson.net/clips/academic/2003.IEEE.DiskDriveForensics.pdf
And it turns out that nearly twenty years later, people are rediscovering the joys of buggy firmware, but on SATA SSDs. Well, you might be interested to know that the SATA API is basically a simplified and updated version of SCSI command set.
http://www.reddit.com/r/programming/comments/3a0f3f/when_solid_state_drives_are_not_that_solid/
Earlier this year, Kaspersky reported that an organization they called “Equation Group” had developed some hard drive firmware rewriting in order to implement secret storage and backdoors, and this software could date back to around the same time frame.
http://www.wired.com/2015/02/nsa-firmware-hacking/
This is a weird coincidence! However, you can tell it is merely that because I posted this document here and gave it as a five minute “lightning talk” at Noisebridge hackerspace and Baythreat conference prior to public revelation of the techniques. If I had been involved with that malware in any way, talking about even the basic idea of hard drive firmware reflashing would have decreased its value to the organization paying for it.