HDB History - Evolution of a Program

Introduction

This page summarizes the evolution of a program as it is in the source of development.

I am writing this in part to document my process, to get help on OOP/OOD ideas, for my own edification (via post-mortem) and hopefully so that a post-mortem will be valuable to students learning the craft of computer programming.

As a programmer of some 30 years, I have never come across a document describing the actual process of programming. I recall hearing once that looking at a finished mathematical proof leads you to believe the mathematician was a genius, and also gives no clue as to how to actually create a proof. It is therefore most instructive to look at how they actually develop their proofs so that one gets a feel for the process (and realize that it is not as intimidating as it seems, but does require a lot of hard work, just like programming).

I have a good deal of OOP/OOD experience, and now that I have a firm grasp of the fundamentals, I may be in a position to describe it sufficiently to help a new programmer understand the process.

Background

I detail why I wrote HDB on its own homepage.

Suffice it to say here that I chose Ruby instead of my favored language, Python, in order to stretch my abilities as a programmer. It has been said that each programming language you learn teaches you something new about how to think about programs, and I've found that each one learned makes learning a new one that much easier.

I encourage programmers to learn new languages, but probably not on the job; the process is too time-consuming, and sometimes you end up with products that are needlessly incompatible (for example, when someone chooses to use a technology that is not as universal as another alternative) just because the programmer was, essentially, bored. In other cases, the chosen language turns out to be unsuitable for the job. Regardless, the process is often more time-consuming than you would want in a commercial environment.

You'll note initial progress is slow and later progress is bursty, progressing quickly when I had a pressing need, such as my file server getting full.

I have ignored all edits related to unit tests, which I tried to maintain with each change. I also don't force you to read comments about my comments, and ignore minor changes such as renamings which aren't necessary to follow the changes.

Captain's Log

2009-07-14 16:48:31 -0700 (Tue, 14 Jul 2009)

Initial checkin, just learning ruby, simple FileList class which has an array as a member which holds each filename, and a hash as a member, which holds each filename as a key and 1 as a value. The general idea is that the array holds the bulk of the data, in sorted order, and the hash is a fast way to lookup individual items.

Have a "without" method which returns another FileList lacking the given files, and a "without!" method which destructively removes them.

Manually update the hash using an update_hash method.

Awful design, but at this point I wasn't sure if I was going to look up individual items a lot (in which case a hash makes sense) or need to go through them in sorted order a lot (suggesting an array), so I ended up using both data structures in this first draft.

2009-07-15 09:16:32 -0700 (Wed, 15 Jul 2009)

Renamed "without" to subtract, added an add and add! method.

2009-07-15 09:38:58 -0700 (Wed, 15 Jul 2009)

Removed accessor method for hash, instead created a method that returns the results in sorted order.

2009-07-15 09:48:01 -0700 (Wed, 15 Jul 2009)

When returning files, after sorting, run them through uniq to ignore duplicates.

2009-07-15 09:49:33 -0700 (Wed, 15 Jul 2009)

Removed the redundant hash class; it probably uses up a lot of memory and I'm not using it for anything.

2009-07-15 09:51:13 -0700 (Wed, 15 Jul 2009)

Do in-place sort! and uniq! rather than returning temporary values which are sorted and uniqified.

2009-07-15 11:35:41 -0700 (Wed, 15 Jul 2009)

Learned about default parameters; if we pass nil in as the list of files to the constructor, create an empty array.

This was kind of dumb because I could just pass in [], or make [] the default parameter value, but you make these kind of mistakes when learning a new language.

2009-07-15 11:57:14 -0700 (Wed, 15 Jul 2009)

Created a method that reads a FileList from a file, creating one entry for each line that I read

2009-07-15 12:02:38 -0700 (Wed, 15 Jul 2009)

Created a method to write the FileList to a file, one filename per line.

2009-07-15 12:06:31 -0700 (Wed, 15 Jul 2009)

Renamed write to write_file, created a write routine that accepts a file descriptor instead of a filename. Have write_file call write.

This is probably a violation of the YAGNI principle - I don't need it right now, but when I wrote it I must have had something in mind (see later entry for how it is eventually used).

2009-07-15 12:09:15 -0700 (Wed, 15 Jul 2009)

Did same with read!, renaming to read_file!, and refactoring to create a read! method which accepts a file object.

2009-07-15 12:21:12 -0700 (Wed, 15 Jul 2009)

Created FileListContainer class, with the idea that I'll end up encapsulating FileList in another class. In retrospect, I probably should have used inheritance, since there's a one-to-one correspondence. As of writing this, I still haven't fixed it.

In the constructor, I can call FileList.read with a file descriptor I opened based on the file name so that I can read the FileList from an already-opened file. My idea here is that I'm going to read a few lines of header, and then pass the file object to read! which will read the rest of the lines into the FileList.

2009-07-15 12:55:51 -0700 (Wed, 15 Jul 2009)

Refactored FileListContainer to call a new read! method which reads a saved FileListContainer from disk.

2009-07-15 14:23:58 -0700 (Wed, 15 Jul 2009)

Implemented a FileListContainer.make! method which makes a FLC from a directory by running a Find.find across the contents and adding each file it finds.

2009-07-15 15:16:24 -0700 (Wed, 15 Jul 2009)

Save the directory we used to "make!" a FileListContainer as a member variable. Mentally noted that I want to convert it to an absolute pathname at some point, in case the user passed a relative pathname to the method; relative pathnames are only meaningful relative to the present working directory, which isn't interesting in itself.

Create a FileListContainer.subtract! method which simply removes files from the FileList member variable.

2009-07-15 15:17:44 -0700 (Wed, 15 Jul 2009)

Renamed FLC to FileSet - name was too long.

2009-07-15 15:23:27 -0700 (Wed, 15 Jul 2009)

Created FileSet.add_files! which will add a list of files to the FileList member.

2009-07-16 11:29:57 -0700 (Thu, 16 Jul 2009)

Save the hostname we created the FileSet on by getting it during make! and saving it in a member variable.

2009-07-16 11:49:12 -0700 (Thu, 16 Jul 2009)

Learned about the ruby Pathname class when trying to figure out how to convert from relative to absolute pathnames. Now the FileSet.dir member is a Pathname object, which I convert to an absolute pathname (lacking symlinks) via realpath.

2009-07-16 12:35:52 -0700 (Thu, 16 Jul 2009)

Now I make the changes I anticipated by creating seperate FileList and FileSet classes; FileSet.write method writes out a header containing the FileSet metadata (directory and hostname), and read! reads them in.

2009-07-16 12:57:38 -0700 (Thu, 16 Jul 2009)

Learned about setter methods, created "dir=" method that lets you assign a relative pathname and have it create an absolute one for you automagically. Call it from make!.

2009-11-17 13:01:33 -0800 (Tue, 17 Nov 2009)

Don't bother to initialize @files to [] in read_file!, since we'll reset it implicitly when we call @files.read(..)

2009-11-18 09:42:19 -0800 (Wed, 18 Nov 2009)

Added a @host member variable to FileSetGroup, initialize to Socket.gethostname

2009-11-18 09:44:15 -0800 (Wed, 18 Nov 2009)

Instead of always initializing to Socket.gethostname, allow user to specify, or leave off (default value nil means gethostname)

2009-11-18 10:13:41 -0800 (Wed, 18 Nov 2009)

Allow FileSetGroup.initialize(groupdir) argument to be nil, in which case it defaults to $HOME/.hdb

2009-11-18 13:48:52 -0800 (Wed, 18 Nov 2009)

Made FileSet.dir a string instead of Pathname object

This makes it easier to read from a file, especially if the pathname no longer exists on the hard disk. This also means I can no longer use FileSet.dir= to assign to it.

This is a general strategy I'll do later; make sure that all members that can be saved to disk can be reconstructed later from that serialized data. In some cases I can make them from strings, in other cases I cannot (easily), and need to make them string objects.

Created an eql? method to compare for identical content; == only tests that it's the same object in ruby. By contrast, two objects with the same contents must be compared with eql?.

Created a FileSetGroup.fs member which is an array of FileSet objects.

Created FileSetGroup.scan which will scan the FSG dir and create FileSet objects for every file it finds in there, and store those in the FSG.fs array.

2009-11-18 15:49:39 -0800 (Wed, 18 Nov 2009)

There will be cases where I'm only interested in the FileSets made on a particular host and of a particular directory. At this time, when you call scan, it will ignore any FileSets that were created on a different host. It has to do this by creating FileSet objects for everything on disk, and then using @fs.select { ... } to select out only those FileSet objects created on the host, since we don't know what host they were on prior to reading the FileSet (which is done by creating the object).

2009-11-19 09:20:47 -0800 (Thu, 19 Nov 2009)

At this point, I decided the best way to limit the FileSetGroup to a certain directory is to associate it with the FileSetGroup at initialization time.

(Later, I discovered this was a bad idea, and did it differently)

So, now you pass a directory in, and it converts it to a "real path" (no symlinks) at initialization time.

2009-11-19 10:06:56 -0800 (Thu, 19 Nov 2009)

Prior to this point, FileList was created from an array of filenames.

Now, I changed initialize() to take an open file (object), or nil, and if not nil, to read filenames from it via read!

This required several changes in the set arithmetic operations; if you wanted to subtract from a FileList, you created a new FileList object and called add! with the non-subtracted elements. If you wanted to create a FileList by adding some new filenames (these are later Metadata objects), you create a new FileList and call add! with the old values plus the new.

2009-11-19 10:23:15 -0800 (Thu, 19 Nov 2009)

At this point I turned around and undid the last change; I assume I must have realized there had to be a better way to do it.

2009-11-19 12:45:54 -0800 (Thu, 19 Nov 2009)

At this point, I realized that if you add a bunch of files to a FileSet, and then subtract out files you had already backed up in a previous FileSet, you would end up removing the parent directories as well, so you'd end up with a FileSet with a bunch of "leaf node" files with no parent directories. This could cause problems when attempting to copy a FileSet, so I decided to create two methods; one, called restore_leading_dirs!, would get all the leading directories for all the files and make sure they exist in the FileSet. The other, get_leading_dirs, would take a filename and return a list of all the leading directories.

This was such a complicated change that I didn't commit a working version, but rather stopped what I was doing and committed what I had, which didn't work with absolute pathnames, since it would not include the root directory. It also didn't appear to work right if you specified a relative pathname as "./dir" instead of "dir" when creating a FileSet.

In retrospect, I ended up removing these routines, which leads me to the following rule of thumb: if you can't commit a working version, it's a sign that you've probably picked a poor solution to your problem.

It is also the case that I didn't think too carefully about performance at this point, and left comments in there saying things like "this is likely to be slow", since many files in the FileSet probably share the same parent directories, but I think that's okay, since you shouldn't prematurely optimize your code; do it the simple and fast way, and then go back and optimize it. However, when you're writing unit tests in parallel with your methods, and you have to optimize the way you do it, then your unit tests will undergo a lot of change too, making you waste some time.

2009-11-19 13:44:43 -0800 (Thu, 19 Nov 2009)

At this point, I finally got around to the feature I wanted; to be able to remove from a FileSet any files which had been backed up in other FileSets and saved to the FileSetGroup directory.

I did this by writing a FSG.filter method which took a new FileSet, and iterated through each of its saved FileSets (in the @fs member) and removed them from the FileSet object we passed in via subtract!.

This technique turned out to have some weird implications later... but I'll get to that.

2009-11-19 13:53:00 -0800 (Thu, 19 Nov 2009)

In the FSG.filter routine, after removing files from the FileSet object we passed in, I call restore_leading_dirs! to restore all the leading directories in the FileSet.

2009-11-19 14:25:04 -0800 (Thu, 19 Nov 2009)

At this point, I started to work on the FileSet.copy method. Specifically, this method, the meat of the backup routine, copies every file in the FileSet to a destination directory (the backup medium).

I did this using the ruby FileUtils class, and discovered it's a little simple-minded; I had to call different routines to make directories and copy files. This means that for every kind of "file" on Unix (devices, FIFOs, Unix domain sockets, files, symlinks, etc.) I'll have to have some logic to handle them.

2009-11-19 14:31:55 -0800 (Thu, 19 Nov 2009)

At this point, I realized FileUtils.copy_entry exists, and it can handle files, directories, and symlinks for me, so I use that.

2009-11-19 14:55:37 -0800 (Thu, 19 Nov 2009)

Made FileList.files an attribute which can be externally read.

Created a normalize! routine which sorts and uniqs the files member.

Call normalize! at end of every add! or subtract! invocation; this removes the need for a files accessor method which does it.

...

Fri Aug 20 05:28:15 PDT 2010

A few days ago I realized that I had originally designed the system to accept a volume serial number on the command line, and at the beginning of the program it opened it, and at the end it ejected it.

I realize now that there was a significant implicit assumption here; that the backup would only require one backup medium.

Since the user doesn't necessarily know how many media it will take, I cannot have the user enter volume serial numbers except in an interactive manner.

I am spending some time thinking about how to do this.

Sun Aug 22 12:06:15 PDT 2010

Yesterday I noticed a backup of 700,000 files had been running for 10 days or more, and was only halfway done creating metadata for the files. This was unacceptable so I stopped it.

Rather than start fixing things, I decided to do some profiling to determine what was the source of the slowdown.

I did some searching of the web to figure out how to profile in ruby (I have hardly ever done this before). Then I did a quick backup with nearly no files, and found this usage:

%self     total     self     wait    child    calls  name
23.48     42.24    19.00     0.00    23.24  1176295  HDB::Metadata#create_from_string (/home/vax/svn/hdb/hdb.rb:105}
22.17     17.94    17.94     0.00     0.00  1176329  String#chomp (ruby_runtime:0}
20.35     16.47    16.47     0.00     0.00  1176329  String#split (ruby_runtime:0}
12.86     10.41    10.41     0.00     0.00  1176335  IO#gets (ruby_runtime:0}
...

Next I let it run on some files for a while, and got this:

%self     total     self     wait    child    calls  name
83.45    756.05   756.05     0.00     0.00   533316  IO#read (ruby_runtime:0}
 2.47     22.38    22.38     0.00     0.00   443434  Digest::Base#<< (ruby_runtime:0}
 2.16     45.62    19.53     0.00    26.09  1176295  HDB::Metadata#create_from_string (/home/vax/svn/hdb/hdb.rb:105}
 2.15     19.51    19.51     0.00     0.00  1176329  String#split (ruby_runtime:0}
 1.71     15.48    15.48     0.00     0.00  1176335  IO#gets (ruby_runtime:0}
 1.33     12.07    12.07     0.00     0.00   273325  IO#write (ruby_runtime:0}
 1.17     10.56    10.56     0.00     0.00  1176329  String#chomp (ruby_runtime:0}
 ...

Then I let it run and re-attempt the large backup, and I got:

%self     total     self     wait    child    calls  name
41.79  33163.35 13911.43     0.00 19251.92     8106  Array#select (ruby_runtime:0}
37.32  19251.92 12424.70     0.00  6827.22 1857813457  HDB::Metadata#match? (/home/vax/svn/hdb/hdb.rb:153}
12.18   4056.53  4056.53     0.00     0.00 1861019236  String#== (ruby_runtime:0}
 8.32   2771.25  2771.25     0.00     0.00 1857813468  String#eql? (ruby_runtime:0}
 0.07     22.07    22.07     0.00     0.00  1176321  String#split (ruby_runtime:0}
 0.04     14.37    14.37     0.00     0.00  1176321  String#chomp (ruby_runtime:0}

It appears that my Array#select routine, which is used to select matching metadata from a FileSet/List, is using far more time than I expected. In fact, it's slower than just reading the file and generating the metadata!

This is because the FileSet/List is based around an array, and what we really want is a fast hash lookup. This brings up some interesting design issues. First, what would we look up? We are trying to get the hash value, so we cannot use that as a key. Perhaps we could look up based on filename.

Also, if we start maintaining a parallel data structure - a hash in addition to array - we have to update the hash every time we add or remove something. This is what a novice programmer would do.

Since the FileSet/List tends to be used to either accumulate files, or (once archived) to do lookups, then the thing that immediately jumps to mind is that we could generate a hash lazily (that is, on first lookup), and simply invalidate it on the off chance something is added or deleted.

Alternately, we could have an abstract base class, with derived classes for the "accumulating" and "frozen" versions of the FileList/Sets.

Instead of doing either, I'm going to let my subconscious work on it, and in the mean time I added a command line option to turn on metadata lookups; otherwise we simply generate them as normal.

I ran it again without trying to avoid metadata recomputation and I got:

%self     total     self     wait    child    calls  name
94.55  54000.96 54000.96     0.00     0.00 128010322  IO#read (ruby_runtime:0}
 1.60    916.05   916.05     0.00     0.00 89988103  IO#write (ruby_runtime:0}
 1.55    882.67   882.67     0.00     0.00 42692163  Digest::Base#<< (ruby_runtime:0}
 0.59    455.07   335.24     0.00   119.83  8127372  Pathname#chop_basename (/usr/lib/ruby/1.8/pathname.rb:264}

That actually looks pretty good; I'm spending virtually no time in my own code.

Actually, I suppose one can't just look at a routine in which you're spending a lot of time and say "I need to optimize that"; what you really should be doing is avoiding calling it in the first place, if you can.

That is, a good high-level optimization won't improve the runtime of some chunk of code, but it will avoid calling it as often. A micro-optimization will improve the amount of time spent in it per call.

Mon Feb 21 18:10:17 PST 2011

After a long hiatus, I find myself out of disk space and needing to back up and make space.

I decided to make some changes, to allow multiple volumes per backup.

This required a great rethink, since previously the volume ID was a command line argument.

Now, I want interactive prompting.

First, I created a routine to prompt for a volume, and if you enter zero, to have it search for the first available. This is exactly how the earlier command line argument was processed.

Next, I made that a get_volume function in the BackupVolume class, changed BackupVolume's initialize to take distinct arguments, one of which may be volid (or it might default to nil). Then I call get_volume from initialize if the volid is not passed (nil).

Then, I go back to the BackupRoutine.main and alter it to create a BackupVolume (the class was previously unusued), mkfs and mount it (it previously did this all in-line).

When I go to CryptedBackupVolume, I see that nothing really need be done to this class.

However, when I go to CryptedBackupRoutine, I find an unexpected complication. Previously, I had prompted for a passphrase, run cryptsetup to format it in-line, run cryptsetup to open it in-line, and then called BackupRoutine.main (the superclass) with a different destination device. But now, we cannot call the BackupRoutine.main() without again prompting for a volume, and so on.

This is a pretty common example of the complexities you find when coding. You just didn't expect them to interact this way, and there's little way you could have known unless you held the whole system in your head. In this case, it has been months, so I've forgotten even some of what I knew about ruby :-)

The solution, in this case, is to refactor. Well, actually, the first step is to test my change, before doing anything else; if the underlying (unencrypted) routines don't work, then the encrypted ones won't either.

I ran it and found that there was an unbound variable fs in the BackupVolume.mkfs function. The question is, do I pass in a fstype as an option to BackupVolume and have BackupVolume construct the FileSystem object, or construct it outside and pass the object in? I decided on the latter. I had to not print out the type of filesystem in mkfs since it's no longer available as text to me, but I think it won't matter much, since the mkfs binary will print it out.

As a general rule, I don't like to have a lot of information about how to construct objects inside of other objects if I can pass them in from outside instead; it generates too much dependency between classes. In this case, since the filesystem type won't change from volume to volume (it is a command-line argument), I'll just construct one FileSystem object and pass that to each BackupVolume.

At this point, it goes to back up, and crashes. From the context, I figure out that it's crashing when trying to remove the old FileSetGroup file. To allow for the old behavior, I have to make an attr_reader of the volid member so that we can query it from BackupVolume.main. I wonder at this point that the volume ID is really not so much a property of the volume itself, or the FileSetGroup - in fact, we have to pass the FSG in to the BackupVolume routine to find the next unused volume ID, so perhaps it should be the other way around, with the "get volume ID" functionality being in the FSG. It's not really clear to me at this point; I give up on being perfect, and simply move forward. This happens a lot when coding; the implications aren't clear until your subconscious has a chance to work on it, or you have a chance to work out the ramifications through further coding.

My first tendency here would be to refactor. I would extract out the code which creates the backup volume picks a volume ID, and separate it from the code which formats and copies data onto the drive. But that's the entire routine! Next, I consider passing in a volume ID as a parameter to main(), so that the BackupVolume initializer will see a volume ID (if called from CryptedBackupRoutine) and use it without asking, or not (if called from the script directly) and prompt. However, this is thwarted by both a handy output message saying what volume we're crypto-formatting, as well as the fact that any prompting in BackupVolume.main() about wanting to overwrite the volume will come too late, as it will have already been crypto-formatted.

Thus, it's back to refactoring. I start refactoring, separting the creation of the BackupVolume into its own routine (get_new_volume), and the rest into a new method backup. However, I find that my current structure is simply unsuitable. I prompt the user if he wants to overwrite the volume right after prompting him for the volume, and then I mkfs it. But unfortunately, for crypted volumes, I need to do something in between; I need to luksFormat it.

Along the way, I notice that there's some extra cruft in get_new_volume(); I can move the creation of the FileSystem object into the caller, since it won't change, and pass it into main.

Then, I notice that BackupVolume.get_new_volume() explicitly creates a BackupVolume object; that's no good, since the CryptedBackupRoutine must create a CryptBackupVolume object, and so we can't use this routine directly from the subclass as-is.

An important point here is that after prompting for a volume number, and asking if they are sure, in the unencrypted case, I make a file system and mount it. However, in the crypted case, I must first do a luksFormat, then make a file system and mount it. This means, for code reuse, that I must break them into two methods, so that from CryptedBackupRoutine I can call the first (to get a volume ID), do a luksFormat, then call the second (to make the fs and mount it). I name the second prepare_new_volume.

This seemed to work, finally.

Sat Feb 26 19:33:36 PST 2011

I mentioned to a friend wanting to re-implement the rsync algorithm in pure python.

He pointed me to pysync, which seems to be related, but hasn't been updated in a while.

I got to thinking about it, and for me, the delta stuff isn't as useful as synchronizing directory structures with lots of data very quickly.

My application is special, since if a file won't fit, I want to queue it up for the next backup volume.

However, there's also rsync.py, and Abo's pysync.

I wonder which I should start with.

Wed Jul 6 12:58:24 PDT 2011

Today I ran the program without specifying -l LABEL, and it gave a really obscure error message.

Turns out, after trying -d (debug) and -v (verbose) options, that it was calling cryptsetup luksOpen /dev/whatever

This fails - it needs a name.

When I went to troubleshoot, I thought that CryptedBackupVolume was not reading the instance var from superclass BackupVolume.

Noticed today that if I interrupt the program during making a file system, it leaves a cryptsetup device in /dev/mapper/label. This will cause future calls to fail when using the same label. So I need to trap interrupts and do a cryptsetup luksClose label. Ah, isn't persistent state wonderful? Welcome to systems programming!

At this point, I noticed that I make a file system, and then mount it, in the same method. The problem with this is that if I get an interrupt during the mkfs, I merely want to close the crypto device, and if I get an interrupt after mounting, I want to unmount the file system just before that. I cannot create such logic with them both in one method.