A Web-Based File System and Browser

[email protected]

1 Motivation

You have a large reference library, and you wish to be able to access it easily from more than one computer. You want to be able to search through it, and view the contents with just a click of a mouse button. The files themselves come in a variety of formats, and they may not be suitable for browsing without translation (e.g. CHM files converted to HTML). They may also be compressed to save on storage space, and you want them to “just work”. Also, people may want to sort by media type (audio, video, text), media format (plain text, HTML, RTF), by author, by name, etc.

2 Design Ideas

Use the web - solve your accessibility issues by serving the library with Apache.
Solve your “one client to view them all” problem by using a fully-featured web browser
By letting the user pick with a mouse, you avoid having to type lots of commands to find what you want
Have a search engine that helps you locate what you want quickly
Solve the translation/compression/transcoding problem by browsing with an application on the server side which does it automagically
Employ a virtual file system that allows people to navigate into CHM or ZIP or TAR files
Also employ the virtual file system so that one can browse by media type, format, author, subject, etc.
We can also do automatic de-duplication or sharing of files with identical contents
Allow the user to download files, directory hierarchies, or collections of unrelated files, either in bulk download or as some kind of archive (tar, zip, rar)
Automagically generate thumbnails, download album artwork, link to IMDB entries, Amazon.com entries, etc.
Can be implemented with a regular file system and RDBMS - may not be the most efficient way, but leverages existing APIs and knowledge

3 Thoughts

This is a move towards a web-based file system, which would be far more flexible than a regular file system in certain ways. File system development in UNIX has been hampered by having to do all the work in kernel space, with the exception of tools like FUSE. However, a web server is already in userland, so there’s no performance penalty for doing your work there (as there is with FUSE) because most people don’t run web servers in the kernel (it’s actually possible, just not common). That means that you can reuse all kinds of complex translation userland programs on the data.

Traditional file systems also require random access to the files; you have to be able to, for example, seek to a certain point in the file. That makes it very difficult to implement compressed file systems, because there’s no longer a 1-to-1 correspondence between on-disk storage and what is read by the user’s program. Similarly, encrypting file systems such as TCFS (as opposed to block device encryption systems) never really caught on; I myself actually attempted to use TCFS, and found it very buggy. Upon writing some unit test suites, it became clear that the Linux VFS API was so complicated, and there were so many mismatches between what was on disk and the logical representation, that I discovered many latent bugs.

The virtual file system is neat for many reasons. We already see them popping up in commercial applications, where a music playing program may allow you to group your music into virtual “crates”. Experience with Library Science and online “directories” like Yahoo! show that it is difficult to make certain titles fit in a strict hierachy; for example, where exactly would one put the book “Gödel, Escher, Bach”? It’s hard enough for the author to even describe what it is about. And anyone who has created a large and extensive enough taxonomy finds problematic cases; where, for example, to put the duck-billed platypus, or a book that touches on several subjects.

The simple answer to this, with file systems, is to use links to place the object in multiple places at once. The most efficient method of doing this, on file systems that support it, is through the use of hard links; essentially, the same file contents has two names in the file system. But hard links cannot span file systems, making it difficult to extend the library across multiple storage devices. And directories cannot be hard linked, so adding a file in one location does not make it appear in the other location. A work-around is called symlinks, or shortcuts, and has problems of its own; certain system calls operate on them directly, while others operate on the referent (target), and it is easy to forget to update a symlink if the referent is moved.

By embracing virtual file systems (more precisely, virtual hierarchies), we can view the files sorted by any criteria we so desire.

But why stop there? We can go further, and have even more alternate views of the filesystem; for example, we could have Bob’s favorite book collection, or Amy’s favorite movies. Or Joe’s list of books related to math. And we can do queries that return collections; all the conference papers about computer networks published between the year 1995 and 2000.

And we also have much more powerful notions of access control. We can tag certain subsets of files as being visible to a particular user, or an arbitrary group of users. Native file system ACLs never really caught on, in part because the file systems were designed with much simpler ownership models in mind in the first place, and many backup and copying tools don’t know how to manipulate them properly. Group management, too, is rather limited; we usually can’t have groups of groups, etc.

Once we liberate ourselves from the idea of a file name being related to its organization, we can easily do deduplication as well; in fact, one often finds that large collections of things may have duplicate icons, for example, in different works by the same author or publisher. We can store all the files in one place, and have virtual names that point to them from all the works. In fact, on the underlying data store, the file names may not be meaningful at all; we could name each file by the hash of its contents, for example.

“But how will we ever administer such collections” you may be asking. “Surely a directory full of meaningless hashes is an admistrator’s worst nightmare!”. There is some truth to this, but perhaps it may be possible to actually export some subset of the normal UNIX FS API to the user via FUSE or something similar. Then, and admin could mount a file system that corresponds to one view of the database; say, a hierarchy of topics, and perform some operations on those files such as deleting certain ones.

Importing a collection of files would become quite simple; the tool would scan their names, gather their contents, and insert them into the database, naturally de-duplicating files with identical contents, and then adding their names into the various views. The matter of normalizing names still remains, but anyone who has attempted to manage a large collection of files from various sources knows that this is a tedious thing to do from the command line (I personally still use midnight commander) and a web application seems like a perfect way to prompt the user to enter missing data or correct mispellings and other minor changes.

Coming up with the perfect system will require some work, but I am excited by the possibilities.

TODO: This is a lot of text. I really need some diagrams.

4 Related Work

Need to investigate these.

WebDAV (http://www.webdav.org/)
WinFS (http://en.wikipedia.org/wiki/WinFS)
FUSE (http://fuse.sourceforge.net/)
Links and links and links (http://www.google.com/search?q=web-based+file+%28system+OR+browser+OR+manager%29)
Ghost: Share and Collaborate in the Cloud (http://ghost.cc/)

I NEED YOUR HELP: If you know of similar projects or people with similar ideas, please let me know. If you know of people who might know, please pass this along to them, as I’m not exactly sure where to ask about this kind of stuff.