A Web-Based File System and Browser
travis+web@subspacefield.org
1 Motivation
You have a large reference library, and you wish to be able to access
it easily from more than one computer. You want to be able to search
through it, and view the contents with just a click of a mouse button.
The files themselves come in a variety of formats, and they may not
be suitable for browsing without translation (e.g. CHM files converted
to HTML). They may also be compressed to save on storage space, and
you want them to "just work". Also, people may want to sort
by media type (audio, video, text), media format (plain text, HTML,
RTF), by author, by name, etc.
2 Design Ideas
- Use the web - solve your accessibility issues by serving the library
with Apache.
- Solve your "one client to view them all" problem by using a
fully-featured web browser
- By letting the user pick with a mouse, you avoid having to type lots
of commands to find what you want
- Have a search engine that helps you locate what you want quickly
- Solve the translation/compression/transcoding problem by browsing
with an application on the server side which does it automagically
- Employ a virtual file system that allows people to navigate into CHM
or ZIP or TAR files
- Also employ the virtual file system so that one can browse by media
type, format, author, subject, etc.
- We can also do automatic de-duplication or sharing of files with identical
contents
- Allow the user to download files, directory hierarchies, or collections
of unrelated files, either in bulk download or as some kind of archive
(tar, zip, rar)
- Automagically generate thumbnails, download album artwork, link to
IMDB entries, Amazon.com entries, etc.
- Can be implemented with a regular file system and RDBMS - may not
be the most efficient way, but leverages existing APIs and knowledge
3 Thoughts
This is a move towards a web-based file system, which would be far
more flexible than a regular file system in certain ways. File system
development in UNIX has been hampered by having to do all the work
in kernel space, with the exception of tools like FUSE. However, a
web server is already in userland, so there's no performance penalty
for doing your work there (as there is with FUSE) because most people
don't run web servers in the kernel (it's actually possible, just
not common). That means that you can reuse all kinds of complex translation
userland programs on the data.
Traditional file systems also require random access to the files;
you have to be able to, for example, seek to a certain point in the
file. That makes it very difficult to implement compressed file systems,
because there's no longer a 1-to-1 correspondence between on-disk
storage and what is read by the user's program. Similarly, encrypting
file systems such as TCFS (as opposed to block device encryption systems)
never really caught on; I myself actually attempted to use TCFS, and
found it very buggy. Upon writing some unit test suites, it became
clear that the Linux VFS API was so complicated, and there were so
many mismatches between what was on disk and the logical representation,
that I discovered many latent bugs.
The virtual file system is neat for many reasons. We already see them
popping up in commercial applications, where a music playing program
may allow you to group your music into virtual "crates". Experience
with Library Science and online "directories" like Yahoo! show
that it is difficult to make certain titles fit in a strict hierachy;
for example, where exactly would one put the book "Gödel, Escher,
Bach"? It's hard enough for the author to even describe what it
is about. And anyone who has created a large and extensive enough
taxonomy finds problematic cases; where, for example, to put the duck-billed
platypus, or a book that touches on several subjects.
The simple answer to this, with file systems, is to use links to place
the object in multiple places at once. The most efficient method of
doing this, on file systems that support it, is through the use of
hard links; essentially, the same file contents has two names in the
file system. But hard links cannot span file systems, making it difficult
to extend the library across multiple storage devices. And directories
cannot be hard linked, so adding a file in one location does not make
it appear in the other location. A work-around is called symlinks,
or shortcuts, and has problems of its own; certain system calls operate
on them directly, while others operate on the referent (target), and
it is easy to forget to update a symlink if the referent is moved.
By embracing virtual file systems (more precisely, virtual hierarchies),
we can view the files sorted by any criteria we so desire.
But why stop there? We can go further, and have even more alternate
views of the filesystem; for example, we could have Bob's favorite
book collection, or Amy's favorite movies. Or Joe's list of books
related to math. And we can do queries that return collections; all
the conference papers about computer networks published between the
year 1995 and 2000.
And we also have much more powerful notions of access control. We
can tag certain subsets of files as being visible to a particular
user, or an arbitrary group of users. Native file system ACLs never
really caught on, in part because the file systems were designed with
much simpler ownership models in mind in the first place, and many
backup and copying tools don't know how to manipulate them properly.
Group management, too, is rather limited; we usually can't have groups
of groups, etc.
Once we liberate ourselves from the idea of a file name being related
to its organization, we can easily do deduplication as well; in fact,
one often finds that large collections of things may have duplicate
icons, for example, in different works by the same author or publisher.
We can store all the files in one place, and have virtual names that
point to them from all the works. In fact, on the underlying data
store, the file names may not be meaningful at all; we could name
each file by the hash of its contents, for example.
"But how will we ever administer such collections" you may be
asking. "Surely a directory full of meaningless hashes is an admistrator's
worst nightmare!". There is some truth to this, but perhaps it may
be possible to actually export some subset of the normal UNIX FS API
to the user via FUSE or something similar. Then, and admin could mount
a file system that corresponds to one view of the database; say, a
hierarchy of topics, and perform some operations on those files such
as deleting certain ones.
Importing a collection of files would become quite simple; the tool
would scan their names, gather their contents, and insert them into
the database, naturally de-duplicating files with identical contents,
and then adding their names into the various views. The matter of
normalizing names still remains, but anyone who has attempted to manage
a large collection of files from various sources knows that this is
a tedious thing to do from the command line (I personally still use
midnight commander) and a web application seems like a perfect way
to prompt the user to enter missing data or correct mispellings and
other minor changes.
Coming up with the perfect system will require some work, but I am
excited by the possibilities.
TODO: This is a lot of text. I really need some diagrams.
4 Related Work
Need to investigate these.
I NEED YOUR HELP: If you know of similar projects or people with similar
ideas, please let me know. If you know of people who might know, please
pass this along to them, as I'm not exactly sure where to ask about
this kind of stuff.
File translated from
TEX
by
TTH,
version 3.85.
On 1 Sep 2010, 02:31.