TFS

Directory tree cloning modeled after the unix "tee" command and RAID1
mirroring.  An "event-driven rsync".

TFS monitors a filesystem tree and makes available all write
instructions as a command stream (journal) which can be used to
reconstruct the tree remotely.  Multiple clients can hook into the
command stream.

  tfs_journal    FUSE filesystem. outputs journal stream on stdout.
  tfs_broadcast  UNIX socket broadcast server.
  tfs_mirror     tranforms journal stream to local filesystem ops.


Entry: inital commit
Date: Sun Mar 15 10:52:01 CET 2009

Starting from the FUSE tutorial:
http://apps.sourceforge.net/mediawiki/fuse/index.php?title=Hello_World

Next: 

   Q: how to map a fuse_file_info to an object of our choice, i.e. a
      struct representing an open file?

      http://apps.sourceforge.net/mediawiki/fuse/index.php?title=FAQ#Is_it_possible_to_store_a_pointer_to_private_data_in_the_fuse_file_info_structure.3F

      Is it possible to store a pointer to private data in the fuse_file_info structure?

        Yes, the 'fh' field is for this purpose. This field may be set
        in the open() and create() methods, and is available in all
        other methods having a struct fuse_file_info parameter. Note,
        that changing the value of 'fh' in any other method as open()
        or create() will have no affect.

        Since the type of 'fh' is unsigned long, you need to use casts
        when storing and retrieving a pointer. Under Linux (and most
        other architectures) an unsigned long will be able to hold a
        pointer.

        This could have been done with a union of 'void *' and
        'unsigned long' but that would not have been any more type
        safe as having to use explicit casts. The recommended type
        safe solution is to write a small inline function that
        retrieves the pointer from the fuse_file_info structure.

   Q: what other file systems to start from (merge/switch/..)

      Maybe it's easiest to just write a simple passthrough filesystem
      and "observe" it, spitting out some highlevel code
      (i.e. readable scheme code).


This seems to work quite well.. First stopper:


zzz:/data/safe/tom/darcs/tfs# echo boo >mnt/hello
bash: mnt/hello: Function not implemented

unique: 21, opcode: LOOKUP (1), nodeid: 1, insize: 46
LOOKUP /hello
   NODEID: 2
   unique: 21, error: 0 (Success), outsize: 136
unique: 22, opcode: SETATTR (4), nodeid: 2, insize: 128
   unique: 22, error: -38 (Function not implemented), outsize: 16


http://osdir.com/ml/file-systems.fuse.devel/2005-09/msg00056.html
   
    SETATTR is actually 4 different methods rolled into one: chmod,
    chown, truncate and utime. You need to implement at least
    truncate() for open(... O_TRUNC) to work. */


Entry: journal
Date: Sun Mar 15 14:36:21 CET 2009


The simplest way to get something to work is to _only_ transport
write(), and forget about everything else.  Open/close/create can be
automated on the receive end.

remove() is probably also not necessary..

mkdir() might also be done automatically.


Entry: protocol
Date: Sun Mar 15 14:59:30 CET 2009

2 commands:
  * associate
  * write

everything else is derived or assumed or arbitrarily filled in.

Stick to binary: i'd like the "write" to have as little overhead as
possible.

Communication is message based: 32 bit size followed by the message,
which contains a 32 bit type tag.


EDIT 2 commands: 

  this is not a good idea: replaced with "lazy open" which means that
  when the file is clobbered in a meaningful way (only write atm), the
  open command is sent.


Entry: done?
Date: Sun Mar 15 17:37:18 CET 2009

Looks like basic functionality is there.  More:


  * how to buffer the journal stream?  Maybe write a small program
    that pipes I->O through a disk buffer?

  * make sure that the local writes make it through even if the pipe
    is broken or journal transfer isn't fast enough.

  * Check if this works with current vserver setup.

  * maybe split read/write in 2 different programs: one that creates
    journal files (without write data) and one that replays it with
    write data?


Entry: journal format
Date: Sun Mar 15 18:09:01 CET 2009

OK. Got the command struct trimmed down to 16 bytes per command, with
payloads rounded up to the next 16 bytes.  

Now, how to implement the buffers?  The simplest way is to chop a
journal into managable chunks that allow transfer restarts (i.e. 1MB
of writes).  Then the transfer daemon could consume these chunks, make
sure they got transferred, and delete the metadata.


Entry: vserver
Date: Sun Mar 15 21:50:37 CET 2009

it's working in vserver after creating /dev/fuse and this:

/etc/vservers/test01/ccapabilities
SECURE_MOUNT
SECURE_REMOUNT
BINARY_MOUNT 

/etc/vservers/test01/bcapabilities
SYS_ADMIN 

now i'd like to make it work for a certain user:

# cat start
export TFS_MASTER=/home/tom/tfs/store 
bin/tfs_server -f -o uid=33,gid=33 incoming

but as 33 i can't see nor write the dir..


it does work fine over ssh:
ssh tom@xxx "cd tfs ; sudo ./start" | TFS_SLAVE=mirror ./tfs_client 

fixed read() reastarts see safe_read()

next bug: apparently fuse is multithreaded, so we need to lock the
journal output.  fixed by making it single-threaded.


next bug: can't access the filesystem as www-data, even if it's
mounted as such.  fixed: -o allow_other

now it seems to work.. running test

$ cat start
umount incoming
export TFS_MASTER=/home/tom/tfs/store 
bin/tfs_server -f -o uid=33,gid=33,allow_other incoming


it looks like the master's side writes get simply throttled if the
M->S bandwidth is the limiting factor. perfect!

When using multiple files however, things go wrong.. Could be just
mkdir though.

To fix:

tom@zzz:~/tfs$ cat ~/bin/rsync.inplace 
rsync \
	--archive \
	--verbose \
	--recursive \
        --inplace \
	--progress \
	--copy-links \
	"$@"

# i don't rember why, probably renaming large files?
#	--copy-links \


mkdir wasn't implemented indeed.. fixed.

now, when the fs crashes, it's quite a pain to clean up the mess.  so
it can't crash.  lets turn this into a connection where multiple
clients can hook into the transfer stream.


Entry: avoiding fs crashes at all cost
Date: Mon Mar 16 11:01:24 CET 2009

The only way to do this is to separate the fs daemon and the network
frontend:

[ FS ] <-> [ NET ]

  * It should be made possible to restart the net frontend during
    development.  I.e. SIGHUP will reload the frontend.

  * FS will keep running _even when_ it looses its head.  In this
    fatal condition the FS needs to be restarted when all clients are
    done accessing the filesystem.


What about a simple "hook-in" server?  It should be possible to run an
"inplace" rsync parallel to a running update: the only thing that can
happen is double transfers.

The server would then run at the slowest client speed, or at full
speed when there is no client connected.

Hook-in:
  * start with a "silent" fs.
  * start and run a "bootstrap" rsync


problem: a select-based server would be problematic unless i can get
to the /dev/fuse events.. maybe a socat solution is enough really?  is
it possible to use socat as a 1 -> many distribution server?

hmm.. socat using TCP4-SERVER only serves one connection, so it looks
like some special purpose juggling is necessary.

ok. wrote tfs_socket.c which broadcasts the received messages to all
its connections.  looks like the problem is solved now and i can move
to fine tuning.


Entry: example start file
Date: Wed Mar 18 11:03:30 CET 2009

#!/bin/sh

STORE=store
INCOMING=incoming
SOCKET=/tmp/tfs

# cleanup
umount $INCOMING 2>/dev/null
rm -f $SOCKET

# go
(bin/tfs_journal $STORE -f -o uid=33,gid=33,allow_other $INCOMING  | bin/tfs_broadcast $SOCKET) &


Entry: fine tuning
Date: Thu Mar 19 16:02:18 CET 2009

 * turn tfs_broadcast into a generic in-order message
   forwarding/duplication system: (payload size including header, so
   router can be agnostic about message type). OK

 * using that same protocol, create a disk buffer application that can
   be plugged inbetween the filesystem daemon and the client.

 * create a call-by-value binary protocol synthesizer

 * connection resets: add a client which blocks/buffers until a
   connection is restored.


Entry: reconnect
Date: Sat Mar 21 12:26:50 CET 2009

Problem:  TCP connections break - unknown cause.  Probably some
firewall timeout.  Simple solutions:

         - Minimal nb of clients. OK
         - (guess) Add zero-size keepalive messages? NO


I'm not adding keepalive messages in the broadcast server: it should
not add extra semantics.

To do this in the filesystem server otoh will be problematic with the
current callback-based implementation.

So.. what is necessary to ensure delivery of messages?  Maybe this
just needs to be solved at the tunnel level: require pipes to stay
connected, possibly run them over vpn so IP addresses stay the same.

Another one: apparently it's not possible to detect a disconnect until
a write (or read) is attempted.  This is not so surprising really:
checking if the other end is awake requires data to be sent..

Maybe we should use some ACK mechanism?


Entry: single-threaded implementation
Date: Sun Mar 29 13:00:31 CEST 2009

To make this better behaved and to prepare for a spinof to export
Scheme datastructures through a filesystem API, it is necessary to
move to a select() based single-threaded implementation.

The file descriptor to select() on is in 

    (struct fuse_chan).fd

Get at this using fuse_chan_fd :: struct fuse_chan* -> int


This can be reached through (simplified code):

    fuse_session_loop(struct fuse_session *se) {
        struct fuse_chan *ch = fuse_session_next_chan(se);
        while (!fuse_session_exited(se)) {
            fuse_chan_recv(&&ch);
            fuse_session_process(se);
        }
    }
    

Let's see if there's another entry point.

The default receive function seems to be fuse_kern_chan_receive() in
fuse_kern_chan.c

Looks like I just need to unroll fuse_session_loop() which is called
through fuse_loop() in tfs_journal.c: main()

fuse.c:
02840 int fuse_loop(struct fuse *f)
02841 {
02842         if (f)
02843                 return fuse_session_loop(f->se);
02844         else
02845                 return -1;
02846 }


fuse_loop.c:
00015 int fuse_session_loop(struct fuse_session *se)
00016 {
00017         int res = 0;
00018         struct fuse_chan *ch = fuse_session_next_chan(se, NULL);
00019         size_t bufsize = fuse_chan_bufsize(ch);
00020         char *buf = (char *) malloc(bufsize);
00021         if (!buf) {
00022                 fprintf(stderr, "fuse: failed to allocate read buffer\n");
00023                 return -1;
00024         }
00025 
00026         while (!fuse_session_exited(se)) {
00027                 struct fuse_chan *tmpch = ch;
00028                 res = fuse_chan_recv(&tmpch, buf, bufsize);
00029                 if (res == -EINTR)
00030                         continue;
00031                 if (res <= 0)
00032                         break;
00033                 fuse_session_process(se, buf, res, tmpch);
00034         }
00035 
00036         free(buf);
00037         fuse_session_reset(se);
00038         return res < 0 ? -1 : 0;
00039 }


Trying to expand this code in tfs_journal.c breaks because the
datastructures are internal and not exported in the API headers.

I did read somewhere that select() on the channel fd is possible.  But
how.  And in which version?

Found this:
http://www.nabble.com/-PATCH--async-support-in-FUSE-Python-td21109865.html

replace f->se  
     by fuse_get_session(f);

Next:
tfs_journal.c:251: warning: implicit declaration of function ‘fuse_session_next_chan’
tfs_journal.c:252: warning: implicit declaration of function ‘fuse_chan_bufsize’
tfs_journal.c:259: warning: implicit declaration of function ‘fuse_session_exited’
tfs_journal.c:261: warning: implicit declaration of function ‘fuse_chan_recv’
tfs_journal.c:266: warning: implicit declaration of function ‘fuse_session_process’
tfs_journal.c:270: warning: implicit declaration of function ‘fuse_session_reset’

This needs #include<fuse/fuse_lowlevel.h>


OK. Tested.

Now, turn the mu_fuse_loop() function into a state machine.


Entry: batches vs. events
Date: Sun Jul 19 16:13:24 CEST 2009

I recently read somewhere some advice like this:

  Make sure your different steps are all processed by cron jobs, where
  one job moves data from one directory to the next every x minutes.
  Don't use any fance triggering: it will break.

I'm wondering if it might be a good idea to start ignoring this advice
by focussing on the proper triggering so it won't fail.

The main problem seems to be persistence in the presence of failure.
In practice, the system breaks down whenever a connection breaks.  How
to properly abstract the queue so it is guaranteed that it is possible
to restart?

The main idea here seems to be to be a proper queue mechanism.  How do
you store a queue on disk effectively?  It's easy to store a stack
(append to a file).  A queue can be implemented using two stacks and a
locking mechanism that isolates the dumpover.