TFS Directory tree cloning modeled after the unix "tee" command and RAID1 mirroring. An "event-driven rsync". TFS monitors a filesystem tree and makes available all write instructions as a command stream (journal) which can be used to reconstruct the tree remotely. Multiple clients can hook into the command stream. tfs_journal FUSE filesystem. outputs journal stream on stdout. tfs_broadcast UNIX socket broadcast server. tfs_mirror tranforms journal stream to local filesystem ops. Entry: inital commit Date: Sun Mar 15 10:52:01 CET 2009 Starting from the FUSE tutorial: http://apps.sourceforge.net/mediawiki/fuse/index.php?title=Hello_World Next: Q: how to map a fuse_file_info to an object of our choice, i.e. a struct representing an open file? http://apps.sourceforge.net/mediawiki/fuse/index.php?title=FAQ#Is_it_possible_to_store_a_pointer_to_private_data_in_the_fuse_file_info_structure.3F Is it possible to store a pointer to private data in the fuse_file_info structure? Yes, the 'fh' field is for this purpose. This field may be set in the open() and create() methods, and is available in all other methods having a struct fuse_file_info parameter. Note, that changing the value of 'fh' in any other method as open() or create() will have no affect. Since the type of 'fh' is unsigned long, you need to use casts when storing and retrieving a pointer. Under Linux (and most other architectures) an unsigned long will be able to hold a pointer. This could have been done with a union of 'void *' and 'unsigned long' but that would not have been any more type safe as having to use explicit casts. The recommended type safe solution is to write a small inline function that retrieves the pointer from the fuse_file_info structure. Q: what other file systems to start from (merge/switch/..) Maybe it's easiest to just write a simple passthrough filesystem and "observe" it, spitting out some highlevel code (i.e. readable scheme code). This seems to work quite well.. First stopper: zzz:/data/safe/tom/darcs/tfs# echo boo >mnt/hello bash: mnt/hello: Function not implemented unique: 21, opcode: LOOKUP (1), nodeid: 1, insize: 46 LOOKUP /hello NODEID: 2 unique: 21, error: 0 (Success), outsize: 136 unique: 22, opcode: SETATTR (4), nodeid: 2, insize: 128 unique: 22, error: -38 (Function not implemented), outsize: 16 http://osdir.com/ml/file-systems.fuse.devel/2005-09/msg00056.html SETATTR is actually 4 different methods rolled into one: chmod, chown, truncate and utime. You need to implement at least truncate() for open(... O_TRUNC) to work. */ Entry: journal Date: Sun Mar 15 14:36:21 CET 2009 The simplest way to get something to work is to _only_ transport write(), and forget about everything else. Open/close/create can be automated on the receive end. remove() is probably also not necessary.. mkdir() might also be done automatically. Entry: protocol Date: Sun Mar 15 14:59:30 CET 2009 2 commands: * associate * write everything else is derived or assumed or arbitrarily filled in. Stick to binary: i'd like the "write" to have as little overhead as possible. Communication is message based: 32 bit size followed by the message, which contains a 32 bit type tag. EDIT 2 commands: this is not a good idea: replaced with "lazy open" which means that when the file is clobbered in a meaningful way (only write atm), the open command is sent. Entry: done? Date: Sun Mar 15 17:37:18 CET 2009 Looks like basic functionality is there. More: * how to buffer the journal stream? Maybe write a small program that pipes I->O through a disk buffer? * make sure that the local writes make it through even if the pipe is broken or journal transfer isn't fast enough. * Check if this works with current vserver setup. * maybe split read/write in 2 different programs: one that creates journal files (without write data) and one that replays it with write data? Entry: journal format Date: Sun Mar 15 18:09:01 CET 2009 OK. Got the command struct trimmed down to 16 bytes per command, with payloads rounded up to the next 16 bytes. Now, how to implement the buffers? The simplest way is to chop a journal into managable chunks that allow transfer restarts (i.e. 1MB of writes). Then the transfer daemon could consume these chunks, make sure they got transferred, and delete the metadata. Entry: vserver Date: Sun Mar 15 21:50:37 CET 2009 it's working in vserver after creating /dev/fuse and this: /etc/vservers/test01/ccapabilities SECURE_MOUNT SECURE_REMOUNT BINARY_MOUNT /etc/vservers/test01/bcapabilities SYS_ADMIN now i'd like to make it work for a certain user: # cat start export TFS_MASTER=/home/tom/tfs/store bin/tfs_server -f -o uid=33,gid=33 incoming but as 33 i can't see nor write the dir.. it does work fine over ssh: ssh tom@xxx "cd tfs ; sudo ./start" | TFS_SLAVE=mirror ./tfs_client fixed read() reastarts see safe_read() next bug: apparently fuse is multithreaded, so we need to lock the journal output. fixed by making it single-threaded. next bug: can't access the filesystem as www-data, even if it's mounted as such. fixed: -o allow_other now it seems to work.. running test $ cat start umount incoming export TFS_MASTER=/home/tom/tfs/store bin/tfs_server -f -o uid=33,gid=33,allow_other incoming it looks like the master's side writes get simply throttled if the M->S bandwidth is the limiting factor. perfect! When using multiple files however, things go wrong.. Could be just mkdir though. To fix: tom@zzz:~/tfs$ cat ~/bin/rsync.inplace rsync \ --archive \ --verbose \ --recursive \ --inplace \ --progress \ --copy-links \ "$@" # i don't rember why, probably renaming large files? # --copy-links \ mkdir wasn't implemented indeed.. fixed. now, when the fs crashes, it's quite a pain to clean up the mess. so it can't crash. lets turn this into a connection where multiple clients can hook into the transfer stream. Entry: avoiding fs crashes at all cost Date: Mon Mar 16 11:01:24 CET 2009 The only way to do this is to separate the fs daemon and the network frontend: [ FS ] <-> [ NET ] * It should be made possible to restart the net frontend during development. I.e. SIGHUP will reload the frontend. * FS will keep running _even when_ it looses its head. In this fatal condition the FS needs to be restarted when all clients are done accessing the filesystem. What about a simple "hook-in" server? It should be possible to run an "inplace" rsync parallel to a running update: the only thing that can happen is double transfers. The server would then run at the slowest client speed, or at full speed when there is no client connected. Hook-in: * start with a "silent" fs. * start and run a "bootstrap" rsync problem: a select-based server would be problematic unless i can get to the /dev/fuse events.. maybe a socat solution is enough really? is it possible to use socat as a 1 -> many distribution server? hmm.. socat using TCP4-SERVER only serves one connection, so it looks like some special purpose juggling is necessary. ok. wrote tfs_socket.c which broadcasts the received messages to all its connections. looks like the problem is solved now and i can move to fine tuning. Entry: example start file Date: Wed Mar 18 11:03:30 CET 2009 #!/bin/sh STORE=store INCOMING=incoming SOCKET=/tmp/tfs # cleanup umount $INCOMING 2>/dev/null rm -f $SOCKET # go (bin/tfs_journal $STORE -f -o uid=33,gid=33,allow_other $INCOMING | bin/tfs_broadcast $SOCKET) & Entry: fine tuning Date: Thu Mar 19 16:02:18 CET 2009 * turn tfs_broadcast into a generic in-order message forwarding/duplication system: (payload size including header, so router can be agnostic about message type). OK * using that same protocol, create a disk buffer application that can be plugged inbetween the filesystem daemon and the client. * create a call-by-value binary protocol synthesizer * connection resets: add a client which blocks/buffers until a connection is restored. Entry: reconnect Date: Sat Mar 21 12:26:50 CET 2009 Problem: TCP connections break - unknown cause. Probably some firewall timeout. Simple solutions: - Minimal nb of clients. OK - (guess) Add zero-size keepalive messages? NO I'm not adding keepalive messages in the broadcast server: it should not add extra semantics. To do this in the filesystem server otoh will be problematic with the current callback-based implementation. So.. what is necessary to ensure delivery of messages? Maybe this just needs to be solved at the tunnel level: require pipes to stay connected, possibly run them over vpn so IP addresses stay the same. Another one: apparently it's not possible to detect a disconnect until a write (or read) is attempted. This is not so surprising really: checking if the other end is awake requires data to be sent.. Maybe we should use some ACK mechanism? Entry: single-threaded implementation Date: Sun Mar 29 13:00:31 CEST 2009 To make this better behaved and to prepare for a spinof to export Scheme datastructures through a filesystem API, it is necessary to move to a select() based single-threaded implementation. The file descriptor to select() on is in (struct fuse_chan).fd Get at this using fuse_chan_fd :: struct fuse_chan* -> int This can be reached through (simplified code): fuse_session_loop(struct fuse_session *se) { struct fuse_chan *ch = fuse_session_next_chan(se); while (!fuse_session_exited(se)) { fuse_chan_recv(&&ch); fuse_session_process(se); } } Let's see if there's another entry point. The default receive function seems to be fuse_kern_chan_receive() in fuse_kern_chan.c Looks like I just need to unroll fuse_session_loop() which is called through fuse_loop() in tfs_journal.c: main() fuse.c: 02840 int fuse_loop(struct fuse *f) 02841 { 02842 if (f) 02843 return fuse_session_loop(f->se); 02844 else 02845 return -1; 02846 } fuse_loop.c: 00015 int fuse_session_loop(struct fuse_session *se) 00016 { 00017 int res = 0; 00018 struct fuse_chan *ch = fuse_session_next_chan(se, NULL); 00019 size_t bufsize = fuse_chan_bufsize(ch); 00020 char *buf = (char *) malloc(bufsize); 00021 if (!buf) { 00022 fprintf(stderr, "fuse: failed to allocate read buffer\n"); 00023 return -1; 00024 } 00025 00026 while (!fuse_session_exited(se)) { 00027 struct fuse_chan *tmpch = ch; 00028 res = fuse_chan_recv(&tmpch, buf, bufsize); 00029 if (res == -EINTR) 00030 continue; 00031 if (res <= 0) 00032 break; 00033 fuse_session_process(se, buf, res, tmpch); 00034 } 00035 00036 free(buf); 00037 fuse_session_reset(se); 00038 return res < 0 ? -1 : 0; 00039 } Trying to expand this code in tfs_journal.c breaks because the datastructures are internal and not exported in the API headers. I did read somewhere that select() on the channel fd is possible. But how. And in which version? Found this: http://www.nabble.com/-PATCH--async-support-in-FUSE-Python-td21109865.html replace f->se by fuse_get_session(f); Next: tfs_journal.c:251: warning: implicit declaration of function ‘fuse_session_next_chan’ tfs_journal.c:252: warning: implicit declaration of function ‘fuse_chan_bufsize’ tfs_journal.c:259: warning: implicit declaration of function ‘fuse_session_exited’ tfs_journal.c:261: warning: implicit declaration of function ‘fuse_chan_recv’ tfs_journal.c:266: warning: implicit declaration of function ‘fuse_session_process’ tfs_journal.c:270: warning: implicit declaration of function ‘fuse_session_reset’ This needs #include OK. Tested. Now, turn the mu_fuse_loop() function into a state machine. Entry: batches vs. events Date: Sun Jul 19 16:13:24 CEST 2009 I recently read somewhere some advice like this: Make sure your different steps are all processed by cron jobs, where one job moves data from one directory to the next every x minutes. Don't use any fance triggering: it will break. I'm wondering if it might be a good idea to start ignoring this advice by focussing on the proper triggering so it won't fail. The main problem seems to be persistence in the presence of failure. In practice, the system breaks down whenever a connection breaks. How to properly abstract the queue so it is guaranteed that it is possible to restart? The main idea here seems to be to be a proper queue mechanism. How do you store a queue on disk effectively? It's easy to store a stack (append to a file). A queue can be implemented using two stacks and a locking mechanism that isolates the dumpover.