Simple content-addressed storage for managed (linux) tree create a .storage directory containing: - a list of files / directories to manage (to regenerate) - a database with hash <-> file location (to allow old version recovery) - hard links from hash <-> file data (to prevent delete: FIXME doesn't work) Entry: Host to prevent delete? Date: Sun Sep 7 12:49:26 CEST 2014 Hard linking doesn't work for "live" data, since an in-place update will change the data. It makes sense for backup storage as long as the sync method DELETES files before updating them. RSYNC does this by default: --delete-before receiver deletes before transfer (default) True copy on write is necessary for live data. Let's solve that later. Entry: Three-stage Date: Sun Sep 7 12:58:32 CEST 2014 1. Live 2. Backup 3. Duplication 2. contains older versions of 1. 2. and 3. are clones If there is no in-place editing, the Live stage can be removed, i.e. any user tree can be annotated. Entry: Performance Date: Sun Sep 7 13:11:08 CEST 2014 This doesn't work well if there are a lot of small files. Entry: Why is this problem so hard? Date: Sun Sep 7 13:19:28 CEST 2014 The thing is: version control like git, darcs does solve the problem, but it is cumbersome to use, and it has performance issues for large binary files. Essentially, the problem is performance (CPU use and storage size..) Entry: Copy on write Date: Sun Sep 7 13:28:02 CEST 2014 So hard links do not solve the problem. Can the features of a copy-on-write / snapshotting filesystem be used? Entry: Simpler Date: Sun Sep 7 13:30:01 CEST 2014 I want a solution now. Python script to copy a specified tree into: db.sqlite3 : storage = [(md5,path)] .md5/xx/yy/zzzzz... content-addressed This is enough to solve the current pressing problems: - indefinite storage in backup - deduplication in backup Entry: Finding duplicates Date: Sun Sep 7 15:55:55 CEST 2014 Going through pama storage, I'm finding many duplicates. select md5,count(*) from storage group by md5 order by count(*); Entry: Store filesize in metadata Date: Sun Sep 7 16:00:35 CEST 2014 Makes sense in queries. Don't care so much about duplicate files when they are small. And those occur a lot as well in practice. Entry: FAM + BATCH Date: Sun Sep 7 16:04:10 CEST 2014 Using a FAM it's possible to keep the read-only backup up-to-date by copying a file every time it is modified. Then once a day go over the whole store to make sure nothing was missed. Create a directory of links into the md5 store to provide read-only access to files and possibly older versions. Entry: Problems to solve? Date: Sun Sep 7 16:16:51 CEST 2014 - data corruption (through hashing) - versions (through hashing + time stamps) - data loss independent of backup time (i.e. delete + mirror doesn't save accidental delete) - read-only access to backed-up data - deduplication (data size reduction for backups) Entry: Do not touch user tree Date: Sun Sep 7 16:28:26 CEST 2014 Working tree will have duplicates. Backup maintainer doesn't care. Backup maintainer might tell user, but in general the cost of storage at user side approaches 0. Practical deduplication (du -sh) pama: 9.7G md5: 7.6G Entry: Recreating links Date: Sun Sep 7 16:43:20 CEST 2014 To expose the backups / archive, create a directory with hard links, with file permissions set to readonly. (hashed names contain the original permissions). Entry: Concepts: archives and backups Date: Sun Sep 7 16:58:34 CEST 2014 Backups are not the same as archives. Archives: - Read-only, store-forever Backups: - Provide means to restore a working tree in the face of corruption, theft, and human error. Backups can be treated as archives once a copy-on-write mechanism is implemented, i.e. never delete anything. In practice, this requires solving efficiency problems due to duplication. So backup = archive + metadata tracking over time. Entry: TODO Date: Sun Sep 7 17:38:58 CEST 2014 - Recreate directory of readonly hard links from db.sqlite3 Entry: Linked directories. Date: Sun Sep 7 18:24:51 CEST 2014 For proper archive use, the hash storage could be synced from the linked storage. Backup = work tree -> hashed tree Expose = hash tree -> readonly tree Archive = readonly tree -> hash tree Entry: Deleting from storage Date: Sun Sep 7 18:30:00 CEST 2014 Hashes of deleted files need to be recorded so they will not be restored by syncing from other archives. Entry: Unmanaged files Date: Wed Sep 10 17:28:44 CEST 2014 select * from storage natural join (select md5 from storage except select md5 from backup); Entry: Distributed storage Date: Wed Sep 10 19:11:43 CEST 2014 Interesting problem. Learned so far: - The static representation is key. - Journals are key. Files are created, then destroyed. I.e. there are 3 phases: nonexistant, created (linked), deleted/overwritten (unlinked). It seems that "splitting up" set/reset over time keeps the store constant and trival to synchronize. It seems the only "fuzz" is the exact creation/delete times of files. Keeping the DB file separate per location allows a local view of the store. Pulling in those DBs allows to compare the remote state. Entry: Manage journal in git? Date: Wed Sep 10 19:20:52 CEST 2014 It would be nice to then use git to record the file history journal and database. Making sure files are sorted according to host would allow git mergining to work properly. Entry: Growing tables Date: Wed Sep 10 19:26:58 CEST 2014 # Create new column ALTER TABLE storage ADD host TEXT; # Set defaults UPDATE storage SET host="zni"; UPDATE storage SET host="zoo"; Entry: Finding corrupt files Date: Thu Sep 11 13:00:41 CEST 2014 Already have some: seagate hd differs: 98/26be087a7275cf3613ac5172937380 71/4161e3aabf06eb43dadb03a8977c7b This was a >4G file truncated on FAT32. Entry: extra field: modification time Date: Thu Sep 11 16:12:31 CEST 2014 It's probably OK to check for modification time when updating (hourly?), then do a full checksum sync once a week or so. int(os.path.getmtime("/home/tom/.bashrc")) ALTER TABLE storage ADD mtime INTEGER;