Simple content-addressed storage

for managed (linux) tree create a .storage directory containing:
  - a list of files / directories to manage (to regenerate)
  - a database with hash <-> file location (to allow old version recovery)
  - hard links from hash <-> file data (to prevent delete: FIXME doesn't work)


Entry: Host to prevent delete?
Date: Sun Sep  7 12:49:26 CEST 2014

Hard linking doesn't work for "live" data, since an in-place update
will change the data.

It makes sense for backup storage as long as the sync method DELETES
files before updating them.  RSYNC does this by default:
--delete-before         receiver deletes before transfer (default)

True copy on write is necessary for live data.  Let's solve that later.


Entry: Three-stage
Date: Sun Sep  7 12:58:32 CEST 2014

1. Live
2. Backup
3. Duplication

2. contains older versions of 1.
2. and 3. are clones


If there is no in-place editing, the Live stage can be removed,
i.e. any user tree can be annotated.


Entry: Performance
Date: Sun Sep  7 13:11:08 CEST 2014

This doesn't work well if there are a lot of small files.


Entry: Why is this problem so hard?
Date: Sun Sep  7 13:19:28 CEST 2014

The thing is: version control like git, darcs does solve the problem,
but it is cumbersome to use, and it has performance issues for large
binary files.

Essentially, the problem is performance (CPU use and storage size..)


Entry: Copy on write
Date: Sun Sep  7 13:28:02 CEST 2014

So hard links do not solve the problem.
Can the features of a copy-on-write / snapshotting filesystem be used?


Entry: Simpler
Date: Sun Sep  7 13:30:01 CEST 2014

I want a solution now.

Python script to copy a specified tree into:
  db.sqlite3 : storage = [(md5,path)]
  .md5/xx/yy/zzzzz...  content-addressed

This is enough to solve the current pressing problems: 
- indefinite storage in backup
- deduplication in backup


Entry: Finding duplicates
Date: Sun Sep  7 15:55:55 CEST 2014

Going through pama storage, I'm finding many duplicates.

select md5,count(*) from storage group by md5 order by count(*);


Entry: Store filesize in metadata
Date: Sun Sep  7 16:00:35 CEST 2014

Makes sense in queries.  Don't care so much about duplicate files when
they are small.  And those occur a lot as well in practice.


Entry: FAM + BATCH
Date: Sun Sep  7 16:04:10 CEST 2014

Using a FAM it's possible to keep the read-only backup up-to-date by
copying a file every time it is modified.  Then once a day go over the
whole store to make sure nothing was missed.

Create a directory of links into the md5 store to provide read-only
access to files and possibly older versions.


Entry: Problems to solve?
Date: Sun Sep  7 16:16:51 CEST 2014

- data corruption (through hashing)
- versions (through hashing + time stamps)
- data loss independent of backup time (i.e. delete + mirror doesn't save accidental delete)
- read-only access to backed-up data
- deduplication (data size reduction for backups)


Entry: Do not touch user tree
Date: Sun Sep  7 16:28:26 CEST 2014

Working tree will have duplicates.  Backup maintainer doesn't care.
Backup maintainer might tell user, but in general the cost of storage
at user side approaches 0.

Practical deduplication (du -sh)
 pama:  9.7G
 md5:   7.6G


Entry: Recreating links
Date: Sun Sep  7 16:43:20 CEST 2014

To expose the backups / archive, create a directory with hard links,
with file permissions set to readonly.  (hashed names contain the
original permissions).


Entry: Concepts: archives and backups
Date: Sun Sep  7 16:58:34 CEST 2014

Backups are not the same as archives.

Archives:

  - Read-only, store-forever

Backups:

  - Provide means to restore a working tree in the face of corruption,
    theft, and human error.


Backups can be treated as archives once a copy-on-write mechanism is
implemented, i.e. never delete anything.  In practice, this requires
solving efficiency problems due to duplication.

So backup = archive + metadata tracking over time.


Entry: TODO
Date: Sun Sep  7 17:38:58 CEST 2014

- Recreate directory of readonly hard links from db.sqlite3


Entry: Linked directories.
Date: Sun Sep  7 18:24:51 CEST 2014

For proper archive use, the hash storage could be synced from the
linked storage.

Backup = work tree -> hashed tree
Expose = hash tree -> readonly tree
Archive = readonly tree -> hash tree


Entry: Deleting from storage
Date: Sun Sep  7 18:30:00 CEST 2014

Hashes of deleted files need to be recorded so they will not be
restored by syncing from other archives.


Entry: Unmanaged files
Date: Wed Sep 10 17:28:44 CEST 2014

select * from storage natural join (select md5 from storage except select md5 from backup);


Entry: Distributed storage
Date: Wed Sep 10 19:11:43 CEST 2014

Interesting problem.  Learned so far:

- The static representation is key.

- Journals are key.  Files are created, then destroyed.  I.e. there
  are 3 phases: nonexistant, created (linked), deleted/overwritten
  (unlinked).

It seems that "splitting up" set/reset over time keeps the store
constant and trival to synchronize.  It seems the only "fuzz" is the
exact creation/delete times of files.  Keeping the DB file separate
per location allows a local view of the store.  Pulling in those DBs
allows to compare the remote state.


Entry: Manage journal in git?
Date: Wed Sep 10 19:20:52 CEST 2014

It would be nice to then use git to record the file history journal
and database.  Making sure files are sorted according to host would
allow git mergining to work properly.


Entry: Growing tables
Date: Wed Sep 10 19:26:58 CEST 2014

# Create new column
ALTER TABLE storage ADD host TEXT;

# Set defaults
UPDATE storage SET host="zni";
UPDATE storage SET host="zoo";

Entry: Finding corrupt files
Date: Thu Sep 11 13:00:41 CEST 2014

Already have some: seagate hd differs:
98/26be087a7275cf3613ac5172937380
71/4161e3aabf06eb43dadb03a8977c7b

This was a >4G file truncated on FAT32.


Entry: extra field: modification time
Date: Thu Sep 11 16:12:31 CEST 2014

It's probably OK to check for modification time when updating
(hourly?), then do a full checksum sync once a week or so.

int(os.path.getmtime("/home/tom/.bashrc"))

ALTER TABLE storage ADD mtime INTEGER;