Sun Apr 24 21:59:35 EDT 2011

Robust filesystem

I'm burning my fingers on a current project.  Some things that went

  - No specification, evolutionary what-can-we-get-away-with-cheaply
    design.  Considering the application that was actually not such a
    bad approach: functional requirements where very simple and
    straightforward.  Eventally there was one ill-specified
    requirement that caused a bit of complexity.

  - Complete underestimation of non-functional requirements:

  - Difficult refactoring: merging two subsystems that where "almost
    the same" caused many headaches when they where actually placed on
    top of the same abstraction.

  - Splitting another part of the code into separate modules proved
    difficult due to insufficient understanding of the coupling

  - Premature optimization.  Not for speed, but for memory usage, in
    this case disk buffers.  This lead to an implementation that was
    very hard to change.

  - Inadiquate test suite: the stateful nature of the system, and the
    nature of the errors that should be recovered (lots of invalid
    intermediate state) makes it very hard to test.

If I can name one major point, it is state.  The system as it is at
this moment has too many degrees of freedom.  This makes many things
very difficult:

 * Testing: almost impossible to cover all corner cases.

 * Change: the higly sequential nature of operation makes it very
   difficult to separate responsabilities over multiple objects, or
   perform simple, incremental changes.

 * Ownership: at least one bad factorization is due to unclear
   ownership of data structures.

 * Temporary storage management: A non-functional requirement is to
   use a minimal memory footprint.

If state is the problem, the solution I imagine is almost immediately
to switch to a transaction-based approach where pre and post
conditions can be expressed clearly (even if only in the test suite).

In a fully transaction-based approach, there is a very simple, even
_stupid simple_ way of handling errors: whenever it fails, retry.

In the current implementation that approach doesn't work completely:
physical errors are system state changes: the system is changed from a
consistent to an inconsistent state.  The difficulty is in recovering
from that.