Fri Nov 25 14:16:42 EST 2011


Main issues are:

- Transient vs. permanent errors.

Transient errors are errors that do not violate system state, but are
a consequence of communication errors.  The criterion is that they can
often be solved by detecting them (using data redundancy) followed by
simply retrying the operation.

Permanent errors are those that permanently violate invariants.  The
issue here is to identify which invariants are broken, and how to fix
them by throwing away part of the information causing minimal damage.

- Local or global recovery.

It is hard to implement robust code that is also modular.  There seems
to always be a tension between abstraction (solving problems locally)
and need for broad-ranging information (solving problems globally).