Sat Jul 3 11:04:25 CEST 2010

Implementing Z

As mentioned before[1], the Z operator is the main source of
complexity, as it can (and should) be implemented in different ways:

  - load/store of local variables (registers) inside loop + load/store of state memory outside
  - load/store of state memory inside loop
  - load/store of circular delay lines for Z^n, n >> 1
  - load of indexed input + load/store of block-based state memory outside of loop

The task is now to find a way to represent these different
implementations, and parameterize the compilation step.

The choice seems to be about the "type" of memory used: do we
represent this explicitly or not?  Inside the loop we use fast memory,
while state data that bridges different loop iterations can be in
slower memory.  Should this kind of distinction be left over to the
low-level compiler, or do we model it explicitly?

Looks like explicit modeling has more benefits.  Let's stick to the
slogan: "don't trust automatic caches".  This way we can compile to
simpler hardware.

Main idea: compiling Z is about managing memory.

Each loop has two state buffers: fast and slow for sample and
block-based access.  Transfer between the two is explicit at the start
and end of the loop.  For the C implementation, fast memory are local
variables and slow memory are pointer indexed structs.  This solves
the problem for output state.

Input state has an alternative implementation through indexing.  In
practice this is going to occur a lot for block-based processing,
which is essentially a whole different problem and might need a
special approach to manage the required overlap.

Maybe I should take a hint here.  The structure of the implementation
is mostly determined by the shape of the inner loop; the rest doesn't
matter so much as the more flexible pointer-indexed approaches can be
used.  Essentially there are then only two kinds of loops:

   - Block based (FIR)
   - Sample recursive (IIR)

For IIR loops we're kind of stuck: the recursive structure requires a
certain approach (keeping state in fast memory + streaming input and

For FIR loops many loop transformations are possible.

Hierarchical combination of FIR and IIR such as the Kalman filter,
where the inner loop is FIR (matrix ops), and the outer loops are IIR,
have FIR-freedom in the "kernel" problem, and a fixed IIR structure on
the higher level.

So it seems this IIR/FIR problem is what I need to look at.  Once in
the FIR domain, the problem becomes more elegant and there is much
literature available.

My contribution should then be mostly in how to deal with
closely-looped IIR structures, and how to transform between
sample-based IIR and block-based IIR structures so it is simpler to
combine a-symmetric IIR structures with more symmetric FIR structures.

So, the simplest get-it-running compilation is then to use explicit
slow/fast state memory with pre/post transfer and no input indexing
(pure streaming).

[1] entry://20100511-195936