[<<][staapl][>>][..]
Sun Sep 27 10:42:38 CEST 2009

Stream Combinators : Z

The problem that needs to be solved for the image processing loop
transforms is a way to move `Z', the single-element delay, inside and
outside a loop.  The general idea:

  Z (delay) should be a high-level operation.  In all implementations
  I've found, Z is always explicitly implemented wrt. the
  representation of streams / sequences, leading to a complicated (too
  much detail) and inflexible (too specific) encoding of knowledge.
  DSP code should be expressed in terms of polynomials.  The mapping
  of this representation to code should be automated, possibly
  parameterized.

Essentially, translate a Z on the inputs into a Z of the fold state.
This is probably related to paramorphisms.

I'm looking for a _form_ where the operation of moving the effect of Z
inside a loop is clear.

Suppose s is an infinite list.  In scheme notation I'm looking for the
transformation between:

  (lambda (s)
    (let ((s1 s)
          (s2 (cdr s)))
      (map + s1 s2)))

and

  (lambda (s)
    (let filter ((s1 (car s))
                 (s  (cdr s)))
      (let ((s2 (car s)))
        (cons (+ s1 s2)
              (filter (cdr s) s2)))))

The first is a map, while the second is a kind of fold (it uses
threaded state).

The right framework to see this is probably memoization / dynamic
programming.  The delay operation `cdr' here is memoized, which in
practice means that an expensive memory fetch will be cached by
keeping the variable in register.

A paramorphism does something similar: its binary operation takes a
pair: the input and output of the iterated function.  From[1]:

A paramorphism for the natural numbers:

   h 0       = b
   h (1 + n) = n @ h n

For lists:

   h Nil            = b
   h (Cons (a, as)) = a @ (as, h as)

Here `@' is the binary function to be iterated.  The next step seems
to be to look into Meerten's idea[2] and find a way to express it for
Z-based computations.


However, note that a paramorphism has a truly recursive call tree:
there is a reverse data dependency: the binary operation @ receives
the _result_ of the recursion.  What I'm looking for is tail recursion
with all memoized state / accumulators passed in as extra arguments.

Indeed the problem is different: the specification might deal with
more general recursion patterns on recursive data types, but in the
end one needs to construct loop kernels with local state connected to
a data access pattern.  Maybe the reason for confusion is the
difference between abstract definition with low arity function, and
the ``splatted'' nature of pipelined loop kernels for RISC/VLIW
machines with lots of registers.

Maybe the question can be reformulated as: find a transformation that
memoizes accesses.  In [3] elimination of redundant loads is mentioned
as one of the possible optimizations that bring accesses to
registers.  If I'm about to move the abstraction level up, this kind
of optimizations becomes very important.  What are the other patterns?

   - elimination of redundant loads (i.e. offset indexing implemented
     using register shuffling)

   - loop unrolling enables array -> variable translations.

   - data prefetch and proper allocation in memory hierarchy: spatial
     vs. temporal locality, i.e. FIR input/output vs. FIR coefs.


[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.125
[2] http://www.springerlink.com/content/h1547h551422462u/
[3] isbn://1558607668



[Reply][About]
[<<][staapl][>>][..]