[<<][staapl][>>][..]Sun Sep 27 10:42:38 CEST 2009

The problem that needs to be solved for the image processing loop transforms is a way to move `Z', the single-element delay, inside and outside a loop. The general idea: Z (delay) should be a high-level operation. In all implementations I've found, Z is always explicitly implemented wrt. the representation of streams / sequences, leading to a complicated (too much detail) and inflexible (too specific) encoding of knowledge. DSP code should be expressed in terms of polynomials. The mapping of this representation to code should be automated, possibly parameterized. Essentially, translate a Z on the inputs into a Z of the fold state. This is probably related to paramorphisms. I'm looking for a _form_ where the operation of moving the effect of Z inside a loop is clear. Suppose s is an infinite list. In scheme notation I'm looking for the transformation between: (lambda (s) (let ((s1 s) (s2 (cdr s))) (map + s1 s2))) and (lambda (s) (let filter ((s1 (car s)) (s (cdr s))) (let ((s2 (car s))) (cons (+ s1 s2) (filter (cdr s) s2))))) The first is a map, while the second is a kind of fold (it uses threaded state). The right framework to see this is probably memoization / dynamic programming. The delay operation `cdr' here is memoized, which in practice means that an expensive memory fetch will be cached by keeping the variable in register. A paramorphism does something similar: its binary operation takes a pair: the input and output of the iterated function. From[1]: A paramorphism for the natural numbers: h 0 = b h (1 + n) = n @ h n For lists: h Nil = b h (Cons (a, as)) = a @ (as, h as) Here `@' is the binary function to be iterated. The next step seems to be to look into Meerten's idea[2] and find a way to express it for Z-based computations. However, note that a paramorphism has a truly recursive call tree: there is a reverse data dependency: the binary operation @ receives the _result_ of the recursion. What I'm looking for is tail recursion with all memoized state / accumulators passed in as extra arguments. Indeed the problem is different: the specification might deal with more general recursion patterns on recursive data types, but in the end one needs to construct loop kernels with local state connected to a data access pattern. Maybe the reason for confusion is the difference between abstract definition with low arity function, and the ``splatted'' nature of pipelined loop kernels for RISC/VLIW machines with lots of registers. Maybe the question can be reformulated as: find a transformation that memoizes accesses. In [3] elimination of redundant loads is mentioned as one of the possible optimizations that bring accesses to registers. If I'm about to move the abstraction level up, this kind of optimizations becomes very important. What are the other patterns? - elimination of redundant loads (i.e. offset indexing implemented using register shuffling) - loop unrolling enables array -> variable translations. - data prefetch and proper allocation in memory hierarchy: spatial vs. temporal locality, i.e. FIR input/output vs. FIR coefs. [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.125 [2] http://www.springerlink.com/content/h1547h551422462u/ [3] isbn://1558607668

[Reply][About]

[<<][staapl][>>][..]