Fast Image Processing Prototyping This project is about MetaOcaml / PLT typed Scheme experiments with C code generation for image processing applications. This is a hands-on exercise to bring together the following elements: - deforestation of image processing - datastructure traversal as delimited continuation Practical goals: - integration in GIMP (still) and Packet Forth (video). Image processing is used instead of audio processing due to its multi-dimensional nature which makes sequence iteration non-trivial. This project is a side-track of Staapl. While Staapl aims mostly at metaprogramming event/control applications, this project aims at dataflow/DSP applications. These two domains are too different to tackle them in the same framework. Rest assured, IP doesn't use the Forth language ;) Entry: The practical goal Date: Mon Mar 30 15:42:54 CEST 2009 A project I'm involved in uses a core DSP routine based on edge detection and the Hough transform. I'd like to play with these algorithms to get a hands-on feel about how to tackle memoization and deforestation of composed operations. Entry: Deforestation Date: Thu Apr 2 14:28:02 CEST 2009 Trying to find some intuitive ground to the business of intermediate tree representation elimination. In the original Wadler paper [1] the essential steps are respectively the elimination of a matching clause and a data constructor, and a function definition and its application. Now, the question is: does it make sense to define a language that has no algebraic data types and no first class functions, but allows meta forms that support these abstractions, as long as they can be eliminated at compile time? From experience there are a lot of programs in the (deeply) embedded control field that don't need such data types for a consise description of a solution. Otoh, using data structures to describe program components is convenient. What if then programming could be be composition of higher level functionsm and specialization of initial data types (i.e. representing I/O data streams as their iterator coroutines). [1] http://homepages.inf.ed.ac.uk/wadler/papers/deforest/deforest.ps Entry: stuck Date: Fri Apr 3 15:04:15 CEST 2009 I'm stuck. Time to aim for questions or simple tasks. Task: create a simple first order language with 'let' and basic math operations + interpreter. Now, what I'm really looking for is a data type to represent code. Why not stick to s-expressions? Is it easy to transform an interpreter back to explicit structs? The whole idea is: eventually I need to transform the interpreter, so the interpreter itself needs to be written in a form that can be easily manipulated. Now: should i be using redex? Entry: edge detector Date: Fri Apr 3 15:58:04 CEST 2009 So, let's try to stay closer to the concrete goal: figure out a way to perform deforestation on image processing operations. The operation I'm trying is sobel edge detection: I -> (I - X(I))^2 + (I - Y(I))^2 Here the binary operations + and - and the unary operations X and Y (shift) all operate on images. The goal is to: - turn this into one simple loop over the image - solve corner cases - allow tiling transformations What complicates this is that there isn't a single kind of data traversal: there is a whole family related by all kinds of loop transformations. Somehow there is a gem hidden here.. How to transform/view data structure iterators as stream processors with the proper amount of memoized data exposed. It looks like this is really more about memoization than deforestation. Entry: every data structure is a function Date: Fri Apr 3 17:31:01 CEST 2009 What about this: make sure that eventually, all constructors get replaced by function abstractions. This thing is really about "connecting time". Replace instruction scheduling with data structure connectivity. Turning something static (a data structure) into something dynamic (sequencing of machine instructions). Maybe the important observation to make is to leave declarative programming (expressions) and move towards dataflow programming. Why? because this is a better model for a lot of operations (multiple outputs). But doesn't that dispose of the "sequential time" property (time as a proper tree -> time as an acyclic graph). Again I find it hard to clearly express the ideas. I need more basic knowledge. So.. * Given a composition of operations and combinators (taking scalar operations to collection operations), construct a data flow graph of the entire computation. (completely declarative) * Then "compress" this graph into streamed iterated machines with limited internal storage for both code and data. The assumption is that there is always only one time dimension along which the algorithm needs to be serialized. The questions: how to encode this? = how to parameterize possible permutations, and how to manage the search space for optimal algorithms. It looks like the object of study is parameterized data flow graphs. Entry: to simplify: reduce dimensionality Date: Fri Apr 3 18:13:44 CEST 2009 What about simplifying the problem by reducing the dimensionality of the problem. Currently i'm thinking about images over time (video) which contains 2 spatial dimensions (3 with color) and one time dimension. Let's stick to one spatial dimension and one time dimention, I.e. the evolution of sound spectra over time. This makes drawing graph operations more managable. Once understood, move to more spatial dimensions. The basic operations then are: * 1->1, 2->1 and 2->2 scalar functions + association/commutation laws * shift (Z+, Z-) Simplified to: * 2->2 + association + commutation laws. * stubs Which can include 1->1 2->1 and shift by including appropriate stubs (zero inputs / outputs = constrained transforms). This leaves only a single operation: the 2->2 transform mapped over a vector: a localized overlapping mixing operation (matrix with band = 3). On this set of objects, * parameterize equivalent structures based on primitive equivalence preserving operations. * once parameterized, attribute external cost function that allows selection of representative. Entry: image processing in general Date: Fri Apr 3 18:41:55 CEST 2009 It is important to distinguish two classes of operations: * Local operations (i.e. derivatives) * Global operations (i.e. integrals) They both need completely different approaches for mapping to concrete machines. In IP it's probably best to stick to the local operations. Entry: practical: Ocaml Date: Wed Apr 8 19:06:05 CEST 2009 Getting to know Ocaml. Next: figure out how to use the module system: write an module, use it, instantiate it, parameterize it. Entry: A different machine Date: Tue Jun 2 14:31:51 CEST 2009 Switching to the other side again.. The thing is, I know and love Scheme. While MetaOcaml has interesting static properties, I'm not sure if it will weigh against the familiarity I now have with Scheme and Scheme macros. It might be more interesting to work from dynamic -> static (gradual typing) then to try to fit problems into a static framework in an alien language.. More concretely I'm thinking about LLVM. Maybe this is more something for Staapl.. To get to know LLVM it might be interesting to build an LLVM target. Entry: Loop transformation: compression of grid descriptions Date: Fri Jul 17 10:06:57 CEST 2009 Is it possible to compile a bunch of operations to some description of a grid processor and then "compress" this into a different representation? Entry: cache line management Date: Sun Jul 19 08:33:40 CEST 2009 What about making sure that the program's memory blocks are actually cache lines, making fast memory sort of directly managed? Associativity[1]: * fully associative: cache line can go anywhere * direct mapped * N-way set associative Looking at the K8 core (athlon 64) on that page - TLB is split in two levels - L1 ins / data 64KB 2-way I've tried to dabble with this before. My conclusion is the same however: this is too complicated to be able to constructively try to appease. It seems that simply being careful to not do anything stupid is more important than trying to predict where somethings is going. Cache lines are 64 bytes long L1 2 x 64K 2-way L2 1024K 16-way This suggests an approach like this: * use chunks the size of cache lines * keep the bulk working set in the L2. (chunk = raw data) * index sets and immediate local state in L1 * split data/control in two so the 2-way associativity doesn't get in the way. [1] http://en.wikipedia.org/wiki/Cpu_cache#Associativity