Fast Image Processing Prototyping

This project is about MetaOcaml / PLT typed Scheme experiments with C
code generation for image processing applications.  This is a hands-on
exercise to bring together the following elements:

  - deforestation of image processing 
  - datastructure traversal as delimited continuation

Practical goals:

  - integration in GIMP (still) and Packet Forth (video).

Image processing is used instead of audio processing due to its
multi-dimensional nature which makes sequence iteration non-trivial.

This project is a side-track of Staapl.  While Staapl aims mostly at
metaprogramming event/control applications, this project aims at
dataflow/DSP applications.  These two domains are too different to
tackle them in the same framework.  Rest assured, IP doesn't use the
Forth language ;)


Entry: The practical goal
Date: Mon Mar 30 15:42:54 CEST 2009

A project I'm involved in uses a core DSP routine based on edge
detection and the Hough transform.  I'd like to play with these
algorithms to get a hands-on feel about how to tackle memoization and
deforestation of composed operations.


Entry: Deforestation
Date: Thu Apr  2 14:28:02 CEST 2009

Trying to find some intuitive ground to the business of intermediate
tree representation elimination.

In the original Wadler paper [1] the essential steps are respectively
the elimination of a matching clause and a data constructor, and a
function definition and its application.

Now, the question is: does it make sense to define a language that has
no algebraic data types and no first class functions, but allows meta
forms that support these abstractions, as long as they can be
eliminated at compile time?

From experience there are a lot of programs in the (deeply) embedded
control field that don't need such data types for a consise
description of a solution.  Otoh, using data structures to describe
program components is convenient.  What if then programming could be
be composition of higher level functionsm and specialization of
initial data types (i.e. representing I/O data streams as their
iterator coroutines).

[1] http://homepages.inf.ed.ac.uk/wadler/papers/deforest/deforest.ps


Entry: stuck
Date: Fri Apr  3 15:04:15 CEST 2009

I'm stuck.  Time to aim for questions or simple tasks.

Task: create a simple first order language with 'let' and basic math
operations + interpreter.

Now, what I'm really looking for is a data type to represent code.
Why not stick to s-expressions?  Is it easy to transform an
interpreter back to explicit structs?

The whole idea is: eventually I need to transform the interpreter, so
the interpreter itself needs to be written in a form that can be
easily manipulated.

Now: should i be using redex?


Entry: edge detector
Date: Fri Apr  3 15:58:04 CEST 2009

So, let's try to stay closer to the concrete goal: figure out a way to
perform deforestation on image processing operations.

The operation I'm trying is sobel edge detection:

    I -> (I - X(I))^2 + (I - Y(I))^2

Here the binary operations + and - and the unary operations X and Y
(shift) all operate on images.

The goal is to:
  - turn this into one simple loop over the image
  - solve corner cases
  - allow tiling transformations

What complicates this is that there isn't a single kind of data
traversal: there is a whole family related by all kinds of loop
transformations.

Somehow there is a gem hidden here..  How to transform/view data
structure iterators as stream processors with the proper amount of
memoized data exposed.  It looks like this is really more about
memoization than deforestation.


Entry: every data structure is a function
Date: Fri Apr  3 17:31:01 CEST 2009

What about this: make sure that eventually, all constructors get
replaced by function abstractions.

This thing is really about "connecting time".  Replace instruction
scheduling with data structure connectivity.  Turning something static
(a data structure) into something dynamic (sequencing of machine
instructions).

Maybe the important observation to make is to leave declarative
programming (expressions) and move towards dataflow programming.  Why?
because this is a better model for a lot of operations (multiple
outputs).  But doesn't that dispose of the "sequential time" property
(time as a proper tree -> time as an acyclic graph).

Again I find it hard to clearly express the ideas.  I need more basic
knowledge.  So..

  * Given a composition of operations and combinators (taking scalar
    operations to collection operations), construct a data flow graph
    of the entire computation. (completely declarative)

  * Then "compress" this graph into streamed iterated machines with
    limited internal storage for both code and data.

The assumption is that there is always only one time dimension along
which the algorithm needs to be serialized.

The questions: how to encode this? = how to parameterize possible
permutations, and how to manage the search space for optimal
algorithms.

It looks like the object of study is parameterized data flow graphs.


Entry: to simplify: reduce dimensionality
Date: Fri Apr  3 18:13:44 CEST 2009

What about simplifying the problem by reducing the dimensionality of
the problem.  Currently i'm thinking about images over time (video)
which contains 2 spatial dimensions (3 with color) and one time
dimension.  Let's stick to one spatial dimension and one time
dimention, I.e. the evolution of sound spectra over time.  This makes
drawing graph operations more managable.  Once understood, move to
more spatial dimensions.

The basic operations then are:
    * 1->1, 2->1 and 2->2 scalar functions + association/commutation laws
    * shift (Z+, Z-)
    
Simplified to:
    * 2->2 + association + commutation laws.
    * stubs

Which can include 1->1 2->1 and shift by including appropriate stubs
(zero inputs / outputs = constrained transforms).  This leaves only a
single operation: the 2->2 transform mapped over a vector: a localized
overlapping mixing operation (matrix with band = 3).

On this set of objects, 

   * parameterize equivalent structures based on primitive equivalence
     preserving operations.

   * once parameterized, attribute external cost function that allows
     selection of representative.


Entry: image processing in general
Date: Fri Apr  3 18:41:55 CEST 2009

It is important to distinguish two classes of operations:

   * Local operations (i.e. derivatives)

   * Global operations (i.e. integrals)

They both need completely different approaches for mapping to concrete
machines.  In IP it's probably best to stick to the local operations.


Entry: practical: Ocaml
Date: Wed Apr  8 19:06:05 CEST 2009

Getting to know Ocaml.  Next: figure out how to use the module system:
write an module, use it, instantiate it, parameterize it.


Entry: A different machine
Date: Tue Jun  2 14:31:51 CEST 2009

Switching to the other side again..  The thing is, I know and love
Scheme.  While MetaOcaml has interesting static properties, I'm not
sure if it will weigh against the familiarity I now have with Scheme
and Scheme macros.  It might be more interesting to work from dynamic
-> static (gradual typing) then to try to fit problems into a static
framework in an alien language..

More concretely I'm thinking about LLVM.

Maybe this is more something for Staapl..  To get to know LLVM it
might be interesting to build an LLVM target.


Entry: Loop transformation: compression of grid descriptions
Date: Fri Jul 17 10:06:57 CEST 2009

Is it possible to compile a bunch of operations to some description of
a grid processor and then "compress" this into a different
representation?


Entry: cache line management
Date: Sun Jul 19 08:33:40 CEST 2009

What about making sure that the program's memory blocks are actually
cache lines, making fast memory sort of directly managed?

Associativity[1]:
  * fully associative: cache line can go anywhere
  * direct mapped
  * N-way set associative

Looking at the K8 core (athlon 64) on that page
 - TLB is split in two levels
 - L1 ins / data 64KB 2-way

I've tried to dabble with this before.  My conclusion is the same
however: this is too complicated to be able to constructively try to
appease.  It seems that simply being careful to not do anything stupid
is more important than trying to predict where somethings is going.

Cache lines are 64 bytes long
L1  2 x 64K  2-way
L2    1024K 16-way

This suggests an approach like this:  

  * use chunks the size of cache lines

  * keep the bulk working set in the L2.  (chunk = raw data)

  * index sets and immediate local state in L1

  * split data/control in two so the 2-way associativity doesn't get
    in the way.

[1] http://en.wikipedia.org/wiki/Cpu_cache#Associativity