State Machines

After an excursion into writing manual state machines I'm again
veering towards just writing parallel code with blocking constructs.
The new realization is that there doesn't have to be an RTOS.  It's
likely possible to generate C state machines from a HLL.


Entry: Inverting state machine control (futures)
Date: Sat Jan  4 19:47:02 CET 2020

https://www.youtube.com/watch?v=9HspeHGBg-Q&t=933s

Futures are sugar.
Did this whole thing start in C# ?

If the async library in closure is just sugar, then I can do the same
for small uC projects.

Rich mentions they do not provide unlimited buffering so it allows for
actual backpressure.  Also maps better to embedded.


Entry: New sm.txt notes
Date: Sat Jan  4 20:31:33 CET 2020

This probably needs a separate blog thread.  I'm shuffling around many
ideas, but I really want to move this into an actual system.  I'm
tired of writing C for microcontrollers, and Rust is not the solution
in my case: I want something more model-based, macro-like.

I still want to "keep Haskell out of embedded".  Only as compiler,
where it should be the main tool.

- sm.h

  - limited nesting by making stack frames explicit

- A way to generate a static C program from a higher level functional
  description.  Something that can do:

  - Automatic async-style conversion

  - Smart re-use of closures (i.e. flat state structure where state
    updates are mutations, not copying in a new activation record).

- I want lisp-like syntax and constructs that can be easily
  implemented in C, but explicitly no arbitrary closures or
  continuations.

- Some way to formulate this so it makes sense on FPGA as well.  Where

  - messages are either actual bulk data lines, registers, or
    references to data stored in memory

  - tasks are state machines, or stored programs on a CPU

- Erlang as test target.  Either Core erlang, or just an interpreter
  for something simpler, written in Erlang proper.


Entry: CSP in C
Date: Sat Jan  4 21:49:25 CET 2020

Rendezvous is too hard to implement, so use buffers.

See uc_tools/cbuf.h, optionally with both ends of the channel exposed
as separate types to make it clear if a channel is input or output for
the current "task".

A "task" is a C struct containing state variables and channel pointers.

Scheduling can usually be done in an ad-hoc fashion, using a
round-robin scheduler where each task defines its own poll routine.
sm.h style state machines could be used.

EDIT: Don't call this CSP if it's not rendez-vous!


Entry: Byrd miniKanren
Date: Sun Jan  5 22:59:42 CET 2020

https://www.youtube.com/watch?v=RVDCRlW1f1Y


Entry: Capture subset of C in s-expressions
Date: Tue Jan  7 11:52:37 CET 2020

;; Varible declarations.  All variables are typed and explicitly
;; initialized.  Const is default.  Non const is mut.  It's an
;; expression language with statements inside let blocks.  A let block
;; returns the value of the last expression.

(let ((:: (uint32_t)     a) 123)
     ((:: (mut uint32_t) b) 456)
  (set! b (+ a b))
  b)

;; There is no goto.  We'll implement tail calls instead.  That leaves
;; just loop and conditionals.

(while <cond> <expr> ...)

(if <cond> <e_true> <e_false>)

;; The language allows the usual abstraction/application, with some
;; restrictions on closures that cannot be implemented statically.
;; Function calls that do not block are treated as primitives.  Any
;; other applications will cause creation of continuations.

;; Continuations are represented using nested C structs/unions
;; following the sm.h model.  It seems important to keep continuations
;; explicit to allow for exact memory allocation, e.g. not using
;; stacks.  This restricts control flow but in practice is likely not
;; an issue.

;; It appears that most of this language is trivial apart from the
;; representation of continuations, so start there.  Ignoring type
;; annotation, the simplest blocking program would be a copy process
;; like:

(while $t
   (let ((line (read-line)))
     (print line)))

;; where read-line is the blocking function, and optionally print is
;; treated as rendez-vous, but for now we can just assume buffered
;; channels.


Entry: Continuations
Date: Tue Jan  7 18:51:19 CET 2020

Some reflections after working with sm.h.

I do not want nested polling as is used in sm.h's CALL mechanism.  A
state machine should be compiled to:

  - one function per continuation

  - single flat function with switch() or computed goto

The latter two are essentially the same.  The former is also
essentially the same if there is no need for shared context that could
be implemented in the body of the function.

Let's assume individual functions.

It's probably beneficial to use ANF, i.e. to have explicit stack
frames.  Stack frames map directly to nested unions.

E.g.

The "root" union is the collection of resume points in the main
function.  Suppose 3 resume points A,B,C this then maps to the
following nesting.  Suppose each of these has 2,1,3 resume points
respectively:

union k {
   union {
      struct { ... } frameAA;
      struct { ... } frameAB;
   } frameA;
   union {
      struct { ... } frameBA;
   } frameB;
   union {
      struct { ... } frameCA;
      struct { ... } frameCB;
      struct { ... } frameCC;
   } frameC;
};

Something like that.  Should tail recursion be supported?  Probably
not.  Loops are likely simpler to work with.  This will be a language
that supports macros, so much can be built on top.

To make this more understandable, create some example programs and
print stack frames at suspension points.  Basically abstract
interpretation will need to be used to enumerate all the suspension
points.  Only programs that have a finite number of suspension points
will be representable.


Entry: Abstract interpretation
Date: Tue Jan  7 19:10:45 CET 2020

To perform control flow analysis, abstract interpretation will be
essential.


Entry: Manually
Date: Thu Jan  9 09:36:36 CET 2020

Go through the steps manually.  There are two problems: collecting the
current stack frame, and resuming it.

Resuming should probably be implemented by letting the code do
explicit pointer chasing.  E.g.  e->var1, e->var2, ....

This should probably be extended to extending environment for closures.
E.g 

e->var     is current stack frame, while
e->p->var  is a variable in the parent frame

So from the point of application it is straightforward: just use
nested C structures/unions.

The trick is then to convert suspension points into nested unions.

The difference between doing this statically and dynamically is that
all continuations need to be known at compile time.  This is a serious
control flow analysis problem, and for sure doesn't work for all
programs.  What are the restrictions?

( Note: only blocking functions need to be considered here.
Non-blocking functions are treated as primitives. )

The algorithm is straightforward

- enumerate all call sites
- for each call site:
  - collect stack frame
  - recurse

This way it is ensured that recursion cannot represent loops.

Note that nested unions/structs aren't necessary because memory is
stored linearly.  The only time two pointers are needed is for
(downward) closures, but we are probably not going to use those much
because iterators probably eliminate them completely.


Entry: Downward closures or iterators?
Date: Thu Jan  9 09:56:14 CET 2020

E.g. Backus' functional forms.

Does it matter?  Maybe they are really the same in most use cases?
The core question is: does a closure need to be treated as a value, or
is it always immediately "inlined".

I don't think closures will be that hard to implement, as they just
have a pointer to an extra stack frame.  It's also only just one stack
frame.  Note that it is the _thread_ that has multiple frames (the
continuation is compisitional).  Closures close only over their local
frame.


Entry: Representing the stack
Date: Thu Jan  9 10:04:03 CET 2020

There's no reason not to grow downward when all frames are known.


Entry: Dedicated HW
Date: Fri Jan 10 12:31:54 CET 2020

So full circle: create a Forth CSP machine?

Core idea here is to create a simple task switcher.

I like the idea of using Forth as a low-resource implementation
language, but use a specification layer on top of it to express
semantics.


Entry: Cheap rendez-vous
Date: Fri Jan 10 13:00:40 CET 2020

How to do that in scheme?  Basically, eliminate context switches.
Implement context switches as jumps or calls.  If the overhead is
gone, a lot of buffering isn't needed, and the implementation probably
comes closer to what can be done in hardware using dedicated machines.

Try this!  It's an important bit.


Entry: Summary
Date: Fri Jan 10 13:01:57 CET 2020

- represent continuations explicitly as C structures, might lead to
  readable generated code even.

- cheap rendez-vous and task switching


Entry: Cheap task switching
Date: Fri Jan 10 13:27:53 CET 2020

I wonder how many programs revert to "loop over buffer element"
patterns because task switching is perceived to be inefficient?  If
task switching is made cheap, i.e. just switching a context pointer,
how would that affect programming style?


Entry: Continuation
Date: Fri Jan 10 13:59:33 CET 2020

So it's clear: the core abstraction is the continuation.


Entry: Manual example
Date: Fri Jan 10 14:16:52 CET 2020

Create a manually compiled example, then refine it to something that
can be automatically compiled.

The DMX example is actually quite nice, as it has 4 event kinds: USB
input, DMX output, DMX input, USB output.  This is going to come back
in a lot of data converter tasks.


Example of code generator output.  Work bottom up: start by manually
creating some C code, then model the compiler to produce that kind
of output.

The task is a CSP copy task:

  (let ((a0 123))  ;; some constant present in all frames
    (loop
      (let* ((a (recv c0)))  ;; a is not present in recv k
             (a1 (+ a 1)))
        (send c1 a1)
        (send c2 a1)))

- This has 3 continuations for the three channel rendez-vous points.
  One of the sends can be left out to simplify.

- This already exposes ownership and lifetime of data, which is a
  general problem.  We solve it by storing the data to be transferred
  in the sender's stack frame, and letting the receiver copy the data.

- There is also a tail call here at the end of the loop, which
  destroys the stack frame.  Maybe it is better to model that
  explicitly?  No let's keep the loop as a primitive.  It will make
  the reuse of stack frames simpler, allowing first class function to
  be avoided.

- To represent in C: I'm thinking it might be simplest to generate one
  struct per continuation type.  Otherwise the dereference notation is
  going to be too hard to read in the generated C code.

- How to re-write things to not pollute continuations?  E.g. if
  variables are short-lived they can just go on the C stack.  Does it
  actually matter that we keep junk in the suspended contination?
  E.g. for the send continuations, the variable a is sill visible but
  no longer needed so could actually just be removed. Maybe it is best
  to not make the distinction at all in the beginning.  Just allocate
  everything in the struct, then later when it is all working, add an
  optimization that allocates intermediate variables on the stack if
  they are not used after a point.

- Simplification: program will need to be in ANF such that the value
  passed into a continuation has a well known place at the top of the
  stack.

- C representation: incoming and outgoing continuation do not have any
  required relation, but in practice they will share the bottom of the
  stack.  I would like to avoid copying.  However, there might be many
  intermediate forms.  This is going to take bookkeeping, so maybe
  local variables that do not make it into the closure should just not
  be used.

- The type of the output stack depends on the code path!  So how do we
  know what the continuation output type is?  From the function that
  interprets it.  So continuation struct types are always associated
  to the associated function, and only that function knows how to
  interpret the data.

- Growing upward is going to be simplest.  Fits best in C struct view.

- Let's assume that the continuation can be modified in-place, and it
  is the responsibility of the function to do this.

- About output: it's really not clear how to anticipate the shape of
  the output continuation.  The more I think about this, the more I
  believe that explicit stacks might no be the way to go.  Otoh static
  guarantees are nice.

- Loops and conditionals need to be handled somehow.  These could be
  expressed as continuations, but there's a distinction between
  non-blocking and blocking continuations. I ran into this before
  doing the DMX converter box: the main loop structure doesn't
  correspond to blocking points.  This is exactly why you would want
  to automatically transform to CPS, partially.
 
- Harping on that: maybe compiling to flat form is better.  That way
  'goto' can be used and we can have tail calls.  And suspension is
  just taking the label of that goto.  Let's rework the example.

- If the program is in ANF form between control points, then it would
  probably be simpler to manage stack frames because there is only on
  C struct per "chain", that could be partially filled.

- Recursive calls can also have copy optimization to re-use the
  context.


Was a good exercise.  The conclusion is that it isn't nearly as simple
as anticipated, but some constraints got clarified in the process.
Important conclusions:

- use ANF, with the exception that non-block C functions can be
  collapsed into expressions.

- suspensions and control structs (loop + conditional) are very
  different from scheduling point, but should be put on the same
  footing in the language.  a single C struct per machine makes this
  possible.  that also makes the generated code a bit more modular at
  the C level

- the scheduler doesn't need to know the type of the contination.
  however a continuation's code pointer does identify the type of the
  data associated.  obvious in retrospect, but wasn't starting out.

- grow upward. easier for C structs.


Entry: Summary exok.c and discovery of the core issue
Date: Sat Jan 11 15:30:50 CET 2020

This is actually a simple and straightforward way to write CSP
directly in C.  It seems that it is only necessary to create the
proper overlay of the continuation.

struct env {
    uint32_t a;
    uint32_t b;
    uint32_t c;
};


#define COP(var,klabel,typ) \
    k->pval = &(var);       \
    k->next = &&klabel;     \
    k->op_type = typ;       \
    return;                 \
klabel:

#define SND(var,klabel) COP(var,klabel,1)
#define RCV(var,klabel) COP(var,klabel,2)

void task(void *next, void *ctx, struct k *k) {
    struct env *e = ctx; // FIXME: not possible in general
    if (next) goto *next;

    e->a = 123;

  again:
    RCV(e->b, k1);
    e->c = e->a + e->b;
    SND(e->c, k2);
    SND(e->c, k3);
    goto again;
}


The thing that is "almost right" here is the assumption that we can
use the same struct for the input and the output environment.

That is where the core of the problem lies to convert this to C.

The rest is really straightforward standard ANF/CPS stuff.


In the example above, the stack is only growing, so the unused part of
the C struct is simply ignored, but no notational overhead is created.

In a more elaborate example, the C stack will "fork".  It's not quite
clear how to handle that.  One trick could be to use inline struct
definitions.  However in general it might be necessary to throw an
optimizer at the problem.

What is quite surprising is that this might actually work as a guide
to write CSP in C.

I.e. it is important to see that the conversion from task to state
machine is entirely on the inside of the tasks.  The scheduler doesn't
need to know.  It treats the task as a state machine.

That means that I can take my time to design the taks compiler, but I
can already start working on and using the scheduler.


Entry: summary
Date: Sat Jan 11 15:51:01 CET 2020

Summary of summary of summary.  Whatever, I'm learning.

- computed goto has all the advantages

- this works for all kinds of state machines, not only sequential programs

- an ANF state machine is just a special case of what this can host.
  manually written machines work just as well


Entry: scheduler
Date: Sat Jan 11 16:00:19 CET 2020

So how do you write a CSP scheduler?

This is essentially wingo's talk, but much simpler.


Simplify:

- Tasks can only suspend on channel read write.

- When a task suspends, if it has a partner, resume the partner, then
  resume the task.  Otherwise, add to wait queue.


There is one complication: select needs to be able to wait on multiple
channels.


Entry: Is select the same as non-deterministic choice?
Date: Sat Jan 11 17:11:56 CET 2020

Look this up.  I understand Select, but they might be different than
the other options.


Entry: Implementing scheduler
Date: Sat Jan 11 18:03:58 CET 2020

Does it needs queues?
I don't thing we really care about sched order.

Is there a simpler DS that can mark a channel for a waiting task?
Yes, but it also then needs to undo the wait in case of select.

Might be simpler if the number of channels is small.

So: fun from chan -> (ks,kr)

I'm getting too tired for data structures.

Actually, can scheduling be done using a stack instead?  Or maybe
implementing a queue using two stacks.

EDIT: Revisit from perspective of suspend.

Basically, it is always possible to start from a state where
everything blocks, because the input of the network can be monitored.
I.e. channels that represent timers or I/O events.

So whenever something happens, it is time to start scheduling.  This
can be done depth-first:

- event happens on channel c0, e.g. from timer or external something.

- there will always be a reader on this channel because the system is
  in a "relaxed" state.  execute it.

- it will produce a new continuation.  if it is a send, there will be
  a corresponding receive.  why?  same case as before: there is no
  other way the system can advance.

- if it is a receive, check the queues for a send.

Maybe it's not great to think about this in a tired state.

EDIT: It's wrong.  Reads and writes seem to be really symmetric.  It
doesn't really seem to matter what happens between the two tasks.
It's only important that they meet.  Wingo was really stressing this,
and it seems to be true from the scheduler's point.  The reason to
have two queues is just a filter.  Let's assume schedulable tasks are
a set.


Entry: A scheduler, again.
Date: Sat Jan 11 19:03:49 CET 2020

Blocked tasks are a set.

We have a way to partition it into reading and writing, which is only
necessary to limit the search, but it might be actually simpler to use
a single data structure and iterate around it.


What I'm looking for is a stop condition.

1. Precondition: everything is blocking.

2. Some external read or write happens.

3. Find a corresponding blocked task

   3a.  There is no task.  Add it to the set and go to 1.

   3b.  There is a corresponding task.  Remove it from the set.  Run
        both tasks and execute 2. for each task.


It is 3b that looks like a binary tree recursion.  I.e. the
continuations that just were created will need to be checked if they
unblock anything, because nothing else will cause an unblock.

So the "hot" list of unchecked suspended tasks has a different meaning
than the "cold" list of suspended tasks.  It can just as well be
implemented as a stack.  It doesn't seem to matter in which order the
tasks are executed.

( I would have had a different life if I understood this earlier.
This is really magnificient. )


So let's make some assumptions:

- The "hot" list can be a stack.

- The "cold" list is fairly implicit.  We never need to enumerate.  We
  just need two operations:

  - Given channel id, check if there is a blocked task

  - Given channel id with no blocked task, add a task


Because of the "hot" list, there is no need to ever have a channel
data structure that has two blocking tasks.


So the data structure seems quite straightforward:

A channel is a pointer to a blocked task, or NULL.

It is actually possible that there is more than one blocked task,
e.g. a channel can have multiple writers (or multiple readers, but
that only makes sense for a multiprocessor).

To implement multiple blocked tasks on a channel, it seems best to add
a next_k pointer to the k list.


EDIT: I think I have an implementation for just send and receive.
What about select?

Select is a read that can rendez-vous with a whole lot of channels.

Maybe it's simplest to implement only single write and multiple
receive?


Entry: Multiple readers?
Date: Sat Jan 11 20:34:02 CET 2020

So it is actually meaningful to have multiple readers, because each of
those might actually be blocked on some output task.

I wonder if it is then needed to have the dual of select: write to
multiple channels, picking the first that is available.

Good question.

Let's assume this is not the case for now.

AHA. This is like CML.  Because a write can be an event that can be
syncrhonized on.

Maybe CML is just the realization that the duality of the rendez-vous
is just a side effect of it not being general enough?


Entry: Select
Date: Sat Jan 11 20:41:03 CET 2020

Going over csp.c

If a hot task is a select.  It will need to scan the compatible
channels until something is found.

EDIT: Implementing select, it is clear that it's not possible to use
the channel array trick because a task can appear in more than one
queue, and I don't want to start messing with containers.  Use a
traversal instead.

It will be slightly different:

while hot tasks:
  for all cold tasks:
     rendez-vous?
     yes ->
       remove from both queues
       copy data
       resume
       requeue hot

This requires minor changes to the loop function.

An extension of is_rendez-vous, as it needs to check if a sender's
channel is part of a select's channel.

But the thing I don't really want to do now is to implement the queue
remove.


Note that the prime objective is simplicity.  Speed is nice but not
essential.

EDIT: It's also clear now that making the choice between send and
receive early leads to simpler code when matching.


Entry: Remove element from list
Date: Sat Jan 11 21:46:57 CET 2020

This requires iterating using a **.  It's really just a pop in the
middle of the list.


Entry: Draft
Date: Sat Jan 11 22:50:00 CET 2020

Current csp.c seems to be quite close to the initial vague idea.


Entry: Manual tasks
Date: Sat Jan 11 23:43:22 CET 2020

So basically, add a forth!


Entry: Sending something from outside
Date: Sun Jan 12 00:02:32 CET 2020

Is not easy, because the value that is contained needs to be held in a
task's data structure until it is removed.

Getting something out of the system is simple: it's just a side
effect.  Getting something in needs:
- a task that halts after sending the data
- a call to the scheduler

Though we do not have a way to guarantee that the task will block.

To overcome this, use a buffer structure and a task that reads from
that buffer.  Whenever an element is added to the buffer, restart the
buffer reader task.

You really switch worlds this way.  Different rules apply on the
inside and the outside!


Entry: Task switch efficiency
Date: Sun Jan 12 08:42:01 CET 2020

Currently I have a fairly dumb search, but it seems in general that it
is not simple to create good datastructures.

The case is complicated by select.  In practice, multiple listeners on
the same channel will be fairly rare, but the possibility does make
direct mapping from channel to waiters a bit awkward.


Entry: Generalize
Date: Sun Jan 12 11:08:58 CET 2020

I'm almost immediately running into the case where I need to select on
a read and a write.

csp_send() only works when there is a reader already blocking, so to
implement buffered channels, the only way is to use csp_send() to send
an "interrupt" to a task that reads the buffer and sends it out.

However, it might be possible that the send blocks, but that a new
interrupt arrives.  This means I need to select on both send and
receive, and it seems that this is also what CSP can do.


This messes up the data structure again, because a task can now be
blocked both on send and receive.

Also, select is over ops, e.g. receive channel 3, send channel 2.

What needs to change?

- Go back to only a single list.  That should be straightforward.

- Every operation is implicitly a select.  Same as in CSP btw!

- A task will need to refer to a list of channel operations.  Those do
  not need to be part of the process struct, but will have to be
  stored in the process state.

- The is_rendezvous will become more complex, because all combinations
  will need to be checked.

- One way to quickly do this is to split send and receive operations
  in the select.


Alright. Learning!


Entry: Rework
Date: Sun Jan 12 20:16:22 CET 2020

Some C bs.  Code pointer leaking into.

(gdb) p/x ((struct env1*)s)->op[0]
$26 = {type_chan = 0x5102, msg_len = 0x5555, msg_buf = 0x0}

Actually, the ops do not need to contain op type: it is known by the
grouping whether it's in or out.


EDIT: Works now


Entry: select
Date: Sun Jan 12 21:46:17 CET 2020

So next step is to actually use select for the buffer drainer.

I want something like:

switch(select(SEND(chan2,e->data),
              RECV(chan1,e->int))) {
case 1: ...
case 2: ...
}
       

EDIT: Added a draft. Needs sugar.


Entry: CSP vs. reactive
Date: Sun Jan 12 23:31:43 CET 2020

This is actually more flexible than a reactive system, and can be used
to implement one, where each task is a many->many function that
re-evaluates when any of its inputs change.

Probably not the best way to implement it though, but who cares for
low update stuff.


Entry: What does this open up?
Date: Mon Jan 13 09:54:55 CET 2020

- Reactive system

- Allow decoupling of HW and logic: just replace the events that
  trigger evolution.


Entry: forking lexical contexts
Date: Mon Jan 13 13:42:41 CET 2020

I've removed most continuation compiler notes from csp_test.c
A summary:

- computed goto is a good compiler target, unifying control structures
  (implemented as plain goto) and suspension (computed goto).

- the next step is to resolve the "forking stack struct" problem.
  e.g. how to bridge between C-style "flat struct" state approach, and
  a real continuation compiler which has a separate continuation type
  for each resume point.  there is a lot of overlap but this is
  entirely coincidental.  how to make that explicit?


Entry: More efficient?
Date: Mon Jan 13 16:06:53 CET 2020

Since there is no direct map from channel to tasks, this isn't
particularly efficient.

To make it more efficient:

- When a task suspends, the select structure should be checked against
  the channel->task maps.

- If there is a hit, the task should be removed from all other channel
  lists.

- If there is no hit, the task should be added to the channel->task
  map for each task.

Changes needed:

- Tasks are no longer just in one list, so lists should be implemented
  separately.

- Lists can still be stacks.

- Lists:
  - hot, cold (mutually exclusive)
  - channel's read/write list

It seems to become a lot more complex this way, but linear searches
will be eliminated, so it will be more efficient when there are a
large number of tasks.


How much more memory is needed?
Per channel: + a stack of tasks, one for read and one for write
Per task: - link pointer

There is probably a bound on the number of list elements in total, as
a task will be:
- either hot or cold (one cell)
- part of n channel lists, where n is the select count

So the upper bound is the sum of max select count per task.  That is
statically known.

The hot/cold list can be kept the same, as this is only one byte per
task.

This makes the channel list an "accelerator structure".  I.e. it can
be optionally included to trade memory vs. speed.

The pointers can be shrunk to 8 bit if the number of tasks is smaller
or equal to 255, with one value reserved for null.


Entry: Accelerator
Date: Mon Jan 13 16:45:42 CET 2020


The code above uses a fairly dumb data structure which is memory
efficient but makes scheduling slow.  To accelerate, an index can be
built for the channel->task map.

For now, assume that the hot/cold list structure based on the
next_task pointer can be kept.

What is needed?

- For each channel:
  - a list of blocked readers
  - a list of blocked writers

  It is generally not straighforward to determine the size of this
  list.  The most typical case however is 1, so optimize for that.

- A way to easily remove a task from the this.  This can be done by
  following the task->channel map, and then searching the channel
  lists.

- The cold list can probably be removed.  Only a hot list is
  necessary to keep track of newly blocked tasks.


The sizes of these lists are not easily determined, so it seems
best to place an upper bound on the number of list slots, and use a
shared list cell pool.

Also, if the number of tasks is relatively small, indirection could
be used.  Since this requres a base table, it seems to add a lot of
complexity for not a whole lot of gain.  Revisit the idea later.

The upper bound on the number of cells needed is determined by the
maximum select count for each task: for each select entry, the task
is added to the channel's list.

So total number of extra bytes needed:

- Tasks * MaxSelect * 8
- Channels * 2 * 4

It does seem to create quite a bit of overhead.  Suppose we can use
IDs for task lists and tasks, the count then becomes:

- Tasks * 5  (id -> pointer table + backpointer in task struct)
- Channels * 2 * 1

This requres that the total number of active selects also fits in 8
bit.

Re-arranging tasks so they are in a single array avoids the back
pointer, but requres a secondary pointer.

Basic conclusion I come to is that keeping it simple is proably going
to be the way to go.  Pick speed or pick memory efficiency.

Add a cell pool that just has pointers.


Entry: csp indexed scheduler
Date: Tue Jan 14 17:04:00 CET 2020

It's much more direct now.  Compared to the original search, this
should be a lot faster.  I don't think I need to test speed to know
that I don't need the old implementation any more.

This will be a lot faster in the common case where there are a lot of
blocked tasks and each scheduler run only schedules a couple of tasks.

It does need more memory.  It's itching to optimize this but maybe
measure memory usage first, then see what makes most sense in
practical scenarios.

Also needs an allocator for channels, which is usuful in its own
right.

The hot list can still use the pointer inside the task.


Entry: External receive
Date: Thu Jan 16 20:04:02 CET 2020

This could implement a "pull" network as well.  It's completely
symmetric.  The counterpart read to synchronous send is obvious.

That also means there should be a conterpart to the buffered input,
but this is less obvious.  Maybe it's just a poll of a buffer?  Maybe
execute a network until a buffer is full?

A typical use case is an interrupt that will pick up something from a
buffer.  The interrupt should perform the read and signal the network
to fill the buffer again.

In general it seems something needs to be buffered, but it is not
really clear what.

Both input and output buffering might be necessary due to deadlines,
so a third element (WFI main loop) is actually essential if both need
buffering, otherwise advance the network from input or output
interrupts.

 
Entry: SELECT macro
Date: Thu Jan 16 20:34:19 CET 2020

Something like this:

        switch(SELECT(b,
                      {EVT(b->c_data, b->token),
                       EVT(...)},
                      {EVT(b->c_int,  b->nb_bytes)}))

First clause is send, second is receive.

This doesn't work well with macros.
Maybe use a begin/end construct?

SELECT_SEND(s,2)
  EVT(s, 0, chan1, var1)
  EVT(s, 1, chan2, var2)
SELECT_RECV(s,1)
  EVT(s, 2, chan3, var3)
SELECT_END {
case 0: ...
case 1: ...
case 2: ...
}

These seem possible, but awkward.  Maybe just stick to raw?

Quirky idomatic, but quite clear actually.

        CSP_EVT(b, 0, b->c_data, b->token);     /* SND */
        CSP_EVT(b, 1, b->c_int,  b->nb_bytes);  /* RCV */
        CSP_SEL(b, 1/*nb_send*/, 1/*nb_recv*/);

I can' figure out how to turn it into an abstraction, so use
switch(task->selected) {}.


Entry: priorities
Date: Fri Jan 17 14:56:16 CET 2020

Implementing priorities requires some extra machinery:

- Push hot tasks by priority

- For each channel, push cold tasks by priority

However, without pre-emption, this doesn't seem to be very useful.


Entry: CSP
Date: Fri Jan 17 15:00:07 CET 2020

Basically, this seems done.  I don't think I need to add priorities.
If that is necessary, use an existing RTOS.

I think this mostly needs a break.  It was fun, but it is pretty much
finished if I want to keep it simple.

What remains is syntax:
- Make select prettier
- Convert continuations to state machines


Entry: The always waiting task
Date: Sat Jan 18 06:57:18 CET 2020

Instead of using a buffer, it's also possible to use a task that is
always waiting for a buffer write.  Basically, a blocking task behaves
as a 1-slot buffer per channel it listens on.

Buffering seems to be a system design issue.  Once you know the entire
network, it's possible to say a couple of things about how data will
propagate.  See next post.


Entry: Relation between CSP and dataflow
Date: Sat Jan 18 06:59:50 CET 2020

This is more about the guarantee not to block.  E.g. suppose data is
coming in over USB.  Take the recent DMX box example.  I want to
guarantee that an external CSP send will succeed.  Is there a way to
express that in CSP algebra terms?

In the example, a buffer is needed in beteen USB input and DMX output.
It doesn't matter where it goes, but it has to be there.


Entry: Reactive systems
Date: Sat Jan 18 15:18:45 CET 2020

A reactive system can be something like this:
- select on input
- update value
- select on outputs

Maybe the semantics isn't really perfect, since simultaneity isn't
handled well, and neither is fanout.  For a reactive system we really
want continuous time semantics, which would not create multiple events
if all input values changes at once.

It seems reactivity is really more about time, while CSP is timeless.

It might be more useful to think about time provision.

EDIT: reactive dataflow = CSP + memoization?


Entry: An OS is CSP + time base?
Date: Sat Jan 18 15:24:08 CET 2020

CSP gets time from the outside.  We're going to want to introduce time
in almost every application.  So maybe create a wrapper for it?

Do this more general: input events are uC interrupts.  This will
always work.


Entry: CSP and interrupts: two levels are enough
Date: Sat Jan 18 15:28:17 CET 2020

To fit in the current setup, i'm tempted to run the CSP machine from
interrupt, but that really doesn't work due to the need for
pre-emption.

The case of a intterrupt-per-char on uart is actually quite common,
and frequently processing will be irregular: the character that
completes a message will take longer to process than the one that is
just buffered.

So general rule:
- intterupt data goes into buffers
- main loop uses WFI to wake up

The same issue happens on the FPGA.  The problem is TDM: one
representation of the data requires frequent small operations, another
requires infrequent large operations.  This decoupling is essential in
computing in general: the "time warping" is what makes things
practical.

So I've identified two cases: character vs. data packet.  This can
likely grow larger as well, involving processes that operate on many
data packets.  But these do not seem to need special requirements.

Why does the interrupt need special attention?

Because it has a time base that doesn't stretch!

The interrupt is tied to real time.  Once buffered, time becomes
relative, i.e. at the fine grain, only causality matters, and the
relation to real time is there at a much coarser level.

So one thing to take away is that the need for priorities ALWAYS has
to do with some exeternally imposed deadline.

Reading that back it is rather obvious.  I just made a weird detour to
get to a very simple conclusion.


Entry: Priorities
Date: Sat Jan 18 15:43:04 CET 2020

So can this be used to introduce priorities in the CSP scheduler?

Suppose there is a chain of events that has a time constraint.  This
chain goes through a number of processes.  Likely, each process in
turn will unblock the next one.  What is the scheduling decision that
drives this?

There are only three points where a priority decision can be made:

- Take next element from hot list.  This can grow large if a single
  event has a cascading effect.

- For a particular channel, pick an element from the cold list.  In
  most cases this is just one, so this is not the most likely
  contributer.

- When waiting for multiple events, pick the highest priority one.


The first two are trivial: use ordered insert.

The latter is simple to implement: instead of picking first event in
select list, pick the corresponding task with highest priority.  It
involves a search.  Is it possible to re-arrange things such that this
can also be just an ordered insert?


Context: this still doesn't have pre-emption, but it could probably
accomodate "hot inserts", meaning that if an interrupt occurs, the
corresponding event should be inserted as soon as possible.

WFI->scheduler is not enough.  It needs to also perform the event.
For each buffer, it needs to perform a poll and add to the hot list.
Poll also needs to be repeated inside the scheduler in case of
priorities.

If we need to poll on every event it becomes expensive.  It seems best
to use a single interrupt "stream", which is then treated as high
priority.


So the general story isn't simple.  Data structures will need to be
changed to allow priority, and interrupts need to be mapped to hot
tasks.


Another: this isn't enough: we need to also pick a hot task based on
what task it will unlock, which is essentially priority inversion.
The priority of a hot task is the max between itself and the tasks it
will unlock, and all the way transitively.


Entry: Priority inheritence, opaque tasks
Date: Sat Jan 18 16:07:32 CET 2020

Suppose the highest priority task blocks.  Its events now become high
priority, and anything that can potentiall unlock it should become
high priority.

What we want to do is to solve the transitive problem: there might be
a low priority task that potentially unlocks the high pri one through
a chain of dependencies.  But there doesn't seem to be a way to do
this.

E.g   P1 -> P2 -> P3
            P2'

Where P3 is high pri, P1 is low pri, and P2' is mid pri.  If it
doesn't know how to promote the chain, it will run P2' first.

This should read like this: P1 can be scheduled, which would unlock
P2, which then performs a write which will unlock P3.

The scheduler cannot see this because tasks are opaque.

There doesn't seem to be a way to promote P1.

Maybe it should be done the other way around: maybe events should have
priorities.  E.g. suppose a high priority write (e.g an interrupt)
unlocks P1, any subsequent actions should be treated as high priority
and executed first.

Maybe it is even enough to make the scheduler greedy?  E.g. suppose an
interrupt creates a hot task.  This is scheduled, possibly unlocking
more.  Those unlocked tasks should be treated as high priority.  But
not all of them would be.
 
I can't wrap my head around it..


Entry: Priority inside the network
Date: Sat Jan 18 16:10:36 CET 2020

Note that inside the CSP network, priorities do not make any sense.
Tasks do not know about time.

Priorities only relate to I/O timing behavior: i.e. if the CSP network
somehow ties to I/O with timing properties, then we can schedule tasks
in a particular order such that we influence the real time spent
between tasks.


Entry: Topics to expand on
Date: Sat Jan 18 17:54:33 CET 2020

- priority queues
- priority inheritance ?
- external interrupts

The latter is straightforward.  If there is no danger for hardware
buffer overflow, run it from ISR.  Otherwise run it using two priority
levels: pre-emptive interrupts into buffers + main loop that polls
buffers on interrupt.


Entry: Priority levels
Date: Sat Jan 18 18:09:57 CET 2020

Priorities do not seem make sense without pre-emption.

Without pre-emption, there is only external in->out delays, which in
theory could be sped up by not running low priority event paths.

Is it so that a high-priority task can only become runnable through an
external interrupt?  Seems so.

How to make scheduler pre-emptable?  All list updates need to be
atomic.  That's about it.  The procesor's interrupt priority can then
be used to re-enter the scheduler.  However, this does not solve
priority inversion.

Hairy problem.  I likely won't need this in regular data-shuffling
applications.  I can see it being important in hard real-time
scenarios.  Maybe just postpone?


Entry: Next?
Date: Sat Jan 18 18:49:18 CET 2020

1. Forget priorities. Really.

2. Implement an example, e.g. DHT11 driver, RS485 sensor network.

3. Create a syntax layer

Of these, the latter will be the most useful.


Entry: Syntax layer
Date: Sat Jan 18 18:50:43 CET 2020

+ Would allow a decent select macro.

- Requires finding a C subset.  This will be awkward, and should
  probably be forked off as a separate "C lisp" project.

- Requires resolving an impedance mismatch between flat C structs and
  continuations.


Entry: Representing continuations : union of maximum frames
Date: Sat Jan 18 18:52:28 CET 2020

The main point is to avoid copies.  It's probably ok to create a
separate C structure for each continuation point.

The stack can be split into 3 pieces:
   shared - old
   shared - new

The issue is in overwriting old.  In a standard implementation this
happens naturally.  Can it be done naturally in a similar way?

There is no real difference, other than that there are many transient
forms of stack frames, and not all of them are relevant.

How to split this up?  Generate all the stack frames: every ANF
operation has an associated frame.  If the operation is blocking, the
frame is reified.

Or.. The store is just there, but there is a different interpretation
at each stack frame.  This is likely the simplest form, but requires a
lot of structures.  Using short typedef names for the structs:
s1,s2,.... and a short name for the stack, the ANF sequence looks
like:

((f1*)s)->a = 123;
((f2*)s)->b = ((f1*)s)->a + 1;

It could be implemented using a union.

union {
  struct {
    int a;
  } f1;
  struct {
    int a;
    int b;
  } f2;
}

s->f1.a = 123;
s->f2.b = s->f1.a + 1;

That can then be further simplified by reducing the number of
structures by using e.g. f2 where f1 would suffice:
 
s->f2.a = 123;
s->f2.b = s->f2.a + 1;

That would avoid quadradic clutter in the generated C code.  There
will be no issues with undefined variables as definition will be
ensured by code gen from proper ANF.

This actually seems quite straightforward.  

Maybe later find a way to reduce notational clutter, but for now
readability isn't a big issue.  As long as names are preserved it
would be fine.


Summary:

- Each operation has its associated environment.  All forms are
  'let*', meaning the environment grows, and previous variables are
  accessible.

- Casts are done for each reference.  Casts can be implemented using
  unions.

- Map each frame to its "maximum", and only generate and use those in
  C code.  This removes quadratic explosion.


Simple, but the combination is what makes it workable.

When this is represented in Haskell, it's probably possible to
parameterize and infer types, so instead of starting out with a data
type, maybe start out with a finally tagless form?

EDIT: Simple, but optimizer can't do much.  In between resume points,
we want to allow for optimization.  So when taking a continuation, or
when performing a tail call, it is probably best to copy any
additional variables.  But this then immediately opens up issues that
previously frozen stack frames no longer have the same size.  It's
hairy: gets very difficult very quickly.


Entry: CSP and FRP
Date: Sat Jan 18 20:20:25 CET 2020

So it works as long as full propagation is used: every recompute on
any of the inputs, cache the inputs and push to all outputs.  This
does not work for "simultaneous" events, but naive push-style FRP also
doesn't.

Note that this is a real problem if there is any fanout.


Entry: Seq or new ADT?
Date: Sat Jan 18 20:29:00 CET 2020

Seq C is not complete.  There are no loop constructs.

This is almost completely complementary.  I don't care about the
primitives, but I do care about loops and continuations.

Let's first do it separately.


Entry: Semantics
Date: Sat Jan 18 20:58:48 CET 2020

I get stuck very quickly.  Too many issues at once.
Some notes:

- I can't see the consequences of implementing higher order functions,
  so implement only higher order "forms", such as iteration.  This can
  be just goto.

- However, I still want function calls to properly abstract over
  blocking operations.


Can I do this instead with abstract inerpretation?

Execute the program, produce a trace, and from the trace produce the
stack data structures.

Insisting on ANF and closures will make it too difficult to keep C
semantics.

Just a stack.  Nothing fancy.

Start with what can be done using a stack:

- Non tail call function application (append to stack, hide).

- Non tail call "form inline" (append to stack, keep context)

- Tail call: replace part of stack

- Tail call "form inline" (same, keeping context).

These are 4 operations.

Summarized:
- call
- form inline
- tail call
- 

Observation:

- "hiding" is only necessary when it is possible to have name clashes.

- inlined forms are interesting, elaborate on that


Entry: Inlined forms
Date: Sat Jan 18 21:12:08 CET 2020

This is a trick to use something that looks like the application of a
function, but is easy to implement.

Given a form like this:

(let ((a 123))
   (map (lambda (x) (+ x a)) '(1 2 3)))

Where a closure is consumed immediately by a higher order function.

This can be re-arranged by inlining the 'map'.

Basically flipping things around: the stack frame used to implement
map will be hidden to the rest of the function.

So what does this actually do, when we're restricted to stacks?

1. It inserts the stack frame of map (e.g. iteration control
   variables) in a way that is not accessible to the body of the
   lambda.

2. It inserts the stack frame of the lambda.

so we have the following stacking:

a=123
(iter control, invisible)
x=1,2,3,...

So basically I should define a machine that can do this kind of stack
frame interleaving, then optionally "hide" frames.


EDIT: Is an inlined form the same as a downward closure?  The DC would
be explicit.  Rust solves it by moving ownership, and is much more
flexible.


Entry: those 4 stack-only control structures, revisited
Date: Sat Jan 18 21:23:10 CET 2020

- function call:
  - hide current frame (essential!)
  - append new frame (args & params)

- tail call:
  - remove current frame
  - replace with new ...

- inlined form
  - keep current frame
  - hide form's control frame
  - append new frame

- same, tail call version, might not be necessary?


Hiding the caller's frame is essential, because the function can be
used at any point.  The invisibility is an enabling property!

I re-invented the frame pointer.


To represent a function call: store return point and previous stack
frame.  This is exactly what happens in C function calls.

Tail call will likely need a memcpy to prepare the new frame on top of
the old one (or the C stack), and put it in place.  Also not really
rocket science.

The interesting part is the inlined forms.

This is what is "new" in the language.

But.. the only one that I'm going to need is a Rust-like "for".


The more I think about this, the more it is obvious that I should not
write the closure compiler, but use Rust instead.

Keep the CSP at what it is: something simple.

Still.  It's a nice exercise.


Entry: Spawning a temporary task with static memory
Date: Sat Jan 18 21:55:04 CET 2020

It just uccured to me that tasks can easily be nested.  As long as
lifetime is properly managed, one task could host the storage of
another task and spawn and kill it.  E.g. to implement a sequence.

This thing is a real eye opener.

Once a scheduler is in place, anything that would ordinarily be a
state machine can just be a task.

Maybe work on the memory usage, because this is just too useful.


Entry: Scheduler structure: less memory
Date: Sat Jan 18 21:59:35 CET 2020

Currently:

csp_task: 12 + n_ev * csp_evt:
csp_evt:   8

csp_chan_to_evt:  n_chan * 8
csp_evt_list:     n_ev_tot * 12

Most of these can be halved by using 16-bit packed pointers.

The expensive parts are the events: 20 bytes/ev.

The smallest controller I run this on has 20kB.
This really isn't all that bad.

It can be halved without too much effort, by using encoded pointers
(RAM_BASE + 4*i).  Works for up to 256kBytes.  For controllers that
have more memory it problably isn't necessary.

It will make it a lot more awkward.

So don't optimize.


Entry: A single interrupt queue
Date: Sat Jan 18 22:17:02 CET 2020

So, if there are no priorities, it's ok to use a single queue for
interrupts as well.  Make sure they are all at the same hardware
interrupt level, and use an application main loop (bottom half) that
does:

    loop
        WFI
        while read_evt
        external send evt

The interrupt event queue then should have:
- channel (2)
- len (2)
- payload (n)


Entry: Wakup ideas
Date: Sun Jan 19 11:05:41 CET 2020

Probably nothing.

- Try to put everything inside the task struct.
- Store channel data only once?

Let's look at the event.  How to turn it more into a synchronized
function call?  Sender provides a callback.

In the cold list there is only ever one pointer to a particular tasks'
event list.

EDIT: The interesting relation is that a task's event is pointed to
only by the channel, but a single channel can point to more than one
task.

One thing though: a channel can never have readers and writers at the
same time, so a single list would suffice.


Entry: CSP on HW
Date: Tue Jan 21 04:57:24 EST 2020

The big lesson from implememnting this in C, is that the inside of a
task really doesn't matter.  It is about messages and syncrhonization.
The rendezvous works particularly well in hardware.

I still miss a good way to go from parallell to sequential.  The core
idea is to parameterize a "CPU stripper".


Entry: Tasks
Date: Wed Jan 22 07:15:47 EST 2020

Dreaming about this.  It is very flexible.  E.g. it's possible to
emulate (blocking) function calls with tasks.

Many->One : gather results
One->Many : dispatch worker

Note a misconception: a worker isn't necessary about a CPU resource.
It can be just as wel be managing/using some other resource, e.g. some
process that evolves over time.


Entry: Decoupling through channels
Date: Thu Jan 23 08:32:12 EST 2020

You know this really solves the HAL problem.  I didn't consider that
at first.  It's quite amazing how deep this thing goes.


Entry: Reiterate FPGA bits
Date: Fri Jan 24 06:59:40 EST 2020

So thinking in CSP style cuncurrency, what can be done on the FPGA?
What is missing in my understanding to make this completely natural?

E.g. writing state machines as communicating processes is now obvious.
The same could be true for FPGA.

Maybe the secret is in select?

1. Receiver being able to handle a number of things at once

2. Sender blocking until receiver has message

Up to now I've been really focused on sender controlling the transfer:
outputting a sync pulse, requiring receiver to be listening.  This is
not necessarily true.

Ok that's clear: focus on rendez-vous.


Entry: Rendez-vous in hardware
Date: Fri Jan 24 07:04:48 EST 2020

Simplificatation:
- Direction is hard-coded
- One reader, one writer
- Synchronous logic only

Eiter one can arrive first.

Sender:
- Sets output signals, raises write
- Waits for rendezvous signal
- Lowers write, continues

Receiver:
- Raises read
- Waits for rendezvoud signal
- Samples, lowers read


If the reader is first, to the writer this should look the same as
sending out a non-synchronized pulsed write.  This way we can still
use non-synchronized feeders whenever we can guarantee that the reader
is always ready.  Sync can be removed as an optimization.

Googling.  Not directly related, this is for async:
http://async.usc.edu/jungfrau_web/htdocs/new/research/current/verilogcsp/VeriloglCSP.pdf


Entry: Representing state machines
Date: Fri Jan 24 07:17:23 EST 2020

Create a modeling language for state machines that is optimized for
code transformations that can pipeline the transition machine.

The real problem I keep facing is that bit-level state machines are
just too low level, and manual pipelining is a real pain to do, but at
the same time it is itching: if this is represented better,
sequentializing the transition machine should be straightforward.


Entry: Pipeliner: dataflow vs. state machine
Date: Fri Jan 24 07:23:24 EST 2020

So it seems quite obvious.  Pipelining is simple: take a data flow
graph and start adding delays.  It's probably best to formulate it as
a problem that can be solved using a SMT solver.

There is one caveat: feedback.

So for state machines it is really not all that obvious.

It might be good to implement a simple machine like that and see how
it goes.  E.g. a pipelined counter, where the adder is pipelined.


The ground rule is: a pipelined state machine will have a slower state
update.

It also seems that utilization will be low if functional units do not
share a datapath.

Conclusion: pipelined state machines need a programmable datapath.


To re-iterate: this is an important insight.  Easy to see when two
regimes are put next to each other:

- Feed-forward dataflow can be pipelined without consequences: all
  function segments are utilized at max rate, providing throughput,
  but at the expense of some I/O delay which is often not an issue.

- If such a functional network is used as the update function of a
  state machine, the delay is very significiant: for an N-stage
  pipeline, the utilization is only 1/N.  In this case it is probably
  more effective to collapse the different pipeline stages into one
  programmable datapath.  The total machine's update rate will still
  be N times slower, but silicon usage might be better due to reuse.

A CPU is the natural end stage of a pipelined state machine, where we
create a programmable datapath that is universal.


Entry: Programmable data path
Date: Fri Jan 24 07:28:06 EST 2020

See last post.  Pipelined state machines will want to be programmable
datapaths, while pipelined feedforward networks have single-use
stages.

These are two "architecture islands" that have fairly little to do
with each other.

Conclusiong to make:

- If speed is of the essence, try to get rid of state machines and
  turn it into a feedforward network.

- An intermediate stage: each pipeline stage itself could use
  feedback, so purity is not really essential.

- If speed is not essential, programmable datapaths are the way to go.


Question: is it always possible to avoid state machines?  Not really.
The typical example is an integrator.  However it does seem possible
to try to make feedback loops smaller.


Entry: What is human and what is machine?
Date: Fri Jan 24 07:41:28 EST 2020

So what do I do with these insights?

First conclusion is that it might be good to create a datapath generator.

Another conclusion I had before is that unrolling is the easy part.
Condensing onto a programmable datapath is more difficult as it _adds_
information.

So human inginuity should be used to create a representation as a
"maximally compacted" sequential program.  Then an optimizer could be
used to unroll.

This brings us back to Forth.  I don't know of a better sequential
program encoder, as long as stack shuffling is avoided, but it seems
that adding registers in an ad-hoc way will solve that.


Entry: Programmable datapath
Date: Fri Jan 24 09:28:49 EST 2020

Assuming:
- code sequence is an abstract stream (RAM or other SM)
- stack can be unfolded to registers
- primitives can be expanded

Forth really is a good way to turn program transformation into a
simple algebra.  And when the search space isn't too large, discrete
solvers can be used to perform mapping.

I know there is something here, but how to uncover it?


Entry: Unrolling/unfolding
Date: Fri Jan 24 09:34:18 EST 2020

Stick to that basic idea that unfolding is easier than folding.

Unfolding fort to ANF:

- creating an output register for each ALU operation
- treating the stack as individual (named) registers

From programming in Forth, stack shuffling can often be avoided by
using:
- "channels", or read/write pointers
- the R stack for tucking away temporaries


I think it's still a bit out of reach until I find some intermediate
form.  Maybe first finish an actual stack processor?


Entry: Remove Time
Date: Sat Jan 25 05:03:56 EST 2020

Create a programmable datapath, driven from UART or SPI.  This would
allow easy stepping.

The core idea is to remove time from the equation.  Use only event
causality.  That will make it possible to slow everything down, and
also emulate it on a different substrate that is much easier to
instrument.

I don't think it is that hard to introduce rendez-vous into an
existing unidirectional sync system.  See above.


Entry: CSP: it would be nice to avoid the copy
Date: Sat Jan 25 12:43:47 EST 2020

It is possible to do that as long as tasks know that messages are only
valid up to the next scheduling point.

But let's not do this.  It's probably better to create a buffer
management scheme, and pass buffer references around.

EDIT: It can be quite simple.  E.g. interrupt uses 2 buffers and is
responsible for swapping them


Entry: Network sends...
Date: Sat Jan 25 19:02:35 EST 2020

cannot be used for synchronization, unless a ping-pong mechanism is
employed.


Entry: FPGA
Date: Sun Jan 26 05:36:17 EST 2020

What am I missing?

First, the CSP style concurrency is fundamentally different from
synchronous logic.  Maybe work a bit to bridge the gap.

Say I want to write a SLIP decoder as a sequential program, and
compile it down to a state machine.  This is non-trivial

loop:
  read
  - case ESC:
      read
      case ESC_ESC
      case ESC_END
      default
  - case END
      send PACKET
  - default
   
     
Yeah I'm really stuck at things like this.  What does it require?

- Map events to signaling conditions

- Change tasks to update functions / state machines

That should be it: no transition when there is no event, and if there
is one or multiple events, perform a tie-breaking operation.


Entry: Tie breaking allows use of sequential select
Date: Sun Jan 26 05:52:47 EST 2020

This is a new idea: mimick the behavior of select in hardware, which
requires tie breaking when events arrive at the same time.

Assuming it is best to do only one event at a time.  If not, split it
up into two different machines.

When all transition functions are pure, they can be implemented either
as single-cycle, or using a programmable datapath.

So the core ideas are:

- use CSP style rendez-vous with tie-breaking select

- implementation of transition functions can then be decoupled


In othe words: CSP-style rendez-vous allows for machine
synchronization, effectively removing real-time from the picture: we
can reason at the event level.  So DECOUPLE TIME FIRST.

Once time is decoupled, transition functions can be implemented in a
single cycle, or can be done over many cycles.  This essentially
isolates the problem of function computation.


I am going to need a compiler and a universal state machine language.


Entry: CSP language
Date: Sun Jan 26 06:07:32 EST 2020

1. Assume task to state machine compiler.

2. The code between two selects is a function.

Implementing those functions could be done with human assist.

E.g. once the compiler has split the code into transition functions,
it could probably be converted to C.  But it could just as well be
compiled to a hardware state machine.

That compilation step could in theory be done by hand, e.g. if the
compiler can generate the transition function specificiations (the
part that is hard to do by hand), they could be implemented manually
against a reference implementation.


Entry: Target mapping
Date: Sun Jan 26 06:22:38 EST 2020

Could be just implementation of functions.  See previous post.


Entry: Summary
Date: Sun Jan 26 06:26:34 EST 2020

Too many ideas and too little focus.  What should be done next?
Creating a CSP language frontent seems to be an essential part.
Targeting can then be solved separately.

The interesting thing is that FPGAs are in reach as well through the
combination of rendez-vous implemented as a hand-shake in hardware,
and making select-to-select transition functions explicit, which
allows for manual optimization if needed.


Entry: Implementing compiler exploration
Date: Sun Jan 26 06:50:33 EST 2020

It needs to be make concrete soon.  Other things will surface during
implementation.


Entry: CSP lib: why?
Date: Sun Jan 26 10:45:35 EST 2020

1. Started after writing yet another ad-hoc state machine sequencing
program.  CSP is much more uniform to write OS-less code.

2. This is a small library and composes.  Not a framework dependency.


Entry: Haskell QC test bench for C CSP programs
Date: Mon Jan 27 08:26:04 EST 2020

Maybe simplest to decouple through packet_bridge.  That way the
interface is platform-independent and can be used in Erlang also.


Entry: Rust async block_on
Date: Mon Jan 27 19:44:11 EST 2020

https://stjepang.github.io/2020/01/25/build-your-own-block-on.html

maybe not what i think it is?


Entry: Ceu
Date: Thu Feb  6 16:12:30 EST 2020

http://ceu-lang.org/publications.html

Not general purpose, but it contains some nice ideas.


Entry: Redo
Date: Fri Feb  7 14:36:22 EST 2020

I wrote an Erlang implementation of redo, and I'm currently using it
to write an Erlang implementation of make as well.

See uc_tools_gdb_do.erl

Redo fits in the general exploration of event-driven schedulers.


Entry: switch reactive direction
Date: Sat Feb  8 14:32:32 EST 2020

So a highly available distributed system acts as a reactive network
where the causality switches direction.  Each of the nodes can
initiate a state change with all of the other ones following.

I think that is the main idea between lamport timestamps and vector
clocks: focus on causality.

This still doesn't quite fix the "crossing paths" problem. So how is
that usually tackled?

Crossing paths is actually quite common in the following case:

- system A and B both have a reference to some lazy resource C.

- system A and B split

- resource C disappears, but A and B don't notice it

- both A and B have a need for resource C, see it is not there, and
  both pick a replacement: A picks D, and B picks E.

- on merge, which should become the consensus?

Maybe this is more convoluted than I set out.

If they both implement the same interface, it probably shouldn't
matter.


Entry: CSP inside an Erlang process?
Date: Wed Feb 12 08:08:59 EST 2020

So, going from composing machine functions into a single process, what
about running a CSP inside Erlang as a scheduler for such state
machines?

This would be a nice playground.

Main benefit is that it is contained in a single process and would be
easier to manage and compose.

CSP maps a collection of state machines to a state machine.

Same for reactive networks btw.

Maybe redo.erl can be abstracted like that as well?  Possible to run
inside a single process, or to be forked off into separate workers for
parallellism.


EDIT:

Erlang would be a good testing ground for building synchronous state
machines (each process has its own CSP scheduler), linked together
over communication links.

It's also a good compiler target to try out the closure compiler first
in a simpler setting.


So how to express rendez-vous like that?  Inputs are straightforward.
Outputs are too, actually.  Looks like it is just defining select.

Anyways. Might be best to go on top of Erlang processes anyway...  The
single-threaded thing is really a special case.


Entry: redo: providing temp space to imperative code
Date: Thu Feb 13 06:33:39 EST 2020

The thing that really strikes in redo is the way in which an
imperative update function is given the "ownership" of the output
node.

Basically, it can do what it wants with that piece of storage while it
is running.  It is very nice to have this high level guard around a
dirty, in-place imperative update.

Write a C version?
Since I have the example, what would that look like?


Entry: distcc-style network?
Date: Thu Feb 13 06:36:44 EST 2020

missing ingredients:
- all nodes running the same code (use nix for this)
- decent local cache for "constant" data
- intermediate products should be kept local (scheduler optimizer)


Entry: Making a redo evaluator in C
Date: Thu Feb 13 06:45:39 EST 2020

Maybe the problem domain is not specified well: in practical
situations, I think that dynamic dependencies might not be such a big
deal.  I.e. we will know the dependency network up front.

So a better way to do this is to us a pure dataflow structure with an
explicit memoized push/pull walker: start at changed input(s), and
recursively pull any of the affected outputs.  It will be necessary to
have both o->i and i->o lists available.

It doesn't really look like rocket science. Question is more: do I
have a use for it?

I think atm a more explicit CSP style approach is appropriate on
uC/FPGA.

It would be nice to reframe reactive code as CSP.  Basically it is
just adding memoization: each process is responsible for its own
output node.  Once it has evaluated, it will keep sending out the same
value until a global signal resets the "phase" of the algorithm.  It
does seem rather inefficient to do it like that.


Entry: danielg on #rust about async compiler
Date: Mon Apr 20 13:23:58 EDT 2020


12:16 < doelie> Hi. Quick question: who implemented the gnarly details of the async->state machine conversion? I'm trying to gauge how hard it would be to write a stand-alone
                tool that could do the same for C.
12:17 < doelie> I'm guessing some of the rust code would be reusable.
12:17 < doelie> Or if not, at least some of the struct packing heuristics.

12:50 < danieldg> doelie: it's actually pretty simple in theory.  Make a struct that contains all local variables and use that, along with one "state" variable.  Yield is
                  "struct->state = N; return YIELDED"; the start of the function contains a switch/goto on state.
12:51 < danieldg> rust uses an (unnamed) enum to hold the local variables so that it can overlap variables that are never in scope at the same time, and uses the real stack for
                  variables not live across a yield. But those are just optimizations.
12:52 < danieldg> the result is that the struct packing is not specific to the async code at all

12:59 < danieldg> doelie: imo it would be easiest to do it as part of an LLVM pass instead of a standalone C->C tool

13:10 < doelie> danieldg:
13:10 < doelie> danieldg: the llvm pass might indeed be a good thing to consider
13:11 < doelie> however i want to see how much extra work i need to do to just do it in c.
13:11 < danieldg> look for existing tools to create generators in C, it's the same thing
13:12 < doelie> any pointers?
13:12 < danieldg> no, just giving you google keywords :D
13:12 < doelie> yeah i did some looking already :)
13:13 < danieldg> I've only dealt with higher-level tools (matlab) that generate state machines and then C code from it

13:13 < doelie> i've not found a tool that can just do a C->C translation
13:15 < doelie> anways this is just to see if i'm not missing anything obvious. i did some thinking about the struct overlap to encode the continuations, and it looks quite
                similar to what rust is doing.
13:15 < danieldg> one important thing to consider is that an async rust function is actually two real functions: one setup that creates the state struct, and one that is called
                  by poll() on the struct.  Splitting functions like that in C is not as trivial.
13:15 < danieldg> unless you want to do heap allocation and function pointers all the time
13:15 < danieldg> I think C++ has some support for generators in its drafts, btw
13:16 < danieldg> might look at that
13:18 < doelie> no i want to use static/stack allocation. i've used something similar in the past, but it's quite heavy on the C macros and also doesn't do any compilation to
                state machine, just dispatch: https://github.com/zwizwa/uc_tools/blob/master/sm.h

13:20 < doelie> danieldg: i think that the main issue is going to be to find a good C parsing/printing library.  i guess llvm can be made to spit out C code again.
13:20 < doelie> but i'd like to do this such that the resulting c code remains readable.
13:20 < j`ey> clang has rewriting tools
13:26 < doelie> j`ey: thanks. i found this post: https://eli.thegreenplace.net/2014/05/01/modern-source-to-source-transformation-with-clang-and-libtooling
13:26 < j`ey> > moder
13:26 < j`ey> n
13:26 < j`ey> >2014
13:26 < j`ey> :P
13:26 < doelie> yea good ideas dont age
13:27 < doelie> but it's enough to get unstuck for me. thanks you both.