State Machines After an excursion into writing manual state machines I'm again veering towards just writing parallel code with blocking constructs. The new realization is that there doesn't have to be an RTOS. It's likely possible to generate C state machines from a HLL. Entry: Inverting state machine control (futures) Date: Sat Jan 4 19:47:02 CET 2020 https://www.youtube.com/watch?v=9HspeHGBg-Q&t=933s Futures are sugar. Did this whole thing start in C# ? If the async library in closure is just sugar, then I can do the same for small uC projects. Rich mentions they do not provide unlimited buffering so it allows for actual backpressure. Also maps better to embedded. Entry: New sm.txt notes Date: Sat Jan 4 20:31:33 CET 2020 This probably needs a separate blog thread. I'm shuffling around many ideas, but I really want to move this into an actual system. I'm tired of writing C for microcontrollers, and Rust is not the solution in my case: I want something more model-based, macro-like. I still want to "keep Haskell out of embedded". Only as compiler, where it should be the main tool. - sm.h - limited nesting by making stack frames explicit - A way to generate a static C program from a higher level functional description. Something that can do: - Automatic async-style conversion - Smart re-use of closures (i.e. flat state structure where state updates are mutations, not copying in a new activation record). - I want lisp-like syntax and constructs that can be easily implemented in C, but explicitly no arbitrary closures or continuations. - Some way to formulate this so it makes sense on FPGA as well. Where - messages are either actual bulk data lines, registers, or references to data stored in memory - tasks are state machines, or stored programs on a CPU - Erlang as test target. Either Core erlang, or just an interpreter for something simpler, written in Erlang proper. Entry: CSP in C Date: Sat Jan 4 21:49:25 CET 2020 Rendezvous is too hard to implement, so use buffers. See uc_tools/cbuf.h, optionally with both ends of the channel exposed as separate types to make it clear if a channel is input or output for the current "task". A "task" is a C struct containing state variables and channel pointers. Scheduling can usually be done in an ad-hoc fashion, using a round-robin scheduler where each task defines its own poll routine. sm.h style state machines could be used. EDIT: Don't call this CSP if it's not rendez-vous! Entry: Byrd miniKanren Date: Sun Jan 5 22:59:42 CET 2020 https://www.youtube.com/watch?v=RVDCRlW1f1Y Entry: Capture subset of C in s-expressions Date: Tue Jan 7 11:52:37 CET 2020 ;; Varible declarations. All variables are typed and explicitly ;; initialized. Const is default. Non const is mut. It's an ;; expression language with statements inside let blocks. A let block ;; returns the value of the last expression. (let ((:: (uint32_t) a) 123) ((:: (mut uint32_t) b) 456) (set! b (+ a b)) b) ;; There is no goto. We'll implement tail calls instead. That leaves ;; just loop and conditionals. (while ...) (if ) ;; The language allows the usual abstraction/application, with some ;; restrictions on closures that cannot be implemented statically. ;; Function calls that do not block are treated as primitives. Any ;; other applications will cause creation of continuations. ;; Continuations are represented using nested C structs/unions ;; following the sm.h model. It seems important to keep continuations ;; explicit to allow for exact memory allocation, e.g. not using ;; stacks. This restricts control flow but in practice is likely not ;; an issue. ;; It appears that most of this language is trivial apart from the ;; representation of continuations, so start there. Ignoring type ;; annotation, the simplest blocking program would be a copy process ;; like: (while $t (let ((line (read-line))) (print line))) ;; where read-line is the blocking function, and optionally print is ;; treated as rendez-vous, but for now we can just assume buffered ;; channels. Entry: Continuations Date: Tue Jan 7 18:51:19 CET 2020 Some reflections after working with sm.h. I do not want nested polling as is used in sm.h's CALL mechanism. A state machine should be compiled to: - one function per continuation - single flat function with switch() or computed goto The latter two are essentially the same. The former is also essentially the same if there is no need for shared context that could be implemented in the body of the function. Let's assume individual functions. It's probably beneficial to use ANF, i.e. to have explicit stack frames. Stack frames map directly to nested unions. E.g. The "root" union is the collection of resume points in the main function. Suppose 3 resume points A,B,C this then maps to the following nesting. Suppose each of these has 2,1,3 resume points respectively: union k { union { struct { ... } frameAA; struct { ... } frameAB; } frameA; union { struct { ... } frameBA; } frameB; union { struct { ... } frameCA; struct { ... } frameCB; struct { ... } frameCC; } frameC; }; Something like that. Should tail recursion be supported? Probably not. Loops are likely simpler to work with. This will be a language that supports macros, so much can be built on top. To make this more understandable, create some example programs and print stack frames at suspension points. Basically abstract interpretation will need to be used to enumerate all the suspension points. Only programs that have a finite number of suspension points will be representable. Entry: Abstract interpretation Date: Tue Jan 7 19:10:45 CET 2020 To perform control flow analysis, abstract interpretation will be essential. Entry: Manually Date: Thu Jan 9 09:36:36 CET 2020 Go through the steps manually. There are two problems: collecting the current stack frame, and resuming it. Resuming should probably be implemented by letting the code do explicit pointer chasing. E.g. e->var1, e->var2, .... This should probably be extended to extending environment for closures. E.g e->var is current stack frame, while e->p->var is a variable in the parent frame So from the point of application it is straightforward: just use nested C structures/unions. The trick is then to convert suspension points into nested unions. The difference between doing this statically and dynamically is that all continuations need to be known at compile time. This is a serious control flow analysis problem, and for sure doesn't work for all programs. What are the restrictions? ( Note: only blocking functions need to be considered here. Non-blocking functions are treated as primitives. ) The algorithm is straightforward - enumerate all call sites - for each call site: - collect stack frame - recurse This way it is ensured that recursion cannot represent loops. Note that nested unions/structs aren't necessary because memory is stored linearly. The only time two pointers are needed is for (downward) closures, but we are probably not going to use those much because iterators probably eliminate them completely. Entry: Downward closures or iterators? Date: Thu Jan 9 09:56:14 CET 2020 E.g. Backus' functional forms. Does it matter? Maybe they are really the same in most use cases? The core question is: does a closure need to be treated as a value, or is it always immediately "inlined". I don't think closures will be that hard to implement, as they just have a pointer to an extra stack frame. It's also only just one stack frame. Note that it is the _thread_ that has multiple frames (the continuation is compisitional). Closures close only over their local frame. Entry: Representing the stack Date: Thu Jan 9 10:04:03 CET 2020 There's no reason not to grow downward when all frames are known. Entry: Dedicated HW Date: Fri Jan 10 12:31:54 CET 2020 So full circle: create a Forth CSP machine? Core idea here is to create a simple task switcher. I like the idea of using Forth as a low-resource implementation language, but use a specification layer on top of it to express semantics. Entry: Cheap rendez-vous Date: Fri Jan 10 13:00:40 CET 2020 How to do that in scheme? Basically, eliminate context switches. Implement context switches as jumps or calls. If the overhead is gone, a lot of buffering isn't needed, and the implementation probably comes closer to what can be done in hardware using dedicated machines. Try this! It's an important bit. Entry: Summary Date: Fri Jan 10 13:01:57 CET 2020 - represent continuations explicitly as C structures, might lead to readable generated code even. - cheap rendez-vous and task switching Entry: Cheap task switching Date: Fri Jan 10 13:27:53 CET 2020 I wonder how many programs revert to "loop over buffer element" patterns because task switching is perceived to be inefficient? If task switching is made cheap, i.e. just switching a context pointer, how would that affect programming style? Entry: Continuation Date: Fri Jan 10 13:59:33 CET 2020 So it's clear: the core abstraction is the continuation. Entry: Manual example Date: Fri Jan 10 14:16:52 CET 2020 Create a manually compiled example, then refine it to something that can be automatically compiled. The DMX example is actually quite nice, as it has 4 event kinds: USB input, DMX output, DMX input, USB output. This is going to come back in a lot of data converter tasks. Example of code generator output. Work bottom up: start by manually creating some C code, then model the compiler to produce that kind of output. The task is a CSP copy task: (let ((a0 123)) ;; some constant present in all frames (loop (let* ((a (recv c0))) ;; a is not present in recv k (a1 (+ a 1))) (send c1 a1) (send c2 a1))) - This has 3 continuations for the three channel rendez-vous points. One of the sends can be left out to simplify. - This already exposes ownership and lifetime of data, which is a general problem. We solve it by storing the data to be transferred in the sender's stack frame, and letting the receiver copy the data. - There is also a tail call here at the end of the loop, which destroys the stack frame. Maybe it is better to model that explicitly? No let's keep the loop as a primitive. It will make the reuse of stack frames simpler, allowing first class function to be avoided. - To represent in C: I'm thinking it might be simplest to generate one struct per continuation type. Otherwise the dereference notation is going to be too hard to read in the generated C code. - How to re-write things to not pollute continuations? E.g. if variables are short-lived they can just go on the C stack. Does it actually matter that we keep junk in the suspended contination? E.g. for the send continuations, the variable a is sill visible but no longer needed so could actually just be removed. Maybe it is best to not make the distinction at all in the beginning. Just allocate everything in the struct, then later when it is all working, add an optimization that allocates intermediate variables on the stack if they are not used after a point. - Simplification: program will need to be in ANF such that the value passed into a continuation has a well known place at the top of the stack. - C representation: incoming and outgoing continuation do not have any required relation, but in practice they will share the bottom of the stack. I would like to avoid copying. However, there might be many intermediate forms. This is going to take bookkeeping, so maybe local variables that do not make it into the closure should just not be used. - The type of the output stack depends on the code path! So how do we know what the continuation output type is? From the function that interprets it. So continuation struct types are always associated to the associated function, and only that function knows how to interpret the data. - Growing upward is going to be simplest. Fits best in C struct view. - Let's assume that the continuation can be modified in-place, and it is the responsibility of the function to do this. - About output: it's really not clear how to anticipate the shape of the output continuation. The more I think about this, the more I believe that explicit stacks might no be the way to go. Otoh static guarantees are nice. - Loops and conditionals need to be handled somehow. These could be expressed as continuations, but there's a distinction between non-blocking and blocking continuations. I ran into this before doing the DMX converter box: the main loop structure doesn't correspond to blocking points. This is exactly why you would want to automatically transform to CPS, partially. - Harping on that: maybe compiling to flat form is better. That way 'goto' can be used and we can have tail calls. And suspension is just taking the label of that goto. Let's rework the example. - If the program is in ANF form between control points, then it would probably be simpler to manage stack frames because there is only on C struct per "chain", that could be partially filled. - Recursive calls can also have copy optimization to re-use the context. Was a good exercise. The conclusion is that it isn't nearly as simple as anticipated, but some constraints got clarified in the process. Important conclusions: - use ANF, with the exception that non-block C functions can be collapsed into expressions. - suspensions and control structs (loop + conditional) are very different from scheduling point, but should be put on the same footing in the language. a single C struct per machine makes this possible. that also makes the generated code a bit more modular at the C level - the scheduler doesn't need to know the type of the contination. however a continuation's code pointer does identify the type of the data associated. obvious in retrospect, but wasn't starting out. - grow upward. easier for C structs. Entry: Summary exok.c and discovery of the core issue Date: Sat Jan 11 15:30:50 CET 2020 This is actually a simple and straightforward way to write CSP directly in C. It seems that it is only necessary to create the proper overlay of the continuation. struct env { uint32_t a; uint32_t b; uint32_t c; }; #define COP(var,klabel,typ) \ k->pval = &(var); \ k->next = &&klabel; \ k->op_type = typ; \ return; \ klabel: #define SND(var,klabel) COP(var,klabel,1) #define RCV(var,klabel) COP(var,klabel,2) void task(void *next, void *ctx, struct k *k) { struct env *e = ctx; // FIXME: not possible in general if (next) goto *next; e->a = 123; again: RCV(e->b, k1); e->c = e->a + e->b; SND(e->c, k2); SND(e->c, k3); goto again; } The thing that is "almost right" here is the assumption that we can use the same struct for the input and the output environment. That is where the core of the problem lies to convert this to C. The rest is really straightforward standard ANF/CPS stuff. In the example above, the stack is only growing, so the unused part of the C struct is simply ignored, but no notational overhead is created. In a more elaborate example, the C stack will "fork". It's not quite clear how to handle that. One trick could be to use inline struct definitions. However in general it might be necessary to throw an optimizer at the problem. What is quite surprising is that this might actually work as a guide to write CSP in C. I.e. it is important to see that the conversion from task to state machine is entirely on the inside of the tasks. The scheduler doesn't need to know. It treats the task as a state machine. That means that I can take my time to design the taks compiler, but I can already start working on and using the scheduler. Entry: summary Date: Sat Jan 11 15:51:01 CET 2020 Summary of summary of summary. Whatever, I'm learning. - computed goto has all the advantages - this works for all kinds of state machines, not only sequential programs - an ANF state machine is just a special case of what this can host. manually written machines work just as well Entry: scheduler Date: Sat Jan 11 16:00:19 CET 2020 So how do you write a CSP scheduler? This is essentially wingo's talk, but much simpler. Simplify: - Tasks can only suspend on channel read write. - When a task suspends, if it has a partner, resume the partner, then resume the task. Otherwise, add to wait queue. There is one complication: select needs to be able to wait on multiple channels. Entry: Is select the same as non-deterministic choice? Date: Sat Jan 11 17:11:56 CET 2020 Look this up. I understand Select, but they might be different than the other options. Entry: Implementing scheduler Date: Sat Jan 11 18:03:58 CET 2020 Does it needs queues? I don't thing we really care about sched order. Is there a simpler DS that can mark a channel for a waiting task? Yes, but it also then needs to undo the wait in case of select. Might be simpler if the number of channels is small. So: fun from chan -> (ks,kr) I'm getting too tired for data structures. Actually, can scheduling be done using a stack instead? Or maybe implementing a queue using two stacks. EDIT: Revisit from perspective of suspend. Basically, it is always possible to start from a state where everything blocks, because the input of the network can be monitored. I.e. channels that represent timers or I/O events. So whenever something happens, it is time to start scheduling. This can be done depth-first: - event happens on channel c0, e.g. from timer or external something. - there will always be a reader on this channel because the system is in a "relaxed" state. execute it. - it will produce a new continuation. if it is a send, there will be a corresponding receive. why? same case as before: there is no other way the system can advance. - if it is a receive, check the queues for a send. Maybe it's not great to think about this in a tired state. EDIT: It's wrong. Reads and writes seem to be really symmetric. It doesn't really seem to matter what happens between the two tasks. It's only important that they meet. Wingo was really stressing this, and it seems to be true from the scheduler's point. The reason to have two queues is just a filter. Let's assume schedulable tasks are a set. Entry: A scheduler, again. Date: Sat Jan 11 19:03:49 CET 2020 Blocked tasks are a set. We have a way to partition it into reading and writing, which is only necessary to limit the search, but it might be actually simpler to use a single data structure and iterate around it. What I'm looking for is a stop condition. 1. Precondition: everything is blocking. 2. Some external read or write happens. 3. Find a corresponding blocked task 3a. There is no task. Add it to the set and go to 1. 3b. There is a corresponding task. Remove it from the set. Run both tasks and execute 2. for each task. It is 3b that looks like a binary tree recursion. I.e. the continuations that just were created will need to be checked if they unblock anything, because nothing else will cause an unblock. So the "hot" list of unchecked suspended tasks has a different meaning than the "cold" list of suspended tasks. It can just as well be implemented as a stack. It doesn't seem to matter in which order the tasks are executed. ( I would have had a different life if I understood this earlier. This is really magnificient. ) So let's make some assumptions: - The "hot" list can be a stack. - The "cold" list is fairly implicit. We never need to enumerate. We just need two operations: - Given channel id, check if there is a blocked task - Given channel id with no blocked task, add a task Because of the "hot" list, there is no need to ever have a channel data structure that has two blocking tasks. So the data structure seems quite straightforward: A channel is a pointer to a blocked task, or NULL. It is actually possible that there is more than one blocked task, e.g. a channel can have multiple writers (or multiple readers, but that only makes sense for a multiprocessor). To implement multiple blocked tasks on a channel, it seems best to add a next_k pointer to the k list. EDIT: I think I have an implementation for just send and receive. What about select? Select is a read that can rendez-vous with a whole lot of channels. Maybe it's simplest to implement only single write and multiple receive? Entry: Multiple readers? Date: Sat Jan 11 20:34:02 CET 2020 So it is actually meaningful to have multiple readers, because each of those might actually be blocked on some output task. I wonder if it is then needed to have the dual of select: write to multiple channels, picking the first that is available. Good question. Let's assume this is not the case for now. AHA. This is like CML. Because a write can be an event that can be syncrhonized on. Maybe CML is just the realization that the duality of the rendez-vous is just a side effect of it not being general enough? Entry: Select Date: Sat Jan 11 20:41:03 CET 2020 Going over csp.c If a hot task is a select. It will need to scan the compatible channels until something is found. EDIT: Implementing select, it is clear that it's not possible to use the channel array trick because a task can appear in more than one queue, and I don't want to start messing with containers. Use a traversal instead. It will be slightly different: while hot tasks: for all cold tasks: rendez-vous? yes -> remove from both queues copy data resume requeue hot This requires minor changes to the loop function. An extension of is_rendez-vous, as it needs to check if a sender's channel is part of a select's channel. But the thing I don't really want to do now is to implement the queue remove. Note that the prime objective is simplicity. Speed is nice but not essential. EDIT: It's also clear now that making the choice between send and receive early leads to simpler code when matching. Entry: Remove element from list Date: Sat Jan 11 21:46:57 CET 2020 This requires iterating using a **. It's really just a pop in the middle of the list. Entry: Draft Date: Sat Jan 11 22:50:00 CET 2020 Current csp.c seems to be quite close to the initial vague idea. Entry: Manual tasks Date: Sat Jan 11 23:43:22 CET 2020 So basically, add a forth! Entry: Sending something from outside Date: Sun Jan 12 00:02:32 CET 2020 Is not easy, because the value that is contained needs to be held in a task's data structure until it is removed. Getting something out of the system is simple: it's just a side effect. Getting something in needs: - a task that halts after sending the data - a call to the scheduler Though we do not have a way to guarantee that the task will block. To overcome this, use a buffer structure and a task that reads from that buffer. Whenever an element is added to the buffer, restart the buffer reader task. You really switch worlds this way. Different rules apply on the inside and the outside! Entry: Task switch efficiency Date: Sun Jan 12 08:42:01 CET 2020 Currently I have a fairly dumb search, but it seems in general that it is not simple to create good datastructures. The case is complicated by select. In practice, multiple listeners on the same channel will be fairly rare, but the possibility does make direct mapping from channel to waiters a bit awkward. Entry: Generalize Date: Sun Jan 12 11:08:58 CET 2020 I'm almost immediately running into the case where I need to select on a read and a write. csp_send() only works when there is a reader already blocking, so to implement buffered channels, the only way is to use csp_send() to send an "interrupt" to a task that reads the buffer and sends it out. However, it might be possible that the send blocks, but that a new interrupt arrives. This means I need to select on both send and receive, and it seems that this is also what CSP can do. This messes up the data structure again, because a task can now be blocked both on send and receive. Also, select is over ops, e.g. receive channel 3, send channel 2. What needs to change? - Go back to only a single list. That should be straightforward. - Every operation is implicitly a select. Same as in CSP btw! - A task will need to refer to a list of channel operations. Those do not need to be part of the process struct, but will have to be stored in the process state. - The is_rendezvous will become more complex, because all combinations will need to be checked. - One way to quickly do this is to split send and receive operations in the select. Alright. Learning! Entry: Rework Date: Sun Jan 12 20:16:22 CET 2020 Some C bs. Code pointer leaking into. (gdb) p/x ((struct env1*)s)->op[0] $26 = {type_chan = 0x5102, msg_len = 0x5555, msg_buf = 0x0} Actually, the ops do not need to contain op type: it is known by the grouping whether it's in or out. EDIT: Works now Entry: select Date: Sun Jan 12 21:46:17 CET 2020 So next step is to actually use select for the buffer drainer. I want something like: switch(select(SEND(chan2,e->data), RECV(chan1,e->int))) { case 1: ... case 2: ... } EDIT: Added a draft. Needs sugar. Entry: CSP vs. reactive Date: Sun Jan 12 23:31:43 CET 2020 This is actually more flexible than a reactive system, and can be used to implement one, where each task is a many->many function that re-evaluates when any of its inputs change. Probably not the best way to implement it though, but who cares for low update stuff. Entry: What does this open up? Date: Mon Jan 13 09:54:55 CET 2020 - Reactive system - Allow decoupling of HW and logic: just replace the events that trigger evolution. Entry: forking lexical contexts Date: Mon Jan 13 13:42:41 CET 2020 I've removed most continuation compiler notes from csp_test.c A summary: - computed goto is a good compiler target, unifying control structures (implemented as plain goto) and suspension (computed goto). - the next step is to resolve the "forking stack struct" problem. e.g. how to bridge between C-style "flat struct" state approach, and a real continuation compiler which has a separate continuation type for each resume point. there is a lot of overlap but this is entirely coincidental. how to make that explicit? Entry: More efficient? Date: Mon Jan 13 16:06:53 CET 2020 Since there is no direct map from channel to tasks, this isn't particularly efficient. To make it more efficient: - When a task suspends, the select structure should be checked against the channel->task maps. - If there is a hit, the task should be removed from all other channel lists. - If there is no hit, the task should be added to the channel->task map for each task. Changes needed: - Tasks are no longer just in one list, so lists should be implemented separately. - Lists can still be stacks. - Lists: - hot, cold (mutually exclusive) - channel's read/write list It seems to become a lot more complex this way, but linear searches will be eliminated, so it will be more efficient when there are a large number of tasks. How much more memory is needed? Per channel: + a stack of tasks, one for read and one for write Per task: - link pointer There is probably a bound on the number of list elements in total, as a task will be: - either hot or cold (one cell) - part of n channel lists, where n is the select count So the upper bound is the sum of max select count per task. That is statically known. The hot/cold list can be kept the same, as this is only one byte per task. This makes the channel list an "accelerator structure". I.e. it can be optionally included to trade memory vs. speed. The pointers can be shrunk to 8 bit if the number of tasks is smaller or equal to 255, with one value reserved for null. Entry: Accelerator Date: Mon Jan 13 16:45:42 CET 2020 The code above uses a fairly dumb data structure which is memory efficient but makes scheduling slow. To accelerate, an index can be built for the channel->task map. For now, assume that the hot/cold list structure based on the next_task pointer can be kept. What is needed? - For each channel: - a list of blocked readers - a list of blocked writers It is generally not straighforward to determine the size of this list. The most typical case however is 1, so optimize for that. - A way to easily remove a task from the this. This can be done by following the task->channel map, and then searching the channel lists. - The cold list can probably be removed. Only a hot list is necessary to keep track of newly blocked tasks. The sizes of these lists are not easily determined, so it seems best to place an upper bound on the number of list slots, and use a shared list cell pool. Also, if the number of tasks is relatively small, indirection could be used. Since this requres a base table, it seems to add a lot of complexity for not a whole lot of gain. Revisit the idea later. The upper bound on the number of cells needed is determined by the maximum select count for each task: for each select entry, the task is added to the channel's list. So total number of extra bytes needed: - Tasks * MaxSelect * 8 - Channels * 2 * 4 It does seem to create quite a bit of overhead. Suppose we can use IDs for task lists and tasks, the count then becomes: - Tasks * 5 (id -> pointer table + backpointer in task struct) - Channels * 2 * 1 This requres that the total number of active selects also fits in 8 bit. Re-arranging tasks so they are in a single array avoids the back pointer, but requres a secondary pointer. Basic conclusion I come to is that keeping it simple is proably going to be the way to go. Pick speed or pick memory efficiency. Add a cell pool that just has pointers. Entry: csp indexed scheduler Date: Tue Jan 14 17:04:00 CET 2020 It's much more direct now. Compared to the original search, this should be a lot faster. I don't think I need to test speed to know that I don't need the old implementation any more. This will be a lot faster in the common case where there are a lot of blocked tasks and each scheduler run only schedules a couple of tasks. It does need more memory. It's itching to optimize this but maybe measure memory usage first, then see what makes most sense in practical scenarios. Also needs an allocator for channels, which is usuful in its own right. The hot list can still use the pointer inside the task. Entry: External receive Date: Thu Jan 16 20:04:02 CET 2020 This could implement a "pull" network as well. It's completely symmetric. The counterpart read to synchronous send is obvious. That also means there should be a conterpart to the buffered input, but this is less obvious. Maybe it's just a poll of a buffer? Maybe execute a network until a buffer is full? A typical use case is an interrupt that will pick up something from a buffer. The interrupt should perform the read and signal the network to fill the buffer again. In general it seems something needs to be buffered, but it is not really clear what. Both input and output buffering might be necessary due to deadlines, so a third element (WFI main loop) is actually essential if both need buffering, otherwise advance the network from input or output interrupts. Entry: SELECT macro Date: Thu Jan 16 20:34:19 CET 2020 Something like this: switch(SELECT(b, {EVT(b->c_data, b->token), EVT(...)}, {EVT(b->c_int, b->nb_bytes)})) First clause is send, second is receive. This doesn't work well with macros. Maybe use a begin/end construct? SELECT_SEND(s,2) EVT(s, 0, chan1, var1) EVT(s, 1, chan2, var2) SELECT_RECV(s,1) EVT(s, 2, chan3, var3) SELECT_END { case 0: ... case 1: ... case 2: ... } These seem possible, but awkward. Maybe just stick to raw? Quirky idomatic, but quite clear actually. CSP_EVT(b, 0, b->c_data, b->token); /* SND */ CSP_EVT(b, 1, b->c_int, b->nb_bytes); /* RCV */ CSP_SEL(b, 1/*nb_send*/, 1/*nb_recv*/); I can' figure out how to turn it into an abstraction, so use switch(task->selected) {}. Entry: priorities Date: Fri Jan 17 14:56:16 CET 2020 Implementing priorities requires some extra machinery: - Push hot tasks by priority - For each channel, push cold tasks by priority However, without pre-emption, this doesn't seem to be very useful. Entry: CSP Date: Fri Jan 17 15:00:07 CET 2020 Basically, this seems done. I don't think I need to add priorities. If that is necessary, use an existing RTOS. I think this mostly needs a break. It was fun, but it is pretty much finished if I want to keep it simple. What remains is syntax: - Make select prettier - Convert continuations to state machines Entry: The always waiting task Date: Sat Jan 18 06:57:18 CET 2020 Instead of using a buffer, it's also possible to use a task that is always waiting for a buffer write. Basically, a blocking task behaves as a 1-slot buffer per channel it listens on. Buffering seems to be a system design issue. Once you know the entire network, it's possible to say a couple of things about how data will propagate. See next post. Entry: Relation between CSP and dataflow Date: Sat Jan 18 06:59:50 CET 2020 This is more about the guarantee not to block. E.g. suppose data is coming in over USB. Take the recent DMX box example. I want to guarantee that an external CSP send will succeed. Is there a way to express that in CSP algebra terms? In the example, a buffer is needed in beteen USB input and DMX output. It doesn't matter where it goes, but it has to be there. Entry: Reactive systems Date: Sat Jan 18 15:18:45 CET 2020 A reactive system can be something like this: - select on input - update value - select on outputs Maybe the semantics isn't really perfect, since simultaneity isn't handled well, and neither is fanout. For a reactive system we really want continuous time semantics, which would not create multiple events if all input values changes at once. It seems reactivity is really more about time, while CSP is timeless. It might be more useful to think about time provision. EDIT: reactive dataflow = CSP + memoization? Entry: An OS is CSP + time base? Date: Sat Jan 18 15:24:08 CET 2020 CSP gets time from the outside. We're going to want to introduce time in almost every application. So maybe create a wrapper for it? Do this more general: input events are uC interrupts. This will always work. Entry: CSP and interrupts: two levels are enough Date: Sat Jan 18 15:28:17 CET 2020 To fit in the current setup, i'm tempted to run the CSP machine from interrupt, but that really doesn't work due to the need for pre-emption. The case of a intterrupt-per-char on uart is actually quite common, and frequently processing will be irregular: the character that completes a message will take longer to process than the one that is just buffered. So general rule: - intterupt data goes into buffers - main loop uses WFI to wake up The same issue happens on the FPGA. The problem is TDM: one representation of the data requires frequent small operations, another requires infrequent large operations. This decoupling is essential in computing in general: the "time warping" is what makes things practical. So I've identified two cases: character vs. data packet. This can likely grow larger as well, involving processes that operate on many data packets. But these do not seem to need special requirements. Why does the interrupt need special attention? Because it has a time base that doesn't stretch! The interrupt is tied to real time. Once buffered, time becomes relative, i.e. at the fine grain, only causality matters, and the relation to real time is there at a much coarser level. So one thing to take away is that the need for priorities ALWAYS has to do with some exeternally imposed deadline. Reading that back it is rather obvious. I just made a weird detour to get to a very simple conclusion. Entry: Priorities Date: Sat Jan 18 15:43:04 CET 2020 So can this be used to introduce priorities in the CSP scheduler? Suppose there is a chain of events that has a time constraint. This chain goes through a number of processes. Likely, each process in turn will unblock the next one. What is the scheduling decision that drives this? There are only three points where a priority decision can be made: - Take next element from hot list. This can grow large if a single event has a cascading effect. - For a particular channel, pick an element from the cold list. In most cases this is just one, so this is not the most likely contributer. - When waiting for multiple events, pick the highest priority one. The first two are trivial: use ordered insert. The latter is simple to implement: instead of picking first event in select list, pick the corresponding task with highest priority. It involves a search. Is it possible to re-arrange things such that this can also be just an ordered insert? Context: this still doesn't have pre-emption, but it could probably accomodate "hot inserts", meaning that if an interrupt occurs, the corresponding event should be inserted as soon as possible. WFI->scheduler is not enough. It needs to also perform the event. For each buffer, it needs to perform a poll and add to the hot list. Poll also needs to be repeated inside the scheduler in case of priorities. If we need to poll on every event it becomes expensive. It seems best to use a single interrupt "stream", which is then treated as high priority. So the general story isn't simple. Data structures will need to be changed to allow priority, and interrupts need to be mapped to hot tasks. Another: this isn't enough: we need to also pick a hot task based on what task it will unlock, which is essentially priority inversion. The priority of a hot task is the max between itself and the tasks it will unlock, and all the way transitively. Entry: Priority inheritence, opaque tasks Date: Sat Jan 18 16:07:32 CET 2020 Suppose the highest priority task blocks. Its events now become high priority, and anything that can potentiall unlock it should become high priority. What we want to do is to solve the transitive problem: there might be a low priority task that potentially unlocks the high pri one through a chain of dependencies. But there doesn't seem to be a way to do this. E.g P1 -> P2 -> P3 P2' Where P3 is high pri, P1 is low pri, and P2' is mid pri. If it doesn't know how to promote the chain, it will run P2' first. This should read like this: P1 can be scheduled, which would unlock P2, which then performs a write which will unlock P3. The scheduler cannot see this because tasks are opaque. There doesn't seem to be a way to promote P1. Maybe it should be done the other way around: maybe events should have priorities. E.g. suppose a high priority write (e.g an interrupt) unlocks P1, any subsequent actions should be treated as high priority and executed first. Maybe it is even enough to make the scheduler greedy? E.g. suppose an interrupt creates a hot task. This is scheduled, possibly unlocking more. Those unlocked tasks should be treated as high priority. But not all of them would be. I can't wrap my head around it.. Entry: Priority inside the network Date: Sat Jan 18 16:10:36 CET 2020 Note that inside the CSP network, priorities do not make any sense. Tasks do not know about time. Priorities only relate to I/O timing behavior: i.e. if the CSP network somehow ties to I/O with timing properties, then we can schedule tasks in a particular order such that we influence the real time spent between tasks. Entry: Topics to expand on Date: Sat Jan 18 17:54:33 CET 2020 - priority queues - priority inheritance ? - external interrupts The latter is straightforward. If there is no danger for hardware buffer overflow, run it from ISR. Otherwise run it using two priority levels: pre-emptive interrupts into buffers + main loop that polls buffers on interrupt. Entry: Priority levels Date: Sat Jan 18 18:09:57 CET 2020 Priorities do not seem make sense without pre-emption. Without pre-emption, there is only external in->out delays, which in theory could be sped up by not running low priority event paths. Is it so that a high-priority task can only become runnable through an external interrupt? Seems so. How to make scheduler pre-emptable? All list updates need to be atomic. That's about it. The procesor's interrupt priority can then be used to re-enter the scheduler. However, this does not solve priority inversion. Hairy problem. I likely won't need this in regular data-shuffling applications. I can see it being important in hard real-time scenarios. Maybe just postpone? Entry: Next? Date: Sat Jan 18 18:49:18 CET 2020 1. Forget priorities. Really. 2. Implement an example, e.g. DHT11 driver, RS485 sensor network. 3. Create a syntax layer Of these, the latter will be the most useful. Entry: Syntax layer Date: Sat Jan 18 18:50:43 CET 2020 + Would allow a decent select macro. - Requires finding a C subset. This will be awkward, and should probably be forked off as a separate "C lisp" project. - Requires resolving an impedance mismatch between flat C structs and continuations. Entry: Representing continuations : union of maximum frames Date: Sat Jan 18 18:52:28 CET 2020 The main point is to avoid copies. It's probably ok to create a separate C structure for each continuation point. The stack can be split into 3 pieces: shared - old shared - new The issue is in overwriting old. In a standard implementation this happens naturally. Can it be done naturally in a similar way? There is no real difference, other than that there are many transient forms of stack frames, and not all of them are relevant. How to split this up? Generate all the stack frames: every ANF operation has an associated frame. If the operation is blocking, the frame is reified. Or.. The store is just there, but there is a different interpretation at each stack frame. This is likely the simplest form, but requires a lot of structures. Using short typedef names for the structs: s1,s2,.... and a short name for the stack, the ANF sequence looks like: ((f1*)s)->a = 123; ((f2*)s)->b = ((f1*)s)->a + 1; It could be implemented using a union. union { struct { int a; } f1; struct { int a; int b; } f2; } s->f1.a = 123; s->f2.b = s->f1.a + 1; That can then be further simplified by reducing the number of structures by using e.g. f2 where f1 would suffice: s->f2.a = 123; s->f2.b = s->f2.a + 1; That would avoid quadradic clutter in the generated C code. There will be no issues with undefined variables as definition will be ensured by code gen from proper ANF. This actually seems quite straightforward. Maybe later find a way to reduce notational clutter, but for now readability isn't a big issue. As long as names are preserved it would be fine. Summary: - Each operation has its associated environment. All forms are 'let*', meaning the environment grows, and previous variables are accessible. - Casts are done for each reference. Casts can be implemented using unions. - Map each frame to its "maximum", and only generate and use those in C code. This removes quadratic explosion. Simple, but the combination is what makes it workable. When this is represented in Haskell, it's probably possible to parameterize and infer types, so instead of starting out with a data type, maybe start out with a finally tagless form? EDIT: Simple, but optimizer can't do much. In between resume points, we want to allow for optimization. So when taking a continuation, or when performing a tail call, it is probably best to copy any additional variables. But this then immediately opens up issues that previously frozen stack frames no longer have the same size. It's hairy: gets very difficult very quickly. Entry: CSP and FRP Date: Sat Jan 18 20:20:25 CET 2020 So it works as long as full propagation is used: every recompute on any of the inputs, cache the inputs and push to all outputs. This does not work for "simultaneous" events, but naive push-style FRP also doesn't. Note that this is a real problem if there is any fanout. Entry: Seq or new ADT? Date: Sat Jan 18 20:29:00 CET 2020 Seq C is not complete. There are no loop constructs. This is almost completely complementary. I don't care about the primitives, but I do care about loops and continuations. Let's first do it separately. Entry: Semantics Date: Sat Jan 18 20:58:48 CET 2020 I get stuck very quickly. Too many issues at once. Some notes: - I can't see the consequences of implementing higher order functions, so implement only higher order "forms", such as iteration. This can be just goto. - However, I still want function calls to properly abstract over blocking operations. Can I do this instead with abstract inerpretation? Execute the program, produce a trace, and from the trace produce the stack data structures. Insisting on ANF and closures will make it too difficult to keep C semantics. Just a stack. Nothing fancy. Start with what can be done using a stack: - Non tail call function application (append to stack, hide). - Non tail call "form inline" (append to stack, keep context) - Tail call: replace part of stack - Tail call "form inline" (same, keeping context). These are 4 operations. Summarized: - call - form inline - tail call - Observation: - "hiding" is only necessary when it is possible to have name clashes. - inlined forms are interesting, elaborate on that Entry: Inlined forms Date: Sat Jan 18 21:12:08 CET 2020 This is a trick to use something that looks like the application of a function, but is easy to implement. Given a form like this: (let ((a 123)) (map (lambda (x) (+ x a)) '(1 2 3))) Where a closure is consumed immediately by a higher order function. This can be re-arranged by inlining the 'map'. Basically flipping things around: the stack frame used to implement map will be hidden to the rest of the function. So what does this actually do, when we're restricted to stacks? 1. It inserts the stack frame of map (e.g. iteration control variables) in a way that is not accessible to the body of the lambda. 2. It inserts the stack frame of the lambda. so we have the following stacking: a=123 (iter control, invisible) x=1,2,3,... So basically I should define a machine that can do this kind of stack frame interleaving, then optionally "hide" frames. EDIT: Is an inlined form the same as a downward closure? The DC would be explicit. Rust solves it by moving ownership, and is much more flexible. Entry: those 4 stack-only control structures, revisited Date: Sat Jan 18 21:23:10 CET 2020 - function call: - hide current frame (essential!) - append new frame (args & params) - tail call: - remove current frame - replace with new ... - inlined form - keep current frame - hide form's control frame - append new frame - same, tail call version, might not be necessary? Hiding the caller's frame is essential, because the function can be used at any point. The invisibility is an enabling property! I re-invented the frame pointer. To represent a function call: store return point and previous stack frame. This is exactly what happens in C function calls. Tail call will likely need a memcpy to prepare the new frame on top of the old one (or the C stack), and put it in place. Also not really rocket science. The interesting part is the inlined forms. This is what is "new" in the language. But.. the only one that I'm going to need is a Rust-like "for". The more I think about this, the more it is obvious that I should not write the closure compiler, but use Rust instead. Keep the CSP at what it is: something simple. Still. It's a nice exercise. Entry: Spawning a temporary task with static memory Date: Sat Jan 18 21:55:04 CET 2020 It just uccured to me that tasks can easily be nested. As long as lifetime is properly managed, one task could host the storage of another task and spawn and kill it. E.g. to implement a sequence. This thing is a real eye opener. Once a scheduler is in place, anything that would ordinarily be a state machine can just be a task. Maybe work on the memory usage, because this is just too useful. Entry: Scheduler structure: less memory Date: Sat Jan 18 21:59:35 CET 2020 Currently: csp_task: 12 + n_ev * csp_evt: csp_evt: 8 csp_chan_to_evt: n_chan * 8 csp_evt_list: n_ev_tot * 12 Most of these can be halved by using 16-bit packed pointers. The expensive parts are the events: 20 bytes/ev. The smallest controller I run this on has 20kB. This really isn't all that bad. It can be halved without too much effort, by using encoded pointers (RAM_BASE + 4*i). Works for up to 256kBytes. For controllers that have more memory it problably isn't necessary. It will make it a lot more awkward. So don't optimize. Entry: A single interrupt queue Date: Sat Jan 18 22:17:02 CET 2020 So, if there are no priorities, it's ok to use a single queue for interrupts as well. Make sure they are all at the same hardware interrupt level, and use an application main loop (bottom half) that does: loop WFI while read_evt external send evt The interrupt event queue then should have: - channel (2) - len (2) - payload (n) Entry: Wakup ideas Date: Sun Jan 19 11:05:41 CET 2020 Probably nothing. - Try to put everything inside the task struct. - Store channel data only once? Let's look at the event. How to turn it more into a synchronized function call? Sender provides a callback. In the cold list there is only ever one pointer to a particular tasks' event list. EDIT: The interesting relation is that a task's event is pointed to only by the channel, but a single channel can point to more than one task. One thing though: a channel can never have readers and writers at the same time, so a single list would suffice. Entry: CSP on HW Date: Tue Jan 21 04:57:24 EST 2020 The big lesson from implememnting this in C, is that the inside of a task really doesn't matter. It is about messages and syncrhonization. The rendezvous works particularly well in hardware. I still miss a good way to go from parallell to sequential. The core idea is to parameterize a "CPU stripper". Entry: Tasks Date: Wed Jan 22 07:15:47 EST 2020 Dreaming about this. It is very flexible. E.g. it's possible to emulate (blocking) function calls with tasks. Many->One : gather results One->Many : dispatch worker Note a misconception: a worker isn't necessary about a CPU resource. It can be just as wel be managing/using some other resource, e.g. some process that evolves over time. Entry: Decoupling through channels Date: Thu Jan 23 08:32:12 EST 2020 You know this really solves the HAL problem. I didn't consider that at first. It's quite amazing how deep this thing goes. Entry: Reiterate FPGA bits Date: Fri Jan 24 06:59:40 EST 2020 So thinking in CSP style cuncurrency, what can be done on the FPGA? What is missing in my understanding to make this completely natural? E.g. writing state machines as communicating processes is now obvious. The same could be true for FPGA. Maybe the secret is in select? 1. Receiver being able to handle a number of things at once 2. Sender blocking until receiver has message Up to now I've been really focused on sender controlling the transfer: outputting a sync pulse, requiring receiver to be listening. This is not necessarily true. Ok that's clear: focus on rendez-vous. Entry: Rendez-vous in hardware Date: Fri Jan 24 07:04:48 EST 2020 Simplificatation: - Direction is hard-coded - One reader, one writer - Synchronous logic only Eiter one can arrive first. Sender: - Sets output signals, raises write - Waits for rendezvous signal - Lowers write, continues Receiver: - Raises read - Waits for rendezvoud signal - Samples, lowers read If the reader is first, to the writer this should look the same as sending out a non-synchronized pulsed write. This way we can still use non-synchronized feeders whenever we can guarantee that the reader is always ready. Sync can be removed as an optimization. Googling. Not directly related, this is for async: http://async.usc.edu/jungfrau_web/htdocs/new/research/current/verilogcsp/VeriloglCSP.pdf Entry: Representing state machines Date: Fri Jan 24 07:17:23 EST 2020 Create a modeling language for state machines that is optimized for code transformations that can pipeline the transition machine. The real problem I keep facing is that bit-level state machines are just too low level, and manual pipelining is a real pain to do, but at the same time it is itching: if this is represented better, sequentializing the transition machine should be straightforward. Entry: Pipeliner: dataflow vs. state machine Date: Fri Jan 24 07:23:24 EST 2020 So it seems quite obvious. Pipelining is simple: take a data flow graph and start adding delays. It's probably best to formulate it as a problem that can be solved using a SMT solver. There is one caveat: feedback. So for state machines it is really not all that obvious. It might be good to implement a simple machine like that and see how it goes. E.g. a pipelined counter, where the adder is pipelined. The ground rule is: a pipelined state machine will have a slower state update. It also seems that utilization will be low if functional units do not share a datapath. Conclusion: pipelined state machines need a programmable datapath. To re-iterate: this is an important insight. Easy to see when two regimes are put next to each other: - Feed-forward dataflow can be pipelined without consequences: all function segments are utilized at max rate, providing throughput, but at the expense of some I/O delay which is often not an issue. - If such a functional network is used as the update function of a state machine, the delay is very significiant: for an N-stage pipeline, the utilization is only 1/N. In this case it is probably more effective to collapse the different pipeline stages into one programmable datapath. The total machine's update rate will still be N times slower, but silicon usage might be better due to reuse. A CPU is the natural end stage of a pipelined state machine, where we create a programmable datapath that is universal. Entry: Programmable data path Date: Fri Jan 24 07:28:06 EST 2020 See last post. Pipelined state machines will want to be programmable datapaths, while pipelined feedforward networks have single-use stages. These are two "architecture islands" that have fairly little to do with each other. Conclusiong to make: - If speed is of the essence, try to get rid of state machines and turn it into a feedforward network. - An intermediate stage: each pipeline stage itself could use feedback, so purity is not really essential. - If speed is not essential, programmable datapaths are the way to go. Question: is it always possible to avoid state machines? Not really. The typical example is an integrator. However it does seem possible to try to make feedback loops smaller. Entry: What is human and what is machine? Date: Fri Jan 24 07:41:28 EST 2020 So what do I do with these insights? First conclusion is that it might be good to create a datapath generator. Another conclusion I had before is that unrolling is the easy part. Condensing onto a programmable datapath is more difficult as it _adds_ information. So human inginuity should be used to create a representation as a "maximally compacted" sequential program. Then an optimizer could be used to unroll. This brings us back to Forth. I don't know of a better sequential program encoder, as long as stack shuffling is avoided, but it seems that adding registers in an ad-hoc way will solve that. Entry: Programmable datapath Date: Fri Jan 24 09:28:49 EST 2020 Assuming: - code sequence is an abstract stream (RAM or other SM) - stack can be unfolded to registers - primitives can be expanded Forth really is a good way to turn program transformation into a simple algebra. And when the search space isn't too large, discrete solvers can be used to perform mapping. I know there is something here, but how to uncover it? Entry: Unrolling/unfolding Date: Fri Jan 24 09:34:18 EST 2020 Stick to that basic idea that unfolding is easier than folding. Unfolding fort to ANF: - creating an output register for each ALU operation - treating the stack as individual (named) registers From programming in Forth, stack shuffling can often be avoided by using: - "channels", or read/write pointers - the R stack for tucking away temporaries I think it's still a bit out of reach until I find some intermediate form. Maybe first finish an actual stack processor? Entry: Remove Time Date: Sat Jan 25 05:03:56 EST 2020 Create a programmable datapath, driven from UART or SPI. This would allow easy stepping. The core idea is to remove time from the equation. Use only event causality. That will make it possible to slow everything down, and also emulate it on a different substrate that is much easier to instrument. I don't think it is that hard to introduce rendez-vous into an existing unidirectional sync system. See above. Entry: CSP: it would be nice to avoid the copy Date: Sat Jan 25 12:43:47 EST 2020 It is possible to do that as long as tasks know that messages are only valid up to the next scheduling point. But let's not do this. It's probably better to create a buffer management scheme, and pass buffer references around. EDIT: It can be quite simple. E.g. interrupt uses 2 buffers and is responsible for swapping them Entry: Network sends... Date: Sat Jan 25 19:02:35 EST 2020 cannot be used for synchronization, unless a ping-pong mechanism is employed. Entry: FPGA Date: Sun Jan 26 05:36:17 EST 2020 What am I missing? First, the CSP style concurrency is fundamentally different from synchronous logic. Maybe work a bit to bridge the gap. Say I want to write a SLIP decoder as a sequential program, and compile it down to a state machine. This is non-trivial loop: read - case ESC: read case ESC_ESC case ESC_END default - case END send PACKET - default Yeah I'm really stuck at things like this. What does it require? - Map events to signaling conditions - Change tasks to update functions / state machines That should be it: no transition when there is no event, and if there is one or multiple events, perform a tie-breaking operation. Entry: Tie breaking allows use of sequential select Date: Sun Jan 26 05:52:47 EST 2020 This is a new idea: mimick the behavior of select in hardware, which requires tie breaking when events arrive at the same time. Assuming it is best to do only one event at a time. If not, split it up into two different machines. When all transition functions are pure, they can be implemented either as single-cycle, or using a programmable datapath. So the core ideas are: - use CSP style rendez-vous with tie-breaking select - implementation of transition functions can then be decoupled In othe words: CSP-style rendez-vous allows for machine synchronization, effectively removing real-time from the picture: we can reason at the event level. So DECOUPLE TIME FIRST. Once time is decoupled, transition functions can be implemented in a single cycle, or can be done over many cycles. This essentially isolates the problem of function computation. I am going to need a compiler and a universal state machine language. Entry: CSP language Date: Sun Jan 26 06:07:32 EST 2020 1. Assume task to state machine compiler. 2. The code between two selects is a function. Implementing those functions could be done with human assist. E.g. once the compiler has split the code into transition functions, it could probably be converted to C. But it could just as well be compiled to a hardware state machine. That compilation step could in theory be done by hand, e.g. if the compiler can generate the transition function specificiations (the part that is hard to do by hand), they could be implemented manually against a reference implementation. Entry: Target mapping Date: Sun Jan 26 06:22:38 EST 2020 Could be just implementation of functions. See previous post. Entry: Summary Date: Sun Jan 26 06:26:34 EST 2020 Too many ideas and too little focus. What should be done next? Creating a CSP language frontent seems to be an essential part. Targeting can then be solved separately. The interesting thing is that FPGAs are in reach as well through the combination of rendez-vous implemented as a hand-shake in hardware, and making select-to-select transition functions explicit, which allows for manual optimization if needed. Entry: Implementing compiler exploration Date: Sun Jan 26 06:50:33 EST 2020 It needs to be make concrete soon. Other things will surface during implementation. Entry: CSP lib: why? Date: Sun Jan 26 10:45:35 EST 2020 1. Started after writing yet another ad-hoc state machine sequencing program. CSP is much more uniform to write OS-less code. 2. This is a small library and composes. Not a framework dependency. Entry: Haskell QC test bench for C CSP programs Date: Mon Jan 27 08:26:04 EST 2020 Maybe simplest to decouple through packet_bridge. That way the interface is platform-independent and can be used in Erlang also. Entry: Rust async block_on Date: Mon Jan 27 19:44:11 EST 2020 https://stjepang.github.io/2020/01/25/build-your-own-block-on.html maybe not what i think it is? Entry: Ceu Date: Thu Feb 6 16:12:30 EST 2020 http://ceu-lang.org/publications.html Not general purpose, but it contains some nice ideas. Entry: Redo Date: Fri Feb 7 14:36:22 EST 2020 I wrote an Erlang implementation of redo, and I'm currently using it to write an Erlang implementation of make as well. See uc_tools_gdb_do.erl Redo fits in the general exploration of event-driven schedulers. Entry: switch reactive direction Date: Sat Feb 8 14:32:32 EST 2020 So a highly available distributed system acts as a reactive network where the causality switches direction. Each of the nodes can initiate a state change with all of the other ones following. I think that is the main idea between lamport timestamps and vector clocks: focus on causality. This still doesn't quite fix the "crossing paths" problem. So how is that usually tackled? Crossing paths is actually quite common in the following case: - system A and B both have a reference to some lazy resource C. - system A and B split - resource C disappears, but A and B don't notice it - both A and B have a need for resource C, see it is not there, and both pick a replacement: A picks D, and B picks E. - on merge, which should become the consensus? Maybe this is more convoluted than I set out. If they both implement the same interface, it probably shouldn't matter. Entry: CSP inside an Erlang process? Date: Wed Feb 12 08:08:59 EST 2020 So, going from composing machine functions into a single process, what about running a CSP inside Erlang as a scheduler for such state machines? This would be a nice playground. Main benefit is that it is contained in a single process and would be easier to manage and compose. CSP maps a collection of state machines to a state machine. Same for reactive networks btw. Maybe redo.erl can be abstracted like that as well? Possible to run inside a single process, or to be forked off into separate workers for parallellism. EDIT: Erlang would be a good testing ground for building synchronous state machines (each process has its own CSP scheduler), linked together over communication links. It's also a good compiler target to try out the closure compiler first in a simpler setting. So how to express rendez-vous like that? Inputs are straightforward. Outputs are too, actually. Looks like it is just defining select. Anyways. Might be best to go on top of Erlang processes anyway... The single-threaded thing is really a special case. Entry: redo: providing temp space to imperative code Date: Thu Feb 13 06:33:39 EST 2020 The thing that really strikes in redo is the way in which an imperative update function is given the "ownership" of the output node. Basically, it can do what it wants with that piece of storage while it is running. It is very nice to have this high level guard around a dirty, in-place imperative update. Write a C version? Since I have the example, what would that look like? Entry: distcc-style network? Date: Thu Feb 13 06:36:44 EST 2020 missing ingredients: - all nodes running the same code (use nix for this) - decent local cache for "constant" data - intermediate products should be kept local (scheduler optimizer) Entry: Making a redo evaluator in C Date: Thu Feb 13 06:45:39 EST 2020 Maybe the problem domain is not specified well: in practical situations, I think that dynamic dependencies might not be such a big deal. I.e. we will know the dependency network up front. So a better way to do this is to us a pure dataflow structure with an explicit memoized push/pull walker: start at changed input(s), and recursively pull any of the affected outputs. It will be necessary to have both o->i and i->o lists available. It doesn't really look like rocket science. Question is more: do I have a use for it? I think atm a more explicit CSP style approach is appropriate on uC/FPGA. It would be nice to reframe reactive code as CSP. Basically it is just adding memoization: each process is responsible for its own output node. Once it has evaluated, it will keep sending out the same value until a global signal resets the "phase" of the algorithm. It does seem rather inefficient to do it like that. Entry: danielg on #rust about async compiler Date: Mon Apr 20 13:23:58 EDT 2020 12:16 < doelie> Hi. Quick question: who implemented the gnarly details of the async->state machine conversion? I'm trying to gauge how hard it would be to write a stand-alone tool that could do the same for C. 12:17 < doelie> I'm guessing some of the rust code would be reusable. 12:17 < doelie> Or if not, at least some of the struct packing heuristics. 12:50 < danieldg> doelie: it's actually pretty simple in theory. Make a struct that contains all local variables and use that, along with one "state" variable. Yield is "struct->state = N; return YIELDED"; the start of the function contains a switch/goto on state. 12:51 < danieldg> rust uses an (unnamed) enum to hold the local variables so that it can overlap variables that are never in scope at the same time, and uses the real stack for variables not live across a yield. But those are just optimizations. 12:52 < danieldg> the result is that the struct packing is not specific to the async code at all 12:59 < danieldg> doelie: imo it would be easiest to do it as part of an LLVM pass instead of a standalone C->C tool 13:10 < doelie> danieldg: 13:10 < doelie> danieldg: the llvm pass might indeed be a good thing to consider 13:11 < doelie> however i want to see how much extra work i need to do to just do it in c. 13:11 < danieldg> look for existing tools to create generators in C, it's the same thing 13:12 < doelie> any pointers? 13:12 < danieldg> no, just giving you google keywords :D 13:12 < doelie> yeah i did some looking already :) 13:13 < danieldg> I've only dealt with higher-level tools (matlab) that generate state machines and then C code from it 13:13 < doelie> i've not found a tool that can just do a C->C translation 13:15 < doelie> anways this is just to see if i'm not missing anything obvious. i did some thinking about the struct overlap to encode the continuations, and it looks quite similar to what rust is doing. 13:15 < danieldg> one important thing to consider is that an async rust function is actually two real functions: one setup that creates the state struct, and one that is called by poll() on the struct. Splitting functions like that in C is not as trivial. 13:15 < danieldg> unless you want to do heap allocation and function pointers all the time 13:15 < danieldg> I think C++ has some support for generators in its drafts, btw 13:16 < danieldg> might look at that 13:18 < doelie> no i want to use static/stack allocation. i've used something similar in the past, but it's quite heavy on the C macros and also doesn't do any compilation to state machine, just dispatch: https://github.com/zwizwa/uc_tools/blob/master/sm.h 13:20 < doelie> danieldg: i think that the main issue is going to be to find a good C parsing/printing library. i guess llvm can be made to spit out C code again. 13:20 < doelie> but i'd like to do this such that the resulting c code remains readable. 13:20 < j`ey> clang has rewriting tools 13:26 < doelie> j`ey: thanks. i found this post: https://eli.thegreenplace.net/2014/05/01/modern-source-to-source-transformation-with-clang-and-libtooling 13:26 < j`ey> > moder 13:26 < j`ey> n 13:26 < j`ey> >2014 13:26 < j`ey> :P 13:26 < doelie> yea good ideas dont age 13:27 < doelie> but it's enough to get unstuck for me. thanks you both.