Haskell DSL for synchronous circuits


Entry: Basic idea
Date: Fri May 25 14:10:30 EDT 2018

Is it possible to capture enough of a synchronous state machine to
be able to do the same thing as with Pru.hs?  In general, a HDL
represents a discrete event simulator.  For clocked circuits, the
simulation becomes a lot simpler.

At every tick, register inputs are read, and an update function is
computed for each register.  So the basic unit to work with is the
register.

For now, assume MyHDL as a target.  The idea is to produce blocks
that look like:

   @always_seq(CLK.posedge, reset=None)
   def counter():
       count.next = count + 1

Abstract the CLK and reset completely.

How to construct an embedded language around this idea?  There are
essentially two elements:
1) combinatorial functions
2) registers

A register is directly tied to the function that computes its next
state, so it makes sense to use a Map for this.

MyHDL doesn't use registers per se, but uses signals.  If a
signal's .next is written to, it behaves as a register.  In other
cases it is possible that a signal is just a wire.  I find this
very confusing.


Entry: Signals
Date: Fri May 25 14:11:39 EDT 2018

To make this work, it is necessary to first understand what Signals
are.  Stick to MyHDL as basic semantics.

Somehow I don't really see why signals need to be so complex.

Maybe just stick to what MyHDL does?

At the very least, I need to distinguish between intermediates (wires)
and registers.

I think I understand.  The difference is whether something appears in
a combinatorial or a sequential block.

So we have some context in which we can do bindings.

Semantics of signals:
1) exactly one driver
2) from comb creates a wire that cannot have loops
3) from seq creates a register which can create loops across clock ticks


Let's try to make this basic program work:

  counter a b = do
    comb $ do
      a <-- not b
    seq $ do
      b <-- b `add` 1

Two blocks, one combinatorial and one sequential, using a uniary and a
binary operation.

Maybe signals can be implicit?
The do notation's <- can be used to create combinatorial signals
Then a 'set' operation can assign a combinatorial signal to a register.


Following through, this code

  counter b = do
    a  <- inv b
    b' <- add b $ L 1
    set b b'
  
  main = do
    printl $ mapToList $ compile $ signal >>= counter

leads to the following network structure:
signal n is driven by driver d.

(n,d)

(0,Reg (Signal 2))
(1,Comb1 INV (Signal 0))
(2,Comb2 ADD (Signal 0) (L 1))

The combinatorial ones are straightforward.  The first one says that
signal 0 is driven by a register, who's input is driven by signal 2.

I'm going to rename Reg to Delay, and separate out constants so they
are explicit drivers of signals:

(0,Delay 3)
(1,Comb1 INV 0)
(2,Const 1)
(3,Comb2 ADD 0 2)


Making a next iteration where signals are explicitly driven, allowing
both explicit combinatorial and sequential drive.


Entry: fix?
Date: Fri May 25 19:10:21 EDT 2018


Added a fatal error for signals driven more than once.

Now, how to avoid this from happening by using a functional
representation?

Basically, create some kind of fix operator.

Again, a counter.


-- A counter from a register fixed point operator
counter' :: forall m r. RTL m r => r Sig -> m ()
counter' = regFix inc

regFix :: RTL m r => (r Sig -> m (r Sig)) -> r Sig -> m ()
regFix f r = f r >>= next r


So that's straightforward.

To create state machines it is still necessary to have a naked
'signal' that can set up the register in the first place.


Ok... so what's next?

I will have to represent state machines as functions with I/O.

E.g. two input, two output:

r Sig -> r Sig -> m (r Sig, r Sig)

For the code generator, open functions are definitely necessary, but
for the emulator it is ok to just work with traces for now.  A test
bench doesn't need an input.


Entry: Generate signals
Date: Fri May 25 20:27:24 EDT 2018

 --- test_edge
(0,Const 4)
(1,Delay 3)
(2,Const 1)
(3,Comb2 ADD 1 2)
(4,Comb2 SLL 0 1)
(5,Delay 4)
(6,Comb2 XOR 4 5)


Maybe a good exercise to write this as an interpreter.  I can see this
problem of turning a network into a function reoccur.

In this case, assume we know 6 is the output.  What do we need to know?

- collect all registers
- compute output function, stop at registers
- compute update function for each register


Entry: Abstracting fix?
Date: Fri May 25 20:54:19 EDT 2018

This can be done using arrows.

What I want is something that looks closed, but can be opened up
again.  I've been here before.  Then I used existential types.

Having an open representation doesn't seem to be necessary as long as
state traces are made available.

So maybe the first thing to do is to compile to traces?


Entry: Arrow / Category
Date: Fri May 25 22:30:13 EDT 2018

So what about instead of modeling as functions, we model as category?


Entry: RTLEmu
Date: Fri May 25 22:36:48 EDT 2018

Just continue filling things out.
Start by only collecting register signals.

But how to represent a hole?

Basically, op2 ADD needs to apply a function to something.

What about inserting an environment into the monad, and using circular
programming?

This is a new idea: compile an entire network to a function taking
register state as input.

Done. Also managed to compute register init using circular programming.

What is missing is a good way to produce an output.

Currently, it needs machinery to untag the Emu.R wrapping and perform
register dereference in case the return value is Emu.R (Reg Int)

Unfortunate, but it seems a special "output" command will need to be
added to evaluate source code.

If it needs to be added, then make it into a list so it is generic.

OK, done


Entry: It is very annoying not to have constants
Date: Sat May 26 01:51:08 EDT 2018

EDIT: Seems really hard to fix.


Entry: Remove combinatorial drive?
Date: Sat May 26 02:05:28 EDT 2018

Seems like it's not needed.


Entry: fix instead of next?
Date: Sat May 26 08:56:51 EDT 2018

If it is removed from the interface, there are no longer issues with 0
or >1 assignements to registers.


Entry: Is it necessary to model wide signals?
Date: Sat May 26 09:04:57 EDT 2018

It seems the metalanguage is enough.  Generated HDL can be "flat".
Likely, the compiler will pick up the pieces.

It would be nice to be able to embed signal types in Haskell types,
but it seems more trouble than it's worth.

Overall it seems too much trouble to use lists of bits.  Use sized
integers instead.  This is a compiler, so there is the extra
compile-time function evaluation to resolve these issues.


Entry: Reset value
Date: Sat May 26 10:33:22 EDT 2018

I don't really like this, but I guess it can work... Signals can have
default values.  If a signal is not driven, its default value should
be used.  Constants are currently implemented as undriven signals.

EDIT: No: make ints explicit.


Entry: Wire transposition
Date: Sat May 26 11:16:16 EDT 2018

Unpack a signal into components.  This really makes me want to
represent signals as bit vectors.


Entry: Export a module
Date: Sat May 26 11:29:01 EDT 2018

Because we only do sequential logic, it is essential that we are able
to export standard (MyHDL) modules that can be included in a
simulation that also handles asynchronous logic.

The basic idea is that:

- async logic is sometimes necessary for interfacing, but is too
  low-level to be used for the bulk of a circuit

- syncrhonous logic is simpler, and this simplicity can be used to
  express it in a simpler language


Entry: Renaming RTL to Seq
Date: Sat May 26 11:31:06 EDT 2018

It is clear now what the purpose is: to create a langauge to describe
sequential processing.  RTL is too confusing.  It might mean the same
in spirit, but is a loaded term otherwise.


Entry: Conditionals
Date: Sat May 26 13:27:51 EDT 2018

Or, conditional assignment.

Can I get away with implementing this as a primitive?  It will be
implemeted as a multiplexer, so the signals will be there already.


Entry: Traces that do not have generators
Date: Sat May 26 16:06:30 EDT 2018

Trying to get a in input sequence fed into a trace.  Getting into a
knot with that..

Maybe it's easier to embed a sequence right into the fabric, as a
special kind of update equation around a register?

Basically, I want to use the second operand of 'next' to magically
produce a value such that the output register state contains
information to compute the next value, and 'val' produces the current
one.

  -- register drive
  next (R (Reg sz _ a)) (R b) = do
    vb <- val b
    let ifConflict _ old = error $ "Register conflict: " ++ show (a,old,vb)
    modify $ appOut $ insertWith ifConflict a vb


Can't wrap my head around it.  Seems inconsistent.

What is the real problem?  To distill the initial state of registers.

Is there a more direct way to compile this to (s, s->s) ?


Entry: (s,s->s)
Date: Sat May 26 16:23:24 EDT 2018

First, start by storing the register defaults in the state map.
Currently they are in the instructions.

EDIT: So I have something that works, but it's not pretty.  The
problem is really that register types are deeply embedded inside the
code and there's no good way to get to them other than just executing
with dummy inputs.

A proper way would be to provide actual types for those dummy inputs,
but here the 0 value will do.


Entry: Inputs?
Date: Sat May 26 18:27:19 EDT 2018

Still didn't get very far.  A similar probing approach could be used.

Ideally, this is encoded in the types, but I'm not going to put in the
effort.


OK, done.  Wasn't that hard in the end.  Just confused, I guess.


Entry: RAM
Date: Sat May 26 20:20:26 EDT 2018

What I need most to make an application, is emulation for RAM.
The RAM in iCE40 is not clocked.  


Entry: MyHDL
Date: Sun May 27 00:25:01 EDT 2018

Basic print seems to be working.

How to re-enable expressions?
Or does it not matter?
Likely, MyHDL will flatten expressions to internal nodes.
Or not? Look at the HDL output maybe?

In any case, it will make the output more readable.

So how to decide to inline a node?  When it has only one user an it is
not a Delay node.

This isn't so hard to compute.  The user list for each node can then
be rendered such that:
- signals definitions and .next= lines can be skipped
- code is recursively inlined when printing the expression


Entry: fanout
Date: Sun May 27 01:01:14 EDT 2018

To compute fanout, first make an iterator over all references.
A good excuse to finally understand Foldable and Traversable

Foldable is enough, likely.


Funny how implementing Foldable immedately required the generalization
to Term t, which paves the way for nested data structures, by changing
the type from 

  Term Node 

to 
  
  Term (Either Node (Term Node)), 


Entry: Free Monad
Date: Sun May 27 08:38:41 EDT 2018

The above smells like the Free Monad.

Free Term Node

See http://hackage.haskell.org/package/free-5.0.2/docs/Control-Monad-Free.html


So inlining is some form of fmap

Here's what worked.  I had to really do it step by step to figure out
the types of holes.

inline :: (Int -> Term Node) -> Int -> Exp Node
inline ref = inl where
  inl :: Int -> Exp Node
  inl n = f $ ref n where
    f (Delay _) = Pure n
    f term = Free $ fmap inl term

Parameterized node type

-- Inlining, terminated on Delay to avoid cycles.
inline :: (t -> Term t) -> t -> Exp t
inline ref = inl where
  inl n = f $ ref n where
    f (Delay _) = Pure n
    f term = Free $ fmap inl term


Generalized to predicate

-- Inlining, terminated on Delay to avoid cycles.
inline' :: (Term t -> Bool) -> (t -> Term t) -> t -> Exp t
inline' p ref = inl where
  inl n = f $ ref n where
    f term = case p term of
      False -> Pure n
      True  -> Free $ fmap inl term

inline = inline' p where
  p (Delay _) = False
  p _ = True


Now I want to extend this to keep track of whether a node was inlined
or not.  But first, can this be generalized further?

Is there another case where a Bool is used to pick two alternatives?
Yes that's just if.

inline' :: (Term t -> Bool) -> (t -> Term t) -> t -> Exp t
inline' p ref = inl where
  inl n = f $ ref n where
    f term = if' (p term) (Free $ fmap inl term) (Pure n)

EDIT: Continued a bit.  Straightforward in the end.


Entry: Encode signal types as types?
Date: Sun May 27 20:42:54 EDT 2018

This would avoid needing SType values, but would require either
dependent types or some nasty tuple shit.

Maybe type families?


Entry: Memory emulation
Date: Sun May 27 22:46:02 EDT 2018

I'm not going to do this using registers.  Also, I need something that
works synchronously.

It seems best to do this as a sort of coroutine that sits in between
two state updates:
- reading read_addr, write_addr, read_data
- writing write_addr


Entry: Patch two "coroutines"
Date: Mon May 28 11:47:37 EDT 2018

(R S -> M (R S)) -> (R S -> M (R S)) -> M ()

This is a special case of reg.

EDIT: changed regs to regFix, using a functor.


Entry: signal and next
Date: Mon May 28 14:28:09 EDT 2018

Itching to remove it, and replace it with regFix.

But that's not really so important.  Some implementations might be
simpler that way.


Entry: Memories
Date: Mon May 28 14:32:50 EDT 2018

So basic structure is there.  How to actually use it?

Memories are an implementation feature.  Generally we will provide
code that is parameterized by the memory's register interfaces.

Test it with a dummy read/write.

So it's the same as general register fix: close over the memory
interface.


-- Dummy memory-using operation.  For testing memFix.
dummy_mem rd = do     -- mem reg in
  z <- int 0
  return ((z, z, z),  -- mem regs out
          [z])        -- test program output


Maybe this can be added to Seq?  It's very useful to have without the
need of pushing memFix in as a paremeter.

So I have a test, but no generic way to do this.

The problem is that the interfaces changes with all this threading
going on.

It would be more convenient to tuck it into the monad.

I want a "makeMemory" function.

dummy_mem rd = do     -- mem reg in
  z <- int 0
  return ((z, z, z),  -- mem regs out
          [z])        -- test program output

dummy_mem2 makeMem = do
  memFix <- makeMem
  [o] <- memFix (t, t) dummy_mem
  return $ o


How to do this without parameterizing the class?

Memory is a general case of external I.O.  I want a generic way to
embed abstract state threading.

There are two ways:
- Find a way to hide it into the main monad
- Use custom trace functions

The latter really doesn't seem like a good idea.

I think this is a job for existential types.


Entry: State, revisited
Date: Mon May 28 15:46:37 EDT 2018

So question: include more state in the state monad by parameterizing
it, or use explicit state in the trace (run) function.

Start from the use case again.

I'm writing a HDL code that is somehow parameterized by a number of
memories.  The implementation of these memories should probably be
abstract, such that it can be filled in during test time.

With this code, I want to: 

- generate it as HDL, where it will be combined with an HDL stub to
  patch it into the memories.

- generate it as a test function that can provide the meories and
  produce an output sequence.

My "main" program can just be parameterized by the memory interfaces,
collected in a functor.


Entry: Allow non-monadic Seq constants
Date: Tue May 29 12:29:06 EDT 2018

Functional dependencies should be able to constraint r -> m.

This is a deep change.  Will take a bit of time.


Entry: I don't understand Free
Date: Tue May 29 14:59:03 EDT 2018

I made a small change : Term n -> Term (Op n), and now I cannot fix
the inliner.

Need to redo it from scratch.

Start by writing a template that just uses return to produce the data
type, then refine.  This already exposes the bulk of the wrapping
problems.

Is there a way to implement the behavior of free without putting in
all the wrapping?


EDIT: Ok, I get it.  I discovered unfold, and wrapped the Free monad
inside a WriterT . ReaderT to perform String rendering with indentation.

EDIT: I didn't get it.  Then discovered liftF

-- exprDef = inlineNode ref              -- 0 levels (Delay cuts off)
-- exprDef n = liftF $ Compose $ ref n   -- 1 level
exprDef n = (liftF $ Compose $ ref n) >>= inlineNode ref  -- 1 level + inline


Entry: regFix bug
Date: Wed May 30 03:09:19 EDT 2018

test_regfix = SeqEmu.trace' $  do
  let t = SInt Nothing 0
  regFix [t,t] $ \[a,b] -> do
    a' <- add a 2
    b' <- add b 3
    return ([a',b'],[a,b])

--- test_regfix
[[0,0],[3,3],[6,6],[9,9],[12,12],[15,15],[18,18],[21,21],[24,24],[27,27]]

So it seems the problem is here:

regFix ::
  forall f m r o. (Applicative f, Traversable f, Seq m r) =>
  f SType -> (f (r S) -> m (f (r S), o)) -> m o
regFix ts f = do
  rs <- sequence $ fmap signal ts
  (rs', o) <- f rs
  sequence_ $ liftA2 next rs rs'
  return o


--- test_regfix
[[0,0],[3,3],[6,6],[9,9],[12,12],[15,15],[18,18],[21,21],[24,24],[27,27]]
-- bindings: 
(2,Comb2 ADD (Node 0) (Const 2))
(3,Comb2 ADD (Node 1) (Const 3))
(0,Delay (Node 3))
(1,Delay (Node 3))
-- output: 
[Node 0,Node 1]
-- inlined: 
2 <- (ADD (NODE 0) (CONST 2))
3 <- (ADD (NODE 1) (CONST 3))
0 <- (DELAY (NODE 3))
1 <- (DELAY (NODE 3))


Wo why is node 3 bound twice?
Both interpretations appear to do the same thing.

The nodes get created, and they get bound, but not to the correct
thing?  Very strange.

There you have it:

*Main> sequence $ Applicative.liftA2 (\a b -> return (a,b)) [1,2] [3,4] :: Maybe [(Int,Int)]
Just [(1,3),(1,4),(2,3),(2,4)]

It will compute the outer product.

So lists can't be used!
ZipList wrapper is needed.


Is there a way to express that two functors have the same structure?
I guess, converting to ziplist would work.

EDIT:
Used explicit zipWidth next + toList in regFix.
Then converted all other lists to ZipList.
Maybe best to use dedicated functors.


Entry: Why is there a pipeline delay?
Date: Wed May 30 11:13:51 EDT 2018

Maybe before figuring this out, simplify the interface to the monad.
It's not necessary to abstract 'val'.  The user can easily do this in
the testbench stub.

Maybe today is not a good day for refactoring.  Get a nap.

However, it seems that I'm introducing a delay by introducing those
registers as actual registers, on top of the delay created by feeding
back the memory's io state.

EDIT: Damn I'm tired but I can't let go of this.

So this is how it should work:
- User stub should use registers
- But memory stub should use the constant output before it is fed back.

I think this can work with the existing setup.

Basically, only the rData register is an actual register.  The memory
is a combinatorial function from the control inputs to rData decoupled
by that register.

EDIT: Yes that was it.  Makes the code a lot simpler too.


Entry: ZipList
Date: Wed May 30 13:27:19 EDT 2018

Put an error message in SeqEmu such that a missed ZipList case shows
up in the evaluation of 'next'.  Then figure out why I'm actually
running into this problem.


Entry: CPU, sequencing
Date: Wed May 30 18:31:27 EDT 2018

I need a break from the more abstract stuff.  What to do to make this
usable?


Entry: myhdl interface
Date: Mon Jun  4 15:47:06 EDT 2018

Set up basic test bench -> vcd compilation in hatd project.

The question is though: who generates the signals?
I don't think it's really necessary to let the MyHDL side do this.


Entry: concat
Date: Sun Jun 10 10:28:50 EDT 2018

"concat" is a MyHDL function:
http://docs.myhdl.org/en/stable/manual/reference.html

 concat(base[, arg ...])

    Returns an intbv object formed by concatenating the arguments.

This is one of the big differences between logic and CPU instructions:
wires are arbitrary. 

It's probably time to use MyHDL style integers: intbv vs. modbv.

About that: modbv seems to make most sense as a default, but I do want
to leave open the possibility.  Maybe make an explicit type?


Entry: SeqTerm SType
Date: Sun Jun 10 11:49:44 EDT 2018

Where does SType go?  Ultimately, the dictionary should have type
annotation, because the dictionary contents are used to define the
MyHDL signals.

But the straightforward way to do that introduces duplication: both
Term and Node would have type annotation.

Maybe that's just the way to go without a lot of restructuring,
because SeqTerm uses a writer, not a state monad?

Yes let's just do duplication.


Entry: Trigger filter
Date: Sun Jun 10 14:50:05 EDT 2018

Circuit is simple, but requires a lot of primitives that are not yet
implemented:

- just did "concat"

- bit vector indexing

- equality

It makes sense to implement the shift register separately, and also
have a look at what the HDL output is for the MyHDL code.

EDIT: Create the primitives needed to build a shift register first.


Entry: MyHDL output
Date: Mon Jun 11 10:33:03 EDT 2018

It's probably possible to join all combinatorial assignments in the
first block, and put the sequential ones in the second block.


from myhdl import *
def module(CLK, RST, s0, s1):
	s4 = Signal(modbv(0)[8:])
	s2 = Signal(modbv(0)[8:])
	s10 = Signal(modbv(0)[8:])
	s6 = Signal(modbv(0)[8:])
	# s1 is an input
	@always_comb
	def blk1():
		s4.next = (concat((s2[7:0]), s1))
	@always_seq(CLK.posedge, reset=RST)
	def blk2():
		s2.next = (s4)
	@always_comb
	def blk3():
		s10.next = (0 if (s4 == 0) else (1 if ((~s4) == 0) else s6))
	@always_seq(CLK.posedge, reset=RST)
	def blk4():
		s6.next = (s10)
	@always_comb
	def blk5():
		s0.next = (s10)
	return [blk1, blk2, blk3, blk4, blk5]

to

from myhdl import *
def module(CLK, RST, s0, s1):
	s4 = Signal(modbv(0)[8:])
	s2 = Signal(modbv(0)[8:])
	s10 = Signal(modbv(0)[8:])
	s6 = Signal(modbv(0)[8:])
	# s1 is an input
	@always_comb
	def blk1():
		s4.next = (concat((s2[7:0]), s1))
		s10.next = (0 if (s4 == 0) else (1 if ((~s4) == 0) else s6))
		s0.next = (s10)
	@always_seq(CLK.posedge, reset=RST)
	def blk2():
		s2.next = (s4)
		s6.next = (s10)
	return [blk1, blk2]


just keep the order?

Maybe not... I wonder if this actually works as expected, or if each
combinatorial value really needs to have its own block.  Or maybe a
Signal is not required?

Damn I ran out of steam before it's finished...


It seems that it is possible to use "naked" intbv, e.g.
http://www.antfarm.org/blog/aaronf/2008/03/myhdl_example_avalonst_error_a.html


# (j is large enough to hold any index into intermediate.)
j = intbv(0, min=0, max=2 + len(i_err))
for i in range(len(outputMapping)):
  j[:] = outputMapping[i]
  o_err.next[i] = intermediate[int(j)]


What question am I supposed to ask?  What is the difference between
Signal(intbv(...))
intbv(...)


EDIT: Difference between cell and value.

For combinatorial networks, it seems that this is not allowed:

    @always_comb:
    def f_bc:
        b.next = a
        c.next = b

The following should always work.

    @always_comb:
    def f_b:
        b.next = a

    @always_comb:
    def f_c:
        c.next = b

For combinatorial code, the used values are in the sensitivity list.
Suppose a is the output of a register.  At a clock edge that updates
a, f_b will run, which updates b, then f_c will run which updates c.

It will not be so hard to make a test case for this.


Entry: Make a test bench for semantics
Date: Mon Jun 11 12:12:17 EDT 2018

Basically, compare the python simulation output with the Seq output.
The simplest way to do this seems to be to generate a python data
structure and have the python script verify it.


Entry: parsec + template haskell?
Date: Mon Jun 11 12:16:40 EDT 2018

To make some notational abstraction over "do"?


Entry: Haskell / Python bindings
Date: Thu Jun 14 21:06:56 EDT 2018

https://john-millikin.com/software/haskell-cpython


Entry: Preinc/postinc access 
Date: Sat Jun 16 17:19:58 EDT 2018

Trying to get the combinatorial / register split right for pre/post
inc/dec memory read/write.

For some reason, there is something that isn't quite clicking about
how this is supposed to work.  I guess I want to really see a single
cycle instruction memory work on the FPGA.

The read and write happen on rising read/write clocks.
This is the same behavior as a register.

I guess what I'm looking for is a more detailed description of an
SRAM.  To believe it, I guess..

I don't exactly see where the feedback "fix" is located when it comes
to a clocked memory.  That is the main issue here.  For a register it
is simple: one in / one out.  So what is it like for a memory?

There must be some latch in there somewhere.

When in doubt, look at the simulation.  It captures semantics,
obviously..

    @always(write_clock.posedge)
    def rtlwr():
        if write_enable:
            memory[write_addr].next = write_data

    @always(read_clock.posedge)
    def rtlrd():
        read_data.next = memory[read_addr]

So the memory does behave as a register.  fixMem is correct for the
readAddr -> readReg part, but I'm not sure for the write.  EDIT: It
seems OK, but I want to see it.

If readAddr == writeAddr, there should be a two cycle delay.  This is
something that isn't hard to test on actual hardware.

Make some tests for Seq for these corner cases.

test_mem_delay = SeqEmu.traceState ([empty]) m where
  t = SInt Nothing 0
  m = SeqEmu.fixMem [t] $ \[rd] -> do
    c <- counter $ SInt (Just 8) 0
    return ([(1, 0, c, 0)], [c, rd])

--- test_mem_delay
[[0,0],[1,0],[2,0],[3,1],[4,2],[5,3],[6,4],[7,5],[8,6],[9,7]]

So indeed, delay is two, passing the two registers (read register, and
internal sram modeled as a register).

How to test this on a scope?  If the counter is 8 steps, and bit 2 is
put out, it will show up as quadrature delay.


Entry: latches / registers and pure functions
Date: Sat Jun 16 17:59:00 EDT 2018

https://forums.xilinx.com/t5/Implementation/why-latches-are-considered-bad/td-p/200291

    The output of a combinational circuit is a function of input only
    and the circuit should not contain any internal state (i.e.,
    memory). One common error with an always block is the inference of
    unintended memory in a combinational circuit. The Verilog standard
    specifies that a variable will keep its previous value if it is
    not assigned a value in an always block. During synthesis, this
    infers an internal state (via a closed feedback loop) or a memory
    element (such as a latch).


Entry: Push a stack
Date: Sat Jun 16 19:10:34 EDT 2018

So, the memory has a 2 cycle delay.  I wonder if it is necessary to
make a 2-step processor?

What is the simplest thing to do?  Speed is not an issue for the task
at hand.  Simplicity is more important.  That means no pipelining:
results of the previous instruction step should be available in the
next.

Assume:
- stack architecture (working reg + top of data stack = ALU input)
- no need for pipeline delays


Inputs:
- instruction word
- working reg
- stacks : read register

Outputs:
- instruction pointer
- working reg
- stacks: write address + data


Assume combinatorial path between those.  What can be implemented
without delays?

- jump
- alu (top of data stack + working reg -> working reg)
- stack pointer (e.g inc / dec)


Lost it... Start simpler.  Start with:
- ip, wreg
- jump
- conditional jump
- inc / dec / load


So... this looks really interesting, but DO NOT do this for work.
Keep the FPGA circuits simple, and put the control logic in the CPU.


Entry: Arrows?
Date: Sun Jun 17 01:41:02 EDT 2018

It would be just Kleisli.

But basically, after Conal indoctrination, I really like the
applicative version more.  Functions, be it monadic functions, seem to
make more sense.

But maybe just try it?

https://www.reddit.com/r/haskell/comments/4fkkzo/when_does_one_consider_using_arrows/

  While the exact mathematics doesn't seem to have been worked out
  exactly yet, it is well known that Applicative+Category has "about
  the same" expressiveness as Arrows.

  Using Strong from profunctors, you can prove Strong + Category has
  exactly the same expressiveness.

  http://www.fceia.unr.edu.ar/~mauro/pubs/Notions_of_Computation_as_Monoids.pdf
  gives most of the story and
  http://www-kb.is.s.u-tokyo.ac.jp/~asada/papers/arrStrMnd.pdf gives
  the rest, but you need to read between the lines and see that using
  Category gives you a way to model monads in Prof.


  class (Strong p, Category p) => Arrow p
  instance (Strong p, Category p) => Arrow p

  There. I fixed it.


So basically, forget arrows.


Entry: Kleisli arrows
Date: Sun Jun 17 02:38:30 EDT 2018

Maybe this is a way to write more oneliners?

a <- add x <=< add y z


Entry: MonadFix?
Date: Sun Jun 17 03:07:04 EDT 2018

ArrowLoop is defined for Kleisli if MonadFix works.

Wont work for fixReg, because it's still an actual fixed point operator.

http://hackage.haskell.org/package/base-4.11.1.0/docs/Control-Monad-Fix.html

purity
    mfix (return . h) = return (fix h)

if h :: a -> a


Entry: Applicative
Date: Sun Jun 17 03:59:01 EDT 2018

Main reason not to use them is that Kleisli composition is enough.

However, applications do start making sense when state machines are
lifted to simulations, because the values become "pure" in some sense.

(a -> m b) -> Stream a -> Stream b

Also

(a, b) -> m c
a -> b -> m c

Stream a -> Stream b -> Stream c

Are all kind of the same thing.

So it should be possible to define an Applicative instance.  
(a -> b -> c) -> m a -> m b -> m c

So this brings me to the question:

is (m a -> m b) the same as (a -> m b)?

For sure there is:

t :: (m a -> m b) -> a -> m b
t f = f . return

t' :: (a -> m b) -> m a -> m b
t' f m = m >= f

So the conversion is generic and seems to be unique, so these seem to be isomorphic.


Then for the multi argument function, it does seem that the order
becomes important:

t'' :: (a -> b -> m c) -> (m a -> m b -> m c)
t'' f ma mb = do
   a <- ma
   b <- mb
   f a b

So there is some form of arbitraryness here.  The problem goes away in
the uncurried version:

:: (a,b) -> m c -> (m (a,b) -> m c)

What does this all mean?  It seems that currying/uncurring Kleisli
arrows is not unique.

This points at the "default order" for Applicative functors: left to right.

ap :: m (a -> b) -> m a -> m b


Entry: Practical
Date: Sun Jun 17 14:53:55 EDT 2018

Anything practical that still needs to happen?
SeqEmu.hs can be simplified.

I especially do not like the clumsy external state threading approach,
and how the state monad is not really computing the register state
update.  It compues an update function.  But maybe that is enough.

So basically, memories work, and inputs work, but I haven't used them
together yet.

Anyway, today is not an insight day..


Entry: Stacking monads
Date: Mon Jun 18 10:23:50 EDT 2018

As a user interface, it is probably possible to stack another StateT.
But how to do that without interference?

It's not a problem to stack multiple I believe.  I think it will just
pick the first one it finds.  So that would be from the user's
perspective.  But how to "tag" the inner state monad used in the
emulator?

To make this work properly, it is probably best to all put it into one
monad.  But how to solve the problem of multiple isolated states?
This has to be a problem people have run into before.


http://blog.ezyang.com/2013/09/if-youre-using-lift-youre-doing-it-wrong-probably/

    As everyone is well aware, when a monad transformer shows up
    multiple times in the monad stack, the automatic type class
    resolution mechanism doesn't work


EDIT: Maybe time to simplify.  The reader monad isn't really
necessary, so take the current parameterization to be an extra
component to the state monad.

EDIT: Looks like the writer monad is already gone.


How does this compose?  I want to be able to add an effect
transparently.  It seems the only way to really do that is to use
existential types.

Say there is a "fixReg" operator for each kind of state, which will
add a state element to a list and record a way to access it.

Added a list of existential state types.  Next is to find a way to
separate the "allocate" and "use" phases, in a way that is similar to
a register.

Maybe this thing could be just an extension of a register?  That
actually makes more sense.  EDIT: So I made some room for this. Now what?

EDIT: Can't be just made of registers, because those get initialized
on every tick, so is definitely different form a register.

The interface should be "a special kind of register".  That way, IO
can be modeled as registeres as well.

Now fixMem can be implemented using an analogue to "signal",
e.g. "memory", to create a register that holds e.g. a Map, or some
other state structure.


Entry: ExtStates
Date: Tue Jun 19 12:39:53 EDT 2018

Currently, only memories are ExtStates.  Wrappers are in place.  Now
put this behind an existential interface.

EDIT: So the interface should provide a fix function:

f ExtSTate -> M (f ExtState, o)

Note that it will be hard to commute the f with the existential state.
Unless f goes inside the existential type.

Existential types are really hard to use..


So an abstract state from this perspective is a fix function.
Start with the original fixMem

fixMem :: 
  (Zip f, Traversable f) =>
  f SType ->
  (f (R S) -> M (f (R S, R S, R S, R S), o)) ->
  f MemState -> M (f MemState, o)


What the state _does_ is to implement an interaction that is only
visible at the register level.  So the interaction interface needs to
be generalized.  Let's use lists for now, then generalize.


fixExtState ::
  ([R S] -> M [([R S], o)]) ->
  s -> M (s, o)

Basically, the abstract fix function takes an abstract register
interaction, its own hidden state, and produces the result of that
interaction together with the updated hidden state.


There are some complications indeed with the interaction of f and the
state.


Ok it is getting a little complicated to express in the current form,
but the basic idea is simple:

   external state interaction is represented as a register interaction

Maybe just build it from the ground up to get to a simpler version
that is working (one memory), then generalize from there?

(i (R S) -> M (o (R S), t)) ->
s -> M (s, t)

Currently I don't know what to do with those i,o parameters, so really
just use flat list.  I don't think at that point the structure of the
I/O is really important.


EDIT: Painted into a corner..  I need something to close this loop.
Doesn't really matter what.  Some essential element is missing.


Entry: Got it: existential + dynamic types
Date: Tue Jun 19 22:30:15 EDT 2018

data type:
forall s. (s, s -> M (s, Dynamic))

interface function:
Typeable o => (s -> M (s, o)) -> s -> M o

So dynamic types are not visible outside of the implementation.

I'm happy with it.  Result is simple, though it was _not_ easy to
write this!  A couple of things need to come together without
interfering.


Entry: PruEmu
Date: Fri Jul  6 12:06:52 EDT 2018

type EmuState = Map EmuVar Int
data EmuVar
  = File Int  -- register file
  | CFlag     -- carry flag
  | PCounter  -- program counter
  | Time      -- instruction counter
  deriving (Eq,Ord,Show)


How to extend this to arbitrary state?  It no longer is a map to Int.

It's too much work at this point.  I just need something simple and move on.


Entry: MyHDL sim test
Date: Sat Jul  7 08:23:29 EDT 2018

1. Create the Seq circuit with its own test bench.

2. Generate a MyHDL module that can compile to FPGA code, together
   with some MyHDL wrapper code.

3. Same for a test bench


EDIT: After a bit of doodling I end up with this:

- A testbench module has CLK, RST and a number of outputs.
- It is a python module with a function named "module"
- The module can contqain a list named "output"
- If so it is verified against the simulation output

TODO: Wrap this up.


Ok.  One thing that I'd like to figure out, is how to specify output
types for a module.  Currently it is left to the instantiator to
provide I/O signals, but how does it know what to instantiate?

Maybe generate an instantiator in the .py file

EDIT: Anyways, not a huge problem.  Still need to verify if the output
can actually be synthesized to VHDL or Verilog.

EDIT: I really miss not having actual signal names.
EDIT: Ok I have a small TH hack that can be used to resolve this.


Entry: Named ports
Date: Sat Jul  7 15:54:25 EDT 2018

myhdl is parameterized by ports
the node type is opaque, so a name could be attached that way.

myhdl :: (Eq n, Show n) => [Op n] -> [(n, Expr n)] -> MyHDL
myhdl ports bindings = _

ports comes from SeqTerm.compile

So at any point where just the abstract syntax is provided, a list of
port names using the TH hack can be provided as well.

Solve this through a type class.
EDIT: Ok, got it.


Entry: put this to the test?
Date: Sat Jul  7 21:49:13 EDT 2018

So here I am with quite a bit of sophistication in the tool, but not
really the application that merits it.  I should have probably just
continued in MyHDL.  But then again, from just messing with Python a
little, I am quite happy with Haskell.


Entry: DSL, sharing and monad laws
Date: Sun Jul  8 08:47:08 EDT 2018

This keeps coming back, so I put it in SeqLib.hs

Typically, in a monadic language with internal nodes, when the same
monadic value is passed to a 2-argument function like this:

f ma mb = do
  a <- ma
  b <- mb
  return $ add a b

The monad is evaluated twice.

There seems to be no simple way to avoid this.  Period.

If sharing is needed, use do notation.

If there is no fanout, it's ok to use applicative notation to build
nested expressions.

Note that in an expression, there can never be any sharing, except for
sharing introduced by combinators, and those could implement it
correctly.  E.g.

dup m = do a <- m; return (a, a)

So the point about sharing is moot, really.

The real issue is that a couple of interfaces are needed:

  a -> a -> m a

  a -> m a -> m a

  m a -> a -> m a

  m a -> m a -> m a

I see no good way to do this that is easy to use apart from
introducing a syntax manipulation step.


Entry: nailing down semantics
Date: Sat Jul 14 12:11:20 EDT 2018

Something isn't right with conc and slice.
Make some tests?

The problem here is clear: I want to look at the output of conc and
slice for a variety of things, but I have no clear way to do that at
the command line.

The need to resort to the command line is a consequence of dynamic
typing.  I.e. the semantics is not completely capture in the type.

Actually, this is readily available.  Here "Trigger" is a module that
has all relevant pieces imported, and ghci is entered through the
nix/cabal setup.

Prelude Trigger> take 10 $ SeqEmu.trace $ return [1]
[[1],[1],[1],[1],[1],[1],[1],[1],[1],[1]]

Prelude Trigger> take 10 $ SeqEmu.trace $ let v = Seq.constant (Seq.SInt (Just 2) 1) in do v' <- Seq.conc v v ; return [v']
[[5],[5],[5],[5],[5],[5],[5],[5],[5],[5]]

So that's not it.

EDIT: The error was in slice, shiftR had args swapped.
But I think I have at least some better way to look at tests now.


Entry: Seq, what does it do?
Date: Sat Jul 14 18:08:05 EDT 2018

- "applicative" vs "network" notation. i.e. no explict assignment/binding.
- state hiding
- implicit clocks
- pure modeling, easy to compose and test

The former is probably the most important one.


Entry: Practical stuff
Date: Mon Jul 16 10:41:57 EDT 2018

UART: transfer bytes into memory.


Entry: Clocks, RTL
Date: Mon Jul 16 10:47:01 EDT 2018

So two way to look at digital cicuits:

1) Go from event-driven to sequential

2) Still, in the sequential domain, there is a need to represent
signals that are somehow "clocked".  The way to do this is to
synchronize to the sampling clock, and add an enable bit.


Basic intuition here is that for FPGA design, signal flow needs to be
directional, which meshes very well with functional programming.


Entry: UART and sub-clocking
Date: Mon Jul 16 11:32:54 EDT 2018

Problem factorization.

- Async receiver is sample pulse generator and shift register
- Async transmitter is same, function of sample pulse to shift output

The main abstraction here when using a single clock domain, is to
perform a subsampling operation.  Given an operation, run it only when
an enable bit is set.


Entry: subsample, when
Date: Mon Jul 16 11:42:14 EDT 2018

This is not trivial.  To make it trivial, support for single-branch if
is needed.

It is clear now that this feature of HDL is essential!  It's also
clear why.  In some cases, making the non-active path explicit will be
quite an ordeal.

EDIT: So this is not an expression.  It is something completely
different.  Do I really want to go that way?  Can the enable just be
pushed deeper?

EDIT: I don't like this construct, as it has far reaching
consequences: the language is no longer just dataflow.  But it seems
to be essential to be able to express conditional register updates.
Basically, the register "close" operation needs to be augmented with
the enable.

Is there another way?

Yes there is.  Make the "close" explictly use an enable.

One way to make this implicit, is to use an environment.

So what is worse.  Introducing dynamic scope, or rendering the
language imperative?

I'd say the latter.

So:
1) you can always use explicit close + enable
2) dynamic scope is terribly convenient


Entry: Environment
Date: Mon Jul 16 12:17:35 EDT 2018

How to add environment to Seq?  The thing to keep in mind is to make
this extensible.  The consumer will be the close operation, the
producer will be user code.  Let's call these "withEnable" and
"enable".  Then allow room for extension later.

No this seems wrong.  

What am I missing?

SeqEmu has an environment for the register contents.
SeqTerm doesn't.

Ok, added a slot in the environment monads, and parameterized
"closeReg".


Entry: UART
Date: Mon Jul 16 13:51:03 EDT 2018

This should now be simple:

- detect start bit, go to "on" state and produce a number of pulses.
  wait for stop bit and produce parallel output strobe.


Entry: Conditionals
Date: Mon Jul 16 15:44:39 EDT 2018

One of the hardest things to unlearn, is that conditional branches do
not "save work".  Everything is instantiated.  The structure is fixed.

The only way to save work as can be done in a CPU through a
conditional branch, is to implement a (reduced) CPU with conditional
branches.


Entry: functional dependencies
Date: Mon Jul 16 16:08:25 EDT 2018

class (Monad m, Num (r S)) => Seq m r | r -> m where


Entry: Conditionals, again
Date: Mon Jul 16 16:26:43 EDT 2018

if' works only on signals.  not on containers of signals!

to make it work on containers, it needs to be lifted:

async_receiver ::
  forall m r. Seq m r =>
  SType -> r S -> m (r S)
async_receiver t i = closeReg [bit,t] update where
  update [is_on, n] = do
    on  <- return [i, is_on, n]  -- dummy
    off <- return [i, is_on, n]  -- dummy
    (o:s') <- sequence $ zipWith (if' is_on) on off
    return (s',o) 


Note that this is quite different from how MyHDL can have if branches
that contain assignments.

So I am _intentionally_ building a dataflow subset. The parallel
nature of if muxes is quite explicit that way.

Does this need special care for non-binary muxes?


Entry: VHDL and synthesis
Date: Mon Jul 16 17:48:24 EDT 2018

https://web.ewu.edu/groups/technology/Claudio/ee430/Lectures/vhdl-guidelines.pdf

Mostly boils down to:
- do not create latches
- beware of duplication
- some things are not synthesizable

I'm encouraged in the approach
- to make registers explicit
- to model combinatorial networks as functions


Entry: Slowing down state machines
Date: Mon Jul 16 18:48:17 EDT 2018

A simple way seems to be to gate the inputs to all the registers.

A disadvantage is that it is no longer possible to create 1-cycle
pulses to be used elsewhere.

A solution to that is to use edge detectors at places where the pulses
are used, and encode the pulses differentially.  But that gives
problems with spurious pulses at reset.

It doesn't seem to be such a great idea...

Let's build a single async receiver and use an explicit "clock" input.


Entry: Enabled streams
Date: Tue Jul 17 08:40:25 EDT 2018

There must be something beautiful hidden in all this.  I just can't
fish it out yet.  What is a stream?  Data + enable.

At the event level, it is data + clock.

What am I missing?

It's the misconception that a sequential machine is clockless.  That
only happens when all the timing domains are the same.

To transition "clock domains", the only thing that's needed is to AND
the outputs with the enable pulses.


So a "stream" consists of some data and an enable signal.  It can be
made a convention that the current context contains this enable
signal.  It is then the responsibility of the implementation to use
that properly.  I.e. if there are any "output clocks", they should at
least be masked by the enable.


Entry: RTL misconceptions
Date: Tue Jul 17 08:44:57 EDT 2018

1) As long as there are no propagation delay timing problems, or
   external clock synchronization issues, there are no synchronization
   issues _inside_ the sequential representation.

2) There are still clocks, but they are of a different nature.  They
   represent subsequences: points in time where signals are valid on a
   conceptual level.  ( They will always be valid on an electrical
   level. )

3) Conditionals are muxes, and are inherently parallel.  It is a good
   thing to make this explicit.  Conditionals do not "save work" as
   compared to conditional jumps on a CPU.


EDIT: About conditionals: it seems that state machine transitions can
be factored into a bunch of circuits that appear "hoisted out" of the
conditional, and some simple muxing.


Entry: Testing tagged streams
Date: Tue Jul 17 10:12:03 EDT 2018

Today's task is to find a good representation for testing tagged
stream operators.  The abstraction seems sound and high-leverage.

EDIT: Got it.  Added some QC tests.


Entry: QC generators
Date: Tue Jul 17 17:15:40 EDT 2018

Create some generators for:
- sequences of ints of limited bit length

http://hackage.haskell.org/package/QuickCheck-2.4.1.1/docs/Test-QuickCheck.html#v%3aforAll


Entry: Shifts
Date: Tue Jul 17 18:11:02 EDT 2018

SPI,UART use different shift registers, so the circuit cannot be
reused unless a reverse is somehow implemented.  Not clear what is
best here, or if it is even necessary.


Entry: UART
Date: Tue Jul 17 19:22:53 EDT 2018

A UART is a counter that is started by a 1->0 transition, and performs
sampling based on that counter.

I find this hard to express without making a drawing.

Here's an example:
https://www.nandland.com/vhdl/modules/module-uart-serial-port-rs232.html

This points at one difficult issue: for some algorithms, assignment is
a sparser way to encode.

Another
https://www.fpga4fun.com/SerialInterface4.html


Entry: State machines
Date: Tue Jul 17 22:10:08 EDT 2018

I'm disappointed that this is so hard to express.  I really want the
convenience of multiple assignments.  It's still possible though but
requires boilerplate.

So first, make a real state machine.  Something simpler than the uart.
Then iron out the language.

EDIT: I have a basic skeleton for the UART.  Needs some cleanup, but
basically it seems ok.


Entry: MyHDL case statements
Date: Wed Jul 18 00:00:24 EDT 2018

http://docs.myhdl.org/en/stable/manual/conversion.html

If-then-else structures may be mapped to case statements

    Python does not provide a case statement. However, the converter
    recognizes if-then-else structures in which a variable is
    sequentially compared to items of an enumeration type, and maps
    such a structure to a Verilog or VHDL case statement with the
    appropriate synthesis attributes.


I might have to do this differently.  Currently, conditionals are
duplicated at the meta level.  It might be good to re-arrange that
into imperative if then else.


They get inferred differently: priority routing network vs. single mux.

https://electronics.stackexchange.com/questions/73387/difference-between-if-else-and-case-statement-in-vhdl

I do wonder: in the case the inputs are exclusive, does the
combinatorial optimization figure this out?


Entry: Logic simplification
Date: Wed Jul 18 08:48:25 EDT 2018

I would like to know how Yosys does this.


Entry: UART debugging
Date: Wed Jul 18 08:57:13 EDT 2018

[1,0,0,3]
[1,0,0,3]
[1,0,0,3]
[1,0,0,3]

[0,0,1,3]
[0,0,1,2]
[0,0,1,1]
[0,0,1,0]

[0,0,2,63]
[0,0,2,62]
[0,0,2,61]
[0,0,2,60]

[0,1,2,59]
[0,0,2,58]
[0,0,2,57]
[0,0,2,56]

[0,0,2,55]
[0,0,2,54]
[1,0,2,53]


Entry: more UART
Date: Thu Jul 19 15:08:27 EDT 2018

Sending out 5.  Why does it see 5<<1 | 1

The sample pulses are fine.  But it is clocking in the bit when the
bit changes.

It works when the register output is taken.

EDIT: Important: do not propagate the _input_ of registers that use
clock enable.


-- 48
[9,1,0,0,0,3]
[9,1,0,0,0,3]
[9,1,0,0,0,3]
[9,1,0,0,0,3]
[9,1,0,0,0,3]
[9,1,0,0,0,3]
[9,1,0,0,0,3]
[9,1,0,0,0,3]
-- 49
[8,0,0,0,1,3]
[8,0,0,0,1,2]
[8,0,0,0,1,1]
[8,0,0,0,1,0]
[8,0,0,0,2,63]
[8,0,0,0,2,62]
[8,0,0,0,2,61]
[8,0,0,0,2,60]
-- 50
[8,0,0,0,2,59]
[8,0,0,0,2,58]
[8,0,0,0,2,57]
[8,0,0,0,2,56]
[8,0,1,0,2,55]
[16,0,0,0,2,54]
[16,0,0,0,2,53]
[16,0,0,0,2,52]
-- 51
[16,0,0,0,2,51]
[16,0,0,0,2,50]
[16,0,0,0,2,49]
[16,0,0,0,2,48]
[16,0,1,0,2,47]
[32,0,0,0,2,46]
[32,0,0,0,2,45]
[32,0,0,0,2,44]
-- 52
[32,0,0,0,2,43]
[32,0,0,0,2,42]
[32,0,0,0,2,41]
[32,0,0,0,2,40]
[32,0,1,0,2,39]
[64,0,0,0,2,38]
[64,0,0,0,2,37]
[64,0,0,0,2,36]
-- 53
[64,0,0,0,2,35]
[64,0,0,0,2,34]
[64,0,0,0,2,33]
[64,0,0,0,2,32]
[64,0,1,0,2,31]
[128,0,0,0,2,30]
[128,0,0,0,2,29]
[128,0,0,0,2,28]
-- 54
[128,0,0,0,2,27]
[128,0,0,0,2,26]
[128,0,0,0,2,25]
[128,0,0,0,2,24]
[128,0,1,0,2,23]
[0,0,0,0,2,22]
[0,0,0,0,2,21]
[0,0,0,0,2,20]
-- 55
[1,1,0,0,2,19]
[1,1,0,0,2,18]
[1,1,0,0,2,17]
[1,1,0,0,2,16]
[1,1,1,0,2,15]
[3,1,0,0,2,14]
[3,1,0,0,2,13]
[3,1,0,0,2,12]
-- 56
[2,0,0,0,2,11]
[2,0,0,0,2,10]
[2,0,0,0,2,9]
[2,0,0,0,2,8]
[2,0,1,0,2,7]
[4,0,0,0,2,6]
[4,0,0,0,2,5]
[4,0,0,0,2,4]
-- 57
[5,1,0,0,2,3]
[5,1,0,0,2,2]
[5,1,0,0,2,1]
[5,1,0,0,2,0]
[5,1,1,0,3,7]
[11,1,0,0,3,6]
[11,1,0,0,3,5]
[11,1,0,0,3,4]
-- 58
[11,1,0,0,3,3]
[11,1,0,0,3,2]
[11,1,0,0,3,1]
[11,1,0,0,3,0]
[11,1,0,1,0,0]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
-- 59
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]
[11,1,0,0,0,3]


Entry: Bit sizes as types
Date: Thu Jul 19 22:02:48 EDT 2018

It's probably good to leave SType at the value level.  It allows to
grow the language slowly.  But at the same time, evaluate DataKinds to
allow "restricted phantom types".

https://lexi-lambda.github.io/blog/2016/06/12/four-months-with-haskell/


Entry: FIFOs
Date: Fri Jul 20 23:37:38 EDT 2018

The problem with fifos is not so much the fifo itself, but how to
integrate memories.  It should probably be done at the Seq language
level, because you really do want to tuck them in.

A fifo something that takes a memory, and produces a reader and a
writer.  This doesn't seem possible to express though.  Let's write it
out in woven form first and see how it can be rearranged.


Entry: Memories..
Date: Sat Jul 21 00:32:08 EDT 2018

Something really strange is going on with memories.  There seems to be
some interference between the two feedback parts that do trace input
and memories.

I don't understand, so I wonder, why does it need to be so opaque?
Can other kinds of feedback be just the same as closeReg?

The reason I didn't do this is because registers get re-initialized
completely on every round.  But that isn't much different from being
otherwise updated..


closeProcess has this

  -- Compute update using the update function bundled with current
  -- state.  Repack with update function to do the same next time.
  o <- modifyProcess r p0 $ \(Process (s, u)) -> do
    (s', o) <- u s
    return (Process (s', u), o)

I think what happens is that the u in there calls modifyProcess again.

But it seems fine.  There was a bug in here before..  The inner
routine can call modifyProcess, as long as it is a different r.  Can
the r be the same?

modifyProcess r def f = do
  ps <- getProcesses
  let p = Map.findWithDefault def r ps
  (p', o) <- f p
  modify $ appProcesses $ insert r p'
  return o


Here's the routine.  It appears as if the closeMem captures the first
instance of the input.

-- This does something really strange
x_mem_bad = do
  let writes = [[0,1,x,x+20] | x <- [1..10]]
      reads  = [[x,0,0,0]    | x <- [1..10]]
      outs = t_mem $ writes ++ reads
      t_mem = trace [8,1,8,8] $ \i@[ra,we,wa,wd] -> do
        t <- stype wd
        SeqEmu.closeMem [t] $ \[rd] ->
          return ([(we, wa, wd, ra)], (rd:i))
  putStrLn "-- x_mem_bad rd,ra,we,wa,wd"
  printL outs


Why does it capture the first value?
I think I'm assuming that the update equation is constant.

Yes, it's passed from call to call.

That's the problem.

Trouble is that once the equation is tucked away, I can't use it any
more.

So the solution is to make the state typable.

EDIT: OK, works.


Entry: The mapped if'
Date: Sat Jul 21 08:53:49 EDT 2018

I'm a little worried about that one.  How hard is it to bundle these
into imperative statements?

Recursively:
- find all if nodes that have the same condition
- group together the equations

What happens when there are ANF terms inbetween the conditionals?

It makes sense to write this as a test.


EDIT: Here's an example.

-- sequenced if' : does it need to be bundled?
x_ifs = do
  putStrLn "--- x_ifs"
  print_hdl $ do
    io@[c,i1,i2,o1,o2] <- SeqTerm.io $ replicate 5 bit
    os' <- ifs c [i1,i2] [0,0]
    sequence $ zipWith connect [o1,o2] os'
    return io


--- x_ifs
-- ports: 
[Node (SInt (Just 1) 0) 0,
 Node (SInt (Just 1) 0) 1,
 Node (SInt (Just 1) 0) 2,
 Node (SInt (Just 1) 0) 3,
 Node (SInt (Just 1) 0) 4]
-- bindings: 
(0,Input (SInt (Just 1) 0))
(1,Input (SInt (Just 1) 0))
(2,Input (SInt (Just 1) 0))
(5,Comb3 (SInt (Just 1) 0) IF (Node (SInt (Just 1) 0) 0) (Node (SInt (Just 1) 0) 1) (Const (SInt Nothing 0)))
(6,Comb3 (SInt (Just 1) 0) IF (Node (SInt (Just 1) 0) 0) (Node (SInt (Just 1) 0) 2) (Const (SInt Nothing 0)))
(3,Connect (SInt (Just 1) 0) (Node (SInt (Just 1) 0) 5))
(4,Connect (SInt (Just 1) 0) (Node (SInt (Just 1) 0) 6))
0 <- (INPUT)
1 <- (INPUT)
2 <- (INPUT)
3 <- (CONNECT (IF (NODE 0) (NODE 1) (CONST SInt Nothing 0)))
4 <- (CONNECT (IF (NODE 0) (NODE 2) (CONST SInt Nothing 0)))
-- MyHDL: 
from myhdl import *
def module(CLK, RST, s0, s1, s2, s3, s4):
	# s0 is an input
	# s1 is an input
	# s2 is an input
	@always_comb
	def blk1():
		s3.next = ((s1 if s0 else 0))
	@always_comb
	def blk2():
		s4.next = ((s2 if s0 else 0))
	return [blk1, blk2]


Looking at 

cond ((mcond, whenTrue):clauses) dflt = do
  c <- mcond
  t <- whenTrue
  f <- cond clauses dflt
  ifs c t f

It appears that conditions will always be completely evaluated, so
they appear successive if nodes.

It doesn't seem too hard to make this work.


So this is an operation on bindings.

Maybe best to do it in a couple of steps
- translate current expression if to a statement if else
- chain else if to elif
- bundle

Note that seq can also be bundled.


What are the core operations?
- Conditionally bundle static assignments:
  - Seq
  - If (unless it is chained)

One thing that is implicit, is whether a set of equations is
independent.

When are equations independent?

- Sequential
- Combinatorial, but no (mutual) recursive references

Those can be bundled.

Before doing any of this, it is necessary to test if it is really
needed.  I'm assuming the synthesizer already does a lot of
rearranging.  Figure out how yosys works.

Another reason to do this is to make the output code more readable.


Entry: signal/next and memory/nextMemory
Date: Sat Jul 21 10:57:54 EDT 2018

For Seq, does it make sense to get rid of signal and next?
No.  Keep it.

So for generalizing closeProcess, it it possible to implement a
signal/next pair for that?


Here's an idea:

  - only add memory/nextMemory to Seq
  - for emu, keep the more general process/nextProcess


In the current implementation, it is not necessary to initialize the
processes.  So let's propagate that.

Ok this is a bit simpler.  Next: what should it look like at the MyHDL
side?  A memory can be modeled as a register with a specific update
equation.

I think this would be it:

(m, rd) <- memory
updateMemory m (we, wa, wd, ra)


Entry: Arrow
Date: Sat Jul 21 18:28:02 EDT 2018

Is applicative plus what?
This is about sharing, tuples, that sort of thing.


Entry: Kleisli arrows
Date: Sun Jul 22 10:18:14 EDT 2018

Trying to put this to rest.  There is only one sane thing to do:
provide SeqKleisli.hs which wraps all functions as Kleisli arrows, and
represents binary and ternary functions as uncurried.

Then see if this actually gets used.


Entry: Applicative
Date: Sun Jul 22 19:27:09 EDT 2018

I'm going to delete the file.  The whole idea sucks.
EDIT: Keep the lifted versions, but remove the autoconversion.


Entry: Blink-a-led working on breakout
Date: Sun Jul 22 21:02:41 EDT 2018

Using RST tied to GND.


Entry: Fix emulation inefficiency
Date: Mon Jul 23 09:44:07 EDT 2018

This boils down to splitting evaluation in two phases:
- Probe -> initial state, update function
- Update


Actually I don't see how to do this.

What should the function look like?  It will always perform some kind
of computation on register values and constants, so it is basically
the same as the output of SeqTerm.hs

The output is a serial program.

So to implement: operations should return only register references,
not values.

Put it behind the (r,f) interface first, then change the implementation.

The core change is here:

styp = (fmap fst) . sints
sval = (fmap snd) . sints
val v = do SInt _ v' <- sval v ; return v'

sints (Val i@(SInt sz v)) = do
  return $ (i, i)
sints (Reg r) = do
  v <- asks $ \(_,regs) -> regs r
  ts <- getTypes
  let IntReg v0@(SInt sz _) = ts ! r
  return $ (v0, SInt sz v)


The point is to compile this straight into a function to give the
implementation as much freedom as possible to reduce it.  Is that
possible?  Probably overkill.

Actually it seems easier to use an interpreter to implement SeqEmu.
Maybe take that approach directly?

Today is not a big insight day...


ANF nodes and registers are not the same thing (Val | Reg).  But ANF
nodes need to be modeled becaues of fanout.

Here's an idea: use template haskell.  Map it straight to a haskell
program.


Entry: TH
Date: Mon Jul 23 11:20:32 EDT 2018

http://okmij.org/ftp/tagless-final/index.html

So let's just start over instead of trying to retrofit SeqEmu.hs


Entry: pure / applicative
Date: Tue Jul 24 12:21:24 EDT 2018

So here's a thing.  Apart from state feedback, the language is pure.
Is it possible to rephrase it such that all primitives are pure, and
the state feedback is hidden somewhere else?

Basically, express it as a z transform?


So why can't I have a pure DSL?
The only reason is sharing.

Can applicative express sharing?


Entry: Sharing
Date: Tue Jul 24 17:37:01 EDT 2018

The canonical, meaingful example is to compute the square.
But any binary function works.

suppose s is a sequence.
how to express square?

s -> (s,s) -> s^2

So what's missing is shuffling, as mentioned before.

Essentially, this is fmap (,) / snd / fst

Any kind of threading can be implemented using these operators.  It's
not pretty, but it works.


t_app_square :: Seq m r => m (r S) -> m (r S)
t_app_square = (SeqApp.uncurry SeqApp.mul) . (fmap (\x->(x,x)))

x_app_share = do
  putStrLn "--- x_app_share"
  let c@(outputs, bindings) = SeqTerm.compile m
      m = do a <- inc 1
             b <- t_app_square $ return a
             return [b]
  print outputs
  sequence $ map print bindings


--- x_app_share
[Node (SInt Nothing 0) 1]
(0,Comb2 (SInt Nothing 0) ADD (Const (SInt Nothing 1)) (Const (SInt Nothing 1)))
(1,Comb2 (SInt Nothing 0) MUL (Node (SInt Nothing 0) 0) (Node (SInt Nothing 0) 0))


My question is then, is there any notion of a pure "+" in there?
Hidden in the semantics, maybe, but it only appears operating on m (r
S).


Entry: Upside-down : the dual implementation in terms of type families
Date: Tue Jul 24 18:21:49 EDT 2018

EDIT: Not so sure any more...

Turning it upside down, maybe.

Can this be done by representing the idea of a sequence as an
applicative functor of a generic data type?

Yes.

But how to perform feedback?

This requires explicit implementation of "delay".

Let's try that out.

class Functor t => Delay (F t) where
  delay :: F t -> F t

instance Delay [] where
  delay (a:as) = as
  

Then, using type families, it might be possible to write alternative
implementations that generate code.


I'm still not comfortable using circular programming, so I'm inclined
to use a "close" operator.  Essentially it is something that turns a
pure function into a sequence.


But just for the fun of it, can it be done to use delay?

I don't see it and it really complicates things.

The proper way is to make "close" explicit.

module Causal where

class Functor f => Causal f where
  close :: (a -> (a, b)) -> a -> f b

instance Causal [] where
  close :: u s0 = (o:os) where
    (s, o) = u s0
    os = u s

integral = close $ \s -> (s + 1, s)


But this is just unfold.

No. It is special in that what it is unfolded into is not necessarily a list.


I think this is going to work..  The key phrase in last week's reading
was that type families are sort of dual to type classes.

https://wiki.haskell.org/GHC/Type_families


EDIT: Some inspection is needed to fish out the input value, so close
is not enough. 

  closei :: (s -> i -> (s, o)) -> s -> f i -> f o

It surprised me before that I could write this as applicative, but I
think that used a representation of singals as (s->(s,o).

Actually that still works here, but requires that particular
constraint on the representation.  We don't need that.  We just need
close over i/o and state.


Ok I think I got it:

Seq.erl defines Seq A composition of m and r
SeqApp.erl exports primitives lifted to that representation.
A "close" Causal class is implemented for each of the m.r compositions.

The work here is in the individual implementations.

I don't think that type families are needed.  The behavior is in the functor.

And it is important to compose m and r!


So the insight is that these should be functors.
That is not trivial in itself.
And actually, doesn't work in general.

Nope.... m is the functor.

So this whole thing doesn't work because there are no pure r S
function, so close doesn't make sense.

So the "f" cannot be the M of the implementation.  The wrapping needs
to be doubled-up:

F (M (R S))
F t


So it looks like this just doesn't work in the current setting.
Something else is needed.

Deleting it.  There is no point..


Entry: The point
Date: Tue Jul 24 20:48:13 EDT 2018

is this:

integral :: (Num t, Causal f) => t -> f t -> f t
integral = close $ \s i -> (s + i, s)
counter s = integral s (pure 1)

class Applicative f => Causal f where
  close :: (s -> i -> (s, o)) -> s -> f i -> f o

and implement it in a way that all the machinery goes into f.


It doesn't look like this is possible.  So I'm removing it.
This is an entirely different problem.

That monad is in the way.


Entry: Applicative sharing
Date: Tue Jul 24 21:06:52 EDT 2018


What works: in AppSeq.erl, see "dup"


So given that, is it possible to write a kind of "let"?
Yes.


square x' = var x' $ \x -> x `mul` x
var mv f = mv >>= \v -> f $ return v


So there...


Entry: Monad laws
Date: Tue Jul 24 21:36:18 EDT 2018

Simplest in terms of Keisli Arrows:

- Composition is associative
- return is left and right inverse


Entry: SeqTH / SeqPrim
Date: Wed Jul 25 19:43:44 EDT 2018

Boring and straightforward to write, but I think I got it.
Very simple in structure, lots of different types and conversions.

Next is to put this in the test script. Done.
Next is to implement all the primitives.

EDIT: Got prims for uart test, but don't get the correct result.
Trying some ad-hoc tests and they seem fine:

  m3 = do
    -- Some ad-hoc tests for SeqTH,SeqPrim combo.
    let test p = print $ SeqTH.run p $ map (:[]) [0..9]
    test $(SeqTH.compile [1] $ \[i] -> do c <- counter $ bits 3 ; return [c])
    test $(SeqTH.compile [4] $ \[i] -> do c <- integral i ; return [c])
    test $(SeqTH.compile [4] $ \[i] -> do c <- conc i (constant $ bits 1) ; return [c])
    test $(SeqTH.compile [4] $ \[i] -> do c <- slice i (Just 4) 1 ; return [c])


EDIT: Following the types too much.  I think the inputs are not in the
correct order?

EDIT: That doesn't seem to be the problem.

running x_async_receiver_sample' it seems that the internal state
machine is fine, going through the proper bit framing, but the sample
pulse is missing so the shift register never triggers.


here's a discrepancy:


                 r5 = seqSLICE (seqInt 6) r3 (seqInt 0);
                 r7 = seqEQU (seqInt 1) r5 (seqInt 0);

I beleive r5 is from:

    phase    <- slice count (Just 3) 0

The bit size is not correct.
The bug is likely in SeqTerm

  slice (R a) b c =
    fmap R $ driven $ Slice (combTypes [a]) a b c

Yes.
combTypes is not correct here.
also for CONC the sizes are different.


EDIT: I think I got it now.

Doing SeqTH is a good thing: it allows SeqTerm to be debugged without
doing this through MyHDL tests.

One more issue:

test-seq: Seq.sizeError: (IF,Just 2,Just 2,Just 2)
CallStack (from HasCallStack):
  error, called at ./SeqTerm.hs:169:23 in main:SeqTerm
Test suite test-seq: FAIL
Test suite logged to: /dev/stdout

Just a bug in the test.


EDIT: uart quickcheck passes


Entry: Next?
Date: Thu Jul 26 12:27:34 EDT 2018

It's gotten quite far.
Finialize quickcheck?
Test the FIFO.
OK


Entry: Next?
Date: Thu Jul 26 20:11:03 EDT 2018

I think I have pretty much everything needed.  Complete the CPU?

It's clear that most circuits are about decoders.  So find a good way
to express that.

Can haskell serialization be used to encode types?


Entry: Why can't primitives be pure?
Date: Sat Jul 28 15:01:26 EDT 2018

Sharing.

I'm going to have to keep doing that!
It is very hard to internalize...


Entry: Next
Date: Sat Jul 28 17:48:28 EDT 2018

It's been quite a trip and I'm a little depleted at this point.

Is there any application that would be interesting to do, besides a
CPU, which will likely resume once brain is back online.


EDIT: The next steps are about abstracting designs.  The low level
language seems done.


Entry: uart-controlled clock enable
Date: Sun Jul 29 00:39:23 EDT 2018

http://zipcpu.com/blog/2017/05/26/simpledbg.html


Entry: Logic analyzer
Date: Tue Jul 31 09:50:33 EDT 2018

TH can also be used to create logic analyzers.  Likely we'll be fast
enough.  Just need to read from stdin to fit into the current saleae
code from lars.


Entry: LLVM
Date: Tue Jul 31 21:22:09 EDT 2018

http://www.stephendiehl.com/llvm/
Implementing a JIT Compiled Language with Haskell and LLVM


Entry: Interesting avenues
Date: Tue Jul 31 21:49:33 EDT 2018

- dump to C or LLVM
- logic analyzers
- abstractions for streams
- CPU
- data kinds


Entry: Logic analyzers
Date: Tue Jul 31 21:51:19 EDT 2018

https://www.sump.org/projects/analyzer/

http://dangerousprototypes.com/docs/Open_Bench_Logic_Sniffer
Spartan 3E 250


Entry: custom quasiquoter
Date: Wed Aug  1 23:53:57 EDT 2018

https://wiki.haskell.org/Quasiquotation

So this is for String


quoteExprExp :: String -> TH.ExpQ
quoteExprPat :: String -> TH.PatQ
 
expr  :: QuasiQuoter
expr  =  QuasiQuoter { quoteExp = quoteExprExp,
                       quotePat = quoteExprPat
                      -- with ghc >= 7.4, you could also
                      -- define quoteType and quoteDec for
                      -- quasiquotes in those places too
          }


Entry: parallel if
Date: Thu Aug  2 08:50:48 EDT 2018

1. is it necessary?
no

Again, learn how yosys works first.


Entry: A driver...
Date: Thu Aug  2 09:49:22 EDT 2018

Something I actually care about.


Entry: Behavioral vs. RTL
Date: Thu Aug  2 11:36:20 EDT 2018

I've never really understood the distinction here, and it appears that
there is no clear-cut distinction.

http://www.clifford.at/yosys/files/yosys_manual.pdf

Reading Yosys manual, the distinction is made between

Behavioral:
  - register update
  - if / case
  - unrollable loop

RTL:
  - combinatoral networks
  - registers

So the distinction is really not that important for Seq approach.
Behavioral bits are implemented as macros.


  ... modern logic synthesis tools utilize much more complicated
  multi-level logic synthesis algorithms. Most of these algorithms
  convert the logic function to a Binary-Decision-Diagram (BDD) or
  And-Inverter-Graph (AIG) and work from that representation.

Yosys uses ABC

https://people.eecs.berkeley.edu/~alanmi/abc/


RTLIL is used for internal optimizations.  It does seem to perform
some high-level optimizations and recognition steps, so maybe it does
make sense to keep the input syntax in some partitular form.

It's not quite clear if that yosys-specific optimization stuff is
really needed if ABC is used.

The way to find out is to have it dump some internal representations.


Entry: mode 0 spi
Date: Thu Aug  2 13:22:23 EDT 2018

logic analyzer:
- spi mode 0: clock=0, samples on 0->1
- something that's guaranteed to be fast (mutable arrays?)


Entry: arrays
Date: Thu Aug  2 13:32:46 EDT 2018

While the language is pure, it makes sense to compile TH to a monad
instead.  It can still be implemented as the identity monad and still
be fast.


https://wiki.haskell.org/Arrays#Welcome_to_the_machine:_Array.23.2C_MutableArray.23.2C_ByteArray.23.2C_MutableByteArray.23.2C_pinned_and_moveable_byte_arrays


This needs some thought.  I'm making several type errors in my head.
Basically, the monad should contain the memories.  First, learn how to
use fast mutable memory in haskell, then redo this:


  update =
    LamE [TupP [memIn, stateIn, inputs]] $
    DoE $
    bindings' ++
    [NoBindS $ AppE
     (VarE $ mkName "return")
     (TupE [memOut, stateOut, outputs'])]
    
  bindings' =
    [BindS (nodeNumPat n) (termExp e)
    | (n, e) <- partition E]


op1 :: (Int -> Int)               -> Int -> Int -> M Int
op2 :: (Int -> Int -> Int)        -> Int -> Int -> Int -> M Int 
op3 :: (Int -> Int -> Int -> Int) -> Int -> Int -> Int -> Int -> M Int

run ::
  ((m, r, [Int]) -> SeqPrim.M (m, r, [Int]),
   (m, r))
  -> [[Int]] -> SeqPrim.M [[Int]]
run (mf, (m0, r0)) is = u m0 r0 is where
  u _ _ [] = []
  u m r (i:is) = do 
    (m',r',o) <- mf (m,r,i)
    os <- u m' r' is
    return (o : os)


EDIT: I think it's doable to fix this using STUArray
http://hackage.haskell.org/package/array-0.5.2.0/docs/Data-Array-ST.html


But I've already reverted it.. 
Let's first build up some examples so it's clear that this works.


Entry: seqInitMem
Date: Fri Aug  3 08:13:13 EDT 2018

seqRun (\((r2, r1),
          (r3, r5, r7, r9),
          [r0]) -> do {r4 <- seqADD (seqInt 1) r0 r3;
                       r6 <- seqADD (seqInt 4) r5 (seqInt 1);
                       r8 <- seqADD (seqInt 5) r7 (seqInt 1);
                       r10 <- seqADD (seqInt 6) r9 (seqInt 1);
                       return ((r4, r6, r8, r10), [r3])})
       ((seqInt 0, seqInitMem),
        (seqInt 0, seqInt 0, seqInt 0, seqInt 0))

The memories need to be initialized before the loop runs.


Here's the diff.  I'm going to revert it.


tom@panda:~/asm_tools$ git diff
diff --git a/SeqPrim.hs b/SeqPrim.hs
index 669c340..0e4826f 100644
--- a/SeqPrim.hs
+++ b/SeqPrim.hs
@@ -7,7 +7,7 @@
 
 module SeqPrim(
   seqADD, seqSUB, seqAND, seqEQU, seqIF, seqCONC, seqSLICE,
-  seqInt, seqInitMem, seqUpdateMem,
+  seqInt, seqInitMem, seqMemRd, seqMemWr, --seqUpdateMem, 
   seqRun
   ) where
 import Data.IntMap.Strict
@@ -51,15 +51,23 @@ seqEQU = op2 $ \a b   -> if a == b then 1 else 0
 seqCONC  = op3 $ \bs a b -> (a `shiftL` bs) .|. b
 seqSLICE = op2 $ shiftR
 
-seqInitMem :: IntMap Int
-seqInitMem = empty
+type Mem s = STUArray s Int Int
+seqInitMem :: ST s (Mem s)
+seqInitMem = newArray (0, 256) 0  -- FIXME: size!
+
+seqUpdateMem :: ((Int, Int, Int, Int), Mem s) ->  ST s (Int, Mem s)
+seqUpdateMem (args@(wEn,wAddr,wData,rAddr), arr) = do
+  rData <- readArray arr rAddr
+  case wEn of
+    0 -> return ()
+    1 -> writeArray arr wAddr wData
+    _ -> error $ "seqUpdateMem: " ++ show args
+  return (rData, arr)
+
+
+seqMemRd :: ((Int, Int, Int, Int), Mem s) ->  ST s (Int, Mem s)
+seqMemWr :: ((Int, Int, Int, Int), Mem s) ->  ST s (Int, Mem s)
 
-seqUpdateMem :: ((Int, Int, Int, Int), IntMap Int) -> (Int,  IntMap Int)
-seqUpdateMem ((wEn,wAddr,wData,rAddr), mem) = (rData, mem') where
-  rData = findWithDefault 0 rAddr mem
-  mem' = case wEn == 0 of
-           True  -> mem
-           False -> insert wAddr wData mem
 
 seqInt :: Integer -> Int
 seqInt = fromIntegral
@@ -70,37 +78,24 @@ seqInt = fromIntegral
 -- r: register state (tuple of Int)
 -- i/o is collected in a concrete [] type to make it easier to handle.
 
-seqRun :: ((a, r, [Int]) -> forall s. ST s (r, [Int])) -> (a, r) -> [[Int]] -> [[Int]]
-seqRun f (a,r0) i = runST $ u r0 i where
-  u _ [] = return []
-  u r (i:is) = do
-    (r',o) <- f (a, r,i)
-    os <- u r' is
-    return (o:os)
-
-
--- seqRun' ::
---   (forall s. (m,r,[Int]) -> ST s (m, r, [Int])
---   ,(m,r)) -> [[Int]] -> [[Int]]
--- seqRun' (f, (m0, r0)) is = runST $ u m0 r0 is where
-
---   u _ _ [] = return []
---   u m r (i:is) = do
---     (m',r',o) <- f' (m,r,i)
---     os <- (u m' r' is)
---     return (o:os)
-
-
-seqRun' :: ((m,r,[Int]) -> forall s. ST s (m, r, [Int])) -> (m, r) ->  [[Int]] -> [[Int]]
-seqRun' f (m0, r0) is = runST $ u m0 r0 is where
-  u _ _ [] = return []
-  u m r (i:is) = do
-    (m',r',o) <- f (m,r,i)
-    os <- (u m' r' is)
-    return (o:os)
-    
-
--- seqRun = undefined
+seqRun ::
+  ((a, r, [Int]) -> forall s. ST s (r, [Int]))
+  -> (forall s. ST s a)
+  -> r
+  -> [[Int]] -> [[Int]]
+seqRun f ma r0 i = runST m where
+  m = do
+    -- Initialize mutable state (e.g. arrays)
+    a <- ma
+    -- Run loop
+    let u _ [] = return []
+        u r (i:is) = do
+          (r',o) <- f (a, r,i)
+          os <- u r' is
+          return (o:os)
+    u r0 i
+
+
 
 -- For ST, it is important to understand which s parameters are
 -- specific, and which are generic.
diff --git a/SeqTH.hs b/SeqTH.hs
index ca56074..fa19181 100644
--- a/SeqTH.hs
+++ b/SeqTH.hs
@@ -46,18 +46,17 @@ toExp  (outputs, bindings) = exp where
   -- trying to make the loop function and initial state explicit,
   -- which I don't want to understand yet.  It seems best to just
   -- generate a closed expression.
+  exp = app3 (seqVar "Run") update memInit stateInit
   
-  
-  exp = app2 (seqVar "Run") update init
-  
-  init = TupE [memInit, stateInit]
   update =
-    LamE [TupP [memIn, stateIn, inputs]] $
+    LamE [TupP [memRefs, stateIn, inputs]] $
     DoE $
+    memRead ++
     bindings' ++
-    [NoBindS $ AppE
-     (VarE $ mkName "return")
-     (TupE [stateOut, outputs'])]
+    memWrite ++
+    (return' $ TupE [stateOut, outputs'])
+
+  return' e = [NoBindS $ AppE (VarE $ mkName "return") e]
   
   partition t = map snd $ filter ((t ==) . fst) tagged
   tagged = map p' bindings
@@ -69,8 +68,8 @@ toExp  (outputs, bindings) = exp where
   p _              = E
     
   bindings' =
-    [BindS (nodeNumPat n) (termExp e)
-    | (n, e) <- partition E]
+    [BindS (nodeNumPat n) (termExp e) |
+     (n, e) <- partition E]
 
   -- I/O is more conveniently exposed as lists, which would be the
   -- same interface as the source code.  State can use tuples: it will
@@ -84,17 +83,43 @@ toExp  (outputs, bindings) = exp where
   stateIn   = tupP' $ map (nodeNumPat . fst) $ ds
   stateOut  = tupE' [nodeExp n | (_, (Delay _ n)) <- ds]
 
-  mrs = partition MR
-  mi _ = tupE' $ [int 0, seqVar "InitMem"]
-  mr (rd, MemRd _ (MemNode mem)) =
-    tupP' [nodeNumPat rd, nodeNumPat mem]
-  memInit  = tupE' $ map mi mrs
-  memIn  = tupP' $ map mr mrs
-  memOut =
-    tupE' [AppE (seqVar "UpdateMem") $
-            TupE [tupE' $ map nodeExp [a,b,c,d],
-                  nodeNumExp n]
-          | (n, (MemWr (a,b,c,d))) <- partition MW]
+  
+
+  -- For ST, memories are different.  The question is whether to
+  -- implement it in the generated code, or to use functions.  It
+  -- seems possible to implement MemRd and MemWr directly as monadic
+  -- operators.  The use of tuples makes it hard to "fmap".
+
+  -- Here's the current strategy:
+  -- . Create an initializer that produces a tuple of arrays
+  -- . Create imperative memory read functions, inserted at the start of the loop
+  -- . Same for write, at the end
+
+  memRefs  = tupP' $ map (nodeNumPat . fst) $ partition MR
+  memInit  = tupE' []
+  memRead  = [BindS (nodeNumPat rData) ()
+             | (rData, (MemRd td arr)) <- partition MR]
+  memWrite = [BindS _ _ | (_, (MemWr (_,_,_,_))) <- partition MW]
+
+
+  -- -- Memories need to be instantiated before the loop starts.
+  -- mrs = partition MR
+  -- mi _ = tupE' $ [int 0, seqVar "InitMem"]
+
+  -- mr (rd, MemRd _ (MemNode mem)) =
+  --   tupP' [nodeNumPat rd, nodeNumPat mem]
+  -- memInit =
+  --   DoE $
+  --   [BindS (VarP $ mkName $ "m" ++ show n) (mi mr) | (mi,n) <- zip mrs [0..]]
+  --   ++ (return' $ tupE' [VarE $ mkName $ "m" ++ show n | (_,n) <- zip mrs [0..]])
+  -- -- tupE' $ map mi mrs
+  
+  -- memIn  = tupP' $ map mr mrs
+  -- memOut =
+  --   tupE' [AppE (seqVar "UpdateMem") $
+  --           TupE [tupE' $ map nodeExp [a,b,c,d],
+  --                 nodeNumExp n]
+  --         | (n, (MemWr (a,b,c,d))) <- partition MW]
 
 
 -- FIXME: Use nested tuples for the state, memory collections.
@@ -112,7 +137,6 @@ opVar :: Show t => t -> Exp
 opVar opc = seqVar $ show opc
 seqVar str = VarE $ mkName $ "seq" ++ str
 
-
 termExp :: T -> Exp
 
 -- Special cases
tom@panda:~/asm_tools$ 


The main change needed is to inline the "memUpdate" as NoBindS, and
provide an initializer.  

EDIT: Done. Was straightforward.


Entry: Unify
Date: Sun Aug  5 09:38:00 EDT 2018

- PRU
- Staapl macro forth
- CPU
- RTL
- DSP language

Is there a good way to model something at different levels of
abstraction that fits into Haskell?

It seems to be just tagless final + nested type classes.

I think today it's time for the CPU.  This will enable to test a
couple of things:

- code gen for large decoders
- forth-like macro language
- efficiency of emulation
- compare pattern generated by loop between sim and fpga


Entry: Building a CPU
Date: Sun Aug  5 11:13:03 EDT 2018

The outer part of a CPU is the update of the instruction pointer in
terms of the current instruction word.

closeIW :: Seq m r => SType -> SType -> (r S -> m ((r S, r S), o)) -> m o
closeIW tw ta f = do
  closeMem [tw] $ \[iw] -> do
    (ip', out) <- closeReg [ta] $ \[ip] -> do
      ((jmp, dst), out) <- f iw
      ip1 <- inc ip
      ip' <- if' jmp dst ip1
      return ([ip'], (ip',out)) -- comb out
    return ([(0, 0, 0, ip')], out) -- imem is read-only
     

This raises the questions: what is the output of a CPU?

The central idea is to perform composition of multiple of these
"close" operations.

The input/output of a cpu:
- instruction memory writes
- GPIn, GPOut


Entry: Hardware is annoyingly pure
Date: Sun Aug  5 11:18:04 EDT 2018

You can't just "assign" things!
Everything needs to be bound just once.

Approach digital design as gradually "closing" local feedback.

Note that this imposes arbitrary hierachy.  Can that be avoided?
Maybe it is even a good thing.

EDIT: I think there is something to learn here about the way these
compositions "commute".  One key element seems to be to keep the
input/output relation abstract.

A "close" operation takes some state context and burries it, leaving
(i -> m o).


EDIT: Abstracted closeIMem with some data types in the interface.
It's all quite straightforward, but testing this will require some
infrastructure:

- make it possible to initialize memories
- have the test transfer data into the memory

The latter seems most useful, but I think the former is easier to do.

EDIT: I have a CPU with a jump instruction


Entry: Memory init
Date: Sun Aug  5 13:47:24 EDT 2018

Ising FPGA memory as ROM, what is the output of wData on just out of
reset?  Is this an actual combinatorial readout?

Look at the simulator spec.
It doesn't specify a reset value.

https://www.latticesemi.com/-/media/LatticeSemi/Documents/ApplicationNotes/MO/MemoryUsageGuideforiCE40Devices.ashx?document_id=47775

Figure 3. EBR Module Timing Diagram 1

So it clearly says the data is only valid after the first read address
has been clocked.

To use this as instruction memory, either:
- ignore the first read (e.g. reset value = 0)
- use a run enable signal to ensure the first instruction is read

Practically, the CPU will be loaded by another one, so I'm assuming
the run signal is explicit.  Otherwise, use a single delay to start
it.

See memory usage guide for ice40
To push this to verilog, have myHDL insert something like:

defparam ram256x16_inst.INIT_0 =
256'h0000000000000000000000000000000000000000000000000000000000000000;
...


https://sourceforge.net/p/myhdl/mailman/message/33001534/


Entry: Sequencer
Date: Sun Aug  5 16:34:22 EDT 2018

So this really is pretty much it.  The rest is instruction decoder and
memory/stack access.  Those are fairly involved by themselves but
there doesn't seem to be any magic.

E.g. a pattern sequencer needs:
- set io/o
- set loop count
- decrement and branch

This is best done using a driver application.


It might be better to implement function calls first.  For the
application I don't yet need other kinds of conditional jumps, just
nested Forth for..next loops.

How to begin?

One bit in the instruction can be used for:
- if zero, pop stack and continue
- if nonzero, decrement and jump
- push number to stack


Entry: Here's a big lesson
Date: Sun Aug  5 17:38:36 EDT 2018

Full-branch conditionals are too hard to use.

It is often the case that a register only needs to be updated in a
very specific case.  It is too awkward to always have to specify the
non-update case explicitly.

But then again.  It might be due to lack of good abstraction.  The
specific case I'm looking at is stack push/pop.

Also, it might be possible to write this as a generic Seq extension,
instead of a core function.


EDIT: MyHDL and Verilog both support successive assignments in a
single "process".  Syntactically it is not a problemin Seq, but is the
semantics correct?  I really don't like this though..  Let's try to
work without it for a while.


Entry: FIFO / Stack using grey code
Date: Sun Aug  5 17:51:00 EDT 2018

https://www.edaboard.com/showthread.php?157611-why-FIFO-design-using-grey-code

ice40 has carry logic, so maybe not necessary.


Entry: Write-through delay for FIFO and stack
Date: Sun Aug  5 18:26:24 EDT 2018

Here's a stack implementation:

https://github.com/jandecaluwe/myhdl-examples/blob/master/ChessPlayingFPGA/stack/stack_myhdl.py

For my use case, is it ok if the data is only available on the next
cycle?

1. pop (computes new rAddr)
2. use rData

1. push
2. use rData <- has previous result
3. use rData <- has pushed result

So it seems that write-to-read delay is important.

So I don't actually know if it is an issue.  But if it is, it can
likely be solved later by cleaning up a NOP workaround.


Entry: CPU design
Date: Sun Aug  5 19:33:34 EDT 2018

It's starting to get more clear that there are a lot of design choices!

1. multi-clock instruction cycle (e.g. PIC)
2. hazard-mitigation through NOP
3. stall
4. pipelining

What is my goal? To keep it simple, and to have deterministic
execution.  Leaving in the hazards seems best to get a first working
design.


Entry: Stacks
Date: Sun Aug  5 19:43:00 EDT 2018

Looking at swapforth.
https://github.com/jamesbowman/swapforth/blob/master/j1a/verilog/stack2.v

It seems that the stacks are implemented as registers.

This would avoid the delay issue.

Study that some more, and see if it's possible to do actually express
this in Seq.


Entry: Pru emulator
Date: Mon Aug  6 12:03:48 EDT 2018

Basically I want to clock a state machine off of the GPIOs.  Every
time the clock switches, outputs get updated.  So this is essentially
a map from GPO -> GPI.


Entry: Keep pseudo ops
Date: Mon Aug  6 13:38:40 EDT 2018

This should be a hook in the main class that is ignored for code
generation, but gets executed in the emulator.

-test_int_logger = do
-  putStrLn "--- test_int_logger"
-  let log = do
-        t <- loadm Time
-        tell $ [t]
-      sample' = map (pseudo log >>) (sample :: [Src [Int]])
-      src = do
-        initRegs
-        bl_weave sample'
-        return ()
-      tick = compile src
-      [t1,t2] = take 2 $ logTrace tick (machineInit' 123 [10,11])
-  putStrLn $ "period: " ++ show (t2-t1)
-
-  

Keep it specialized.  I can't see a simple way to generalize it to
nops in the target compiler.


Entry: closeMem
Date: Wed Aug  8 14:15:40 EDT 2018

Here's why closeMem is awkward: FIFOs decouple parts of the circuit,
so you really want to have the two ends to be fairly separate
entities.

E.g. I have a circuit with 16 FIFOs.  The reader end is a circuit that
talks to all 16 FIFOs, and the writer end are 16 identical circuits,
speaking to one FIFO each.

This will need a closeMem operation that's quite high up the
hierarchy.


There is a more general point to make: sometimes there is a lot of
criss-cross going on that is easy to solve when resources are named
and binding (single assignment) is explicit.


There is a really big tension between applicative style and "netlist
style" or explicit single-assigment binding.


Entry: Indexed muxes
Date: Wed Aug  8 14:42:05 EDT 2018

Now, how to express larger multiplexers?  These could be expressed as
nested if statments, but MyHDL will generate a case statement if it
sees successive ifs.

Maybe it is time to handle nested ifs differently.

EDIT: See SeqLib.index

Straightforward with zero-extended array, and recursive expansion
based on bits of the indexing word.


Entry: applicative style and instantiating spaghetti networks
Date: Wed Aug  8 18:15:10 EDT 2018

There is something to learn here.
1. Why is it hard to express this applicative?

It might not be really that hard, but it's hard to write it down all
at once.  It is likely that arguments need to be added to functions to
make this work.


So here is the insight: It's not that different from closing over any
other state such as registers.

A closing function creates a more abstract I/O relation from
subcircuits.  The trick is to understand what the desired I/O lines
are for the more abstract circuit.  Then write those down as

f i = do
  ...
  o <- ...
  ...
  return o

If all the other pieces are written in applicative style, they will
just fall into place.


Entry: CPU, sequencer
Date: Thu Aug  9 17:56:01 EDT 2018

closeIMem's Control struct now has "loop", which enables waiting for
signals.

I think I have enough now to perform basic sequencing using a single
programmable sequencer + some simpler state machines that can be
started in parallel.

This is really nitty-gritty though.  I noticed that last time as well.
Hard to imagine without actually getting into it.


Entry: Reset is not implemented properly for SeqTH
Date: Thu Aug  9 19:59:56 EDT 2018


seqRun (\([],
          (),
          r6,
          [r0, r2, r4]) -> do {r1 <- seqSLICE (seqInt 1) r0 (seqInt 0);
                               r3 <- seqSLICE (seqInt 1) r2 (seqInt 0);
                               r5 <- seqSLICE (seqInt 8) r4 (seqInt 0);
                               r7 <- seqCONC (seqInt 9) (seqInt 1) r5 (seqInt 0);
                               r8 <- seqIF (seqInt 9) r3 r7 r6;
                               r9 <- seqSLICE (seqInt 1) r8 (seqInt 0);
                               r10 <- seqSLICE (seqInt 8) r8 (seqInt 1);
                               r11 <- seqCONC (seqInt 9) (seqInt 8) (seqInt 1) r10;
                               r12 <- seqIF (seqInt 9) r1 r11 r8;
                               return ((), r12, [r9, r3, r1, r12])}) [] ((), seqInt 0)


prob was in SeqTerm 


https://github.com/YosysHQ/yosys/issues/103

The lattice chips don't have support for initial values on registers.
Use a reset generator or something.

Clifford mentiones to use the LOCK signal of PLL out.

Or something like this:

reg [7:0] resetn_counter = 0;
    assign resetn = &resetn_counter;

    always @(posedge clk) begin
            if (!resetn)
                    resetn_counter <= resetn_counter + 1;
    end


Entry: async tx
Date: Thu Aug  9 21:05:43 EDT 2018

This needs work.  Revisit tomorrow.


Entry: bundling case
Date: Thu Aug  9 22:50:25 EDT 2018

make a test case first

(0,Comb3 (SInt Nothing 0) IF (Const (SInt Nothing 1)) (Const (SInt Nothing 1)) (Const (SInt Nothing 4)))
(1,Comb3 (SInt Nothing 0) IF (Const (SInt Nothing 1)) (Const (SInt Nothing 2)) (Const (SInt Nothing 5)))
(2,Comb3 (SInt Nothing 0) IF (Const (SInt Nothing 1)) (Const (SInt Nothing 3)) (Const (SInt Nothing 6)))

how to tackle this?
grouping will be easy.
so the problem is creating the data structure
grouping will be recursive


what about adding it straight to Term?


Entry: bit-serial architecture
Date: Thu Aug  9 23:21:34 EDT 2018

I've always found this very cool.  So let's give it a try.  See
electronics.txt


Entry: Display
Date: Thu Aug  9 23:39:05 EDT 2018

I think it's time for something else.  There is the beaglebone + FPGA
+ PRU idea.  I have a bunch of tools now.  What can be done with it?

First thing is to hook this all up.


Entry: Blocks
Date: Fri Aug 10 08:37:25 EDT 2018

Imperative conditionals with explicit assignment are really
convenient when working with state machines.

Is there a way to express them?  Maybe by making the default
assignment explicit?

The problem is that this needs local names in a way that I don't see
work other than using something like lenses.

It's probably easiest to use 'update' explicitly.

Yea the whole thing just doesn't fit well.  It's either/or.


Entry: async_transmit
Date: Fri Aug 10 08:55:37 EDT 2018

Looks like it works now

[1,1,0,0,511,0]
[1,1,0,0,511,0]
[1,1,1,1,180,10]

[0,0,0,0,180,10]
[0,0,0,1,346,9]

[0,0,0,0,346,9]
[0,0,0,1,429,8]
[1,0,0,0,429,8]
[1,0,0,1,470,7]

[0,0,0,0,470,7]
[0,0,0,1,491,6]
[1,0,0,0,491,6]
[1,0,0,1,501,5]

[1,0,0,0,501,5]
[1,0,0,1,506,4]
[0,0,0,0,506,4]
[0,0,0,1,509,3]

[1,0,0,0,509,3]
[1,0,0,1,510,2]
[0,0,0,0,510,2]
[0,0,0,1,511,1]

[1,0,0,0,511,1]
[1,0,0,1,511,0]

[1,1,0,0,511,0]
[1,1,0,1,511,15]


Note that this can also be used


Entry: Figure out conversion to nested if elif else
Date: Fri Aug 10 09:15:53 EDT 2018

First, define an intermediate language derived from SeqExpr that can
represent this.

Here's the uart receiver after unsharing unique nodes, indented for clarity

0 <- (INPUT)
3 <- (SUB (NODE 2) (CONST SInt Nothing 1))
5 <- (EQU (NODE 2) (CONST SInt Nothing 0))
6 <- (EQU (SLICE (NODE 2) 3 0) (CONST SInt Nothing 0))
7 <- (EQU (NODE 1) (CONST SInt Nothing 0))
9 <- (EQU (NODE 1) (CONST SInt Nothing 1))
12 <- (EQU (NODE 1) (CONST SInt Nothing 2))
15 <- (EQU (NODE 1) (CONST SInt Nothing 3))
32 <- (IF (NODE 7) (CONST SInt Nothing 0) (IF (NODE 9) (CONST SInt Nothing 0) (IF (NODE 12) (CONST SInt Nothing 0) (IF (NODE 15) (AND (NODE 6) (NODE 0)) (CONST SInt Nothing 0)))))
1 <- (DELAY
  (IF (NODE 7)
      (IF (NODE 0)
          (CONST SInt Nothing 0)
          (CONST SInt Nothing 1))
      (IF (NODE 9)
          (IF (NODE 5) 
              (CONST SInt Nothing 2)
              (NODE 1))
          (IF (NODE 12)
              (IF (NODE 5)
                  (CONST SInt Nothing 3)
                  (NODE 1))
              (IF (NODE 15)
                  (IF (NODE 5)
                      (CONST SInt Nothing 0)
                      (NODE 1))
                  (CONST SInt Nothing 0))))))

2 <- (DELAY
  (IF (NODE 7)
      (CONST SInt Nothing 3)
      (IF (NODE 9)
          (IF (NODE 5)
              (CONST SInt Nothing 63)
              (NODE 3))
          (IF (NODE 12)
              (IF (NODE 5)
                  (CONST SInt Nothing 7)
                  (NODE 3))
              (IF (NODE 15)
                  (IF (NODE 5)
                      (CONST SInt Nothing 0)
                      (NODE 3))
                  (CONST SInt Nothing 0))))))

35 <- (DELAY 
   (IF (IF (NODE 7)
          (CONST SInt Nothing 0)
          (IF (NODE 9)
               (CONST SInt Nothing 0)
               (IF (NODE 12)
                   (NODE 6)
                   (IF (NODE 15)
                       (CONST SInt Nothing 0)
                       (CONST SInt Nothing 0)))))
       (CONC (NODE 0) (SLICE (NODE 35) 8 1))
       (NODE 35)))


Yeah this isn't very useful...
Don't do this on SeqExpr.
Do it on SeqTerm first, then perform Expr on subexpressions.


Entry: SeqTerm and blocks
Date: Fri Aug 10 09:55:53 EDT 2018

Blocks are always guarded by conditionals.  It seems simplest to get
rid of the ternary if expression, and convert everything to nested
block form.

Can this be done using Free?

First, make s-expression printout:


0 <- (INPUT)
3 <- (SUB (NODE 2) (CONST _:1))
4 <- (SLICE (NODE 2) 3 0)
5 <- (EQU (NODE 2) (CONST _:0))
6 <- (EQU (NODE 4) (CONST _:0))
7 <- (EQU (NODE 1) (CONST _:0))
8 <- (IF (NODE 0) (CONST _:0) (CONST _:1))
9 <- (EQU (NODE 1) (CONST _:1))
10 <- (IF (NODE 5) (CONST _:2) (NODE 1))
11 <- (IF (NODE 5) (CONST _:63) (NODE 3))
12 <- (EQU (NODE 1) (CONST _:2))
13 <- (IF (NODE 5) (CONST _:3) (NODE 1))
14 <- (IF (NODE 5) (CONST _:7) (NODE 3))
15 <- (EQU (NODE 1) (CONST _:3))
16 <- (AND (NODE 6) (NODE 0))
17 <- (IF (NODE 5) (CONST _:0) (NODE 1))
18 <- (IF (NODE 5) (CONST _:0) (NODE 3))
19 <- (IF (NODE 15) (CONST _:0) (CONST _:0))
20 <- (IF (NODE 15) (NODE 16) (CONST _:0))
21 <- (IF (NODE 15) (NODE 17) (CONST _:0))
22 <- (IF (NODE 15) (NODE 18) (CONST _:0))
23 <- (IF (NODE 12) (NODE 6) (NODE 19))
24 <- (IF (NODE 12) (CONST _:0) (NODE 20))
25 <- (IF (NODE 12) (NODE 13) (NODE 21))
26 <- (IF (NODE 12) (NODE 14) (NODE 22))
27 <- (IF (NODE 9) (CONST _:0) (NODE 23))
28 <- (IF (NODE 9) (CONST _:0) (NODE 24))
29 <- (IF (NODE 9) (NODE 10) (NODE 25))
30 <- (IF (NODE 9) (NODE 11) (NODE 26))
31 <- (IF (NODE 7) (CONST _:0) (NODE 27))
32 <- (IF (NODE 7) (CONST _:0) (NODE 28))
33 <- (IF (NODE 7) (NODE 8) (NODE 29))
34 <- (IF (NODE 7) (CONST _:3) (NODE 30))
1 <- (DELAY (NODE 33))
2 <- (DELAY (NODE 34))
36 <- (SLICE (NODE 35) 8 1)
37 <- (CONC (NODE 0) (NODE 36))
38 <- (IF (NODE 31) (NODE 37) (NODE 35))
35 <- (DELAY (NODE 38))


Then perform hoisting, something like:

19 <- (IF (NODE 15) (CONST _:0) (CONST _:0))
20 <- (IF (NODE 15) (NODE 16) (CONST _:0))
21 <- (IF (NODE 15) (NODE 17) (CONST _:0))
22 <- (IF (NODE 15) (NODE 18) (CONST _:0))


[19,20,21,20] <- IFS (NODE 15) 

      [19 <- (CONST _:0),
       20 <- (NODE 16),
       21 <- (NODE 17),
       22 <- (NODE 18)],

      [19 <- (CONST _:0),
       20 <- (CONST _:0),
       21 <- (CONST _:0),
       22 <- (CONST _:0)]

It's not actually that easy.  To make this readable, the nodes that
are not shared elsewhere should also be moved inside the blocks, so
they can later be eliminated.

I don't see how to express this easily.

Let it rest a bit.


Entry: Now here's an other idea.
Date: Fri Aug 10 11:00:53 EDT 2018

What about making a language that is more like Verilog, and plug it
into Seq?  I can probably use Yosys to generate a netlist, then
compile that into Seq for unit testing.

Simulation in Haskell is nice, but expressing the primitive state
machines is a bit of a pain.


Entry: Check yosys output
Date: Fri Aug 10 11:04:36 EDT 2018

This is actually more important.  So the thing to do is to get
everything to run on FPGA and see if Yosys can optimize what Seq
produces.

What to use?  Maybe the UART makes sense?  That's something I can just
monitor easily.

EDIT: I have no way to judge this output..

EDIT: Was wrong - was doing just logic opti.  Doing the ice40_synth
does give something that doesn't look too bad.  PNR phase says:

After packing:
IOs          4 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          27 / 7680
  DFF        11
  CARRY      5
  CARRY, DFF 0
  DFF PASS   0
  CARRY PASS 2
BRAMs        0 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2


EDIT: It optimized out the shift register because I'm using only one
bit.  Here's with all 8 outputs routed out of the FPGA:


After packing:
IOs          11 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          34 / 7680
  DFF        18
  CARRY      5
  CARRY, DFF 0
  DFF PASS   7
  CARRY PASS 2
BRAMs        0 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2

Which added 7 more logic cells.  This doesn't look too bad really.

      
Entry: Variable names
Date: Fri Aug 10 12:02:09 EDT 2018

I need variable names.  Otherwise this is going to be too hard to
figure out.

Maybe implement it using closeReg?


Entry: Next?
Date: Fri Aug 10 18:31:48 EDT 2018

I'm not sure if it's really necessary to do the if..else thing.


Entry: output to input dependency?
Date: Fri Aug 10 18:39:09 EDT 2018

Is it possible to compute the output based on the input in the ST
here?  Let's give it a try with a test case.

seqRun ::
  (forall s. ([Mem s], rd, r, [Int]) -> ST s (rd, r, [Int]))
  -> [Int]
  -> (rd, r)
  -> [Int -> Int]
  -> [[Int]] -> [[Int]]
seqRun f memBits (rd0, r0) memInits i = 
  runST $ do
    a <- sequence $ zipWith seqMemInit memBits memInits
    let u _ _ [] = return []
        u rd r (i:is) = do
          (rd',r',o) <- f (a, rd, r, i)
          os <- u rd' r' is
          return (o:os)
    u rd0 r0 i


This produces <<loop>>

l_st = seq where 
  seq = ([0] : seq')
  seq' = seqRun (\([], (), (), [i]) -> return ((), (), [i])) [] ((), ()) [] seq

So it needs a lazy state implementation.


Entry: Assignment language
Date: Fri Aug 10 22:43:54 EDT 2018

So is it possible to do this actually?

if foo
  a <- 1
les
  a <- 2

No.  Those are still bindings.  It would need to be:


a <- signal
if foo
   connect a 1
else
   connect a 2


So I am really stuck with the pure approach.

Or not?  Here's an idea: when entering diverging blocks, a local
context can be built up that keeps track of the guards of the
conditional expression.  Once the full expression is evaluated, all
the guards can be gathered and an update network can be constructed
for a particular variable.

Still I wonder if it is really that useful.  It seems like a lot of
cruft.  Maybe there is something to say for better factoring?


EDIT: Do not waste time on this.


Entry: Testing uart transmit
Date: Sat Aug 11 07:54:20 EDT 2018

Do this using the receiver.  It is currently not possible to thave
SeqTH code use feedback at the delay level due to strictness of ST.


Entry: Focus more on composition
Date: Sat Aug 11 08:03:08 EDT 2018

I now have fifo interfaces and ser/deser.  How to compose these on a
higher level?

The problem is still sequencers.  How to abstract those better?

Maybe start from the top this time.  Build the loop controller for the
application and see where that gets stuck.

So here's a nice goal for today: get a basic CPU-like sequencer
running on FPGA, producing an output loop.

Maybe what I've been overlooking is the explicit construction of a
bus, so let's start there.


Entry: CPU bus
Date: Sat Aug 11 08:21:25 EDT 2018

read, write, address, data.
a bus is quite similar to a memory.

practically, wat is needed:

- write uart byte
- read + wait for uart done
- read byte from fifo


For the application I can avoid conditional jumps if there is a wait
instruction.

The wait instruction selects from one of a number of flags.

So where to start?  Once the basic structure is there, it is easy to
modify and abstract.

I feel dread starting this, so likely it is going to be more
complicated than I thought.


Entry: Busses, cont
Date: Sat Aug 11 11:19:41 EDT 2018

With a bus, these are the basic instructions:
- jmp
- read  (bus->reg)
- write (reg->bus)
- im    (ins->reg)

But there still is no synchronization.  I need a blocking read.

Instead of using a register interface, what about using a port interface?
But a register read is already needed.

What I miss is a direct flag input.  This could still be part of the
bus.

The missing bit is really the read operation.  I'm thinking of this as
one thing, but it is actually always split over two cycles:

- issue read
- wait for read to be stable on the bus

This could be the next instruction, but it might be a later one as well.

With this in mind, look at an example bus: Wishbone
https://en.wikipedia.org/wiki/Wishbone_(computer_bus)

This seems overkill.

What I need is just a read ready signal such that the cpu can read
from a stream.

Another important realization is that a bus is simpler when it has a
master and a slave side, and needs special attention when these roles
need to switch.


Entry: Read ready, streams
Date: Sat Aug 11 11:58:40 EDT 2018

Can they both be the same?  The thing is to make sure that the CPU
doesn't miss it, or to add a S/R + a latch.

The primitive representation really should be the 1-cycle enable
sample pulse.  Then if a different interface is needed, add the latch
+ stable ready signal.


Entry: MyHDL memories
Date: Sat Aug 11 12:43:50 EDT 2018

todo


Entry: CPU hierarchy
Date: Sat Aug 11 13:31:30 EDT 2018

I've got the basic idea patched up.  Decomposed into a couple of levels:

- top level io
- bus peripherals on bus
- cpu sequencer + memory
- instruction decoder

Ha.. that's why they are called peripherals!  They sit between the CPU
and the outside world.


So this is great.  Once that basic structure is there, the rest is
incremental.

Next: make a uart read/write test, or just the 2 main loops of the
application.  Uart write and some i/o and timing control.


Entry: Probe
Date: Sat Aug 11 18:24:51 EDT 2018

So I'm moving to an explicit probe functionality as part of Seq.  This
way there is no clutter, and everything can be easily probed by adding
lines to the source code.

EDIT: Propagated it to SeqTH.  TODO:
- create bindings (always)
- allow parameter to pick out probes, reify to var names

EDIT: Fully implemented.  Actually is a lot better than creating d_
routines.


Entry: Next?
Date: Sun Aug 12 10:09:04 EDT 2018

Tests, mostly.  The rest seems quite incremental.  Focus on building a
single reference design for a stack processor, then specialize it to
the application.

It is time though to make tests on hardware; get the test circuitry
set up.

EDIT: With probes in place, continue with peripherals.


Entry: UART out on bus
Date: Sun Aug 12 12:36:01 EDT 2018

Write strobe doesn't work yet.  Or does it?

EDIT: works.
Currently using blocking read for UART.


Entry: Path to FPGA
Date: Sun Aug 12 13:42:53 EDT 2018

- instantiate (initialized) memory in myhdl
- run program on FPGA to output on scope or LEDs
- hook up the programmer


Entry: MyHDL RAM
Date: Sun Aug 12 14:25:41 EDT 2018

Two things:
- Generate a code stub for the memory instructions in SeqTerm
- Possibly instantiate directly for synthesis?

The main question is: do I want a ROM?
If I do, I'll need to do a whole lot more work.

There is a tool to program just the ram:

https://stackoverflow.com/questions/36852808/modify-ice40-bitstream-to-load-new-block-ram-content
https://github.com/cliffordwolf/icestorm/tree/master/icebram


Basically, I will need access to the block ram over SPI anyway, so
maybe best to not do this?


Entry: Synchronous SPI
Date: Sun Aug 12 14:36:17 EDT 2018

Previously, I've used async SPI.  Since speed is not really a big
problem, it might be easier to use synchronous SPI.  It makes it
easier to test inside Seq.

Just curious: googleing for SPI FPGA what do we get?  Sync or async?
Sync..

I guess the reason is that it is easier to have a serial signal cross
clock domains.  What I did before was a hack.


Entry: Yosys FSM Detection
Date: Sun Aug 12 15:28:13 EDT 2018

http://www.clifford.at/yosys/files/yosys_manual.pdf
8.2.1

The fsm_detect pass identifies FSM state registers. It sets the
\fsm_encoding= "auto" attribute on any (multi-bit) wire that matches
the following description:

• Does not already have the \fsm_encoding attribute.

• Is not an output of the containing module.

• Is driven by single $dff or $adff cell.

• The \D -Input of this $dff or $adff cell is driven by a multiplexer
  tree that only has constants or the old state value on its leaves.

• The state value is only used in the said multiplexer tree or by
  simple relational cells that compare the state value to a constant
  (usually $eq cells).

How to see what it detects?


Entry: CPU startup
Date: Sun Aug 12 17:17:39 EDT 2018

- boot fpga
- transfer into memory
- start processor after cs:0->1


Entry: Next?
Date: Sun Aug 12 17:21:36 EDT 2018

Tired and bored, so I'll need something new.

EDIT: a forth "next".


Entry: Loops
Date: Sun Aug 12 20:06:00 EDT 2018

if loop count is zero -> drop and continue
else -> decrement and jump

EDIT: ok works

Next for cpu?  Some more tests that nail down the semantics.  Figure
out why suddenly a nop is needed for uart read.


Entry: State machine vs. instruction sequencer
Date: Sun Aug 12 23:44:51 EDT 2018

There is definitely a tipping point somewhere, so how to determine?

https://en.wikipedia.org/wiki/Microsequencer

https://www.arl.wustl.edu/~jst/cse/260/ddc.pdf


Entry: Into the real world?
Date: Mon Aug 13 09:08:30 EDT 2018

To make this wor for real, it is necessary to integrate into an
existing verilog simulator, and/or have a verilog parser frontend.

http://hackage.haskell.org/package/verilog
https://github.com/tomahawkins/verilog


Entry: Memories
Date: Mon Aug 13 09:17:29 EDT 2018

Move it forward.
- MyHDL memories
- MyHDL simulation
- FPGA through FTDI upload
- FPGA wire-up


EDIT: I got it up to this point:


instantiation (TODO)

	[s14_rd, s14_we, s14_wa, s14_wd, s14_ra] = s14

from mem bus:
	@always_comb
	def blk2():
		s17.next = ((s14_rd) if s13 else 0)

to mem bus:

	@always_comb
	def blk16():
		s14_we.next = 0
		s14_wa.next = 0
		s14_wd.next = 0
		s14_ra.next = s59

What's needed is some types to plug into the instantiation, and
possibly a decoupling through passing env?

E.g.

	[s14_rd, s14_we, s14_wa, s14_wd, s14_ra] = env.memory("s14", 8, 16)

This way the instance itself doesn't need to be handled inside the
generated routine.

Then do the same for signal?

EDIT: Probably best to create the instance and pass it on.
TODO:
- Find addr,data sizes + add to instances.
- Renames from probes


Entry: MyHDL concat
Date: Mon Aug 13 12:11:46 EDT 2018

This can't take constants.

	c_1_0 = Signal(modbv(0)[1:0])
	c_1_1 = Signal(modbv(1)[1:0])

http://discourse.myhdl.org/t/constant-bit-vectors-in-a-concat-expression/121

A binary string fixes it:

	def blk20():
		s63.next = (((concat(s18, "0")) if s62 else ((concat("1", (s63[9:1]))) if 1 else s63)))


EDIT: Done


Entry: MyHDL test bench
Date: Mon Aug 13 13:46:44 EDT 2018

I want to keep this in haskell.  So find a way to run Python code and
return the result.


Entry: Memory instantiation hierarchy problem
Date: Mon Aug 13 16:16:04 EDT 2018

I don't understand what's going wrong here.

Traceback (most recent call last):
  File "run_myhdl.py", line 140, in <module>
    load_and_run(sys.argv[1], sys.argv[2])
  File "run_myhdl.py", line 134, in load_and_run
    toVerilog(hdl_fun, env, *signals)
  File "/home/tom/priv/git-private/humanetics/gw_src/deps/asm_tools/myhdl/myhdl/conversion/_toVerilog.py", line 176, in __call__
    siglist, memlist = _analyzeSigs(h.hierarchy)
  File "/home/tom/priv/git-private/humanetics/gw_src/deps/asm_tools/myhdl/myhdl/conversion/_analyze.py", line 92, in _analyzeSigs
    assert(delta >= -1)
AssertionError

I think I understand.  Signals need to be passed in from the top or
created at the same level.  They cannot be instantiated at lower
levels and bubbled up.


Trying to fix it


	s14_rd = Signal(modbv(0)[1:0])
	s14_we = Signal(modbv(0)[1:0])
	s14_wa = Signal(modbv(0)[1:0])
	s14_wd = Signal(modbv(0)[1:0])
	s14_ra = Signal(modbv(0)[1:0])
	inst1 = ram.ram(CLK, s14_wa, s14_wd, s14_we, CLK, s14_ra, s14_rd)        


	s14_rd = env.sig(16, 0)
	s14_we = env.sig(1, 0)
	s14_wa = env.sig(8, 0)
	s14_wd = env.sig(16, 0)
	s14_ra = env.sig(8, 0)
	inst1 = env.ram(CLK, s14_wa, s14_wd, s14_we, CLK, s14_ra, s14_rd)        


Entry: MyHDL
Date: Tue Aug 14 09:27:04 EDT 2018

Make a CPython interface.  Now it needs to be tested.

Note that it is no longer necessary to use the macro to get to
variable names.  They can be added through the "probe" mechanism.

Let's push that through first.

EDIT: I/O names are still necessary.


Entry: MyHDL code gen cleanup
Date: Tue Aug 14 10:09:55 EDT 2018

Create a single generator.

Currently there are 3:

toPy
testbench
fpga


I need one interface.  The problem is that there are a couple.


m [r S]            passed into compileTerm
[r S] -> m ()      mirrors MyHDL port api
[r S] -> m [r S]   mirrors Seq processor

What is needed?

- FPGA code generation

- Test benches


The problem is with the latter.

So let's focus on code gen first, then figure out a way to implement
i->o test benches.

EDIT: got it sorted out + can execute from haskell, though it needs
intermediate files.


Entry: next
Date: Tue Aug 14 15:40:27 EDT 2018

Everything is there to create a test bench for the soc to run in
MyHDL, and to then run it on the hardware.

What should this do?  Likely best to take the application mainloop.

Currently what's in the way is the ability to upload SPI code to the
running FPGA.  I need to really go ahead and do this.

And I really don't want to?  Why is that?

Because it is all so hard to debug.  Can I make that easier?  Can I
make an always-on logic analyzer?  Let's set that up instead.


It's the software.  I don't want to run a daemon and I don't want to
have it streaming data constantly.

What I want is the following:

- Send status once per second or per request
- Send status on change

Do this for all GPIOs.
Use a dedicated blue pill board.

EDIT: This is madness.  Decision fatigue. I put everything together
but stopped.  I could just solder up the board and do it manually.
Why is it so hard to start doing that?

I guess I'm just tired.


Entry: Low hanging fruit?
Date: Tue Aug 14 17:06:37 EDT 2018

Or something interesting.  I've not solved timing yet for the CPU.
How to do this?

One thing I need is 


Entry: Instruction Sequencer vs. State Machine
Date: Tue Aug 14 17:41:45 EDT 2018

Here's the thing: a CPU, you only need to build once.  A state machine
is custom every time, because it is really hard to modularize.


Entry: A Forth
Date: Tue Aug 14 19:30:21 EDT 2018

So it's time for a mini Forth.  Basically only to create loops.  A
monad with a compilation stack.


Entry: Tests
Date: Tue Aug 14 21:30:27 EDT 2018

- create a FPGA .v module to see what the resource usage is
- make a testbench, run it in mydl
- move to FPGA


Entry: Bad hex value 1ff
Date: Tue Aug 14 22:34:27 EDT 2018

def _convertInitVal(reg, init):

_toVerilog.py 


def _convertInitVal(reg, init):
    if isinstance(reg, _Signal):
        tipe = reg._type
    else:
        assert isinstance(reg, intbv)
        tipe = intbv
    if tipe is bool:
        v = '1' if init else '0'
    elif tipe is intbv:
        v = "%s" % init if init is not None else "'bz"
    else:
        assert isinstance(init, EnumItemType)
        v = init._toVerilog()
    return v

init is originally set to the string "1ff"

type(init) is <class 'myhdl._modbv.modbv'>

and converted to string it is 1ff

I guess that's the problem?

Implementation of this is in _intbv.py

EDIT: fixed it by

-        v = "%s" % init if init is not None else "'bz"
+        v = "h%s" % init if init is not None else "'bz"


Entry: Sizes of things
Date: Tue Aug 14 23:15:48 EDT 2018

140 LUTs for UART transmit and CPU.
That's not a whole lot.

I wonder what a good ratio is for LUT to transistor count.
4004 had 2300 transistors.

A FF is about 10 transistors.

A LUT is about 20 transistors maybe?  Accounting for low resource use?

This says one LUT is 6 NAND gate equivalent.
A CMOS NAND gate is 4 transistors.

Which is 24 transistors per LUT.

https://blogs.synopsys.com/breakingthethreelaws/2015/02/how-many-asic-gates-does-it-take-to-fill-an-fpga/


So the CPU is about (* 24 140) 3360 transistor equivalent.

Orig Z80 was 8500 NMOS

Compare this to a PIC.
PIC12C508 8-bit 1200 nanometre process
https://en.wikipedia.org/wiki/PIC_microcontrollers

That's quite a big feature size.  1985 technology.

https://en.wikipedia.org/wiki/Microprocessor_chronology

https://www.pcmag.com/encyclopedia/term/49759/process-technology


     Nanometers  Micrometers
 Year     (nm)        (µm)

 1957   120,000      120.0
 1963    30,000       30.0
 1971    10,000       10.0
 1974     6,000        6.0
 1976     3,000        3.0
 1982     1,500        1.5
 1985     1,300        1.3
 1989     1,000        1.0
 1993       600        0.6
 1996       350        0.35
 1998       250        0.25
 1999       180        0.18
 2001       130        0.13
 2003        90        0.09
 2005        65        0.065
 2008        45        0.045
 2010        32        0.032
 2012        22        0.022
 2014        14        0.014
 2017        10        0.010
 2018         7        0.007
  ??          5        0.005


Entry: FPGA reverse engineering
Date: Tue Aug 14 23:58:00 EDT 2018

https://hackaday.com/2018/01/17/34c3-reverse-engineering-fpgas/


Entry: critical path
Date: Wed Aug 15 08:12:41 EDT 2018

Use icetime.  Later, upgrade tools and have it do timing-based PNR.
It's about 2x the 36MHz.

If this gives issues, it is always possible to create two clock
domains, where clock recovery and FIFO write is done in the 36MHz
domain, and the CPU and control run at a lower rate.


tom@panda:~/asm_tools$ make x_soc_fpga.ct256.time
icetime -p x_soc_fpga.pcf -o x_soc_fpga.ct256.nl.v -P ct256 -d hx8k -t x_soc_fpga.ct256.asc
// Reading input .pcf file..
// Reading input .asc file..
// Reading 8k chipdb file..
// Creating timing netlist..

icetime topological timing analysis report
==========================================

Warning: This timing analysis report is an estimate!

Report for critical path:
-------------------------

        ram_25_15 (SB_RAM40_4K) [clk] -> RDATA[13]: 2.246 ns
     2.246 ns net_99005 (s14_rd[13])
        t521 (LocalMux) I -> O: 0.330 ns
        inmux_24_16_99177_99230 (InMux) I -> O: 0.260 ns
        lc40_24_16_5 (LogicCell40) in0 -> lcout: 0.449 ns
     3.284 ns net_95054 ($abc$1273$n154_1)
        t473 (LocalMux) I -> O: 0.330 ns
        inmux_23_16_95119_95168 (InMux) I -> O: 0.260 ns
        lc40_23_16_7 (LogicCell40) in3 -> lcout: 0.316 ns
     4.189 ns net_90979 ($abc$1273$n5)
        odrv_23_16_90979_50336 (Odrv12) I -> O: 0.540 ns
        t393 (Sp12to4) I -> O: 0.449 ns
        t395 (Span4Mux_v3) I -> O: 0.337 ns
        t394 (LocalMux) I -> O: 0.330 ns
        inmux_18_19_75079_75125 (InMux) I -> O: 0.260 ns
        lc40_18_19_3 (LogicCell40) in0 -> lcout: 0.449 ns
     6.552 ns net_70960 ($abc$1273$n7)
        odrv_18_19_70960_75043 (Odrv12) I -> O: 0.540 ns
        t223 (Sp12to4) I -> O: 0.449 ns
        t222 (Span4Mux_v2) I -> O: 0.252 ns
        t221 (IoSpan4Mux) I -> O: 0.323 ns
        t220 (LocalMux) I -> O: 0.330 ns
        t219 (IoInMux) I -> O: 0.260 ns
        t218 (ICE_GB) USERSIGNALTOGLOBALBUFFER -> GLOBALBUFFEROUTPUT: 0.617 ns
        t217 (gio2CtrlBuf) I -> O: 0.000 ns
        t216 (GlobalMux) I -> O: 0.154 ns
        t215 (INTERCONN) I -> O: 0.000 ns
        t214 (LocalMux) I -> O: 0.330 ns
        inmux_16_15_66433_66498 (InMux) I -> O: 0.260 ns
        lc40_16_15_6 (LogicCell40) in0 -> lcout: 0.449 ns
    10.515 ns net_62318 ($abc$1273$n48)
        odrv_16_15_62318_65784 (Odrv12) I -> O: 0.540 ns
        t168 (Span12Mux_v9) I -> O: 0.421 ns
        t167 (LocalMux) I -> O: 0.330 ns
        t166 (IoInMux) I -> O: 0.260 ns
        t165 (ICE_GB) USERSIGNALTOGLOBALBUFFER -> GLOBALBUFFEROUTPUT: 0.617 ns
        t164 (gio2CtrlBuf) I -> O: 0.000 ns
        t163 (GlobalMux) I -> O: 0.154 ns
    12.836 ns seg_18_16_glb_netwk_5_6

Resolvable net names on path:
     2.246 ns ..  2.835 ns s14_rd[13]
     3.284 ns ..  3.873 ns $abc$1273$n154_1
     4.189 ns ..  6.104 ns $abc$1273$n5
     6.552 ns ..  6.552 ns $abc$1273$n7
    10.066 ns .. 10.066 ns $abc$1273$n7$2
    10.515 ns .. 10.515 ns $abc$1273$n48

Total number of logic levels: 5
Total path delay: 12.84 ns (77.90 MHz)


Entry: rle
Date: Wed Aug 15 08:20:54 EDT 2018

The simplest way is to write down explicitly what is expected.
- remove preamble
- generate the periodic rle signal a couple of times
- compare

It's no longer possible to run lazily with SeqTH, so maybe create a
workaround for that?


Entry: MyHDL is picky
Date: Wed Aug 15 10:19:18 EDT 2018

Top level circuit cannot use *args, as argument names are recovered.

So I'm including the reset generator instantiation in the generated module.

        rst = ice40_reset(CLK, RST)

This means reset should be an output also.


Entry: Build cleanup
Date: Wed Aug 15 11:06:55 EDT 2018

One Haskell file per FPGA project?


Entry: Yosys ram init
Date: Wed Aug 15 12:02:21 EDT 2018

blif below. indented for readability.  it should be possible to
identify these based on any of the names of the attached ports.  how
does the ram write tool work?

https://github.com/cliffordwolf/icestorm/tree/master/icebram

It seems there are 3 options:
- manually instantiate SB_RAM40_4K and provide init params
- edit the .blif
- edit the .asc

I believe the init parameters are bit vectors, so leas significant
word is to the right.


.gate SB_RAM40_4K
MASK[0]=$true MASK[1]=$true MASK[2]=$true MASK[3]=$true
MASK[4]=$true MASK[5]=$true MASK[6]=$true MASK[7]=$true
MASK[8]=$true MASK[9]=$true MASK[10]=$true MASK[11]=$true
MASK[12]=$true MASK[13]=$true MASK[14]=$true MASK[15]=$true
RADDR[0]=s60_ra[0] RADDR[1]=s60_ra[1] RADDR[2]=s60_ra[2] RADDR[3]=s60_ra[3]
RADDR[4]=s60_ra[4] RADDR[5]=s60_ra[5] RADDR[6]=s60_ra[6] RADDR[7]=s60_ra[7]
RADDR[8]=$false RADDR[9]=$false RADDR[10]=$false 
RCLK=CLK RCLKE=$true
RDATA[0]=s60_rd[0] RDATA[1]=s60_rd[1] RDATA[2]=s60_rd[2] RDATA[3]=s60_rd[3]
RDATA[4]=s60_rd[4] RDATA[5]=s60_rd[5] RDATA[6]=s60_rd[6] RDATA[7]=s60_rd[7]
RDATA[8]=s60_rd[8] RDATA[9]=s60_rd[9] RDATA[10]=s60_rd[10] RDATA[11]=s60_rd[11]
RDATA[12]=s60_rd[12] RDATA[13]=s60_rd[13] RDATA[14]=s60_rd[14] RDATA[15]=s60_rd[15]
RE=$true
WADDR[0]=$false WADDR[1]=$false WADDR[2]=$false WADDR[3]=$false WADDR[4]=$false
WADDR[5]=$false WADDR[6]=$false WADDR[7]=$false WADDR[8]=$false WADDR[9]=$false
WADDR[10]=$false WCLK=$false WCLKE=$false
WDATA[0]=$false WDATA[1]=$false WDATA[2]=$false WDATA[3]=$false WDATA[4]=$false
WDATA[5]=$false WDATA[6]=$false WDATA[7]=$false WDATA[8]=$false WDATA[9]=$false
WDATA[10]=$false WDATA[11]=$false WDATA[12]=$false WDATA[13]=$false WDATA[14]=$false WDATA[15]=$false
WE=$true
.attr src "/usr/local/bin/../share/yosys/ice40/brams_map.v:191|/usr/local/bin/../share/yosys/ice40/brams_map.v:35"
.param INIT_0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_5 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_6 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_7 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_8 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_9 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_A xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_B xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_C xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_D xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_E xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.param INIT_F xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Entry: FPGA load program
Date: Wed Aug 15 12:19:50 EDT 2018

First, see if there is a sign of life when just modifying the blif manually.

Extend the iceprog program to send bytes to SPI.
EDIT: Done

Now, how to bootstrap into this?

Make a SPI command to set the other 7 LEDs.


Temp setup

on tp:
cd /i/tom/asm_tools
export PATH=/i/tom/git/icestorm/iceprog/:$PATH
make f_blink.ct256.iceprog
dd if=/dev/urandom of=/tmp/test.bin bs=1 count=1 ; iceprog -x /tmp/test.bin


echo -ne 'A' >/tmp/test.bin ; iceprog -x /tmp/test.bin

Entry: PLL setup
Date: Wed Aug 15 13:17:31 EDT 2018

https://github.com/YosysHQ/yosys/issues/107


Entry: iceprog
Date: Wed Aug 15 14:15:43 EDT 2018

Code says 6MHz, but board runs at 12Mhz, so this might not be sampled
correctly.

Does the 6MHz reflect the data rate or the actual clock signal
frequency?  E.g. if the actual clock frequency is 3MHz then we're
good.


BITMODE_MPSSE

void send_spi(uint8_t *data, int n)
{
	if (n < 1)
		return;

	send_byte(0x11);
	send_byte(n-1);
	send_byte((n-1) >> 8);

	int rc = ftdi_write_data(&ftdic, data, n);
	if (rc != n) {
		fprintf(stderr, "Write error (chunk, rc=%d, expected %d).\n", rc, n);
		error();
	}
}


http://www.ftdichip.com/Support/Documents/AppNotes/AN_135_MPSSE_Basics.pdf

Ok, I'm not going to find this by staring at code and nonexistant
documentation.


https://learn.adafruit.com/adafruit-ft232h-breakout/mpsse-setup

https://github.com/devttys0/libmpsse

Use this code as documentation:

https://github.com/devttys0/libmpsse.git

From this:


This command sets div x5, generating 12MHz from 60Mhz
	TCK_X5			= 0x8A,

Then putting system_clock=12Mhz in here, the divisor for freq=6MHz is 0.
/* Convert a frequency to a clock divisor */
uint16_t freq2div(uint32_t system_clock, uint32_t freq)
{
	return (((system_clock / freq) / 2) - 1);
}
     factor  reg=factpr-1
6MHZ 1       0
3MHz 2       1
2MHz 3       2
...

So I can just change that.  Big or little endian?
From the other examples it seems to be little endian.

 	send_byte(0x86);
-	send_byte(0x00);
+	send_byte(0x01);
 	send_byte(0x00);


Entry: Next
Date: Wed Aug 15 15:02:30 EDT 2018

spi seems to be working.  now:
- create flow to push program out to .bin file
- create test file that emulates spi upload at 3MHz +- some fractional error
- have it put a pattern on the LEDs, test with rle sim


EDIT: Running into a problem.  Instead of nesting all these things, it
might be better to preform the composition inside one top-level function.

Something like "make_soc", where all the components are passed in.

Components:
- CPU
- instruction sequencer
- boot (memory write)
- bus
- reset circuitry


Entry: FPGA board test
Date: Wed Aug 15 19:11:43 EDT 2018

The next thing to do is to create a program that writes 'U' to the
LEDs.  From there on, a couple of different programs could be tried.


Entry: No sign of life for CPU
Date: Wed Aug 15 22:02:47 EDT 2018

Step is too big.
Reduce: bring ip out.

Can this be done with probes?

EDIT: Adding environment-based debug cross-cuts.
It runs somewhat.  Can't see a consistent pattern though.

I'm seeing some patterns but nothing completely consistent.

tom@tp:/i/tom/asm_tools$ echo 'UUUU' >/tmp/test.bin
tom@tp:/i/tom/asm_tools$ iceprog -x /tmp/test.bin 

This gives

10100101

So it's not clocking in properly.

One problem: fifo pointer isn't resetting
So on the first load, I see "02"
I wonder if spi resets properly as well..


Entry: State can linger.
Date: Wed Aug 15 23:04:09 EDT 2018

When creating tests: make sure to run them multiple times in a row to
make sure devices reset properly.


Entry: deser
Date: Thu Aug 16 08:16:56 EDT 2018

There is something wrong with the deser that is not obvious in the
simulation.

Write one from scratch and look at the difference?

There should be a simpler way to keep this under control.  Basically,
I don't see what's happening.

Maybe it's time to take a break from this.


Entry: Debug probes
Date: Thu Aug 16 09:01:41 EDT 2018

Is it enough to use the environment-based probing?

At the bottom level, probes are just ignored if they are not defined.
At the top level, probes can be pulled out if they are defined, and
raise an error if they are not.  This seems better altoghether.

Maybe a combination of both will do?  Use the first probe mechanism to
collect names into the meta level, then use the former environment
mechanism to teleport signals.


Entry: Smaller step
Date: Thu Aug 16 10:30:49 EDT 2018

SPI works, but writing to buffer, then executing, doesn't work.  What
about making this simpler: add readout first, and see what that gives.

Another thing is to clock the CPU from SPI, and pulling out debug
information that way.

Reading is not expensive: it is just routing.  Writing is expensive:
it requires multiplexing at the input of a register.

So what about adding a tracer for registers?  I have the probe
mechanism, so a state machine could just go through all of this.


Entry: State machines
Date: Thu Aug 16 10:36:39 EDT 2018

So what is my problem with state machines?  They are hard to write.

But, I have a macro mechanism at my disposal, so why not expand it
from a higher level description?

Basically, a lot of state machines have a structure that is:

- act, wait, act, wait, ...

Where each step is a state.


Entry: Buffer
Date: Thu Aug 16 10:43:20 EDT 2018

- Create an abstraction that splits a memory into two parts: one that
  can read over SPI, one that can write.

- Abstract SPI cs,sck into bc,rst

The latter is maybe the most important part: external data
representation vs. internal.

For SPI it is ok to assume that cs=rst


Entry: How do you call a clock enable pulse?
Date: Thu Aug 16 10:45:15 EDT 2018

Calling it clock enable (as is done on a fifo) is cumbersome, so call
it clock instead.

Added a note to SeqLib.hs


Entry: FPGA SPI bug
Date: Thu Aug 16 11:35:36 EDT 2018

Here's the theory: the sampling edge is 0->1, but the initial clock
phase is 1.  The FPGA is immune to this, but my edge detector is not.

Let's just look at the scope.  Nah I need to solder...

So let's look at the iceprog code.  This is how it starts a transfer
when in MPSSE modes.

	send_byte(0x11);
	send_byte(n-1);
	send_byte((n-1) >> 8);

The code here is more readable:


https://github.com/devttys0/libmpsse
https://www.intra2net.com/en/developer/libftdi/documentation/ftdi_8h.html

#define MPSSE_WRITE_NEG   0x01 /* Write TDI/DO on negative TCK/SK edge*/
#define MPSSE_DO_WRITE    0x10 /* Write TDI/DO */

So it writes the data on the negative edge.

Which means the initial polarity is 1.


Now, make a test for SPI phase.
Basically, specify this properly.


Entry: I can't slow it down?
Date: Thu Aug 16 15:35:05 EDT 2018


It's working, but I still can't slow it down.

Only works with this:

		// clock divide
                send_byte(0x86);
		send_byte(0x00);
		send_byte(0x00);

Filling in any other div kills both the IRAM upload as the main FPGA image upload.

Or maybe I understand this badly.


Entry: Generate verilog directly
Date: Thu Aug 16 16:30:26 EDT 2018

This could just be straight up RTL.

module <name> ( <signal> ... ) {
input <signal>; ...
output <signal>; ...
wire [<size-1>:0] <signal>;
reg  [<size-1>:0] <signal>;
assign <signal> = <combinatorial_expression>;
}

Then the sequential bit could be one big block.


Entry: Why do FPGAs use LUTs?
Date: Thu Aug 16 17:11:54 EDT 2018

I don't know, but a guess is that a LUT doesn't introduce any bias
against certain functions.  Each n-ary function is implemented with
the same cost.  As such, LUTs are universal.

They are probably also easy to implement as compared to other
structures.

The question is more: why 3 to 1, and not any other n to m
configuration?

There does seem to be some variability here.  3-1, 4-1, even mult-out?

This says 4-1 is best for size, 5-1 is best for performance.
https://ieeexplore.ieee.org/document/1410611/


Entry: So CPU works
Date: Thu Aug 16 17:47:40 EDT 2018

Next: don't use it!

Seriously though, there is a middle ground here: generating state
machines from static programs.

Here's the outline of a practical program I need to write.  The
input/output are abstracted, but essentially they perform two kinds of
actions:

- write to a register
- send a start pulse
- wait for a done pulse

e.g.

- write
- wait
- n times:
  - m times:
    - write
    - wait
- wait
- n times
  - write, pulse
  - wait
  - pulse
  - wait


The essence of this structure is the nesting.
To flatten this requires flattening out the nesting.

What makes a CPU interesting is that it can do loops
A stack then adds nested loops

To translate all that into a state machine requires state for each
level + possibly repurposing the state for the next loop.


I don't even have to work this out to know that the CPU approach is
_much_ simpler once there is any form of nesting.

Note that a UART already performs nesting:
- top level states:  idle->start->data->stop->idle
- inside data there are N data states

So to create a UART, one could also just create a CPU.

Let's look at that in the next post.


Entry: Bitbang UART vs. hardware
Date: Thu Aug 16 18:06:34 EDT 2018

How much more complicated is a CPU implementing a UART as a program,
compared to a hardware state machine?

- wait start
- delay 1/2
- n times
   - delay 1
   - sample
- delay 1
- check stop
- delay 1/2


Another question: since bit size will be known, why not a flat state
machine?

Here's an idea.  What is easy to do?
- flat state machines
- flat + shared counter in each stage
   or generalized: any shared abstract state machine


It seems that the important idea is to be able to share resources in
the several sub states.  If that is possible, or even natural (a
counter for instance), a state machine will be ok.

OTOH if there are states where resource sharing is not obvious, a
sequential program will be more appropriate.


Entry: Lessons Learned
Date: Thu Aug 16 18:52:47 EDT 2018

1) You can't escape the fact that you're dealing with circuits.

2) Resource sharing is about multiplexing.

3) Generated Verilog can be simple (just RTL).

4) Monadic do notation is still awkward, and there do not seem to be
   any obvious workarounds.


Entry: CPU resource use
Date: Thu Aug 16 19:04:16 EDT 2018

After packing:
IOs          15 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          238 / 7680
  DFF        104
  CARRY      37
  CARRY, DFF 5
  DFF PASS   44
  CARRY PASS 14
BRAMs        1 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2


Entry: Better instruction packing
Date: Thu Aug 16 19:07:23 EDT 2018

Currently mostly unused:
3 bits
5 bits reserved
8 bits argument


If it is ok to have all zero operand instructions, something like this
could work:

H 7 8  - literal
L 15   - packed

Where 7 and 15 bits contain packed instructions.

Anyways a lot is possible still.
This isn't even a 2-stack machine.

Here's an approach:

1 + 3*5

Which can pack 5 instructions

The other branch can then be used for literals and jumps.


Anyways this all needs to be evaluated relative to an application,
because there are otherwise too many variables to even begin to define
"optimal".

In my case, "optimal" means simple.

Going anywhere, base it on Moore on Koopman
https://users.ece.cmu.edu/~koopman/stack_computers/sec6_3.html


Entry: soc spi
Date: Thu Aug 16 19:51:37 EDT 2018

now that boot works, maybe add a spi command mode?


Entry: Subroutine call / return
Date: Thu Aug 16 20:51:46 EDT 2018

I don't care so much about using more instructions inside the
subroutine (return), but the call site should be small.

Call could then load the instruction pointer on the stack, and return
pops it.  Actually, the next pointer is already available.


Entry: single stack is enough for a control processor?
Date: Thu Aug 16 20:59:31 EDT 2018

Basically, if a subroutine does not need to return an argument?
Currently if I want to pass a single argument to a routine, it would
be:

push <arg> ; call <sub>

swap ; drop

I.e. the subroutine would start with swap to move the return address
to the side.


Entry: variables
Date: Thu Aug 16 21:02:53 EDT 2018

Use the other memories for variables?

Or just use them as stacks?  That way no logic is needed besides the
pointers.


Entry: CPU designs
Date: Thu Aug 16 21:04:35 EDT 2018

- single stack control computer

- dual stack forth machine


Entry: A CPU: It's inverted
Date: Thu Aug 16 21:45:35 EDT 2018

Signals are computed from their inputs, but for some reason this seems
inverted.  I.e. it's more natural to think of a big or statement as
"also set if".

In my first circuit I found this was exactly the problem to get
around. 


Entry: How expensive is the stack?
Date: Thu Aug 16 22:42:04 EDT 2018

3 x 8

After packing:
IOs          15 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          302 / 7680
  DFF        120
  CARRY      37
  CARRY, DFF 5
  DFF PASS   52
  CARRY PASS 14
BRAMs        1 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2


4 x 8

After packing:
IOs          15 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          310 / 7680
  DFF        128
  CARRY      37
  CARRY, DFF 5
  DFF PASS   52
  CARRY PASS 14
BRAMs        1 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2


So it's just the FFs

Basically, stack depth is cheap.  What about width?  Going from 4x8 -> 4x12

That makes a bigger difference.  But it also adds logic for the
memories.


After packing:
IOs          15 / 206
GBs          0 / 8
  GB_IOs     0 / 8
LCs          391 / 7680
  DFF        158
  CARRY      49
  CARRY, DFF 5
  DFF PASS   58
  CARRY PASS 14
BRAMs        16 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2


Entry: Removed old notes from CPU.hs
Date: Thu Aug 16 23:49:53 EDT 2018

-- Original notes on stack macchines:

-- Memory seems to be the most important component.  I'm going to
-- target the iCE40, which has a bunch of individual memories,
-- allowing separate busses for instruction, data and return stack,
-- and the rest bundled as data memory.

-- At every clock, each of the 4 memories has a word sitting on its
-- read port:
-- i: current instruction
-- d: 2nd on stack  (top is in a register)
-- r: return address (ip is the instruction memory's write port)

-- The instruction word drives the decoder, which drives all the
-- muxes.

-- It seems that reading out instructions is the most useful thing to
-- start with.  This could be used for specialized sequencers that are
-- not necessarily general purpose CPUs.  This can then be gradually
-- extended to more abstract operations.


-- The main problem for building a CPU is to properly decompose the
-- decoder.  I'm not sure how to do this exactly, so just start in an
-- ad-hoc way.

-- There is some arbitraryness here: a hierarchy is created in the
-- nesting of the "close" operations.  The guideline is to abstract
-- away a register as soon as possible, i.e. move it to the inner part
-- of the hierarchy.

-- At the very top, there is:
-- . instruction memory access:
--   . read:  program sequencing
--   . write: bootloader
-- . BUS I/O (i.e. containing GPIO)

-- Each hierarchy level is an adaptation.  closeIW will abstract the
-- inner decoder as an iw -> jump operation, and insert the necessary
-- logic to either just advance to the next instruction, or perform a
-- jump.


-- The origianl problem that drove this exploration is meanwhile
-- implemented on PRU.  These were the instructions needed:
-- 
-- a) loop n times
-- b) write UART byte, wait until done
-- c) wait
-- d) set I/O
-- e) read I/O into memory and advance pointer

-- To implement loops, it would be useful to have a stack to be able
-- to have nested loop counters.  This would mean less registers.  I'm
-- not going to be able to make this simpler than making a small forth
-- machine..  This way:

-- UART out can be bit-banged.
-- Multiple counters not needed for timing control.
-- No "wait" instruction needed: instruction counting suffices.
-- Add a data stack when needed.  Probably a single top register is enough.

-- The basic instructions seem straightforward.  This is just a
-- decoder that fans out into mux controls.  The unknown part to me is
-- the call/return.

-- Call:   move IP+1 -> rtop write port
--         inc rpointer
--         set ip from instruction word
-- Ret:    dec rpointer
--         move rtop -> IP

-- This could also be microcoded:
-- a) load literal into rdata
-- b) increment rstack
-- c) unconditional jump

-- The operations that can be reused are:
-- write, postinc  (stacks + buffers)
-- read, predec

-- So there is a clear tradeoff between the complexity of the
-- instruction decoder, and the amount of instructions needed.

-- Where to start?  Conditional memory write.

-- So for unidirectional flow, this is easy.  For bi-directional such
-- as a stack, two pointers need to be maintained.  It might be
-- simplest to initialize them such that the write/read operation can
-- happen immediately?  Both will have individual adders.  Maybe not a
-- good idea?


-- Perform an operation and wait for it to finish.
-- Let's keep the operation abstract, so what this does is:
--
-- . first time the instruction is executed, the sub-machine is
--   enabled.  the sequencer will wait until the machine provides a
--   "done" flag, which will advance the instruction pointer.
--
-- . it seems simpler to split this into "start" and "wait"
--   instructions.
--


-- Each instruction can have push/pop/write/nop wrt imm?  It seems
-- possible that stack can be manipulated in parallel with bus
-- transfer.

-- But let's not make this too complicated.  Some observations:

-- . This is for very low level, specialized code.  It will never
--   necessary to manipulate addresses as data, so address for read,
--   write can always come from the immediate word.  The data itself
--   might be manipulated.


-- It's probably ok to instantiate it fully even if certain
-- instructions are not used.  Yosys/abc removes unused logic.

-- Still, this can use some decomposition.  For now, because there are
-- not many instructions, use one-hot encoding to keep the logic
-- simple.


Entry: Next
Date: Fri Aug 17 08:39:05 EDT 2018

- Direct Verilog RTL code gen.
- Type-level additions
- Syntax preprocessor
- Get rid of (SInt (Just n) _) and use e.g. Reg t v , Sig t , Const v

None of these are essential.

SInt can be left if replaced by a wrapper like sbits.  EDIT: Did that.
Simplifies things without making a huge change.

Type directed stuff: I'm asking for help on Twitter.

Syntax preproc.  Seems to be more trouble than it's worth because this
needs "escapes" for generic meta-level stuff.  It seems that just
sticking to "do" for now is the best option.  Maybe some TH can help
as a middle ground?

Verilog seems most useful to cut out MyHDL entirely.  It's nice, but
not really necessary.  And I'm not using it's main purpose: to be able
to use Python as a macro and test bench language.

EDIT: Started putting in the boilerplate for Verilog.hs
It seems quite straightforward, boring.


Entry: Forward declarations
Date: Fri Aug 17 11:13:43 EDT 2018

Likely will be necessary eventually, but can it be hacked for now?
E.g. write a program as a set of routines, and tell the compiler which
is the first one?  This also avoids infinite loops.


Entry: Bit-serial CPU, PLC
Date: Fri Aug 17 12:18:43 EDT 2018

I wonder if that makes sense.  It seems that the bit-width
significantly impacts the resource use.

https://en.wikipedia.org/wiki/Serial_computer


What I'm building is a PLC?

https://en.wikipedia.org/wiki/PDP-14

https://en.wikipedia.org/wiki/Programmable_logic_controller


http://fpga-guru.com/files/supercn.pdf


Entry: Bit serial architecture.
Date: Fri Aug 17 12:34:54 EDT 2018

These could be cycled through a RAM, giving 16 registers of 256 bits
deep.

I'm actually really intrigued by this.  Seems like a perfect task for
a good macro language!

But, let's cut it short: before building something like this, build a
fixed datapath DSP.  A FIR or a biquad.  Once it is clear how to
abstract it, only then move on to CPU design.

EDIT: The instruction word and decoder will have to be parallel.  The
argument could be streamed in.  Next instruction address can be
streamed out as part of the ALU.  It seems that the main design
challenge is to design the ALU.  Routing needs to be determined from
the instruction word.  Operations like conditional jumps need an extra
cycle to decide between continue and jump.  Probably simplest to do it
in two macro cycles: compute condition and jump in the next cycle.


Entry: CPU/PLC vs state machine
Date: Fri Aug 17 12:44:57 EDT 2018

If what you're doing is sequential in time, needs strict timing but
has a time scale that is far below the clock rate, multiplex and
sequence it on a CPU.

If you need something simple, something parallel, something fast, use
a state machine.


Entry: Better testing of CPU
Date: Fri Aug 17 12:52:52 EDT 2018

Create some quickcheck I/O functions to constrain the behavior before
adding new instructions and peripherals.  Simple program -> debug
sequence is probably ok.


Entry: Verilog
Date: Fri Aug 17 16:17:14 EDT 2018

One reason to generate Verilog is to make it easier to do direct
instantiation.

EDIT: I'm thinking that the other direction might actually be a lot
more important: take a verilog module, and instantiate it into Seq.
The simplest way is maybe to use yosys or icarus to translate it to a
netlist.  This can probably be done lazily, once there is a need.

EDIT: Verilog generation is complete.  Needs some testing still.
Import, I'd like to see if I can do something with blif.

http://www.clifford.at/yosys/files/yosys_appnote_010_verilog_to_blif.pdf

Looking at what kinds of gates are in the blif:

tom@panda:~/asm_tools$ cat f_soc.blif |grep ^.gate | awk '{print $2}'|sort|uniq
SB_CARRY
SB_DFF
SB_DFFE
SB_DFFER
SB_DFFESR
SB_DFFESS
SB_DFFR
SB_DFFSR
SB_DFFSS
SB_LUT4
SB_RAM40_4K


Entry: STUArray
Date: Fri Aug 17 16:31:40 EDT 2018

Figure out how to chunk it.

runSTUArray doesn't have the correct type.  I also want the other
parameters.  So looks like unsafeFreeze is needed.

EDIT: see strictToLazyST.  It seems possible, but I got to really
understand it first.  Not for now.


Entry: RTL
Date: Fri Aug 17 22:01:21 EDT 2018

Yosys manual: 2.1.4 Register-Transfer Level (RTL)

  Many optimizations and analyses can be performed best at the RTL
  level.  Examples include FSM detection and optimization,
  identification of memories or other larger building blocks and
  identification of shareable resources.

multi-level logic synthesis

- Binary-Decision-Diagram (BDD) has a unique normal form.
https://en.wikipedia.org/wiki/Binary_decision_diagram
This is nested IF / 2-1 multiplexers.

- And-Inverter-Graph (AIG) better worst case performance. ABC uses this.
https://en.wikipedia.org/wiki/And-inverter_graph


Entry: yosys log
Date: Sat Aug 18 01:32:52 EDT 2018

12 bit

   Number of wires:                277
   Number of wire bits:            920
   Number of public wires:          58
   Number of public wire bits:     303
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                547
     SB_CARRY                       47
     SB_DFF                         14
     SB_DFFE                        13
     SB_DFFER                       73
     SB_DFFESR                       3
     SB_DFFESS                       1
     SB_DFFR                        61
     SB_DFFSR                        5
     SB_DFFSS                        2
     SB_LUT4                       312
     SB_RAM40_4K                    16


4 bit:

   Number of wires:                160
   Number of wire bits:            387
   Number of public wires:          61
   Number of public wire bits:     216
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                290
     SB_CARRY                       22
     SB_DFF                         13
     SB_DFFE                        13
     SB_DFFER                       29
     SB_DFFESR                       3
     SB_DFFESS                       1
     SB_DFFR                        36
     SB_DFFSR                        5
     SB_DFFSS                        2
     SB_LUT4                       165
     SB_RAM40_4K                     1


Entry: Clash, Lava
Date: Sat Aug 18 02:13:57 EDT 2018

Clash uses a static analysis approach.  ( Implemented as plugin? )
http://hackage.haskell.org/package/clash-prelude-0.99.3/docs/Clash-Tutorial.html

Different from eDSL like the Lava languages.
http://projects.haskell.org/chalmers-lava2000/Doc/tutorial.pdf
http://hackage.haskell.org/package/chalmers-lava2000

I see no monads..
Brings me to sharing.
http://www.ittc.ku.edu/~andygill/talks/20090903-hask.pdf


How do we turn a Lava program into a Graph?

Use a Monad – Xilinx Lava does this.

Tag every binding with a unique tag – Hydra does this.

Automate the introduction of unique tags.
 Destroys referential transparency.
 Chalmers Lava does this.
 Not a big problem in practice.

Use a function that can see certain classes of sharing.
 Like a reflective parser.
 Kansas Lava does this.
 Like Chalmers Lava, can be unsafe.


Data.Reify uses StableNames


Seems like dirty hacks galore..

Stick to Monad.  Less surprises.

Some more on Chalmers Lava:
http://www.cse.chalmers.se/edu/year/2012/course/TDA956/Slides/Lava111.pdf


Entry: Xylinx lava
Date: Sat Aug 18 03:07:25 EDT 2018

http://hackage.haskell.org/package/xilinx-lava
http://hackage.haskell.org/package/xilinx-lava-5.0.1.9/docs/Lava.html

type Out a = State Netlist a


Entry: stripping bare a 6809
Date: Sat Aug 18 03:21:07 EDT 2018

Same story: need a CPU, don't have much room.

https://www.edn.com/design/integrated-circuit-design/4460471/Afternoon-diversion--Design-your-own-microprocessor


Entry: Avenues
Date: Sat Aug 18 03:32:58 EDT 2018

- split the design:
  - rename CPU.hs to PLC.hs
  - create a real Forth processor with 2 stacks from 2 memory banks

- move Staapl pic compiler to Haskell.  It's only the peeophole
  optimizer, the rest can be done in Haskell incrementally.


Entry: PLL
Date: Sat Aug 18 08:56:43 EDT 2018

Before declaring victory, run it at a higher clock frequency.  Do this
using Verilog, direct instantiation.

But first, see if the verilog actually works.


Entry: Verilog debugging
Date: Sat Aug 18 09:34:02 EDT 2018

Trying to generate f_soc with Verilog.hs
Running into things like this:

wire [99:0] s144; // "s144" <- (IF "rx_in" (CONST _:0) (CONST _:1))

Solution?  This needs to be constant-folded.
For now it's probably OK to leave it in as yosys will optimize it out.

The real issue here is types: there is no unification going on in the
other direction.

I'm shuffling things around so a workaround can be added to f_soc.hs

EDIT: still there

f_soc: fix_binding:
("s144",Comb3 (SInt Nothing 0) 
 IF (Node (SInt (Just 1) 0) "rx_in") (Const (SInt Nothing 0)) (Const (SInt Nothing 1)))


Hack: it's possible that these disappear when bundling expressions.
But the best solution is to perform a unification pass on all the bit
types in SeqTerm.


Entry: Failing tests
Date: Sat Aug 18 09:53:45 EDT 2018


-- p_deser
*** Failed! Falsifiable (after 2 tests): 
(7,(3,[1]))
(6,(13,[8191]))

-- p_soc_fun
*** Failed! Falsifiable (after 2 tests): 
(255,255)
(253,254)

EDIT: p_soc_fun was due to bus size set to 4.  The other still fails:

-- p_deser
*** Failed! Falsifiable (after 2 tests): 
(4,(9,[0]))

It doesn't seem to work at all.
Ha. cs==1.

Still a failing case:
(4,(10,[0,1023,1018,0,5,4]))


Entry: Unification
Date: Sat Aug 18 10:09:04 EDT 2018

How to express this?  It is a set of integer equations, but using a
solver seems overkill.

This is a core issue.  In general I want to have equations between
types, not functions.

Here's a simple solution:

- for each binding, propagate to output and inputs until there is no
  more change

- another way, because these are trees the following terminates: pick
  a node with unknown type and re-root the tree, going downward.

But actually I'm too tired to think now so let's try a library.
http://hackage.haskell.org/package/monad-unify-0.2.2/docs/Control-Monad-Unify.html

This should have everything.


Actually I already have a unification algorithm: the netlist
execution.  No it's only half way: i/o function, no arbitrary
relations.

EDIT: monad-unify doesn't build:
Configuring monad-unify-0.2.2...
Setup: Encountered missing dependencies:
base >=4.5 && <4.8


http://dev.stephendiehl.com/fun/006_hindley_milner.html


What' I'm trying to do is simpler because of the way that register
types are always defined explicitly in the code using functional
dependencies.  So I should be able to just push it through that way.


Entry: Verilog FPGA test
Date: Sat Aug 18 10:32:28 EDT 2018

f_soc doesn't work while MyHDL still works.
likely my fixed-size workaround is bad: it won't work for concatenation.


Entry: Next?
Date: Sat Aug 18 11:01:17 EDT 2018

- Verilog.hs <- SeqTerm.hs unification
- deser has a failing test case


Entry: Behavioral
Date: Sat Aug 18 14:31:06 EDT 2018

Look at Chalmers Lava paper, it has a "behavioral" language embedded
that is then compiled to a "full tree" register update.

http://projects.haskell.org/chalmers-lava2000/Doc/tutorial.pdf
http://hackage.haskell.org/package/chalmers-lava2000


Entry: Term type cleanup
Date: Sat Aug 18 16:08:24 EDT 2018

This requires a lot of code changes, so might not be the thing to do
at this moment.

Basically, I want to get to this:

data Form n =
  Const     
  | Comb1    Seq.Op1 n
  | Comb2    Seq.Op2 n n
  | Comb3    Seq.Op3 n n n
  | Slice    n Seq.SSize Seq.NbBits
  | Delay    n
  | Connect  n
  | Mem      (n,n,n,n)
  | Input    -- Externally driven node
  deriving (Show, Functor, Foldable)


The changes to the original Term:

- Flatten Term (Op n) to Form n.  This is already implemented.  It
  seems necessary to be able to keep the non-monadic constants.

- Change the memory representation such that memory is just an i/o
  function.  The read data register can then be added as a normal
  register.

- This then allows SType to be factored out of the Form type.


Entry: Spiral after focus
Date: Sat Aug 18 18:19:40 EDT 2018

I guess I'm looking for that rush of getting the processor to work,
only to discover there is a lot of work left, and it's "work".
I.e. refactorings that are not so easy to perform.

Mostly, this needs some rest.


Entry: Unification
Date: Sat Aug 18 22:27:06 EDT 2018

I'm running into a unification problem, and it seems to be slightly
different from HM.  What I want to know is how my problem is different.

My problem:

concat:     c = a + b
other 2op:  c = a, c = b

Forget about the other cases for now.

During execution, I can go from leaves up to result nodes, but some
information travels the other way: from delay / connect nodes down to
the leaves.

Is it enough to just propagate that information in one way?  That's a
lot simpler than full unification.

Yes there are only two "entry points" where type information is
present: typed constants and register reads at the leaves, and typed
register/connect at the root.

It should be enough to split the nodes in typed and untyped, iterate
over all the roots, go down the full trees, and fill in information as
it becomes available.  Keep going until stable or done.

Descending trees might not even be necessary.  As long as there is
reduction, it's ok to continue.  If nothing changes in one pass it's
time to give up.

Is it a problem if this is quadratic?  I guess not for any pracical
problem.  So just scan all the nodes every time some new information
is available.  To optimize, use a map from nodes to nodes that depend
on it.

Actually, just a map is useful for other things, so maybe just build it?


Entry: SeqNetList
Date: Mon Aug 20 15:29:52 EDT 2018

Cleaned it up to use Data.Graph, which makes a lot of code simpler.
Also have expression inlining.

The question is now: is it still necessary to unify?  The inlining
might already fix that, leaving type (bitwith) reconstruction of
intermediates to the target language.

EDIT: In the soc example, this seems to be the case.


One simplification is to always use Vertex for the structural
algorithms.  When node representation needs to be changed, the functor
property can be used.


Entry: Next?  Verilog
Date: Mon Aug 20 15:40:01 EDT 2018

Verilog gen so PLL can be tested + mem init.

Some other things need cleanup before that.  And now is not the time.
Too much "work-like".

Is output propagation for "Connect" still necessary?

One way to do this though is to create Verilog2 as a print of 
[(Vertex, (SSize, Expr Vertex))]

Then use postprocessing later to fix things.

EDIT: This is still a big step.  I need an analysis routine to filter
out memories, which are modeled as combinatorial network + delay.

EDIT: Currently, type information of constants is lost.  So keep tracking that.

EDIT: Fanout of Memory node is []?
First, clean up types a bit.

This is the graph.

(35,TypedForm {typedFormType = Just 16, typedFormForm = Memory 26 29 28 95},[])
(36,TypedForm {typedFormType = Just 16, typedFormForm = Delay 35 0},[38])

Fanout is 0 because delays are not counted.

Is it possible to keep the full graph representation, and turn it into
a DAG when needed?

EDIT: Created se printer, then disabled that as default.  Running into this error:


EDIT: Memory -> Delay fanout is now accessible again:

(35,TypedForm {typedFormType = Just 16, typedFormForm = Memory 26 29 28 95},[36])
(36,TypedForm {typedFormType = Just 16, typedFormForm = Delay 35 0},[38])


 memory_decl: ((2,TypedForm {typedFormType = Just 16, typedFormForm = Memory 33 34 35 4}),
               (3,TypedForm {typedFormType = Just 16, typedFormForm = Delay 2 0}))


Entry: Next? Verilog done.
Date: Tue Aug 21 23:08:10 EDT 2018

Instantiation of external modules, testbenches.
If Verilog is generated, it needs to be validated.


Entry: Verilog
Date: Wed Aug 22 12:35:36 EDT 2018

Easy to do huh?

Quoite a detour to make SeqNetList.hs and still not generating the
f_soc example properly.

It works for f_blink.  I suspect it's memory.  That involves some
manual steps.  Inspection didn't yield any obvious mistakes.

I don't have a good strategy to test this apart from going over every
feature and writing or generating a testbench.

Actually, quickcheck..  Does it make sense to generate a random
network?  Most would likely produce junk.

It needs to be fed with random data.  Or a noise generator.


Entry: Something else
Date: Wed Aug 22 14:46:08 EDT 2018

Not much sleep.  Bored.  What else?
EDIT: Playing a bit with verilog testbenches.


Entry: Macro languages / Simulation
Date: Wed Aug 22 23:45:21 EDT 2018

Maybe there is some room for improvement?  I guess I can continue with
this, find abstractions that work.  Comments, follow-ups

- multi-level sim (rtl, gate-level)
- asic: problem is verification (design for verification)
- RTL synthesis
- Event-driven approach is necessary (multiple clock domains)
- Seq test benches should be able to drive a verilog simulator (cosim)
- MyHDL cosim with icarus


https://news.ycombinator.com/item?id=9516217

"If you have worked in ASIC designs that you are almost certainly used
to not work with Verilog directly, but with an ad-hoc macro
language. I've seen them all, but most commonly Perl is used. It is
especially for circuits that we desperately need better tools and
abstractions."


> They are all toys aimed at newbies. They address problems that are
  no problem at all to a qualified engineer in the profession.

I think this kind of attitude is one of the main problems of the
hardware design industry (the other one is the conservative mindset).


The bugs are not due to language generally. If software guys want to
write a new language for HW engineers, they need to ask HW engineers
what the issues are rather then go off on a tangent.


As it is they don't even know what the workflow is like and where the
real pain lies. They are just writing tools for themselves. The
problem is their applications are not commercial.


Imperative synchronous languages.

https://news.ycombinator.com/item?id=9948906

Exactly. It's crazy to even think, outside of analog, that someone
would want to build a modern SOC with RTL directly. Even the
full-custom folks like Intel, correct me if I'm wrong, essentially use
a hardware approach where they can abstract away from and use EDA to
synthesize to the stuff they hand-craft. Most SOC builders simply
don't have the labor to waste working at RTL for a complex design &
its inevitable problems. It's why they pay so much to the Big Three
for synthesis tools.


The real difference is that in event-driven languages, clock events
are explicit, instead of implicit like in Chisel and the whole array
of dead HDLs that preceded it. So if history is any guide, Chisel is
dead upon arrival.


Essentially once you translate your Chisel design to verilog you
basically can't reuse your verification environment for the RTL
simulation or the gate-level. How are you going to check timing if you
wrote all your tests in Chisel?


So much more logic is added to chips during/after synthesis these days
and you have no way to get any of your tests running on that net list.


productivity bottleneck in hardware is not in design but verification
and specifically sequential verification. Combinational verification
is quote easy, because modern SAT solvers can prove almost all
combinational properties you can throw at them.


Lots of progress is happening though: Aaron Bradley's PDR
(http://theory.stanford.edu/~arbrad/) and Arie Gurfinkel and Yakir
Vizel's AVY (http://arieg.bitbucket.org/avy/) have made pretty major
breakthroughs in proving a lot of complex sequential properties and
these algorithms have made their way into industrial formal tools as
well.

(on clash) And you're still describing the transfer function between
states; CλaSH does not seem to allow you to describe your program in a
structured way (such as loop until x becomes true, wait for 3 cycles,
read z, while z > 0 decrement z, etc.)


Entry: cosim
Date: Thu Aug 23 00:55:22 EDT 2018

https://github.com/jgjverheij/clash-cosim
http://docs.myhdl.org/en/stable/manual/cosimulation.html


Entry: structured programming
Date: Thu Aug 23 01:19:42 EDT 2018

See two posts back: " CλaSH does not seem to allow you to describe
your program in a structured way (such as loop until x becomes true,
wait for 3 cycles, read z, while z > 0 decrement z, etc.)"

https://news.ycombinator.com/item?id=9516217

How would you in Verilog?


Entry: Chisel HDL: the latest instance of a flawed approach
Date: Thu Aug 23 01:33:43 EDT 2018

http://www.jandecaluwe.com/blog/chisel-flawed-approach.html

Jan disagrees:

    Since we are constructing a digital circuit, the notion of
    reassignment does not make much sense since connections between
    circuit nodes only need to be specified once.

  This is a puzzling statement. It would suggest that Chisel reduces
  the goal of an HDL to structurally constructing a circuit. What
  about describing behavior, which is the main reason why industrial
  HDL designers use VHDL, Verilog, SystemC and MyHDL?


So I'm making the same mistake.  Jan mentions again that merely
constructing circuits is a mistake, and that behavioral modeling is
the important property.

I've actually run into this, to be honest.  Why am I building a CPU?
To be able to do sequential programming.

In the comments, Jan comments on blocking assignments:


Entry: Procedural / behavioral modeling
Date: Thu Aug 23 01:47:19 EDT 2018

So what is it?  What are the constructs?  Is this just single-ended
ifs?  ( Because unrolled macros are trivial ).

http://www.jandecaluwe.com/hdldesign/thinking-software-rtl.html

  Coding at the RTL level means describing the hardware behavior in a
  single clock cycle.

  If an algorithm requires a number of steps, you have to code it as
  an FSM so that the behavior depends on the state you are in. The
  resulting code is fairly low level. For example, a loop is emulated
  by explicit transitions between states.

  However, there is an interesting special case. When the algorithm is
  simple enough to complete within a single clock cycle, you can use
  higher level features such as a for loop to describe the behavior in
  a straightforward way.


So as long as the loop unrolls into a logic network, it is
synthesizable.  Now what is the equivalent of that, Haskellized?  It's
OK if I generate straight RTL...

That gray counter is a good example to duplicate.


Some more here:
http://www.jandecaluwe.com/hdldesign/the-case-for-a-better-hdl.html
http://www.jandecaluwe.com/hdldesign/signal-assignments.html


Trying to summarize the points:

- Consecutive assignments can be useful to factor out complex
  conditions

- Loops are useful to express networks algorithmically.

For seq: the latter is just macros.  The former needs a special trick,
such as register assigments or lenses.


Entry: Record assignments
Date: Thu Aug 23 02:25:46 EDT 2018

Haskell does have this record assignment syntax..
https://www.reddit.com/r/haskell/comments/6v9ezi/does_it_bother_anyone_else_that_record_assignment/

So this can be used for state updates, in case the state gets large.

Also: lens.


Entry: Update: behavioral
Date: Thu Aug 23 10:09:27 EDT 2018

(I shouldn't do this stuff right before bed...)

So I think the main idea is that the behavioral approach can make
control decisions based on signal values, and unrolls this into a
circuit.  At first glance, this seems to need target HDL support, but
because of the result being a static circuit, it should be able to be
expressed as a macro.  Just maybe not an imperative one.

http://www.jandecaluwe.com/hdldesign/thinking-software-rtl.html

The key to the gray counter example is the "found" variable.  It
appears to make a structural decision.  However, unrolled, the only
thing that happens is that the ith iteration of the loop has a
different input.  So this is a chain of combinatorial circuits.

This means that "for" can be a macro, and variables can just be
signals.

Generalize this to fold.

This is an important insight.

Maybe the one-legged if can also be implemented this way: a fold?


Entry: Processor as a macro
Date: Thu Aug 23 10:32:53 EDT 2018

So back to the main question: is there an intermediate between
creating a program on a processor, and a raw state machine.

I.e. I want an imperative / stack language that generates a state
machine.

Say start without a stack: just use loop counters.

Every instruction word is a state, with a next state equation.

This should be automatic.  These are all just functions!  What about
decoupling it as a type class, and providing two different
interpretations?

There are a couple of trade-offs: loop counters with inidividual
adders, or a stack or ALU-like structure.  It seems that only the
instruction memory can be eliminated this way.  The other logic will
remain the same.  Also obvious in retrospect.


Entry: concat
Date: Thu Aug 23 12:50:27 EDT 2018

Maybe today is a day to evaluate concat.
https://github.com/conal/concat

Before that, I need to solve some build issues.  Means nix.


Entry: Verilog next
Date: Fri Aug 24 11:19:31 EDT 2018

Something to test memories.  Make a universal test bench?  I.e. start
with output-only, single output, then change once it works.


Entry: Sequencers: state machines vs. CPUs
Date: Sun Aug 26 11:05:29 EDT 2018

This seems to be the main thing.  Also that it is just the imem+iword
infrastructure: for loops, a stack or multiple loop counts are still
needed.


Entry: Verilog cosimulation
Date: Sun Aug 26 11:20:09 EDT 2018

I started some basic testbench work.  Ultimately, I want cosimulation.
IVerilog can do that.  So maybe go for it straight away?

How is the interface built up?


root@zoo:~# apt-file find iverilog
...
iverilog: /usr/include/iverilog/_pli_types.h
iverilog: /usr/include/iverilog/acc_user.h
iverilog: /usr/include/iverilog/ivl_target.h
iverilog: /usr/include/iverilog/sv_vpi_user.h
iverilog: /usr/include/iverilog/veriuser.h
iverilog: /usr/include/iverilog/vpi_user.h
...

iverilog has vpi_user.h which refers to this standard:
https://en.wikipedia.org/wiki/Verilog_Procedural_Interface


What I need is either message passing, or calls from iverilog into
Haskell.  As a direct FFI, calls from Haskell into iverilog would be
more appropriate, but likely a bit awkward.

An example would be nice.

http://www.asic-world.com/verilog/pli6.html
What is a system task?

It seems to be a function that is called inside a begin/end block.
How do I get at wire/reg values?

I want to simulate something where the i/o relation is implemented on
the Haskell side.  It seems that just calling a function from verilog
is enough.  Is this a "system task"?  Something that can take
arguments and return them?


Entry: simple tests
Date: Sun Aug 26 12:04:45 EDT 2018

But first, some simple tests.

E.g.: a test that writes to memory, reads it out.  Basically it would
be nice if this could be expressed such that it can be plugged
directly into the existing QC functionality, while also allowing
inspection of the generated verilog.


EDIT: Postponed to later.  This is too much "work".  The MyHDL path
works for the CPU, so use that for now.  Work on this in the
background.


Entry: Behavioral is NOT state machine synthesis
Date: Sun Aug 26 14:21:10 EDT 2018

I think that was the core of my misunderstanding.  If "behavioral" is
just an imperative, iterative overlay to generate combinatorial
networks, I don't think I'm really interested.  Can be done better
using explicit macro code.


Entry: Cleaned up module structure
Date: Sun Aug 26 15:43:02 EDT 2018

Apparently the biggest misunderstanding was that the module
'asm-tools' can just be put in the dependency list of the tests.  This
way, they do not completely rebuild everything.

TODO: Integrate hatd code.


Entry: Generic assembler
Date: Sun Aug 26 15:44:20 EDT 2018

I was able to avoid this by using non-recursive functions.  But in
general, it is probably best to wrap up the knot-tying assembler
behavior in a monad (transformer).


Entry: iCE40UP5K 
Date: Sun Aug 26 19:16:29 EDT 2018

https://github.com/icebreaker-fpga/icebreaker

8 DSP cores (16x16 MAC)

There is support since january
http://www.clifford.at/icestorm/


Entry: Publish?
Date: Sun Aug 26 23:50:17 EDT 2018

The most important property that distinguishes this from a toy is
cosimulation.  If this tool can be used to create components as part
of a larger design, it will be useful.  If not, it is a toy.

For my own purpose, I control everything.  There are no hard or soft
cores apart from memory.


Entry: post synthesis changes
Date: Sun Aug 26 23:52:32 EDT 2018

I read something about changes being made post sythesis, supposedly
because in large designs, synthesis is such an expensive procedure.

I'm not sure if this means change then resynth, or just synth + tweak
and don't resynth.


Entry: Timing
Date: Mon Aug 27 00:31:11 EDT 2018

Maybe more important than verilog, is to get a better handle on timing
constraints.  E.g. how fast does the CPU run?  Is it fast enough for
the application, or do we need multiple clock domains?

As I mentioned before, I find it strange that I don't run into timing
issues (yet), while this is the main thing in all other CPU designs!

Path:
verilog -> pll instantiation -> CPU test


Entry: iCE40 VGA demo?
Date: Mon Aug 27 01:14:54 EDT 2018


https://imgur.com/t/fpga/71Ap9jT


Entry: Zip CPU
Date: Mon Aug 27 01:55:23 EDT 2018

If you bring me a design that doesn't work, my first two comments will
be:

1. Do you have `default_nettype none as your first line?

2. Does it pass verilator -Wall -cc toplevel.v quietly?


Entry: Scan
Date: Tue Aug 28 04:06:25 EDT 2018

Scan: One extra multiplexer + string all flipflops as a shift register.

http://pages.hmc.edu/harris/cmosvlsi/4e/lect/lect12.pdf


Entry: Verilog gen: names as strings
Date: Tue Aug 28 10:40:47 EDT 2018

Why bother with template haskell.  FPGA gen needs custom things
anyway, so use the existing probe mechanism to set signal names?


Entry: Direction: work with existing Verilog code
Date: Thu Aug 30 22:29:45 EDT 2018

This means: focus on tying into an existing simulator.  The power in
the Haskell approach is testability, composability.  For code gen
however, it will be mostly "glue" in any practical design that has
reuse.


Entry: tristate logic vs. multiplexer trees
Date: Fri Aug 31 18:09:45 EDT 2018

From one of the comments:

https://electronics.stackexchange.com/questions/289878/fpga-memory-vs-registers

"Using registers as memory also requires huge multiplexer trees, since
there is no tri-state logic in routing."


Entry: Verilator or Icarus?
Date: Fri Aug 31 21:26:32 EDT 2018

MyHDL uses Icarus, so that's what it will be.

Just do what MyHDL does:
http://docs.myhdl.org/en/stable/manual/cosimulation.html

  module dut_bin2gray; 

     reg [`width-1:0] B;
     wire [`width-1:0] G;

     initial begin
        $from_myhdl(B);
        $to_myhdl(G);
     end

     bin2gray dut (.B(B), .G(G));
     defparam dut.width = `width;

  endmodule

And use the "probe" mechanism to send the probe signals back using
to_x, and push in the test signals using from_x.


http://iverilog.wikia.com/wiki/Using_VPI


So there is a straightforward path with MyHDL as an example.  It
involves a lot of reading, so not for now.


Entry: Low-hanging fruit
Date: Fri Aug 31 21:51:40 EDT 2018

I need something simple to keep myself occupied.
EDIT: Sleep!


Entry: Verilog
Date: Sat Sep  1 08:47:42 EDT 2018

Maybe leave the cosim out of the picture, but make a working verilog
RTL gen.  What is missing currently?  Likely the problem is the
memory.


EDIT: What's needed is a cosim function, that returns the Seq
emulation, but also the Verilog one.  This seems like a lot of work,
really.  How to split it up into manageable parts?


Entry: Abstract submodules
Date: Sun Sep  2 09:41:23 EDT 2018

Instead of generating verilog code for memories and reset generator,
write those as verilog modules, and instantiate them.


Entry: VPI
Date: Mon Sep  3 10:05:21 EDT 2018

So I have some cargo-culted code that compiles and runs.  Now I want
to know how exactly I get "events".  What I miss is the shape of the
context these functions are supposed to run in.

This is quite typical: the API is too granular to reconstruct this
kind of context.  Examples are essential to get the big picture.


I find this in the myhdl.c code:

 vpi_register_cb(&cb_data_s);

Googling for that, I find:

http://www.asic-world.com/systemc/hdl8.html

Some remarks:

- userdata has nothing to do with the argument to a function.  it
  seems to be just a user state pointer.

- everything is dynamically typed, and referenced using vpiHandle

- some calls allocate objects that need to be freed,
  e.g. vpi_iterate() (EDIT: this caused a crash... reference manual?)

https://books.google.com/books?id=3pEMBwAAQBAJ&pg=PA213&lpg=PA213&dq=vpi+posedge&source=bl&ots=AYdk0XJym9&sig=3W4FA1Xe5LP4zUGPSm7QQv6O2z4&hl=en&sa=X&ved=2ahUKEwjnttT_lJ_dAhXDIzQIHSPLDhUQ6AEwB3oECAQQAQ#v=onepage&q=vpi%20posedge&f=false


Now, how to put it all together?

Instead of syncing on clock inside the module, what about calling it
like this:

always @(posedge clock)
   $seq_tick


This could then read out all the registers that were tagged in
$to_seq(), and write to all the registers tagged in $from_seq().

This seems straightforward.  The problem is then how to define the
protocol between the Haskell side and the simulation?


EDIT: This seems to work.  Proof of concept, clean up later.

Next: generic way to get an I/O pipe into a C program from Haskell.
TCP?  Named pipe?

The most straightforward way I can think of:
- Have Haskell set up the pipes
- Put pipe locations in environemnt
- Let module open the pipes


EDIT: How to set this up?
There is this, which is a higher level interface:
http://hackage.haskell.org/package/pipes-4.3.9/docs/Pipes-Tutorial.html

Seems overkill.  I just need the system calls.
EDIT: Trying named pipes, but this is really hacky.
Unix socket?

EDIT: Basic unix socket infra is in place.  How to (de)serialize?

I don't understand how to link the Char8 to the Data.Binary.Get

    • Couldn't match expected type ‘Data.ByteString.Lazy.Internal.ByteString’
                  with actual type ‘Data.ByteString.Internal.ByteString’


Entry: cauterize
Date: Mon Sep  3 12:37:23 EDT 2018

Serialization between emulator core and Haskell?
This is a job for cauterize:

https://github.com/cauterize-tools/cauterize

EDIT: Stick to u32 for now.


Entry: bug?
Date: Mon Sep  3 21:33:41 EDT 2018

reg [7:0] p0; // (0,Free (TypedForm {typedFormType = Just 8, typedFormForm = Input}))
reg [7:0] p1; // (1,Free (TypedForm {typedFormType = Just 8, typedFormForm = Connect (Pure 2)}))
wire [7:0] s2; // (2,Free (TypedForm {typedFormType = Just 8, typedFormForm = Comb2 ADD (Pure 0) (Pure 2)}))
assign s2 = (p0 + s2); // (2,Free (TypedForm {typedFormType = Just 8, typedFormForm = Comb2 ADD (Pure 0) (Pure 2)}))
assign p1 = s2; // (1,Free (TypedForm {typedFormType = Just 8, typedFormForm = Connect (Pure 2)}))

that 4th line is wrong
it looks like the constants start counting at the wrong offset.

Yes... should be 1 + maximum.

It's likely that this is the bug I've been looking for.  Nasty enough to mess up a network.

EDIT: .v code gen works on FPGA now.


Entry: cosim
Date: Tue Sep  4 10:25:56 EDT 2018

I have something working with .v gen.
Questions:

- from_seq uses registers.  does it need a special update?
  vpiNoDelay is fine

- to_seq uses wires.  should be fine

see https://verificationhack.com/


Entry: cosim next?
Date: Tue Sep  4 10:50:33 EDT 2018

So this works.  There is no need to continue using the stdin/stdout
approach.

EDIT: Cleaned up.  Good enough for now.


Entry: Next
Date: Tue Sep  4 12:30:02 EDT 2018

- Test CPU Verilog on FPGA

- Use Verilog as top-level composition.  Do this for reset gen, PLL,
  ram init.

Most of this is "boring" work.  I need to be more careful spending the
will power to do those things.  Lot's of payed work to do as well..

EDIT: I'm stuck with just boring problems.  I need something
meaningful.  Or somehow awaken the flame.


Entry: PLL
Date: Wed Sep  5 11:22:26 EDT 2018

So work towards getting the PLL up.  Clean up .v modules first.

- how to start yosys with library routines? OK
- same for icarus
- how to add instantiation to Seq?


Entry: Next
Date: Wed Sep  5 12:22:53 EDT 2018

I don't want to go through debugging PLL stuff when I'm not 100%
online.  Something simpler?  Or something more exciting?


Entry: CPU Emulation
Date: Wed Sep  5 13:06:19 EDT 2018

With Seq, CPU emulation is at the level of sequential logic.  Is there
a way to relate this level to a higher one?

E.g. one thing I'd like to do is to create a PIC18 code generator /
emulator.  Should I emulate the PIC18 at a hardware level?  That seems
like a lot of work.

The main question: how to relate different semantic levels?  This does
not seem trivial at all!

The main trick seems to be invariance of properties, done at the
problem level.


Entry: PRU revisited
Date: Thu Sep  6 15:06:34 EDT 2018

I'm starting to think that this might not be such a good idea.
Reality doesn't correspond to the simulation.  It seems stuck in some
loop.

So I'm looking at a sim error.
QBNE is prime suspect.

QBNE


Entry: Code gen or scaffolding?
Date: Thu Sep  6 19:03:14 EDT 2018

There is always this tension: is it _reall necessary_ to generate
code, or can low level code be written by hand and loaded into the
test framework for validation?

When to reverse the arrow?  The cost is significant: it is no longer
possible to read the code without understanding the metaprogramming
framework.

Maybe it is time to create a subset of C that can be used to express
algoriths, but is also easy to use as a frontend to abstract
interpretation.


Entry: Next
Date: Fri Sep  7 11:49:34 EDT 2018

PLL: I would really like to know if it works or not.  But this kind of
work needs strong attention to detail, and I'm in the mud again.

So what is missing, currently?

Maybe clean up the Verilog generator a bit.


Entry: I'm a little stuck
Date: Fri Sep  7 23:22:56 EDT 2018

Let's use this tool for a bit and let the application drive progress.
I think I'm done with just making stuff up.  Considering to switch to
"exo" as the main side project.


Entry: Emulator / Assembler
Date: Sat Sep  8 12:03:59 EDT 2018

I'd like to abstract the 2-phase structure used in Pru:

- compile to program
- run program

This requires two monads: a compile time monad and a run time monad.

To customize, maybe write these as monad transformers?


Entry: Instruction scheduling using SAT solver
Date: Thu Sep 13 21:04:58 EDT 2018

It might be an interesting.


Entry: So, what about just using assembly?
Date: Fri Sep 14 22:10:02 EDT 2018

Using a forth for the part that is not time critical, and assembly for
the part that is.  Together with emulators.

So the basic problem becomes: how to quickly write a CPU emulator?

Also, how to quickly iterate?  Haskell is REALLY slow to compile.


Entry: PRUs and instruction counting
Date: Sun Sep 16 10:05:15 EDT 2018

So I got something to work, but what I miss is a good way to specify
it.  I think this is getting closer to SAT solver territory.

The problem there is how to encode the problem.

SMT seems more convenient then SAT.


Entry: SAT/SMT: stick to z3 for now
Date: Sun Sep 16 10:11:53 EDT 2018

( Getting more into how other people make monad wrappers.  This one
sure looks ugly, but is straightforward, to the point. )


Notes:

mkFreshIntVar: create a variable
mkInteger:     create an integer constant

assert: add assertion to Z3 monad

connectives:
  mkLe
  mkAnd
  mkIte
  mkEq
  mkSub
  mkUnaryMinux

withModel: provides context such that evalInt works
evalInt:   evaluate variable


Entry: dsPIC
Date: Sun Sep 16 20:37:39 EDT 2018

- no USB needed.  this is for deeply embedded bits, solve interfacing
  elsewhere.  uart is fine.

- using e.g. an STM32 for this, it is also possible to put a PIC
  programmer in there.  the programming of the devices isn't really
  that difficult given bit-bang possibility on the STM32.

- time for some signal processing.  I have a bunch of these dsPIC
  DIP packages, so why not use them?

- dsPICs are simple.  Also, it will likely be straightforward to move
  this to a softcore written around some DSP primitives on an FPGA.

- I think I'm ready with the Haskell work:

  - PRU/Emu.hs can be transformed into a universal emulator / assembler

  - Haskell as a meta language is enough to implement Forth-like
    target languages

  - More complex things are often not needed: use a different chip
    with a "real" language like C or Rust.


What is needed?

- The tagless-final language can be built incrementally on top of an
  assembler.  I would like to use an existing assembler.  I don't
  think the Staapl approach of redoing everything makes much sense.
  Otoh I don't immediately find something, so maybe let's do this
  anyway?  It shouldn't be that hard.  It could be on an as-needed
  basis.


Entry: Architecture
Date: Sun Sep 16 21:01:51 EDT 2018

Interface chip: blue pill.  There is no need for Rust here.  Keep it
simple.  This doesn't need to do anything complicated.

So it looks like it is necessary to generate instructions for the ICSP
anyway, so let's just do that.  Hook up up the naked dsPIC to the blue
pill.

I have a dsPIC30F4013 on a board.  There are the GP chips from a previous
project project also.  Does it make a difference for programming?

Programming: dspic 30F and 33F can both contain a "programming
executive", but it seems this first needs to be uploaded.  Maybe
simpler to use ICSP mode.

Definitely, this will need to be limited to low voltage programming.
I'm not going to build a charge pump.  This means only dsPIC33F.


EDIT: This is not simple.  Do I really want to get into this, or is it
just a beer idea?

EDIT: Sobered up.  The point here is to be able to do circuits without
the need for breakout boards.  I believe the dsPIC33F can be
programmed with pickit2.  Yes, see /home/tom/bin/pk2cmd.33FJ128GP802.write


Entry: State machines vs. processor
Date: Fri Sep 28 13:32:10 CDT 2018

Disadvantage of processor is the amount of multiplexing that is
needed.  Advantage: that multiplexing does buy something: flexibility
in accessing everything from a single point.

What I really want: a simple way to synthesize loops into hardware.
Can it be done cheaper?  A stack of counters isn't all that expensive.
The problem is more that if repurposing is needed, a processor + bus
is simpler.


Entry: State machines
Date: Mon Oct 15 10:11:14 CEST 2018

I need a good syntax to implement state machines, but above all, a
better way to think about implementing them.

Often, state machines are too low-level.  Nested loops provide a
better mechanism.  I want a way to relate the two.

In Seq, a syntax transformation would be easy to implement, so why
can't I just write this down?

- loops: remain in a state as long as condition is not met

- re-use counters, and possibly counter conditions, between states

The trick is in the reuse of resources between states.  Having to do
this explicitly is also what makes manual state machines hard to
develop.  Using a CPU, the shared resources are the CPU resources
(registers, ALU, instruction sequencer).


So it is the reuse of resources between states that makes things
complicated.  Also, state components used in one branch, and not used
in another, still need to be initialized.  It is exactly this that a
more "imperative" language would solve: by default, keep value.  It
appears it is this property of needing to specify unused values that
gets in the way.


Entry: Imperative vs. functional
Date: Mon Oct 15 11:25:10 CEST 2018

I'm starting to think that the deeper problem in my reasoning is
holding on to "functional" way of thinking.  An imperative language
might actually be more appropriate for this kind of work: single (or
no) assignment, or even repeated assignment (though that feels like a
mistake).

Is this just beginner misunderstanding, or am I just not indoctrinated
enough by the standard way?  If that is ub-optimal, I have an
opportunity to think differently.


What am I missing?

One way to look at the missing link is partial evauation of a CPU +
program into a state machine.

There is some element of the construction there that I do not
understand.  Is it just partial evaluation of the decoder + ALU?


Entry: Behavioral
Date: Mon Oct 15 11:55:02 CEST 2018

Create a behavioral language.  Essentially, this separates variable
declaration and assignment.  It can then still be determined whether
to use single or multiple assignment.  No assignment means to keep the
existing variable.

Is it possible to replace clock tick by "goto"?  This way, everything
executed between these is combinatorial.  That is the backbone.

A state machine is something that can change the value of outputs
(registers) between gotos, using the new values of inputs (other
registers).

The main "programming principle": In this particular case, it is more
convenient to think in terms of and express changes in state than the
recomputation of state.

This model is not a perfect equivalence, i.e. it is not possible to
reuse logic in the same clock cycle.

Early exits should be implemented, i.e. multiple in branches.


If partial updates are so important, are clock enables on the
registers then used to implement this?  Do I need to worry about this
in how translation to vhdl will work?


Is it possible to use behavioral description as the bottom
abstraction, and implement the "functional" logic style on top of it?


Entry: Verilog
Date: Mon Oct 15 14:08:23 CEST 2018

I'm starting to worry about striding too far away from the status quo,
especially with recent insights about behavioral approach.


Entry: State machine
Date: Tue Oct 16 11:07:25 CEST 2018

I'm stuck.  I have no behavioral language to express this, can't
quickly build one, so will need to continue with fully explicit
"functional" updates.

_ = do
  (carry,dec') <- carry dec cnt
  [state',cnt] = switch state [
    (wait, do
        ifs i [wait,0] [count,half]),
    (count, do
        ifs carry [idle,0] [count,dec']),
    (idle, do
        ifs en [wait,0] [idle,0])]


So why does this happen?  Many state machines do a lot of waiting,
which doesn't change the state.  You want this to be easily expressed.

Really, though, am I just looking for excuses? It works, it's just not
easy to use when there is a lot of state.  But maybe proper factoring
can solve that problem independently?  I.e. define sub-machines, and
provide control lines.


( Still, it itches.  There must be some simple transform where
variables are moved into dynamic context. )


Entry: The solution is factoring
Date: Tue Oct 16 11:47:27 CEST 2018

It always is.

The sub-machine creates an enable signal.  It behaves a bit like an
S/R flipflop, with set triggered by a first data transition.

The other machine does just clock recovery, expecting a data and
enable signal.

But that's not enough maybe?  The first machine also needs to have a
clock start on the first edge.

So it seems that an enable/reset is actually not a bad way to do this.


Entry: Bit sync
Date: Thu Oct 18 11:03:49 CEST 2018

So do it using the principles above:
- factor out the start/stop
- start it at 1/2 bit


This should really be done on paper... But, if it can be done on
paper, it can also be done as letrec.

edge' <- edge i
negedge <- e & ~i

bit_enable <- set_reset negedge count_carry


Why is this so difficult to express?  Maybe it is essential to put it
on paper first.


Entry: Cont...
Date: Thu Oct 18 15:28:23 CEST 2018

I have all the components.
- Sub clock counter -> bit clock out
- SR + addr gen + RAM

The problem is reset/enable or start/stop: mostly about how to define
the protocol.  These are roughly equivalent, but require some small
state machines to convert.

- Single enable/nrst
- Start/Stop single clock pulses
- Start/Stop stretched pulses

polarized edge detect can convert:
  stretched -> single
  en/nrst -> signle

( see next post) 
So, for bit sync, which is easiest: start/stop or enable?

Start/Stop is not enough.  States are:
- start, but wait for first edge
- ...

So basically, there are 2 start signals:
- frame start
- first edge after frame start

To solve:
- first start pulse -> enable (or use enable in first place)
- edge pulse masked by enable, then "debounce" start pulses

It seems simplest to use a single "frame" or "chip select" style input
to the circuit, and have the end reset everything.  Bits can be
clocked in as long as frame is high: it will just be an extra couple
of words.

So, general arch:
- 2-stage start/stop (frame start, then first edge)
- frame stop = reset

It seems that the 2-phase start is really the problem.  Inner circuit
should be written to take a single start pulse, and start counting
right away.


Entry: Premature opti
Date: Thu Oct 18 15:45:53 CEST 2018

I want to use abstractions, and not worry about efficiency at the same
time.  Trust compiler to do common subexpression elimination, even for
state machines.

If it is possible to duplicate functionality at state machine level,
it is OK to _actually use_ abstractions that convert between different
representation.

A good example is dual signal start+stop protocol vs. single line
enable/nrst.  If the optimizer can eliminate dual instantiations of
the same edge detector, it is no longer necessary to spend mental
cycles manually performing the factorization.  This premature
optimization is a serious hangup otherwise.

TODO: Validate if this is what happens!  Otherwise, add a pass in Seq.

EDIT: I really worry about this.  Easy enough to test: create 2 edge
detectors from the same pin to 2 different output pins, and see if it
shares.

EDIT: I don't think it eliminates delays, at least I do not see it
doing this anywhere.


Entry: Pipelining CPU
Date: Sat Oct 20 15:06:24 CEST 2018

Just fetch.  Essentially this delays the address register.  Everything
stays the same, apart from jumps going into effect one instruction
late.  This then creates a branch slot.

This then takes the RAM access time out of the timing loop.  EDIT:
Looking at this again, the RAM access is only a small part.  Bulk
comes from logic levels (17 deep!).


Entry: Update tools?
Date: Sun Oct 21 15:20:35 CEST 2018

Maybe good to update the tools?


tom@wanda:~/asm_tools$ make f_soc.ct256.time
icetime -p f_soc.pcf -o f_soc.ct256.nl.v -P ct256 -d hx8k -t f_soc.ct256.asc
// Reading input .pcf file..
// Reading input .asc file..
// Reading 8k chipdb file..
// Creating timing netlist..

icetime topological timing analysis report
==========================================

Warning: This timing analysis report is an estimate!

Report for critical path:
-------------------------

        lc40_18_19_4 (LogicCell40) [clk] -> lcout: 0.640 ns
     0.640 ns net_70961 (tx_in_s60[1])
        odrv_18_19_70961_75164 (Odrv12) I -> O: 0.540 ns
        t2617 (Sp12to4) I -> O: 0.449 ns
        t2616 (Span4Mux_v1) I -> O: 0.203 ns
        t2615 (LocalMux) I -> O: 0.330 ns
        inmux_16_20_67064_67084 (InMux) I -> O: 0.260 ns
        lc40_16_20_1 (LogicCell40) in1 -> carryout: 0.260 ns
     2.681 ns net_67082 ($auto$alumacc.cc:470:replace_alu$232.C[2])
        lc40_16_20_2 (LogicCell40) carryin -> carryout: 0.126 ns
     2.807 ns net_67088 ($auto$alumacc.cc:470:replace_alu$232.C[3])
        lc40_16_20_3 (LogicCell40) carryin -> carryout: 0.126 ns
     2.933 ns net_67094 ($auto$alumacc.cc:470:replace_alu$232.C[4])
        lc40_16_20_4 (LogicCell40) carryin -> carryout: 0.126 ns
     3.060 ns net_67100 ($auto$alumacc.cc:470:replace_alu$232.C[5])
        lc40_16_20_5 (LogicCell40) carryin -> carryout: 0.126 ns
     3.186 ns net_67106 ($auto$alumacc.cc:470:replace_alu$232.C[6])
        lc40_16_20_6 (LogicCell40) carryin -> carryout: 0.126 ns
     3.312 ns net_67112 ($auto$alumacc.cc:470:replace_alu$232.C[7])
        lc40_16_20_7 (LogicCell40) carryin -> carryout: 0.126 ns
     3.438 ns net_67118 ($auto$alumacc.cc:470:replace_alu$232.C[8])
        t316 (ICE_CARRY_IN_MUX) carryinitin -> carryinitout: 0.196 ns
        lc40_16_21_0 (LogicCell40) carryin -> carryout: 0.126 ns
     3.761 ns net_67199 ($auto$alumacc.cc:470:replace_alu$232.C[9])
        lc40_16_21_1 (LogicCell40) carryin -> carryout: 0.126 ns
     3.887 ns net_67205 ($auto$alumacc.cc:470:replace_alu$232.C[10])
        lc40_16_21_2 (LogicCell40) carryin -> carryout: 0.126 ns
     4.014 ns net_67211 ($auto$alumacc.cc:470:replace_alu$232.C[11])
        lc40_16_21_3 (LogicCell40) carryin -> carryout: 0.126 ns
     4.140 ns net_67217 ($auto$alumacc.cc:470:replace_alu$232.C[12]$2)
        inmux_16_21_67217_67227 (InMux) I -> O: 0.260 ns
        lc40_16_21_4 (LogicCell40) in3 -> lcout: 0.316 ns
     4.715 ns net_63054 ($auto$alumacc.cc:470:replace_alu$232.C[12])
        odrv_16_21_63054_67278 (Odrv4) I -> O: 0.372 ns
        t2115 (Span4Mux_h1) I -> O: 0.175 ns
        t2114 (LocalMux) I -> O: 0.330 ns
        inmux_18_20_75201_75263 (InMux) I -> O: 0.260 ns
        lc40_18_20_5 (LogicCell40) in3 -> lcout: 0.316 ns
     6.167 ns net_71085 (c_s77)
        odrv_18_20_71085_70869 (Odrv4) I -> O: 0.372 ns
        t2651 (LocalMux) I -> O: 0.330 ns
        inmux_18_18_74969_75003 (InMux) I -> O: 0.260 ns
        lc40_18_18_3 (LogicCell40) in1 -> lcout: 0.400 ns
     7.527 ns net_70837 ($abc$2791$n472_1)
        odrv_18_18_70837_46505 (Odrv12) I -> O: 0.540 ns
        t2485 (Span12Mux_h1) I -> O: 0.133 ns
        t2491 (Sp12to4) I -> O: 0.449 ns
        t2490 (Span4Mux_v0) I -> O: 0.203 ns
        t2489 (Span4Mux_v0) I -> O: 0.203 ns
        t2488 (Span4Mux_v0) I -> O: 0.203 ns
        t2487 (Span4Mux_v1) I -> O: 0.203 ns
        t2486 (LocalMux) I -> O: 0.330 ns
        inmux_12_19_50638_50667 (InMux) I -> O: 0.260 ns
        t229 (CascadeMux) I -> O: 0.000 ns
        lc40_12_19_3 (LogicCell40) in2 -> lcout: 0.379 ns
    10.431 ns net_46499 ($abc$2791$n517_1)
        t1354 (LocalMux) I -> O: 0.330 ns
        inmux_12_19_50617_50677 (InMux) I -> O: 0.260 ns
        lc40_12_19_5 (LogicCell40) in0 -> lcout: 0.449 ns
    11.469 ns net_46501 ($abc$2791$n626)
        odrv_12_19_46501_14685 (Odrv12) I -> O: 0.540 ns
        t1403 (Span12Mux_h11) I -> O: 0.526 ns
        t1402 (Sp12to4) I -> O: 0.449 ns
        t1401 (Span4Mux_v0) I -> O: 0.203 ns
        t1400 (Span4Mux_v4) I -> O: 0.372 ns
        t1399 (Span4Mux_v4) I -> O: 0.372 ns
        t1398 (Span4Mux_v2) I -> O: 0.252 ns
        t1397 (LocalMux) I -> O: 0.330 ns
        inmux_25_9_102231_102273 (CEMux) I -> O: 0.603 ns
    15.116 ns net_102273 ($abc$2791$n626)
        ram_25_9 (SB_RAM40_4K) RCLKE [setup]: 0.267 ns

Resolvable net names on path:
     0.640 ns ..  2.421 ns tx_in_s60[1]
     2.681 ns ..  2.681 ns $auto$alumacc.cc:470:replace_alu$232.C[2]
     2.807 ns ..  2.807 ns $auto$alumacc.cc:470:replace_alu$232.C[3]
     2.933 ns ..  2.933 ns $auto$alumacc.cc:470:replace_alu$232.C[4]
     3.060 ns ..  3.060 ns $auto$alumacc.cc:470:replace_alu$232.C[5]
     3.186 ns ..  3.186 ns $auto$alumacc.cc:470:replace_alu$232.C[6]
     3.312 ns ..  3.312 ns $auto$alumacc.cc:470:replace_alu$232.C[7]
     3.438 ns ..  3.635 ns $auto$alumacc.cc:470:replace_alu$232.C[8]
     3.761 ns ..  3.761 ns $auto$alumacc.cc:470:replace_alu$232.C[9]
     3.887 ns ..  3.887 ns $auto$alumacc.cc:470:replace_alu$232.C[10]
     4.014 ns ..  4.014 ns $auto$alumacc.cc:470:replace_alu$232.C[11]
     4.140 ns ..  4.399 ns $auto$alumacc.cc:470:replace_alu$232.C[12]$2
     4.715 ns ..  5.851 ns $auto$alumacc.cc:470:replace_alu$232.C[12]
     6.167 ns ..  7.127 ns c_s77
     7.527 ns .. 10.052 ns $abc$2791$n472_1
    10.431 ns .. 11.020 ns $abc$2791$n517_1
    11.469 ns .. 15.116 ns $abc$2791$n626
              RDATA[11] -> $abc$2791$n272
               RDATA[3] -> $abc$2791$n269

Total number of logic levels: 17
Total path delay: 15.38 ns (65.01 MHz)


After pipelining the jump instruction:

Total number of logic levels: 6
Total path delay: 14.74 ns (67.86 MHz)

So it looks like pipelining fetch is not going to solve much.


With new version:

icetime -p f_soc.pcf -o f_soc.ct256.nl.v -P ct256 -d hx8k -t f_soc.ct256.asc
// Reading input .pcf file..
// Reading input .asc file..
// Reading 8k chipdb file..
// Creating timing netlist..

icetime topological timing analysis report
==========================================

Report for critical path:
-------------------------

        lc40_28_11_7 (LogicCell40) [clk] -> lcout: 0.640 ns
     0.640 ns net_110076 (s20)
        odrv_28_11_110076_70105 (Odrv12) I -> O: 0.540 ns
        t3051 (Sp12to4) I -> O: 0.449 ns
        t3056 (Span4Mux_v3) I -> O: 0.337 ns
        t3055 (LocalMux) I -> O: 0.330 ns
        inmux_20_14_82634_82652 (InMux) I -> O: 0.260 ns
        lc40_20_14_1 (LogicCell40) in0 -> lcout: 0.449 ns
     3.004 ns net_78496 ($abc$2868$n379_1)
        odrv_20_14_78496_62321 (Odrv12) I -> O: 0.540 ns
        t2288 (Span12Mux_v6) I -> O: 0.288 ns
        t2287 (Sp12to4) I -> O: 0.449 ns
        t2310 (Span4Mux_h0) I -> O: 0.147 ns
        t2323 (Span4Mux_h4) I -> O: 0.316 ns
        t2322 (Span4Mux_h3) I -> O: 0.231 ns
        t2321 (LocalMux) I -> O: 0.330 ns
        inmux_22_21_91648_91680 (InMux) I -> O: 0.260 ns
        lc40_22_21_3 (LogicCell40) in1 -> lcout: 0.400 ns
     5.963 ns net_87513 ($abc$2868$n387)
        odrv_22_21_87513_87648 (Odrv4) I -> O: 0.372 ns
        t2686 (Span4Mux_h4) I -> O: 0.316 ns
        t2685 (LocalMux) I -> O: 0.330 ns
        inmux_17_21_71246_71313 (InMux) I -> O: 0.260 ns
        lc40_17_21_6 (LogicCell40) in0 -> lcout: 0.449 ns
     7.689 ns net_67132 ($abc$2868$n385)
        odrv_17_21_67132_31238 (Odrv12) I -> O: 0.540 ns
        t1996 (Sp12to4) I -> O: 0.449 ns
        t1998 (Span4Mux_v2) I -> O: 0.252 ns
        t1997 (LocalMux) I -> O: 0.330 ns
        inmux_16_19_66938_66962 (InMux) I -> O: 0.260 ns
        t317 (CascadeMux) I -> O: 0.000 ns
        lc40_16_19_1 (LogicCell40) in2 -> lcout: 0.379 ns
     9.898 ns net_62805 ($abc$2868$n506_1)
        odrv_16_19_62805_46628 (Odrv12) I -> O: 0.540 ns
        t1781 (LocalMux) I -> O: 0.330 ns
        inmux_16_19_66931_66979 (InMux) I -> O: 0.260 ns
        lc40_16_19_4 (LogicCell40) in1 -> lcout: 0.400 ns
    11.427 ns net_62808 ($abc$2868$n505_1)
        t1775 (LocalMux) I -> O: 0.330 ns
        inmux_15_19_62873_62879 (InMux) I -> O: 0.260 ns
        lc40_15_19_0 (LogicCell40) in1 -> lcout: 0.400 ns
    12.416 ns net_58727 (s115[6])
        odrv_15_19_58727_62446 (Odrv12) I -> O: 0.540 ns
        t1672 (Span12Mux_h8) I -> O: 0.386 ns
        t1671 (Sp12to4) I -> O: 0.449 ns
        t1679 (Span4Mux_v0) I -> O: 0.203 ns
        t1682 (Span4Mux_v4) I -> O: 0.372 ns
        t1687 (Span4Mux_v4) I -> O: 0.372 ns
        t1689 (Span4Mux_v2) I -> O: 0.252 ns
        t1688 (LocalMux) I -> O: 0.330 ns
        inmux_25_25_103850_103900 (InMux) I -> O: 0.260 ns
        t585 (CascadeMux) I -> O: 0.000 ns
    15.579 ns net_103900_cascademuxed
        ram_25_25 (SB_RAM40_4K) RADDR[6] [setup]: 0.203 ns
    15.782 ns dangling_wire_337

Resolvable net names on path:
     0.640 ns ..  2.555 ns s20
     3.004 ns ..  5.563 ns $abc$2868$n379_1
     5.963 ns ..  7.240 ns $abc$2868$n387
     7.689 ns ..  9.519 ns $abc$2868$n385
     9.898 ns .. 11.027 ns $abc$2868$n506_1
    11.427 ns .. 12.016 ns $abc$2868$n505_1
    12.416 ns .. 15.579 ns s115[6]
              RDATA[11] -> $abc$2868$n296
               RDATA[3] -> $abc$2868$n293

Total number of logic levels: 7
Total path delay: 15.78 ns (63.36 MHz)


Entry: wanda update
Date: Sun Oct 21 15:35:47 CEST 2018

Let's update tools.  I'm going to have to redo this on panda.

icestorm:
commit a1fd644f383e77d4db71ce4654fdf15b940b53a6
Merge: 6178dfb 3587093
Author: Clifford Wolf <clifford@clifford.at>
Date:   Mon Mar 28 16:51:44 2016 +0200

arachne-pnr:
commit f808b8e64a9df67ce87d0ecb6e0eee8f3215d7f3
Merge: 1a4fdf9 7380afe
Author: cseed <cotton@alum.mit.edu>
Date:   Tue Mar 29 11:36:24 2016 -0400

yosys:
commit dcf576641b4a9b476d51fbe1b0cdfb57d02a76e6
Author: Clifford Wolf <clifford@clifford.at>
Date:   Fri Jun 3 11:38:31 2016 +0200


Entry: uart bugs
Date: Tue Oct 23 11:46:18 CEST 2018

uart_done doesn't work properly

machine doesn't align to baud clock properly.  should baud timer be
reset by uart?


Entry: bus reads
Date: Tue Oct 23 16:06:45 CEST 2018

Important to realize that there is a 1 cycle delay between bus reads
and values being ready.

read:4    1          0       1      1 (1636x)
read:4    1          1       1      1
read:4    1          0       1      1

The delay comes from:

  -- Couple bus master and slave through bus registers.
  closeReg [bit, bits imem_bits] $ \[rStrobe,rData] -> do
    bus_wr <- bus_master (BusRd rStrobe rData)
    (BusRd rStrobe' rData', soc_output) <- bus [rx, tx_bc] bus_wr
    return ([rStrobe', rData'], soc_output)


I did this by accident.  But is it really necessary to have this
delay?  It will make the CPU critical path longer, but have the
advantage that reads take only once cycle instead of two.

It doesn't seem to be an issue at this point, but good to keep in
mind.


Entry: uart TX
Date: Tue Oct 23 16:57:23 CEST 2018

Something is not right with tx_done.  Note: read has a wait state, but
tx_out should not be 0.


iw       ip tick_20khz tx_done tx_out
-------------------------------------
read:2    8          0       0      0 (4x)
read:2    8          0       0      1 (12x)
read:2    8          0       0      0 (96x)
read:2    8          0       1      0 (2x)
drop      9          0       1      0
push:150 10          0       1      0
loop:11  11          0       1      0 (32x)
loop:11  11          0       1      1 (119x)
push:17  12          0       1      1


Instead of trying to debug this, rewrite it.  Use a proper state
machine.  It's ok to use an external bit clock, but it is necessary to
wait.

Maybe it is the stop bit?  Because IF we wait for the next pulse to
send the start bit, it is ok to set done high once the stop bit is
sent out, giving some time during that bit to send a consecutive one.

EDIT: Test with /12 baud rate sending out 8 bit 0 bytes.
(+ 108 2 1 1 1 1) 114

(* 9 12) 108


uart_tx:
iw      ip tx_done tx_out
-------------------------
drop     0       1      1 (3x)
write:2  1       1      1
drop     2       0      1
read:2   3       0      1 (9x)
read:2   3       0      0 (108x)
read:2   3       1      0 (2x)
drop     4       1      0
drop     5       1      0
drop     0       1      0
write:2  1       1      0
drop     2       0      1
read:2   3       0      1 (5x)
read:2   3       0      0 (108x)
read:2   3       1      0 (2x)
drop     4       1      0
drop     5       1      0
drop     0       1      0
write:2  1       1      0
drop     2       0      1
read:2   3       0      1 (5x)
read:2   3       0      0 (108x)
read:2   3       1      0 (2x)
drop     4       1      0
drop     5       1      0
drop     0       1      0
write:2  1       1      0
drop     2       0      1
read:2   3       0      1 (5x)
read:2   3       0      0 (26x)


EDIT: I need a test for UART, so for the example soc in CPU.hs, add a baud clock.

Dammit, I can't figure it out.  Too complex.
EDIT: Bus size was 12 bit, which caused uart to be 8 bit as well.


Entry: Memories
Date: Thu Oct 25 12:03:19 CEST 2018

The problem that keeps coming back: how to "route" memory read and
write ports properly?

This is an integration question, so essentially, the core routines
should not perform any binding to memories.  Close over memory is then
done at the point where both ports are accessible, e.g. when
instantiating a bus.


Entry: Test with memory contents
Date: Thu Oct 25 13:57:22 CEST 2018

EDIT: Changed API to also return frozen memory contents.

Haskell array with negative bound?
Ix{Int}.index: Index (0) out of range ((0,-9223372036854775808))

Because of unspecified array address word size, so it defaults to 64
bit, which wreaks havoc when it is used as a Haskell array size.
Maybe fix that?

EDIT: Got a test: words, serialized, going into ram.


Entry: Simulating protocols
Date: Fri Oct 26 15:33:49 CEST 2018

The only thing needed here is feedback.  What about providing a stub
that is called inside the main loop?

It can just be an ST operation.

EDIT: I have a first step: an interface that will take other kinds of
test bench data apart from just a list of inputs.  Now to actually
implement it.

Next: make a test that actually uses a state machine, or better still,
that also implements an input list as a state machine.  This needs an
ST variable to store the current list.


Entry: ports vs. bindings
Date: Wed Oct 31 09:30:59 CET 2018

There is still an awkward point in how abstraction works: I would like
to bundle inputs and outputs (e.g. an RS485 line with controls), but
how to do this if I and O are always spit as arguments to submodules,
and bindings of return values?

Maybe the solution is the same as ever: abstract things as functions.

To make it more concrete, it might help to write an actual flat FPGA
toplevel to use the same API as a nesting of component instantances.


Entry: Verilog frontend
Date: Wed Oct 31 09:38:39 CET 2018

To make this evolve and integrate into other tools, it will be
necessary to expose modules as readable verilog.

From the Haskell side, it might mean that there is a need for a syntax
frontend, to at least be able to carry node and register names into
verilog, and also have something resembling design hierarchy.


Entry: Tradeoff: blocking read vs. exiplicit bit wait
Date: Sat Nov  3 10:12:20 CET 2018

Either reserve a address for every bit, or add a new instruction to
wait for a specific bit.  This is just a multiplexer.

Return value is wasted, but does it matter?  Otherwise 2-op is needed,
which adds more wires.

Not worth it.


Entry: Setting up dev env for CPU tests
Date: Wed Nov 28 12:15:42 EST 2018

- standardize boot.  basically, I want a kind of DHCP for these
  boards: plug in the board and have it load code.

- once booted, the main interface is spi.  is there a way to
  standardize this to TCP?

EDIT: See exo.txt

This is really a naming problem: some things have only local names,
but they can be embedded in some kind of wrapper that implement a
global name.

So essentially such resources have 2 names: a host name and a local
resource name.


EDIT: Going through the motions

tom@panda:~/asm_tools$ make f_soc.ct256.bin
tom@zoe:/i/panda/home/tom/asm_tools$ make f_soc.ct256.iceprog

I forgot about how to upload the ram.

Ok got it.  Added to makefile and iceprog.sh script.


Entry: Next
Date: Sun Dec  2 00:28:22 EST 2018

Got through a couple of practical issues, mostly related to
incremental builds.

But I'm ready to move on to UART control of the CPU.

Some UART code would be nice on the breakout board.
I can use the FTDI for that.

FTDI bus B is connected to:

PIO0_13 B12 RS232_Tx_TTL
PIO0_14 B10 RS232_Rx_TTL

Maybe do this in asm_tools first.


Entry: bitserial CPU : SERV RISCV
Date: Sat Dec 15 18:41:22 EST 2018

https://twitter.com/OlofKindgren
https://www.businesswire.com/news/home/20181206005747/en/RISC-V-SoftCPU-Contest-Winners-Demonstrate-Cutting-Edge-RISC-V
https://github.com/olofk/serv

Also look at:
https://github.com/olofk/fusesoc


Entry: pizza dinner hdl
Date: Fri Dec 28 21:51:39 CET 2018

- asic: 50% RTL design, 50% implementation
- Write RTL s.t. test insertion automation
- 10% area is scan/test code
- statistical design: variability per gate is large
- extra gates to allow metal-layer corrections


Entry: reset behavior
Date: Wed Jan  9 11:32:50 CET 2019

Always a problem..

Example: edge detector that self-initializes, using the first signal
as an initialization.  This seems to require an extra flip-flop.

Let's write it compositional: edge detector, but kill the first
output.  Easily done using a 0->1 transition delay.


Entry: Off-by-one
Date: Wed Jan  9 11:42:07 CET 2019

It remains a real pain to navigate pre/post delay signals.

It seems that a good way to go is to assume this will be wrong, and
always use some kind of redundant assert.

Digital design is really about pipelining.  If it were just
combinatorial logic, it would be easy.


Entry: Think compositionally
Date: Wed Jan  9 11:46:15 CET 2019

So if the point is to just express the circuit without regard to
things that could be optimized by re-arranging combinatorial logic, it
seems that the core idea behind my approach is to push 'compositional
thinking' much further into the core of the circuits.

If I look at Verilog or VHDL code, I see a lot of raw state machines.
I don't like this approach.

It seems to make more sense to create a couple of primitive state
machines, and solve problems using composition, which is easier to get
right.

Basically, this is the Forth or generic FP idea.


Entry: CPU
Date: Fri Jan 11 21:16:46 CET 2019

Might have some time for hacking on the CPU.  To try:
- better timing
- 2-stack machine

Other Seq things:
- Verilog cosim
- Import verilog modules (via yosys netlist)


Entry: ecosystem integration
Date: Tue Jan 15 04:42:31 CST 2019

So, very important: find a way to integrate into a larger ecosystem.

It is already possible to create Verilog modules, but it will also be
necessary to go the other way: import, instantiate verilog modules.


Entry: Syntax frontent
Date: Tue Feb  5 10:32:24 EST 2019

I really want a syntax frontend.  Monadic notation is too cumbersome.

There are two ways I see:
- template haskel + S-expressions
- Conal's CCC

I think the latter is too big a risk.  The rough edges are likely
going to need advanced insights to resolve.

So go with s-expressions.


Entry: SERV
Date: Wed Mar 20 18:27:35 EDT 2019

https://github.com/olofk/serv

SERV was written with manual logic optimization using Karnaugh maps.


Entry: Higher level abstractions
Date: Wed Mar 27 10:41:33 EDT 2019

Instead of trying to battle the bottom layers through local reasoning,
use a higher level abstraction.  Looking into statecharts and decision
tables (via Hillel Wayne).

TODO:

- Take these abstractions and compile them down to a flat state machine

- Keep states abstract to allow different encodings.  Aim to
  synthesize such that tools can recognize the state machines and
  recode them.


Entry: Sequencers
Date: Fri Apr  5 14:32:20 EDT 2019

How to make design more modular.  I have an application that would be
not too hard to implement on a CPU, but I'm likely going to need state
machines to implement the stages.

I am still very much not into designing fucking state machines.

Let's think about that for a bit.

Why is doing things sequentially so much easier?

Let's take a look at some examples:

- uart SLIP packet framing?
- length framing + checksum
- CRC computation

What I need is a good way to represent this.  A framework in which it
is easy to embed a state machine.

I'm going to need a good idea to push through this fatigue.


One problem I have is connectin modules together.  I don't seem to get
very far other than designing everything with the CPU embedded.


Entry: Look at the geometry of the problem
Date: Fri Apr  5 16:30:43 EDT 2019

For a simple state machine, that's a graph.  For a push-down
automaton, that's something else.

There are actually two things that make a CPU approach simple:

- sequential programming
- subroutines (parameterized repetition)

Subroutines are mostly useful as a compression step.  It is part of
the incremental solving of a problem, adding to the solution when
there is already something there.  Maybe leave this as step two?

So what is step one?

- Write some pseudocode.
- Coalesce everything that can be parallellized
- Write an explicit state diagram for this (in Haskell)
- Compile that down to a bit-level implementation.

So start by encoding the program and the opcodes as a data type.

I need an example state machine.


Entry: RAM as decoder
Date: Fri Apr  5 16:35:50 EDT 2019

A decoder is typically an "expansion", e.g. a small word fans out to a
lot of control lines.  A ram already has that structure: 8 address
bits to 16 data bits.  Maybe that is done intentionally?  I.e. it is
wide by design?

Can it be made wider?  Yes, by chaining 2 RAMs.

Can it be made wider using only a single RAM?  Yes, by multiplexing.
E.g. two 7->16 bit maps evaluated using two clocks.

When is this more efficient than using LUTs?


Entry: State machine composition
Date: Fri Apr  5 16:47:06 EDT 2019

This is the holy grail, really.  If one machine can reuse and restart
another, that is efficient use of resources.  This is pretty much what
a CPU + peripheral combo would do.

However, it does not make a lot of sense to build a machine and have
it sit idle for most of the time.


So one machine restarting another is a special case of a PRODUCT type.

One machine running after another is a SUM type.

Note that reuse in the case of a SUM type is only possible if the
different machine modes share some logic.

EDIT: What Axel said: the hardware is going to be there anyway, so why
not pipeline?


Entry: Process: why is this still hard?
Date: Fri Apr  5 20:56:17 EDT 2019

1. I don't want to look at (complex) previous solutions.

2. Maybe it's just too ad-hoc?  Hard to compose?


Entry: Rekindle the fire
Date: Fri Apr  5 21:11:46 EDT 2019

Why did I loose intrinsic interest?  Probably because of getting
disappointed that the link to Verilog isn't really useful yet, so this
does not yet integrate into a larger world.  Maybe something to work
on?

Anyway, I forgot most of this.  Just getting into it will likely
revive it again.


Entry: alright, revolution!
Date: Sat Apr  6 07:58:20 EDT 2019

How to take advantage of the fresh start, without falling back into
old pits.  It is all about feedforward data processors with hidden
internal state.

Where does my mind want to go?  Instead of where it is forced to go?

The gap between dream -- what attracts me to this -- and application
is too large.


So what is the real problem?  This morning, I felt energized again by
fixing the build system to create immediate feedback.  The problem is
really the resistance to reload all that scaffolding context.  To
solve this:

1. Make sure it is encoded in the build system in a very granular
input-output form.  E.g. one test should generate one report file.

2. The granularity is important for reconnecting to the problem.


Entry: Compile to ST without TH?
Date: Sat Apr  6 11:16:46 EDT 2019

I really don't like the limitations that TH imposes.  Maybe it is best
to compile to something else, like LLVM, and load the code dynamically
that way.


Entry: Designing hardware with a control CPU
Date: Sat Apr  6 15:46:58 EDT 2019

1. The tradeoff is to put high level control logic in the CPU, and low
   level (i.e. fast-switching) control logic in the peripherals.

2. The interface between CPU and peripherals can be optimized: there
   is no need to make the peripherals nor the CPU general purpose.
   This translates to CPUs with a limited instruction set, and
   peripherals with a tiny (single) register interface.

3. This allows the CPU and peripheral sides to be disentangled and
   tested individually using non-implementable code.


Entry: How to get used to large I->O functions
Date: Sun Apr  7 10:32:47 EDT 2019

I think I needed an explanation about why some things seem messy.

Here's the thing: a lot of circuits have a criss-cross nature.  This
is an inherent feature of parallel circuits and is what carries their
usefulness.

However, wiring all this up is not easy.

So, a good approach seems to be the idea of configurable mux, where a
central object takes control words, and connects things together.

Two observations:

- This object abstracts a "web" behind a simple interface.

- The instantiation of this object necessarily takes this plethora of
  I/O lines.

So it is ok to bundle up all these lines into a single large
collection object (i.e. an environment).

I've been struggling with internalizing this.  Probably because I am
used to thinking in terms of "create connection", as opposed to having
everything be and I->O function.

So where does this bad intuition come from?  It comes from the
representation of circuits diagrams containing boxes connected with
lines.  But these "lines" are always directional!  And in a functional
representation, the line represented by the definition/use relation.

Why is this such an easy error to make?  Because those lines are
_physical_.  They are wires that can be seen on the circuit board as
well.  But what we don't see is that each wire has a direction.  So in
this case, the idea of a symmetrical wire is just plain wrong.  When
superimposing the direction, the diagram becomes a whole lot more
"messy", and it is clear that the boxes are just some arbitrary
grouping of things into large I to large O functions.

It is important to realize that a full circuit, e.g. an FPGA config,
is essentially a function that takes a large number of inputs and
produces a large number of outputs.  The composition then looks like
taking the large I, splitting it up, feeding it into a large number of
circuits, collecting the outputs of the large number of circuits and
pushing it out as a large O.


Summarized: there is a tension between:

- The desire for a simple I->O functional representation

- The boxes + wires view of circuits

The latter gets in the way.
It is intuition that has to be unlearned.


Entry: Tree vs. Path representation
Date: Sun Apr  7 11:12:52 EDT 2019

Because of the flat nature of the I/O field, it seems simplest to
introduce hierarchy in the form of a path encoding, as opposed to
nested structures.

This relates to a pattern I've seen pop up in other places: if
hierarchy gets too arbitrary, it is often more convenient to switch to
a flat environment, and put the hierarchy in the keys instead.

Of course, both tree and path representations are isomorphic.  It is
merely a matter of convenience for the packing/unpacking code.


Entry: The low level monad
Date: Sun Apr  7 11:58:09 EDT 2019

Basically, it is a machine language.  It should have a layer on top of
it, but I do not know how to do that apart from very ad-hoc
constructs, and it is not really essential.  Just ugly.  Ok to just
live with it until inspiration hits.


Entry: contravariance
Date: Mon Apr  8 08:35:02 EDT 2019

So here's something to think about.  Contravariance is a very
important concept.

( See above, regarding the confusion about dropping the direction of
signal lines. )

A circuit is represented by I -> m O, where m is the DSL monad.

A consequence of this is that it is not possible for I and O types to
appear in the same (covariant) product type.  Basically, it makes no
sense to have a pair (I,O) appear anywhere.  The O is always
contravariant to I.

( I wish I had a better way to explain this without hand-waving. )


Summary: this is a roundabout way of saying: keep input and output
structures separated, and make them make sense at the lowset levels.
The rest will fall out by itself.


Entry: Naming: enable / strobe / clock?
Date: Mon Apr  8 08:55:52 EDT 2019

Events on a dedicated bus are always represented as a pair consisting
of a control bit carrying time information, and a data word carrying
payload.

It seems best to call the clock bit "strobe", because:

- enable: refers to something being on/off for a longer time
- clock:  refers to register clocks, and it is NOT visible in RTL

Because strobes are so common, name them with an _s postfix?

Actually, just name it clock.  There is no possible confusion, because
in RTL, the "square wave" master clock is not visible as a signal.

So just use all 3, with this meaning:

- enable: on/off for a long duration
- clock:  data streams
- strobe: command streams

So clock/strobe is arbitrary to some extent.  Feel it out.


Entry: Dynamic probe names
Date: Mon Apr  8 10:26:34 EDT 2019

Extend the probe names with environment nesting that is derived from
the instantiation.


Entry: Test circuit
Date: Mon Apr  8 10:51:54 EDT 2019

I have that delta1010 box.  It shouldn't be too hard to syntesize an
interface for DAC only.  Then move on to ADC.


Entry: Multiplexers
Date: Tue Apr  9 09:46:10 EDT 2019

Here's the thing: digital circuit design is mostly about multiplexers.
So Seq should have really good abstractions for that.

Basically, this is the 'case' or 'switch' statement.

Currently I have this working for lists, but it should work for
arbitrary collection types.

Is there a way to do this "flattening" operation better?


Write it as a fold?

What I want, is an algebraic data type instead of a case or switch
statement.  As I've recently learned, it should be possible to write
it as a fold instead.


Summary: multiplexers are the most important circuit, and I've made it
very difficult to express them.

Two elements:

- monadic notation is cumbersome

- grouping only supports lists

- defaults are quite useful


Entry: Imperative construction of circuits
Date: Tue Apr  9 09:59:41 EDT 2019

This is the thing: the "imperative metaprogramming" of standard HDLs
isn't all that bad.  It is optimized for registers keeping their
value.

How hard is it to introduce that functionality in Seq?


Entry: Traversable, Zip
Date: Tue Apr  9 10:03:27 EDT 2019

Actually, this already works:

switch ::
  (Seq m r, Traversable f, Zip f)
  => r S
  -> [(r S, m (f (r S)))]
  -> m (f (r S))
  -> m (f (r S))

I just miss the instances.


So the idea is that the data structure nesting is just "skin-deep".

I need to get deeper into Haskell stuff before I can wiggle out of
this.  The core problem seems to be that I really want to encode
signals at the type level to make it easier to work with, but that
would be a really big change.  It would be a different language.

It's possible to do gradually, but will require to build Seq on top of
Seq2, or vise versa, to make it practical (i.e. not horribly break
things).


Entry: Parameterized decision table
Date: Tue Apr  9 10:10:33 EDT 2019

So what I have is quite a clear-cut decision table, but the outputs
are multiple.  To fix this, split it up a bit.

I.e. i have 16 busses to control, but I'm never controlling them
individually.  There are only two modes of control:

- all of them do the same in parallel, or

- only one is active and the rest is idle

This is just 2 cases.  That switch can be solved separately.


Entry: Don't try to abstract muxes
Date: Tue Apr  9 10:21:55 EDT 2019


I keep wanting to express muxes differently, but I always come back to
the only thing that makes sense: describe what happens for each output
separately.

This makes the code simpler to read.  Don't try to group it too much.


Entry: Registers and commands
Date: Wed Apr 10 08:01:32 EDT 2019

Here's something that is annoying me:

It's a lot of hassle to separate register and control word in two
separate names.  Why not flatten the namespace and always use
(addr,data) pairs?


Entry: Make probe names hierarchical
Date: Thu Apr 11 07:55:12 EDT 2019

Should be as simple as changing the type from String to [String], and
to add an environment variable that names the current context.

So let's try the former first and propagate the refactoring.


Entry: Combine Seq and EDSP?
Date: Fri Apr 19 09:15:52 EDT 2019

It all needs to be one language.

The real problem is "unrolling".  Maybe it is possible to create a
program, and then specify how it should be time-multiplexed?

I have a driver program: the filter for radiopos.


Entry: Better signal type
Date: Fri Apr 19 09:21:53 EDT 2019

This can probably be done gradually.
But really, this seems to be the whole idea:

A "language" should not just be a bottom-up stack.  It should have its
primitives defined as well.

The point is to increase the granularity of the type classes, such
that algorithms can be very generic, and instantation can be very
specific.  I.e. if there is a multiply, it could be expanded to a
multiplier expressed in seq.

Seq needs to be recursive.


Entry: A driver application: quadrature tuner
Date: Fri Apr 19 09:25:56 EDT 2019

See radiopos for high-level idea.


Entry: Avoid "inferring"
Date: Fri Apr 19 12:13:06 EDT 2019

In hardware mapping it is typical to infer subcircuits.

If tools support it, go ahead, but in general it seems best to just be
explicit about things.

EDIT: This is an important insight.  Why have a "base line interface"
anyway?  It is what separates the programmable system from the
hard-wired system.

So "Seq" is some kind of model that works in most practical cases.


Entry: Extend types gradually
Date: Sat Apr 20 08:31:45 EDT 2019

There are now at least two things that would benefit from
parameterization in the type system:

- Bit types

- Substrate (e.g. mul or no mul)

These can probably be introduced gradually by writing the old in terms
of the new, and then moving the library written in terms of old to
new.

This in itself is not the issue right now: bit types are values passed
in at instantiation time, and substrate is just Seq, but with some
parts eliminated (exposed as errors).

Seq is dataflow + sequential feedback.  It cannot by itself express
other types of hierarchy, such as running a program on a CPU.  This
needs to be captured by "recursive" Seq.


Entry: The DSP language: combinators and algebraic substrate
Date: Sat Apr 20 08:43:45 EDT 2019

Two problems need to be solved:

- how to combine iteration patterns into higher level operations (hide
  iteration)

- how to express algorithms in a way that algorithm analysis is
  possible.  Basically, define a (hierarchical?) set of classes that
  can represent matrices, autodif, etcc..


Entry: Start with Ring
Date: Sat Apr 20 09:02:59 EDT 2019

- Basic arithmetic is expressed in a Ring.

- The ring can be "parameterized", which is mostly there to
  parameterize constants to be able to do autodiff.

EDIT: I have Ring and complex numbers over a ring.

The goal now is to express a complete Seq system using only these
abstractions, and have it generate C code.

I will need that first actual implementation to then generalize to
more complex things, to see how it all fits together in detail.


I need a little break before continuing, but here is the basic idea:

Create a C program that generates a damped complex oscillation.
Basically the generalized counter "hello world" program.

It is very important to be able to do things like this:

- have a class-level language that has a notion of exponential
  function

- implement it using a polynomial or an update equiation

I just want very extreme modularity.
Haskell is a great substrate for that.


Entry: Ring vs. System
Date: Sat Apr 20 11:41:53 EDT 2019

Dataflow operations are simple.  They are expressions in the Ring.
However, I will want to build systems before I plug them into
analysis.


Entry: Parameterized Z-transform
Date: Sat Apr 20 11:45:57 EDT 2019

This is the important concept.
It is the output of a derivative of a non-linear system.

Organizing this might take a couple of iterations.  But it seems there
is a tremendous amount of leverage to be exposed.

I know something is there but I don't see the path.  And I'm afraid
I'm going to dull out before I am able to write this down.  It takes
quite a bit of time to just load that context in my head, though there
is a lot of echo from the past just loading it this morning...


Entry: An example
Date: Sat Apr 20 11:52:21 EDT 2019

A complex 1-pole AR process.

The intermediate abstraction is a system.  That is what should be
translated to Seq.

Is this possible?  Or does it need some commuted version?  Where Seq
implements the feedback operator.  It will have to.


Entry: The problem is commutation.
Date: Sat Apr 20 11:57:43 EDT 2019

Think of it at this higher level, and things will open up.
Haskell is not a good substrate because it doesn't allow to express this.

You'll need to find a way to represent the higher level, and then
compile it to Haskell.  A meta step is essential here.  Dependent
types will help but it won't be accessible to you yet.

The next step is to just write down that damn exponential and stare at
it.

EDIT: I have to learn how to organize type class hierarchies.
I.e. how they are derived.  I've just winged it up to now.  I need to
push through to really understand it before I get anywhere.

Here's an example of a clash I did not expect.

      Matching instances:
        instance Seq m r => Ring m (r S)
          -- Defined at ../deps/asm_tools/asm-tools-seq/Language/Seq/DSP.hs:55:10
        instance [safe] Ring m t => Ring m (C t)
          -- Defined at ../deps/asm_tools/asm-tools-seq/Language/Seq/Algebra.hs:96:10

The latter seems fine.  Is the former actually correct?  Maybe it is
the other way around?

Head starts hurting now...


Entry: Why is this uninituitive?
Date: Sat Apr 20 12:26:27 EDT 2019


this is clear:
Ring m t -> Ring m (C t)

why should this be different:
Seq m r  -> Ring m (r t)


The former means, given that Ring operations are defined for t, here's
a way to define Ring operations for C t.

The latter means, given that there is an implementation for Seq,
here's a way to implement Ring.

There isn't anything wrong with this, but what is clear is that it
should not be the only instance maybe?  Maybe r just needs to be made
explicit.

This smells like a detail.  Something not really that important.

EDIT: Yes it's a detail.  It's probably possible to remove it using
some more "pinning" of type classes, but basically the problem is that
the compiler doesn't know that the C type is not used as a
representation.  In practice it seems that m and t are never generic
when this class is instantiated so it is not an issue.

EDIT: Looks like this is going to happen a lot.


Entry: Bridging the two systems views
Date: Sat Apr 20 13:59:56 EDT 2019

In Seq.Algebra I need a representation of a system to be able to
compute a Z transform (or generate a Seq function that computes the Z
transform, which might be useful for things like graphical display).

How to match that with the requirement of Seq, to implement closeReg?

Actually, Z transform is just function evaluation.  It is probably
also not necessary to distinguish between linear and non-linear,
because it will simply not be defined for nonlinear functions.


Entry: ST
Date: Sat Apr 20 14:10:10 EDT 2019

Why can't the Seq monad just be ST?  Why is the intermediate TH step
necessary?


Entry: Lists and Functor, Traversable, Zip
Date: Sat Apr 20 16:37:36 EDT 2019

I'm going to need structured arrays, e.g. matrices of complex numbers,
that will have to flatten down to just lists of things.

Is there a simple way to create a data structure such that it
automatically has Functor, Traversable and Zip?


I've been here before.  What is a representable functor?
http://hackage.haskell.org/package/representable-functors-3.2.0.2/docs/Data-Functor-Representable.html

Not for now.  I need to focus on the application.


data C t = C t t deriving (Eq, Show, Functor, Foldable, Traversable)
instance Zip C where zip (C ar ai) (C br bi) = C (ar,br) (ai,bi)

c2 f a b = sequence $ zipWith f a b


Entry: Z transforms
Date: Sat Apr 20 17:56:52 EDT 2019

I can't figure it out.  Something is missing.

EDIT: I have some rules figured out, but the beef is still in applying
this to systems.  I don't see where to go next...

say I have:

s -> i -> m (s, o)

To create the z-transform, the state needs to be eliminated by using
up the expression for delay-as-phase-shift.

I need to do this on paper first to trigger muscle memory, then it
will become obvious.


EDIT.

The way to look at this is to look at a linear autoregressive process.
Using the notation I'm used to, where x is the state vector, A is the
system matrix, B is the input-to-state matrix, and i is the input:

z x(z) = A x(z) + B i(z)  =>  x(z) = (zI - A)^-1 B i(z)

To construct an evaluator for x, it might not be necessary to compute
the inverse.  Given z and i, this reduces to a system from which x can
be solved.

So what do we actually have?

The system for which we're computing the z transform is not the
original system: it is the linearlized version.

A is a matrix of partial derivatives.  These are still non-linear in
the parameters of the system, but we can actually compute the
numerical values using autodiff, then solve the system to produce an
evaluator for the z transform.

I'd like to implement this so it can all fold in on itself to keep all
components reifiable and composable.

EDIT: Yep.  Doing this on paper first helped a lot to figure out what
exactly I was looking for.  Basically the entire thing I did today
around Z-transforms is quite meaningless.


Entry: Z-transform, summary
Date: Sat Apr 20 22:02:59 EDT 2019

This will only work on linearized systems, so first create an autodiff
instance to be able to compute partial differentials.

So to implement autodiff, implement normal numbers.

EDIT: Almost done with that.  Rest is easy to fill in.

EDIT: I've re-invented rai/ai-freq.rkt

;; Basic strategy:
;; - normal eval: eval parent semantics lifted over small signals (linear approx)
;; - feedback: 
;;   - split variables in parameter / signal
;;   - compute output offsets
;;   - compute differential matrix of update function
;;   - return (memoized) z-dependent signals
;;     - compute z-dependent transfer function from z, matrix
;;     - apply transfer function

      ;; Obtain the frequency response of a linear system by first
      ;; probing the function for the linear system matrix, and then
      ;; compute the effect of feedback through matrix inversion
      ;; (solve linear system).


Entry: System solving
Date: Sat Apr 20 23:01:11 EDT 2019

Could also be implemented in Seq.  It's probably useful to have as a
library operation.  Do it first without pivoting.  Later, pivoting
might be necessary.

Not ideal:
- cramer's rule  (< 4 variables ok?)
- gaussian elimination

Better:
- LDU factorization?

RAI computes m_inv.


Entry: Next
Date: Sun Apr 21 07:49:24 EDT 2019

It feels the z-transform is distracting from the problem at hand,
which is to get some actual code going to do phase demodulation.

Otoh, it would require solving a couple of things.

What I miss is insight and basic structure.  The split of Ring
vs. Function seems to be a good idea.  Should div be part of Ring?


Entry: Base ring/field
Date: Sun Apr 21 08:12:06 EDT 2019

I think it makes sense to add the base field to the definition of Ring.

Essentially, I'm trying to incorporate the idea of vector space as
well.  Maybe that should be kept separate.

EDIT:  The idea here is indeed to define

class (Ring m t, Traversable f, Zip f) => Vector m f t

I.e. it is really just a constraint on a functor.


Entry: Abstract representations of vectors
Date: Sun Apr 21 08:42:09 EDT 2019

Eventually, I want all loops to be implemented target-side, so there
will need to be some notion of fold and zip that are implicit enough
to push it through the representation monad.

This is the next tough problem.

It cannot be captured in Seq.  It is a different type class: Loop?

What about this: C and D are always inlined.  There doesn't seem to be
a good reason not to, but matrices and polynomials are implemented in
terms of abstract iteration and storage patterns.


Entry: Test case for implementable vectors as loops
Date: Sun Apr 21 08:54:29 EDT 2019

Basically, compute the norm of a vector, but leave the vector
representation abstract.

C.hs and Term.hs will need to be extended to support two ideas: loops
and arrays.

The biggest win would be if this can be written in a way that a low
level compiler can eliminate intermediate storage, i.e. perform loop
fusion.

In RAI I do this manually.

E.g. instead of

f3 a =
  let b = fmap f1 a
      c = fmap f2 b

where b is an intermediate vector.  It would be possible to map (f2
. f1) over a and produce c directly, where the 'b' values are only
scalar inside the loop.

I believe this should be easy enough to do as long as the storage
allocation of b is visible only to the function and not outside.

I need to ask somebody with good knowledge of LLVM.

Or, just make a test.

I believe the terms are "fusion" and "deforestation".

void f(const float *src, float *dst) {
     float tmp[10];
     for (int i=0; i<10; i++) {
         tmp[i] = sin(src[i]);
     }
     for (int i=0; i<10; i++) {
         dst[i] = sin(tmp[i]);
     }
}


void f(const float *src, float *dst) {
     for (int i=0; i<10; i++) {
         float tmp = sin(src[i]);
         dst[i]    = sin(tmp);
     }
}


I think it's ok to assume that this will work out fine.

So to implement the loop operations, it should be enough to just
separate declaration and binding.


Entry: Target vectors
Date: Sun Apr 21 09:37:51 EDT 2019

So it seems almost trivial as long as the interfaces are there.  But
how to provide that interface?

Step one: implement some code that is abstract in the vector type.

This is going to be a bit of work.


Entry: Loops are environments
Date: Sun Apr 21 09:53:06 EDT 2019

( Relation to Representable functors? )

Luckily, I just need vectors at this time, so it can be quite
concrete.  Vectors will be the base for all
Functor, Traversable, Foldable, Zip behavior 

There are only two operations that are important at this time:

- fold (accumulation, vector to scalar reduction)
- zip  (vector to vector maps)

Is there a natural way to express these?  Yes, by assuming each loop
has a single structure that is natural to C-like code: 

- input / output vectors
- accumulators / state
- nested versions of these

Then implement the standard classes in terms of these primitives.

How do I start this?  Probably best by extending Term.hs such that
C.hs can generated it.

It doesn't feel like today is the day for it though.


- Declare and initialize accumulators
- Declare vectors
- Insert the loop head
- Insert the loop body


EDIT: Some key insight is missing.  This is a tangle, and I need to
find a starting point to then see the loose ends.


Entry: Necessarily Meta
Date: Sun Apr 21 15:23:00 EDT 2019

Because their main point is to say something about the code that is
completely lost when it is compiled.


Entry: C.hs
Date: Mon Apr 22 07:58:21 EDT 2019

Start by creating a new kind of binding: a loop.

A loop has two main parts:
- Output array
- Output accumulator

To simplify, these are always there, and there is just one of them for
now.  Extend to multiple arrays or accumulators once the basic
structure is ready.

So it is clear now where Term needs to be extended.

The question now is where to add the concept of collection.  Should
"node" be extended to mean a combination of accumulators and arrays?

Maybe the first change to make is the ability to have multiple outputs
from a primitive, because loops will likely have multiple outputs.

It seems safest to extend Term into something that can express this
structure.  Maybe TermLoop.hs?

It might be possible to extend Term.hs in-place, but I worry about
breaking things, so let's just decouple it for now and create an
intermediate form.

Or, just go for it.  I believe the only necessary change is in
binding.

Note that node type is abstract in Term.hs.  Let's first find out the
concrete type of C.compile

compile :: Show n => String -> CompileResult n -> String

So it doesn't care about the node type.

Let's add another constraint on n that embeds the looping.

class Loop n
instance Loop NodeNum
compile :: (Loop n, Show n) => String -> CompileResult n -> String

This class then could expose the required nesting.

( This is the first time I think about using type classes to express
folds to extend data types.  Is this the right way to go?  This is
always possible, I just never realized it was an option.. )

So current approach is to just stick to Term, and extend it
recursively using a type class.


Entry: Primtives with multiple return values
Date: Mon Apr 22 08:22:31 EDT 2019

I think I miss this feature to be able to abstract a loop as something
that returns a set of arrays and a set of accumulators.

Implementing the loop body is easy: it just contains nodes that are
parameterized by the current loop count.

The idea is to extend the node type to also express array references.
This should work as well.


Basically a loop is a siso applied to an array.

Where to introduce the recursion?
It should be something like SeqLoop?


Entry: Summary: SeqLoop
Date: Mon Apr 22 08:40:54 EDT 2019

The extension needs to happen as:

class Seq m r => SeqLoop m v r where
  zipfold :: ([r t] -> [r t] -> ([r t], [r t]))
          -> r [t] -> r [v t] -> r ([t], [v t])
  -- zipfold body initStates inputVectors = (outStates, outputVectors)

Or something like that.  The problem this solves is to unpack the
representations of arrays: inside the body function, there are only
scalar reprsentations.

This is going to take some shuffling to get right.  But it seems that
once expressed properly, the construction of instances is going to be
straightforward.


Generalized:

class
  (Seq m r,
   -- Generalize [] grouping functors to a,i,o
   Zip a, Traversable a,  -- accumulators
   Zip i, Traversable i,  -- inputs
   Zip o, Traversable o   -- outputs
  ) =>

  SeqLoop m r a i o where
  
  -- This implements the typicial "tagless-final" style where a
  -- combinator flips the nesting of representation (r) and collection
  -- (a,i,o) type constructors.
  zipfold ::
    (a (r t) -> i (r t) -> (a (r t), o (r t))) ->
    (r (a t) -> r (i t) -> (r (a t), r (o t)))

  -- zipfold loopBody initAccus inputVectors = (outAccus, outputVectors)
  

I'm pretty sure that's it.  The rest is implementation which can now
be done in a type-driven fashion.  That's for tomorrow morning as I
don't think it's going to be a small change.  Probably will need a new
data type.


Entry: zipfold
Date: Sat Apr 27 09:16:02 EDT 2019

What's next?  Multiple outputs / bindings.

The reason for multiple outputs is sharing: there is likely some
intermediate value that then forks into two.  It doesn't seem possible
to express this as single bindings.

Maybe this isn't so simple.  At least not today.  This might need some
brightness to resolve...

Yeah this is not going to work today.


Entry: zipfold
Date: Sat May  4 07:37:42 EDT 2019

I feel ready to tackle this.
First thing: multi-output bindings.

Seq is only single-output, so a simpler way would probably be to
create bundling of bindings instead?

data Binding n = Binding n (Term (Op n)) |
                 Probe (Op n) [String]

So if n were a collection, this would be solved?

I feel the solution to this is simple, but I don't see it, and I'd
have to first build a couple of things that don't work before the
proper way becomes obvious.

Let's just clone Term.hs and make it multi-binding.

This is going to be more difficult.
Let's just do it in-place.

I want this to be a simple change.  What do I really need?

A construct that introduces a loop variable, but otherwise does
nothing special.

I come back to the same architecture as RAI.


Entry: Go back to RAI
Date: Sat May  4 08:03:42 EDT 2019

RAI uses the observation that loops just introduce context = loop
variables, and arrays are just index-parameterized variables.  Single
assignment is preserved, except for accumulators.

So arrays themselves are easy: just add some annotation.  Assume that
loop indices are just scalar base types.

Accumulators are the sore point.  Can they be written as
single-assignment as well?  E.g they are just arrays, but it should be
written in such a way that there are no back-references.

What about generalizing this: an accumulator is really a "triangle".
Allow for back-references, but optimize them out in case of an
accumulator.


Entry: Two-pass full array + triangle feedback
Date: Sat May  4 08:09:59 EDT 2019

So the idea is to embrace the array nature and allow for "full
history" in the loops, then optimize out:
- intermediate arrays that are only used element-wise
- translate triangles to accumulators

Do this in two passes: one that generates only the normal form, and
one that can perform the optimizations.

This means the zipfold form has to change, as back-references of
outputs are allowed, and full references of inputs are allowed as
well.

So the language I'm writing is an array language with "triangle
feedback" of the outputs.

This way there needs to be no distinction between accumulators and
outputs, as each output can be used as an accumulator, as long as the
index referencing stays within the triangle.

It seems that representation of this at Term level would be
straightforward.  It needs only arrays and loops that range over an
index.

The output of a single iteration is a collection of scalars.
The input is a collection of arrays.
- Some of these are original inputs, and are full scale.
- Some are outputs computed in the previous iteration.

So this needs the representation of an array, probably as a function.

Some cases:
- one loop incrementally computes an array
- an other loop can be run to perform a computation on that whole array

So it can't be just "global" variables.  Scope really needs to be
local.  Storage can be reused though as long as outputs are not
propagated.

This isn't a simple problem.

But good to realize that a more general view (triangles instead of
accumulators) is better.


So again:

- at any "step", you can compute one scalar value from a number of
  other scalar values (n-op)

- the place where this value is allocated, i.e. the hole, is what we
  are managing.

- at each "step", we also exactly know the valid range of the input
  arrays.

- based on context, holes can be re-used in subsequent iterations, but
  this is an optimization.  in first iteration, allocate everything
  for single-assignment.  then later on, "project" the variables,
  where each projection eliminates a dimension.

- start with a setup where all loops have a fixed size, then
  generalize to computed fixed sizes (e.g. allocate intermediates on
  stack), then generalize to data-based iteration sizes with just an
  upper bound on the allocation.


The idea is still the same as RAI, only RAI was too eager with
optimizing out intermediates.

Representing this is going to be a challenge.  Term can probably be
used because it has parameterized nodes.  What we add is:

- loop variable contexts

- context dependent nodes for value binding

- primitives for dereferencing


So, where to start?

Structurally, the first thing that needs to happen is to introduce
context nesting.  Once that is possible, the rest will becomes clear.


Entry: holes
Date: Sat May  4 08:48:55 EDT 2019

It seems simpler to do this by making holes explicit.

There are two ways of looking at it:

holes:
- put hole allocation code in the output
- parameterize the remaining code generator with the hole

outputs:
- run the code generator
- from the outputs it produces, insert the hole allocation before the generated code

so they are pretty much the same thing apart from some juggling.

using outputs would also allow local variables to remain local.
i.e. escape analysis will become simpler: if a value that is computed
inside a loop does not survive it, it can be allocated inside the
loop without reference to the current iteration variable.

so each variable would have its full dimensionality, plus an
annotation indicating at which loop exit it is "dropped".

This needs to ferment.


Entry: Map/Fold
Date: Wed May  8 07:30:29 EDT 2019

It is important to look at this the right way.  Ideally, I would like
to have only one looping construct: the one that includes output
feedback.

This encodes a mapping between the multi-dimensional object, and a
sequential encoding of the dependencies.  Note that mathematically,
there might be many forms of evaluation.

So ideally I want to express only the dependencies.

But in practice, I will need a space filling curve, a sequence.

Does it make sense to express things in that higher abstraction?  Or
is it better to stick to the reality of the loop?

In general it is better to do the specific case first to at least have
an example of what to generalize.

What is an array?

a :: i -> v

With the caveat that i is contained in an interval, and for a fold,
the interval grows during the loop.

In this context, map looks like:


map ::  (a -> r) -> (i -> a) -> (i -> r)


In practice, we want access to the current index:

imap ::  (i -> a -> r) -> (i -> a) -> (i -> r)


The multivariate version has v,w be tuples.  This is no essential
difference, so let's ignore it for now.

The version where random access is possible:

irmap :: (i -> (i -> a) -> r) -> (i -> a) -> (i -> r)


Note that (i -> a) now can be factored out completely into a global
environment.  Maybe this is important?

(i -> a) -> (i -> r)

What does this actually mean?  That a constant input is just a
function that has no influence on the shape of things.


Let's move on to folds first.

A fold looks like:


ifold ::  (i -> s -> a -> s) -> s -> (i -> a) -> s


Generalizing the loop state to an array, and allowing access to the
previous results inside the loop gives:

iafold :: (i -> (i -> s) -> s) -> (i -> s) -> (i -> s)

This seems to be the essential component, because all of these i->s
are different things.  The induction step is to take

i -> s where i \in [0,n[ to
i -> s where i \in [0,n]

It is a transformation between types.  I don't think I can express
this in the current encoding.

To read iafold's type: take a function that grows an array by one
element, and use it to transform an array of size 0 into an array of
size N (concrete).  Separately, provide some context such that the
array growing function can also reference arrays that are not being
constructed.

So basically, I'm building a sequential array constructor language.

Even simplified: the input i->s is actually just () if it is an empty
array.  So the type can be even simpler:

iafold' :: i -> (i -> (i -> s) -> s) -> (i -> s)

Take the size of an array, a body that has access to the current size
and the array constructed in the last step, and produce a new array.
Actually, the body needs access to the total size as well, but this
can be hidden in the closurs because it is a constant.

So it appears that the idea of "environment" is very important. Also,
there is some need to turn this into a type family, where the
induction step can be expressed at the type level.

Starting from the main iteration,
- map fits in here by never using the input state
- fold filds by only using the last state (reusable state variable)
- more complex siso machines can use multiple delay elements

It seems that focusing on the one true abstraction makes sense,
because that will be the only one that needs to be implemented.  In a
second step, constraints can be added at the type level for more
constrained iterators, and these type annotations could then be
propagated to the intermediate language to be used in non-local
optimizations, i.e. optimizations that will need to fold over the
entire IR.


Summary:

- make indices explicit

- explicit indices allow constant arrays to be hidden in an
  environment

- the essential "step" is appending one element to an array, where the
  whole previous array can be used as input.

- todo: find some way of encoding the n->n+1 size extension induction
  in the type

- additional patterns could be expressed as constraints on the main
  pattern: map, fold, and combinations with more complex state.


Entry: Context
Date: Wed May  8 11:26:16 EDT 2019

One of the central ideas in the previous section is to simplify by
abstracting away the context.  This will require some support for
higher order functions in the representation, which will mostly be
just arrays defined at particular levels.

This will be simple nesting without context escapes, so there should
be no generic lambda+app that can take a context and "move" it
somewhere else.

EDIT: But still, if the interface to this mechanism is still just
lambda, then the capturing and application need to somehow be
constrained.

Maybe in first iteration, an explicit approach is better?  I.e. an
environment is an explicit collection of arrays?


Entry: Random access
Date: Thu May  9 07:26:12 EDT 2019

The point is that the primitives should support random access.  This
needs some kind of dependent typing trick to allow encoding of bounds.
If indices are not processed, this is no big deal, but I also want to
have a way to compute indices.


Entry: Lambda
Date: Thu May  9 07:31:11 EDT 2019

So let's try to get some constraints on abstraction and application.

- When creating a closure, it needs to contain a type marker that
  allows it to be inserted deeper into a loop, but not higher up.  So
  there is a type family of loop nesting as well.

- The arrays that are contained in the context, together with the
  current loop indices will always be valid if we just go deeper.  So
  these can just be tracked.

The index into the type family is the loop nesting.

EDIT: The environment can be encoded as a nested type-level construct
using positional encoding.  Each level contains an array and some
information describint the range of the array.


Entry: Next?
Date: Thu May  9 14:22:01 EDT 2019

I need to understand more about type level programming to be able to
make good decisions.  Go over this one:

https://www.parsonsmatt.org/2017/04/26/basic_type_level_programming_in_haskell.html


Entry: Type level stuff
Date: Fri May 10 10:09:08 EDT 2019

First, maybe write a wrapper around Seq that allows fixed size bit
vectors.


Entry: Make arrays total
Date: Fri May 10 12:58:25 EDT 2019

Make the programming model such that the arrays are total functions,
e.g. just using wraparound, but create some kind of type-level
tracking to remove bounds mapping when it is not necessary.


Entry: Implement some loop
Date: Sat May 11 08:18:38 EDT 2019

I'm quite stuck, so let's start with something.

Last practical hurdle was to not have multiple return values in a
binding statement, preventing a recursion point.

This can be solved by introducing collection nodes, and collection
dereferences.


Entry: Moving forward on the loops
Date: Sun May 12 02:07:15 EDT 2019


It is important to see that all this fancy type stuff is just
notation.  It makes it possible to fit the idea into a larger
framework without having to do a lot of manual "matching".  Or at
least, without having to do manual verification of that matching.

The idea should stand on its own.

If it is too hard to express, find a different way.  Either untyped,
or more concretely typed for a particular example.


Entry: Feldspar and fused representations
Date: Sun May 12 08:56:45 EDT 2019

There was a trick there.

In some earlier rai notes mentioned I need a loop transformation
algebra.  Maybe start there?

1. It should be simple to create a generic unfused version, where all
   arrays are explicit, and time feedback is not in the picture.

2. Perform fusion on that

3. Introduce state on top of it?


So the new focus is to separate feedback from vectors.

The core of the target language doesn't have anything to do with
feedback at the inner loops.


Entry: Grid
Date: Sun May 12 10:29:44 EDT 2019

I started doing this in a separate module.

Two important insights:

- Grid needs to be separated from Seq.  Basically:
  - Seq  = Expr x Time (implicit)
  - Grid = Expr x Space (arrays)

- Grid will provide "intermediate storage holes" for Expr.

So the interface between Grid and Expr is at the level of ALLOCATION.

Practically, when a Grid compiler passes control to an Expr compiler,
it will need to provide the mechanism to create a variable in a
context.


Maybe I should stop here to not get too confused.

EDIT: Continued a bit, distinguiing a number of entities into separate
types.  It can now represent:

T c[100];
T d[100];
for(int i=0; i< 100); i++){
	c[i] = op(a[i], b[i])
	d[i] = op(c[i], c[i])
}

EDIT: This is still not correct, as arrays can span multiple loops.
So next is to encode nested loops.

EDIT: Will it be possible to mix different "types" of expressions
inside loops?  No.  This will have to be recovered in post-processing.

Let's try to express at least 2 levels.

In this light, there is a difference between

- the loop being expressed abstractly (iterate over i,j)
- actual concrete nesting order

The loop order will be important for implementations, as it will allow
for certain kinds of optimizations.

But. If everything is inside a single loop, then nothing is inside a
single loop.  The product _doesn't add any structure_.

So it appears that this entire exercise is about identifying reuse
patterns in nested loops.  And I am very naturally arriving at the
same approach as RAI, though with a separate optimization step.


Entry: Escape analysis
Date: Sun May 12 14:59:03 EDT 2019

Whether a value escapes will determine whether it is temporary.


Entry: Identifying reuse patterns
Date: Sun May 12 15:13:03 EDT 2019

This needs a simplified notation.

Basically, map the full target language representation on something
simpler, definte operations on the simpler form to play with them and
find some operations, and then lift them to the target language
representation.

To make these transformations work, they need to be expressed in an
algebra.

Starting from where we were:

T c[100];
T d[100];
for(int i=0; i< 100); i++){
	c[i] = op(a[i], b[i])
	d[i] = op(c[i], c[i])
}


This can be stripped to

(def c 100)
(def d 100)
(loop i 100
  ((c i) (op (a i) (b i)))
  ((d i) (op (c i) (c i))))

Removing the definitions

(loop i 100
  ((c i) (op (a i) (b i)))
  ((d i) (op (c i) (c i))))


Ignoring the loop ranges: we know there is some range, but its extent
is non-essential.

(loop i
  ((c i) (op (a i) (b i)))
  ((d i) (op (c i) (c i))))

The tansformation is then between the above and

(loop i
  (c (op (a i) (b i)))
  ((d i) (op c c)))

The general rule: a dimension can be removed if a variable does not
escape a context.  ( Or if only the last variable does )

Then loop order for multiply-nested loops could be determined by
checking each order, and comparing the optimizations.


Removing mention of "op"

(loop i
  ((c i) (a i) (b i))
  ((d i) (c i) (c i)))

The tansformation is then between the above and

(loop i
  (c (a i) (b i))
  ((d i) (c c)))


Getting rid of parenthesis, where <- means binding + some abstracted
primitive operation.


i:
  c(i) <- a(i) b(i)
  d(i) <- c(i) c(i)

i:
  c()  <- a(i) b(i)
  d(i) <- c()  c()
  

Then get rid of operations and use upper case for array names, and lower case for indices

i:
   Ci <- Ai Bi
   Di <- Ci Ci

i:
   C  <- Ai Bi
   Di <- C  C 


Entry: Construction of loops
Date: Sun May 12 15:25:53 EDT 2019

Basically, a loop lifts an expression over an array.  It is
essentially "map".  If there is back-reference, it is a "fold".  If
there is random access back-reference, it is a "triangle fold".

It seems that this problem can be done quite simply by using full grid
notation.  It would only need each "map" or "fold" to add an extra
dimension.

It's important to realize that this is not just for scalars!  Each
loop is a +1 in the number of grid parameters.

Because the difference between map, fold and triangle fold is just
array access, it might be good enough to focus on map.


Entry: Loop algebra
Date: Sun May 12 15:33:17 EDT 2019

What are the operations?
- fuse / split
- project / inject (eliminate)
- hoisting
- interchange

Where deforestation is a combination of fusion and elimination.

I think that's pretty much it.  Here are some more:

https://en.wikipedia.org/wiki/Loop_optimization


There are the (bi-directional) operations in the notation developed
above, in "reductive" order.

- FUSE

i:
   Bi <- Ai
i:
   Ci <- Bi
=>
i:
   Bi <- Ai
   Ci <- Bi

   
- ELIMINATE:

i:
   Ci <- Ai Bi
   Di <- Ci Ci
=>
i:
   C  <- Ai Bi
   Di <- C  C 


- HOIST

i:
   C  <- A B
   Ei <- C Di 
=>
C <- A B
i:
   Ei <- C Di


- INTERCHANGE

i:
  j:
     Cij <- Aij Bij
=>
j:
  i:
     Cij <- Aij Bij


Entry: summary
Date: Sun May 12 16:10:31 EDT 2019

- to define an algebra, first define it on a simple, concrete language
  (i.e. create a notation first), and then generalize it to a langauge
  with practical annotations (e.g. something that maps to C loops and
  arrays).

- the RAI idea isn't bad.  it was just missing a split between
  generation of the full grid intermediate, and subsequent
  optimization based on a loop algebra.

- the algebra is "large".  i.e. there are a lot of degrees of motion
  inside the algebra.  any optimization will likely need heuristics.
  it's not just straigtforward reduction.

- higher order functions just add a dimension to the grid.  they are
  not just scalar->vector.


Entry: LTA
Date: Tue May 14 08:54:00 EDT 2019

So I have a language form and a prettyprinter.  Next is to define the
operations, and define some folds.

The next organizational step is to define a monad-parameterized
language such that creating of terms would be easier, and use it to
build a prettyprinter.

EDIT: Do a concrete version of this first.  E.g.:


test_val =
  Program $
  [loop' i [loop' j [let' (c i j) (a i j) (b i j)]],
   loop' i [loop' j [let' (d i j) (a i j) (c i j)]]]
  where [a,b,c,d] = map a2 ["A","B","C","D"]


This should expose some more concise representation even.


Entry: Expressing the operations
Date: Wed May 15 08:40:04 EDT 2019

There's something not quite right, because it is very hard to express
the transformations without a lot of conditions.

I want to split this up:

- Find a way to represent the structure, zoomed in at a particular site

- In this zoomed in state, the transformation itself will be trivial to express

For fuse it is already quite simple: zoom in on two adjacent loops.
For interchange it is simple as well.


It seems that the search is really about representation spaces.  About
notation.


Maybe start by consensing some of the constructors.
EDIT: Done: there are just 2 now:
- Program = collection of Form
- Form = branch of LetLoop and LetPrim


I think the insight is that these transformations are not local.  They
all essentially consist of an iteration pattern and a decision rule
that uses non-local information.

Start by splitting fuse into two components:

- an iterator that goes through the list of terms, pairwize, and

- an inner routine that Maybe produces a fused element


So many insights pop up while doing this, but it is so hard to record
or remember them.


Entry: About folds
Date: Wed May 15 09:41:17 EDT 2019

Why do I have trouble writing a fold for this?

data Form = LetPrim Cell [Cell]
          | LetLoop Index [Form]


Maybe because the constructors are not primitive?  While 'Program' is
just a wrapper around [], the [] introduces a substructure.  The
question is then, should the constructors of this substructure be
exposed?

Let's just try the dumb thing first.

EDIT: It's easier to understand when making things a little more
general and looking at Form as being parameterized by the [] type.
This then has a direct correspondence in foldForm being parameterized
by foldList.  The flattened foldr then has all the constructors
involved in the two mutually recursive types letPrim,letLoop,cons,nil
with the remark that there are now 2 types of "accumulators" for the
two mutually recursive types.

EDIT: The mutual recursion is now expressed properly in two separate
legs, and a combined fold that treats all levels the same.


Entry: Next
Date: Wed May 15 11:31:26 EDT 2019

Rewrite fusion in terms of a fold.


Entry: Use the fold
Date: Thu May 16 08:34:28 EDT 2019

Curious, because I have never done it like this.  It looks reasonable
though.

Start with a no-op:


fuse' p = foldProgram letPrim letLoop cons nil where
  nil = []
  cons a b = a:b
  letPrim c cs = LetPrim c cs
  letLoop i fs = LetLoop i fs


To do the fuse, it can be done in two spots: either as part of cons on
a per element basis, or as part of letLoop, operating on the whole
list.

Let's try cons first.

I got to this, which only does one layer.  Why is that?  Aha because
it uses a non-recursive cons.

fuse' p = Program $ foldProgram LetPrim LetLoop cons [] p where
  cons h@(LetLoop i as) t@((LetLoop i0 bs):t') =
    case i == i0 of
      True  -> (LetLoop i (as ++ bs)) : t'
      False -> h:t
  cons a b = a:b

This is a lot more tricky than I thought.  The fully recursive routine
needs a cons that is non-recursive.

OK I see now: it works inside out, and fusion is an outside-in
operation: when the outside is fused, it exposes fusable insides.
Running the operation multiple times does produce the correct result.


So, is it possible to create a fold that is top-down?

That is a breath-first iteration pattern.


Entry: Local context
Date: Thu May 16 09:16:20 EDT 2019

So I need some iteration pattern that can associate a primitive term
to some context.  To use Traversable, there needs to be a Functor
structure.  What is the contained type in this case?

I'd say the basic element would be the primitive expression.  We want
to modify the indexing there.

First, make it such that Form can take a single binding parameter.

EDIT: Done. It was easy, and instances can be derived
automatically. This is awesome.

So it should be easy now to create a primitive transformation
operation based on context.  Basically, traverse, with a custom monad.


Looks like there is two different ways to look at this:

- Custom, using a monadic iteration based on the generalized fold

- Flattened, using standard iteration patterns.


It appears that there needs to be some routine that moves information
between the structure that is hidden to the functor (loop indices),
into the contained values.

EDIT: Ive created the function below, exposes the path to the functor
structure.

annotate :: Program b -> Program ([Index], b)
annotate (Program fs) = Program $ forms [] fs where
  forms path fs = map (form path) fs
  form path (LetPrim b) = LetPrim (path, b)
  form path (LetLoop i fs) = LetLoop i $ forms (i:path) fs

Now that could be tucked away in a Monad.  Is it necessary though?  I
kind of like this explicit structure.


Entry: Escape analysis
Date: Thu May 16 10:24:42 EDT 2019

Now for the core issue: construct a list of variables that is only
defined inside a loop.

Note that we actually have that information in the original form: we
know the return values of the form.  So it is likely safe to assume
the intermediate form will have an explicit list of arrays that will
be visible outside of its scope.

There's a problem: when two loops are fused, the outputs might no
longer escape.  So it seem that analysis needs to be performed anyway.


What does it mean for a variable to not escape?  If it is no longer
referenced after the loop has finished.

So given a loop segment, we need to isolate the segments that come
after it.

This needs to be done for each binding.

It's not entirely clear how to mix primitives and loops inside a forms
list.  So let's just start building this and see where it ends.

I'm starting to run out of steam.  This stuff is exhausting.

Ok, resume.  I do not know what to do with irregular nesting, but I do
know that the end of a loop is a splitting point.

EDIT: I'm missing an insight, an angle.  I need to let this pop up by
itself.


EDIT: Ok I tried several times.  It's not working.  I can't retain
enough context.


EDIT: Took a much longer break.  Let's create a simpler zipper.
Follow the Haskell tutorial first to refresh intuition.

So, really, it is just a stack.  I just need a stack, because only the
future matters.  It's not that I haven't done that before!


Here's just the iteration pattern, doing nothing but linearly
traversing and keeping a context.

data Ctx b = Ctx { ctxStack :: [[Form b]],
                   ctxCode  :: [Form b] }

zip_next (Ctx (fs:fss) []) = zip_next (Ctx fss fs)  -- pop context
zip_next (Ctx [] []) = () -- end
zip_next (Ctx fss ((LetPrim b):fs)) = zip_next (Ctx fss fs) -- skip
zip_next (Ctx fss ((LetLoop i fs'):fs)) = zip_next (Ctx (fs:fss) fs') -- push


Note that this does just the other part of the index generation.  So
what about changing that code to do full zipper/future annotation?

I think I understand: perform a traversal in a state monad, and
annotate each node with the zipper.

This can probably be generalized to do all kinds of things.

Ok, so generalize annotate to monadic form and remove all explicit
passing:

annotate' :: Monad m => Program b -> m (Program ((), b))
annotate' (Program fs) = fmap Program $ forms fs where
  forms fs = traverse form fs
  form (LetPrim b)    = return $ LetPrim ((), b)
  form (LetLoop i fs) = fmap (LetLoop i) $ forms fs

Now m can be made to be the state monad.

EDIT: I have it split up, but this really feels like doing double
work.  It actually is, because there is an actual recrursion and the
updating of a datatype that represents the recursion.


Entry: Escape analysis
Date: Fri May 17 08:33:03 EDT 2019

See LTA.hs

I now have:

  escapes :: Context Let -> Array -> Bool

So the coordinate transformation should be straightforward to
implement.  If an array does not escape, remove all loop indices.

I'm not actually sure that is the whole picture, so let's implement it
first.

Ok the basic idea works!

But one thing I missed: it is not just a local substitution, because
all references need to be substituted.


EDIT: Doing it in two steps: make a list of intermediate
(non-escaping) variables, and use it to perform a substitution.

-- LTA
i:
  j:
    C <- Aij Bij
    Dij <- C Aij
i:
  j:
    E <- Aij Dij

Together with fusion, this is the basic thing I need.  I don't think
that hoistable things would show up in generated code, and
interchange, well I don't really see a way to add a cost function.

It's time to start making some examples that use intermediate arrays,
which is the whole point of this.

This will only make sense when there are folds involved, because
otherwise it would be possible to fuse.

So I need another operator that uses "trangular access" and "feedback
access" and show that this cannot be fused.

These two seem different.  Do "feedback access" first.


Entry: Different intermediates
Date: Fri May 17 17:05:06 EDT 2019

In RAI, all intermediates were local and did not survive to the next
loop.  I remember this being a problem for the FDN implementation.

EDIT: I believe it was the (sparse) matrix multiplication.

I think the central point is to have one loop compute something, and
have another loop use that value.  The problem in RAI is that
multi-pass structures are not supported.

So start with something simple.  Multipass is necessary when there is
some kind of global -> local data dependency.  Anything goes really,
so let's pick vector normalization:

sum squares -> 1/sqrt^2 -> scale

The LTA language cannot yet express accumulators.

i:
   Ai <- A(i-1) + Bi

Can I express something that would need multipass without using
accumulators?  It doesn't seem so.  Accumulation is key.


Some remarks

- Any kind of triangle feedback would be allowed in a first iteration.
  Only the regular ones should be replaced by accumulator variables.

- This "allowable range" idea should be extended to generic grids as
  well: if a grid was computed in a previous loop, it can be used
  entirely.  Otherwise only the local part is accessible.  Keep this
  open and add it once an example pops up.


Entry: accumulation
Date: Fri May 17 17:20:16 EDT 2019

i:
   Ai <- A(i-1) + Bi

This notation abstracts the need to initialize the accumulator before
the loop starts.

Representing accumulators is important, so maybe it is possible to
generalize:

i:
   A_i <- A_f(i) + B_i

There based on what A_f(i) is, we can implement it as a single
accumulator, a finite set of accumulators (a shift register), or a
full array.


So that is all fluff.  Let's use i' as the previous index, with
implicit initialization.

Actually I don't have any way to use the two types of references:
previous accumulator value and final, after the loop.

i:
  A <- A Bi
i:
  Di <- A Ci

So it appears that accumulators need some special notation.  To make
it work for triangle patterns, use the triangle notation.


i:
  Ai <- Ai1 Bi
i:
  Di <- AI Ci

Where I is the last element that was stored in the array in its
constructing loop.


Entry: LLVM loop optimizations
Date: Fri May 17 19:14:40 EDT 2019

https://www.youtube.com/watch?v=QpvZt9w-Jik

- Tensor Comprehensions
https://research.fb.com/announcing-tensor-comprehensions/

- Halide
https://halide-lang.org/


Entry: Accumulators
Date: Sat May 18 08:41:23 EDT 2019

So it's already established that:

- all accumulator references are "full triangle".

- replacing triangles with single (or multiple) accumulators is an
  optimization that can be derived directly from the usage pattern.
  e.g. if 1) inside the loop only the last acc value is referenced and
  2) outside the loop only the last element is referenced.

So the only necessary bit here is to make them representable.


Instead of transforming indices, transform the arrays.  This popped
out through a notational shortcut.


There are a couple of inconsistences.

During construction, there is a loop index.  However, after
construction is finished (e.g. in the next loop, or just using an
input array), it is not clear how to refer to particular elements in
the arrays.

What is clear is that arrays have definite types.

Some rules:

- iteration ranges are always derived from the dimensions of the
  arrays that are being constructed in a loop.  all of these arrays
  are necessarily the same size.

- they are not necessarily related to arrays that are used as input to
  a loop.  this needs to be represented somehow.

- some dimensions are only there in a virtual sense.


What about leaving all the indices as they are in the code, but keep
track about which dimensions are actually implemented based on the
access patterns?

Basically, abstract the array access.

This keeps the original semantics.


So next step is to represent abstract array access.

There are two "stages"
- array is defined: random access is allowed
- array is being defined: only back-referencing acces is allowed

( Note: it is important to be able to nest triangles, i.e. one loop
builds up an array one element at a time, while allowing a sub-loop to
run over all the elements that have been computed up to that time. )

Based on the kind of access, the arrays can be thinly provisioned.


Entry: Triangles
Date: Sat May 18 09:53:49 EDT 2019

This idea really attached itself.  Why is it so important?  Because it
is a _physical_ property that cannot be disputed.  I.e. it is not a
leaky abstraction:

An algorithm running on a CPU is necessarily sequential, but while
going through the steps, it can refer to all its input, and all its
output.

Capturing this in the core language is very important.


Entry: Fix the representation of definition and use
Date: Sat May 18 09:59:39 EDT 2019

The point is to create a model of the dynamic execution of the
combination of sequencing (re-using) and looping (construction).  At
any point in the dynamic evolution, it should be 100% clear how the
access patterns work.

Summarized: array construction should be self-referential.

Note that this does not allow for mutation.  The model remains purely
functional.  So in a strong sense there is a limit.  Maybe this can be
extended later on.  For now, let's stick to this idea because it
allows mapping to functional languages.  Also it keeps the writable
and readable sections separate, which might be useful later on.


Entry: Ranges : loop indices are really just sizes
Date: Sat May 18 10:06:00 EDT 2019

What about this: each array has an associated definted range.  Once a
full loop has executed, this defined range is the size and is an
explitly defined parameter that has the same standing as any other
loop index.

I.e. A uses index i during definition

i:
    Ai <- Bi

But once the loop has executed, i could be left to represent the size
of the array.

This unifies two ideas: access to the size of an array, and the
"current index" during array construction.  So the index parameter is
really a size parameter.


Let this sink in for a bit.

One thing can already be changed: loop indices should not be re-used,
as they retain a value after a loop has finished.  E.g. the following
is valid: The first loop defines the accumulator (i' is i-1), the
second loop uses its last value.

-- LTA
i:
  Ai <- Ai' Bi
j:
  Dj <- Ai Cj


Entry: Next
Date: Sat May 18 10:19:07 EDT 2019

So there is a representation that needs a couple more annotations
likely, but should be able to cover things.  Start generalizing it to
something that will perform elimination while allowing accumulator
reuse.

E.g. one thing is definitely the use of multi-dimensional accumulators

i:
  j:
    Aij <- Ai'j Bii
k:
  Dk <- Aik Ck

That might be the most basic example that is actually useful to figure
out how to optimize using only two optimizations:

intermediate elimination for:
- independent references
- accumulators

The first one is already implemented: a complete dimension can be
eliminated if the array is deemed intermediate.

The second can be done as an extension: if definition only uses
backreference and use only uses last, the dimension can be collapsed
into a single value.

Implementing the machinery for that will likely expose possibility to
generalize.


So.  In the example above, there is only the accumulation dimension
that can be eliminated.  How to write a matcher for that?


Entry: accu matcher
Date: Sun May 19 08:25:08 EDT 2019

- only backreference in defining loop
- only last reference outside

Should be straightforward.

However this needs to be tracked for each dimension separately.  The
information that needs to be tracked is whether that dimension:

- is a local variable
- is an escaping accumulator

To tackle this, first fix escape analysis to return a data structure .,


Entry: I'm going to need register allocation as well
Date: Sun May 19 08:50:35 EDT 2019

If an array is used as an intermediate value between two loops but no
longer used after that, it should probably be reused.

This is similar to what Pd does.  In fact, the lta should be able to
represent the "block based" approach just fine.


Entry: How to modify escape analysis?
Date: Sun May 19 09:11:26 EDT 2019

I'm not seeing things clearly this morning.  Maybe not a day to make
changes.


Entry: grounding
Date: Sun May 19 12:51:22 EDT 2019

Make it practical first.  Create the FDN in a language that can
actually render to C, then create the mapping.


Entry: 
Date: Sun May 19 16:44:04 EDT 2019

e.g. i,j: if it doesn't escape the inner loop, all the indices can be
removed.  if it does escape j but not i, j can ....

( i'm looking at this upside down )

there is the case where the inner escapes, but the outer doesn't


i:
  j:
     Bij <- Aij
  k:
     Cij <- Bik

... (no reference of B)

In this case, the dimension associated to i can be removed


i:
  j:
     Bj <- Aij
  k:
     Cij <- Bk


It's time to start collecting all these special cases that pin down
semantics.


Entry: Revisit
Date: Sun May 19 19:33:02 EDT 2019

Starting out with the original "zipfold", the current idea is similar,
but just represented differently, and also allowing for some more
freedom (triangle feedback).

Maybe it's time to start working on a representation that can then use
standard map and fold to end up with a loop?  Can be done in LTA
already.  It will give more of an idea of how things are formed.

Then continue with single-dim eliminate and accumulator detection.


Entry: A monadic test language
Date: Mon May 20 06:40:52 EDT 2019

p a b = do
  c <- op [a,b]
  d <- op [a,c]
  return [d]
  
Currently there is nothing to represent return, which I knew already.

The most appropriate structure is the state continuation monad.  Maybe
this time, write it as a transformer?

The question is then, what is the order of transformation?


Alright... Because of the lack of nesting in the previous Seq and PRU
languages, I've been able to avoid this one.  I don't remember how it
works, or where I have the code parked...

I found something here:

~/darcs/meta/

sharing/monadic_sharing.hs
dspm/StateCont.hs

Let's just copy the latter.

EDIT: Added Functor and Applicative instances, and created a basic
'op' primitive.  Because this needs allocation, it's time to move from
String to Int-indexed variables.

Ok I remember: CPS is used just to be able to do the state threading.
State is just variable count.


Entry: All .hs code in ~/darcs/meta
Date: Mon May 20 07:04:44 EDT 2019

tom@panda:~/darcs/meta$ find -name '*.hs'
./Applicative/ApplicativeAlgebra.hs
./Applicative/ApplicativeNum.hs
./Applicative/Doodle.hs
./ArrowStack/ArrowStack.hs
./ai/Ai.hs
./ai/Armax.hs
./ai/Block.hs
./ai/Cas.hs
./ai/Connect.hs
./ai/DFL.hs
./ai/Flatten.hs
./ai/FlattenMonad.hs
./ai/Function.hs
./ai/Indexed.hs
./ai/Plot.hs
./ai/Procedure.hs
./ai/RatFunc.hs
./ai/SSA.hs
./ai/Shared.hs
./ai/Term.hs
./ai/test-Ai.hs
./ai/testFlatten.hs
./atom/scratch.hs
./closed_sm/ClosedSM.hs
./clua/old/CAnalyze.hs
./clua/BStruct.hs
./clua/CAnalyze.hs
./clua/PrintLua.hs
./clua/Setup.hs
./clua/clua.hs
./dspm/dist/build/autogen/Paths_dspm.hs
./dspm/SSM.hs
./dspm/0broken_Sys.hs
./dspm/0test_Loop.hs
./dspm/0test_Pd.hs
./dspm/0test_PrettyC.hs
./dspm/0test_TML.hs
./dspm/0test_integration.hs
./dspm/Array.hs
./dspm/Code.hs
./dspm/Control.hs
./dspm/Data.hs
./dspm/Lambda.hs
./dspm/LetRec.hs
./dspm/Lib.hs
./dspm/Pd.hs
./dspm/PrettyC.hs
./dspm/SArray.hs
./dspm/Struct.hs
./dspm/SysFold.hs
./dspm/Term.hs
./dspm/TermC.hs
./dspm/Type.hs
./dspm/Value.hs
./dspm/doodle.hs
./dspm/Sys.hs
./dspm/CSSM.hs
./dspm/Sys_.hs
./dspm/StateCont.hs
./haskell/asm/Logic.hs
./haskell/asm/TLEnv.hs
./haskell/doodle/TaggedList.hs
./haskell/doodle/affine.hs
./haskell/doodle/ainum.hs
./haskell/doodle/applicative.hs
./haskell/doodle/code.hs
./haskell/doodle/commterm.hs
./haskell/doodle/flatten.hs
./haskell/doodle/mapfold.hs
./haskell/doodle/memolet.hs
./haskell/doodle/nodes.hs
./haskell/doodle/onezero.hs
./haskell/doodle/sk.hs
./haskell/doodle/stack.hs
./haskell/doodle/stackoverflow.hs
./haskell/doodle/staged-in-unstaged.hs
./haskell/doodle/stream.hs
./haskell/doodle/tagless.hs
./haskell/doodle/tensor.hs
./haskell/doodle/typelist.hs
./haskell/exist_monad/exist.hs
./haskell/exist_monad/generality.hs
./haskell/exist_monad/iso.hs
./haskell/exist_monad/iso2.hs
./haskell/exist_monad/iso3.hs
./haskell/exist_monad/iso4.hs
./haskell/exist_monad/iso5.hs
./haskell/exist_monad/iso6.hs
./haskell/exist_monad/iso7.hs
./haskell/exist_monad/istream.hs
./haskell/old_Sym/Sym.hs
./haskell/old_Sym/SymAsm.hs
./haskell/old_Sym/SymEval.hs
./haskell/old_Sym/SymExpr.hs
./haskell/old_Sym/SymLLVM.hs
./haskell/old_Sym/SymTest.hs
./haskell/ssm/SigApp.hs
./haskell/ssm/SigBind.hs
./haskell/ssm/SigJoin.hs
./haskell/ssm/StateSpace.hs
./haskell/SigOp.hs
./llvm/llvm.hs
./pulse/FFT.hs
./pulse/constant.hs
./pulse/pulse.hs
./sharing/monadic_sharing.hs
./sharing/sharing_problem.hs
./sm/sm_test.hs
./staapl/staapl.hs
./z/Z.hs
./siso/Data.hs
./siso/dist/build/autogen/Paths_siso.hs
./siso/Vec.hs
./siso/Signal.hs
./siso/StateCont.hs
./siso/Eval.hs
./siso/Code.hs
./siso/RSignal.hs
./siso/Type.hs
./siso/Lib.hs
./siso/RSig.hs
./siso/Signal_r.hs
./siso/Test.hs
./siso/llvm-general/Shake.hs
./siso/llvm-general/llvm-general-pure/Setup.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/AddrSpace.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Attribute.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/COMDAT.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/CallingConvention.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Constant.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/DLL.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/DataLayout.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Float.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/FloatingPointPredicate.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/FunctionAttribute.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Global.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/InlineAssembly.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Instruction.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/IntegerPredicate.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Linkage.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Name.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Operand.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/ParameterAttribute.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/RMWOperation.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/ThreadLocalStorage.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Type.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Visibility.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/DataLayout.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/Internal/PrettyPrint.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/Prelude.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/PrettyPrint.hs
./siso/llvm-general/llvm-general-pure/src/LLVM/General/TH.hs
./siso/llvm-general/llvm-general-pure/test/LLVM/General/Test/DataLayout.hs
./siso/llvm-general/llvm-general-pure/test/LLVM/General/Test/PrettyPrint.hs
./siso/llvm-general/llvm-general-pure/test/LLVM/General/Test/Tests.hs
./siso/llvm-general/llvm-general-pure/test/Test.hs
./siso/llvm-general/llvm-general-pure/dist/dist-sandbox-ae49e09/build/autogen/Paths_llvm_general_pure.hs
./siso/llvm-general/llvm-general/Setup.hs
./siso/llvm-general/llvm-general/src/Control/Monad/AnyCont.hs
./siso/llvm-general/llvm-general/src/Control/Monad/AnyCont/Class.hs
./siso/llvm-general/llvm-general/src/Control/Monad/Trans/AnyCont.hs
./siso/llvm-general/llvm-general/src/LLVM/General.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Analysis.hs
./siso/llvm-general/llvm-general/src/LLVM/General/CodeGenOpt.hs
./siso/llvm-general/llvm-general/src/LLVM/General/CodeModel.hs
./siso/llvm-general/llvm-general/src/LLVM/General/CommandLine.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Context.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Diagnostic.hs
./siso/llvm-general/llvm-general/src/LLVM/General/ExecutionEngine.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Analysis.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Atomicity.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Attribute.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/BasicBlock.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/CallingConvention.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Coding.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/CommandLine.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Constant.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Context.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/DataLayout.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/DecodeAST.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Diagnostic.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/EncodeAST.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/ExecutionEngine.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Analysis.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Assembly.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Attribute.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/BasicBlock.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/BinaryOperator.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Bitcode.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Builder.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/ByteRangeCallback.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Cleanup.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/CommandLine.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Constant.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Context.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/DataLayout.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/ExecutionEngine.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Function.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/GlobalAlias.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/GlobalValue.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/GlobalVariable.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/InlineAssembly.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Instruction.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Iterate.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/LibFunc.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/MemoryBuffer.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Metadata.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Module.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/PassManager.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/PtrHierarchy.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/RawOStream.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/SMDiagnostic.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Target.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Threading.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Transforms.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Type.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/User.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Value.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FastMathFlags.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FloatingPointPredicate.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Function.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Global.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Inject.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/InlineAssembly.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Instruction.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/InstructionDefs.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/IntegerPredicate.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/MemoryBuffer.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Metadata.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Module.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Operand.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/PassManager.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/RMWOperation.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/RawOStream.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/String.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/TailCallKind.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Target.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Threading.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Type.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Value.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Module.hs
./siso/llvm-general/llvm-general/src/LLVM/General/PassManager.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Relocation.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Target.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Target/LibraryFunction.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Target/Options.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Threading.hs
./siso/llvm-general/llvm-general/src/LLVM/General/Transforms.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Analysis.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/CallingConvention.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Constants.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/DataLayout.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/ExecutionEngine.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Global.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/InlineAssembly.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Instructions.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Instrumentation.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Linking.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Metadata.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Module.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Optimization.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Support.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Target.hs
./siso/llvm-general/llvm-general/test/LLVM/General/Test/Tests.hs
./siso/llvm-general/llvm-general/test/Test.hs
./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/setup/setup.hs
./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/autogen/Paths_llvm_general.hs
./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/LLVM/General/Internal/LibraryFunction.hs
./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/LLVM/General/Internal/FFI/InstructionDefs.hs
./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/LLVM/General/Internal/FFI/LLVMCTypes.hs
./siso/Gen.hs
./siso/llvm-tutorial-standalone/src/Codegen.hs
./siso/llvm-tutorial-standalone/src/FFI.hs
./siso/llvm-tutorial-standalone/src/JIT.hs
./siso/llvm-tutorial-standalone/src/Main.hs


Entry: State Continuation
Date: Mon May 20 08:49:35 EDT 2019

Memory is coming back, painfully :)

Cliche, but this is more difficult than I thought it would be.  It's
near obvious what it _should_ be when going through the motions, but
performing the assembly requires a lot of small refinement steps.


So let insertion works.  Next is loop nesting.  This will require some
modification.


loop $ \i -> do ...


I think the state continuation monad is too simple: it does not return
the final value of the state, and it appears as if an entire
computation needs to be run as a subprogram.

I wonder if the trick is to just embed the state in the return value
of the expression?

No this is actually not possible because "return" cannot excape the
context.  So it really needs to be part of the type.

Let's look at the original paper to see what the type is.


This was a problem before:

-- Note: state will be forked!  The Let insertion mechanism doesn't
-- allow recovery of state.
mBlock :: MCode (Code t') -> MCode (Code t)
mBlock sub = SC main where
  main s k = k s $ Code $ (subTerm s sub)


Ok so I need a version of the SC monad that threads state through the
entire computation.  Maybe that isn't possible?

Actually it is possible to get the state by just tagging it, but it
will dump it deeply nested inside the data structure.

I'm missing an essential insight.


So let's implement it first with a state fork, then see what would be
necessary to patch it up.


Entry: Pause
Date: Mon May 20 10:59:15 EDT 2019

It was very important to do the monadic language, because it is
already quite clear there is something not right with the way bindings
and loops interact.

About the SC monad: maybe it just isn't the right abstraction?  A free
monad might work better.

Maybe let insertion is just different: because of scoping rules, the
forking is not an issue.


So there are a couple of conclusions:

- Each sequence of bindings has a clear return value.

- A sequence of loops should also return something.  Currently that
  can't be expressed.


I need to find a bridge between values and assignments.

The return value of a sequence of primitives would be a list of cells.

The return value of a loop is an array slice.

prim / loop


The conceptual error is that a loop binds an array slice, just like a
primitive binds a cell.  So the IR tree is structured in a bad way.

Fixing that is a big change.

A Form is Let | Ret
A Binding is either a prim or a loop


I do wonder though for all of this if the transformations are really
necessary, because the language will be able to represent fused
functions.

Split it off in a separate structure or fix?


There is a strong tension between:
- nesting data structures
- adding an index


Note that loop will need to be a "map style" operation.  It needs to
map something that is scalar to something that is vector, and vector
to matrix etc..


Entry: State Continuation Threading
Date: Mon May 20 12:09:49 EDT 2019

Something I still do not understand is why state can't bubble upwards.

No it's really not possible because of the nested construction.  The
way to get it is to put it in the data structure and fish it out.

Woah I'm really stuck at this...

I think I need to read the original paper.  Because if this doesn't
work, then there is no point to use anything but nested state monads
without the whole continuation business.


So: give up.  Use a more traditional writer/reader/state stack, and
re-thread on every block.


Entry: State + Reader
Date: Tue May 21 08:03:20 EDT 2019

So kick out the let-insertion monad.  It's a neat trick, and it works
if state forks are not an issue.  For my use case however a more plain
environment + state approach seems appropriate.

To collect bindings, should I use a writer?  Or should the bindings go
in the state.


EDIT: This works a lot better.  loop can now explicty run the inner
forms, threading environment and state.


Entry: Indices
Date: Tue May 21 09:04:22 EDT 2019

Binding is easy: it always uses the current environment.  Reference is
more general.

The core issue is in Let: binding and reference need to be split there.

Ok so I have something halfway meaningful: Below,

i,j: input dimensions
k,l loop: refs to ij are abitrary, possibly function of k,l
m,n loop: refs to ijkl are arbitrary, possibly function of m,n

-- LTA
k:
  l:
    Ckl <- Aij Bij
    Dkl <- Aij Ckl
m:
  n:
    Emn <- Dkl
    Fmn <- Aij Emn

Next: make the distinction between local loop variables and size
variables apparent.  Maybe track the dimensions in an environment
through the compilation?

Additionally, it should be impossible to make array dimension
mistakes.


Entry: LLVM diff
Date: Tue May 21 10:22:23 EDT 2019

I've deleted the Langauge.EDSP.LLVM module, but it contained some
valuable findings about how to map between a simple monadic language
and LLVM.  Pasted diff here.

see asm_tools b0e69809a161f59da345bc8bab131c5b8b999564


diff --git a/asm-tools-axbc/Language/AXBC/LLVM.hs b/asm-tools-axbc/Language/AXBC/LLVM.hs
deleted file mode 100644
index 1964733..0000000
--- a/asm-tools-axbc/Language/AXBC/LLVM.hs
+++ /dev/null
@@ -1,80 +0,0 @@
--- TODO: This will need some focus on the structure of the monads.  I
--- want to have my own stack to be able to implement some EDSP
--- language context.  It seems that what is needed is to mix it with
--- the IRBuilderT transformer.
-
-{-# LANGUAGE CPP, OverloadedStrings #-}
-{-# LANGUAGE OverloadedStrings #-}
-{-# LANGUAGE RecursiveDo #-}
-{-# LANGUAGE DeriveFunctor #-}
-{-# LANGUAGE GeneralizedNewtypeDeriving #-}
-
-
-module Language.EDSP.LLVM where
-
-
-
--- LLVM
-import Data.Text.Lazy(Text)
-import Data.Text.Lazy.IO as T
-import Data.Text.Lazy.Encoding
-
-import LLVM.Pretty  -- from the llvm-hs-pretty package
-import LLVM.AST hiding (function)
-import LLVM.AST.Type as AST
-import qualified LLVM.AST.Float as F
-import qualified LLVM.AST.Constant as C
-import LLVM.IRBuilder.Module
-import LLVM.IRBuilder.Monad
-import LLVM.IRBuilder.Instruction as I
-
-
--- EDSP
-import Language.EDSP
-
--- Implementation
-import Control.Monad.State
-
--- Start from the llvm_simple example.  Create a tagless final wrapper
--- as soon as possible.  It's going to be necessary.
-
--- Create the monad.  For now it can just be the LLVM monad.
-
--- Split up in the different parts, exposing only the "pure" core on
--- the inside.
-
-llvm_simple :: Text
-llvm_simple =  ppllvm $ llvm_module
-
-llvm_module :: Module
-llvm_module =  buildModule "exampleModule" $ llvm_function
-
-llvm_function :: ModuleBuilder Operand
-llvm_function = mdo
-  function "add" [(i32, "a"), (i32, "b")] i32 llvm_entry
-
-llvm_entry :: Monad m => [Operand] -> IRBuilderT m ()
-llvm_entry [a, b] = mdo
-  entry <- block `named` "entry"; do
-    c <- llvm_pure [a, b]
-    ret c
-
--- The bridge to the LLVM monadic representation is the class
--- MonadIRBuilder, but it seems we can make it a little less abstract
--- by focusing only on.
-llvm_pure :: MonadIRBuilder m => [Operand] -> m Operand
-llvm_pure [a, b] = do
-  c <- I.add a b
-  return c
-
-
--- Use a custom monad to compile to LLVM.
-newtype M t = M { unM :: State String t } deriving
-  (Functor, Applicative, Monad)
--- instance MonadIRBuilder M where
-
-
-
-run = do
-  T.writeFile "/tmp/test.ll" $ llvm_simple
-  T.putStrLn $ llvm_simple


Entry: Next?
Date: Wed May 22 22:53:28 EDT 2019

Give a proper semantics to indices.
EDIT: There is Ref and Def.  Make that visible in the printout.

EDIT: What I want is a variable that contains size, but also the
static notion of size.  There are two cases:

- during definition, size is a function of the loop indices

- after definition, size is a fixed number known at compile time

how to separate those phases clearly?

Let's give the latter a capital letter in the print rep.

Ok: once a loop is finished, the array's size is static.

EDIT: This feels like stuckness.  I'm focusing on a detail that does
not matter.

Basically, if all "defining" indices are the same, whenever an array
is referenced using anything other than its defining indices, it will
be complete.


Entry: What do I need?
Date: Thu May 23 10:02:17 EDT 2019

I need this language to actually generate code!

So maybe focus on that first.  Make a code generator, and just have it
be inefficient in first iteration.  Things will become more apparent
once there are concrete examples to constrain the general pattern.


Entry: Move forward
Date: Sat May 25 08:33:26 EDT 2019

I'm a bit stuck at the way indices and sizes are represented.

So what is an index?  It always represents the coordinates of cells
that are bound in the current loop block.


Entry: References
Date: Mon May 27 08:52:05 EDT 2019

I guess it's just hard, because it's not just falling out.
Maybe change the monadic form first to support references.

Trying this:

p a b = do
  [d] <- loop $ \i -> do
    loop $ \j -> do
      c <- op "mul" [a i j, b i j]
      d <- op "mul" [a i j, c i j]
      return [d]
  loop $ \i -> do
    loop $ \j -> do
      c <- op "mul" [d i j]
      d <- op "mul" [a i j, c i j]
      return [d]

Where references are always explicit, and definitions are always
implicit.

Basically what this means:

- assignment here is (incremental) definition of a finite function

- since we know the dependency graph of computations leading from
  constants and local indices to array references, we should be able
  to create a proof that the references are valid.


Maybe the real challenge here is to define what the binding means?

c <- op _ $ d i j

There are two levels to look at this:

- element-wise: c(i,j) is made of tx of d(i,j) and we're simply not
  writing down the indices at the LHS.

- whole array: c is made of tx of d as a whole, and there are some
  restrictions on how i and j can be used.

EDIT: This looks ok, but how to express it such that type inference
works?


Entry: Bounds checking
Date: Mon May 27 12:16:38 EDT 2019

Also, if the referencing computation cannot be performed at compile
time, some kind of mechanism needs to be inserted to ensure that
errors get caught.  E.g. to generate code that has bounds checking so
during quickcheck these asserts can be included, then for production
code they can be left out.

I.e if there is no actual proof, there could be a statistical proof.
This way the prover could be written incrementally.


Entry: Next
Date: Tue May 28 09:12:40 EDT 2019

I will need to dedicate some bulk time to work through this.
Fragmented attention won't cut it.


Entry: c <- a i j 
Date: Tue May 28 09:30:43 EDT 2019

Doesn't work.

Because <- will have different types.

I'm looking for a morphism between:
- the element-wise assignment
- the abstract array operation


Entry: Am I looking at this the wrong way?
Date: Tue May 28 18:22:41 EDT 2019

Maybe it is enough to do manual fusion?


Entry: Next
Date: Wed May 29 11:02:12 EDT 2019

I'm stuck.  Ideas aren't flowing.

The probelm is this:

      c <- op "mul" [a i j, b i j]

what i want instead is:

c <- f (a,b,i,j)

there are two things to relate to each other:

- an array is derived from other arrays.  though this by itself is not
  that important.

- an array is constructed element-wise

So in a sence, it really doesn't matter what is on the RHS.  Those
really are atomic values.

The only thing that matters is that:

- "c <- _" denotes the definition of an array

- its dimensions are fully determined by the dimensions of the loops
  it is in.


So I could start out by creating a "constant definer", and annotate
the types that way.

Maybe the most important part is that what is returned in a loop is
always a function.  Currying is natural here.


The problem is to define the type of loop.

This needs to be polymorphic.  Something like

loop :: (Index -> M [Index -> t]) -> M [Index -> t]

'M' needs to be on the outside: it is a representation of an array.


Yeah I really don't have the intuition for this.  No guiding
principle.

I'm stuck.

EDIT: Not quite sure why.  The initial analysis phase was simple.
Syntheses and type encoding seems to not go very well.  It could be
that I'm just too tired and dull to see the path.


Entry: It's time to crack the nut.  What is loop?
Date: Thu May 30 16:31:11 EDT 2019


loop :: (Index -> M [Index -> t]) -> M [Index -> t]

t can be 

Index -> t'

or

Atom


If arrays are represented as function, then what is inside a loop
construct is also a function.

c <- op2 a b


I keep going back to these two views: the entire array vs. operations
on atomic values.


Entry: Can it just be a functor?
Date: Thu May 30 16:46:23 EDT 2019

Until that is resolved, there is not a good way to work with this.

Maybe take the view that these arrays are Functors, such that any
operations on elements can be mapped over arrays?

When in doubt, make it a functor...


Start there.  That should bring us full circle: back to Feldspar.

Loops then come from fmap.
This can then be generalized to fold.
Any other context-aware operations will be generalizations of that.


Do I just need to start from scratch, ensuring a proper interface?

Is traverse the same as map, but for "monadic primitives"?


map:      (a -> b)   -> t a -> t b

traverse: (a -> f b) -> t a -> f (t b)


yes.


So next: build this from the ground up.  Start from traverse, then add
all the extensions such as parameterized indices.


traverse: (a -> M b) -> A a -> M (A b)

where M is the compilation/interpretation monad, and A is the array
representation.  e.g.  type A = (Index ->)

Then generalize to expose index

traverse1: (Index -> a -> M b) -> A a -> M (A b)

traverse2: (Index -> Size -> a -> M b) -> A a -> M (A b)
 

This is really a different view from not expressing inputs explicitly.
Maybe that is the core of the issue?  I've been focusing on
construction, with any inputs abstract, which removes the 'a' parameter:

loop: (Index -> Size -> M b) -> M (A b)
 

Entry: feldspar
Date: Thu May 30 17:01:40 EDT 2019

I don't think there is a whole lot to be found there..  Let's just
stick to the current setting.


Entry: traverse vs. construction?
Date: Fri May 31 08:46:21 EDT 2019

It is really about the difference between these two:

traverse: (Index -> Size -> a -> M b) -> A a -> M (A b)

loop: (Index -> Size -> M b) -> M (A b)

The reason to go for the latter is that only construction is
element-wise.  Reference can be random access with some restrictions
for feedback configurations.

This is the MAIN IDEA.


Entry: loop :: (i -> M t) -> M (A t)
Date: Sat Jun  1 15:45:50 EDT 2019

Literally: transform a loop body that produces elements into an array.

This can't be too hard.

EDIT: Types work out.  Arrays are constructed one dimension at a time.

EDIT: So here is what I find out: I don't have intrinsic motivation to
finish this.  It is REALLY in the way to make progress, but my mind is
on finding a new client.

EDIT: Ok, some ideas.  Dimensionality is in the type, do should be
reflected in reference.


EDIT: Conceptual problem?

This whole thing of treating everything as a grid is not going to
work.  Because of the signature of loop, there will be a concept of
"scalar".


Entry: Structure of the language
Date: Sun Jun  2 09:04:36 EDT 2019

So this is pretty much Seq + Array construction and referencing.

I don't think there is a whole lot to be added apart from that.

How do I separate these?

Maybe it's not necesary.  The missing step was the separating of array
dereferencing.

Basically, a new grid would be created with the indices.

Then all primtives are just mappings of n-aries.


Ok, another basic idea:

- referencing creates a new grid of indices.  this factors out
  anything special in the access pattern

- all other operations can be factored into an fmap / liftA2 / ...,
  making grid operations and scalar operations isomorphic.


Conclusion:

EXPORT DEREFERNCING AS A PRIMITIVE


Entry: Cleaning up the language
Date: Sun Jun  2 09:27:14 EDT 2019

So there is still a bit of work to fit the data language to the
monadic representation.

First: dereferencing is always explicit.

I think I need to start over.

The LTA form is not what I'm looking for.

EDIT: So I've split up LTA (which now has a simple data structure
print statement to serve as an example), and Loop, which is built
around being able to represent this thing:


p :: ArrayZero t => (Array (Array t)) -> (Array (Array t)) -> M (Array (Array t))
p a b = do
  d <- loop $ \i -> do
    loop $ \j -> do
      aij <- ref2 a i j
      bij <- ref2 b i j
      c <- op2 "mul" aij bij
      d <- op2 "mul" aij c
      return d
  loop $ \i -> do
    loop $ \j -> do
      dij <- ref2 d i j
      c <- op2 "mul" dij dij
      e <- op2 "mul" c c
      return e


I'm going to need a break.

So there is a pattern: I often want a language with a particular
control structure around an otherwise form.  How to do that more
efficiently?


Looks like the inconsistency comes from the inability to represent the
referencing properly in the data type.  The final embedding has no
issues with nested types, but the data type itself can't do that, so
likely needs to use a flat encoding.


Ok I am just not seeing the big picture.

There is a real issue in mixing partial application and loop nesting
(at the embedding side) and the need for uncurried representation in
the main language.

It seems best to treat arrays as a special kind of function that has
"abstraction" (loop), and "application" (ref).


Here's an idea:

- pretend that the language is higher order.  variables can contain
  partially applied array references.  those are just compile time
  entities because they will need to resolve to scalars when
  operations are involved.


Ok this is actual progress.


Entry: partial application of grids / grids as finite functions
Date: Sun Jun  2 16:00:24 EDT 2019

This seems to be the important idea to be able to manage nesting.

1. Nesting is necesary due to loops

2. That then reflectis to the data type as well.

So the basic language is a typed lambda calculus where abstraction is
array construction, and application is array reference.


Entry: Meta-programmed Erlang node
Date: Tue Jul 30 09:36:00 EDT 2019

I want something that behaves as an Erlang node, but is actually
static state machines described by a Haskell-embedded language.


Entry: Compositional state machines
Date: Tue Jul 30 09:37:20 EDT 2019

Essentially I want something that runs on an FPGA, where the compiler
can figure out what to implement as state machines, and what to
multiplex on a CPU or state machine execution engine.

The basic substrate is a state machine.  The basic decision is to
time-multiplex or not.


Entry: Interface to Verilog
Date: Tue Aug 13 09:13:33 EDT 2019

1. Have yosys compile it to something that can run on the haskell
substrate.

2. Generate better verilog code


Entry: State machines
Date: Tue Aug 13 09:36:24 EDT 2019

I need a different way to express state machines.  Maybe start looking
into formal?

EDIT: What is the real problem actually?

I like the Erlang transactional model: message comes in and it changes
state.  Message can be anything, state can be anything.

Does this work for syncrhonous state machines as well?  A message in
this case is a clock edge.  Its contents is all the inputs at the
clock edge.  That is not very useful.

So where do these two differ, really?

It seems that the problem with clocked state machines is that the
input is everything and the state is everything.  This is just too
unstructured to say anything meaningful.  so how to abstract it?

It would be great to create logic state machines in the same terms.


EDIT: I wonder if the missing link is just missing knowlege, and not
so much bad integration of knowledge.  Maybe have a look at TLA+

https://www.apress.com/us/book/9781484238288

Practical TLA+
Planning Driven Development
Authors: Wayne, Hillel


Entry: Make FPGAs and CPUs the same
Date: Wed Aug 14 17:41:46 EDT 2019

I need to find a way to translate high level event-driven
"transactional programs" into gateware.

That would be quite revolutionary.

This can be done by enable signals, or "transaction busy" signals.
Prototypical examples are the 1-cycle "event" pulse, and things like
SPI CS.

Maybe look at other transactional problems in the circuit domain,
e.g. busses.

Busses tend to evolve into packet systems, essentially.

The problem isn't that this is hard to encode, but that there are many
many different encodings that all have other efficiency tradeoffs.

I guess there are many interpretations of:

- receive message

- update state


Entry: stat should be local in both place and time 
Date: Wed Aug 14 18:47:20 EDT 2019

- use a lot of encapsulation: compose state machines based on the
  protocols they receive and produce, not about what state they are
  in.

- keep state short-lived.  i.e. reset to known state often and
  predictably.  ( rationale: long-lived state "residue" is what makes
  things hard to test. )


These two rules already eliminate a lot of issues.  They treat state
as an implementation detail that is not observable at higher protocol
units.   ( Cfr. implementing FP lang in IP lang. )

Try to design in protocols, data flow.  Might be specific to my
applications, but it seems to be the way to go.  Protocol oriented
programming: a protocol and the state machine that parses / generates
it are two sides of the same coin.

Protocols are state machine traces.


Maybe related: Quviq quickcheck state machine analysis?
Is there a Haskell variant that can do state machines?

http://hackage.haskell.org/package/quickcheck-state-machine


Entry: Lets start with a state machine synthesizer
Date: Wed Aug 14 18:56:31 EDT 2019

- how does encoding actually affect gate usage in FPGAs?

- see how yosys does state machine transformations

- can I make state machines abstract and convert them to states
  directly?  i.e. can states be fully abstract?


Entry: Notes
Date: Wed Aug 14 20:40:51 EDT 2019

- Nothing needs to be bi-directional.  That is a complete artefact of
  communication constraints (wires).  Always think of everything as
  unidirectional.

- This makes it also functional.  The output sequence is a pure
  function of the input sequence and reset state.

- There has to be an equivalent of the fourier/laplace transform for
  boolean machines.  Is there?  Those transforms rely on linearity and
  the concept of exponentials or orthogonal functions.  Maybe this
  isn't all that useful though...


Entry: The CPU attractor
Date: Thu Aug 15 18:44:14 EDT 2019

What is a better use of a state machine fabric than to implement a CPU
in it?  Why is linear / nested execution so useful?

What about this: start with thinking of every machine as a CPU
executing code.  Then implement it by stripping away functionality.

What about flattening programs into linear execution states?

Maybe most of this is about hierarchy of state representation.


Entry: Protocol splitters
Date: Thu Aug 15 19:19:14 EDT 2019

Think a bit more about that.  The basic idea is single packet comes in
one one bus and gets split into multiple busses.


Entry: Transactions vs. Flip-Flops
Date: Sat Aug 17 09:10:33 EDT 2019

Wait a minute.  I can create a yosys target to run C code.  I just
need to re-interpret what flip-flop means.  What a critical path
means.  Basically, a FF is a transaction unit.


Entry: Pipelining is the real problem
Date: Sat Aug 17 09:12:56 EDT 2019

For feedforward systems, pipelining is always possible.  So pipelining
needs to be solved first.


Entry: The higher level language
Date: Sat Aug 17 09:14:55 EDT 2019

It is time to start inventing structured macros on top of this
low-level language.


Entry: Common subexpressions
Date: Sat Aug 17 09:16:06 EDT 2019

So is it actually necessary to factor out a logic function?  I think
Yosys ABC takes a global approach anyway.  I.e. it would be fine to
define boolean transition functions directly.

Maybe what I need to do next is to make a layer that is pure, and
expresses only boolean functions.  Then build state machines on top of
that.


Entry: Sumary of ideas
Date: Sat Aug 17 09:17:51 EDT 2019

- focus on the transition functions

- figure out how to auto-pipeline

- compiling to C is re-defining what transition means (a sequential
  program will execute a transaction, instead of a logic update
  function).

The key is really in bridging these two worlds: expressing a machine
such that it can be compiled to C but runs slower, or be compiled to
(pipelined) logic and runs faster.


Entry: Figure out how yosys passes things to ABC
Date: Sat Aug 17 09:20:17 EDT 2019

The _key_ element is if I can represent boolean functions as
unstructured flat tables, or if I need to represent them as DAGs.


Entry: FSM extraction and boolean function optimization
Date: Sat Aug 17 09:31:23 EDT 2019

If yosys does FSM extraction, that should be a good hint to represent
states abstractly in an abstracted layer on top of Seq.

I need an example.  One of the state machines I want to create is an
Ethernet to I2S or S/PDIF converter.

Let's factor it into Ethernet to SPI and SPI to S/PDIF.


Entry: A generic data converter architecture
Date: Sat Aug 17 09:37:47 EDT 2019


Note that allmost all code I write can be implemented as feedforward
chained state machines with some buffers inbetween.

E.g:

I -> b -> C1 -> b -> C2 -> b -> O

- Where each b is just a dumb circular buffer
- Cx is a converter state machine (parser + printer)

What happens where is not important.  Some C's could be on
microcntrollers implemented in C, others could be FPGAs.

But the general idea is that _testing_ should be completely contained.

Also, some buffers are not there at all.

Also, buffers can be abstract, i.e. they can be made to contain
tokens, not bytes.  That way the representation can either make smart
writes or smart reads.


Entry: State update representation
Date: Sat Aug 17 11:13:33 EDT 2019

Another thing that has been bothering me for a while: it is easier to
use an imperative style state update instead of a fully explicit
constructor.

This is a hard pill to swallow because it really brings forward the
whole FP vs IP debate.  Sometimes one is better than the other, so
maybe use some kind of lens-like API?


Entry: Try out ABC: give it a LUT
Date: Sat Aug 17 11:36:27 EDT 2019

So given RTL, represent the update functions as LUTs, push it into ABC
and see if that initial LUT form has an effect on what comes out.

What is the primitive here?  An N->1 boolean function.  Describing
such a thing requires 2^N bits.

But that isn't really the core representation, because it does not
allow for sharing.  ABC likely needs to be presented with N->M boolean
functions, represented in some kind of logic gate format that does not
cause table explosion.

So I probably still want to hold on to some shared DAG just to not
explode the representation.  Processing of such graphs is likely
iterative, i.e. it needs a seed of _some_ structured/shared
representation as AIG before it can do optimizations in AIG space.
Full LUT description seems too unstructured.  Cfr Karnaugh maps: the
whole idea is to identify these islands == term pruning.


https://people.eecs.berkeley.edu/~alanmi/abc/abc.htm
https://en.wikipedia.org/wiki/And-inverter_graph


Entry: Coroutines
Date: Sun Aug 18 09:39:49 EDT 2019

Doing some manual buffer management, it seems also a good idea to make
this automatic.  I.e given a high level coroutine structure, split
things up such that flow control and buffer structure can be
parameterized.  They depend on each other quite a bit.


Entry: 8 instructions
Date: Sat Aug 24 23:48:19 EDT 2019

TTL logic kit:
https://www.youtube.com/watch?v=_2uXqTi42LI

PDP8


Entry: Pipelining
Date: Fri Aug 30 08:16:52 EDT 2019

So it's quite clear: my problem with digital logic is the pipelining.
There has to be a way to express things such that the feedforward part
can be separated from:

- decoupling/pipelining delays
- simple state machines

It's because there are two elements here.  I've already noticed that
factoring state machines makes them easier to understand, as this
allows concentration on stream processing.

But pipelining delays are a real pain.  It already starts with
defining small state machines.  Should the output be registered or
not?  The answer is that it depends.  For high speed logic it is
usually yes.  For low speed, efficient use of gates it is usually no,
or mostly not until timing gets violated.

A thing to keep in mind is to write systems in a way that it is easy
to add delays in the path.


Entry: Shared substrate
Date: Sun Sep 22 10:59:30 CEST 2019

It's an interesting problem.  How to express a program in such a way,
that it can be mapped to time-sliced programs and parallel state
machines.

EDIT: This is a very important insight to build applications that can
use resources properly.

Separate date processing (dependencies) from the sequential/parallel
implementation questions.  It is too easy to get absorbed by the
attractors associated to each programming paradigm.


Entry: Dataflow
Date: Fri Oct 18 07:50:27 EDT 2019

Thinking in dataflow is still quite difficult for me, while thinking
in sequential programs operating on buffers is second nature.

Maybe that is the problem that needs to be solved first: write systems
in that "CPU form", then gradually work towards factoring out steps
that can go into hardware.  I need to model the _reduction_ that
happens there.


So let's spell it out.

- all problems (for now) are modeled as single input to single output
  data streams.

- streams get "parsed" into some internal state, then "printed" out
  again.

- splitting the problem up into buffers + sequential packet processing
  code is a natural thing to do (at least for a programmer).

- so start out that way, and gradually eliminate the need for general
  purpose CPUs, by:

  - simplifying the high speed parts so buffers can be eliminated, and
    the processing cores can be simple state machines.

  - leaving the low speed parts to be handled by a single CPU
    executing an event loop


Most if this is known stuff, apart from the "simplify to non-buffered
stream processor".

The transition there is very large, as it consists of:

- eliminiating data memory accesses

- eliminating instruction sequencing


Is there a method that can be employed?  E.g. write out the entire
program as a sequence of data processing loops to expose the ones that
need to be turned into hardware, then merge the remaining code into
CPU code that operates on a minimal set of buffers.

I wonder if there is some more elegant way to say that both are the
same.  The CPU is an artifact.  It is NOT the most natural way to do
things.  The buffer is an artifact that is a side effect of the TDM
nature of a task switching CPU.  Basically, this representation is too
biased.  A CPU is essentially too powerful powerful such that al
algorithm can be mapped back to a simpler architecture.  A more
restricted abstraction is necessary.  Essentially, work backwards.

What I want is to split the problem into:

1) An executable specification (no buffers, all parallel processing)

2) An explicit mapping from parallel to TDM, either to simple
   sequencers or full-fledged CPUs and buffer memories.


How do you go from knowing that in principle this is possible, to
finding a practical approach that can be implemented in a reasonable
amount of time.  I.e. to solve the core subproblem first.


I need a toy problem to tackle this.

The typical one is a logic / protocol analyzer.

Let's write that DHT11 driver with this idea in mind.

And let's write it in Haskell right away.


Entry: Practically, what does that substrate look like?
Date: Fri Oct 18 08:27:18 EDT 2019


1. It has to be an event based language with abstract events and
   abstract state.  The program is (I,S) -> (O,S).

2. Blocking sequential program style will need to be implemented as a
   language preprocessing step.  ( Is this just async/await? )

3. Time has to be explicit.  Do not rely on a fixed clock.  If this is
   necessary, maybe add time stamps to the events instead.


The last one is an important constraint, because it is often implicit
in synchronous sytems where cycle counters can be used as part of the
state machine.  However, this trick cannot be used on a uC: all timers
need to be external inputs.


Entry: DHT11 state machine
Date: Sat Nov  9 05:55:01 EST 2019

I want a pure event-driven parser with two kinds of events:
transitions and timeouts.

See electronics.txt

MCU:
  - assert 0
  - wait 18000
  - release (1)
  - start receiver
DHT11:
  - assert 0
  - wait ACK:80 / BIT:80
  - release (1)
  - wait ACL80 / BIT0:25 / BIT1:70

End pulse should probably timeout.

Events (interrupts)
- line change: 
  - save state + elapsed time
  - reset timer
- timer:
  - disable timer


Now I want to write this in a high level language such that I just
need to fill in the platform-dependent details.

Turn this around because it is simpler to think of a data bit as:

- wait 25 / 70
- send 0
- wait 50
- send 1

This means the DHT init sequence is:

- send 0
- wait 80
- send 1
- wait 80
- send 0
- wait 50
- send 1


To parse:
- reset timer at the last 0->1 transition
- at a 1->0 transition, queue the time delay

The machine is initialized after the request sequence is sent out.

This means the first two measurements that come out can be ignored, as
they are:
- the initial response time (20-40)
- the ack pulse 80


EDIT: So I don't really need a code generator.  I just need some
"scrap paper" to properly define the state machine, the initial state,
and possible post-processing.  Because once understood, state machines
(in my case) are very simple.

How does this generalize to more complex machines?

I want something that works both on UC and on FPGA.

The main difference is that on UC, events can be modeled as procedure
calls.  On FPGA they are state update functions evaluated at each
clock.  These need to be merged into the same representation somehow.


Entry: Events on FPGA
Date: Sat Nov  9 07:50:49 EST 2019

The core is always just that:

- on a CPU, events are essentially procedure calls

- on FPGA, events are much more implicit: state transitions caused by
  certain combinations of inputs

So to put both on the same footing, it is the FPGA representation
level that needs to be abstracted a bit, because the CPU cannot do the
"polling" that is essentially done by the synchronous logic at every
clock cycle.

So how do you map events into synchronous logic?

Assume the presence of an event is represented by a 1 bit somewhere.
The core is then state updates conditional on that bit being 1.

But that does not solve the case where there can be multiple events at
once.  Maybe this is not really an issue?

Suppose that the imput to the state transition function is a mutually
exclusive bit vector, where each represents an event.

I already ran into this representing data streams.  It seems quite
natural.  Let's keep this in mind and design a machine.


The core idea seems to be that synchronous logic is too lowlevel for
all but the most essential components (e.g. an edge detector, a
counter).

This points in this direction:

- Create a library of low-level components that transfer "continous
  monitoring" into event streams, i.e. data + sync bit.

- Express all processing as (state,{event}) -> (state,{event}), where
  {event} denotes a set of events.

The difficulty seems to be events that are concurrent.  Is it too much
of a restriction to impoce each "process" to have only one "message
queue"?

This brings to mind the distinction between the channel approach and
the mailbox approach.  But that is not really what it is about, since
a sequential program gets to choose from which channel it reads next,
and can impose time order that way.

The core idea for this to work is that each machine takes only a
single event stream, and can do a single event at each clock tick.
The mailbox model works best for that.

This also makes it possible to express stream processors as functions,
e.g. giving a sequential composition structure without fan-in.

Fan-in should be handled by dedicated mergers.

Alright there are a lot of corner cases, that is clear.

E.g. it might be possible to implement a machine sequentially, but
then it needs to have a limited event input rate.


Entry: Abstract DHT11 driver
Date: Sat Nov  9 09:59:54 EST 2019

So I wrote it out completely.  Some notes:

- This can be done as inline functions, where the "host" is supposed
  to implement basic functionality.

- A single start function + single event hander seems to make most
  sense.  This is also the structure that is used in Erlang.

- The ping-pong between state machine and host consists of two kinds
  of functions:

  - Simple library functions that perform uC state change and readout,
    e.g. get/set a pin, read a timer, start a timer.

  - Some of these are expected to produce events, that will then be
    passed on to the abstract event handler.


EDIT: Now this is a "sequential" event handling state machine, i.e. it
knows exactly what the next resume point is at the time of yielding,
so it can be represented in sequential, blocking form and
auto-generated from such form.


Entry: parallel, time
Date: Mon Nov 11 11:27:26 EST 2019

What struck me writing the DHT11 driver is that even for this simple
example, there are essentially two execution traces: The sequence of
pin transitions, and the global timeout.

There is something extra about parallelism that is impossible to
express in straight line code, and you get there quite quickly.

Actually it's very straightforward: often, a real-world interaction
system needs to concept of time, not just of events.  Events have only
an _order_ property.  Time is something else: there are two flavors:

  - timestamps: map 2 events to 1 postive/negative number.

  - alarms: map 1 positive number to the occurance of a future event


Entry: a state machine language
Date: Mon Nov 11 20:55:30 EST 2019

Two important conclusions:

1. Clocked logic is NOT good as a base substrate.  It needs at least
one higher abstraction, which is events.

2. This means the implementation language that sits below the
functional specification layer can be anything.  Which also means Seq
is going to be good enough, and it can grow and change as long as the
interface is kept.

So how to move this forward?

Create a mapper.  I have a real-world state machine that I can use to
test this, the DHT11 driver.

I can't "design" this.  So I'll have to evolve it.  Start with writing
out the state machine in the most straightforward Haskell form.  Then
rethink from that vantage point.


Entry: state machine language
Date: Fri Nov 15 08:57:55 EST 2019

Get past procrastination here..  What is the actual problem that needs
to be solved?  Mapping a description to code.

Start with the prototypical state machine: a counter.

Work backwards?

EDIT: I need to somehow inch myself into this.  The substrate should
probably be Seq.  And Seq's C output is where to focus.  I'm not going
to get into LLVM at this point.

So how to create a C program in Seq?
EDIT: See exo/ghcid/ExoSP.hs

-- ExoSP
static inline void fun(seq_t o[1], seq_t r0) {
	seq_t r1 = seqMUL(r0, r0);
	o[0] = r1;
}


EDIT: It seems that this is really enough to work incrementally in
very small steps.  Wait until inspiration hit.

EDIT: Some essential structural / representational component is
missing.  Seq is made for sequential processing.  Not for event
handling.  How to add this?


Entry: An s-expression front-end
Date: Fri Nov 15 09:32:45 EST 2019

Intermezzo.

I really want this.  Monadic notation is too cumbersome and it is
really just a syntactic problem.

I don't hink it would be necessary to create a Template Haskell
S-expression front-end.  It seems that Seq programs are untyped enough
to be "interpreted".  This actually turned out to be a feature!

This is actually something that can be thought of as a core design
element, and also makes RAI and Seq quite close.  If more type
encoding is necessary, it is possible to just create a "phantom
wrapper" in Haskell to express higher level ideas.


Let's give this a try.  There is a parser in
asm_tools/asm-tools/Data/AsmTools/SE.hs

EDIT: I already started this in
asm_tools/asm-tools-seq/Language/Seq/Syntax.hs

EDIT: Jump is too big.  Reduce scope.  Stick to just doing C gen.


Entry: I need a hello world
Date: Fri Nov 15 10:26:50 EST 2019

Something essential is missing to make this event-driven, and I can't
put my finger on it.

The product should be a collection of event handlers that each perform
a state update and possibly generate/enqueue more events.


Entry: DSP
Date: Thu Nov 28 05:29:32 EST 2019

Another stab.  The problem I'm trying to solve is:

- to be able to reuse intermediate buffers easily, e.g. declare them
  in C inside a limited scope, and

- to do buffer dimension reduction

Is there a simpler way to look at this problem?


Is this just deforestation / loop fusion?  Or are there other things
at play?  Needs some new insight.


What is the core problem?  To reuse buffers such that they are
cache-hot.


Entry: Sequential programs are the norm
Date: Sat Nov 30 16:02:47 EST 2019

Why would I ever need sequential code if I have unlimited dataflow?
Is it possible to express a state machine as pure dataflow, and then
add specification for the dataflow?

The problem is with loops.  You almost always want sequential
processing of data stored in a memory because the access to that
memory is going to be sequential.

So the idea is this: the default is sequential programs.  What they
run on doesn't really matter.  Some form of datapath and control
logic.  A lot of freedom there that does not really matter much.
Parallel dataflow is actually very rare and only needed for the most
low level components such as e.g. bus interfaces.  All the other logic
will simplest to express as sequential code.

So to write state machines on an FPGA, write evertything as code, and
when it gets too inefficient, create a low level state machine.


Entry: buffers
Date: Sat Dec  7 10:46:43 EST 2019

So I have "clocked words", some signals that have a "valid" signal
associated with them to indicate events.

How to extend this to buffers?  Buffering seems unavoidable, so make
sure there is a good strategy.

This is the "control plane vs. data plane" pattern that always comes
up in implementation.

The data plane can just be memories.


So, conclusion: It is sometimes hard to see in the overall design
which streaming connections should be buffered.  For non-random access
(FIFOs only), this is an implementation detail that has to do with the
sequential decomposition of some processor.  If sequential
decompositions do not line up, buffering is needed to "deform time".

So can it be kept abstract?

Communicate tokens?


Entry: logic implementation is about deforming time
Date: Sat Dec  7 11:07:29 EST 2019

basically: buffering, sequentialization and pipelining

Note that this is really about dataflow, wich is the only thing I ever
need an FPGA for: communication, and possibly some stream processing.

To make this work, set up some interfaces.  The main missing component
is a control flow mechanism for FIFOs.

Ideally, manage the "coarse" time scale of the FIFO in some other way.
This is not typically called a FIFO, but a buffer: one machine fills,
then only when full a buffer is transferred.  However it is still
possible to use only streaming access on the buffer.


Entry: buffers over fifos
Date: Sat Dec  7 12:02:34 EST 2019

There is a good reason to use buffers, because often it is possible to
put a hard constraint on message size.  It is much harder to constrain
queues of multiple objects.

This is an interesting tension: while it might be more "natural" to
not use a chunking mechanism (grouping many atoms in an arbitrary
chunk), it does seem to allow for better memory management.

FIFOs are probably ok as long as there is some kind of high level rate
constraint.  Some "coarse level" information.

I.e. only if FIFOs do bounded time warping where maximum time delay is
known.


Entry: Transformation of parallel to serial?
Date: Sat Dec  7 12:24:42 EST 2019

Can it be that simple?

Start with actual "large" objects, then transform the data connections
into sequential ones, and refactor the processing in the same way.

It seems that the important bit is to decide where to make those
time/space tradeoffs.

It also seems simpler to go from parallel data to serial data.
I.e. you "add" stuff.

Going from parallel compute to serial compute is also straightforward,
but it seems backwards in some way because we all get so used to
sequential programs.  However the intuitive sequential programs might
hide some other sequentialization.

It seems better to make resource allocation very explicit: start with
pure data flow, then braid it into sequential machines.


EDIT: Actually this goes back to things I remember from very early on:
splitting a filter into a sequential program using a simpler datapath.


Entry: Sequential programs are not natural
Date: Sat Dec  7 12:31:52 EST 2019

So it is not that sequential programs are more intuitive, it is that a
particular CPU or programming language is a "known substrate", where
it becomes easier to be expressive as a designer.

What is more natural is to think in terms of events that cause state
changes, and then to implement the data transfer and compute onto the
sequential machine substrate.


Entry: Method: project model onto specification
Date: Sat Dec  7 12:33:46 EST 2019

So how to translate that into a method?

Does it make sense to implement the behavior at a higher level, to
then use it as a template?

Actually, very much yes.

It really seems that "adding" information to a specification to then
arrive at an implementation is the wrong way to look at it.

Work the other way around: "remove" information from an implementation
to end up with the specification.

This is textbook abstract interpretation.

It is very important to understand that an implementation in the first
place does _more_ than the specification.  Its substrate mapping has
many more observable effects than are necessary to implement the
implementation.

It also might do _less_, in that certain corner cases are not handled.

The tragedy is that this is usually what people (me, really) focus on,
forgetting the part where an implementation is actually strictly
_larger_ than a specification.


This also makes it obvious that generating an implementation isn't
always possible.  There might be simply too many degrees of freedom
that really do not compose in a nice way.

I.e. changing a dataflow program into a sequential or pipelined
program introduces a very practically significant shift in timing
characteristics.  Something the specification might not care about at
all.


Entry: Implementation / Specification codesign
Date: Sat Dec  7 12:41:36 EST 2019

So let's make this concrete by inventing an event + state machine
language that can be compared with an implementation, with maps going
from implementation to specification, but not backwards.

When there are backwards maps, great!  Those are modules for which
code generation can be used.  But this is _never_ guaranteed.

So a model-based approach should really take that structure: semantics
can always be assigned by projecting an implementation into model
space, and the reverse is optional but never guaranteed.


Entry: How to actually do that?  Phantom types.
Date: Sat Dec  7 12:51:43 EST 2019

It seems like a good idea.  Abstract enough to be correct, but also
completely unusable without some concrete modeling ideas.

The essense is state machines.  So start there maybe?  A state machine
is something that transforms (event,state) into ([out_event],state).

Stick to single input, many output machines.  Capture multi-input in
the state.

EDIT: Focus on composition?  Given a low-level implementation /
specification pair, construct a way to compose them.

Also, the specifications give phantom types to the implementations.


The more I think about it this way, the easier it is to be ok with the
fairly simple "boring haskell" approach for Seq.  It really is just an
implementation language, and a phantom layer is all that's needed.

EDIT: The model doesn't need to be actual.  It can just be a
repackaging of input streams.  Where model "fit" is just the ability
to express pack/unpack of data?


Entry: Example?
Date: Sat Dec  7 13:10:00 EST 2019

The obvious path is to take some state machine from the Ethernet MAC.

EDIT: I'm still missing an element.  Can the model-wrapping be _only_
phantom types?

EDIT: The frontend transforms the data ready + bit pair into a word
stream that can go into a fifo.  How to type those?  Doesn't make a
whole lot of sense putting it that way.  The core issue here is that
one type of event stream is transformed into another, and this is done
multiple times until anything like a "packet" emerges.

But still, it does make sense to somehow type the I/O.  This
composition is actual, so it makes sense to reflect it in the types.

so:

rmii -> word -> packet

EDIT: I'm on the wrong track.  Just start implementing, then once the
composition is there, start typing it.


A glimpse: type it without mentioning state.  It is understood that
things are stateful at a local level.


Entry: Move structure into names
Date: Sun Dec  8 18:25:48 EST 2019

I ran into two cases where there is a benefit of implementing grouping
based on names, and not on hierarchical structures.  They are
isomorphic (paths vs nesting), but seem to often be much easier to
manipulate.

This has to be a generic pattern.


Entry: Buffer reuse
Date: Mon Dec  9 06:59:38 EST 2019

It is essentially about lifetime of variables.


Entry: C-like lang
Date: Fri Dec 13 20:07:32 EST 2019

Either do this with a subset of C as a frontend, or do something in
Haskell on top of Seq.


Entry: Next state spec
Date: Sun Dec 22 00:15:18 CET 2019

So Mr. Lamport agrees: explictly stating what stays the same is a good idea.

https://lamport.azurewebsites.net/video/video4.html


Entry: Compiling CPS state machines
Date: Sun Dec 22 13:11:14 CET 2019

This is very straightforward when the realization is made that a
reified continuation is a sum type, with one clause for each state.

In C using a big ball of mud state struct, this idea is often obscured
because there is usually a lot of sharing between states.  I.e. it is
not always clear whether a particular struct member is valid in a
particular state.

Modeling the continuation as a sum type gets rid of this confusion.
The only downside is then to map this idea efficiently onto a C struct
or union.

Rust will probably map very well to this idea.


So essentially there are two parts:
- Convert a blocking task to CPS form
- Represent the continuation sum type efficiently


How close am I getting to async/.await?

https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html


That is already performing the CPS transform.

I think I have the perfect example to try this out on.  And also the
test: can "loops" and "recursions" be expressed in async/.await?


Entry: Erlang as CSP substrate
Date: Sat Jan  4 15:56:06 CET 2020

This is the missing link I've been looking for, connection between:

- Abstract state machine work in Haskell

- More practice with code gen & macro languages.

- Emacs/Erlang exploratory exo system


EDIT: CSP is synchronous and requires a layer over Erlang to make it
work.  Compilation to C might be more appropriate.  There is a library
now.


Entry: Modify deser to be able to do 2-bit sequences
Date: Mon Jan 27 06:06:05 EST 2020

It might already work.

Currently, deser core routine is:

shiftUpdate dir sr b

where b could just as well be 2 bits.


Entry: channels
Date: Mon Jan 27 08:59:03 EST 2020

Some interesting cases pop up for a generic synchronizer: uart memory
reader.  Basically, DMA.

So let's continue down this path.

Create a channel compositor.

The synchronization mechanism is easy enough: and together sender and
receiver signal, but expressing it in an expression language requires
some awkward shuffling.  It's the old read/write problem: a read is
the input of a function, and a write is the output.  Composition is
then done on the outside.


The mem->UART DMA is a nice example.

Mem read can continue once UART has acknowledged it has sampled the
channel.  Then mem read can obtain the next byte and block until UART
is ready sending.  Express that as a channel operation

Can this be expressed as a select?

The important part is the handshake.

SM1: raises write ready flag
SM2: raises read ready flag

Optimization: if it is known that the reader is fast enough such that
it will sit there waiting, it's ok to just pulse.  This is what I did
before in the DTI sequencer.


Entry: Memory reader
Date: Mon Jan 27 09:43:48 EST 2020

I.e. a "monitor" for a memory.  This is a task that "selects" on a
number of channels, and performs a read whenever a channel gets ready.

This rendez-vous is a very powerful abstraction.


Entry: Synchronization
Date: Mon Jan 27 09:50:10 EST 2020

Let's distinguish two mechanisms:

- Sender pulses, reader waits for pulse

- Sender raises and waits until reader is raised.

In the latter case it's necessary to be careful such that reads and
next write do not overlap.


Entry: integrate with fusesoc
Date: Mon Jan 27 16:22:30 EST 2020

https://github.com/olofk/fusesoc

Wrap it in exo nix.
https://nixos.wiki/wiki/Python

EDIT: Yeah once these things are getting bigger, reuse is becoming
important.  Seq is not going to be for system-level integration.  It's
only a module generator.


Entry: Synchronization
Date: Tue Jan 28 05:58:35 EST 2020

Actually this is a game changer.  2-way synchronization is more
abstract: no longer necessary to prove that one machine is in time to
pick up output of another machine.

So nothing new, really, but I do have a way to think about things in
the abstract.  This is very different.

That said I do need practice.  I can't just "see" it at the state
machine level.

EDIT: Here's a thing: "select" is the same as "priority cond".
E.g. in async_transmit: wordClock is responded to first.

So let's abstract a synchronous send as an asynchronous send combined
with a synchronous receive of ack.

If it is easier to to have a level transition for the ack, so so, but
add an edge detector somewhere.


Entry: DMA
Date: Tue Jan 28 06:18:06 EST 2020

So how to actually do this instead of gloating about insights?

First iteration: use pulses in two directions.  Build intuition about
that first.

Yeah why is this so hard?  Still not done with the context switching.
Do something else first untill fully awake.


Entry: Handshake
Date: Tue Jan 28 06:35:48 EST 2020

I need some basic design principles:

Requirements:

1. Allow both pulsed and level (ready/busy) signals

2. "cond" is your friend


Entry: Pulse vs. level
Date: Tue Jan 28 06:38:12 EST 2020

Level is easier to do because it's a single time instance (change
polarity).  A pulse always requres two events: one to turn it on, and
one to turn it off.

The good part is that level can be transformed into pulse using just
an edge detector.

Important there is that it allows FACTORIZATION OF STATE MACHINES.

I strongly believe it is better to have two independent transition
functions, as compared to a single one.

Now, there is a problem: we can't implement waiting if a pulse will be
missed.  That is why level-triggering is necessary.


So let's look at this a bit closer.

Suppose we have a series of signals that are used for synchronization.
If all signals are low, we wait in the next cycle.  If one signal goes
high, that one is used to cause a transition.  Ties are broken by
using a priority select.

That part is straightforward.

Pulses can be seen only when they are not obscured by other higher
priority pulses occuring simultaneously.

A "case" study: the async_transmit has a priority select on wordClock,
which means it will ignore bitClock if it occurs at the same time as
wordClock.

      [shiftReg', cnt'] <- cond
        [(wordClock, [newframe, cbits n' (n + 2)]),
         (bitClock,  [shifted,  cntDec])]
        [shiftReg, cnt]

What about this:

- Use pulses for async one-way communication.  This already works, and
  in most cases it is appropriate.

- Find a new mechanism for rendez-vous synchronization.  Tis seems to
  need level triggering with acknowledgement.


Entry: Rendez-vous: ready + ack
Date: Tue Jan 28 06:51:27 EST 2020

This is necessarily two-way.  Let's break the a-symmetry for now by
requiring that reads will wait for writes.  Then later it is probably
clear how to restore the symmetry.  Essentially this is about agreeing
on a time event.

That statement actually contains the solution.

The output of the synchronizer is a single cycle pulse.  Can this be
constructed from a reader and writer level signal?

Let's go back to a-symmetric case:

1. Writer signals that data is ready by creating a 0->1 transition and
   holding it there.

2. When reader is ready to perform the read, it will generate a 0->1
   transition, and at the same time sample the value.

The action of 2. should acknowledge to 1 that it can continue, but
also remove the condition immediately such that the reader state
machine doesn't read again in the next cycle.  That bit seems to be
the essence.

I think I'm re-discovering interrupts.

Essentially, this is a counter.


If the reader sees the interrupt, it will immediately (and only in
that state!) output a received pulse.

The writer will be waiting for this pulse.  If it is seen it should
immediately turn off its enable signal.

This 2-way thing is tricky.

Let's put in some requirements.

W: . . x x x x . . .
R: . . . . . x . . .
S: . . . . . x . . .

I think the key element is that the ANDed signal is not registered:
information flows in the two directions in a single cycle.

I.e. this makes it easy to generate by the receiver as a side-effect
of the "cond" case that sees the input signal high, and the writer can
see the pulse immediately and lower its write signal for the next
cycle.

So, summarized:

- writer has a wait state where it has the output ready signal raised
- reader has a state that sets the ack pin NON-REGISTERED
- in writer's wait state, the ack pin is used to transition to lower the ready signal

This should work for a continuous stream of readies.  E.g. if writer
doesn't lower the ready singnal, reader will treat it as a next sync.

Summarized even more:  Ready -> Ack path is COMBINATORIAL.

If it is combinatorial, single cycle transfers are possible.  If the
signal is pipelined, some more work is necessary to avoid duplicates.
So definitely this case is simpler.

Also, if single cycle transfers are possible, this is exactly THE
mechanism by which to factor machines into a composition of smaller
ones.


Entry: Justify "applicative" structure
Date: Tue Jan 28 07:23:07 EST 2020

While some things are hard to write this way, e.g. when there is some
"crossing" of signals, the desired end result of abstraction is almost
always feedforward structures and those map very well to applicative
structure.  I.e. data processors with some internal state.


Entry: Practical example
Date: Tue Jan 28 07:28:34 EST 2020

I don't think this can be abstracted away, as it is a core part of how
the two machines perform transitions in the presence of other things
going on.  So the transaction is an essential part of the machine's
main transition function.

An example.  Two machines:
1. writer is a counter that writes out the next state
2. reader is synchronized to an external pulse

Ok that's some setup to start with.


Entry: A good book on circuits
Date: Tue Jan 28 07:34:46 EST 2020

Maybe this one?

Computer Architecture: A Quantitative Approach (The Morgan Kaufmann
Series in Computer Architecture and Design)

https://www.amazon.com/Computer-Architecture-Quantitative-Approach-Kaufmann/dp/0128119055

Computer.Architecture-.A.Quantitative.Approach.-.4ed.pdf
md5://808a7562b705ed1cf6a3deb9b9370d98


Entry: Syncrhonization example
Date: Tue Jan 28 08:59:45 EST 2020

Ok so there is a catch: this cross-coupling introduces a loop, and I
don't think I have a way to decouple it.

This is why I couldn't just write it down.  The idea has a flaw.

Interesting.  Why is there this limitation?

So it seems that if I implement this using a delay in the feedback
path, it will be straightforward to do.  Otherwise it is a recursion
scheme that is not currently possible, and requires a new language
construct.

Maybe not.  What is necessary is the "open" version of the two
transition functions.  They should be merged into a single function
and then closed over feedback registers.

I'm too dumb for this shit rn.

Note that there will be no loops: it is only because there is an
apparent delay between rdy and ack.  Maybe the trick is to expose
these two signals (ready in next) and then close them explicitly?

This is the same thing that happened for memory reads.  And indeed!
this is also a read/write pair!

EDIT: Ok was able to code it up.  Now create a test.

EDIT: Ok have test, but things aren't correct yet.  This is again an
off-by-one thing that is hard to see.  Thinks like that are just
horrible to get right!  Almost there, just too stupid atm.


Entry: How to make this understandable?
Date: Tue Jan 28 17:09:23 EST 2020

Draw it out on paper again?

One thing that makes it difficult is to not distinguish between this
state and next state in the printed tables.  Fix that first.


Entry: rendez-vous
Date: Wed Jan 29 05:56:22 EST 2020

Basic idea: this is a cross-wiring of RDY and ACK, and one delay is
necessary to make the loop work.  Where does the delay go?

One way to think about this, and maybe good to create a test case:

- Reader responds with ACK pulse as soon as it can handle RDY

- Writer uses that ACK to turn off the RDY signal

In the end I want to be able to do this continuously, e.g. be able to
transfer data at clock rate, but to see what actually happens, turning
it off might be a good idea.

So make a new test circuit that does only handshake.


EDIT: Ok I think I got it.  The trick is to use combinatorial output
on RDY and ACK, and to delay the ACK going into the writer.


Entry: testing rendez-vous
Date: Thu Jan 30 07:39:47 EST 2020

I have a test case, but QC found a problem when pulse sep is 0.

Ok the problem is in the writer.  I need a general principle to do
this.

Suppose ready is always high.  What is the output?
- If not acknowledged, repeat the last one
- If acknowledged, use next

Yes this is tricky because of the dependences on current and last
state.

I think it's best to express the separate cases explicitly.


Entry: Truth tables
Date: Thu Jan 30 08:18:40 EST 2020

So I end up very naturally at truth tables.  In some cases it is just
very hard to express a transition function in terms of manually
factored binary and,or,not.

EDIT: Since there is currently no direct way to implement truth
tables, I'm resorting to manual implementation.

I do wonder: Verilog has don't care matching, right?

https://embdev.net/topic/276558

Yes, casez.

I probably should implement this.

EDIT: It's not really necessary atm.  Factoring can actually be
beneficial for understanding.  E.g. try to identify local signals that
are meaningful enough to give a proper name, and include them in the
truth table.


Entry: Change of style
Date: Thu Jan 30 10:35:20 EST 2020

Use d_ prefix for state machine inputs.  The "default time" should be
the current time instance.  This makes much more intuitive sense than
thinking about "future".

FIXME: word this better.


Entry: DMA
Date: Thu Jan 30 11:01:39 EST 2020

I'm having real trouble with those off-by-one errors.
Continue writing explicit truth tables for the transition function.

EDIT: Ok I got something working: send a counter to UART.  Then
replace counter with arbitrary pulsed state machine.

EDIT: Running into this cross-pattern again.  I guess it is universal.

So let's use the convention that the library exposes the open machine,
and we need to close it on use.  Naming will be clear after it's done.


Entry: UART revisit
Date: Thu Jan 30 14:44:13 EST 2020

Maybe good to test it out better.

TODO:
- merge the two test cases into one
- fix the stop bit issue
- make the done bit combinatorial
- generalize the sync write/read

This is going to take some time to all work out.


Entry: Port transaction symmetry
Date: Sun Feb  2 19:55:41 EST 2020

I factored out the read and the state machine.  The SM now takes a
'cont' and produces a 'have'.  This is essentially read and ack.

So it appears I'm re-inventing bus transactions.

1. The core idea is that a read (write) command is an interplay
   between two signals:
   - read end will issue a 'req'
   - write end will respond with an 'ack'

2. The req->ack path is combinatorial at the writer end

3. The ack->req path is combinatorial at the reader

4. The composition of the two inserts a delay on the ack to break the
   loop.


This doesn't seem too hard.  Now why is it a-symmetric?  Can we have a
writer sending the req, and the reader sending the ack?

The insight is that this is already symmetric.  The assymmetry is just
in the names!

This means that this could be a 2-way read/write as well.  The
handshake just defines a moment in time when both are watching the
i/o.

So what to do with this?  Make some drawings on paper...

The symmetry is important.

And the fact that the delay breaks the symmetry is also important.

I'm tempted to split that delay in half!

Go into this: 
There are also combinatorial signals at play for maximum performance.
https://en.wikipedia.org/wiki/Wishbone_(computer_bus)


EDIT: So how do I test this?


Entry: More handshake examples
Date: Mon Feb  3 05:48:16 EST 2020

So let's do:
- a byte producer (that 4,5,6,7,12,13,14,15,... counter)
- a consumer (the uart)

Also let's properly name:
- read/write strobe
- ack

The ack is not an ack.  The or of in and out strobe is the ack.

So a machine looks like:

- some data that is exchanged (doesn't really matter!)
- input:  indication that peer is ready
- output: ready indication
- ack = wire or of the two

TODO:
- put this behind a standard interface with some constraints.
- create a standard way of gluing two ports


Entry: Asymmetry in read/write
Date: Mon Feb  3 06:04:09 EST 2020

So what I want really is to split delays in half, but that doesn't
work.  So the solution is to make the interface a-symmetric, and let
one of the ends assume a delayed input.

Who should this be?

EDIT: To avoid duplication of registers, make sure that the reader is
frontend.


Entry: Channel, final word?
Date: Mon Feb  3 07:55:35 EST 2020

-- We break symmetry based on the requirement to not have a data delay
-- as part of the loop closing operation.  This makes frontend=write,
-- backend=read.
--
--   /-------------------<----------------------\
--   \-->--[D]-->--[f:write]---->--[b:read]-->--/
--                    \--------->-----/
--
-- This brings us to the following implementation.
closeChannel writer reader = do
  closeReg [bits 1] $ \[d_rd_sync] -> do
    (wr_sync, wr_data, wr_out) <- writer d_rd_sync
    (rd_sync, rd_out)          <- reader wr_sync wr_data
    "d_rd_sync" <-- d_rd_sync
    "rd_sync"   <-- rd_sync
    "wr_sync"   <-- wr_sync
    "wr_data"   <-- wr_data
    -- wr_out, rd_out: other state machine outputs not necessarily
    -- related to channel communication.
    return ([rd_sync],(wr_out,rd_out))


Entry: General remark about 'close'
Date: Mon Feb  3 08:05:14 EST 2020

When closing multiple operations at once, it becomes a bit arbitrary
in which order these are performed, and also there is quite a bit of
shuffling needed to bring outputs out of the circuit being closed.

In Verilog this is a lot easier to do by just defining a bunch of
registers and using assignment to perform the cross-wiring approach.

So is it worth it?

This seems to be what is payed to keep an applicative interface.

I'm going to assume for now that yes it is worth it, becaue it makes
abstraction cheaper.

Let's put it in the README.


Entry: channel: examples
Date: Mon Feb  3 11:42:37 EST 2020

These are canonical examples:

- readers
  - chan->uart_tx
  - chan->memory

- writers
  - uart_rx->chan
  - memory->chan


Entry: general remarks
Date: Mon Feb  3 11:52:56 EST 2020

I'm on the right path, but this will need a whole lot of work to find
good factorization primitives, and to maybe also create a processor in
such a factored way.

EDIT: A PLC would probably be possible

1. Factor out the instruction sequencing: instruction memory
   interface, loops and call/return.

2. Allow user to plug in conditional jumps (input->control path) and
   instruction decoders (instruction->output path).


Entry: Make tests easier
Date: Wed Feb  5 06:48:36 EST 2020

- only use probes

- why is TH necessary?


Entry: Why is TH necessary?
Date: Wed Feb  5 07:25:27 EST 2020

Interesting question.
I think I did this because of lack of strict typing?


Entry: Why am I not using other tools?
Date: Thu Feb  6 05:03:03 EST 2020

First, I really want to know how to do this bottom up.  That is very
important.  Once I've learned I might want to switch to different
back-ends.


Entry: Sharing
Date: Thu Feb  6 05:16:32 EST 2020

See haskell.txt

There are a lot of interesting points, but it is probably just a
distraction at this point.  For now, stick to monads.


Entry: Accelerate
Date: Thu Feb  6 05:23:39 EST 2020

Make it easier to get to hardware quicker.
What is missing?


Entry: Bit-serial Forth processor?
Date: Thu Feb  6 08:18:42 EST 2020

Yeah why not.
Probably best to start with a QSPI flash chip board.


Entry: Substrate independence
Date: Fri Feb  7 06:36:51 EST 2020


Entry: ISA generation
Date: Sat Feb  8 13:12:33 EST 2020

So instead of designing an instruction set, what about having it generated?

Basically, what you want is a "wide" instruction set with datapath
control signals.  This then needs to be compressed into something that
is a tradeoff between:

- not being too complex to decode

- not being too wide to store

Since I am already doing abstract code inside Haskell, I never
actually want to see the instruction encoding.  How about generating
it?

I only need the instructions themselves (abstract tags + concrete
payloads), then find an encoding to map the abstract tags to
instructions.


Entry: Testing memory readers
Date: Mon Feb 10 06:56:42 EST 2020

This is also a roadblock, but it appears the main roadblock is to work
on the HW so fix that first.


Entry: clock domain crossing
Date: Wed Feb 12 15:23:42 EST 2020

Dual-ported memory + signal?

Yes, FIFOs:
https://filebox.ece.vt.edu/~athanas/4514/ledadoc/html/pol_cdc.html


Entry: Programmable datapath
Date: Thu Feb 13 07:49:50 EST 2020

How to manage a single DSP core for low-rate processing?
It needs to be time-sliced.
Idea is to feed it via a register file.

Dual-read-port memory would be nice for operands, but probably ok to
go through fetch phases for multiple operands.

Incoding the instructions.  Probably not too many different things.
Mostly DSP.

Basically, DSP and control processors are really different.  DSP
mostly about feeding data into the MAC, control mostly about
responding to events, sending out events, possibly with subprograms.

So control should focus on a forth-style approach, while DSP would
focus on a register architecture.

DSP would be compiled by mapping a dataflow network onto a program
that executes it.


Entry: Substrate nesting
Date: Thu Feb 13 07:55:05 EST 2020

Maybe time to start doing this for DSP algos.

It's probably ok to pass through Term, e.g. have a complete compiler
behind a mapping.


Entry: applicative notation
Date: Fri Feb 14 16:43:04 EST 2020

Did i completely miss this?

f <*> a <*> b

Prelude> :t (<*>)
(<*>) :: Applicative f => f (a -> b) -> f a -> f b

Problem is that I have

a -> m b

Instead of

m (a -> b)

Is it possible to change one into the other?


What I have is actually

(f =<< ma) =<< mb


Entry: next
Date: Sun Feb 16 18:39:48 EST 2020

To/from memory.
Make those tests straightforward.


Entry: mini cpu
Date: Thu Mar 19 21:32:48 EDT 2020

http://bleyer.org/pacoblaze/picoblaze.pdf


Entry: z transform
Date: Fri Jun  5 23:15:34 EDT 2020

I have an algorithm for this already in rai.  The core problem is to
linearize the update equation.  Once there is a set of linear
equations, the rest is straightforward with a library like hmatrix.

In the problem I need to solve first, the update equation is linear
already, so that step can be skipped.