Haskell DSL for synchronous circuits Entry: Basic idea Date: Fri May 25 14:10:30 EDT 2018 Is it possible to capture enough of a synchronous state machine to be able to do the same thing as with Pru.hs? In general, a HDL represents a discrete event simulator. For clocked circuits, the simulation becomes a lot simpler. At every tick, register inputs are read, and an update function is computed for each register. So the basic unit to work with is the register. For now, assume MyHDL as a target. The idea is to produce blocks that look like: @always_seq(CLK.posedge, reset=None) def counter(): count.next = count + 1 Abstract the CLK and reset completely. How to construct an embedded language around this idea? There are essentially two elements: 1) combinatorial functions 2) registers A register is directly tied to the function that computes its next state, so it makes sense to use a Map for this. MyHDL doesn't use registers per se, but uses signals. If a signal's .next is written to, it behaves as a register. In other cases it is possible that a signal is just a wire. I find this very confusing. Entry: Signals Date: Fri May 25 14:11:39 EDT 2018 To make this work, it is necessary to first understand what Signals are. Stick to MyHDL as basic semantics. Somehow I don't really see why signals need to be so complex. Maybe just stick to what MyHDL does? At the very least, I need to distinguish between intermediates (wires) and registers. I think I understand. The difference is whether something appears in a combinatorial or a sequential block. So we have some context in which we can do bindings. Semantics of signals: 1) exactly one driver 2) from comb creates a wire that cannot have loops 3) from seq creates a register which can create loops across clock ticks Let's try to make this basic program work: counter a b = do comb $ do a <-- not b seq $ do b <-- b `add` 1 Two blocks, one combinatorial and one sequential, using a uniary and a binary operation. Maybe signals can be implicit? The do notation's <- can be used to create combinatorial signals Then a 'set' operation can assign a combinatorial signal to a register. Following through, this code counter b = do a <- inv b b' <- add b $ L 1 set b b' main = do printl $ mapToList $ compile $ signal >>= counter leads to the following network structure: signal n is driven by driver d. (n,d) (0,Reg (Signal 2)) (1,Comb1 INV (Signal 0)) (2,Comb2 ADD (Signal 0) (L 1)) The combinatorial ones are straightforward. The first one says that signal 0 is driven by a register, who's input is driven by signal 2. I'm going to rename Reg to Delay, and separate out constants so they are explicit drivers of signals: (0,Delay 3) (1,Comb1 INV 0) (2,Const 1) (3,Comb2 ADD 0 2) Making a next iteration where signals are explicitly driven, allowing both explicit combinatorial and sequential drive. Entry: fix? Date: Fri May 25 19:10:21 EDT 2018 Added a fatal error for signals driven more than once. Now, how to avoid this from happening by using a functional representation? Basically, create some kind of fix operator. Again, a counter. -- A counter from a register fixed point operator counter' :: forall m r. RTL m r => r Sig -> m () counter' = regFix inc regFix :: RTL m r => (r Sig -> m (r Sig)) -> r Sig -> m () regFix f r = f r >>= next r So that's straightforward. To create state machines it is still necessary to have a naked 'signal' that can set up the register in the first place. Ok... so what's next? I will have to represent state machines as functions with I/O. E.g. two input, two output: r Sig -> r Sig -> m (r Sig, r Sig) For the code generator, open functions are definitely necessary, but for the emulator it is ok to just work with traces for now. A test bench doesn't need an input. Entry: Generate signals Date: Fri May 25 20:27:24 EDT 2018 --- test_edge (0,Const 4) (1,Delay 3) (2,Const 1) (3,Comb2 ADD 1 2) (4,Comb2 SLL 0 1) (5,Delay 4) (6,Comb2 XOR 4 5) Maybe a good exercise to write this as an interpreter. I can see this problem of turning a network into a function reoccur. In this case, assume we know 6 is the output. What do we need to know? - collect all registers - compute output function, stop at registers - compute update function for each register Entry: Abstracting fix? Date: Fri May 25 20:54:19 EDT 2018 This can be done using arrows. What I want is something that looks closed, but can be opened up again. I've been here before. Then I used existential types. Having an open representation doesn't seem to be necessary as long as state traces are made available. So maybe the first thing to do is to compile to traces? Entry: Arrow / Category Date: Fri May 25 22:30:13 EDT 2018 So what about instead of modeling as functions, we model as category? Entry: RTLEmu Date: Fri May 25 22:36:48 EDT 2018 Just continue filling things out. Start by only collecting register signals. But how to represent a hole? Basically, op2 ADD needs to apply a function to something. What about inserting an environment into the monad, and using circular programming? This is a new idea: compile an entire network to a function taking register state as input. Done. Also managed to compute register init using circular programming. What is missing is a good way to produce an output. Currently, it needs machinery to untag the Emu.R wrapping and perform register dereference in case the return value is Emu.R (Reg Int) Unfortunate, but it seems a special "output" command will need to be added to evaluate source code. If it needs to be added, then make it into a list so it is generic. OK, done Entry: It is very annoying not to have constants Date: Sat May 26 01:51:08 EDT 2018 EDIT: Seems really hard to fix. Entry: Remove combinatorial drive? Date: Sat May 26 02:05:28 EDT 2018 Seems like it's not needed. Entry: fix instead of next? Date: Sat May 26 08:56:51 EDT 2018 If it is removed from the interface, there are no longer issues with 0 or >1 assignements to registers. Entry: Is it necessary to model wide signals? Date: Sat May 26 09:04:57 EDT 2018 It seems the metalanguage is enough. Generated HDL can be "flat". Likely, the compiler will pick up the pieces. It would be nice to be able to embed signal types in Haskell types, but it seems more trouble than it's worth. Overall it seems too much trouble to use lists of bits. Use sized integers instead. This is a compiler, so there is the extra compile-time function evaluation to resolve these issues. Entry: Reset value Date: Sat May 26 10:33:22 EDT 2018 I don't really like this, but I guess it can work... Signals can have default values. If a signal is not driven, its default value should be used. Constants are currently implemented as undriven signals. EDIT: No: make ints explicit. Entry: Wire transposition Date: Sat May 26 11:16:16 EDT 2018 Unpack a signal into components. This really makes me want to represent signals as bit vectors. Entry: Export a module Date: Sat May 26 11:29:01 EDT 2018 Because we only do sequential logic, it is essential that we are able to export standard (MyHDL) modules that can be included in a simulation that also handles asynchronous logic. The basic idea is that: - async logic is sometimes necessary for interfacing, but is too low-level to be used for the bulk of a circuit - syncrhonous logic is simpler, and this simplicity can be used to express it in a simpler language Entry: Renaming RTL to Seq Date: Sat May 26 11:31:06 EDT 2018 It is clear now what the purpose is: to create a langauge to describe sequential processing. RTL is too confusing. It might mean the same in spirit, but is a loaded term otherwise. Entry: Conditionals Date: Sat May 26 13:27:51 EDT 2018 Or, conditional assignment. Can I get away with implementing this as a primitive? It will be implemeted as a multiplexer, so the signals will be there already. Entry: Traces that do not have generators Date: Sat May 26 16:06:30 EDT 2018 Trying to get a in input sequence fed into a trace. Getting into a knot with that.. Maybe it's easier to embed a sequence right into the fabric, as a special kind of update equation around a register? Basically, I want to use the second operand of 'next' to magically produce a value such that the output register state contains information to compute the next value, and 'val' produces the current one. -- register drive next (R (Reg sz _ a)) (R b) = do vb <- val b let ifConflict _ old = error $ "Register conflict: " ++ show (a,old,vb) modify $ appOut $ insertWith ifConflict a vb Can't wrap my head around it. Seems inconsistent. What is the real problem? To distill the initial state of registers. Is there a more direct way to compile this to (s, s->s) ? Entry: (s,s->s) Date: Sat May 26 16:23:24 EDT 2018 First, start by storing the register defaults in the state map. Currently they are in the instructions. EDIT: So I have something that works, but it's not pretty. The problem is really that register types are deeply embedded inside the code and there's no good way to get to them other than just executing with dummy inputs. A proper way would be to provide actual types for those dummy inputs, but here the 0 value will do. Entry: Inputs? Date: Sat May 26 18:27:19 EDT 2018 Still didn't get very far. A similar probing approach could be used. Ideally, this is encoded in the types, but I'm not going to put in the effort. OK, done. Wasn't that hard in the end. Just confused, I guess. Entry: RAM Date: Sat May 26 20:20:26 EDT 2018 What I need most to make an application, is emulation for RAM. The RAM in iCE40 is not clocked. Entry: MyHDL Date: Sun May 27 00:25:01 EDT 2018 Basic print seems to be working. How to re-enable expressions? Or does it not matter? Likely, MyHDL will flatten expressions to internal nodes. Or not? Look at the HDL output maybe? In any case, it will make the output more readable. So how to decide to inline a node? When it has only one user an it is not a Delay node. This isn't so hard to compute. The user list for each node can then be rendered such that: - signals definitions and .next= lines can be skipped - code is recursively inlined when printing the expression Entry: fanout Date: Sun May 27 01:01:14 EDT 2018 To compute fanout, first make an iterator over all references. A good excuse to finally understand Foldable and Traversable Foldable is enough, likely. Funny how implementing Foldable immedately required the generalization to Term t, which paves the way for nested data structures, by changing the type from Term Node to Term (Either Node (Term Node)), Entry: Free Monad Date: Sun May 27 08:38:41 EDT 2018 The above smells like the Free Monad. Free Term Node See http://hackage.haskell.org/package/free-5.0.2/docs/Control-Monad-Free.html So inlining is some form of fmap Here's what worked. I had to really do it step by step to figure out the types of holes. inline :: (Int -> Term Node) -> Int -> Exp Node inline ref = inl where inl :: Int -> Exp Node inl n = f $ ref n where f (Delay _) = Pure n f term = Free $ fmap inl term Parameterized node type -- Inlining, terminated on Delay to avoid cycles. inline :: (t -> Term t) -> t -> Exp t inline ref = inl where inl n = f $ ref n where f (Delay _) = Pure n f term = Free $ fmap inl term Generalized to predicate -- Inlining, terminated on Delay to avoid cycles. inline' :: (Term t -> Bool) -> (t -> Term t) -> t -> Exp t inline' p ref = inl where inl n = f $ ref n where f term = case p term of False -> Pure n True -> Free $ fmap inl term inline = inline' p where p (Delay _) = False p _ = True Now I want to extend this to keep track of whether a node was inlined or not. But first, can this be generalized further? Is there another case where a Bool is used to pick two alternatives? Yes that's just if. inline' :: (Term t -> Bool) -> (t -> Term t) -> t -> Exp t inline' p ref = inl where inl n = f $ ref n where f term = if' (p term) (Free $ fmap inl term) (Pure n) EDIT: Continued a bit. Straightforward in the end. Entry: Encode signal types as types? Date: Sun May 27 20:42:54 EDT 2018 This would avoid needing SType values, but would require either dependent types or some nasty tuple shit. Maybe type families? Entry: Memory emulation Date: Sun May 27 22:46:02 EDT 2018 I'm not going to do this using registers. Also, I need something that works synchronously. It seems best to do this as a sort of coroutine that sits in between two state updates: - reading read_addr, write_addr, read_data - writing write_addr Entry: Patch two "coroutines" Date: Mon May 28 11:47:37 EDT 2018 (R S -> M (R S)) -> (R S -> M (R S)) -> M () This is a special case of reg. EDIT: changed regs to regFix, using a functor. Entry: signal and next Date: Mon May 28 14:28:09 EDT 2018 Itching to remove it, and replace it with regFix. But that's not really so important. Some implementations might be simpler that way. Entry: Memories Date: Mon May 28 14:32:50 EDT 2018 So basic structure is there. How to actually use it? Memories are an implementation feature. Generally we will provide code that is parameterized by the memory's register interfaces. Test it with a dummy read/write. So it's the same as general register fix: close over the memory interface. -- Dummy memory-using operation. For testing memFix. dummy_mem rd = do -- mem reg in z <- int 0 return ((z, z, z), -- mem regs out [z]) -- test program output Maybe this can be added to Seq? It's very useful to have without the need of pushing memFix in as a paremeter. So I have a test, but no generic way to do this. The problem is that the interfaces changes with all this threading going on. It would be more convenient to tuck it into the monad. I want a "makeMemory" function. dummy_mem rd = do -- mem reg in z <- int 0 return ((z, z, z), -- mem regs out [z]) -- test program output dummy_mem2 makeMem = do memFix <- makeMem [o] <- memFix (t, t) dummy_mem return $ o How to do this without parameterizing the class? Memory is a general case of external I.O. I want a generic way to embed abstract state threading. There are two ways: - Find a way to hide it into the main monad - Use custom trace functions The latter really doesn't seem like a good idea. I think this is a job for existential types. Entry: State, revisited Date: Mon May 28 15:46:37 EDT 2018 So question: include more state in the state monad by parameterizing it, or use explicit state in the trace (run) function. Start from the use case again. I'm writing a HDL code that is somehow parameterized by a number of memories. The implementation of these memories should probably be abstract, such that it can be filled in during test time. With this code, I want to: - generate it as HDL, where it will be combined with an HDL stub to patch it into the memories. - generate it as a test function that can provide the meories and produce an output sequence. My "main" program can just be parameterized by the memory interfaces, collected in a functor. Entry: Allow non-monadic Seq constants Date: Tue May 29 12:29:06 EDT 2018 Functional dependencies should be able to constraint r -> m. This is a deep change. Will take a bit of time. Entry: I don't understand Free Date: Tue May 29 14:59:03 EDT 2018 I made a small change : Term n -> Term (Op n), and now I cannot fix the inliner. Need to redo it from scratch. Start by writing a template that just uses return to produce the data type, then refine. This already exposes the bulk of the wrapping problems. Is there a way to implement the behavior of free without putting in all the wrapping? EDIT: Ok, I get it. I discovered unfold, and wrapped the Free monad inside a WriterT . ReaderT to perform String rendering with indentation. EDIT: I didn't get it. Then discovered liftF -- exprDef = inlineNode ref -- 0 levels (Delay cuts off) -- exprDef n = liftF $ Compose $ ref n -- 1 level exprDef n = (liftF $ Compose $ ref n) >>= inlineNode ref -- 1 level + inline Entry: regFix bug Date: Wed May 30 03:09:19 EDT 2018 test_regfix = SeqEmu.trace' $ do let t = SInt Nothing 0 regFix [t,t] $ \[a,b] -> do a' <- add a 2 b' <- add b 3 return ([a',b'],[a,b]) --- test_regfix [[0,0],[3,3],[6,6],[9,9],[12,12],[15,15],[18,18],[21,21],[24,24],[27,27]] So it seems the problem is here: regFix :: forall f m r o. (Applicative f, Traversable f, Seq m r) => f SType -> (f (r S) -> m (f (r S), o)) -> m o regFix ts f = do rs <- sequence $ fmap signal ts (rs', o) <- f rs sequence_ $ liftA2 next rs rs' return o --- test_regfix [[0,0],[3,3],[6,6],[9,9],[12,12],[15,15],[18,18],[21,21],[24,24],[27,27]] -- bindings: (2,Comb2 ADD (Node 0) (Const 2)) (3,Comb2 ADD (Node 1) (Const 3)) (0,Delay (Node 3)) (1,Delay (Node 3)) -- output: [Node 0,Node 1] -- inlined: 2 <- (ADD (NODE 0) (CONST 2)) 3 <- (ADD (NODE 1) (CONST 3)) 0 <- (DELAY (NODE 3)) 1 <- (DELAY (NODE 3)) Wo why is node 3 bound twice? Both interpretations appear to do the same thing. The nodes get created, and they get bound, but not to the correct thing? Very strange. There you have it: *Main> sequence $ Applicative.liftA2 (\a b -> return (a,b)) [1,2] [3,4] :: Maybe [(Int,Int)] Just [(1,3),(1,4),(2,3),(2,4)] It will compute the outer product. So lists can't be used! ZipList wrapper is needed. Is there a way to express that two functors have the same structure? I guess, converting to ziplist would work. EDIT: Used explicit zipWidth next + toList in regFix. Then converted all other lists to ZipList. Maybe best to use dedicated functors. Entry: Why is there a pipeline delay? Date: Wed May 30 11:13:51 EDT 2018 Maybe before figuring this out, simplify the interface to the monad. It's not necessary to abstract 'val'. The user can easily do this in the testbench stub. Maybe today is not a good day for refactoring. Get a nap. However, it seems that I'm introducing a delay by introducing those registers as actual registers, on top of the delay created by feeding back the memory's io state. EDIT: Damn I'm tired but I can't let go of this. So this is how it should work: - User stub should use registers - But memory stub should use the constant output before it is fed back. I think this can work with the existing setup. Basically, only the rData register is an actual register. The memory is a combinatorial function from the control inputs to rData decoupled by that register. EDIT: Yes that was it. Makes the code a lot simpler too. Entry: ZipList Date: Wed May 30 13:27:19 EDT 2018 Put an error message in SeqEmu such that a missed ZipList case shows up in the evaluation of 'next'. Then figure out why I'm actually running into this problem. Entry: CPU, sequencing Date: Wed May 30 18:31:27 EDT 2018 I need a break from the more abstract stuff. What to do to make this usable? Entry: myhdl interface Date: Mon Jun 4 15:47:06 EDT 2018 Set up basic test bench -> vcd compilation in hatd project. The question is though: who generates the signals? I don't think it's really necessary to let the MyHDL side do this. Entry: concat Date: Sun Jun 10 10:28:50 EDT 2018 "concat" is a MyHDL function: http://docs.myhdl.org/en/stable/manual/reference.html concat(base[, arg ...]) Returns an intbv object formed by concatenating the arguments. This is one of the big differences between logic and CPU instructions: wires are arbitrary. It's probably time to use MyHDL style integers: intbv vs. modbv. About that: modbv seems to make most sense as a default, but I do want to leave open the possibility. Maybe make an explicit type? Entry: SeqTerm SType Date: Sun Jun 10 11:49:44 EDT 2018 Where does SType go? Ultimately, the dictionary should have type annotation, because the dictionary contents are used to define the MyHDL signals. But the straightforward way to do that introduces duplication: both Term and Node would have type annotation. Maybe that's just the way to go without a lot of restructuring, because SeqTerm uses a writer, not a state monad? Yes let's just do duplication. Entry: Trigger filter Date: Sun Jun 10 14:50:05 EDT 2018 Circuit is simple, but requires a lot of primitives that are not yet implemented: - just did "concat" - bit vector indexing - equality It makes sense to implement the shift register separately, and also have a look at what the HDL output is for the MyHDL code. EDIT: Create the primitives needed to build a shift register first. Entry: MyHDL output Date: Mon Jun 11 10:33:03 EDT 2018 It's probably possible to join all combinatorial assignments in the first block, and put the sequential ones in the second block. from myhdl import * def module(CLK, RST, s0, s1): s4 = Signal(modbv(0)[8:]) s2 = Signal(modbv(0)[8:]) s10 = Signal(modbv(0)[8:]) s6 = Signal(modbv(0)[8:]) # s1 is an input @always_comb def blk1(): s4.next = (concat((s2[7:0]), s1)) @always_seq(CLK.posedge, reset=RST) def blk2(): s2.next = (s4) @always_comb def blk3(): s10.next = (0 if (s4 == 0) else (1 if ((~s4) == 0) else s6)) @always_seq(CLK.posedge, reset=RST) def blk4(): s6.next = (s10) @always_comb def blk5(): s0.next = (s10) return [blk1, blk2, blk3, blk4, blk5] to from myhdl import * def module(CLK, RST, s0, s1): s4 = Signal(modbv(0)[8:]) s2 = Signal(modbv(0)[8:]) s10 = Signal(modbv(0)[8:]) s6 = Signal(modbv(0)[8:]) # s1 is an input @always_comb def blk1(): s4.next = (concat((s2[7:0]), s1)) s10.next = (0 if (s4 == 0) else (1 if ((~s4) == 0) else s6)) s0.next = (s10) @always_seq(CLK.posedge, reset=RST) def blk2(): s2.next = (s4) s6.next = (s10) return [blk1, blk2] just keep the order? Maybe not... I wonder if this actually works as expected, or if each combinatorial value really needs to have its own block. Or maybe a Signal is not required? Damn I ran out of steam before it's finished... It seems that it is possible to use "naked" intbv, e.g. http://www.antfarm.org/blog/aaronf/2008/03/myhdl_example_avalonst_error_a.html # (j is large enough to hold any index into intermediate.) j = intbv(0, min=0, max=2 + len(i_err)) for i in range(len(outputMapping)): j[:] = outputMapping[i] o_err.next[i] = intermediate[int(j)] What question am I supposed to ask? What is the difference between Signal(intbv(...)) intbv(...) EDIT: Difference between cell and value. For combinatorial networks, it seems that this is not allowed: @always_comb: def f_bc: b.next = a c.next = b The following should always work. @always_comb: def f_b: b.next = a @always_comb: def f_c: c.next = b For combinatorial code, the used values are in the sensitivity list. Suppose a is the output of a register. At a clock edge that updates a, f_b will run, which updates b, then f_c will run which updates c. It will not be so hard to make a test case for this. Entry: Make a test bench for semantics Date: Mon Jun 11 12:12:17 EDT 2018 Basically, compare the python simulation output with the Seq output. The simplest way to do this seems to be to generate a python data structure and have the python script verify it. Entry: parsec + template haskell? Date: Mon Jun 11 12:16:40 EDT 2018 To make some notational abstraction over "do"? Entry: Haskell / Python bindings Date: Thu Jun 14 21:06:56 EDT 2018 https://john-millikin.com/software/haskell-cpython Entry: Preinc/postinc access Date: Sat Jun 16 17:19:58 EDT 2018 Trying to get the combinatorial / register split right for pre/post inc/dec memory read/write. For some reason, there is something that isn't quite clicking about how this is supposed to work. I guess I want to really see a single cycle instruction memory work on the FPGA. The read and write happen on rising read/write clocks. This is the same behavior as a register. I guess what I'm looking for is a more detailed description of an SRAM. To believe it, I guess.. I don't exactly see where the feedback "fix" is located when it comes to a clocked memory. That is the main issue here. For a register it is simple: one in / one out. So what is it like for a memory? There must be some latch in there somewhere. When in doubt, look at the simulation. It captures semantics, obviously.. @always(write_clock.posedge) def rtlwr(): if write_enable: memory[write_addr].next = write_data @always(read_clock.posedge) def rtlrd(): read_data.next = memory[read_addr] So the memory does behave as a register. fixMem is correct for the readAddr -> readReg part, but I'm not sure for the write. EDIT: It seems OK, but I want to see it. If readAddr == writeAddr, there should be a two cycle delay. This is something that isn't hard to test on actual hardware. Make some tests for Seq for these corner cases. test_mem_delay = SeqEmu.traceState ([empty]) m where t = SInt Nothing 0 m = SeqEmu.fixMem [t] $ \[rd] -> do c <- counter $ SInt (Just 8) 0 return ([(1, 0, c, 0)], [c, rd]) --- test_mem_delay [[0,0],[1,0],[2,0],[3,1],[4,2],[5,3],[6,4],[7,5],[8,6],[9,7]] So indeed, delay is two, passing the two registers (read register, and internal sram modeled as a register). How to test this on a scope? If the counter is 8 steps, and bit 2 is put out, it will show up as quadrature delay. Entry: latches / registers and pure functions Date: Sat Jun 16 17:59:00 EDT 2018 https://forums.xilinx.com/t5/Implementation/why-latches-are-considered-bad/td-p/200291 The output of a combinational circuit is a function of input only and the circuit should not contain any internal state (i.e., memory). One common error with an always block is the inference of unintended memory in a combinational circuit. The Verilog standard specifies that a variable will keep its previous value if it is not assigned a value in an always block. During synthesis, this infers an internal state (via a closed feedback loop) or a memory element (such as a latch). Entry: Push a stack Date: Sat Jun 16 19:10:34 EDT 2018 So, the memory has a 2 cycle delay. I wonder if it is necessary to make a 2-step processor? What is the simplest thing to do? Speed is not an issue for the task at hand. Simplicity is more important. That means no pipelining: results of the previous instruction step should be available in the next. Assume: - stack architecture (working reg + top of data stack = ALU input) - no need for pipeline delays Inputs: - instruction word - working reg - stacks : read register Outputs: - instruction pointer - working reg - stacks: write address + data Assume combinatorial path between those. What can be implemented without delays? - jump - alu (top of data stack + working reg -> working reg) - stack pointer (e.g inc / dec) Lost it... Start simpler. Start with: - ip, wreg - jump - conditional jump - inc / dec / load So... this looks really interesting, but DO NOT do this for work. Keep the FPGA circuits simple, and put the control logic in the CPU. Entry: Arrows? Date: Sun Jun 17 01:41:02 EDT 2018 It would be just Kleisli. But basically, after Conal indoctrination, I really like the applicative version more. Functions, be it monadic functions, seem to make more sense. But maybe just try it? https://www.reddit.com/r/haskell/comments/4fkkzo/when_does_one_consider_using_arrows/ While the exact mathematics doesn't seem to have been worked out exactly yet, it is well known that Applicative+Category has "about the same" expressiveness as Arrows. Using Strong from profunctors, you can prove Strong + Category has exactly the same expressiveness. http://www.fceia.unr.edu.ar/~mauro/pubs/Notions_of_Computation_as_Monoids.pdf gives most of the story and http://www-kb.is.s.u-tokyo.ac.jp/~asada/papers/arrStrMnd.pdf gives the rest, but you need to read between the lines and see that using Category gives you a way to model monads in Prof. class (Strong p, Category p) => Arrow p instance (Strong p, Category p) => Arrow p There. I fixed it. So basically, forget arrows. Entry: Kleisli arrows Date: Sun Jun 17 02:38:30 EDT 2018 Maybe this is a way to write more oneliners? a <- add x <=< add y z Entry: MonadFix? Date: Sun Jun 17 03:07:04 EDT 2018 ArrowLoop is defined for Kleisli if MonadFix works. Wont work for fixReg, because it's still an actual fixed point operator. http://hackage.haskell.org/package/base-4.11.1.0/docs/Control-Monad-Fix.html purity mfix (return . h) = return (fix h) if h :: a -> a Entry: Applicative Date: Sun Jun 17 03:59:01 EDT 2018 Main reason not to use them is that Kleisli composition is enough. However, applications do start making sense when state machines are lifted to simulations, because the values become "pure" in some sense. (a -> m b) -> Stream a -> Stream b Also (a, b) -> m c a -> b -> m c Stream a -> Stream b -> Stream c Are all kind of the same thing. So it should be possible to define an Applicative instance. (a -> b -> c) -> m a -> m b -> m c So this brings me to the question: is (m a -> m b) the same as (a -> m b)? For sure there is: t :: (m a -> m b) -> a -> m b t f = f . return t' :: (a -> m b) -> m a -> m b t' f m = m >= f So the conversion is generic and seems to be unique, so these seem to be isomorphic. Then for the multi argument function, it does seem that the order becomes important: t'' :: (a -> b -> m c) -> (m a -> m b -> m c) t'' f ma mb = do a <- ma b <- mb f a b So there is some form of arbitraryness here. The problem goes away in the uncurried version: :: (a,b) -> m c -> (m (a,b) -> m c) What does this all mean? It seems that currying/uncurring Kleisli arrows is not unique. This points at the "default order" for Applicative functors: left to right. ap :: m (a -> b) -> m a -> m b Entry: Practical Date: Sun Jun 17 14:53:55 EDT 2018 Anything practical that still needs to happen? SeqEmu.hs can be simplified. I especially do not like the clumsy external state threading approach, and how the state monad is not really computing the register state update. It compues an update function. But maybe that is enough. So basically, memories work, and inputs work, but I haven't used them together yet. Anyway, today is not an insight day.. Entry: Stacking monads Date: Mon Jun 18 10:23:50 EDT 2018 As a user interface, it is probably possible to stack another StateT. But how to do that without interference? It's not a problem to stack multiple I believe. I think it will just pick the first one it finds. So that would be from the user's perspective. But how to "tag" the inner state monad used in the emulator? To make this work properly, it is probably best to all put it into one monad. But how to solve the problem of multiple isolated states? This has to be a problem people have run into before. http://blog.ezyang.com/2013/09/if-youre-using-lift-youre-doing-it-wrong-probably/ As everyone is well aware, when a monad transformer shows up multiple times in the monad stack, the automatic type class resolution mechanism doesn't work EDIT: Maybe time to simplify. The reader monad isn't really necessary, so take the current parameterization to be an extra component to the state monad. EDIT: Looks like the writer monad is already gone. How does this compose? I want to be able to add an effect transparently. It seems the only way to really do that is to use existential types. Say there is a "fixReg" operator for each kind of state, which will add a state element to a list and record a way to access it. Added a list of existential state types. Next is to find a way to separate the "allocate" and "use" phases, in a way that is similar to a register. Maybe this thing could be just an extension of a register? That actually makes more sense. EDIT: So I made some room for this. Now what? EDIT: Can't be just made of registers, because those get initialized on every tick, so is definitely different form a register. The interface should be "a special kind of register". That way, IO can be modeled as registeres as well. Now fixMem can be implemented using an analogue to "signal", e.g. "memory", to create a register that holds e.g. a Map, or some other state structure. Entry: ExtStates Date: Tue Jun 19 12:39:53 EDT 2018 Currently, only memories are ExtStates. Wrappers are in place. Now put this behind an existential interface. EDIT: So the interface should provide a fix function: f ExtSTate -> M (f ExtState, o) Note that it will be hard to commute the f with the existential state. Unless f goes inside the existential type. Existential types are really hard to use.. So an abstract state from this perspective is a fix function. Start with the original fixMem fixMem :: (Zip f, Traversable f) => f SType -> (f (R S) -> M (f (R S, R S, R S, R S), o)) -> f MemState -> M (f MemState, o) What the state _does_ is to implement an interaction that is only visible at the register level. So the interaction interface needs to be generalized. Let's use lists for now, then generalize. fixExtState :: ([R S] -> M [([R S], o)]) -> s -> M (s, o) Basically, the abstract fix function takes an abstract register interaction, its own hidden state, and produces the result of that interaction together with the updated hidden state. There are some complications indeed with the interaction of f and the state. Ok it is getting a little complicated to express in the current form, but the basic idea is simple: external state interaction is represented as a register interaction Maybe just build it from the ground up to get to a simpler version that is working (one memory), then generalize from there? (i (R S) -> M (o (R S), t)) -> s -> M (s, t) Currently I don't know what to do with those i,o parameters, so really just use flat list. I don't think at that point the structure of the I/O is really important. EDIT: Painted into a corner.. I need something to close this loop. Doesn't really matter what. Some essential element is missing. Entry: Got it: existential + dynamic types Date: Tue Jun 19 22:30:15 EDT 2018 data type: forall s. (s, s -> M (s, Dynamic)) interface function: Typeable o => (s -> M (s, o)) -> s -> M o So dynamic types are not visible outside of the implementation. I'm happy with it. Result is simple, though it was _not_ easy to write this! A couple of things need to come together without interfering. Entry: PruEmu Date: Fri Jul 6 12:06:52 EDT 2018 type EmuState = Map EmuVar Int data EmuVar = File Int -- register file | CFlag -- carry flag | PCounter -- program counter | Time -- instruction counter deriving (Eq,Ord,Show) How to extend this to arbitrary state? It no longer is a map to Int. It's too much work at this point. I just need something simple and move on. Entry: MyHDL sim test Date: Sat Jul 7 08:23:29 EDT 2018 1. Create the Seq circuit with its own test bench. 2. Generate a MyHDL module that can compile to FPGA code, together with some MyHDL wrapper code. 3. Same for a test bench EDIT: After a bit of doodling I end up with this: - A testbench module has CLK, RST and a number of outputs. - It is a python module with a function named "module" - The module can contqain a list named "output" - If so it is verified against the simulation output TODO: Wrap this up. Ok. One thing that I'd like to figure out, is how to specify output types for a module. Currently it is left to the instantiator to provide I/O signals, but how does it know what to instantiate? Maybe generate an instantiator in the .py file EDIT: Anyways, not a huge problem. Still need to verify if the output can actually be synthesized to VHDL or Verilog. EDIT: I really miss not having actual signal names. EDIT: Ok I have a small TH hack that can be used to resolve this. Entry: Named ports Date: Sat Jul 7 15:54:25 EDT 2018 myhdl is parameterized by ports the node type is opaque, so a name could be attached that way. myhdl :: (Eq n, Show n) => [Op n] -> [(n, Expr n)] -> MyHDL myhdl ports bindings = _ ports comes from SeqTerm.compile So at any point where just the abstract syntax is provided, a list of port names using the TH hack can be provided as well. Solve this through a type class. EDIT: Ok, got it. Entry: put this to the test? Date: Sat Jul 7 21:49:13 EDT 2018 So here I am with quite a bit of sophistication in the tool, but not really the application that merits it. I should have probably just continued in MyHDL. But then again, from just messing with Python a little, I am quite happy with Haskell. Entry: DSL, sharing and monad laws Date: Sun Jul 8 08:47:08 EDT 2018 This keeps coming back, so I put it in SeqLib.hs Typically, in a monadic language with internal nodes, when the same monadic value is passed to a 2-argument function like this: f ma mb = do a <- ma b <- mb return $ add a b The monad is evaluated twice. There seems to be no simple way to avoid this. Period. If sharing is needed, use do notation. If there is no fanout, it's ok to use applicative notation to build nested expressions. Note that in an expression, there can never be any sharing, except for sharing introduced by combinators, and those could implement it correctly. E.g. dup m = do a <- m; return (a, a) So the point about sharing is moot, really. The real issue is that a couple of interfaces are needed: a -> a -> m a a -> m a -> m a m a -> a -> m a m a -> m a -> m a I see no good way to do this that is easy to use apart from introducing a syntax manipulation step. Entry: nailing down semantics Date: Sat Jul 14 12:11:20 EDT 2018 Something isn't right with conc and slice. Make some tests? The problem here is clear: I want to look at the output of conc and slice for a variety of things, but I have no clear way to do that at the command line. The need to resort to the command line is a consequence of dynamic typing. I.e. the semantics is not completely capture in the type. Actually, this is readily available. Here "Trigger" is a module that has all relevant pieces imported, and ghci is entered through the nix/cabal setup. Prelude Trigger> take 10 $ SeqEmu.trace $ return [1] [[1],[1],[1],[1],[1],[1],[1],[1],[1],[1]] Prelude Trigger> take 10 $ SeqEmu.trace $ let v = Seq.constant (Seq.SInt (Just 2) 1) in do v' <- Seq.conc v v ; return [v'] [[5],[5],[5],[5],[5],[5],[5],[5],[5],[5]] So that's not it. EDIT: The error was in slice, shiftR had args swapped. But I think I have at least some better way to look at tests now. Entry: Seq, what does it do? Date: Sat Jul 14 18:08:05 EDT 2018 - "applicative" vs "network" notation. i.e. no explict assignment/binding. - state hiding - implicit clocks - pure modeling, easy to compose and test The former is probably the most important one. Entry: Practical stuff Date: Mon Jul 16 10:41:57 EDT 2018 UART: transfer bytes into memory. Entry: Clocks, RTL Date: Mon Jul 16 10:47:01 EDT 2018 So two way to look at digital cicuits: 1) Go from event-driven to sequential 2) Still, in the sequential domain, there is a need to represent signals that are somehow "clocked". The way to do this is to synchronize to the sampling clock, and add an enable bit. Basic intuition here is that for FPGA design, signal flow needs to be directional, which meshes very well with functional programming. Entry: UART and sub-clocking Date: Mon Jul 16 11:32:54 EDT 2018 Problem factorization. - Async receiver is sample pulse generator and shift register - Async transmitter is same, function of sample pulse to shift output The main abstraction here when using a single clock domain, is to perform a subsampling operation. Given an operation, run it only when an enable bit is set. Entry: subsample, when Date: Mon Jul 16 11:42:14 EDT 2018 This is not trivial. To make it trivial, support for single-branch if is needed. It is clear now that this feature of HDL is essential! It's also clear why. In some cases, making the non-active path explicit will be quite an ordeal. EDIT: So this is not an expression. It is something completely different. Do I really want to go that way? Can the enable just be pushed deeper? EDIT: I don't like this construct, as it has far reaching consequences: the language is no longer just dataflow. But it seems to be essential to be able to express conditional register updates. Basically, the register "close" operation needs to be augmented with the enable. Is there another way? Yes there is. Make the "close" explictly use an enable. One way to make this implicit, is to use an environment. So what is worse. Introducing dynamic scope, or rendering the language imperative? I'd say the latter. So: 1) you can always use explicit close + enable 2) dynamic scope is terribly convenient Entry: Environment Date: Mon Jul 16 12:17:35 EDT 2018 How to add environment to Seq? The thing to keep in mind is to make this extensible. The consumer will be the close operation, the producer will be user code. Let's call these "withEnable" and "enable". Then allow room for extension later. No this seems wrong. What am I missing? SeqEmu has an environment for the register contents. SeqTerm doesn't. Ok, added a slot in the environment monads, and parameterized "closeReg". Entry: UART Date: Mon Jul 16 13:51:03 EDT 2018 This should now be simple: - detect start bit, go to "on" state and produce a number of pulses. wait for stop bit and produce parallel output strobe. Entry: Conditionals Date: Mon Jul 16 15:44:39 EDT 2018 One of the hardest things to unlearn, is that conditional branches do not "save work". Everything is instantiated. The structure is fixed. The only way to save work as can be done in a CPU through a conditional branch, is to implement a (reduced) CPU with conditional branches. Entry: functional dependencies Date: Mon Jul 16 16:08:25 EDT 2018 class (Monad m, Num (r S)) => Seq m r | r -> m where Entry: Conditionals, again Date: Mon Jul 16 16:26:43 EDT 2018 if' works only on signals. not on containers of signals! to make it work on containers, it needs to be lifted: async_receiver :: forall m r. Seq m r => SType -> r S -> m (r S) async_receiver t i = closeReg [bit,t] update where update [is_on, n] = do on <- return [i, is_on, n] -- dummy off <- return [i, is_on, n] -- dummy (o:s') <- sequence $ zipWith (if' is_on) on off return (s',o) Note that this is quite different from how MyHDL can have if branches that contain assignments. So I am _intentionally_ building a dataflow subset. The parallel nature of if muxes is quite explicit that way. Does this need special care for non-binary muxes? Entry: VHDL and synthesis Date: Mon Jul 16 17:48:24 EDT 2018 https://web.ewu.edu/groups/technology/Claudio/ee430/Lectures/vhdl-guidelines.pdf Mostly boils down to: - do not create latches - beware of duplication - some things are not synthesizable I'm encouraged in the approach - to make registers explicit - to model combinatorial networks as functions Entry: Slowing down state machines Date: Mon Jul 16 18:48:17 EDT 2018 A simple way seems to be to gate the inputs to all the registers. A disadvantage is that it is no longer possible to create 1-cycle pulses to be used elsewhere. A solution to that is to use edge detectors at places where the pulses are used, and encode the pulses differentially. But that gives problems with spurious pulses at reset. It doesn't seem to be such a great idea... Let's build a single async receiver and use an explicit "clock" input. Entry: Enabled streams Date: Tue Jul 17 08:40:25 EDT 2018 There must be something beautiful hidden in all this. I just can't fish it out yet. What is a stream? Data + enable. At the event level, it is data + clock. What am I missing? It's the misconception that a sequential machine is clockless. That only happens when all the timing domains are the same. To transition "clock domains", the only thing that's needed is to AND the outputs with the enable pulses. So a "stream" consists of some data and an enable signal. It can be made a convention that the current context contains this enable signal. It is then the responsibility of the implementation to use that properly. I.e. if there are any "output clocks", they should at least be masked by the enable. Entry: RTL misconceptions Date: Tue Jul 17 08:44:57 EDT 2018 1) As long as there are no propagation delay timing problems, or external clock synchronization issues, there are no synchronization issues _inside_ the sequential representation. 2) There are still clocks, but they are of a different nature. They represent subsequences: points in time where signals are valid on a conceptual level. ( They will always be valid on an electrical level. ) 3) Conditionals are muxes, and are inherently parallel. It is a good thing to make this explicit. Conditionals do not "save work" as compared to conditional jumps on a CPU. EDIT: About conditionals: it seems that state machine transitions can be factored into a bunch of circuits that appear "hoisted out" of the conditional, and some simple muxing. Entry: Testing tagged streams Date: Tue Jul 17 10:12:03 EDT 2018 Today's task is to find a good representation for testing tagged stream operators. The abstraction seems sound and high-leverage. EDIT: Got it. Added some QC tests. Entry: QC generators Date: Tue Jul 17 17:15:40 EDT 2018 Create some generators for: - sequences of ints of limited bit length http://hackage.haskell.org/package/QuickCheck-2.4.1.1/docs/Test-QuickCheck.html#v%3aforAll Entry: Shifts Date: Tue Jul 17 18:11:02 EDT 2018 SPI,UART use different shift registers, so the circuit cannot be reused unless a reverse is somehow implemented. Not clear what is best here, or if it is even necessary. Entry: UART Date: Tue Jul 17 19:22:53 EDT 2018 A UART is a counter that is started by a 1->0 transition, and performs sampling based on that counter. I find this hard to express without making a drawing. Here's an example: https://www.nandland.com/vhdl/modules/module-uart-serial-port-rs232.html This points at one difficult issue: for some algorithms, assignment is a sparser way to encode. Another https://www.fpga4fun.com/SerialInterface4.html Entry: State machines Date: Tue Jul 17 22:10:08 EDT 2018 I'm disappointed that this is so hard to express. I really want the convenience of multiple assignments. It's still possible though but requires boilerplate. So first, make a real state machine. Something simpler than the uart. Then iron out the language. EDIT: I have a basic skeleton for the UART. Needs some cleanup, but basically it seems ok. Entry: MyHDL case statements Date: Wed Jul 18 00:00:24 EDT 2018 http://docs.myhdl.org/en/stable/manual/conversion.html If-then-else structures may be mapped to case statements Python does not provide a case statement. However, the converter recognizes if-then-else structures in which a variable is sequentially compared to items of an enumeration type, and maps such a structure to a Verilog or VHDL case statement with the appropriate synthesis attributes. I might have to do this differently. Currently, conditionals are duplicated at the meta level. It might be good to re-arrange that into imperative if then else. They get inferred differently: priority routing network vs. single mux. https://electronics.stackexchange.com/questions/73387/difference-between-if-else-and-case-statement-in-vhdl I do wonder: in the case the inputs are exclusive, does the combinatorial optimization figure this out? Entry: Logic simplification Date: Wed Jul 18 08:48:25 EDT 2018 I would like to know how Yosys does this. Entry: UART debugging Date: Wed Jul 18 08:57:13 EDT 2018 [1,0,0,3] [1,0,0,3] [1,0,0,3] [1,0,0,3] [0,0,1,3] [0,0,1,2] [0,0,1,1] [0,0,1,0] [0,0,2,63] [0,0,2,62] [0,0,2,61] [0,0,2,60] [0,1,2,59] [0,0,2,58] [0,0,2,57] [0,0,2,56] [0,0,2,55] [0,0,2,54] [1,0,2,53] Entry: more UART Date: Thu Jul 19 15:08:27 EDT 2018 Sending out 5. Why does it see 5<<1 | 1 The sample pulses are fine. But it is clocking in the bit when the bit changes. It works when the register output is taken. EDIT: Important: do not propagate the _input_ of registers that use clock enable. -- 48 [9,1,0,0,0,3] [9,1,0,0,0,3] [9,1,0,0,0,3] [9,1,0,0,0,3] [9,1,0,0,0,3] [9,1,0,0,0,3] [9,1,0,0,0,3] [9,1,0,0,0,3] -- 49 [8,0,0,0,1,3] [8,0,0,0,1,2] [8,0,0,0,1,1] [8,0,0,0,1,0] [8,0,0,0,2,63] [8,0,0,0,2,62] [8,0,0,0,2,61] [8,0,0,0,2,60] -- 50 [8,0,0,0,2,59] [8,0,0,0,2,58] [8,0,0,0,2,57] [8,0,0,0,2,56] [8,0,1,0,2,55] [16,0,0,0,2,54] [16,0,0,0,2,53] [16,0,0,0,2,52] -- 51 [16,0,0,0,2,51] [16,0,0,0,2,50] [16,0,0,0,2,49] [16,0,0,0,2,48] [16,0,1,0,2,47] [32,0,0,0,2,46] [32,0,0,0,2,45] [32,0,0,0,2,44] -- 52 [32,0,0,0,2,43] [32,0,0,0,2,42] [32,0,0,0,2,41] [32,0,0,0,2,40] [32,0,1,0,2,39] [64,0,0,0,2,38] [64,0,0,0,2,37] [64,0,0,0,2,36] -- 53 [64,0,0,0,2,35] [64,0,0,0,2,34] [64,0,0,0,2,33] [64,0,0,0,2,32] [64,0,1,0,2,31] [128,0,0,0,2,30] [128,0,0,0,2,29] [128,0,0,0,2,28] -- 54 [128,0,0,0,2,27] [128,0,0,0,2,26] [128,0,0,0,2,25] [128,0,0,0,2,24] [128,0,1,0,2,23] [0,0,0,0,2,22] [0,0,0,0,2,21] [0,0,0,0,2,20] -- 55 [1,1,0,0,2,19] [1,1,0,0,2,18] [1,1,0,0,2,17] [1,1,0,0,2,16] [1,1,1,0,2,15] [3,1,0,0,2,14] [3,1,0,0,2,13] [3,1,0,0,2,12] -- 56 [2,0,0,0,2,11] [2,0,0,0,2,10] [2,0,0,0,2,9] [2,0,0,0,2,8] [2,0,1,0,2,7] [4,0,0,0,2,6] [4,0,0,0,2,5] [4,0,0,0,2,4] -- 57 [5,1,0,0,2,3] [5,1,0,0,2,2] [5,1,0,0,2,1] [5,1,0,0,2,0] [5,1,1,0,3,7] [11,1,0,0,3,6] [11,1,0,0,3,5] [11,1,0,0,3,4] -- 58 [11,1,0,0,3,3] [11,1,0,0,3,2] [11,1,0,0,3,1] [11,1,0,0,3,0] [11,1,0,1,0,0] [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] -- 59 [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] [11,1,0,0,0,3] Entry: Bit sizes as types Date: Thu Jul 19 22:02:48 EDT 2018 It's probably good to leave SType at the value level. It allows to grow the language slowly. But at the same time, evaluate DataKinds to allow "restricted phantom types". https://lexi-lambda.github.io/blog/2016/06/12/four-months-with-haskell/ Entry: FIFOs Date: Fri Jul 20 23:37:38 EDT 2018 The problem with fifos is not so much the fifo itself, but how to integrate memories. It should probably be done at the Seq language level, because you really do want to tuck them in. A fifo something that takes a memory, and produces a reader and a writer. This doesn't seem possible to express though. Let's write it out in woven form first and see how it can be rearranged. Entry: Memories.. Date: Sat Jul 21 00:32:08 EDT 2018 Something really strange is going on with memories. There seems to be some interference between the two feedback parts that do trace input and memories. I don't understand, so I wonder, why does it need to be so opaque? Can other kinds of feedback be just the same as closeReg? The reason I didn't do this is because registers get re-initialized completely on every round. But that isn't much different from being otherwise updated.. closeProcess has this -- Compute update using the update function bundled with current -- state. Repack with update function to do the same next time. o <- modifyProcess r p0 $ \(Process (s, u)) -> do (s', o) <- u s return (Process (s', u), o) I think what happens is that the u in there calls modifyProcess again. But it seems fine. There was a bug in here before.. The inner routine can call modifyProcess, as long as it is a different r. Can the r be the same? modifyProcess r def f = do ps <- getProcesses let p = Map.findWithDefault def r ps (p', o) <- f p modify $ appProcesses $ insert r p' return o Here's the routine. It appears as if the closeMem captures the first instance of the input. -- This does something really strange x_mem_bad = do let writes = [[0,1,x,x+20] | x <- [1..10]] reads = [[x,0,0,0] | x <- [1..10]] outs = t_mem $ writes ++ reads t_mem = trace [8,1,8,8] $ \i@[ra,we,wa,wd] -> do t <- stype wd SeqEmu.closeMem [t] $ \[rd] -> return ([(we, wa, wd, ra)], (rd:i)) putStrLn "-- x_mem_bad rd,ra,we,wa,wd" printL outs Why does it capture the first value? I think I'm assuming that the update equation is constant. Yes, it's passed from call to call. That's the problem. Trouble is that once the equation is tucked away, I can't use it any more. So the solution is to make the state typable. EDIT: OK, works. Entry: The mapped if' Date: Sat Jul 21 08:53:49 EDT 2018 I'm a little worried about that one. How hard is it to bundle these into imperative statements? Recursively: - find all if nodes that have the same condition - group together the equations What happens when there are ANF terms inbetween the conditionals? It makes sense to write this as a test. EDIT: Here's an example. -- sequenced if' : does it need to be bundled? x_ifs = do putStrLn "--- x_ifs" print_hdl $ do io@[c,i1,i2,o1,o2] <- SeqTerm.io $ replicate 5 bit os' <- ifs c [i1,i2] [0,0] sequence $ zipWith connect [o1,o2] os' return io --- x_ifs -- ports: [Node (SInt (Just 1) 0) 0, Node (SInt (Just 1) 0) 1, Node (SInt (Just 1) 0) 2, Node (SInt (Just 1) 0) 3, Node (SInt (Just 1) 0) 4] -- bindings: (0,Input (SInt (Just 1) 0)) (1,Input (SInt (Just 1) 0)) (2,Input (SInt (Just 1) 0)) (5,Comb3 (SInt (Just 1) 0) IF (Node (SInt (Just 1) 0) 0) (Node (SInt (Just 1) 0) 1) (Const (SInt Nothing 0))) (6,Comb3 (SInt (Just 1) 0) IF (Node (SInt (Just 1) 0) 0) (Node (SInt (Just 1) 0) 2) (Const (SInt Nothing 0))) (3,Connect (SInt (Just 1) 0) (Node (SInt (Just 1) 0) 5)) (4,Connect (SInt (Just 1) 0) (Node (SInt (Just 1) 0) 6)) 0 <- (INPUT) 1 <- (INPUT) 2 <- (INPUT) 3 <- (CONNECT (IF (NODE 0) (NODE 1) (CONST SInt Nothing 0))) 4 <- (CONNECT (IF (NODE 0) (NODE 2) (CONST SInt Nothing 0))) -- MyHDL: from myhdl import * def module(CLK, RST, s0, s1, s2, s3, s4): # s0 is an input # s1 is an input # s2 is an input @always_comb def blk1(): s3.next = ((s1 if s0 else 0)) @always_comb def blk2(): s4.next = ((s2 if s0 else 0)) return [blk1, blk2] Looking at cond ((mcond, whenTrue):clauses) dflt = do c <- mcond t <- whenTrue f <- cond clauses dflt ifs c t f It appears that conditions will always be completely evaluated, so they appear successive if nodes. It doesn't seem too hard to make this work. So this is an operation on bindings. Maybe best to do it in a couple of steps - translate current expression if to a statement if else - chain else if to elif - bundle Note that seq can also be bundled. What are the core operations? - Conditionally bundle static assignments: - Seq - If (unless it is chained) One thing that is implicit, is whether a set of equations is independent. When are equations independent? - Sequential - Combinatorial, but no (mutual) recursive references Those can be bundled. Before doing any of this, it is necessary to test if it is really needed. I'm assuming the synthesizer already does a lot of rearranging. Figure out how yosys works. Another reason to do this is to make the output code more readable. Entry: signal/next and memory/nextMemory Date: Sat Jul 21 10:57:54 EDT 2018 For Seq, does it make sense to get rid of signal and next? No. Keep it. So for generalizing closeProcess, it it possible to implement a signal/next pair for that? Here's an idea: - only add memory/nextMemory to Seq - for emu, keep the more general process/nextProcess In the current implementation, it is not necessary to initialize the processes. So let's propagate that. Ok this is a bit simpler. Next: what should it look like at the MyHDL side? A memory can be modeled as a register with a specific update equation. I think this would be it: (m, rd) <- memory updateMemory m (we, wa, wd, ra) Entry: Arrow Date: Sat Jul 21 18:28:02 EDT 2018 Is applicative plus what? This is about sharing, tuples, that sort of thing. Entry: Kleisli arrows Date: Sun Jul 22 10:18:14 EDT 2018 Trying to put this to rest. There is only one sane thing to do: provide SeqKleisli.hs which wraps all functions as Kleisli arrows, and represents binary and ternary functions as uncurried. Then see if this actually gets used. Entry: Applicative Date: Sun Jul 22 19:27:09 EDT 2018 I'm going to delete the file. The whole idea sucks. EDIT: Keep the lifted versions, but remove the autoconversion. Entry: Blink-a-led working on breakout Date: Sun Jul 22 21:02:41 EDT 2018 Using RST tied to GND. Entry: Fix emulation inefficiency Date: Mon Jul 23 09:44:07 EDT 2018 This boils down to splitting evaluation in two phases: - Probe -> initial state, update function - Update Actually I don't see how to do this. What should the function look like? It will always perform some kind of computation on register values and constants, so it is basically the same as the output of SeqTerm.hs The output is a serial program. So to implement: operations should return only register references, not values. Put it behind the (r,f) interface first, then change the implementation. The core change is here: styp = (fmap fst) . sints sval = (fmap snd) . sints val v = do SInt _ v' <- sval v ; return v' sints (Val i@(SInt sz v)) = do return $ (i, i) sints (Reg r) = do v <- asks $ \(_,regs) -> regs r ts <- getTypes let IntReg v0@(SInt sz _) = ts ! r return $ (v0, SInt sz v) The point is to compile this straight into a function to give the implementation as much freedom as possible to reduce it. Is that possible? Probably overkill. Actually it seems easier to use an interpreter to implement SeqEmu. Maybe take that approach directly? Today is not a big insight day... ANF nodes and registers are not the same thing (Val | Reg). But ANF nodes need to be modeled becaues of fanout. Here's an idea: use template haskell. Map it straight to a haskell program. Entry: TH Date: Mon Jul 23 11:20:32 EDT 2018 http://okmij.org/ftp/tagless-final/index.html So let's just start over instead of trying to retrofit SeqEmu.hs Entry: pure / applicative Date: Tue Jul 24 12:21:24 EDT 2018 So here's a thing. Apart from state feedback, the language is pure. Is it possible to rephrase it such that all primitives are pure, and the state feedback is hidden somewhere else? Basically, express it as a z transform? So why can't I have a pure DSL? The only reason is sharing. Can applicative express sharing? Entry: Sharing Date: Tue Jul 24 17:37:01 EDT 2018 The canonical, meaingful example is to compute the square. But any binary function works. suppose s is a sequence. how to express square? s -> (s,s) -> s^2 So what's missing is shuffling, as mentioned before. Essentially, this is fmap (,) / snd / fst Any kind of threading can be implemented using these operators. It's not pretty, but it works. t_app_square :: Seq m r => m (r S) -> m (r S) t_app_square = (SeqApp.uncurry SeqApp.mul) . (fmap (\x->(x,x))) x_app_share = do putStrLn "--- x_app_share" let c@(outputs, bindings) = SeqTerm.compile m m = do a <- inc 1 b <- t_app_square $ return a return [b] print outputs sequence $ map print bindings --- x_app_share [Node (SInt Nothing 0) 1] (0,Comb2 (SInt Nothing 0) ADD (Const (SInt Nothing 1)) (Const (SInt Nothing 1))) (1,Comb2 (SInt Nothing 0) MUL (Node (SInt Nothing 0) 0) (Node (SInt Nothing 0) 0)) My question is then, is there any notion of a pure "+" in there? Hidden in the semantics, maybe, but it only appears operating on m (r S). Entry: Upside-down : the dual implementation in terms of type families Date: Tue Jul 24 18:21:49 EDT 2018 EDIT: Not so sure any more... Turning it upside down, maybe. Can this be done by representing the idea of a sequence as an applicative functor of a generic data type? Yes. But how to perform feedback? This requires explicit implementation of "delay". Let's try that out. class Functor t => Delay (F t) where delay :: F t -> F t instance Delay [] where delay (a:as) = as Then, using type families, it might be possible to write alternative implementations that generate code. I'm still not comfortable using circular programming, so I'm inclined to use a "close" operator. Essentially it is something that turns a pure function into a sequence. But just for the fun of it, can it be done to use delay? I don't see it and it really complicates things. The proper way is to make "close" explicit. module Causal where class Functor f => Causal f where close :: (a -> (a, b)) -> a -> f b instance Causal [] where close :: u s0 = (o:os) where (s, o) = u s0 os = u s integral = close $ \s -> (s + 1, s) But this is just unfold. No. It is special in that what it is unfolded into is not necessarily a list. I think this is going to work.. The key phrase in last week's reading was that type families are sort of dual to type classes. https://wiki.haskell.org/GHC/Type_families EDIT: Some inspection is needed to fish out the input value, so close is not enough. closei :: (s -> i -> (s, o)) -> s -> f i -> f o It surprised me before that I could write this as applicative, but I think that used a representation of singals as (s->(s,o). Actually that still works here, but requires that particular constraint on the representation. We don't need that. We just need close over i/o and state. Ok I think I got it: Seq.erl defines Seq A composition of m and r SeqApp.erl exports primitives lifted to that representation. A "close" Causal class is implemented for each of the m.r compositions. The work here is in the individual implementations. I don't think that type families are needed. The behavior is in the functor. And it is important to compose m and r! So the insight is that these should be functors. That is not trivial in itself. And actually, doesn't work in general. Nope.... m is the functor. So this whole thing doesn't work because there are no pure r S function, so close doesn't make sense. So the "f" cannot be the M of the implementation. The wrapping needs to be doubled-up: F (M (R S)) F t So it looks like this just doesn't work in the current setting. Something else is needed. Deleting it. There is no point.. Entry: The point Date: Tue Jul 24 20:48:13 EDT 2018 is this: integral :: (Num t, Causal f) => t -> f t -> f t integral = close $ \s i -> (s + i, s) counter s = integral s (pure 1) class Applicative f => Causal f where close :: (s -> i -> (s, o)) -> s -> f i -> f o and implement it in a way that all the machinery goes into f. It doesn't look like this is possible. So I'm removing it. This is an entirely different problem. That monad is in the way. Entry: Applicative sharing Date: Tue Jul 24 21:06:52 EDT 2018 What works: in AppSeq.erl, see "dup" So given that, is it possible to write a kind of "let"? Yes. square x' = var x' $ \x -> x `mul` x var mv f = mv >>= \v -> f $ return v So there... Entry: Monad laws Date: Tue Jul 24 21:36:18 EDT 2018 Simplest in terms of Keisli Arrows: - Composition is associative - return is left and right inverse Entry: SeqTH / SeqPrim Date: Wed Jul 25 19:43:44 EDT 2018 Boring and straightforward to write, but I think I got it. Very simple in structure, lots of different types and conversions. Next is to put this in the test script. Done. Next is to implement all the primitives. EDIT: Got prims for uart test, but don't get the correct result. Trying some ad-hoc tests and they seem fine: m3 = do -- Some ad-hoc tests for SeqTH,SeqPrim combo. let test p = print $ SeqTH.run p $ map (:[]) [0..9] test $(SeqTH.compile [1] $ \[i] -> do c <- counter $ bits 3 ; return [c]) test $(SeqTH.compile [4] $ \[i] -> do c <- integral i ; return [c]) test $(SeqTH.compile [4] $ \[i] -> do c <- conc i (constant $ bits 1) ; return [c]) test $(SeqTH.compile [4] $ \[i] -> do c <- slice i (Just 4) 1 ; return [c]) EDIT: Following the types too much. I think the inputs are not in the correct order? EDIT: That doesn't seem to be the problem. running x_async_receiver_sample' it seems that the internal state machine is fine, going through the proper bit framing, but the sample pulse is missing so the shift register never triggers. here's a discrepancy: r5 = seqSLICE (seqInt 6) r3 (seqInt 0); r7 = seqEQU (seqInt 1) r5 (seqInt 0); I beleive r5 is from: phase <- slice count (Just 3) 0 The bit size is not correct. The bug is likely in SeqTerm slice (R a) b c = fmap R $ driven $ Slice (combTypes [a]) a b c Yes. combTypes is not correct here. also for CONC the sizes are different. EDIT: I think I got it now. Doing SeqTH is a good thing: it allows SeqTerm to be debugged without doing this through MyHDL tests. One more issue: test-seq: Seq.sizeError: (IF,Just 2,Just 2,Just 2) CallStack (from HasCallStack): error, called at ./SeqTerm.hs:169:23 in main:SeqTerm Test suite test-seq: FAIL Test suite logged to: /dev/stdout Just a bug in the test. EDIT: uart quickcheck passes Entry: Next? Date: Thu Jul 26 12:27:34 EDT 2018 It's gotten quite far. Finialize quickcheck? Test the FIFO. OK Entry: Next? Date: Thu Jul 26 20:11:03 EDT 2018 I think I have pretty much everything needed. Complete the CPU? It's clear that most circuits are about decoders. So find a good way to express that. Can haskell serialization be used to encode types? Entry: Why can't primitives be pure? Date: Sat Jul 28 15:01:26 EDT 2018 Sharing. I'm going to have to keep doing that! It is very hard to internalize... Entry: Next Date: Sat Jul 28 17:48:28 EDT 2018 It's been quite a trip and I'm a little depleted at this point. Is there any application that would be interesting to do, besides a CPU, which will likely resume once brain is back online. EDIT: The next steps are about abstracting designs. The low level language seems done. Entry: uart-controlled clock enable Date: Sun Jul 29 00:39:23 EDT 2018 http://zipcpu.com/blog/2017/05/26/simpledbg.html Entry: Logic analyzer Date: Tue Jul 31 09:50:33 EDT 2018 TH can also be used to create logic analyzers. Likely we'll be fast enough. Just need to read from stdin to fit into the current saleae code from lars. Entry: LLVM Date: Tue Jul 31 21:22:09 EDT 2018 http://www.stephendiehl.com/llvm/ Implementing a JIT Compiled Language with Haskell and LLVM Entry: Interesting avenues Date: Tue Jul 31 21:49:33 EDT 2018 - dump to C or LLVM - logic analyzers - abstractions for streams - CPU - data kinds Entry: Logic analyzers Date: Tue Jul 31 21:51:19 EDT 2018 https://www.sump.org/projects/analyzer/ http://dangerousprototypes.com/docs/Open_Bench_Logic_Sniffer Spartan 3E 250 Entry: custom quasiquoter Date: Wed Aug 1 23:53:57 EDT 2018 https://wiki.haskell.org/Quasiquotation So this is for String quoteExprExp :: String -> TH.ExpQ quoteExprPat :: String -> TH.PatQ expr :: QuasiQuoter expr = QuasiQuoter { quoteExp = quoteExprExp, quotePat = quoteExprPat -- with ghc >= 7.4, you could also -- define quoteType and quoteDec for -- quasiquotes in those places too } Entry: parallel if Date: Thu Aug 2 08:50:48 EDT 2018 1. is it necessary? no Again, learn how yosys works first. Entry: A driver... Date: Thu Aug 2 09:49:22 EDT 2018 Something I actually care about. Entry: Behavioral vs. RTL Date: Thu Aug 2 11:36:20 EDT 2018 I've never really understood the distinction here, and it appears that there is no clear-cut distinction. http://www.clifford.at/yosys/files/yosys_manual.pdf Reading Yosys manual, the distinction is made between Behavioral: - register update - if / case - unrollable loop RTL: - combinatoral networks - registers So the distinction is really not that important for Seq approach. Behavioral bits are implemented as macros. ... modern logic synthesis tools utilize much more complicated multi-level logic synthesis algorithms. Most of these algorithms convert the logic function to a Binary-Decision-Diagram (BDD) or And-Inverter-Graph (AIG) and work from that representation. Yosys uses ABC https://people.eecs.berkeley.edu/~alanmi/abc/ RTLIL is used for internal optimizations. It does seem to perform some high-level optimizations and recognition steps, so maybe it does make sense to keep the input syntax in some partitular form. It's not quite clear if that yosys-specific optimization stuff is really needed if ABC is used. The way to find out is to have it dump some internal representations. Entry: mode 0 spi Date: Thu Aug 2 13:22:23 EDT 2018 logic analyzer: - spi mode 0: clock=0, samples on 0->1 - something that's guaranteed to be fast (mutable arrays?) Entry: arrays Date: Thu Aug 2 13:32:46 EDT 2018 While the language is pure, it makes sense to compile TH to a monad instead. It can still be implemented as the identity monad and still be fast. https://wiki.haskell.org/Arrays#Welcome_to_the_machine:_Array.23.2C_MutableArray.23.2C_ByteArray.23.2C_MutableByteArray.23.2C_pinned_and_moveable_byte_arrays This needs some thought. I'm making several type errors in my head. Basically, the monad should contain the memories. First, learn how to use fast mutable memory in haskell, then redo this: update = LamE [TupP [memIn, stateIn, inputs]] $ DoE $ bindings' ++ [NoBindS $ AppE (VarE $ mkName "return") (TupE [memOut, stateOut, outputs'])] bindings' = [BindS (nodeNumPat n) (termExp e) | (n, e) <- partition E] op1 :: (Int -> Int) -> Int -> Int -> M Int op2 :: (Int -> Int -> Int) -> Int -> Int -> Int -> M Int op3 :: (Int -> Int -> Int -> Int) -> Int -> Int -> Int -> Int -> M Int run :: ((m, r, [Int]) -> SeqPrim.M (m, r, [Int]), (m, r)) -> [[Int]] -> SeqPrim.M [[Int]] run (mf, (m0, r0)) is = u m0 r0 is where u _ _ [] = [] u m r (i:is) = do (m',r',o) <- mf (m,r,i) os <- u m' r' is return (o : os) EDIT: I think it's doable to fix this using STUArray http://hackage.haskell.org/package/array-0.5.2.0/docs/Data-Array-ST.html But I've already reverted it.. Let's first build up some examples so it's clear that this works. Entry: seqInitMem Date: Fri Aug 3 08:13:13 EDT 2018 seqRun (\((r2, r1), (r3, r5, r7, r9), [r0]) -> do {r4 <- seqADD (seqInt 1) r0 r3; r6 <- seqADD (seqInt 4) r5 (seqInt 1); r8 <- seqADD (seqInt 5) r7 (seqInt 1); r10 <- seqADD (seqInt 6) r9 (seqInt 1); return ((r4, r6, r8, r10), [r3])}) ((seqInt 0, seqInitMem), (seqInt 0, seqInt 0, seqInt 0, seqInt 0)) The memories need to be initialized before the loop runs. Here's the diff. I'm going to revert it. tom@panda:~/asm_tools$ git diff diff --git a/SeqPrim.hs b/SeqPrim.hs index 669c340..0e4826f 100644 --- a/SeqPrim.hs +++ b/SeqPrim.hs @@ -7,7 +7,7 @@ module SeqPrim( seqADD, seqSUB, seqAND, seqEQU, seqIF, seqCONC, seqSLICE, - seqInt, seqInitMem, seqUpdateMem, + seqInt, seqInitMem, seqMemRd, seqMemWr, --seqUpdateMem, seqRun ) where import Data.IntMap.Strict @@ -51,15 +51,23 @@ seqEQU = op2 $ \a b -> if a == b then 1 else 0 seqCONC = op3 $ \bs a b -> (a `shiftL` bs) .|. b seqSLICE = op2 $ shiftR -seqInitMem :: IntMap Int -seqInitMem = empty +type Mem s = STUArray s Int Int +seqInitMem :: ST s (Mem s) +seqInitMem = newArray (0, 256) 0 -- FIXME: size! + +seqUpdateMem :: ((Int, Int, Int, Int), Mem s) -> ST s (Int, Mem s) +seqUpdateMem (args@(wEn,wAddr,wData,rAddr), arr) = do + rData <- readArray arr rAddr + case wEn of + 0 -> return () + 1 -> writeArray arr wAddr wData + _ -> error $ "seqUpdateMem: " ++ show args + return (rData, arr) + + +seqMemRd :: ((Int, Int, Int, Int), Mem s) -> ST s (Int, Mem s) +seqMemWr :: ((Int, Int, Int, Int), Mem s) -> ST s (Int, Mem s) -seqUpdateMem :: ((Int, Int, Int, Int), IntMap Int) -> (Int, IntMap Int) -seqUpdateMem ((wEn,wAddr,wData,rAddr), mem) = (rData, mem') where - rData = findWithDefault 0 rAddr mem - mem' = case wEn == 0 of - True -> mem - False -> insert wAddr wData mem seqInt :: Integer -> Int seqInt = fromIntegral @@ -70,37 +78,24 @@ seqInt = fromIntegral -- r: register state (tuple of Int) -- i/o is collected in a concrete [] type to make it easier to handle. -seqRun :: ((a, r, [Int]) -> forall s. ST s (r, [Int])) -> (a, r) -> [[Int]] -> [[Int]] -seqRun f (a,r0) i = runST $ u r0 i where - u _ [] = return [] - u r (i:is) = do - (r',o) <- f (a, r,i) - os <- u r' is - return (o:os) - - --- seqRun' :: --- (forall s. (m,r,[Int]) -> ST s (m, r, [Int]) --- ,(m,r)) -> [[Int]] -> [[Int]] --- seqRun' (f, (m0, r0)) is = runST $ u m0 r0 is where - --- u _ _ [] = return [] --- u m r (i:is) = do --- (m',r',o) <- f' (m,r,i) --- os <- (u m' r' is) --- return (o:os) - - -seqRun' :: ((m,r,[Int]) -> forall s. ST s (m, r, [Int])) -> (m, r) -> [[Int]] -> [[Int]] -seqRun' f (m0, r0) is = runST $ u m0 r0 is where - u _ _ [] = return [] - u m r (i:is) = do - (m',r',o) <- f (m,r,i) - os <- (u m' r' is) - return (o:os) - - --- seqRun = undefined +seqRun :: + ((a, r, [Int]) -> forall s. ST s (r, [Int])) + -> (forall s. ST s a) + -> r + -> [[Int]] -> [[Int]] +seqRun f ma r0 i = runST m where + m = do + -- Initialize mutable state (e.g. arrays) + a <- ma + -- Run loop + let u _ [] = return [] + u r (i:is) = do + (r',o) <- f (a, r,i) + os <- u r' is + return (o:os) + u r0 i + + -- For ST, it is important to understand which s parameters are -- specific, and which are generic. diff --git a/SeqTH.hs b/SeqTH.hs index ca56074..fa19181 100644 --- a/SeqTH.hs +++ b/SeqTH.hs @@ -46,18 +46,17 @@ toExp (outputs, bindings) = exp where -- trying to make the loop function and initial state explicit, -- which I don't want to understand yet. It seems best to just -- generate a closed expression. + exp = app3 (seqVar "Run") update memInit stateInit - - exp = app2 (seqVar "Run") update init - - init = TupE [memInit, stateInit] update = - LamE [TupP [memIn, stateIn, inputs]] $ + LamE [TupP [memRefs, stateIn, inputs]] $ DoE $ + memRead ++ bindings' ++ - [NoBindS $ AppE - (VarE $ mkName "return") - (TupE [stateOut, outputs'])] + memWrite ++ + (return' $ TupE [stateOut, outputs']) + + return' e = [NoBindS $ AppE (VarE $ mkName "return") e] partition t = map snd $ filter ((t ==) . fst) tagged tagged = map p' bindings @@ -69,8 +68,8 @@ toExp (outputs, bindings) = exp where p _ = E bindings' = - [BindS (nodeNumPat n) (termExp e) - | (n, e) <- partition E] + [BindS (nodeNumPat n) (termExp e) | + (n, e) <- partition E] -- I/O is more conveniently exposed as lists, which would be the -- same interface as the source code. State can use tuples: it will @@ -84,17 +83,43 @@ toExp (outputs, bindings) = exp where stateIn = tupP' $ map (nodeNumPat . fst) $ ds stateOut = tupE' [nodeExp n | (_, (Delay _ n)) <- ds] - mrs = partition MR - mi _ = tupE' $ [int 0, seqVar "InitMem"] - mr (rd, MemRd _ (MemNode mem)) = - tupP' [nodeNumPat rd, nodeNumPat mem] - memInit = tupE' $ map mi mrs - memIn = tupP' $ map mr mrs - memOut = - tupE' [AppE (seqVar "UpdateMem") $ - TupE [tupE' $ map nodeExp [a,b,c,d], - nodeNumExp n] - | (n, (MemWr (a,b,c,d))) <- partition MW] + + + -- For ST, memories are different. The question is whether to + -- implement it in the generated code, or to use functions. It + -- seems possible to implement MemRd and MemWr directly as monadic + -- operators. The use of tuples makes it hard to "fmap". + + -- Here's the current strategy: + -- . Create an initializer that produces a tuple of arrays + -- . Create imperative memory read functions, inserted at the start of the loop + -- . Same for write, at the end + + memRefs = tupP' $ map (nodeNumPat . fst) $ partition MR + memInit = tupE' [] + memRead = [BindS (nodeNumPat rData) () + | (rData, (MemRd td arr)) <- partition MR] + memWrite = [BindS _ _ | (_, (MemWr (_,_,_,_))) <- partition MW] + + + -- -- Memories need to be instantiated before the loop starts. + -- mrs = partition MR + -- mi _ = tupE' $ [int 0, seqVar "InitMem"] + + -- mr (rd, MemRd _ (MemNode mem)) = + -- tupP' [nodeNumPat rd, nodeNumPat mem] + -- memInit = + -- DoE $ + -- [BindS (VarP $ mkName $ "m" ++ show n) (mi mr) | (mi,n) <- zip mrs [0..]] + -- ++ (return' $ tupE' [VarE $ mkName $ "m" ++ show n | (_,n) <- zip mrs [0..]]) + -- -- tupE' $ map mi mrs + + -- memIn = tupP' $ map mr mrs + -- memOut = + -- tupE' [AppE (seqVar "UpdateMem") $ + -- TupE [tupE' $ map nodeExp [a,b,c,d], + -- nodeNumExp n] + -- | (n, (MemWr (a,b,c,d))) <- partition MW] -- FIXME: Use nested tuples for the state, memory collections. @@ -112,7 +137,6 @@ opVar :: Show t => t -> Exp opVar opc = seqVar $ show opc seqVar str = VarE $ mkName $ "seq" ++ str - termExp :: T -> Exp -- Special cases tom@panda:~/asm_tools$ The main change needed is to inline the "memUpdate" as NoBindS, and provide an initializer. EDIT: Done. Was straightforward. Entry: Unify Date: Sun Aug 5 09:38:00 EDT 2018 - PRU - Staapl macro forth - CPU - RTL - DSP language Is there a good way to model something at different levels of abstraction that fits into Haskell? It seems to be just tagless final + nested type classes. I think today it's time for the CPU. This will enable to test a couple of things: - code gen for large decoders - forth-like macro language - efficiency of emulation - compare pattern generated by loop between sim and fpga Entry: Building a CPU Date: Sun Aug 5 11:13:03 EDT 2018 The outer part of a CPU is the update of the instruction pointer in terms of the current instruction word. closeIW :: Seq m r => SType -> SType -> (r S -> m ((r S, r S), o)) -> m o closeIW tw ta f = do closeMem [tw] $ \[iw] -> do (ip', out) <- closeReg [ta] $ \[ip] -> do ((jmp, dst), out) <- f iw ip1 <- inc ip ip' <- if' jmp dst ip1 return ([ip'], (ip',out)) -- comb out return ([(0, 0, 0, ip')], out) -- imem is read-only This raises the questions: what is the output of a CPU? The central idea is to perform composition of multiple of these "close" operations. The input/output of a cpu: - instruction memory writes - GPIn, GPOut Entry: Hardware is annoyingly pure Date: Sun Aug 5 11:18:04 EDT 2018 You can't just "assign" things! Everything needs to be bound just once. Approach digital design as gradually "closing" local feedback. Note that this imposes arbitrary hierachy. Can that be avoided? Maybe it is even a good thing. EDIT: I think there is something to learn here about the way these compositions "commute". One key element seems to be to keep the input/output relation abstract. A "close" operation takes some state context and burries it, leaving (i -> m o). EDIT: Abstracted closeIMem with some data types in the interface. It's all quite straightforward, but testing this will require some infrastructure: - make it possible to initialize memories - have the test transfer data into the memory The latter seems most useful, but I think the former is easier to do. EDIT: I have a CPU with a jump instruction Entry: Memory init Date: Sun Aug 5 13:47:24 EDT 2018 Ising FPGA memory as ROM, what is the output of wData on just out of reset? Is this an actual combinatorial readout? Look at the simulator spec. It doesn't specify a reset value. https://www.latticesemi.com/-/media/LatticeSemi/Documents/ApplicationNotes/MO/MemoryUsageGuideforiCE40Devices.ashx?document_id=47775 Figure 3. EBR Module Timing Diagram 1 So it clearly says the data is only valid after the first read address has been clocked. To use this as instruction memory, either: - ignore the first read (e.g. reset value = 0) - use a run enable signal to ensure the first instruction is read Practically, the CPU will be loaded by another one, so I'm assuming the run signal is explicit. Otherwise, use a single delay to start it. See memory usage guide for ice40 To push this to verilog, have myHDL insert something like: defparam ram256x16_inst.INIT_0 = 256'h0000000000000000000000000000000000000000000000000000000000000000; ... https://sourceforge.net/p/myhdl/mailman/message/33001534/ Entry: Sequencer Date: Sun Aug 5 16:34:22 EDT 2018 So this really is pretty much it. The rest is instruction decoder and memory/stack access. Those are fairly involved by themselves but there doesn't seem to be any magic. E.g. a pattern sequencer needs: - set io/o - set loop count - decrement and branch This is best done using a driver application. It might be better to implement function calls first. For the application I don't yet need other kinds of conditional jumps, just nested Forth for..next loops. How to begin? One bit in the instruction can be used for: - if zero, pop stack and continue - if nonzero, decrement and jump - push number to stack Entry: Here's a big lesson Date: Sun Aug 5 17:38:36 EDT 2018 Full-branch conditionals are too hard to use. It is often the case that a register only needs to be updated in a very specific case. It is too awkward to always have to specify the non-update case explicitly. But then again. It might be due to lack of good abstraction. The specific case I'm looking at is stack push/pop. Also, it might be possible to write this as a generic Seq extension, instead of a core function. EDIT: MyHDL and Verilog both support successive assignments in a single "process". Syntactically it is not a problemin Seq, but is the semantics correct? I really don't like this though.. Let's try to work without it for a while. Entry: FIFO / Stack using grey code Date: Sun Aug 5 17:51:00 EDT 2018 https://www.edaboard.com/showthread.php?157611-why-FIFO-design-using-grey-code ice40 has carry logic, so maybe not necessary. Entry: Write-through delay for FIFO and stack Date: Sun Aug 5 18:26:24 EDT 2018 Here's a stack implementation: https://github.com/jandecaluwe/myhdl-examples/blob/master/ChessPlayingFPGA/stack/stack_myhdl.py For my use case, is it ok if the data is only available on the next cycle? 1. pop (computes new rAddr) 2. use rData 1. push 2. use rData <- has previous result 3. use rData <- has pushed result So it seems that write-to-read delay is important. So I don't actually know if it is an issue. But if it is, it can likely be solved later by cleaning up a NOP workaround. Entry: CPU design Date: Sun Aug 5 19:33:34 EDT 2018 It's starting to get more clear that there are a lot of design choices! 1. multi-clock instruction cycle (e.g. PIC) 2. hazard-mitigation through NOP 3. stall 4. pipelining What is my goal? To keep it simple, and to have deterministic execution. Leaving in the hazards seems best to get a first working design. Entry: Stacks Date: Sun Aug 5 19:43:00 EDT 2018 Looking at swapforth. https://github.com/jamesbowman/swapforth/blob/master/j1a/verilog/stack2.v It seems that the stacks are implemented as registers. This would avoid the delay issue. Study that some more, and see if it's possible to do actually express this in Seq. Entry: Pru emulator Date: Mon Aug 6 12:03:48 EDT 2018 Basically I want to clock a state machine off of the GPIOs. Every time the clock switches, outputs get updated. So this is essentially a map from GPO -> GPI. Entry: Keep pseudo ops Date: Mon Aug 6 13:38:40 EDT 2018 This should be a hook in the main class that is ignored for code generation, but gets executed in the emulator. -test_int_logger = do - putStrLn "--- test_int_logger" - let log = do - t <- loadm Time - tell $ [t] - sample' = map (pseudo log >>) (sample :: [Src [Int]]) - src = do - initRegs - bl_weave sample' - return () - tick = compile src - [t1,t2] = take 2 $ logTrace tick (machineInit' 123 [10,11]) - putStrLn $ "period: " ++ show (t2-t1) - - Keep it specialized. I can't see a simple way to generalize it to nops in the target compiler. Entry: closeMem Date: Wed Aug 8 14:15:40 EDT 2018 Here's why closeMem is awkward: FIFOs decouple parts of the circuit, so you really want to have the two ends to be fairly separate entities. E.g. I have a circuit with 16 FIFOs. The reader end is a circuit that talks to all 16 FIFOs, and the writer end are 16 identical circuits, speaking to one FIFO each. This will need a closeMem operation that's quite high up the hierarchy. There is a more general point to make: sometimes there is a lot of criss-cross going on that is easy to solve when resources are named and binding (single assignment) is explicit. There is a really big tension between applicative style and "netlist style" or explicit single-assigment binding. Entry: Indexed muxes Date: Wed Aug 8 14:42:05 EDT 2018 Now, how to express larger multiplexers? These could be expressed as nested if statments, but MyHDL will generate a case statement if it sees successive ifs. Maybe it is time to handle nested ifs differently. EDIT: See SeqLib.index Straightforward with zero-extended array, and recursive expansion based on bits of the indexing word. Entry: applicative style and instantiating spaghetti networks Date: Wed Aug 8 18:15:10 EDT 2018 There is something to learn here. 1. Why is it hard to express this applicative? It might not be really that hard, but it's hard to write it down all at once. It is likely that arguments need to be added to functions to make this work. So here is the insight: It's not that different from closing over any other state such as registers. A closing function creates a more abstract I/O relation from subcircuits. The trick is to understand what the desired I/O lines are for the more abstract circuit. Then write those down as f i = do ... o <- ... ... return o If all the other pieces are written in applicative style, they will just fall into place. Entry: CPU, sequencer Date: Thu Aug 9 17:56:01 EDT 2018 closeIMem's Control struct now has "loop", which enables waiting for signals. I think I have enough now to perform basic sequencing using a single programmable sequencer + some simpler state machines that can be started in parallel. This is really nitty-gritty though. I noticed that last time as well. Hard to imagine without actually getting into it. Entry: Reset is not implemented properly for SeqTH Date: Thu Aug 9 19:59:56 EDT 2018 seqRun (\([], (), r6, [r0, r2, r4]) -> do {r1 <- seqSLICE (seqInt 1) r0 (seqInt 0); r3 <- seqSLICE (seqInt 1) r2 (seqInt 0); r5 <- seqSLICE (seqInt 8) r4 (seqInt 0); r7 <- seqCONC (seqInt 9) (seqInt 1) r5 (seqInt 0); r8 <- seqIF (seqInt 9) r3 r7 r6; r9 <- seqSLICE (seqInt 1) r8 (seqInt 0); r10 <- seqSLICE (seqInt 8) r8 (seqInt 1); r11 <- seqCONC (seqInt 9) (seqInt 8) (seqInt 1) r10; r12 <- seqIF (seqInt 9) r1 r11 r8; return ((), r12, [r9, r3, r1, r12])}) [] ((), seqInt 0) prob was in SeqTerm https://github.com/YosysHQ/yosys/issues/103 The lattice chips don't have support for initial values on registers. Use a reset generator or something. Clifford mentiones to use the LOCK signal of PLL out. Or something like this: reg [7:0] resetn_counter = 0; assign resetn = &resetn_counter; always @(posedge clk) begin if (!resetn) resetn_counter <= resetn_counter + 1; end Entry: async tx Date: Thu Aug 9 21:05:43 EDT 2018 This needs work. Revisit tomorrow. Entry: bundling case Date: Thu Aug 9 22:50:25 EDT 2018 make a test case first (0,Comb3 (SInt Nothing 0) IF (Const (SInt Nothing 1)) (Const (SInt Nothing 1)) (Const (SInt Nothing 4))) (1,Comb3 (SInt Nothing 0) IF (Const (SInt Nothing 1)) (Const (SInt Nothing 2)) (Const (SInt Nothing 5))) (2,Comb3 (SInt Nothing 0) IF (Const (SInt Nothing 1)) (Const (SInt Nothing 3)) (Const (SInt Nothing 6))) how to tackle this? grouping will be easy. so the problem is creating the data structure grouping will be recursive what about adding it straight to Term? Entry: bit-serial architecture Date: Thu Aug 9 23:21:34 EDT 2018 I've always found this very cool. So let's give it a try. See electronics.txt Entry: Display Date: Thu Aug 9 23:39:05 EDT 2018 I think it's time for something else. There is the beaglebone + FPGA + PRU idea. I have a bunch of tools now. What can be done with it? First thing is to hook this all up. Entry: Blocks Date: Fri Aug 10 08:37:25 EDT 2018 Imperative conditionals with explicit assignment are really convenient when working with state machines. Is there a way to express them? Maybe by making the default assignment explicit? The problem is that this needs local names in a way that I don't see work other than using something like lenses. It's probably easiest to use 'update' explicitly. Yea the whole thing just doesn't fit well. It's either/or. Entry: async_transmit Date: Fri Aug 10 08:55:37 EDT 2018 Looks like it works now [1,1,0,0,511,0] [1,1,0,0,511,0] [1,1,1,1,180,10] [0,0,0,0,180,10] [0,0,0,1,346,9] [0,0,0,0,346,9] [0,0,0,1,429,8] [1,0,0,0,429,8] [1,0,0,1,470,7] [0,0,0,0,470,7] [0,0,0,1,491,6] [1,0,0,0,491,6] [1,0,0,1,501,5] [1,0,0,0,501,5] [1,0,0,1,506,4] [0,0,0,0,506,4] [0,0,0,1,509,3] [1,0,0,0,509,3] [1,0,0,1,510,2] [0,0,0,0,510,2] [0,0,0,1,511,1] [1,0,0,0,511,1] [1,0,0,1,511,0] [1,1,0,0,511,0] [1,1,0,1,511,15] Note that this can also be used Entry: Figure out conversion to nested if elif else Date: Fri Aug 10 09:15:53 EDT 2018 First, define an intermediate language derived from SeqExpr that can represent this. Here's the uart receiver after unsharing unique nodes, indented for clarity 0 <- (INPUT) 3 <- (SUB (NODE 2) (CONST SInt Nothing 1)) 5 <- (EQU (NODE 2) (CONST SInt Nothing 0)) 6 <- (EQU (SLICE (NODE 2) 3 0) (CONST SInt Nothing 0)) 7 <- (EQU (NODE 1) (CONST SInt Nothing 0)) 9 <- (EQU (NODE 1) (CONST SInt Nothing 1)) 12 <- (EQU (NODE 1) (CONST SInt Nothing 2)) 15 <- (EQU (NODE 1) (CONST SInt Nothing 3)) 32 <- (IF (NODE 7) (CONST SInt Nothing 0) (IF (NODE 9) (CONST SInt Nothing 0) (IF (NODE 12) (CONST SInt Nothing 0) (IF (NODE 15) (AND (NODE 6) (NODE 0)) (CONST SInt Nothing 0))))) 1 <- (DELAY (IF (NODE 7) (IF (NODE 0) (CONST SInt Nothing 0) (CONST SInt Nothing 1)) (IF (NODE 9) (IF (NODE 5) (CONST SInt Nothing 2) (NODE 1)) (IF (NODE 12) (IF (NODE 5) (CONST SInt Nothing 3) (NODE 1)) (IF (NODE 15) (IF (NODE 5) (CONST SInt Nothing 0) (NODE 1)) (CONST SInt Nothing 0)))))) 2 <- (DELAY (IF (NODE 7) (CONST SInt Nothing 3) (IF (NODE 9) (IF (NODE 5) (CONST SInt Nothing 63) (NODE 3)) (IF (NODE 12) (IF (NODE 5) (CONST SInt Nothing 7) (NODE 3)) (IF (NODE 15) (IF (NODE 5) (CONST SInt Nothing 0) (NODE 3)) (CONST SInt Nothing 0)))))) 35 <- (DELAY (IF (IF (NODE 7) (CONST SInt Nothing 0) (IF (NODE 9) (CONST SInt Nothing 0) (IF (NODE 12) (NODE 6) (IF (NODE 15) (CONST SInt Nothing 0) (CONST SInt Nothing 0))))) (CONC (NODE 0) (SLICE (NODE 35) 8 1)) (NODE 35))) Yeah this isn't very useful... Don't do this on SeqExpr. Do it on SeqTerm first, then perform Expr on subexpressions. Entry: SeqTerm and blocks Date: Fri Aug 10 09:55:53 EDT 2018 Blocks are always guarded by conditionals. It seems simplest to get rid of the ternary if expression, and convert everything to nested block form. Can this be done using Free? First, make s-expression printout: 0 <- (INPUT) 3 <- (SUB (NODE 2) (CONST _:1)) 4 <- (SLICE (NODE 2) 3 0) 5 <- (EQU (NODE 2) (CONST _:0)) 6 <- (EQU (NODE 4) (CONST _:0)) 7 <- (EQU (NODE 1) (CONST _:0)) 8 <- (IF (NODE 0) (CONST _:0) (CONST _:1)) 9 <- (EQU (NODE 1) (CONST _:1)) 10 <- (IF (NODE 5) (CONST _:2) (NODE 1)) 11 <- (IF (NODE 5) (CONST _:63) (NODE 3)) 12 <- (EQU (NODE 1) (CONST _:2)) 13 <- (IF (NODE 5) (CONST _:3) (NODE 1)) 14 <- (IF (NODE 5) (CONST _:7) (NODE 3)) 15 <- (EQU (NODE 1) (CONST _:3)) 16 <- (AND (NODE 6) (NODE 0)) 17 <- (IF (NODE 5) (CONST _:0) (NODE 1)) 18 <- (IF (NODE 5) (CONST _:0) (NODE 3)) 19 <- (IF (NODE 15) (CONST _:0) (CONST _:0)) 20 <- (IF (NODE 15) (NODE 16) (CONST _:0)) 21 <- (IF (NODE 15) (NODE 17) (CONST _:0)) 22 <- (IF (NODE 15) (NODE 18) (CONST _:0)) 23 <- (IF (NODE 12) (NODE 6) (NODE 19)) 24 <- (IF (NODE 12) (CONST _:0) (NODE 20)) 25 <- (IF (NODE 12) (NODE 13) (NODE 21)) 26 <- (IF (NODE 12) (NODE 14) (NODE 22)) 27 <- (IF (NODE 9) (CONST _:0) (NODE 23)) 28 <- (IF (NODE 9) (CONST _:0) (NODE 24)) 29 <- (IF (NODE 9) (NODE 10) (NODE 25)) 30 <- (IF (NODE 9) (NODE 11) (NODE 26)) 31 <- (IF (NODE 7) (CONST _:0) (NODE 27)) 32 <- (IF (NODE 7) (CONST _:0) (NODE 28)) 33 <- (IF (NODE 7) (NODE 8) (NODE 29)) 34 <- (IF (NODE 7) (CONST _:3) (NODE 30)) 1 <- (DELAY (NODE 33)) 2 <- (DELAY (NODE 34)) 36 <- (SLICE (NODE 35) 8 1) 37 <- (CONC (NODE 0) (NODE 36)) 38 <- (IF (NODE 31) (NODE 37) (NODE 35)) 35 <- (DELAY (NODE 38)) Then perform hoisting, something like: 19 <- (IF (NODE 15) (CONST _:0) (CONST _:0)) 20 <- (IF (NODE 15) (NODE 16) (CONST _:0)) 21 <- (IF (NODE 15) (NODE 17) (CONST _:0)) 22 <- (IF (NODE 15) (NODE 18) (CONST _:0)) [19,20,21,20] <- IFS (NODE 15) [19 <- (CONST _:0), 20 <- (NODE 16), 21 <- (NODE 17), 22 <- (NODE 18)], [19 <- (CONST _:0), 20 <- (CONST _:0), 21 <- (CONST _:0), 22 <- (CONST _:0)] It's not actually that easy. To make this readable, the nodes that are not shared elsewhere should also be moved inside the blocks, so they can later be eliminated. I don't see how to express this easily. Let it rest a bit. Entry: Now here's an other idea. Date: Fri Aug 10 11:00:53 EDT 2018 What about making a language that is more like Verilog, and plug it into Seq? I can probably use Yosys to generate a netlist, then compile that into Seq for unit testing. Simulation in Haskell is nice, but expressing the primitive state machines is a bit of a pain. Entry: Check yosys output Date: Fri Aug 10 11:04:36 EDT 2018 This is actually more important. So the thing to do is to get everything to run on FPGA and see if Yosys can optimize what Seq produces. What to use? Maybe the UART makes sense? That's something I can just monitor easily. EDIT: I have no way to judge this output.. EDIT: Was wrong - was doing just logic opti. Doing the ice40_synth does give something that doesn't look too bad. PNR phase says: After packing: IOs 4 / 206 GBs 0 / 8 GB_IOs 0 / 8 LCs 27 / 7680 DFF 11 CARRY 5 CARRY, DFF 0 DFF PASS 0 CARRY PASS 2 BRAMs 0 / 32 WARMBOOTs 0 / 1 PLLs 0 / 2 EDIT: It optimized out the shift register because I'm using only one bit. Here's with all 8 outputs routed out of the FPGA: After packing: IOs 11 / 206 GBs 0 / 8 GB_IOs 0 / 8 LCs 34 / 7680 DFF 18 CARRY 5 CARRY, DFF 0 DFF PASS 7 CARRY PASS 2 BRAMs 0 / 32 WARMBOOTs 0 / 1 PLLs 0 / 2 Which added 7 more logic cells. This doesn't look too bad really. Entry: Variable names Date: Fri Aug 10 12:02:09 EDT 2018 I need variable names. Otherwise this is going to be too hard to figure out. Maybe implement it using closeReg? Entry: Next? Date: Fri Aug 10 18:31:48 EDT 2018 I'm not sure if it's really necessary to do the if..else thing. Entry: output to input dependency? Date: Fri Aug 10 18:39:09 EDT 2018 Is it possible to compute the output based on the input in the ST here? Let's give it a try with a test case. seqRun :: (forall s. ([Mem s], rd, r, [Int]) -> ST s (rd, r, [Int])) -> [Int] -> (rd, r) -> [Int -> Int] -> [[Int]] -> [[Int]] seqRun f memBits (rd0, r0) memInits i = runST $ do a <- sequence $ zipWith seqMemInit memBits memInits let u _ _ [] = return [] u rd r (i:is) = do (rd',r',o) <- f (a, rd, r, i) os <- u rd' r' is return (o:os) u rd0 r0 i This produces <> l_st = seq where seq = ([0] : seq') seq' = seqRun (\([], (), (), [i]) -> return ((), (), [i])) [] ((), ()) [] seq So it needs a lazy state implementation. Entry: Assignment language Date: Fri Aug 10 22:43:54 EDT 2018 So is it possible to do this actually? if foo a <- 1 les a <- 2 No. Those are still bindings. It would need to be: a <- signal if foo connect a 1 else connect a 2 So I am really stuck with the pure approach. Or not? Here's an idea: when entering diverging blocks, a local context can be built up that keeps track of the guards of the conditional expression. Once the full expression is evaluated, all the guards can be gathered and an update network can be constructed for a particular variable. Still I wonder if it is really that useful. It seems like a lot of cruft. Maybe there is something to say for better factoring? EDIT: Do not waste time on this. Entry: Testing uart transmit Date: Sat Aug 11 07:54:20 EDT 2018 Do this using the receiver. It is currently not possible to thave SeqTH code use feedback at the delay level due to strictness of ST. Entry: Focus more on composition Date: Sat Aug 11 08:03:08 EDT 2018 I now have fifo interfaces and ser/deser. How to compose these on a higher level? The problem is still sequencers. How to abstract those better? Maybe start from the top this time. Build the loop controller for the application and see where that gets stuck. So here's a nice goal for today: get a basic CPU-like sequencer running on FPGA, producing an output loop. Maybe what I've been overlooking is the explicit construction of a bus, so let's start there. Entry: CPU bus Date: Sat Aug 11 08:21:25 EDT 2018 read, write, address, data. a bus is quite similar to a memory. practically, wat is needed: - write uart byte - read + wait for uart done - read byte from fifo For the application I can avoid conditional jumps if there is a wait instruction. The wait instruction selects from one of a number of flags. So where to start? Once the basic structure is there, it is easy to modify and abstract. I feel dread starting this, so likely it is going to be more complicated than I thought. Entry: Busses, cont Date: Sat Aug 11 11:19:41 EDT 2018 With a bus, these are the basic instructions: - jmp - read (bus->reg) - write (reg->bus) - im (ins->reg) But there still is no synchronization. I need a blocking read. Instead of using a register interface, what about using a port interface? But a register read is already needed. What I miss is a direct flag input. This could still be part of the bus. The missing bit is really the read operation. I'm thinking of this as one thing, but it is actually always split over two cycles: - issue read - wait for read to be stable on the bus This could be the next instruction, but it might be a later one as well. With this in mind, look at an example bus: Wishbone https://en.wikipedia.org/wiki/Wishbone_(computer_bus) This seems overkill. What I need is just a read ready signal such that the cpu can read from a stream. Another important realization is that a bus is simpler when it has a master and a slave side, and needs special attention when these roles need to switch. Entry: Read ready, streams Date: Sat Aug 11 11:58:40 EDT 2018 Can they both be the same? The thing is to make sure that the CPU doesn't miss it, or to add a S/R + a latch. The primitive representation really should be the 1-cycle enable sample pulse. Then if a different interface is needed, add the latch + stable ready signal. Entry: MyHDL memories Date: Sat Aug 11 12:43:50 EDT 2018 todo Entry: CPU hierarchy Date: Sat Aug 11 13:31:30 EDT 2018 I've got the basic idea patched up. Decomposed into a couple of levels: - top level io - bus peripherals on bus - cpu sequencer + memory - instruction decoder Ha.. that's why they are called peripherals! They sit between the CPU and the outside world. So this is great. Once that basic structure is there, the rest is incremental. Next: make a uart read/write test, or just the 2 main loops of the application. Uart write and some i/o and timing control. Entry: Probe Date: Sat Aug 11 18:24:51 EDT 2018 So I'm moving to an explicit probe functionality as part of Seq. This way there is no clutter, and everything can be easily probed by adding lines to the source code. EDIT: Propagated it to SeqTH. TODO: - create bindings (always) - allow parameter to pick out probes, reify to var names EDIT: Fully implemented. Actually is a lot better than creating d_ routines. Entry: Next? Date: Sun Aug 12 10:09:04 EDT 2018 Tests, mostly. The rest seems quite incremental. Focus on building a single reference design for a stack processor, then specialize it to the application. It is time though to make tests on hardware; get the test circuitry set up. EDIT: With probes in place, continue with peripherals. Entry: UART out on bus Date: Sun Aug 12 12:36:01 EDT 2018 Write strobe doesn't work yet. Or does it? EDIT: works. Currently using blocking read for UART. Entry: Path to FPGA Date: Sun Aug 12 13:42:53 EDT 2018 - instantiate (initialized) memory in myhdl - run program on FPGA to output on scope or LEDs - hook up the programmer Entry: MyHDL RAM Date: Sun Aug 12 14:25:41 EDT 2018 Two things: - Generate a code stub for the memory instructions in SeqTerm - Possibly instantiate directly for synthesis? The main question is: do I want a ROM? If I do, I'll need to do a whole lot more work. There is a tool to program just the ram: https://stackoverflow.com/questions/36852808/modify-ice40-bitstream-to-load-new-block-ram-content https://github.com/cliffordwolf/icestorm/tree/master/icebram Basically, I will need access to the block ram over SPI anyway, so maybe best to not do this? Entry: Synchronous SPI Date: Sun Aug 12 14:36:17 EDT 2018 Previously, I've used async SPI. Since speed is not really a big problem, it might be easier to use synchronous SPI. It makes it easier to test inside Seq. Just curious: googleing for SPI FPGA what do we get? Sync or async? Sync.. I guess the reason is that it is easier to have a serial signal cross clock domains. What I did before was a hack. Entry: Yosys FSM Detection Date: Sun Aug 12 15:28:13 EDT 2018 http://www.clifford.at/yosys/files/yosys_manual.pdf 8.2.1 The fsm_detect pass identifies FSM state registers. It sets the \fsm_encoding= "auto" attribute on any (multi-bit) wire that matches the following description: • Does not already have the \fsm_encoding attribute. • Is not an output of the containing module. • Is driven by single $dff or $adff cell. • The \D -Input of this $dff or $adff cell is driven by a multiplexer tree that only has constants or the old state value on its leaves. • The state value is only used in the said multiplexer tree or by simple relational cells that compare the state value to a constant (usually $eq cells). How to see what it detects? Entry: CPU startup Date: Sun Aug 12 17:17:39 EDT 2018 - boot fpga - transfer into memory - start processor after cs:0->1 Entry: Next? Date: Sun Aug 12 17:21:36 EDT 2018 Tired and bored, so I'll need something new. EDIT: a forth "next". Entry: Loops Date: Sun Aug 12 20:06:00 EDT 2018 if loop count is zero -> drop and continue else -> decrement and jump EDIT: ok works Next for cpu? Some more tests that nail down the semantics. Figure out why suddenly a nop is needed for uart read. Entry: State machine vs. instruction sequencer Date: Sun Aug 12 23:44:51 EDT 2018 There is definitely a tipping point somewhere, so how to determine? https://en.wikipedia.org/wiki/Microsequencer https://www.arl.wustl.edu/~jst/cse/260/ddc.pdf Entry: Into the real world? Date: Mon Aug 13 09:08:30 EDT 2018 To make this wor for real, it is necessary to integrate into an existing verilog simulator, and/or have a verilog parser frontend. http://hackage.haskell.org/package/verilog https://github.com/tomahawkins/verilog Entry: Memories Date: Mon Aug 13 09:17:29 EDT 2018 Move it forward. - MyHDL memories - MyHDL simulation - FPGA through FTDI upload - FPGA wire-up EDIT: I got it up to this point: instantiation (TODO) [s14_rd, s14_we, s14_wa, s14_wd, s14_ra] = s14 from mem bus: @always_comb def blk2(): s17.next = ((s14_rd) if s13 else 0) to mem bus: @always_comb def blk16(): s14_we.next = 0 s14_wa.next = 0 s14_wd.next = 0 s14_ra.next = s59 What's needed is some types to plug into the instantiation, and possibly a decoupling through passing env? E.g. [s14_rd, s14_we, s14_wa, s14_wd, s14_ra] = env.memory("s14", 8, 16) This way the instance itself doesn't need to be handled inside the generated routine. Then do the same for signal? EDIT: Probably best to create the instance and pass it on. TODO: - Find addr,data sizes + add to instances. - Renames from probes Entry: MyHDL concat Date: Mon Aug 13 12:11:46 EDT 2018 This can't take constants. c_1_0 = Signal(modbv(0)[1:0]) c_1_1 = Signal(modbv(1)[1:0]) http://discourse.myhdl.org/t/constant-bit-vectors-in-a-concat-expression/121 A binary string fixes it: def blk20(): s63.next = (((concat(s18, "0")) if s62 else ((concat("1", (s63[9:1]))) if 1 else s63))) EDIT: Done Entry: MyHDL test bench Date: Mon Aug 13 13:46:44 EDT 2018 I want to keep this in haskell. So find a way to run Python code and return the result. Entry: Memory instantiation hierarchy problem Date: Mon Aug 13 16:16:04 EDT 2018 I don't understand what's going wrong here. Traceback (most recent call last): File "run_myhdl.py", line 140, in load_and_run(sys.argv[1], sys.argv[2]) File "run_myhdl.py", line 134, in load_and_run toVerilog(hdl_fun, env, *signals) File "/home/tom/priv/git-private/humanetics/gw_src/deps/asm_tools/myhdl/myhdl/conversion/_toVerilog.py", line 176, in __call__ siglist, memlist = _analyzeSigs(h.hierarchy) File "/home/tom/priv/git-private/humanetics/gw_src/deps/asm_tools/myhdl/myhdl/conversion/_analyze.py", line 92, in _analyzeSigs assert(delta >= -1) AssertionError I think I understand. Signals need to be passed in from the top or created at the same level. They cannot be instantiated at lower levels and bubbled up. Trying to fix it s14_rd = Signal(modbv(0)[1:0]) s14_we = Signal(modbv(0)[1:0]) s14_wa = Signal(modbv(0)[1:0]) s14_wd = Signal(modbv(0)[1:0]) s14_ra = Signal(modbv(0)[1:0]) inst1 = ram.ram(CLK, s14_wa, s14_wd, s14_we, CLK, s14_ra, s14_rd) s14_rd = env.sig(16, 0) s14_we = env.sig(1, 0) s14_wa = env.sig(8, 0) s14_wd = env.sig(16, 0) s14_ra = env.sig(8, 0) inst1 = env.ram(CLK, s14_wa, s14_wd, s14_we, CLK, s14_ra, s14_rd) Entry: MyHDL Date: Tue Aug 14 09:27:04 EDT 2018 Make a CPython interface. Now it needs to be tested. Note that it is no longer necessary to use the macro to get to variable names. They can be added through the "probe" mechanism. Let's push that through first. EDIT: I/O names are still necessary. Entry: MyHDL code gen cleanup Date: Tue Aug 14 10:09:55 EDT 2018 Create a single generator. Currently there are 3: toPy testbench fpga I need one interface. The problem is that there are a couple. m [r S] passed into compileTerm [r S] -> m () mirrors MyHDL port api [r S] -> m [r S] mirrors Seq processor What is needed? - FPGA code generation - Test benches The problem is with the latter. So let's focus on code gen first, then figure out a way to implement i->o test benches. EDIT: got it sorted out + can execute from haskell, though it needs intermediate files. Entry: next Date: Tue Aug 14 15:40:27 EDT 2018 Everything is there to create a test bench for the soc to run in MyHDL, and to then run it on the hardware. What should this do? Likely best to take the application mainloop. Currently what's in the way is the ability to upload SPI code to the running FPGA. I need to really go ahead and do this. And I really don't want to? Why is that? Because it is all so hard to debug. Can I make that easier? Can I make an always-on logic analyzer? Let's set that up instead. It's the software. I don't want to run a daemon and I don't want to have it streaming data constantly. What I want is the following: - Send status once per second or per request - Send status on change Do this for all GPIOs. Use a dedicated blue pill board. EDIT: This is madness. Decision fatigue. I put everything together but stopped. I could just solder up the board and do it manually. Why is it so hard to start doing that? I guess I'm just tired. Entry: Low hanging fruit? Date: Tue Aug 14 17:06:37 EDT 2018 Or something interesting. I've not solved timing yet for the CPU. How to do this? One thing I need is Entry: Instruction Sequencer vs. State Machine Date: Tue Aug 14 17:41:45 EDT 2018 Here's the thing: a CPU, you only need to build once. A state machine is custom every time, because it is really hard to modularize. Entry: A Forth Date: Tue Aug 14 19:30:21 EDT 2018 So it's time for a mini Forth. Basically only to create loops. A monad with a compilation stack. Entry: Tests Date: Tue Aug 14 21:30:27 EDT 2018 - create a FPGA .v module to see what the resource usage is - make a testbench, run it in mydl - move to FPGA Entry: Bad hex value 1ff Date: Tue Aug 14 22:34:27 EDT 2018 def _convertInitVal(reg, init): _toVerilog.py def _convertInitVal(reg, init): if isinstance(reg, _Signal): tipe = reg._type else: assert isinstance(reg, intbv) tipe = intbv if tipe is bool: v = '1' if init else '0' elif tipe is intbv: v = "%s" % init if init is not None else "'bz" else: assert isinstance(init, EnumItemType) v = init._toVerilog() return v init is originally set to the string "1ff" type(init) is and converted to string it is 1ff I guess that's the problem? Implementation of this is in _intbv.py EDIT: fixed it by - v = "%s" % init if init is not None else "'bz" + v = "h%s" % init if init is not None else "'bz" Entry: Sizes of things Date: Tue Aug 14 23:15:48 EDT 2018 140 LUTs for UART transmit and CPU. That's not a whole lot. I wonder what a good ratio is for LUT to transistor count. 4004 had 2300 transistors. A FF is about 10 transistors. A LUT is about 20 transistors maybe? Accounting for low resource use? This says one LUT is 6 NAND gate equivalent. A CMOS NAND gate is 4 transistors. Which is 24 transistors per LUT. https://blogs.synopsys.com/breakingthethreelaws/2015/02/how-many-asic-gates-does-it-take-to-fill-an-fpga/ So the CPU is about (* 24 140) 3360 transistor equivalent. Orig Z80 was 8500 NMOS Compare this to a PIC. PIC12C508 8-bit 1200 nanometre process https://en.wikipedia.org/wiki/PIC_microcontrollers That's quite a big feature size. 1985 technology. https://en.wikipedia.org/wiki/Microprocessor_chronology https://www.pcmag.com/encyclopedia/term/49759/process-technology Nanometers Micrometers Year (nm) (µm) 1957 120,000 120.0 1963 30,000 30.0 1971 10,000 10.0 1974 6,000 6.0 1976 3,000 3.0 1982 1,500 1.5 1985 1,300 1.3 1989 1,000 1.0 1993 600 0.6 1996 350 0.35 1998 250 0.25 1999 180 0.18 2001 130 0.13 2003 90 0.09 2005 65 0.065 2008 45 0.045 2010 32 0.032 2012 22 0.022 2014 14 0.014 2017 10 0.010 2018 7 0.007 ?? 5 0.005 Entry: FPGA reverse engineering Date: Tue Aug 14 23:58:00 EDT 2018 https://hackaday.com/2018/01/17/34c3-reverse-engineering-fpgas/ Entry: critical path Date: Wed Aug 15 08:12:41 EDT 2018 Use icetime. Later, upgrade tools and have it do timing-based PNR. It's about 2x the 36MHz. If this gives issues, it is always possible to create two clock domains, where clock recovery and FIFO write is done in the 36MHz domain, and the CPU and control run at a lower rate. tom@panda:~/asm_tools$ make x_soc_fpga.ct256.time icetime -p x_soc_fpga.pcf -o x_soc_fpga.ct256.nl.v -P ct256 -d hx8k -t x_soc_fpga.ct256.asc // Reading input .pcf file.. // Reading input .asc file.. // Reading 8k chipdb file.. // Creating timing netlist.. icetime topological timing analysis report ========================================== Warning: This timing analysis report is an estimate! Report for critical path: ------------------------- ram_25_15 (SB_RAM40_4K) [clk] -> RDATA[13]: 2.246 ns 2.246 ns net_99005 (s14_rd[13]) t521 (LocalMux) I -> O: 0.330 ns inmux_24_16_99177_99230 (InMux) I -> O: 0.260 ns lc40_24_16_5 (LogicCell40) in0 -> lcout: 0.449 ns 3.284 ns net_95054 ($abc$1273$n154_1) t473 (LocalMux) I -> O: 0.330 ns inmux_23_16_95119_95168 (InMux) I -> O: 0.260 ns lc40_23_16_7 (LogicCell40) in3 -> lcout: 0.316 ns 4.189 ns net_90979 ($abc$1273$n5) odrv_23_16_90979_50336 (Odrv12) I -> O: 0.540 ns t393 (Sp12to4) I -> O: 0.449 ns t395 (Span4Mux_v3) I -> O: 0.337 ns t394 (LocalMux) I -> O: 0.330 ns inmux_18_19_75079_75125 (InMux) I -> O: 0.260 ns lc40_18_19_3 (LogicCell40) in0 -> lcout: 0.449 ns 6.552 ns net_70960 ($abc$1273$n7) odrv_18_19_70960_75043 (Odrv12) I -> O: 0.540 ns t223 (Sp12to4) I -> O: 0.449 ns t222 (Span4Mux_v2) I -> O: 0.252 ns t221 (IoSpan4Mux) I -> O: 0.323 ns t220 (LocalMux) I -> O: 0.330 ns t219 (IoInMux) I -> O: 0.260 ns t218 (ICE_GB) USERSIGNALTOGLOBALBUFFER -> GLOBALBUFFEROUTPUT: 0.617 ns t217 (gio2CtrlBuf) I -> O: 0.000 ns t216 (GlobalMux) I -> O: 0.154 ns t215 (INTERCONN) I -> O: 0.000 ns t214 (LocalMux) I -> O: 0.330 ns inmux_16_15_66433_66498 (InMux) I -> O: 0.260 ns lc40_16_15_6 (LogicCell40) in0 -> lcout: 0.449 ns 10.515 ns net_62318 ($abc$1273$n48) odrv_16_15_62318_65784 (Odrv12) I -> O: 0.540 ns t168 (Span12Mux_v9) I -> O: 0.421 ns t167 (LocalMux) I -> O: 0.330 ns t166 (IoInMux) I -> O: 0.260 ns t165 (ICE_GB) USERSIGNALTOGLOBALBUFFER -> GLOBALBUFFEROUTPUT: 0.617 ns t164 (gio2CtrlBuf) I -> O: 0.000 ns t163 (GlobalMux) I -> O: 0.154 ns 12.836 ns seg_18_16_glb_netwk_5_6 Resolvable net names on path: 2.246 ns .. 2.835 ns s14_rd[13] 3.284 ns .. 3.873 ns $abc$1273$n154_1 4.189 ns .. 6.104 ns $abc$1273$n5 6.552 ns .. 6.552 ns $abc$1273$n7 10.066 ns .. 10.066 ns $abc$1273$n7$2 10.515 ns .. 10.515 ns $abc$1273$n48 Total number of logic levels: 5 Total path delay: 12.84 ns (77.90 MHz) Entry: rle Date: Wed Aug 15 08:20:54 EDT 2018 The simplest way is to write down explicitly what is expected. - remove preamble - generate the periodic rle signal a couple of times - compare It's no longer possible to run lazily with SeqTH, so maybe create a workaround for that? Entry: MyHDL is picky Date: Wed Aug 15 10:19:18 EDT 2018 Top level circuit cannot use *args, as argument names are recovered. So I'm including the reset generator instantiation in the generated module. rst = ice40_reset(CLK, RST) This means reset should be an output also. Entry: Build cleanup Date: Wed Aug 15 11:06:55 EDT 2018 One Haskell file per FPGA project? Entry: Yosys ram init Date: Wed Aug 15 12:02:21 EDT 2018 blif below. indented for readability. it should be possible to identify these based on any of the names of the attached ports. how does the ram write tool work? https://github.com/cliffordwolf/icestorm/tree/master/icebram It seems there are 3 options: - manually instantiate SB_RAM40_4K and provide init params - edit the .blif - edit the .asc I believe the init parameters are bit vectors, so leas significant word is to the right. .gate SB_RAM40_4K MASK[0]=$true MASK[1]=$true MASK[2]=$true MASK[3]=$true MASK[4]=$true MASK[5]=$true MASK[6]=$true MASK[7]=$true MASK[8]=$true MASK[9]=$true MASK[10]=$true MASK[11]=$true MASK[12]=$true MASK[13]=$true MASK[14]=$true MASK[15]=$true RADDR[0]=s60_ra[0] RADDR[1]=s60_ra[1] RADDR[2]=s60_ra[2] RADDR[3]=s60_ra[3] RADDR[4]=s60_ra[4] RADDR[5]=s60_ra[5] RADDR[6]=s60_ra[6] RADDR[7]=s60_ra[7] RADDR[8]=$false RADDR[9]=$false RADDR[10]=$false RCLK=CLK RCLKE=$true RDATA[0]=s60_rd[0] RDATA[1]=s60_rd[1] RDATA[2]=s60_rd[2] RDATA[3]=s60_rd[3] RDATA[4]=s60_rd[4] RDATA[5]=s60_rd[5] RDATA[6]=s60_rd[6] RDATA[7]=s60_rd[7] RDATA[8]=s60_rd[8] RDATA[9]=s60_rd[9] RDATA[10]=s60_rd[10] RDATA[11]=s60_rd[11] RDATA[12]=s60_rd[12] RDATA[13]=s60_rd[13] RDATA[14]=s60_rd[14] RDATA[15]=s60_rd[15] RE=$true WADDR[0]=$false WADDR[1]=$false WADDR[2]=$false WADDR[3]=$false WADDR[4]=$false WADDR[5]=$false WADDR[6]=$false WADDR[7]=$false WADDR[8]=$false WADDR[9]=$false WADDR[10]=$false WCLK=$false WCLKE=$false WDATA[0]=$false WDATA[1]=$false WDATA[2]=$false WDATA[3]=$false WDATA[4]=$false WDATA[5]=$false WDATA[6]=$false WDATA[7]=$false WDATA[8]=$false WDATA[9]=$false WDATA[10]=$false WDATA[11]=$false WDATA[12]=$false WDATA[13]=$false WDATA[14]=$false WDATA[15]=$false WE=$true .attr src "/usr/local/bin/../share/yosys/ice40/brams_map.v:191|/usr/local/bin/../share/yosys/ice40/brams_map.v:35" .param INIT_0 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_5 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_6 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_7 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_8 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_9 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_A xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_B xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_C xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_D xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_E xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx .param INIT_F xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Entry: FPGA load program Date: Wed Aug 15 12:19:50 EDT 2018 First, see if there is a sign of life when just modifying the blif manually. Extend the iceprog program to send bytes to SPI. EDIT: Done Now, how to bootstrap into this? Make a SPI command to set the other 7 LEDs. Temp setup on tp: cd /i/tom/asm_tools export PATH=/i/tom/git/icestorm/iceprog/:$PATH make f_blink.ct256.iceprog dd if=/dev/urandom of=/tmp/test.bin bs=1 count=1 ; iceprog -x /tmp/test.bin echo -ne 'A' >/tmp/test.bin ; iceprog -x /tmp/test.bin Entry: PLL setup Date: Wed Aug 15 13:17:31 EDT 2018 https://github.com/YosysHQ/yosys/issues/107 Entry: iceprog Date: Wed Aug 15 14:15:43 EDT 2018 Code says 6MHz, but board runs at 12Mhz, so this might not be sampled correctly. Does the 6MHz reflect the data rate or the actual clock signal frequency? E.g. if the actual clock frequency is 3MHz then we're good. BITMODE_MPSSE void send_spi(uint8_t *data, int n) { if (n < 1) return; send_byte(0x11); send_byte(n-1); send_byte((n-1) >> 8); int rc = ftdi_write_data(&ftdic, data, n); if (rc != n) { fprintf(stderr, "Write error (chunk, rc=%d, expected %d).\n", rc, n); error(); } } http://www.ftdichip.com/Support/Documents/AppNotes/AN_135_MPSSE_Basics.pdf Ok, I'm not going to find this by staring at code and nonexistant documentation. https://learn.adafruit.com/adafruit-ft232h-breakout/mpsse-setup https://github.com/devttys0/libmpsse Use this code as documentation: https://github.com/devttys0/libmpsse.git From this: This command sets div x5, generating 12MHz from 60Mhz TCK_X5 = 0x8A, Then putting system_clock=12Mhz in here, the divisor for freq=6MHz is 0. /* Convert a frequency to a clock divisor */ uint16_t freq2div(uint32_t system_clock, uint32_t freq) { return (((system_clock / freq) / 2) - 1); } factor reg=factpr-1 6MHZ 1 0 3MHz 2 1 2MHz 3 2 ... So I can just change that. Big or little endian? From the other examples it seems to be little endian. send_byte(0x86); - send_byte(0x00); + send_byte(0x01); send_byte(0x00); Entry: Next Date: Wed Aug 15 15:02:30 EDT 2018 spi seems to be working. now: - create flow to push program out to .bin file - create test file that emulates spi upload at 3MHz +- some fractional error - have it put a pattern on the LEDs, test with rle sim EDIT: Running into a problem. Instead of nesting all these things, it might be better to preform the composition inside one top-level function. Something like "make_soc", where all the components are passed in. Components: - CPU - instruction sequencer - boot (memory write) - bus - reset circuitry Entry: FPGA board test Date: Wed Aug 15 19:11:43 EDT 2018 The next thing to do is to create a program that writes 'U' to the LEDs. From there on, a couple of different programs could be tried. Entry: No sign of life for CPU Date: Wed Aug 15 22:02:47 EDT 2018 Step is too big. Reduce: bring ip out. Can this be done with probes? EDIT: Adding environment-based debug cross-cuts. It runs somewhat. Can't see a consistent pattern though. I'm seeing some patterns but nothing completely consistent. tom@tp:/i/tom/asm_tools$ echo 'UUUU' >/tmp/test.bin tom@tp:/i/tom/asm_tools$ iceprog -x /tmp/test.bin This gives 10100101 So it's not clocking in properly. One problem: fifo pointer isn't resetting So on the first load, I see "02" I wonder if spi resets properly as well.. Entry: State can linger. Date: Wed Aug 15 23:04:09 EDT 2018 When creating tests: make sure to run them multiple times in a row to make sure devices reset properly. Entry: deser Date: Thu Aug 16 08:16:56 EDT 2018 There is something wrong with the deser that is not obvious in the simulation. Write one from scratch and look at the difference? There should be a simpler way to keep this under control. Basically, I don't see what's happening. Maybe it's time to take a break from this. Entry: Debug probes Date: Thu Aug 16 09:01:41 EDT 2018 Is it enough to use the environment-based probing? At the bottom level, probes are just ignored if they are not defined. At the top level, probes can be pulled out if they are defined, and raise an error if they are not. This seems better altoghether. Maybe a combination of both will do? Use the first probe mechanism to collect names into the meta level, then use the former environment mechanism to teleport signals. Entry: Smaller step Date: Thu Aug 16 10:30:49 EDT 2018 SPI works, but writing to buffer, then executing, doesn't work. What about making this simpler: add readout first, and see what that gives. Another thing is to clock the CPU from SPI, and pulling out debug information that way. Reading is not expensive: it is just routing. Writing is expensive: it requires multiplexing at the input of a register. So what about adding a tracer for registers? I have the probe mechanism, so a state machine could just go through all of this. Entry: State machines Date: Thu Aug 16 10:36:39 EDT 2018 So what is my problem with state machines? They are hard to write. But, I have a macro mechanism at my disposal, so why not expand it from a higher level description? Basically, a lot of state machines have a structure that is: - act, wait, act, wait, ... Where each step is a state. Entry: Buffer Date: Thu Aug 16 10:43:20 EDT 2018 - Create an abstraction that splits a memory into two parts: one that can read over SPI, one that can write. - Abstract SPI cs,sck into bc,rst The latter is maybe the most important part: external data representation vs. internal. For SPI it is ok to assume that cs=rst Entry: How do you call a clock enable pulse? Date: Thu Aug 16 10:45:15 EDT 2018 Calling it clock enable (as is done on a fifo) is cumbersome, so call it clock instead. Added a note to SeqLib.hs Entry: FPGA SPI bug Date: Thu Aug 16 11:35:36 EDT 2018 Here's the theory: the sampling edge is 0->1, but the initial clock phase is 1. The FPGA is immune to this, but my edge detector is not. Let's just look at the scope. Nah I need to solder... So let's look at the iceprog code. This is how it starts a transfer when in MPSSE modes. send_byte(0x11); send_byte(n-1); send_byte((n-1) >> 8); The code here is more readable: https://github.com/devttys0/libmpsse https://www.intra2net.com/en/developer/libftdi/documentation/ftdi_8h.html #define MPSSE_WRITE_NEG 0x01 /* Write TDI/DO on negative TCK/SK edge*/ #define MPSSE_DO_WRITE 0x10 /* Write TDI/DO */ So it writes the data on the negative edge. Which means the initial polarity is 1. Now, make a test for SPI phase. Basically, specify this properly. Entry: I can't slow it down? Date: Thu Aug 16 15:35:05 EDT 2018 It's working, but I still can't slow it down. Only works with this: // clock divide send_byte(0x86); send_byte(0x00); send_byte(0x00); Filling in any other div kills both the IRAM upload as the main FPGA image upload. Or maybe I understand this badly. Entry: Generate verilog directly Date: Thu Aug 16 16:30:26 EDT 2018 This could just be straight up RTL. module ( ... ) { input ; ... output ; ... wire [:0] ; reg [:0] ; assign = ; } Then the sequential bit could be one big block. Entry: Why do FPGAs use LUTs? Date: Thu Aug 16 17:11:54 EDT 2018 I don't know, but a guess is that a LUT doesn't introduce any bias against certain functions. Each n-ary function is implemented with the same cost. As such, LUTs are universal. They are probably also easy to implement as compared to other structures. The question is more: why 3 to 1, and not any other n to m configuration? There does seem to be some variability here. 3-1, 4-1, even mult-out? This says 4-1 is best for size, 5-1 is best for performance. https://ieeexplore.ieee.org/document/1410611/ Entry: So CPU works Date: Thu Aug 16 17:47:40 EDT 2018 Next: don't use it! Seriously though, there is a middle ground here: generating state machines from static programs. Here's the outline of a practical program I need to write. The input/output are abstracted, but essentially they perform two kinds of actions: - write to a register - send a start pulse - wait for a done pulse e.g. - write - wait - n times: - m times: - write - wait - wait - n times - write, pulse - wait - pulse - wait The essence of this structure is the nesting. To flatten this requires flattening out the nesting. What makes a CPU interesting is that it can do loops A stack then adds nested loops To translate all that into a state machine requires state for each level + possibly repurposing the state for the next loop. I don't even have to work this out to know that the CPU approach is _much_ simpler once there is any form of nesting. Note that a UART already performs nesting: - top level states: idle->start->data->stop->idle - inside data there are N data states So to create a UART, one could also just create a CPU. Let's look at that in the next post. Entry: Bitbang UART vs. hardware Date: Thu Aug 16 18:06:34 EDT 2018 How much more complicated is a CPU implementing a UART as a program, compared to a hardware state machine? - wait start - delay 1/2 - n times - delay 1 - sample - delay 1 - check stop - delay 1/2 Another question: since bit size will be known, why not a flat state machine? Here's an idea. What is easy to do? - flat state machines - flat + shared counter in each stage or generalized: any shared abstract state machine It seems that the important idea is to be able to share resources in the several sub states. If that is possible, or even natural (a counter for instance), a state machine will be ok. OTOH if there are states where resource sharing is not obvious, a sequential program will be more appropriate. Entry: Lessons Learned Date: Thu Aug 16 18:52:47 EDT 2018 1) You can't escape the fact that you're dealing with circuits. 2) Resource sharing is about multiplexing. 3) Generated Verilog can be simple (just RTL). 4) Monadic do notation is still awkward, and there do not seem to be any obvious workarounds. Entry: CPU resource use Date: Thu Aug 16 19:04:16 EDT 2018 After packing: IOs 15 / 206 GBs 0 / 8 GB_IOs 0 / 8 LCs 238 / 7680 DFF 104 CARRY 37 CARRY, DFF 5 DFF PASS 44 CARRY PASS 14 BRAMs 1 / 32 WARMBOOTs 0 / 1 PLLs 0 / 2 Entry: Better instruction packing Date: Thu Aug 16 19:07:23 EDT 2018 Currently mostly unused: 3 bits 5 bits reserved 8 bits argument If it is ok to have all zero operand instructions, something like this could work: H 7 8 - literal L 15 - packed Where 7 and 15 bits contain packed instructions. Anyways a lot is possible still. This isn't even a 2-stack machine. Here's an approach: 1 + 3*5 Which can pack 5 instructions The other branch can then be used for literals and jumps. Anyways this all needs to be evaluated relative to an application, because there are otherwise too many variables to even begin to define "optimal". In my case, "optimal" means simple. Going anywhere, base it on Moore on Koopman https://users.ece.cmu.edu/~koopman/stack_computers/sec6_3.html Entry: soc spi Date: Thu Aug 16 19:51:37 EDT 2018 now that boot works, maybe add a spi command mode? Entry: Subroutine call / return Date: Thu Aug 16 20:51:46 EDT 2018 I don't care so much about using more instructions inside the subroutine (return), but the call site should be small. Call could then load the instruction pointer on the stack, and return pops it. Actually, the next pointer is already available. Entry: single stack is enough for a control processor? Date: Thu Aug 16 20:59:31 EDT 2018 Basically, if a subroutine does not need to return an argument? Currently if I want to pass a single argument to a routine, it would be: push ; call swap ; drop I.e. the subroutine would start with swap to move the return address to the side. Entry: variables Date: Thu Aug 16 21:02:53 EDT 2018 Use the other memories for variables? Or just use them as stacks? That way no logic is needed besides the pointers. Entry: CPU designs Date: Thu Aug 16 21:04:35 EDT 2018 - single stack control computer - dual stack forth machine Entry: A CPU: It's inverted Date: Thu Aug 16 21:45:35 EDT 2018 Signals are computed from their inputs, but for some reason this seems inverted. I.e. it's more natural to think of a big or statement as "also set if". In my first circuit I found this was exactly the problem to get around. Entry: How expensive is the stack? Date: Thu Aug 16 22:42:04 EDT 2018 3 x 8 After packing: IOs 15 / 206 GBs 0 / 8 GB_IOs 0 / 8 LCs 302 / 7680 DFF 120 CARRY 37 CARRY, DFF 5 DFF PASS 52 CARRY PASS 14 BRAMs 1 / 32 WARMBOOTs 0 / 1 PLLs 0 / 2 4 x 8 After packing: IOs 15 / 206 GBs 0 / 8 GB_IOs 0 / 8 LCs 310 / 7680 DFF 128 CARRY 37 CARRY, DFF 5 DFF PASS 52 CARRY PASS 14 BRAMs 1 / 32 WARMBOOTs 0 / 1 PLLs 0 / 2 So it's just the FFs Basically, stack depth is cheap. What about width? Going from 4x8 -> 4x12 That makes a bigger difference. But it also adds logic for the memories. After packing: IOs 15 / 206 GBs 0 / 8 GB_IOs 0 / 8 LCs 391 / 7680 DFF 158 CARRY 49 CARRY, DFF 5 DFF PASS 58 CARRY PASS 14 BRAMs 16 / 32 WARMBOOTs 0 / 1 PLLs 0 / 2 Entry: Removed old notes from CPU.hs Date: Thu Aug 16 23:49:53 EDT 2018 -- Original notes on stack macchines: -- Memory seems to be the most important component. I'm going to -- target the iCE40, which has a bunch of individual memories, -- allowing separate busses for instruction, data and return stack, -- and the rest bundled as data memory. -- At every clock, each of the 4 memories has a word sitting on its -- read port: -- i: current instruction -- d: 2nd on stack (top is in a register) -- r: return address (ip is the instruction memory's write port) -- The instruction word drives the decoder, which drives all the -- muxes. -- It seems that reading out instructions is the most useful thing to -- start with. This could be used for specialized sequencers that are -- not necessarily general purpose CPUs. This can then be gradually -- extended to more abstract operations. -- The main problem for building a CPU is to properly decompose the -- decoder. I'm not sure how to do this exactly, so just start in an -- ad-hoc way. -- There is some arbitraryness here: a hierarchy is created in the -- nesting of the "close" operations. The guideline is to abstract -- away a register as soon as possible, i.e. move it to the inner part -- of the hierarchy. -- At the very top, there is: -- . instruction memory access: -- . read: program sequencing -- . write: bootloader -- . BUS I/O (i.e. containing GPIO) -- Each hierarchy level is an adaptation. closeIW will abstract the -- inner decoder as an iw -> jump operation, and insert the necessary -- logic to either just advance to the next instruction, or perform a -- jump. -- The origianl problem that drove this exploration is meanwhile -- implemented on PRU. These were the instructions needed: -- -- a) loop n times -- b) write UART byte, wait until done -- c) wait -- d) set I/O -- e) read I/O into memory and advance pointer -- To implement loops, it would be useful to have a stack to be able -- to have nested loop counters. This would mean less registers. I'm -- not going to be able to make this simpler than making a small forth -- machine.. This way: -- UART out can be bit-banged. -- Multiple counters not needed for timing control. -- No "wait" instruction needed: instruction counting suffices. -- Add a data stack when needed. Probably a single top register is enough. -- The basic instructions seem straightforward. This is just a -- decoder that fans out into mux controls. The unknown part to me is -- the call/return. -- Call: move IP+1 -> rtop write port -- inc rpointer -- set ip from instruction word -- Ret: dec rpointer -- move rtop -> IP -- This could also be microcoded: -- a) load literal into rdata -- b) increment rstack -- c) unconditional jump -- The operations that can be reused are: -- write, postinc (stacks + buffers) -- read, predec -- So there is a clear tradeoff between the complexity of the -- instruction decoder, and the amount of instructions needed. -- Where to start? Conditional memory write. -- So for unidirectional flow, this is easy. For bi-directional such -- as a stack, two pointers need to be maintained. It might be -- simplest to initialize them such that the write/read operation can -- happen immediately? Both will have individual adders. Maybe not a -- good idea? -- Perform an operation and wait for it to finish. -- Let's keep the operation abstract, so what this does is: -- -- . first time the instruction is executed, the sub-machine is -- enabled. the sequencer will wait until the machine provides a -- "done" flag, which will advance the instruction pointer. -- -- . it seems simpler to split this into "start" and "wait" -- instructions. -- -- Each instruction can have push/pop/write/nop wrt imm? It seems -- possible that stack can be manipulated in parallel with bus -- transfer. -- But let's not make this too complicated. Some observations: -- . This is for very low level, specialized code. It will never -- necessary to manipulate addresses as data, so address for read, -- write can always come from the immediate word. The data itself -- might be manipulated. -- It's probably ok to instantiate it fully even if certain -- instructions are not used. Yosys/abc removes unused logic. -- Still, this can use some decomposition. For now, because there are -- not many instructions, use one-hot encoding to keep the logic -- simple. Entry: Next Date: Fri Aug 17 08:39:05 EDT 2018 - Direct Verilog RTL code gen. - Type-level additions - Syntax preprocessor - Get rid of (SInt (Just n) _) and use e.g. Reg t v , Sig t , Const v None of these are essential. SInt can be left if replaced by a wrapper like sbits. EDIT: Did that. Simplifies things without making a huge change. Type directed stuff: I'm asking for help on Twitter. Syntax preproc. Seems to be more trouble than it's worth because this needs "escapes" for generic meta-level stuff. It seems that just sticking to "do" for now is the best option. Maybe some TH can help as a middle ground? Verilog seems most useful to cut out MyHDL entirely. It's nice, but not really necessary. And I'm not using it's main purpose: to be able to use Python as a macro and test bench language. EDIT: Started putting in the boilerplate for Verilog.hs It seems quite straightforward, boring. Entry: Forward declarations Date: Fri Aug 17 11:13:43 EDT 2018 Likely will be necessary eventually, but can it be hacked for now? E.g. write a program as a set of routines, and tell the compiler which is the first one? This also avoids infinite loops. Entry: Bit-serial CPU, PLC Date: Fri Aug 17 12:18:43 EDT 2018 I wonder if that makes sense. It seems that the bit-width significantly impacts the resource use. https://en.wikipedia.org/wiki/Serial_computer What I'm building is a PLC? https://en.wikipedia.org/wiki/PDP-14 https://en.wikipedia.org/wiki/Programmable_logic_controller http://fpga-guru.com/files/supercn.pdf Entry: Bit serial architecture. Date: Fri Aug 17 12:34:54 EDT 2018 These could be cycled through a RAM, giving 16 registers of 256 bits deep. I'm actually really intrigued by this. Seems like a perfect task for a good macro language! But, let's cut it short: before building something like this, build a fixed datapath DSP. A FIR or a biquad. Once it is clear how to abstract it, only then move on to CPU design. EDIT: The instruction word and decoder will have to be parallel. The argument could be streamed in. Next instruction address can be streamed out as part of the ALU. It seems that the main design challenge is to design the ALU. Routing needs to be determined from the instruction word. Operations like conditional jumps need an extra cycle to decide between continue and jump. Probably simplest to do it in two macro cycles: compute condition and jump in the next cycle. Entry: CPU/PLC vs state machine Date: Fri Aug 17 12:44:57 EDT 2018 If what you're doing is sequential in time, needs strict timing but has a time scale that is far below the clock rate, multiplex and sequence it on a CPU. If you need something simple, something parallel, something fast, use a state machine. Entry: Better testing of CPU Date: Fri Aug 17 12:52:52 EDT 2018 Create some quickcheck I/O functions to constrain the behavior before adding new instructions and peripherals. Simple program -> debug sequence is probably ok. Entry: Verilog Date: Fri Aug 17 16:17:14 EDT 2018 One reason to generate Verilog is to make it easier to do direct instantiation. EDIT: I'm thinking that the other direction might actually be a lot more important: take a verilog module, and instantiate it into Seq. The simplest way is maybe to use yosys or icarus to translate it to a netlist. This can probably be done lazily, once there is a need. EDIT: Verilog generation is complete. Needs some testing still. Import, I'd like to see if I can do something with blif. http://www.clifford.at/yosys/files/yosys_appnote_010_verilog_to_blif.pdf Looking at what kinds of gates are in the blif: tom@panda:~/asm_tools$ cat f_soc.blif |grep ^.gate | awk '{print $2}'|sort|uniq SB_CARRY SB_DFF SB_DFFE SB_DFFER SB_DFFESR SB_DFFESS SB_DFFR SB_DFFSR SB_DFFSS SB_LUT4 SB_RAM40_4K Entry: STUArray Date: Fri Aug 17 16:31:40 EDT 2018 Figure out how to chunk it. runSTUArray doesn't have the correct type. I also want the other parameters. So looks like unsafeFreeze is needed. EDIT: see strictToLazyST. It seems possible, but I got to really understand it first. Not for now. Entry: RTL Date: Fri Aug 17 22:01:21 EDT 2018 Yosys manual: 2.1.4 Register-Transfer Level (RTL) Many optimizations and analyses can be performed best at the RTL level. Examples include FSM detection and optimization, identification of memories or other larger building blocks and identification of shareable resources. multi-level logic synthesis - Binary-Decision-Diagram (BDD) has a unique normal form. https://en.wikipedia.org/wiki/Binary_decision_diagram This is nested IF / 2-1 multiplexers. - And-Inverter-Graph (AIG) better worst case performance. ABC uses this. https://en.wikipedia.org/wiki/And-inverter_graph Entry: yosys log Date: Sat Aug 18 01:32:52 EDT 2018 12 bit Number of wires: 277 Number of wire bits: 920 Number of public wires: 58 Number of public wire bits: 303 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 547 SB_CARRY 47 SB_DFF 14 SB_DFFE 13 SB_DFFER 73 SB_DFFESR 3 SB_DFFESS 1 SB_DFFR 61 SB_DFFSR 5 SB_DFFSS 2 SB_LUT4 312 SB_RAM40_4K 16 4 bit: Number of wires: 160 Number of wire bits: 387 Number of public wires: 61 Number of public wire bits: 216 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 290 SB_CARRY 22 SB_DFF 13 SB_DFFE 13 SB_DFFER 29 SB_DFFESR 3 SB_DFFESS 1 SB_DFFR 36 SB_DFFSR 5 SB_DFFSS 2 SB_LUT4 165 SB_RAM40_4K 1 Entry: Clash, Lava Date: Sat Aug 18 02:13:57 EDT 2018 Clash uses a static analysis approach. ( Implemented as plugin? ) http://hackage.haskell.org/package/clash-prelude-0.99.3/docs/Clash-Tutorial.html Different from eDSL like the Lava languages. http://projects.haskell.org/chalmers-lava2000/Doc/tutorial.pdf http://hackage.haskell.org/package/chalmers-lava2000 I see no monads.. Brings me to sharing. http://www.ittc.ku.edu/~andygill/talks/20090903-hask.pdf How do we turn a Lava program into a Graph? Use a Monad – Xilinx Lava does this. Tag every binding with a unique tag – Hydra does this. Automate the introduction of unique tags. Destroys referential transparency. Chalmers Lava does this. Not a big problem in practice. Use a function that can see certain classes of sharing. Like a reflective parser. Kansas Lava does this. Like Chalmers Lava, can be unsafe. Data.Reify uses StableNames Seems like dirty hacks galore.. Stick to Monad. Less surprises. Some more on Chalmers Lava: http://www.cse.chalmers.se/edu/year/2012/course/TDA956/Slides/Lava111.pdf Entry: Xylinx lava Date: Sat Aug 18 03:07:25 EDT 2018 http://hackage.haskell.org/package/xilinx-lava http://hackage.haskell.org/package/xilinx-lava-5.0.1.9/docs/Lava.html type Out a = State Netlist a Entry: stripping bare a 6809 Date: Sat Aug 18 03:21:07 EDT 2018 Same story: need a CPU, don't have much room. https://www.edn.com/design/integrated-circuit-design/4460471/Afternoon-diversion--Design-your-own-microprocessor Entry: Avenues Date: Sat Aug 18 03:32:58 EDT 2018 - split the design: - rename CPU.hs to PLC.hs - create a real Forth processor with 2 stacks from 2 memory banks - move Staapl pic compiler to Haskell. It's only the peeophole optimizer, the rest can be done in Haskell incrementally. Entry: PLL Date: Sat Aug 18 08:56:43 EDT 2018 Before declaring victory, run it at a higher clock frequency. Do this using Verilog, direct instantiation. But first, see if the verilog actually works. Entry: Verilog debugging Date: Sat Aug 18 09:34:02 EDT 2018 Trying to generate f_soc with Verilog.hs Running into things like this: wire [99:0] s144; // "s144" <- (IF "rx_in" (CONST _:0) (CONST _:1)) Solution? This needs to be constant-folded. For now it's probably OK to leave it in as yosys will optimize it out. The real issue here is types: there is no unification going on in the other direction. I'm shuffling things around so a workaround can be added to f_soc.hs EDIT: still there f_soc: fix_binding: ("s144",Comb3 (SInt Nothing 0) IF (Node (SInt (Just 1) 0) "rx_in") (Const (SInt Nothing 0)) (Const (SInt Nothing 1))) Hack: it's possible that these disappear when bundling expressions. But the best solution is to perform a unification pass on all the bit types in SeqTerm. Entry: Failing tests Date: Sat Aug 18 09:53:45 EDT 2018 -- p_deser *** Failed! Falsifiable (after 2 tests): (7,(3,[1])) (6,(13,[8191])) -- p_soc_fun *** Failed! Falsifiable (after 2 tests): (255,255) (253,254) EDIT: p_soc_fun was due to bus size set to 4. The other still fails: -- p_deser *** Failed! Falsifiable (after 2 tests): (4,(9,[0])) It doesn't seem to work at all. Ha. cs==1. Still a failing case: (4,(10,[0,1023,1018,0,5,4])) Entry: Unification Date: Sat Aug 18 10:09:04 EDT 2018 How to express this? It is a set of integer equations, but using a solver seems overkill. This is a core issue. In general I want to have equations between types, not functions. Here's a simple solution: - for each binding, propagate to output and inputs until there is no more change - another way, because these are trees the following terminates: pick a node with unknown type and re-root the tree, going downward. But actually I'm too tired to think now so let's try a library. http://hackage.haskell.org/package/monad-unify-0.2.2/docs/Control-Monad-Unify.html This should have everything. Actually I already have a unification algorithm: the netlist execution. No it's only half way: i/o function, no arbitrary relations. EDIT: monad-unify doesn't build: Configuring monad-unify-0.2.2... Setup: Encountered missing dependencies: base >=4.5 && <4.8 http://dev.stephendiehl.com/fun/006_hindley_milner.html What' I'm trying to do is simpler because of the way that register types are always defined explicitly in the code using functional dependencies. So I should be able to just push it through that way. Entry: Verilog FPGA test Date: Sat Aug 18 10:32:28 EDT 2018 f_soc doesn't work while MyHDL still works. likely my fixed-size workaround is bad: it won't work for concatenation. Entry: Next? Date: Sat Aug 18 11:01:17 EDT 2018 - Verilog.hs <- SeqTerm.hs unification - deser has a failing test case Entry: Behavioral Date: Sat Aug 18 14:31:06 EDT 2018 Look at Chalmers Lava paper, it has a "behavioral" language embedded that is then compiled to a "full tree" register update. http://projects.haskell.org/chalmers-lava2000/Doc/tutorial.pdf http://hackage.haskell.org/package/chalmers-lava2000 Entry: Term type cleanup Date: Sat Aug 18 16:08:24 EDT 2018 This requires a lot of code changes, so might not be the thing to do at this moment. Basically, I want to get to this: data Form n = Const | Comb1 Seq.Op1 n | Comb2 Seq.Op2 n n | Comb3 Seq.Op3 n n n | Slice n Seq.SSize Seq.NbBits | Delay n | Connect n | Mem (n,n,n,n) | Input -- Externally driven node deriving (Show, Functor, Foldable) The changes to the original Term: - Flatten Term (Op n) to Form n. This is already implemented. It seems necessary to be able to keep the non-monadic constants. - Change the memory representation such that memory is just an i/o function. The read data register can then be added as a normal register. - This then allows SType to be factored out of the Form type. Entry: Spiral after focus Date: Sat Aug 18 18:19:40 EDT 2018 I guess I'm looking for that rush of getting the processor to work, only to discover there is a lot of work left, and it's "work". I.e. refactorings that are not so easy to perform. Mostly, this needs some rest. Entry: Unification Date: Sat Aug 18 22:27:06 EDT 2018 I'm running into a unification problem, and it seems to be slightly different from HM. What I want to know is how my problem is different. My problem: concat: c = a + b other 2op: c = a, c = b Forget about the other cases for now. During execution, I can go from leaves up to result nodes, but some information travels the other way: from delay / connect nodes down to the leaves. Is it enough to just propagate that information in one way? That's a lot simpler than full unification. Yes there are only two "entry points" where type information is present: typed constants and register reads at the leaves, and typed register/connect at the root. It should be enough to split the nodes in typed and untyped, iterate over all the roots, go down the full trees, and fill in information as it becomes available. Keep going until stable or done. Descending trees might not even be necessary. As long as there is reduction, it's ok to continue. If nothing changes in one pass it's time to give up. Is it a problem if this is quadratic? I guess not for any pracical problem. So just scan all the nodes every time some new information is available. To optimize, use a map from nodes to nodes that depend on it. Actually, just a map is useful for other things, so maybe just build it? Entry: SeqNetList Date: Mon Aug 20 15:29:52 EDT 2018 Cleaned it up to use Data.Graph, which makes a lot of code simpler. Also have expression inlining. The question is now: is it still necessary to unify? The inlining might already fix that, leaving type (bitwith) reconstruction of intermediates to the target language. EDIT: In the soc example, this seems to be the case. One simplification is to always use Vertex for the structural algorithms. When node representation needs to be changed, the functor property can be used. Entry: Next? Verilog Date: Mon Aug 20 15:40:01 EDT 2018 Verilog gen so PLL can be tested + mem init. Some other things need cleanup before that. And now is not the time. Too much "work-like". Is output propagation for "Connect" still necessary? One way to do this though is to create Verilog2 as a print of [(Vertex, (SSize, Expr Vertex))] Then use postprocessing later to fix things. EDIT: This is still a big step. I need an analysis routine to filter out memories, which are modeled as combinatorial network + delay. EDIT: Currently, type information of constants is lost. So keep tracking that. EDIT: Fanout of Memory node is []? First, clean up types a bit. This is the graph. (35,TypedForm {typedFormType = Just 16, typedFormForm = Memory 26 29 28 95},[]) (36,TypedForm {typedFormType = Just 16, typedFormForm = Delay 35 0},[38]) Fanout is 0 because delays are not counted. Is it possible to keep the full graph representation, and turn it into a DAG when needed? EDIT: Created se printer, then disabled that as default. Running into this error: EDIT: Memory -> Delay fanout is now accessible again: (35,TypedForm {typedFormType = Just 16, typedFormForm = Memory 26 29 28 95},[36]) (36,TypedForm {typedFormType = Just 16, typedFormForm = Delay 35 0},[38]) memory_decl: ((2,TypedForm {typedFormType = Just 16, typedFormForm = Memory 33 34 35 4}), (3,TypedForm {typedFormType = Just 16, typedFormForm = Delay 2 0})) Entry: Next? Verilog done. Date: Tue Aug 21 23:08:10 EDT 2018 Instantiation of external modules, testbenches. If Verilog is generated, it needs to be validated. Entry: Verilog Date: Wed Aug 22 12:35:36 EDT 2018 Easy to do huh? Quoite a detour to make SeqNetList.hs and still not generating the f_soc example properly. It works for f_blink. I suspect it's memory. That involves some manual steps. Inspection didn't yield any obvious mistakes. I don't have a good strategy to test this apart from going over every feature and writing or generating a testbench. Actually, quickcheck.. Does it make sense to generate a random network? Most would likely produce junk. It needs to be fed with random data. Or a noise generator. Entry: Something else Date: Wed Aug 22 14:46:08 EDT 2018 Not much sleep. Bored. What else? EDIT: Playing a bit with verilog testbenches. Entry: Macro languages / Simulation Date: Wed Aug 22 23:45:21 EDT 2018 Maybe there is some room for improvement? I guess I can continue with this, find abstractions that work. Comments, follow-ups - multi-level sim (rtl, gate-level) - asic: problem is verification (design for verification) - RTL synthesis - Event-driven approach is necessary (multiple clock domains) - Seq test benches should be able to drive a verilog simulator (cosim) - MyHDL cosim with icarus https://news.ycombinator.com/item?id=9516217 "If you have worked in ASIC designs that you are almost certainly used to not work with Verilog directly, but with an ad-hoc macro language. I've seen them all, but most commonly Perl is used. It is especially for circuits that we desperately need better tools and abstractions." > They are all toys aimed at newbies. They address problems that are no problem at all to a qualified engineer in the profession. I think this kind of attitude is one of the main problems of the hardware design industry (the other one is the conservative mindset). The bugs are not due to language generally. If software guys want to write a new language for HW engineers, they need to ask HW engineers what the issues are rather then go off on a tangent. As it is they don't even know what the workflow is like and where the real pain lies. They are just writing tools for themselves. The problem is their applications are not commercial. Imperative synchronous languages. https://news.ycombinator.com/item?id=9948906 Exactly. It's crazy to even think, outside of analog, that someone would want to build a modern SOC with RTL directly. Even the full-custom folks like Intel, correct me if I'm wrong, essentially use a hardware approach where they can abstract away from and use EDA to synthesize to the stuff they hand-craft. Most SOC builders simply don't have the labor to waste working at RTL for a complex design & its inevitable problems. It's why they pay so much to the Big Three for synthesis tools. The real difference is that in event-driven languages, clock events are explicit, instead of implicit like in Chisel and the whole array of dead HDLs that preceded it. So if history is any guide, Chisel is dead upon arrival. Essentially once you translate your Chisel design to verilog you basically can't reuse your verification environment for the RTL simulation or the gate-level. How are you going to check timing if you wrote all your tests in Chisel? So much more logic is added to chips during/after synthesis these days and you have no way to get any of your tests running on that net list. productivity bottleneck in hardware is not in design but verification and specifically sequential verification. Combinational verification is quote easy, because modern SAT solvers can prove almost all combinational properties you can throw at them. Lots of progress is happening though: Aaron Bradley's PDR (http://theory.stanford.edu/~arbrad/) and Arie Gurfinkel and Yakir Vizel's AVY (http://arieg.bitbucket.org/avy/) have made pretty major breakthroughs in proving a lot of complex sequential properties and these algorithms have made their way into industrial formal tools as well. (on clash) And you're still describing the transfer function between states; CλaSH does not seem to allow you to describe your program in a structured way (such as loop until x becomes true, wait for 3 cycles, read z, while z > 0 decrement z, etc.) Entry: cosim Date: Thu Aug 23 00:55:22 EDT 2018 https://github.com/jgjverheij/clash-cosim http://docs.myhdl.org/en/stable/manual/cosimulation.html Entry: structured programming Date: Thu Aug 23 01:19:42 EDT 2018 See two posts back: " CλaSH does not seem to allow you to describe your program in a structured way (such as loop until x becomes true, wait for 3 cycles, read z, while z > 0 decrement z, etc.)" https://news.ycombinator.com/item?id=9516217 How would you in Verilog? Entry: Chisel HDL: the latest instance of a flawed approach Date: Thu Aug 23 01:33:43 EDT 2018 http://www.jandecaluwe.com/blog/chisel-flawed-approach.html Jan disagrees: Since we are constructing a digital circuit, the notion of reassignment does not make much sense since connections between circuit nodes only need to be specified once. This is a puzzling statement. It would suggest that Chisel reduces the goal of an HDL to structurally constructing a circuit. What about describing behavior, which is the main reason why industrial HDL designers use VHDL, Verilog, SystemC and MyHDL? So I'm making the same mistake. Jan mentions again that merely constructing circuits is a mistake, and that behavioral modeling is the important property. I've actually run into this, to be honest. Why am I building a CPU? To be able to do sequential programming. In the comments, Jan comments on blocking assignments: Entry: Procedural / behavioral modeling Date: Thu Aug 23 01:47:19 EDT 2018 So what is it? What are the constructs? Is this just single-ended ifs? ( Because unrolled macros are trivial ). http://www.jandecaluwe.com/hdldesign/thinking-software-rtl.html Coding at the RTL level means describing the hardware behavior in a single clock cycle. If an algorithm requires a number of steps, you have to code it as an FSM so that the behavior depends on the state you are in. The resulting code is fairly low level. For example, a loop is emulated by explicit transitions between states. However, there is an interesting special case. When the algorithm is simple enough to complete within a single clock cycle, you can use higher level features such as a for loop to describe the behavior in a straightforward way. So as long as the loop unrolls into a logic network, it is synthesizable. Now what is the equivalent of that, Haskellized? It's OK if I generate straight RTL... That gray counter is a good example to duplicate. Some more here: http://www.jandecaluwe.com/hdldesign/the-case-for-a-better-hdl.html http://www.jandecaluwe.com/hdldesign/signal-assignments.html Trying to summarize the points: - Consecutive assignments can be useful to factor out complex conditions - Loops are useful to express networks algorithmically. For seq: the latter is just macros. The former needs a special trick, such as register assigments or lenses. Entry: Record assignments Date: Thu Aug 23 02:25:46 EDT 2018 Haskell does have this record assignment syntax.. https://www.reddit.com/r/haskell/comments/6v9ezi/does_it_bother_anyone_else_that_record_assignment/ So this can be used for state updates, in case the state gets large. Also: lens. Entry: Update: behavioral Date: Thu Aug 23 10:09:27 EDT 2018 (I shouldn't do this stuff right before bed...) So I think the main idea is that the behavioral approach can make control decisions based on signal values, and unrolls this into a circuit. At first glance, this seems to need target HDL support, but because of the result being a static circuit, it should be able to be expressed as a macro. Just maybe not an imperative one. http://www.jandecaluwe.com/hdldesign/thinking-software-rtl.html The key to the gray counter example is the "found" variable. It appears to make a structural decision. However, unrolled, the only thing that happens is that the ith iteration of the loop has a different input. So this is a chain of combinatorial circuits. This means that "for" can be a macro, and variables can just be signals. Generalize this to fold. This is an important insight. Maybe the one-legged if can also be implemented this way: a fold? Entry: Processor as a macro Date: Thu Aug 23 10:32:53 EDT 2018 So back to the main question: is there an intermediate between creating a program on a processor, and a raw state machine. I.e. I want an imperative / stack language that generates a state machine. Say start without a stack: just use loop counters. Every instruction word is a state, with a next state equation. This should be automatic. These are all just functions! What about decoupling it as a type class, and providing two different interpretations? There are a couple of trade-offs: loop counters with inidividual adders, or a stack or ALU-like structure. It seems that only the instruction memory can be eliminated this way. The other logic will remain the same. Also obvious in retrospect. Entry: concat Date: Thu Aug 23 12:50:27 EDT 2018 Maybe today is a day to evaluate concat. https://github.com/conal/concat Before that, I need to solve some build issues. Means nix. Entry: Verilog next Date: Fri Aug 24 11:19:31 EDT 2018 Something to test memories. Make a universal test bench? I.e. start with output-only, single output, then change once it works. Entry: Sequencers: state machines vs. CPUs Date: Sun Aug 26 11:05:29 EDT 2018 This seems to be the main thing. Also that it is just the imem+iword infrastructure: for loops, a stack or multiple loop counts are still needed. Entry: Verilog cosimulation Date: Sun Aug 26 11:20:09 EDT 2018 I started some basic testbench work. Ultimately, I want cosimulation. IVerilog can do that. So maybe go for it straight away? How is the interface built up? root@zoo:~# apt-file find iverilog ... iverilog: /usr/include/iverilog/_pli_types.h iverilog: /usr/include/iverilog/acc_user.h iverilog: /usr/include/iverilog/ivl_target.h iverilog: /usr/include/iverilog/sv_vpi_user.h iverilog: /usr/include/iverilog/veriuser.h iverilog: /usr/include/iverilog/vpi_user.h ... iverilog has vpi_user.h which refers to this standard: https://en.wikipedia.org/wiki/Verilog_Procedural_Interface What I need is either message passing, or calls from iverilog into Haskell. As a direct FFI, calls from Haskell into iverilog would be more appropriate, but likely a bit awkward. An example would be nice. http://www.asic-world.com/verilog/pli6.html What is a system task? It seems to be a function that is called inside a begin/end block. How do I get at wire/reg values? I want to simulate something where the i/o relation is implemented on the Haskell side. It seems that just calling a function from verilog is enough. Is this a "system task"? Something that can take arguments and return them? Entry: simple tests Date: Sun Aug 26 12:04:45 EDT 2018 But first, some simple tests. E.g.: a test that writes to memory, reads it out. Basically it would be nice if this could be expressed such that it can be plugged directly into the existing QC functionality, while also allowing inspection of the generated verilog. EDIT: Postponed to later. This is too much "work". The MyHDL path works for the CPU, so use that for now. Work on this in the background. Entry: Behavioral is NOT state machine synthesis Date: Sun Aug 26 14:21:10 EDT 2018 I think that was the core of my misunderstanding. If "behavioral" is just an imperative, iterative overlay to generate combinatorial networks, I don't think I'm really interested. Can be done better using explicit macro code. Entry: Cleaned up module structure Date: Sun Aug 26 15:43:02 EDT 2018 Apparently the biggest misunderstanding was that the module 'asm-tools' can just be put in the dependency list of the tests. This way, they do not completely rebuild everything. TODO: Integrate hatd code. Entry: Generic assembler Date: Sun Aug 26 15:44:20 EDT 2018 I was able to avoid this by using non-recursive functions. But in general, it is probably best to wrap up the knot-tying assembler behavior in a monad (transformer). Entry: iCE40UP5K Date: Sun Aug 26 19:16:29 EDT 2018 https://github.com/icebreaker-fpga/icebreaker 8 DSP cores (16x16 MAC) There is support since january http://www.clifford.at/icestorm/ Entry: Publish? Date: Sun Aug 26 23:50:17 EDT 2018 The most important property that distinguishes this from a toy is cosimulation. If this tool can be used to create components as part of a larger design, it will be useful. If not, it is a toy. For my own purpose, I control everything. There are no hard or soft cores apart from memory. Entry: post synthesis changes Date: Sun Aug 26 23:52:32 EDT 2018 I read something about changes being made post sythesis, supposedly because in large designs, synthesis is such an expensive procedure. I'm not sure if this means change then resynth, or just synth + tweak and don't resynth. Entry: Timing Date: Mon Aug 27 00:31:11 EDT 2018 Maybe more important than verilog, is to get a better handle on timing constraints. E.g. how fast does the CPU run? Is it fast enough for the application, or do we need multiple clock domains? As I mentioned before, I find it strange that I don't run into timing issues (yet), while this is the main thing in all other CPU designs! Path: verilog -> pll instantiation -> CPU test Entry: iCE40 VGA demo? Date: Mon Aug 27 01:14:54 EDT 2018 https://imgur.com/t/fpga/71Ap9jT Entry: Zip CPU Date: Mon Aug 27 01:55:23 EDT 2018 If you bring me a design that doesn't work, my first two comments will be: 1. Do you have `default_nettype none as your first line? 2. Does it pass verilator -Wall -cc toplevel.v quietly? Entry: Scan Date: Tue Aug 28 04:06:25 EDT 2018 Scan: One extra multiplexer + string all flipflops as a shift register. http://pages.hmc.edu/harris/cmosvlsi/4e/lect/lect12.pdf Entry: Verilog gen: names as strings Date: Tue Aug 28 10:40:47 EDT 2018 Why bother with template haskell. FPGA gen needs custom things anyway, so use the existing probe mechanism to set signal names? Entry: Direction: work with existing Verilog code Date: Thu Aug 30 22:29:45 EDT 2018 This means: focus on tying into an existing simulator. The power in the Haskell approach is testability, composability. For code gen however, it will be mostly "glue" in any practical design that has reuse. Entry: tristate logic vs. multiplexer trees Date: Fri Aug 31 18:09:45 EDT 2018 From one of the comments: https://electronics.stackexchange.com/questions/289878/fpga-memory-vs-registers "Using registers as memory also requires huge multiplexer trees, since there is no tri-state logic in routing." Entry: Verilator or Icarus? Date: Fri Aug 31 21:26:32 EDT 2018 MyHDL uses Icarus, so that's what it will be. Just do what MyHDL does: http://docs.myhdl.org/en/stable/manual/cosimulation.html module dut_bin2gray; reg [`width-1:0] B; wire [`width-1:0] G; initial begin $from_myhdl(B); $to_myhdl(G); end bin2gray dut (.B(B), .G(G)); defparam dut.width = `width; endmodule And use the "probe" mechanism to send the probe signals back using to_x, and push in the test signals using from_x. http://iverilog.wikia.com/wiki/Using_VPI So there is a straightforward path with MyHDL as an example. It involves a lot of reading, so not for now. Entry: Low-hanging fruit Date: Fri Aug 31 21:51:40 EDT 2018 I need something simple to keep myself occupied. EDIT: Sleep! Entry: Verilog Date: Sat Sep 1 08:47:42 EDT 2018 Maybe leave the cosim out of the picture, but make a working verilog RTL gen. What is missing currently? Likely the problem is the memory. EDIT: What's needed is a cosim function, that returns the Seq emulation, but also the Verilog one. This seems like a lot of work, really. How to split it up into manageable parts? Entry: Abstract submodules Date: Sun Sep 2 09:41:23 EDT 2018 Instead of generating verilog code for memories and reset generator, write those as verilog modules, and instantiate them. Entry: VPI Date: Mon Sep 3 10:05:21 EDT 2018 So I have some cargo-culted code that compiles and runs. Now I want to know how exactly I get "events". What I miss is the shape of the context these functions are supposed to run in. This is quite typical: the API is too granular to reconstruct this kind of context. Examples are essential to get the big picture. I find this in the myhdl.c code: vpi_register_cb(&cb_data_s); Googling for that, I find: http://www.asic-world.com/systemc/hdl8.html Some remarks: - userdata has nothing to do with the argument to a function. it seems to be just a user state pointer. - everything is dynamically typed, and referenced using vpiHandle - some calls allocate objects that need to be freed, e.g. vpi_iterate() (EDIT: this caused a crash... reference manual?) https://books.google.com/books?id=3pEMBwAAQBAJ&pg=PA213&lpg=PA213&dq=vpi+posedge&source=bl&ots=AYdk0XJym9&sig=3W4FA1Xe5LP4zUGPSm7QQv6O2z4&hl=en&sa=X&ved=2ahUKEwjnttT_lJ_dAhXDIzQIHSPLDhUQ6AEwB3oECAQQAQ#v=onepage&q=vpi%20posedge&f=false Now, how to put it all together? Instead of syncing on clock inside the module, what about calling it like this: always @(posedge clock) $seq_tick This could then read out all the registers that were tagged in $to_seq(), and write to all the registers tagged in $from_seq(). This seems straightforward. The problem is then how to define the protocol between the Haskell side and the simulation? EDIT: This seems to work. Proof of concept, clean up later. Next: generic way to get an I/O pipe into a C program from Haskell. TCP? Named pipe? The most straightforward way I can think of: - Have Haskell set up the pipes - Put pipe locations in environemnt - Let module open the pipes EDIT: How to set this up? There is this, which is a higher level interface: http://hackage.haskell.org/package/pipes-4.3.9/docs/Pipes-Tutorial.html Seems overkill. I just need the system calls. EDIT: Trying named pipes, but this is really hacky. Unix socket? EDIT: Basic unix socket infra is in place. How to (de)serialize? I don't understand how to link the Char8 to the Data.Binary.Get • Couldn't match expected type ‘Data.ByteString.Lazy.Internal.ByteString’ with actual type ‘Data.ByteString.Internal.ByteString’ Entry: cauterize Date: Mon Sep 3 12:37:23 EDT 2018 Serialization between emulator core and Haskell? This is a job for cauterize: https://github.com/cauterize-tools/cauterize EDIT: Stick to u32 for now. Entry: bug? Date: Mon Sep 3 21:33:41 EDT 2018 reg [7:0] p0; // (0,Free (TypedForm {typedFormType = Just 8, typedFormForm = Input})) reg [7:0] p1; // (1,Free (TypedForm {typedFormType = Just 8, typedFormForm = Connect (Pure 2)})) wire [7:0] s2; // (2,Free (TypedForm {typedFormType = Just 8, typedFormForm = Comb2 ADD (Pure 0) (Pure 2)})) assign s2 = (p0 + s2); // (2,Free (TypedForm {typedFormType = Just 8, typedFormForm = Comb2 ADD (Pure 0) (Pure 2)})) assign p1 = s2; // (1,Free (TypedForm {typedFormType = Just 8, typedFormForm = Connect (Pure 2)})) that 4th line is wrong it looks like the constants start counting at the wrong offset. Yes... should be 1 + maximum. It's likely that this is the bug I've been looking for. Nasty enough to mess up a network. EDIT: .v code gen works on FPGA now. Entry: cosim Date: Tue Sep 4 10:25:56 EDT 2018 I have something working with .v gen. Questions: - from_seq uses registers. does it need a special update? vpiNoDelay is fine - to_seq uses wires. should be fine see https://verificationhack.com/ Entry: cosim next? Date: Tue Sep 4 10:50:33 EDT 2018 So this works. There is no need to continue using the stdin/stdout approach. EDIT: Cleaned up. Good enough for now. Entry: Next Date: Tue Sep 4 12:30:02 EDT 2018 - Test CPU Verilog on FPGA - Use Verilog as top-level composition. Do this for reset gen, PLL, ram init. Most of this is "boring" work. I need to be more careful spending the will power to do those things. Lot's of payed work to do as well.. EDIT: I'm stuck with just boring problems. I need something meaningful. Or somehow awaken the flame. Entry: PLL Date: Wed Sep 5 11:22:26 EDT 2018 So work towards getting the PLL up. Clean up .v modules first. - how to start yosys with library routines? OK - same for icarus - how to add instantiation to Seq? Entry: Next Date: Wed Sep 5 12:22:53 EDT 2018 I don't want to go through debugging PLL stuff when I'm not 100% online. Something simpler? Or something more exciting? Entry: CPU Emulation Date: Wed Sep 5 13:06:19 EDT 2018 With Seq, CPU emulation is at the level of sequential logic. Is there a way to relate this level to a higher one? E.g. one thing I'd like to do is to create a PIC18 code generator / emulator. Should I emulate the PIC18 at a hardware level? That seems like a lot of work. The main question: how to relate different semantic levels? This does not seem trivial at all! The main trick seems to be invariance of properties, done at the problem level. Entry: PRU revisited Date: Thu Sep 6 15:06:34 EDT 2018 I'm starting to think that this might not be such a good idea. Reality doesn't correspond to the simulation. It seems stuck in some loop. So I'm looking at a sim error. QBNE is prime suspect. QBNE Entry: Code gen or scaffolding? Date: Thu Sep 6 19:03:14 EDT 2018 There is always this tension: is it _reall necessary_ to generate code, or can low level code be written by hand and loaded into the test framework for validation? When to reverse the arrow? The cost is significant: it is no longer possible to read the code without understanding the metaprogramming framework. Maybe it is time to create a subset of C that can be used to express algoriths, but is also easy to use as a frontend to abstract interpretation. Entry: Next Date: Fri Sep 7 11:49:34 EDT 2018 PLL: I would really like to know if it works or not. But this kind of work needs strong attention to detail, and I'm in the mud again. So what is missing, currently? Maybe clean up the Verilog generator a bit. Entry: I'm a little stuck Date: Fri Sep 7 23:22:56 EDT 2018 Let's use this tool for a bit and let the application drive progress. I think I'm done with just making stuff up. Considering to switch to "exo" as the main side project. Entry: Emulator / Assembler Date: Sat Sep 8 12:03:59 EDT 2018 I'd like to abstract the 2-phase structure used in Pru: - compile to program - run program This requires two monads: a compile time monad and a run time monad. To customize, maybe write these as monad transformers? Entry: Instruction scheduling using SAT solver Date: Thu Sep 13 21:04:58 EDT 2018 It might be an interesting. Entry: So, what about just using assembly? Date: Fri Sep 14 22:10:02 EDT 2018 Using a forth for the part that is not time critical, and assembly for the part that is. Together with emulators. So the basic problem becomes: how to quickly write a CPU emulator? Also, how to quickly iterate? Haskell is REALLY slow to compile. Entry: PRUs and instruction counting Date: Sun Sep 16 10:05:15 EDT 2018 So I got something to work, but what I miss is a good way to specify it. I think this is getting closer to SAT solver territory. The problem there is how to encode the problem. SMT seems more convenient then SAT. Entry: SAT/SMT: stick to z3 for now Date: Sun Sep 16 10:11:53 EDT 2018 ( Getting more into how other people make monad wrappers. This one sure looks ugly, but is straightforward, to the point. ) Notes: mkFreshIntVar: create a variable mkInteger: create an integer constant assert: add assertion to Z3 monad connectives: mkLe mkAnd mkIte mkEq mkSub mkUnaryMinux withModel: provides context such that evalInt works evalInt: evaluate variable Entry: dsPIC Date: Sun Sep 16 20:37:39 EDT 2018 - no USB needed. this is for deeply embedded bits, solve interfacing elsewhere. uart is fine. - using e.g. an STM32 for this, it is also possible to put a PIC programmer in there. the programming of the devices isn't really that difficult given bit-bang possibility on the STM32. - time for some signal processing. I have a bunch of these dsPIC DIP packages, so why not use them? - dsPICs are simple. Also, it will likely be straightforward to move this to a softcore written around some DSP primitives on an FPGA. - I think I'm ready with the Haskell work: - PRU/Emu.hs can be transformed into a universal emulator / assembler - Haskell as a meta language is enough to implement Forth-like target languages - More complex things are often not needed: use a different chip with a "real" language like C or Rust. What is needed? - The tagless-final language can be built incrementally on top of an assembler. I would like to use an existing assembler. I don't think the Staapl approach of redoing everything makes much sense. Otoh I don't immediately find something, so maybe let's do this anyway? It shouldn't be that hard. It could be on an as-needed basis. Entry: Architecture Date: Sun Sep 16 21:01:51 EDT 2018 Interface chip: blue pill. There is no need for Rust here. Keep it simple. This doesn't need to do anything complicated. So it looks like it is necessary to generate instructions for the ICSP anyway, so let's just do that. Hook up up the naked dsPIC to the blue pill. I have a dsPIC30F4013 on a board. There are the GP chips from a previous project project also. Does it make a difference for programming? Programming: dspic 30F and 33F can both contain a "programming executive", but it seems this first needs to be uploaded. Maybe simpler to use ICSP mode. Definitely, this will need to be limited to low voltage programming. I'm not going to build a charge pump. This means only dsPIC33F. EDIT: This is not simple. Do I really want to get into this, or is it just a beer idea? EDIT: Sobered up. The point here is to be able to do circuits without the need for breakout boards. I believe the dsPIC33F can be programmed with pickit2. Yes, see /home/tom/bin/pk2cmd.33FJ128GP802.write Entry: State machines vs. processor Date: Fri Sep 28 13:32:10 CDT 2018 Disadvantage of processor is the amount of multiplexing that is needed. Advantage: that multiplexing does buy something: flexibility in accessing everything from a single point. What I really want: a simple way to synthesize loops into hardware. Can it be done cheaper? A stack of counters isn't all that expensive. The problem is more that if repurposing is needed, a processor + bus is simpler. Entry: State machines Date: Mon Oct 15 10:11:14 CEST 2018 I need a good syntax to implement state machines, but above all, a better way to think about implementing them. Often, state machines are too low-level. Nested loops provide a better mechanism. I want a way to relate the two. In Seq, a syntax transformation would be easy to implement, so why can't I just write this down? - loops: remain in a state as long as condition is not met - re-use counters, and possibly counter conditions, between states The trick is in the reuse of resources between states. Having to do this explicitly is also what makes manual state machines hard to develop. Using a CPU, the shared resources are the CPU resources (registers, ALU, instruction sequencer). So it is the reuse of resources between states that makes things complicated. Also, state components used in one branch, and not used in another, still need to be initialized. It is exactly this that a more "imperative" language would solve: by default, keep value. It appears it is this property of needing to specify unused values that gets in the way. Entry: Imperative vs. functional Date: Mon Oct 15 11:25:10 CEST 2018 I'm starting to think that the deeper problem in my reasoning is holding on to "functional" way of thinking. An imperative language might actually be more appropriate for this kind of work: single (or no) assignment, or even repeated assignment (though that feels like a mistake). Is this just beginner misunderstanding, or am I just not indoctrinated enough by the standard way? If that is ub-optimal, I have an opportunity to think differently. What am I missing? One way to look at the missing link is partial evauation of a CPU + program into a state machine. There is some element of the construction there that I do not understand. Is it just partial evaluation of the decoder + ALU? Entry: Behavioral Date: Mon Oct 15 11:55:02 CEST 2018 Create a behavioral language. Essentially, this separates variable declaration and assignment. It can then still be determined whether to use single or multiple assignment. No assignment means to keep the existing variable. Is it possible to replace clock tick by "goto"? This way, everything executed between these is combinatorial. That is the backbone. A state machine is something that can change the value of outputs (registers) between gotos, using the new values of inputs (other registers). The main "programming principle": In this particular case, it is more convenient to think in terms of and express changes in state than the recomputation of state. This model is not a perfect equivalence, i.e. it is not possible to reuse logic in the same clock cycle. Early exits should be implemented, i.e. multiple in branches. If partial updates are so important, are clock enables on the registers then used to implement this? Do I need to worry about this in how translation to vhdl will work? Is it possible to use behavioral description as the bottom abstraction, and implement the "functional" logic style on top of it? Entry: Verilog Date: Mon Oct 15 14:08:23 CEST 2018 I'm starting to worry about striding too far away from the status quo, especially with recent insights about behavioral approach. Entry: State machine Date: Tue Oct 16 11:07:25 CEST 2018 I'm stuck. I have no behavioral language to express this, can't quickly build one, so will need to continue with fully explicit "functional" updates. _ = do (carry,dec') <- carry dec cnt [state',cnt] = switch state [ (wait, do ifs i [wait,0] [count,half]), (count, do ifs carry [idle,0] [count,dec']), (idle, do ifs en [wait,0] [idle,0])] So why does this happen? Many state machines do a lot of waiting, which doesn't change the state. You want this to be easily expressed. Really, though, am I just looking for excuses? It works, it's just not easy to use when there is a lot of state. But maybe proper factoring can solve that problem independently? I.e. define sub-machines, and provide control lines. ( Still, it itches. There must be some simple transform where variables are moved into dynamic context. ) Entry: The solution is factoring Date: Tue Oct 16 11:47:27 CEST 2018 It always is. The sub-machine creates an enable signal. It behaves a bit like an S/R flipflop, with set triggered by a first data transition. The other machine does just clock recovery, expecting a data and enable signal. But that's not enough maybe? The first machine also needs to have a clock start on the first edge. So it seems that an enable/reset is actually not a bad way to do this. Entry: Bit sync Date: Thu Oct 18 11:03:49 CEST 2018 So do it using the principles above: - factor out the start/stop - start it at 1/2 bit This should really be done on paper... But, if it can be done on paper, it can also be done as letrec. edge' <- edge i negedge <- e & ~i bit_enable <- set_reset negedge count_carry Why is this so difficult to express? Maybe it is essential to put it on paper first. Entry: Cont... Date: Thu Oct 18 15:28:23 CEST 2018 I have all the components. - Sub clock counter -> bit clock out - SR + addr gen + RAM The problem is reset/enable or start/stop: mostly about how to define the protocol. These are roughly equivalent, but require some small state machines to convert. - Single enable/nrst - Start/Stop single clock pulses - Start/Stop stretched pulses polarized edge detect can convert: stretched -> single en/nrst -> signle ( see next post) So, for bit sync, which is easiest: start/stop or enable? Start/Stop is not enough. States are: - start, but wait for first edge - ... So basically, there are 2 start signals: - frame start - first edge after frame start To solve: - first start pulse -> enable (or use enable in first place) - edge pulse masked by enable, then "debounce" start pulses It seems simplest to use a single "frame" or "chip select" style input to the circuit, and have the end reset everything. Bits can be clocked in as long as frame is high: it will just be an extra couple of words. So, general arch: - 2-stage start/stop (frame start, then first edge) - frame stop = reset It seems that the 2-phase start is really the problem. Inner circuit should be written to take a single start pulse, and start counting right away. Entry: Premature opti Date: Thu Oct 18 15:45:53 CEST 2018 I want to use abstractions, and not worry about efficiency at the same time. Trust compiler to do common subexpression elimination, even for state machines. If it is possible to duplicate functionality at state machine level, it is OK to _actually use_ abstractions that convert between different representation. A good example is dual signal start+stop protocol vs. single line enable/nrst. If the optimizer can eliminate dual instantiations of the same edge detector, it is no longer necessary to spend mental cycles manually performing the factorization. This premature optimization is a serious hangup otherwise. TODO: Validate if this is what happens! Otherwise, add a pass in Seq. EDIT: I really worry about this. Easy enough to test: create 2 edge detectors from the same pin to 2 different output pins, and see if it shares. EDIT: I don't think it eliminates delays, at least I do not see it doing this anywhere. Entry: Pipelining CPU Date: Sat Oct 20 15:06:24 CEST 2018 Just fetch. Essentially this delays the address register. Everything stays the same, apart from jumps going into effect one instruction late. This then creates a branch slot. This then takes the RAM access time out of the timing loop. EDIT: Looking at this again, the RAM access is only a small part. Bulk comes from logic levels (17 deep!). Entry: Update tools? Date: Sun Oct 21 15:20:35 CEST 2018 Maybe good to update the tools? tom@wanda:~/asm_tools$ make f_soc.ct256.time icetime -p f_soc.pcf -o f_soc.ct256.nl.v -P ct256 -d hx8k -t f_soc.ct256.asc // Reading input .pcf file.. // Reading input .asc file.. // Reading 8k chipdb file.. // Creating timing netlist.. icetime topological timing analysis report ========================================== Warning: This timing analysis report is an estimate! Report for critical path: ------------------------- lc40_18_19_4 (LogicCell40) [clk] -> lcout: 0.640 ns 0.640 ns net_70961 (tx_in_s60[1]) odrv_18_19_70961_75164 (Odrv12) I -> O: 0.540 ns t2617 (Sp12to4) I -> O: 0.449 ns t2616 (Span4Mux_v1) I -> O: 0.203 ns t2615 (LocalMux) I -> O: 0.330 ns inmux_16_20_67064_67084 (InMux) I -> O: 0.260 ns lc40_16_20_1 (LogicCell40) in1 -> carryout: 0.260 ns 2.681 ns net_67082 ($auto$alumacc.cc:470:replace_alu$232.C[2]) lc40_16_20_2 (LogicCell40) carryin -> carryout: 0.126 ns 2.807 ns net_67088 ($auto$alumacc.cc:470:replace_alu$232.C[3]) lc40_16_20_3 (LogicCell40) carryin -> carryout: 0.126 ns 2.933 ns net_67094 ($auto$alumacc.cc:470:replace_alu$232.C[4]) lc40_16_20_4 (LogicCell40) carryin -> carryout: 0.126 ns 3.060 ns net_67100 ($auto$alumacc.cc:470:replace_alu$232.C[5]) lc40_16_20_5 (LogicCell40) carryin -> carryout: 0.126 ns 3.186 ns net_67106 ($auto$alumacc.cc:470:replace_alu$232.C[6]) lc40_16_20_6 (LogicCell40) carryin -> carryout: 0.126 ns 3.312 ns net_67112 ($auto$alumacc.cc:470:replace_alu$232.C[7]) lc40_16_20_7 (LogicCell40) carryin -> carryout: 0.126 ns 3.438 ns net_67118 ($auto$alumacc.cc:470:replace_alu$232.C[8]) t316 (ICE_CARRY_IN_MUX) carryinitin -> carryinitout: 0.196 ns lc40_16_21_0 (LogicCell40) carryin -> carryout: 0.126 ns 3.761 ns net_67199 ($auto$alumacc.cc:470:replace_alu$232.C[9]) lc40_16_21_1 (LogicCell40) carryin -> carryout: 0.126 ns 3.887 ns net_67205 ($auto$alumacc.cc:470:replace_alu$232.C[10]) lc40_16_21_2 (LogicCell40) carryin -> carryout: 0.126 ns 4.014 ns net_67211 ($auto$alumacc.cc:470:replace_alu$232.C[11]) lc40_16_21_3 (LogicCell40) carryin -> carryout: 0.126 ns 4.140 ns net_67217 ($auto$alumacc.cc:470:replace_alu$232.C[12]$2) inmux_16_21_67217_67227 (InMux) I -> O: 0.260 ns lc40_16_21_4 (LogicCell40) in3 -> lcout: 0.316 ns 4.715 ns net_63054 ($auto$alumacc.cc:470:replace_alu$232.C[12]) odrv_16_21_63054_67278 (Odrv4) I -> O: 0.372 ns t2115 (Span4Mux_h1) I -> O: 0.175 ns t2114 (LocalMux) I -> O: 0.330 ns inmux_18_20_75201_75263 (InMux) I -> O: 0.260 ns lc40_18_20_5 (LogicCell40) in3 -> lcout: 0.316 ns 6.167 ns net_71085 (c_s77) odrv_18_20_71085_70869 (Odrv4) I -> O: 0.372 ns t2651 (LocalMux) I -> O: 0.330 ns inmux_18_18_74969_75003 (InMux) I -> O: 0.260 ns lc40_18_18_3 (LogicCell40) in1 -> lcout: 0.400 ns 7.527 ns net_70837 ($abc$2791$n472_1) odrv_18_18_70837_46505 (Odrv12) I -> O: 0.540 ns t2485 (Span12Mux_h1) I -> O: 0.133 ns t2491 (Sp12to4) I -> O: 0.449 ns t2490 (Span4Mux_v0) I -> O: 0.203 ns t2489 (Span4Mux_v0) I -> O: 0.203 ns t2488 (Span4Mux_v0) I -> O: 0.203 ns t2487 (Span4Mux_v1) I -> O: 0.203 ns t2486 (LocalMux) I -> O: 0.330 ns inmux_12_19_50638_50667 (InMux) I -> O: 0.260 ns t229 (CascadeMux) I -> O: 0.000 ns lc40_12_19_3 (LogicCell40) in2 -> lcout: 0.379 ns 10.431 ns net_46499 ($abc$2791$n517_1) t1354 (LocalMux) I -> O: 0.330 ns inmux_12_19_50617_50677 (InMux) I -> O: 0.260 ns lc40_12_19_5 (LogicCell40) in0 -> lcout: 0.449 ns 11.469 ns net_46501 ($abc$2791$n626) odrv_12_19_46501_14685 (Odrv12) I -> O: 0.540 ns t1403 (Span12Mux_h11) I -> O: 0.526 ns t1402 (Sp12to4) I -> O: 0.449 ns t1401 (Span4Mux_v0) I -> O: 0.203 ns t1400 (Span4Mux_v4) I -> O: 0.372 ns t1399 (Span4Mux_v4) I -> O: 0.372 ns t1398 (Span4Mux_v2) I -> O: 0.252 ns t1397 (LocalMux) I -> O: 0.330 ns inmux_25_9_102231_102273 (CEMux) I -> O: 0.603 ns 15.116 ns net_102273 ($abc$2791$n626) ram_25_9 (SB_RAM40_4K) RCLKE [setup]: 0.267 ns Resolvable net names on path: 0.640 ns .. 2.421 ns tx_in_s60[1] 2.681 ns .. 2.681 ns $auto$alumacc.cc:470:replace_alu$232.C[2] 2.807 ns .. 2.807 ns $auto$alumacc.cc:470:replace_alu$232.C[3] 2.933 ns .. 2.933 ns $auto$alumacc.cc:470:replace_alu$232.C[4] 3.060 ns .. 3.060 ns $auto$alumacc.cc:470:replace_alu$232.C[5] 3.186 ns .. 3.186 ns $auto$alumacc.cc:470:replace_alu$232.C[6] 3.312 ns .. 3.312 ns $auto$alumacc.cc:470:replace_alu$232.C[7] 3.438 ns .. 3.635 ns $auto$alumacc.cc:470:replace_alu$232.C[8] 3.761 ns .. 3.761 ns $auto$alumacc.cc:470:replace_alu$232.C[9] 3.887 ns .. 3.887 ns $auto$alumacc.cc:470:replace_alu$232.C[10] 4.014 ns .. 4.014 ns $auto$alumacc.cc:470:replace_alu$232.C[11] 4.140 ns .. 4.399 ns $auto$alumacc.cc:470:replace_alu$232.C[12]$2 4.715 ns .. 5.851 ns $auto$alumacc.cc:470:replace_alu$232.C[12] 6.167 ns .. 7.127 ns c_s77 7.527 ns .. 10.052 ns $abc$2791$n472_1 10.431 ns .. 11.020 ns $abc$2791$n517_1 11.469 ns .. 15.116 ns $abc$2791$n626 RDATA[11] -> $abc$2791$n272 RDATA[3] -> $abc$2791$n269 Total number of logic levels: 17 Total path delay: 15.38 ns (65.01 MHz) After pipelining the jump instruction: Total number of logic levels: 6 Total path delay: 14.74 ns (67.86 MHz) So it looks like pipelining fetch is not going to solve much. With new version: icetime -p f_soc.pcf -o f_soc.ct256.nl.v -P ct256 -d hx8k -t f_soc.ct256.asc // Reading input .pcf file.. // Reading input .asc file.. // Reading 8k chipdb file.. // Creating timing netlist.. icetime topological timing analysis report ========================================== Report for critical path: ------------------------- lc40_28_11_7 (LogicCell40) [clk] -> lcout: 0.640 ns 0.640 ns net_110076 (s20) odrv_28_11_110076_70105 (Odrv12) I -> O: 0.540 ns t3051 (Sp12to4) I -> O: 0.449 ns t3056 (Span4Mux_v3) I -> O: 0.337 ns t3055 (LocalMux) I -> O: 0.330 ns inmux_20_14_82634_82652 (InMux) I -> O: 0.260 ns lc40_20_14_1 (LogicCell40) in0 -> lcout: 0.449 ns 3.004 ns net_78496 ($abc$2868$n379_1) odrv_20_14_78496_62321 (Odrv12) I -> O: 0.540 ns t2288 (Span12Mux_v6) I -> O: 0.288 ns t2287 (Sp12to4) I -> O: 0.449 ns t2310 (Span4Mux_h0) I -> O: 0.147 ns t2323 (Span4Mux_h4) I -> O: 0.316 ns t2322 (Span4Mux_h3) I -> O: 0.231 ns t2321 (LocalMux) I -> O: 0.330 ns inmux_22_21_91648_91680 (InMux) I -> O: 0.260 ns lc40_22_21_3 (LogicCell40) in1 -> lcout: 0.400 ns 5.963 ns net_87513 ($abc$2868$n387) odrv_22_21_87513_87648 (Odrv4) I -> O: 0.372 ns t2686 (Span4Mux_h4) I -> O: 0.316 ns t2685 (LocalMux) I -> O: 0.330 ns inmux_17_21_71246_71313 (InMux) I -> O: 0.260 ns lc40_17_21_6 (LogicCell40) in0 -> lcout: 0.449 ns 7.689 ns net_67132 ($abc$2868$n385) odrv_17_21_67132_31238 (Odrv12) I -> O: 0.540 ns t1996 (Sp12to4) I -> O: 0.449 ns t1998 (Span4Mux_v2) I -> O: 0.252 ns t1997 (LocalMux) I -> O: 0.330 ns inmux_16_19_66938_66962 (InMux) I -> O: 0.260 ns t317 (CascadeMux) I -> O: 0.000 ns lc40_16_19_1 (LogicCell40) in2 -> lcout: 0.379 ns 9.898 ns net_62805 ($abc$2868$n506_1) odrv_16_19_62805_46628 (Odrv12) I -> O: 0.540 ns t1781 (LocalMux) I -> O: 0.330 ns inmux_16_19_66931_66979 (InMux) I -> O: 0.260 ns lc40_16_19_4 (LogicCell40) in1 -> lcout: 0.400 ns 11.427 ns net_62808 ($abc$2868$n505_1) t1775 (LocalMux) I -> O: 0.330 ns inmux_15_19_62873_62879 (InMux) I -> O: 0.260 ns lc40_15_19_0 (LogicCell40) in1 -> lcout: 0.400 ns 12.416 ns net_58727 (s115[6]) odrv_15_19_58727_62446 (Odrv12) I -> O: 0.540 ns t1672 (Span12Mux_h8) I -> O: 0.386 ns t1671 (Sp12to4) I -> O: 0.449 ns t1679 (Span4Mux_v0) I -> O: 0.203 ns t1682 (Span4Mux_v4) I -> O: 0.372 ns t1687 (Span4Mux_v4) I -> O: 0.372 ns t1689 (Span4Mux_v2) I -> O: 0.252 ns t1688 (LocalMux) I -> O: 0.330 ns inmux_25_25_103850_103900 (InMux) I -> O: 0.260 ns t585 (CascadeMux) I -> O: 0.000 ns 15.579 ns net_103900_cascademuxed ram_25_25 (SB_RAM40_4K) RADDR[6] [setup]: 0.203 ns 15.782 ns dangling_wire_337 Resolvable net names on path: 0.640 ns .. 2.555 ns s20 3.004 ns .. 5.563 ns $abc$2868$n379_1 5.963 ns .. 7.240 ns $abc$2868$n387 7.689 ns .. 9.519 ns $abc$2868$n385 9.898 ns .. 11.027 ns $abc$2868$n506_1 11.427 ns .. 12.016 ns $abc$2868$n505_1 12.416 ns .. 15.579 ns s115[6] RDATA[11] -> $abc$2868$n296 RDATA[3] -> $abc$2868$n293 Total number of logic levels: 7 Total path delay: 15.78 ns (63.36 MHz) Entry: wanda update Date: Sun Oct 21 15:35:47 CEST 2018 Let's update tools. I'm going to have to redo this on panda. icestorm: commit a1fd644f383e77d4db71ce4654fdf15b940b53a6 Merge: 6178dfb 3587093 Author: Clifford Wolf Date: Mon Mar 28 16:51:44 2016 +0200 arachne-pnr: commit f808b8e64a9df67ce87d0ecb6e0eee8f3215d7f3 Merge: 1a4fdf9 7380afe Author: cseed Date: Tue Mar 29 11:36:24 2016 -0400 yosys: commit dcf576641b4a9b476d51fbe1b0cdfb57d02a76e6 Author: Clifford Wolf Date: Fri Jun 3 11:38:31 2016 +0200 Entry: uart bugs Date: Tue Oct 23 11:46:18 CEST 2018 uart_done doesn't work properly machine doesn't align to baud clock properly. should baud timer be reset by uart? Entry: bus reads Date: Tue Oct 23 16:06:45 CEST 2018 Important to realize that there is a 1 cycle delay between bus reads and values being ready. read:4 1 0 1 1 (1636x) read:4 1 1 1 1 read:4 1 0 1 1 The delay comes from: -- Couple bus master and slave through bus registers. closeReg [bit, bits imem_bits] $ \[rStrobe,rData] -> do bus_wr <- bus_master (BusRd rStrobe rData) (BusRd rStrobe' rData', soc_output) <- bus [rx, tx_bc] bus_wr return ([rStrobe', rData'], soc_output) I did this by accident. But is it really necessary to have this delay? It will make the CPU critical path longer, but have the advantage that reads take only once cycle instead of two. It doesn't seem to be an issue at this point, but good to keep in mind. Entry: uart TX Date: Tue Oct 23 16:57:23 CEST 2018 Something is not right with tx_done. Note: read has a wait state, but tx_out should not be 0. iw ip tick_20khz tx_done tx_out ------------------------------------- read:2 8 0 0 0 (4x) read:2 8 0 0 1 (12x) read:2 8 0 0 0 (96x) read:2 8 0 1 0 (2x) drop 9 0 1 0 push:150 10 0 1 0 loop:11 11 0 1 0 (32x) loop:11 11 0 1 1 (119x) push:17 12 0 1 1 Instead of trying to debug this, rewrite it. Use a proper state machine. It's ok to use an external bit clock, but it is necessary to wait. Maybe it is the stop bit? Because IF we wait for the next pulse to send the start bit, it is ok to set done high once the stop bit is sent out, giving some time during that bit to send a consecutive one. EDIT: Test with /12 baud rate sending out 8 bit 0 bytes. (+ 108 2 1 1 1 1) 114 (* 9 12) 108 uart_tx: iw ip tx_done tx_out ------------------------- drop 0 1 1 (3x) write:2 1 1 1 drop 2 0 1 read:2 3 0 1 (9x) read:2 3 0 0 (108x) read:2 3 1 0 (2x) drop 4 1 0 drop 5 1 0 drop 0 1 0 write:2 1 1 0 drop 2 0 1 read:2 3 0 1 (5x) read:2 3 0 0 (108x) read:2 3 1 0 (2x) drop 4 1 0 drop 5 1 0 drop 0 1 0 write:2 1 1 0 drop 2 0 1 read:2 3 0 1 (5x) read:2 3 0 0 (108x) read:2 3 1 0 (2x) drop 4 1 0 drop 5 1 0 drop 0 1 0 write:2 1 1 0 drop 2 0 1 read:2 3 0 1 (5x) read:2 3 0 0 (26x) EDIT: I need a test for UART, so for the example soc in CPU.hs, add a baud clock. Dammit, I can't figure it out. Too complex. EDIT: Bus size was 12 bit, which caused uart to be 8 bit as well. Entry: Memories Date: Thu Oct 25 12:03:19 CEST 2018 The problem that keeps coming back: how to "route" memory read and write ports properly? This is an integration question, so essentially, the core routines should not perform any binding to memories. Close over memory is then done at the point where both ports are accessible, e.g. when instantiating a bus. Entry: Test with memory contents Date: Thu Oct 25 13:57:22 CEST 2018 EDIT: Changed API to also return frozen memory contents. Haskell array with negative bound? Ix{Int}.index: Index (0) out of range ((0,-9223372036854775808)) Because of unspecified array address word size, so it defaults to 64 bit, which wreaks havoc when it is used as a Haskell array size. Maybe fix that? EDIT: Got a test: words, serialized, going into ram. Entry: Simulating protocols Date: Fri Oct 26 15:33:49 CEST 2018 The only thing needed here is feedback. What about providing a stub that is called inside the main loop? It can just be an ST operation. EDIT: I have a first step: an interface that will take other kinds of test bench data apart from just a list of inputs. Now to actually implement it. Next: make a test that actually uses a state machine, or better still, that also implements an input list as a state machine. This needs an ST variable to store the current list. Entry: ports vs. bindings Date: Wed Oct 31 09:30:59 CET 2018 There is still an awkward point in how abstraction works: I would like to bundle inputs and outputs (e.g. an RS485 line with controls), but how to do this if I and O are always spit as arguments to submodules, and bindings of return values? Maybe the solution is the same as ever: abstract things as functions. To make it more concrete, it might help to write an actual flat FPGA toplevel to use the same API as a nesting of component instantances. Entry: Verilog frontend Date: Wed Oct 31 09:38:39 CET 2018 To make this evolve and integrate into other tools, it will be necessary to expose modules as readable verilog. From the Haskell side, it might mean that there is a need for a syntax frontend, to at least be able to carry node and register names into verilog, and also have something resembling design hierarchy. Entry: Tradeoff: blocking read vs. exiplicit bit wait Date: Sat Nov 3 10:12:20 CET 2018 Either reserve a address for every bit, or add a new instruction to wait for a specific bit. This is just a multiplexer. Return value is wasted, but does it matter? Otherwise 2-op is needed, which adds more wires. Not worth it. Entry: Setting up dev env for CPU tests Date: Wed Nov 28 12:15:42 EST 2018 - standardize boot. basically, I want a kind of DHCP for these boards: plug in the board and have it load code. - once booted, the main interface is spi. is there a way to standardize this to TCP? EDIT: See exo.txt This is really a naming problem: some things have only local names, but they can be embedded in some kind of wrapper that implement a global name. So essentially such resources have 2 names: a host name and a local resource name. EDIT: Going through the motions tom@panda:~/asm_tools$ make f_soc.ct256.bin tom@zoe:/i/panda/home/tom/asm_tools$ make f_soc.ct256.iceprog I forgot about how to upload the ram. Ok got it. Added to makefile and iceprog.sh script. Entry: Next Date: Sun Dec 2 00:28:22 EST 2018 Got through a couple of practical issues, mostly related to incremental builds. But I'm ready to move on to UART control of the CPU. Some UART code would be nice on the breakout board. I can use the FTDI for that. FTDI bus B is connected to: PIO0_13 B12 RS232_Tx_TTL PIO0_14 B10 RS232_Rx_TTL Maybe do this in asm_tools first. Entry: bitserial CPU : SERV RISCV Date: Sat Dec 15 18:41:22 EST 2018 https://twitter.com/OlofKindgren https://www.businesswire.com/news/home/20181206005747/en/RISC-V-SoftCPU-Contest-Winners-Demonstrate-Cutting-Edge-RISC-V https://github.com/olofk/serv Also look at: https://github.com/olofk/fusesoc Entry: pizza dinner hdl Date: Fri Dec 28 21:51:39 CET 2018 - asic: 50% RTL design, 50% implementation - Write RTL s.t. test insertion automation - 10% area is scan/test code - statistical design: variability per gate is large - extra gates to allow metal-layer corrections Entry: reset behavior Date: Wed Jan 9 11:32:50 CET 2019 Always a problem.. Example: edge detector that self-initializes, using the first signal as an initialization. This seems to require an extra flip-flop. Let's write it compositional: edge detector, but kill the first output. Easily done using a 0->1 transition delay. Entry: Off-by-one Date: Wed Jan 9 11:42:07 CET 2019 It remains a real pain to navigate pre/post delay signals. It seems that a good way to go is to assume this will be wrong, and always use some kind of redundant assert. Digital design is really about pipelining. If it were just combinatorial logic, it would be easy. Entry: Think compositionally Date: Wed Jan 9 11:46:15 CET 2019 So if the point is to just express the circuit without regard to things that could be optimized by re-arranging combinatorial logic, it seems that the core idea behind my approach is to push 'compositional thinking' much further into the core of the circuits. If I look at Verilog or VHDL code, I see a lot of raw state machines. I don't like this approach. It seems to make more sense to create a couple of primitive state machines, and solve problems using composition, which is easier to get right. Basically, this is the Forth or generic FP idea. Entry: CPU Date: Fri Jan 11 21:16:46 CET 2019 Might have some time for hacking on the CPU. To try: - better timing - 2-stack machine Other Seq things: - Verilog cosim - Import verilog modules (via yosys netlist) Entry: ecosystem integration Date: Tue Jan 15 04:42:31 CST 2019 So, very important: find a way to integrate into a larger ecosystem. It is already possible to create Verilog modules, but it will also be necessary to go the other way: import, instantiate verilog modules. Entry: Syntax frontent Date: Tue Feb 5 10:32:24 EST 2019 I really want a syntax frontend. Monadic notation is too cumbersome. There are two ways I see: - template haskel + S-expressions - Conal's CCC I think the latter is too big a risk. The rough edges are likely going to need advanced insights to resolve. So go with s-expressions. Entry: SERV Date: Wed Mar 20 18:27:35 EDT 2019 https://github.com/olofk/serv SERV was written with manual logic optimization using Karnaugh maps. Entry: Higher level abstractions Date: Wed Mar 27 10:41:33 EDT 2019 Instead of trying to battle the bottom layers through local reasoning, use a higher level abstraction. Looking into statecharts and decision tables (via Hillel Wayne). TODO: - Take these abstractions and compile them down to a flat state machine - Keep states abstract to allow different encodings. Aim to synthesize such that tools can recognize the state machines and recode them. Entry: Sequencers Date: Fri Apr 5 14:32:20 EDT 2019 How to make design more modular. I have an application that would be not too hard to implement on a CPU, but I'm likely going to need state machines to implement the stages. I am still very much not into designing fucking state machines. Let's think about that for a bit. Why is doing things sequentially so much easier? Let's take a look at some examples: - uart SLIP packet framing? - length framing + checksum - CRC computation What I need is a good way to represent this. A framework in which it is easy to embed a state machine. I'm going to need a good idea to push through this fatigue. One problem I have is connectin modules together. I don't seem to get very far other than designing everything with the CPU embedded. Entry: Look at the geometry of the problem Date: Fri Apr 5 16:30:43 EDT 2019 For a simple state machine, that's a graph. For a push-down automaton, that's something else. There are actually two things that make a CPU approach simple: - sequential programming - subroutines (parameterized repetition) Subroutines are mostly useful as a compression step. It is part of the incremental solving of a problem, adding to the solution when there is already something there. Maybe leave this as step two? So what is step one? - Write some pseudocode. - Coalesce everything that can be parallellized - Write an explicit state diagram for this (in Haskell) - Compile that down to a bit-level implementation. So start by encoding the program and the opcodes as a data type. I need an example state machine. Entry: RAM as decoder Date: Fri Apr 5 16:35:50 EDT 2019 A decoder is typically an "expansion", e.g. a small word fans out to a lot of control lines. A ram already has that structure: 8 address bits to 16 data bits. Maybe that is done intentionally? I.e. it is wide by design? Can it be made wider? Yes, by chaining 2 RAMs. Can it be made wider using only a single RAM? Yes, by multiplexing. E.g. two 7->16 bit maps evaluated using two clocks. When is this more efficient than using LUTs? Entry: State machine composition Date: Fri Apr 5 16:47:06 EDT 2019 This is the holy grail, really. If one machine can reuse and restart another, that is efficient use of resources. This is pretty much what a CPU + peripheral combo would do. However, it does not make a lot of sense to build a machine and have it sit idle for most of the time. So one machine restarting another is a special case of a PRODUCT type. One machine running after another is a SUM type. Note that reuse in the case of a SUM type is only possible if the different machine modes share some logic. EDIT: What Axel said: the hardware is going to be there anyway, so why not pipeline? Entry: Process: why is this still hard? Date: Fri Apr 5 20:56:17 EDT 2019 1. I don't want to look at (complex) previous solutions. 2. Maybe it's just too ad-hoc? Hard to compose? Entry: Rekindle the fire Date: Fri Apr 5 21:11:46 EDT 2019 Why did I loose intrinsic interest? Probably because of getting disappointed that the link to Verilog isn't really useful yet, so this does not yet integrate into a larger world. Maybe something to work on? Anyway, I forgot most of this. Just getting into it will likely revive it again. Entry: alright, revolution! Date: Sat Apr 6 07:58:20 EDT 2019 How to take advantage of the fresh start, without falling back into old pits. It is all about feedforward data processors with hidden internal state. Where does my mind want to go? Instead of where it is forced to go? The gap between dream -- what attracts me to this -- and application is too large. So what is the real problem? This morning, I felt energized again by fixing the build system to create immediate feedback. The problem is really the resistance to reload all that scaffolding context. To solve this: 1. Make sure it is encoded in the build system in a very granular input-output form. E.g. one test should generate one report file. 2. The granularity is important for reconnecting to the problem. Entry: Compile to ST without TH? Date: Sat Apr 6 11:16:46 EDT 2019 I really don't like the limitations that TH imposes. Maybe it is best to compile to something else, like LLVM, and load the code dynamically that way. Entry: Designing hardware with a control CPU Date: Sat Apr 6 15:46:58 EDT 2019 1. The tradeoff is to put high level control logic in the CPU, and low level (i.e. fast-switching) control logic in the peripherals. 2. The interface between CPU and peripherals can be optimized: there is no need to make the peripherals nor the CPU general purpose. This translates to CPUs with a limited instruction set, and peripherals with a tiny (single) register interface. 3. This allows the CPU and peripheral sides to be disentangled and tested individually using non-implementable code. Entry: How to get used to large I->O functions Date: Sun Apr 7 10:32:47 EDT 2019 I think I needed an explanation about why some things seem messy. Here's the thing: a lot of circuits have a criss-cross nature. This is an inherent feature of parallel circuits and is what carries their usefulness. However, wiring all this up is not easy. So, a good approach seems to be the idea of configurable mux, where a central object takes control words, and connects things together. Two observations: - This object abstracts a "web" behind a simple interface. - The instantiation of this object necessarily takes this plethora of I/O lines. So it is ok to bundle up all these lines into a single large collection object (i.e. an environment). I've been struggling with internalizing this. Probably because I am used to thinking in terms of "create connection", as opposed to having everything be and I->O function. So where does this bad intuition come from? It comes from the representation of circuits diagrams containing boxes connected with lines. But these "lines" are always directional! And in a functional representation, the line represented by the definition/use relation. Why is this such an easy error to make? Because those lines are _physical_. They are wires that can be seen on the circuit board as well. But what we don't see is that each wire has a direction. So in this case, the idea of a symmetrical wire is just plain wrong. When superimposing the direction, the diagram becomes a whole lot more "messy", and it is clear that the boxes are just some arbitrary grouping of things into large I to large O functions. It is important to realize that a full circuit, e.g. an FPGA config, is essentially a function that takes a large number of inputs and produces a large number of outputs. The composition then looks like taking the large I, splitting it up, feeding it into a large number of circuits, collecting the outputs of the large number of circuits and pushing it out as a large O. Summarized: there is a tension between: - The desire for a simple I->O functional representation - The boxes + wires view of circuits The latter gets in the way. It is intuition that has to be unlearned. Entry: Tree vs. Path representation Date: Sun Apr 7 11:12:52 EDT 2019 Because of the flat nature of the I/O field, it seems simplest to introduce hierarchy in the form of a path encoding, as opposed to nested structures. This relates to a pattern I've seen pop up in other places: if hierarchy gets too arbitrary, it is often more convenient to switch to a flat environment, and put the hierarchy in the keys instead. Of course, both tree and path representations are isomorphic. It is merely a matter of convenience for the packing/unpacking code. Entry: The low level monad Date: Sun Apr 7 11:58:09 EDT 2019 Basically, it is a machine language. It should have a layer on top of it, but I do not know how to do that apart from very ad-hoc constructs, and it is not really essential. Just ugly. Ok to just live with it until inspiration hits. Entry: contravariance Date: Mon Apr 8 08:35:02 EDT 2019 So here's something to think about. Contravariance is a very important concept. ( See above, regarding the confusion about dropping the direction of signal lines. ) A circuit is represented by I -> m O, where m is the DSL monad. A consequence of this is that it is not possible for I and O types to appear in the same (covariant) product type. Basically, it makes no sense to have a pair (I,O) appear anywhere. The O is always contravariant to I. ( I wish I had a better way to explain this without hand-waving. ) Summary: this is a roundabout way of saying: keep input and output structures separated, and make them make sense at the lowset levels. The rest will fall out by itself. Entry: Naming: enable / strobe / clock? Date: Mon Apr 8 08:55:52 EDT 2019 Events on a dedicated bus are always represented as a pair consisting of a control bit carrying time information, and a data word carrying payload. It seems best to call the clock bit "strobe", because: - enable: refers to something being on/off for a longer time - clock: refers to register clocks, and it is NOT visible in RTL Because strobes are so common, name them with an _s postfix? Actually, just name it clock. There is no possible confusion, because in RTL, the "square wave" master clock is not visible as a signal. So just use all 3, with this meaning: - enable: on/off for a long duration - clock: data streams - strobe: command streams So clock/strobe is arbitrary to some extent. Feel it out. Entry: Dynamic probe names Date: Mon Apr 8 10:26:34 EDT 2019 Extend the probe names with environment nesting that is derived from the instantiation. Entry: Test circuit Date: Mon Apr 8 10:51:54 EDT 2019 I have that delta1010 box. It shouldn't be too hard to syntesize an interface for DAC only. Then move on to ADC. Entry: Multiplexers Date: Tue Apr 9 09:46:10 EDT 2019 Here's the thing: digital circuit design is mostly about multiplexers. So Seq should have really good abstractions for that. Basically, this is the 'case' or 'switch' statement. Currently I have this working for lists, but it should work for arbitrary collection types. Is there a way to do this "flattening" operation better? Write it as a fold? What I want, is an algebraic data type instead of a case or switch statement. As I've recently learned, it should be possible to write it as a fold instead. Summary: multiplexers are the most important circuit, and I've made it very difficult to express them. Two elements: - monadic notation is cumbersome - grouping only supports lists - defaults are quite useful Entry: Imperative construction of circuits Date: Tue Apr 9 09:59:41 EDT 2019 This is the thing: the "imperative metaprogramming" of standard HDLs isn't all that bad. It is optimized for registers keeping their value. How hard is it to introduce that functionality in Seq? Entry: Traversable, Zip Date: Tue Apr 9 10:03:27 EDT 2019 Actually, this already works: switch :: (Seq m r, Traversable f, Zip f) => r S -> [(r S, m (f (r S)))] -> m (f (r S)) -> m (f (r S)) I just miss the instances. So the idea is that the data structure nesting is just "skin-deep". I need to get deeper into Haskell stuff before I can wiggle out of this. The core problem seems to be that I really want to encode signals at the type level to make it easier to work with, but that would be a really big change. It would be a different language. It's possible to do gradually, but will require to build Seq on top of Seq2, or vise versa, to make it practical (i.e. not horribly break things). Entry: Parameterized decision table Date: Tue Apr 9 10:10:33 EDT 2019 So what I have is quite a clear-cut decision table, but the outputs are multiple. To fix this, split it up a bit. I.e. i have 16 busses to control, but I'm never controlling them individually. There are only two modes of control: - all of them do the same in parallel, or - only one is active and the rest is idle This is just 2 cases. That switch can be solved separately. Entry: Don't try to abstract muxes Date: Tue Apr 9 10:21:55 EDT 2019 I keep wanting to express muxes differently, but I always come back to the only thing that makes sense: describe what happens for each output separately. This makes the code simpler to read. Don't try to group it too much. Entry: Registers and commands Date: Wed Apr 10 08:01:32 EDT 2019 Here's something that is annoying me: It's a lot of hassle to separate register and control word in two separate names. Why not flatten the namespace and always use (addr,data) pairs? Entry: Make probe names hierarchical Date: Thu Apr 11 07:55:12 EDT 2019 Should be as simple as changing the type from String to [String], and to add an environment variable that names the current context. So let's try the former first and propagate the refactoring. Entry: Combine Seq and EDSP? Date: Fri Apr 19 09:15:52 EDT 2019 It all needs to be one language. The real problem is "unrolling". Maybe it is possible to create a program, and then specify how it should be time-multiplexed? I have a driver program: the filter for radiopos. Entry: Better signal type Date: Fri Apr 19 09:21:53 EDT 2019 This can probably be done gradually. But really, this seems to be the whole idea: A "language" should not just be a bottom-up stack. It should have its primitives defined as well. The point is to increase the granularity of the type classes, such that algorithms can be very generic, and instantation can be very specific. I.e. if there is a multiply, it could be expanded to a multiplier expressed in seq. Seq needs to be recursive. Entry: A driver application: quadrature tuner Date: Fri Apr 19 09:25:56 EDT 2019 See radiopos for high-level idea. Entry: Avoid "inferring" Date: Fri Apr 19 12:13:06 EDT 2019 In hardware mapping it is typical to infer subcircuits. If tools support it, go ahead, but in general it seems best to just be explicit about things. EDIT: This is an important insight. Why have a "base line interface" anyway? It is what separates the programmable system from the hard-wired system. So "Seq" is some kind of model that works in most practical cases. Entry: Extend types gradually Date: Sat Apr 20 08:31:45 EDT 2019 There are now at least two things that would benefit from parameterization in the type system: - Bit types - Substrate (e.g. mul or no mul) These can probably be introduced gradually by writing the old in terms of the new, and then moving the library written in terms of old to new. This in itself is not the issue right now: bit types are values passed in at instantiation time, and substrate is just Seq, but with some parts eliminated (exposed as errors). Seq is dataflow + sequential feedback. It cannot by itself express other types of hierarchy, such as running a program on a CPU. This needs to be captured by "recursive" Seq. Entry: The DSP language: combinators and algebraic substrate Date: Sat Apr 20 08:43:45 EDT 2019 Two problems need to be solved: - how to combine iteration patterns into higher level operations (hide iteration) - how to express algorithms in a way that algorithm analysis is possible. Basically, define a (hierarchical?) set of classes that can represent matrices, autodif, etcc.. Entry: Start with Ring Date: Sat Apr 20 09:02:59 EDT 2019 - Basic arithmetic is expressed in a Ring. - The ring can be "parameterized", which is mostly there to parameterize constants to be able to do autodiff. EDIT: I have Ring and complex numbers over a ring. The goal now is to express a complete Seq system using only these abstractions, and have it generate C code. I will need that first actual implementation to then generalize to more complex things, to see how it all fits together in detail. I need a little break before continuing, but here is the basic idea: Create a C program that generates a damped complex oscillation. Basically the generalized counter "hello world" program. It is very important to be able to do things like this: - have a class-level language that has a notion of exponential function - implement it using a polynomial or an update equiation I just want very extreme modularity. Haskell is a great substrate for that. Entry: Ring vs. System Date: Sat Apr 20 11:41:53 EDT 2019 Dataflow operations are simple. They are expressions in the Ring. However, I will want to build systems before I plug them into analysis. Entry: Parameterized Z-transform Date: Sat Apr 20 11:45:57 EDT 2019 This is the important concept. It is the output of a derivative of a non-linear system. Organizing this might take a couple of iterations. But it seems there is a tremendous amount of leverage to be exposed. I know something is there but I don't see the path. And I'm afraid I'm going to dull out before I am able to write this down. It takes quite a bit of time to just load that context in my head, though there is a lot of echo from the past just loading it this morning... Entry: An example Date: Sat Apr 20 11:52:21 EDT 2019 A complex 1-pole AR process. The intermediate abstraction is a system. That is what should be translated to Seq. Is this possible? Or does it need some commuted version? Where Seq implements the feedback operator. It will have to. Entry: The problem is commutation. Date: Sat Apr 20 11:57:43 EDT 2019 Think of it at this higher level, and things will open up. Haskell is not a good substrate because it doesn't allow to express this. You'll need to find a way to represent the higher level, and then compile it to Haskell. A meta step is essential here. Dependent types will help but it won't be accessible to you yet. The next step is to just write down that damn exponential and stare at it. EDIT: I have to learn how to organize type class hierarchies. I.e. how they are derived. I've just winged it up to now. I need to push through to really understand it before I get anywhere. Here's an example of a clash I did not expect. Matching instances: instance Seq m r => Ring m (r S) -- Defined at ../deps/asm_tools/asm-tools-seq/Language/Seq/DSP.hs:55:10 instance [safe] Ring m t => Ring m (C t) -- Defined at ../deps/asm_tools/asm-tools-seq/Language/Seq/Algebra.hs:96:10 The latter seems fine. Is the former actually correct? Maybe it is the other way around? Head starts hurting now... Entry: Why is this uninituitive? Date: Sat Apr 20 12:26:27 EDT 2019 this is clear: Ring m t -> Ring m (C t) why should this be different: Seq m r -> Ring m (r t) The former means, given that Ring operations are defined for t, here's a way to define Ring operations for C t. The latter means, given that there is an implementation for Seq, here's a way to implement Ring. There isn't anything wrong with this, but what is clear is that it should not be the only instance maybe? Maybe r just needs to be made explicit. This smells like a detail. Something not really that important. EDIT: Yes it's a detail. It's probably possible to remove it using some more "pinning" of type classes, but basically the problem is that the compiler doesn't know that the C type is not used as a representation. In practice it seems that m and t are never generic when this class is instantiated so it is not an issue. EDIT: Looks like this is going to happen a lot. Entry: Bridging the two systems views Date: Sat Apr 20 13:59:56 EDT 2019 In Seq.Algebra I need a representation of a system to be able to compute a Z transform (or generate a Seq function that computes the Z transform, which might be useful for things like graphical display). How to match that with the requirement of Seq, to implement closeReg? Actually, Z transform is just function evaluation. It is probably also not necessary to distinguish between linear and non-linear, because it will simply not be defined for nonlinear functions. Entry: ST Date: Sat Apr 20 14:10:10 EDT 2019 Why can't the Seq monad just be ST? Why is the intermediate TH step necessary? Entry: Lists and Functor, Traversable, Zip Date: Sat Apr 20 16:37:36 EDT 2019 I'm going to need structured arrays, e.g. matrices of complex numbers, that will have to flatten down to just lists of things. Is there a simple way to create a data structure such that it automatically has Functor, Traversable and Zip? I've been here before. What is a representable functor? http://hackage.haskell.org/package/representable-functors-3.2.0.2/docs/Data-Functor-Representable.html Not for now. I need to focus on the application. data C t = C t t deriving (Eq, Show, Functor, Foldable, Traversable) instance Zip C where zip (C ar ai) (C br bi) = C (ar,br) (ai,bi) c2 f a b = sequence $ zipWith f a b Entry: Z transforms Date: Sat Apr 20 17:56:52 EDT 2019 I can't figure it out. Something is missing. EDIT: I have some rules figured out, but the beef is still in applying this to systems. I don't see where to go next... say I have: s -> i -> m (s, o) To create the z-transform, the state needs to be eliminated by using up the expression for delay-as-phase-shift. I need to do this on paper first to trigger muscle memory, then it will become obvious. EDIT. The way to look at this is to look at a linear autoregressive process. Using the notation I'm used to, where x is the state vector, A is the system matrix, B is the input-to-state matrix, and i is the input: z x(z) = A x(z) + B i(z) => x(z) = (zI - A)^-1 B i(z) To construct an evaluator for x, it might not be necessary to compute the inverse. Given z and i, this reduces to a system from which x can be solved. So what do we actually have? The system for which we're computing the z transform is not the original system: it is the linearlized version. A is a matrix of partial derivatives. These are still non-linear in the parameters of the system, but we can actually compute the numerical values using autodiff, then solve the system to produce an evaluator for the z transform. I'd like to implement this so it can all fold in on itself to keep all components reifiable and composable. EDIT: Yep. Doing this on paper first helped a lot to figure out what exactly I was looking for. Basically the entire thing I did today around Z-transforms is quite meaningless. Entry: Z-transform, summary Date: Sat Apr 20 22:02:59 EDT 2019 This will only work on linearized systems, so first create an autodiff instance to be able to compute partial differentials. So to implement autodiff, implement normal numbers. EDIT: Almost done with that. Rest is easy to fill in. EDIT: I've re-invented rai/ai-freq.rkt ;; Basic strategy: ;; - normal eval: eval parent semantics lifted over small signals (linear approx) ;; - feedback: ;; - split variables in parameter / signal ;; - compute output offsets ;; - compute differential matrix of update function ;; - return (memoized) z-dependent signals ;; - compute z-dependent transfer function from z, matrix ;; - apply transfer function ;; Obtain the frequency response of a linear system by first ;; probing the function for the linear system matrix, and then ;; compute the effect of feedback through matrix inversion ;; (solve linear system). Entry: System solving Date: Sat Apr 20 23:01:11 EDT 2019 Could also be implemented in Seq. It's probably useful to have as a library operation. Do it first without pivoting. Later, pivoting might be necessary. Not ideal: - cramer's rule (< 4 variables ok?) - gaussian elimination Better: - LDU factorization? RAI computes m_inv. Entry: Next Date: Sun Apr 21 07:49:24 EDT 2019 It feels the z-transform is distracting from the problem at hand, which is to get some actual code going to do phase demodulation. Otoh, it would require solving a couple of things. What I miss is insight and basic structure. The split of Ring vs. Function seems to be a good idea. Should div be part of Ring? Entry: Base ring/field Date: Sun Apr 21 08:12:06 EDT 2019 I think it makes sense to add the base field to the definition of Ring. Essentially, I'm trying to incorporate the idea of vector space as well. Maybe that should be kept separate. EDIT: The idea here is indeed to define class (Ring m t, Traversable f, Zip f) => Vector m f t I.e. it is really just a constraint on a functor. Entry: Abstract representations of vectors Date: Sun Apr 21 08:42:09 EDT 2019 Eventually, I want all loops to be implemented target-side, so there will need to be some notion of fold and zip that are implicit enough to push it through the representation monad. This is the next tough problem. It cannot be captured in Seq. It is a different type class: Loop? What about this: C and D are always inlined. There doesn't seem to be a good reason not to, but matrices and polynomials are implemented in terms of abstract iteration and storage patterns. Entry: Test case for implementable vectors as loops Date: Sun Apr 21 08:54:29 EDT 2019 Basically, compute the norm of a vector, but leave the vector representation abstract. C.hs and Term.hs will need to be extended to support two ideas: loops and arrays. The biggest win would be if this can be written in a way that a low level compiler can eliminate intermediate storage, i.e. perform loop fusion. In RAI I do this manually. E.g. instead of f3 a = let b = fmap f1 a c = fmap f2 b where b is an intermediate vector. It would be possible to map (f2 . f1) over a and produce c directly, where the 'b' values are only scalar inside the loop. I believe this should be easy enough to do as long as the storage allocation of b is visible only to the function and not outside. I need to ask somebody with good knowledge of LLVM. Or, just make a test. I believe the terms are "fusion" and "deforestation". void f(const float *src, float *dst) { float tmp[10]; for (int i=0; i<10; i++) { tmp[i] = sin(src[i]); } for (int i=0; i<10; i++) { dst[i] = sin(tmp[i]); } } void f(const float *src, float *dst) { for (int i=0; i<10; i++) { float tmp = sin(src[i]); dst[i] = sin(tmp); } } I think it's ok to assume that this will work out fine. So to implement the loop operations, it should be enough to just separate declaration and binding. Entry: Target vectors Date: Sun Apr 21 09:37:51 EDT 2019 So it seems almost trivial as long as the interfaces are there. But how to provide that interface? Step one: implement some code that is abstract in the vector type. This is going to be a bit of work. Entry: Loops are environments Date: Sun Apr 21 09:53:06 EDT 2019 ( Relation to Representable functors? ) Luckily, I just need vectors at this time, so it can be quite concrete. Vectors will be the base for all Functor, Traversable, Foldable, Zip behavior There are only two operations that are important at this time: - fold (accumulation, vector to scalar reduction) - zip (vector to vector maps) Is there a natural way to express these? Yes, by assuming each loop has a single structure that is natural to C-like code: - input / output vectors - accumulators / state - nested versions of these Then implement the standard classes in terms of these primitives. How do I start this? Probably best by extending Term.hs such that C.hs can generated it. It doesn't feel like today is the day for it though. - Declare and initialize accumulators - Declare vectors - Insert the loop head - Insert the loop body EDIT: Some key insight is missing. This is a tangle, and I need to find a starting point to then see the loose ends. Entry: Necessarily Meta Date: Sun Apr 21 15:23:00 EDT 2019 Because their main point is to say something about the code that is completely lost when it is compiled. Entry: C.hs Date: Mon Apr 22 07:58:21 EDT 2019 Start by creating a new kind of binding: a loop. A loop has two main parts: - Output array - Output accumulator To simplify, these are always there, and there is just one of them for now. Extend to multiple arrays or accumulators once the basic structure is ready. So it is clear now where Term needs to be extended. The question now is where to add the concept of collection. Should "node" be extended to mean a combination of accumulators and arrays? Maybe the first change to make is the ability to have multiple outputs from a primitive, because loops will likely have multiple outputs. It seems safest to extend Term into something that can express this structure. Maybe TermLoop.hs? It might be possible to extend Term.hs in-place, but I worry about breaking things, so let's just decouple it for now and create an intermediate form. Or, just go for it. I believe the only necessary change is in binding. Note that node type is abstract in Term.hs. Let's first find out the concrete type of C.compile compile :: Show n => String -> CompileResult n -> String So it doesn't care about the node type. Let's add another constraint on n that embeds the looping. class Loop n instance Loop NodeNum compile :: (Loop n, Show n) => String -> CompileResult n -> String This class then could expose the required nesting. ( This is the first time I think about using type classes to express folds to extend data types. Is this the right way to go? This is always possible, I just never realized it was an option.. ) So current approach is to just stick to Term, and extend it recursively using a type class. Entry: Primtives with multiple return values Date: Mon Apr 22 08:22:31 EDT 2019 I think I miss this feature to be able to abstract a loop as something that returns a set of arrays and a set of accumulators. Implementing the loop body is easy: it just contains nodes that are parameterized by the current loop count. The idea is to extend the node type to also express array references. This should work as well. Basically a loop is a siso applied to an array. Where to introduce the recursion? It should be something like SeqLoop? Entry: Summary: SeqLoop Date: Mon Apr 22 08:40:54 EDT 2019 The extension needs to happen as: class Seq m r => SeqLoop m v r where zipfold :: ([r t] -> [r t] -> ([r t], [r t])) -> r [t] -> r [v t] -> r ([t], [v t]) -- zipfold body initStates inputVectors = (outStates, outputVectors) Or something like that. The problem this solves is to unpack the representations of arrays: inside the body function, there are only scalar reprsentations. This is going to take some shuffling to get right. But it seems that once expressed properly, the construction of instances is going to be straightforward. Generalized: class (Seq m r, -- Generalize [] grouping functors to a,i,o Zip a, Traversable a, -- accumulators Zip i, Traversable i, -- inputs Zip o, Traversable o -- outputs ) => SeqLoop m r a i o where -- This implements the typicial "tagless-final" style where a -- combinator flips the nesting of representation (r) and collection -- (a,i,o) type constructors. zipfold :: (a (r t) -> i (r t) -> (a (r t), o (r t))) -> (r (a t) -> r (i t) -> (r (a t), r (o t))) -- zipfold loopBody initAccus inputVectors = (outAccus, outputVectors) I'm pretty sure that's it. The rest is implementation which can now be done in a type-driven fashion. That's for tomorrow morning as I don't think it's going to be a small change. Probably will need a new data type. Entry: zipfold Date: Sat Apr 27 09:16:02 EDT 2019 What's next? Multiple outputs / bindings. The reason for multiple outputs is sharing: there is likely some intermediate value that then forks into two. It doesn't seem possible to express this as single bindings. Maybe this isn't so simple. At least not today. This might need some brightness to resolve... Yeah this is not going to work today. Entry: zipfold Date: Sat May 4 07:37:42 EDT 2019 I feel ready to tackle this. First thing: multi-output bindings. Seq is only single-output, so a simpler way would probably be to create bundling of bindings instead? data Binding n = Binding n (Term (Op n)) | Probe (Op n) [String] So if n were a collection, this would be solved? I feel the solution to this is simple, but I don't see it, and I'd have to first build a couple of things that don't work before the proper way becomes obvious. Let's just clone Term.hs and make it multi-binding. This is going to be more difficult. Let's just do it in-place. I want this to be a simple change. What do I really need? A construct that introduces a loop variable, but otherwise does nothing special. I come back to the same architecture as RAI. Entry: Go back to RAI Date: Sat May 4 08:03:42 EDT 2019 RAI uses the observation that loops just introduce context = loop variables, and arrays are just index-parameterized variables. Single assignment is preserved, except for accumulators. So arrays themselves are easy: just add some annotation. Assume that loop indices are just scalar base types. Accumulators are the sore point. Can they be written as single-assignment as well? E.g they are just arrays, but it should be written in such a way that there are no back-references. What about generalizing this: an accumulator is really a "triangle". Allow for back-references, but optimize them out in case of an accumulator. Entry: Two-pass full array + triangle feedback Date: Sat May 4 08:09:59 EDT 2019 So the idea is to embrace the array nature and allow for "full history" in the loops, then optimize out: - intermediate arrays that are only used element-wise - translate triangles to accumulators Do this in two passes: one that generates only the normal form, and one that can perform the optimizations. This means the zipfold form has to change, as back-references of outputs are allowed, and full references of inputs are allowed as well. So the language I'm writing is an array language with "triangle feedback" of the outputs. This way there needs to be no distinction between accumulators and outputs, as each output can be used as an accumulator, as long as the index referencing stays within the triangle. It seems that representation of this at Term level would be straightforward. It needs only arrays and loops that range over an index. The output of a single iteration is a collection of scalars. The input is a collection of arrays. - Some of these are original inputs, and are full scale. - Some are outputs computed in the previous iteration. So this needs the representation of an array, probably as a function. Some cases: - one loop incrementally computes an array - an other loop can be run to perform a computation on that whole array So it can't be just "global" variables. Scope really needs to be local. Storage can be reused though as long as outputs are not propagated. This isn't a simple problem. But good to realize that a more general view (triangles instead of accumulators) is better. So again: - at any "step", you can compute one scalar value from a number of other scalar values (n-op) - the place where this value is allocated, i.e. the hole, is what we are managing. - at each "step", we also exactly know the valid range of the input arrays. - based on context, holes can be re-used in subsequent iterations, but this is an optimization. in first iteration, allocate everything for single-assignment. then later on, "project" the variables, where each projection eliminates a dimension. - start with a setup where all loops have a fixed size, then generalize to computed fixed sizes (e.g. allocate intermediates on stack), then generalize to data-based iteration sizes with just an upper bound on the allocation. The idea is still the same as RAI, only RAI was too eager with optimizing out intermediates. Representing this is going to be a challenge. Term can probably be used because it has parameterized nodes. What we add is: - loop variable contexts - context dependent nodes for value binding - primitives for dereferencing So, where to start? Structurally, the first thing that needs to happen is to introduce context nesting. Once that is possible, the rest will becomes clear. Entry: holes Date: Sat May 4 08:48:55 EDT 2019 It seems simpler to do this by making holes explicit. There are two ways of looking at it: holes: - put hole allocation code in the output - parameterize the remaining code generator with the hole outputs: - run the code generator - from the outputs it produces, insert the hole allocation before the generated code so they are pretty much the same thing apart from some juggling. using outputs would also allow local variables to remain local. i.e. escape analysis will become simpler: if a value that is computed inside a loop does not survive it, it can be allocated inside the loop without reference to the current iteration variable. so each variable would have its full dimensionality, plus an annotation indicating at which loop exit it is "dropped". This needs to ferment. Entry: Map/Fold Date: Wed May 8 07:30:29 EDT 2019 It is important to look at this the right way. Ideally, I would like to have only one looping construct: the one that includes output feedback. This encodes a mapping between the multi-dimensional object, and a sequential encoding of the dependencies. Note that mathematically, there might be many forms of evaluation. So ideally I want to express only the dependencies. But in practice, I will need a space filling curve, a sequence. Does it make sense to express things in that higher abstraction? Or is it better to stick to the reality of the loop? In general it is better to do the specific case first to at least have an example of what to generalize. What is an array? a :: i -> v With the caveat that i is contained in an interval, and for a fold, the interval grows during the loop. In this context, map looks like: map :: (a -> r) -> (i -> a) -> (i -> r) In practice, we want access to the current index: imap :: (i -> a -> r) -> (i -> a) -> (i -> r) The multivariate version has v,w be tuples. This is no essential difference, so let's ignore it for now. The version where random access is possible: irmap :: (i -> (i -> a) -> r) -> (i -> a) -> (i -> r) Note that (i -> a) now can be factored out completely into a global environment. Maybe this is important? (i -> a) -> (i -> r) What does this actually mean? That a constant input is just a function that has no influence on the shape of things. Let's move on to folds first. A fold looks like: ifold :: (i -> s -> a -> s) -> s -> (i -> a) -> s Generalizing the loop state to an array, and allowing access to the previous results inside the loop gives: iafold :: (i -> (i -> s) -> s) -> (i -> s) -> (i -> s) This seems to be the essential component, because all of these i->s are different things. The induction step is to take i -> s where i \in [0,n[ to i -> s where i \in [0,n] It is a transformation between types. I don't think I can express this in the current encoding. To read iafold's type: take a function that grows an array by one element, and use it to transform an array of size 0 into an array of size N (concrete). Separately, provide some context such that the array growing function can also reference arrays that are not being constructed. So basically, I'm building a sequential array constructor language. Even simplified: the input i->s is actually just () if it is an empty array. So the type can be even simpler: iafold' :: i -> (i -> (i -> s) -> s) -> (i -> s) Take the size of an array, a body that has access to the current size and the array constructed in the last step, and produce a new array. Actually, the body needs access to the total size as well, but this can be hidden in the closurs because it is a constant. So it appears that the idea of "environment" is very important. Also, there is some need to turn this into a type family, where the induction step can be expressed at the type level. Starting from the main iteration, - map fits in here by never using the input state - fold filds by only using the last state (reusable state variable) - more complex siso machines can use multiple delay elements It seems that focusing on the one true abstraction makes sense, because that will be the only one that needs to be implemented. In a second step, constraints can be added at the type level for more constrained iterators, and these type annotations could then be propagated to the intermediate language to be used in non-local optimizations, i.e. optimizations that will need to fold over the entire IR. Summary: - make indices explicit - explicit indices allow constant arrays to be hidden in an environment - the essential "step" is appending one element to an array, where the whole previous array can be used as input. - todo: find some way of encoding the n->n+1 size extension induction in the type - additional patterns could be expressed as constraints on the main pattern: map, fold, and combinations with more complex state. Entry: Context Date: Wed May 8 11:26:16 EDT 2019 One of the central ideas in the previous section is to simplify by abstracting away the context. This will require some support for higher order functions in the representation, which will mostly be just arrays defined at particular levels. This will be simple nesting without context escapes, so there should be no generic lambda+app that can take a context and "move" it somewhere else. EDIT: But still, if the interface to this mechanism is still just lambda, then the capturing and application need to somehow be constrained. Maybe in first iteration, an explicit approach is better? I.e. an environment is an explicit collection of arrays? Entry: Random access Date: Thu May 9 07:26:12 EDT 2019 The point is that the primitives should support random access. This needs some kind of dependent typing trick to allow encoding of bounds. If indices are not processed, this is no big deal, but I also want to have a way to compute indices. Entry: Lambda Date: Thu May 9 07:31:11 EDT 2019 So let's try to get some constraints on abstraction and application. - When creating a closure, it needs to contain a type marker that allows it to be inserted deeper into a loop, but not higher up. So there is a type family of loop nesting as well. - The arrays that are contained in the context, together with the current loop indices will always be valid if we just go deeper. So these can just be tracked. The index into the type family is the loop nesting. EDIT: The environment can be encoded as a nested type-level construct using positional encoding. Each level contains an array and some information describint the range of the array. Entry: Next? Date: Thu May 9 14:22:01 EDT 2019 I need to understand more about type level programming to be able to make good decisions. Go over this one: https://www.parsonsmatt.org/2017/04/26/basic_type_level_programming_in_haskell.html Entry: Type level stuff Date: Fri May 10 10:09:08 EDT 2019 First, maybe write a wrapper around Seq that allows fixed size bit vectors. Entry: Make arrays total Date: Fri May 10 12:58:25 EDT 2019 Make the programming model such that the arrays are total functions, e.g. just using wraparound, but create some kind of type-level tracking to remove bounds mapping when it is not necessary. Entry: Implement some loop Date: Sat May 11 08:18:38 EDT 2019 I'm quite stuck, so let's start with something. Last practical hurdle was to not have multiple return values in a binding statement, preventing a recursion point. This can be solved by introducing collection nodes, and collection dereferences. Entry: Moving forward on the loops Date: Sun May 12 02:07:15 EDT 2019 It is important to see that all this fancy type stuff is just notation. It makes it possible to fit the idea into a larger framework without having to do a lot of manual "matching". Or at least, without having to do manual verification of that matching. The idea should stand on its own. If it is too hard to express, find a different way. Either untyped, or more concretely typed for a particular example. Entry: Feldspar and fused representations Date: Sun May 12 08:56:45 EDT 2019 There was a trick there. In some earlier rai notes mentioned I need a loop transformation algebra. Maybe start there? 1. It should be simple to create a generic unfused version, where all arrays are explicit, and time feedback is not in the picture. 2. Perform fusion on that 3. Introduce state on top of it? So the new focus is to separate feedback from vectors. The core of the target language doesn't have anything to do with feedback at the inner loops. Entry: Grid Date: Sun May 12 10:29:44 EDT 2019 I started doing this in a separate module. Two important insights: - Grid needs to be separated from Seq. Basically: - Seq = Expr x Time (implicit) - Grid = Expr x Space (arrays) - Grid will provide "intermediate storage holes" for Expr. So the interface between Grid and Expr is at the level of ALLOCATION. Practically, when a Grid compiler passes control to an Expr compiler, it will need to provide the mechanism to create a variable in a context. Maybe I should stop here to not get too confused. EDIT: Continued a bit, distinguiing a number of entities into separate types. It can now represent: T c[100]; T d[100]; for(int i=0; i< 100); i++){ c[i] = op(a[i], b[i]) d[i] = op(c[i], c[i]) } EDIT: This is still not correct, as arrays can span multiple loops. So next is to encode nested loops. EDIT: Will it be possible to mix different "types" of expressions inside loops? No. This will have to be recovered in post-processing. Let's try to express at least 2 levels. In this light, there is a difference between - the loop being expressed abstractly (iterate over i,j) - actual concrete nesting order The loop order will be important for implementations, as it will allow for certain kinds of optimizations. But. If everything is inside a single loop, then nothing is inside a single loop. The product _doesn't add any structure_. So it appears that this entire exercise is about identifying reuse patterns in nested loops. And I am very naturally arriving at the same approach as RAI, though with a separate optimization step. Entry: Escape analysis Date: Sun May 12 14:59:03 EDT 2019 Whether a value escapes will determine whether it is temporary. Entry: Identifying reuse patterns Date: Sun May 12 15:13:03 EDT 2019 This needs a simplified notation. Basically, map the full target language representation on something simpler, definte operations on the simpler form to play with them and find some operations, and then lift them to the target language representation. To make these transformations work, they need to be expressed in an algebra. Starting from where we were: T c[100]; T d[100]; for(int i=0; i< 100); i++){ c[i] = op(a[i], b[i]) d[i] = op(c[i], c[i]) } This can be stripped to (def c 100) (def d 100) (loop i 100 ((c i) (op (a i) (b i))) ((d i) (op (c i) (c i)))) Removing the definitions (loop i 100 ((c i) (op (a i) (b i))) ((d i) (op (c i) (c i)))) Ignoring the loop ranges: we know there is some range, but its extent is non-essential. (loop i ((c i) (op (a i) (b i))) ((d i) (op (c i) (c i)))) The tansformation is then between the above and (loop i (c (op (a i) (b i))) ((d i) (op c c))) The general rule: a dimension can be removed if a variable does not escape a context. ( Or if only the last variable does ) Then loop order for multiply-nested loops could be determined by checking each order, and comparing the optimizations. Removing mention of "op" (loop i ((c i) (a i) (b i)) ((d i) (c i) (c i))) The tansformation is then between the above and (loop i (c (a i) (b i)) ((d i) (c c))) Getting rid of parenthesis, where <- means binding + some abstracted primitive operation. i: c(i) <- a(i) b(i) d(i) <- c(i) c(i) i: c() <- a(i) b(i) d(i) <- c() c() Then get rid of operations and use upper case for array names, and lower case for indices i: Ci <- Ai Bi Di <- Ci Ci i: C <- Ai Bi Di <- C C Entry: Construction of loops Date: Sun May 12 15:25:53 EDT 2019 Basically, a loop lifts an expression over an array. It is essentially "map". If there is back-reference, it is a "fold". If there is random access back-reference, it is a "triangle fold". It seems that this problem can be done quite simply by using full grid notation. It would only need each "map" or "fold" to add an extra dimension. It's important to realize that this is not just for scalars! Each loop is a +1 in the number of grid parameters. Because the difference between map, fold and triangle fold is just array access, it might be good enough to focus on map. Entry: Loop algebra Date: Sun May 12 15:33:17 EDT 2019 What are the operations? - fuse / split - project / inject (eliminate) - hoisting - interchange Where deforestation is a combination of fusion and elimination. I think that's pretty much it. Here are some more: https://en.wikipedia.org/wiki/Loop_optimization There are the (bi-directional) operations in the notation developed above, in "reductive" order. - FUSE i: Bi <- Ai i: Ci <- Bi => i: Bi <- Ai Ci <- Bi - ELIMINATE: i: Ci <- Ai Bi Di <- Ci Ci => i: C <- Ai Bi Di <- C C - HOIST i: C <- A B Ei <- C Di => C <- A B i: Ei <- C Di - INTERCHANGE i: j: Cij <- Aij Bij => j: i: Cij <- Aij Bij Entry: summary Date: Sun May 12 16:10:31 EDT 2019 - to define an algebra, first define it on a simple, concrete language (i.e. create a notation first), and then generalize it to a langauge with practical annotations (e.g. something that maps to C loops and arrays). - the RAI idea isn't bad. it was just missing a split between generation of the full grid intermediate, and subsequent optimization based on a loop algebra. - the algebra is "large". i.e. there are a lot of degrees of motion inside the algebra. any optimization will likely need heuristics. it's not just straigtforward reduction. - higher order functions just add a dimension to the grid. they are not just scalar->vector. Entry: LTA Date: Tue May 14 08:54:00 EDT 2019 So I have a language form and a prettyprinter. Next is to define the operations, and define some folds. The next organizational step is to define a monad-parameterized language such that creating of terms would be easier, and use it to build a prettyprinter. EDIT: Do a concrete version of this first. E.g.: test_val = Program $ [loop' i [loop' j [let' (c i j) (a i j) (b i j)]], loop' i [loop' j [let' (d i j) (a i j) (c i j)]]] where [a,b,c,d] = map a2 ["A","B","C","D"] This should expose some more concise representation even. Entry: Expressing the operations Date: Wed May 15 08:40:04 EDT 2019 There's something not quite right, because it is very hard to express the transformations without a lot of conditions. I want to split this up: - Find a way to represent the structure, zoomed in at a particular site - In this zoomed in state, the transformation itself will be trivial to express For fuse it is already quite simple: zoom in on two adjacent loops. For interchange it is simple as well. It seems that the search is really about representation spaces. About notation. Maybe start by consensing some of the constructors. EDIT: Done: there are just 2 now: - Program = collection of Form - Form = branch of LetLoop and LetPrim I think the insight is that these transformations are not local. They all essentially consist of an iteration pattern and a decision rule that uses non-local information. Start by splitting fuse into two components: - an iterator that goes through the list of terms, pairwize, and - an inner routine that Maybe produces a fused element So many insights pop up while doing this, but it is so hard to record or remember them. Entry: About folds Date: Wed May 15 09:41:17 EDT 2019 Why do I have trouble writing a fold for this? data Form = LetPrim Cell [Cell] | LetLoop Index [Form] Maybe because the constructors are not primitive? While 'Program' is just a wrapper around [], the [] introduces a substructure. The question is then, should the constructors of this substructure be exposed? Let's just try the dumb thing first. EDIT: It's easier to understand when making things a little more general and looking at Form as being parameterized by the [] type. This then has a direct correspondence in foldForm being parameterized by foldList. The flattened foldr then has all the constructors involved in the two mutually recursive types letPrim,letLoop,cons,nil with the remark that there are now 2 types of "accumulators" for the two mutually recursive types. EDIT: The mutual recursion is now expressed properly in two separate legs, and a combined fold that treats all levels the same. Entry: Next Date: Wed May 15 11:31:26 EDT 2019 Rewrite fusion in terms of a fold. Entry: Use the fold Date: Thu May 16 08:34:28 EDT 2019 Curious, because I have never done it like this. It looks reasonable though. Start with a no-op: fuse' p = foldProgram letPrim letLoop cons nil where nil = [] cons a b = a:b letPrim c cs = LetPrim c cs letLoop i fs = LetLoop i fs To do the fuse, it can be done in two spots: either as part of cons on a per element basis, or as part of letLoop, operating on the whole list. Let's try cons first. I got to this, which only does one layer. Why is that? Aha because it uses a non-recursive cons. fuse' p = Program $ foldProgram LetPrim LetLoop cons [] p where cons h@(LetLoop i as) t@((LetLoop i0 bs):t') = case i == i0 of True -> (LetLoop i (as ++ bs)) : t' False -> h:t cons a b = a:b This is a lot more tricky than I thought. The fully recursive routine needs a cons that is non-recursive. OK I see now: it works inside out, and fusion is an outside-in operation: when the outside is fused, it exposes fusable insides. Running the operation multiple times does produce the correct result. So, is it possible to create a fold that is top-down? That is a breath-first iteration pattern. Entry: Local context Date: Thu May 16 09:16:20 EDT 2019 So I need some iteration pattern that can associate a primitive term to some context. To use Traversable, there needs to be a Functor structure. What is the contained type in this case? I'd say the basic element would be the primitive expression. We want to modify the indexing there. First, make it such that Form can take a single binding parameter. EDIT: Done. It was easy, and instances can be derived automatically. This is awesome. So it should be easy now to create a primitive transformation operation based on context. Basically, traverse, with a custom monad. Looks like there is two different ways to look at this: - Custom, using a monadic iteration based on the generalized fold - Flattened, using standard iteration patterns. It appears that there needs to be some routine that moves information between the structure that is hidden to the functor (loop indices), into the contained values. EDIT: Ive created the function below, exposes the path to the functor structure. annotate :: Program b -> Program ([Index], b) annotate (Program fs) = Program $ forms [] fs where forms path fs = map (form path) fs form path (LetPrim b) = LetPrim (path, b) form path (LetLoop i fs) = LetLoop i $ forms (i:path) fs Now that could be tucked away in a Monad. Is it necessary though? I kind of like this explicit structure. Entry: Escape analysis Date: Thu May 16 10:24:42 EDT 2019 Now for the core issue: construct a list of variables that is only defined inside a loop. Note that we actually have that information in the original form: we know the return values of the form. So it is likely safe to assume the intermediate form will have an explicit list of arrays that will be visible outside of its scope. There's a problem: when two loops are fused, the outputs might no longer escape. So it seem that analysis needs to be performed anyway. What does it mean for a variable to not escape? If it is no longer referenced after the loop has finished. So given a loop segment, we need to isolate the segments that come after it. This needs to be done for each binding. It's not entirely clear how to mix primitives and loops inside a forms list. So let's just start building this and see where it ends. I'm starting to run out of steam. This stuff is exhausting. Ok, resume. I do not know what to do with irregular nesting, but I do know that the end of a loop is a splitting point. EDIT: I'm missing an insight, an angle. I need to let this pop up by itself. EDIT: Ok I tried several times. It's not working. I can't retain enough context. EDIT: Took a much longer break. Let's create a simpler zipper. Follow the Haskell tutorial first to refresh intuition. So, really, it is just a stack. I just need a stack, because only the future matters. It's not that I haven't done that before! Here's just the iteration pattern, doing nothing but linearly traversing and keeping a context. data Ctx b = Ctx { ctxStack :: [[Form b]], ctxCode :: [Form b] } zip_next (Ctx (fs:fss) []) = zip_next (Ctx fss fs) -- pop context zip_next (Ctx [] []) = () -- end zip_next (Ctx fss ((LetPrim b):fs)) = zip_next (Ctx fss fs) -- skip zip_next (Ctx fss ((LetLoop i fs'):fs)) = zip_next (Ctx (fs:fss) fs') -- push Note that this does just the other part of the index generation. So what about changing that code to do full zipper/future annotation? I think I understand: perform a traversal in a state monad, and annotate each node with the zipper. This can probably be generalized to do all kinds of things. Ok, so generalize annotate to monadic form and remove all explicit passing: annotate' :: Monad m => Program b -> m (Program ((), b)) annotate' (Program fs) = fmap Program $ forms fs where forms fs = traverse form fs form (LetPrim b) = return $ LetPrim ((), b) form (LetLoop i fs) = fmap (LetLoop i) $ forms fs Now m can be made to be the state monad. EDIT: I have it split up, but this really feels like doing double work. It actually is, because there is an actual recrursion and the updating of a datatype that represents the recursion. Entry: Escape analysis Date: Fri May 17 08:33:03 EDT 2019 See LTA.hs I now have: escapes :: Context Let -> Array -> Bool So the coordinate transformation should be straightforward to implement. If an array does not escape, remove all loop indices. I'm not actually sure that is the whole picture, so let's implement it first. Ok the basic idea works! But one thing I missed: it is not just a local substitution, because all references need to be substituted. EDIT: Doing it in two steps: make a list of intermediate (non-escaping) variables, and use it to perform a substitution. -- LTA i: j: C <- Aij Bij Dij <- C Aij i: j: E <- Aij Dij Together with fusion, this is the basic thing I need. I don't think that hoistable things would show up in generated code, and interchange, well I don't really see a way to add a cost function. It's time to start making some examples that use intermediate arrays, which is the whole point of this. This will only make sense when there are folds involved, because otherwise it would be possible to fuse. So I need another operator that uses "trangular access" and "feedback access" and show that this cannot be fused. These two seem different. Do "feedback access" first. Entry: Different intermediates Date: Fri May 17 17:05:06 EDT 2019 In RAI, all intermediates were local and did not survive to the next loop. I remember this being a problem for the FDN implementation. EDIT: I believe it was the (sparse) matrix multiplication. I think the central point is to have one loop compute something, and have another loop use that value. The problem in RAI is that multi-pass structures are not supported. So start with something simple. Multipass is necessary when there is some kind of global -> local data dependency. Anything goes really, so let's pick vector normalization: sum squares -> 1/sqrt^2 -> scale The LTA language cannot yet express accumulators. i: Ai <- A(i-1) + Bi Can I express something that would need multipass without using accumulators? It doesn't seem so. Accumulation is key. Some remarks - Any kind of triangle feedback would be allowed in a first iteration. Only the regular ones should be replaced by accumulator variables. - This "allowable range" idea should be extended to generic grids as well: if a grid was computed in a previous loop, it can be used entirely. Otherwise only the local part is accessible. Keep this open and add it once an example pops up. Entry: accumulation Date: Fri May 17 17:20:16 EDT 2019 i: Ai <- A(i-1) + Bi This notation abstracts the need to initialize the accumulator before the loop starts. Representing accumulators is important, so maybe it is possible to generalize: i: A_i <- A_f(i) + B_i There based on what A_f(i) is, we can implement it as a single accumulator, a finite set of accumulators (a shift register), or a full array. So that is all fluff. Let's use i' as the previous index, with implicit initialization. Actually I don't have any way to use the two types of references: previous accumulator value and final, after the loop. i: A <- A Bi i: Di <- A Ci So it appears that accumulators need some special notation. To make it work for triangle patterns, use the triangle notation. i: Ai <- Ai1 Bi i: Di <- AI Ci Where I is the last element that was stored in the array in its constructing loop. Entry: LLVM loop optimizations Date: Fri May 17 19:14:40 EDT 2019 https://www.youtube.com/watch?v=QpvZt9w-Jik - Tensor Comprehensions https://research.fb.com/announcing-tensor-comprehensions/ - Halide https://halide-lang.org/ Entry: Accumulators Date: Sat May 18 08:41:23 EDT 2019 So it's already established that: - all accumulator references are "full triangle". - replacing triangles with single (or multiple) accumulators is an optimization that can be derived directly from the usage pattern. e.g. if 1) inside the loop only the last acc value is referenced and 2) outside the loop only the last element is referenced. So the only necessary bit here is to make them representable. Instead of transforming indices, transform the arrays. This popped out through a notational shortcut. There are a couple of inconsistences. During construction, there is a loop index. However, after construction is finished (e.g. in the next loop, or just using an input array), it is not clear how to refer to particular elements in the arrays. What is clear is that arrays have definite types. Some rules: - iteration ranges are always derived from the dimensions of the arrays that are being constructed in a loop. all of these arrays are necessarily the same size. - they are not necessarily related to arrays that are used as input to a loop. this needs to be represented somehow. - some dimensions are only there in a virtual sense. What about leaving all the indices as they are in the code, but keep track about which dimensions are actually implemented based on the access patterns? Basically, abstract the array access. This keeps the original semantics. So next step is to represent abstract array access. There are two "stages" - array is defined: random access is allowed - array is being defined: only back-referencing acces is allowed ( Note: it is important to be able to nest triangles, i.e. one loop builds up an array one element at a time, while allowing a sub-loop to run over all the elements that have been computed up to that time. ) Based on the kind of access, the arrays can be thinly provisioned. Entry: Triangles Date: Sat May 18 09:53:49 EDT 2019 This idea really attached itself. Why is it so important? Because it is a _physical_ property that cannot be disputed. I.e. it is not a leaky abstraction: An algorithm running on a CPU is necessarily sequential, but while going through the steps, it can refer to all its input, and all its output. Capturing this in the core language is very important. Entry: Fix the representation of definition and use Date: Sat May 18 09:59:39 EDT 2019 The point is to create a model of the dynamic execution of the combination of sequencing (re-using) and looping (construction). At any point in the dynamic evolution, it should be 100% clear how the access patterns work. Summarized: array construction should be self-referential. Note that this does not allow for mutation. The model remains purely functional. So in a strong sense there is a limit. Maybe this can be extended later on. For now, let's stick to this idea because it allows mapping to functional languages. Also it keeps the writable and readable sections separate, which might be useful later on. Entry: Ranges : loop indices are really just sizes Date: Sat May 18 10:06:00 EDT 2019 What about this: each array has an associated definted range. Once a full loop has executed, this defined range is the size and is an explitly defined parameter that has the same standing as any other loop index. I.e. A uses index i during definition i: Ai <- Bi But once the loop has executed, i could be left to represent the size of the array. This unifies two ideas: access to the size of an array, and the "current index" during array construction. So the index parameter is really a size parameter. Let this sink in for a bit. One thing can already be changed: loop indices should not be re-used, as they retain a value after a loop has finished. E.g. the following is valid: The first loop defines the accumulator (i' is i-1), the second loop uses its last value. -- LTA i: Ai <- Ai' Bi j: Dj <- Ai Cj Entry: Next Date: Sat May 18 10:19:07 EDT 2019 So there is a representation that needs a couple more annotations likely, but should be able to cover things. Start generalizing it to something that will perform elimination while allowing accumulator reuse. E.g. one thing is definitely the use of multi-dimensional accumulators i: j: Aij <- Ai'j Bii k: Dk <- Aik Ck That might be the most basic example that is actually useful to figure out how to optimize using only two optimizations: intermediate elimination for: - independent references - accumulators The first one is already implemented: a complete dimension can be eliminated if the array is deemed intermediate. The second can be done as an extension: if definition only uses backreference and use only uses last, the dimension can be collapsed into a single value. Implementing the machinery for that will likely expose possibility to generalize. So. In the example above, there is only the accumulation dimension that can be eliminated. How to write a matcher for that? Entry: accu matcher Date: Sun May 19 08:25:08 EDT 2019 - only backreference in defining loop - only last reference outside Should be straightforward. However this needs to be tracked for each dimension separately. The information that needs to be tracked is whether that dimension: - is a local variable - is an escaping accumulator To tackle this, first fix escape analysis to return a data structure ., Entry: I'm going to need register allocation as well Date: Sun May 19 08:50:35 EDT 2019 If an array is used as an intermediate value between two loops but no longer used after that, it should probably be reused. This is similar to what Pd does. In fact, the lta should be able to represent the "block based" approach just fine. Entry: How to modify escape analysis? Date: Sun May 19 09:11:26 EDT 2019 I'm not seeing things clearly this morning. Maybe not a day to make changes. Entry: grounding Date: Sun May 19 12:51:22 EDT 2019 Make it practical first. Create the FDN in a language that can actually render to C, then create the mapping. Entry: Date: Sun May 19 16:44:04 EDT 2019 e.g. i,j: if it doesn't escape the inner loop, all the indices can be removed. if it does escape j but not i, j can .... ( i'm looking at this upside down ) there is the case where the inner escapes, but the outer doesn't i: j: Bij <- Aij k: Cij <- Bik ... (no reference of B) In this case, the dimension associated to i can be removed i: j: Bj <- Aij k: Cij <- Bk It's time to start collecting all these special cases that pin down semantics. Entry: Revisit Date: Sun May 19 19:33:02 EDT 2019 Starting out with the original "zipfold", the current idea is similar, but just represented differently, and also allowing for some more freedom (triangle feedback). Maybe it's time to start working on a representation that can then use standard map and fold to end up with a loop? Can be done in LTA already. It will give more of an idea of how things are formed. Then continue with single-dim eliminate and accumulator detection. Entry: A monadic test language Date: Mon May 20 06:40:52 EDT 2019 p a b = do c <- op [a,b] d <- op [a,c] return [d] Currently there is nothing to represent return, which I knew already. The most appropriate structure is the state continuation monad. Maybe this time, write it as a transformer? The question is then, what is the order of transformation? Alright... Because of the lack of nesting in the previous Seq and PRU languages, I've been able to avoid this one. I don't remember how it works, or where I have the code parked... I found something here: ~/darcs/meta/ sharing/monadic_sharing.hs dspm/StateCont.hs Let's just copy the latter. EDIT: Added Functor and Applicative instances, and created a basic 'op' primitive. Because this needs allocation, it's time to move from String to Int-indexed variables. Ok I remember: CPS is used just to be able to do the state threading. State is just variable count. Entry: All .hs code in ~/darcs/meta Date: Mon May 20 07:04:44 EDT 2019 tom@panda:~/darcs/meta$ find -name '*.hs' ./Applicative/ApplicativeAlgebra.hs ./Applicative/ApplicativeNum.hs ./Applicative/Doodle.hs ./ArrowStack/ArrowStack.hs ./ai/Ai.hs ./ai/Armax.hs ./ai/Block.hs ./ai/Cas.hs ./ai/Connect.hs ./ai/DFL.hs ./ai/Flatten.hs ./ai/FlattenMonad.hs ./ai/Function.hs ./ai/Indexed.hs ./ai/Plot.hs ./ai/Procedure.hs ./ai/RatFunc.hs ./ai/SSA.hs ./ai/Shared.hs ./ai/Term.hs ./ai/test-Ai.hs ./ai/testFlatten.hs ./atom/scratch.hs ./closed_sm/ClosedSM.hs ./clua/old/CAnalyze.hs ./clua/BStruct.hs ./clua/CAnalyze.hs ./clua/PrintLua.hs ./clua/Setup.hs ./clua/clua.hs ./dspm/dist/build/autogen/Paths_dspm.hs ./dspm/SSM.hs ./dspm/0broken_Sys.hs ./dspm/0test_Loop.hs ./dspm/0test_Pd.hs ./dspm/0test_PrettyC.hs ./dspm/0test_TML.hs ./dspm/0test_integration.hs ./dspm/Array.hs ./dspm/Code.hs ./dspm/Control.hs ./dspm/Data.hs ./dspm/Lambda.hs ./dspm/LetRec.hs ./dspm/Lib.hs ./dspm/Pd.hs ./dspm/PrettyC.hs ./dspm/SArray.hs ./dspm/Struct.hs ./dspm/SysFold.hs ./dspm/Term.hs ./dspm/TermC.hs ./dspm/Type.hs ./dspm/Value.hs ./dspm/doodle.hs ./dspm/Sys.hs ./dspm/CSSM.hs ./dspm/Sys_.hs ./dspm/StateCont.hs ./haskell/asm/Logic.hs ./haskell/asm/TLEnv.hs ./haskell/doodle/TaggedList.hs ./haskell/doodle/affine.hs ./haskell/doodle/ainum.hs ./haskell/doodle/applicative.hs ./haskell/doodle/code.hs ./haskell/doodle/commterm.hs ./haskell/doodle/flatten.hs ./haskell/doodle/mapfold.hs ./haskell/doodle/memolet.hs ./haskell/doodle/nodes.hs ./haskell/doodle/onezero.hs ./haskell/doodle/sk.hs ./haskell/doodle/stack.hs ./haskell/doodle/stackoverflow.hs ./haskell/doodle/staged-in-unstaged.hs ./haskell/doodle/stream.hs ./haskell/doodle/tagless.hs ./haskell/doodle/tensor.hs ./haskell/doodle/typelist.hs ./haskell/exist_monad/exist.hs ./haskell/exist_monad/generality.hs ./haskell/exist_monad/iso.hs ./haskell/exist_monad/iso2.hs ./haskell/exist_monad/iso3.hs ./haskell/exist_monad/iso4.hs ./haskell/exist_monad/iso5.hs ./haskell/exist_monad/iso6.hs ./haskell/exist_monad/iso7.hs ./haskell/exist_monad/istream.hs ./haskell/old_Sym/Sym.hs ./haskell/old_Sym/SymAsm.hs ./haskell/old_Sym/SymEval.hs ./haskell/old_Sym/SymExpr.hs ./haskell/old_Sym/SymLLVM.hs ./haskell/old_Sym/SymTest.hs ./haskell/ssm/SigApp.hs ./haskell/ssm/SigBind.hs ./haskell/ssm/SigJoin.hs ./haskell/ssm/StateSpace.hs ./haskell/SigOp.hs ./llvm/llvm.hs ./pulse/FFT.hs ./pulse/constant.hs ./pulse/pulse.hs ./sharing/monadic_sharing.hs ./sharing/sharing_problem.hs ./sm/sm_test.hs ./staapl/staapl.hs ./z/Z.hs ./siso/Data.hs ./siso/dist/build/autogen/Paths_siso.hs ./siso/Vec.hs ./siso/Signal.hs ./siso/StateCont.hs ./siso/Eval.hs ./siso/Code.hs ./siso/RSignal.hs ./siso/Type.hs ./siso/Lib.hs ./siso/RSig.hs ./siso/Signal_r.hs ./siso/Test.hs ./siso/llvm-general/Shake.hs ./siso/llvm-general/llvm-general-pure/Setup.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/AddrSpace.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Attribute.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/COMDAT.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/CallingConvention.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Constant.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/DLL.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/DataLayout.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Float.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/FloatingPointPredicate.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/FunctionAttribute.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Global.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/InlineAssembly.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Instruction.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/IntegerPredicate.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Linkage.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Name.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Operand.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/ParameterAttribute.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/RMWOperation.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/ThreadLocalStorage.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Type.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/AST/Visibility.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/DataLayout.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/Internal/PrettyPrint.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/Prelude.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/PrettyPrint.hs ./siso/llvm-general/llvm-general-pure/src/LLVM/General/TH.hs ./siso/llvm-general/llvm-general-pure/test/LLVM/General/Test/DataLayout.hs ./siso/llvm-general/llvm-general-pure/test/LLVM/General/Test/PrettyPrint.hs ./siso/llvm-general/llvm-general-pure/test/LLVM/General/Test/Tests.hs ./siso/llvm-general/llvm-general-pure/test/Test.hs ./siso/llvm-general/llvm-general-pure/dist/dist-sandbox-ae49e09/build/autogen/Paths_llvm_general_pure.hs ./siso/llvm-general/llvm-general/Setup.hs ./siso/llvm-general/llvm-general/src/Control/Monad/AnyCont.hs ./siso/llvm-general/llvm-general/src/Control/Monad/AnyCont/Class.hs ./siso/llvm-general/llvm-general/src/Control/Monad/Trans/AnyCont.hs ./siso/llvm-general/llvm-general/src/LLVM/General.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Analysis.hs ./siso/llvm-general/llvm-general/src/LLVM/General/CodeGenOpt.hs ./siso/llvm-general/llvm-general/src/LLVM/General/CodeModel.hs ./siso/llvm-general/llvm-general/src/LLVM/General/CommandLine.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Context.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Diagnostic.hs ./siso/llvm-general/llvm-general/src/LLVM/General/ExecutionEngine.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Analysis.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Atomicity.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Attribute.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/BasicBlock.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/CallingConvention.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Coding.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/CommandLine.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Constant.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Context.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/DataLayout.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/DecodeAST.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Diagnostic.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/EncodeAST.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/ExecutionEngine.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Analysis.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Assembly.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Attribute.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/BasicBlock.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/BinaryOperator.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Bitcode.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Builder.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/ByteRangeCallback.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Cleanup.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/CommandLine.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Constant.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Context.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/DataLayout.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/ExecutionEngine.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Function.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/GlobalAlias.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/GlobalValue.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/GlobalVariable.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/InlineAssembly.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Instruction.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Iterate.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/LibFunc.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/MemoryBuffer.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Metadata.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Module.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/PassManager.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/PtrHierarchy.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/RawOStream.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/SMDiagnostic.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Target.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Threading.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Transforms.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Type.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/User.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FFI/Value.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FastMathFlags.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/FloatingPointPredicate.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Function.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Global.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Inject.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/InlineAssembly.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Instruction.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/InstructionDefs.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/IntegerPredicate.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/MemoryBuffer.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Metadata.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Module.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Operand.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/PassManager.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/RMWOperation.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/RawOStream.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/String.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/TailCallKind.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Target.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Threading.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Type.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Internal/Value.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Module.hs ./siso/llvm-general/llvm-general/src/LLVM/General/PassManager.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Relocation.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Target.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Target/LibraryFunction.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Target/Options.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Threading.hs ./siso/llvm-general/llvm-general/src/LLVM/General/Transforms.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Analysis.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/CallingConvention.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Constants.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/DataLayout.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/ExecutionEngine.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Global.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/InlineAssembly.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Instructions.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Instrumentation.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Linking.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Metadata.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Module.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Optimization.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Support.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Target.hs ./siso/llvm-general/llvm-general/test/LLVM/General/Test/Tests.hs ./siso/llvm-general/llvm-general/test/Test.hs ./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/setup/setup.hs ./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/autogen/Paths_llvm_general.hs ./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/LLVM/General/Internal/LibraryFunction.hs ./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/LLVM/General/Internal/FFI/InstructionDefs.hs ./siso/llvm-general/llvm-general/dist/dist-sandbox-ae49e09/build/LLVM/General/Internal/FFI/LLVMCTypes.hs ./siso/Gen.hs ./siso/llvm-tutorial-standalone/src/Codegen.hs ./siso/llvm-tutorial-standalone/src/FFI.hs ./siso/llvm-tutorial-standalone/src/JIT.hs ./siso/llvm-tutorial-standalone/src/Main.hs Entry: State Continuation Date: Mon May 20 08:49:35 EDT 2019 Memory is coming back, painfully :) Cliche, but this is more difficult than I thought it would be. It's near obvious what it _should_ be when going through the motions, but performing the assembly requires a lot of small refinement steps. So let insertion works. Next is loop nesting. This will require some modification. loop $ \i -> do ... I think the state continuation monad is too simple: it does not return the final value of the state, and it appears as if an entire computation needs to be run as a subprogram. I wonder if the trick is to just embed the state in the return value of the expression? No this is actually not possible because "return" cannot excape the context. So it really needs to be part of the type. Let's look at the original paper to see what the type is. This was a problem before: -- Note: state will be forked! The Let insertion mechanism doesn't -- allow recovery of state. mBlock :: MCode (Code t') -> MCode (Code t) mBlock sub = SC main where main s k = k s $ Code $ (subTerm s sub) Ok so I need a version of the SC monad that threads state through the entire computation. Maybe that isn't possible? Actually it is possible to get the state by just tagging it, but it will dump it deeply nested inside the data structure. I'm missing an essential insight. So let's implement it first with a state fork, then see what would be necessary to patch it up. Entry: Pause Date: Mon May 20 10:59:15 EDT 2019 It was very important to do the monadic language, because it is already quite clear there is something not right with the way bindings and loops interact. About the SC monad: maybe it just isn't the right abstraction? A free monad might work better. Maybe let insertion is just different: because of scoping rules, the forking is not an issue. So there are a couple of conclusions: - Each sequence of bindings has a clear return value. - A sequence of loops should also return something. Currently that can't be expressed. I need to find a bridge between values and assignments. The return value of a sequence of primitives would be a list of cells. The return value of a loop is an array slice. prim / loop The conceptual error is that a loop binds an array slice, just like a primitive binds a cell. So the IR tree is structured in a bad way. Fixing that is a big change. A Form is Let | Ret A Binding is either a prim or a loop I do wonder though for all of this if the transformations are really necessary, because the language will be able to represent fused functions. Split it off in a separate structure or fix? There is a strong tension between: - nesting data structures - adding an index Note that loop will need to be a "map style" operation. It needs to map something that is scalar to something that is vector, and vector to matrix etc.. Entry: State Continuation Threading Date: Mon May 20 12:09:49 EDT 2019 Something I still do not understand is why state can't bubble upwards. No it's really not possible because of the nested construction. The way to get it is to put it in the data structure and fish it out. Woah I'm really stuck at this... I think I need to read the original paper. Because if this doesn't work, then there is no point to use anything but nested state monads without the whole continuation business. So: give up. Use a more traditional writer/reader/state stack, and re-thread on every block. Entry: State + Reader Date: Tue May 21 08:03:20 EDT 2019 So kick out the let-insertion monad. It's a neat trick, and it works if state forks are not an issue. For my use case however a more plain environment + state approach seems appropriate. To collect bindings, should I use a writer? Or should the bindings go in the state. EDIT: This works a lot better. loop can now explicty run the inner forms, threading environment and state. Entry: Indices Date: Tue May 21 09:04:22 EDT 2019 Binding is easy: it always uses the current environment. Reference is more general. The core issue is in Let: binding and reference need to be split there. Ok so I have something halfway meaningful: Below, i,j: input dimensions k,l loop: refs to ij are abitrary, possibly function of k,l m,n loop: refs to ijkl are arbitrary, possibly function of m,n -- LTA k: l: Ckl <- Aij Bij Dkl <- Aij Ckl m: n: Emn <- Dkl Fmn <- Aij Emn Next: make the distinction between local loop variables and size variables apparent. Maybe track the dimensions in an environment through the compilation? Additionally, it should be impossible to make array dimension mistakes. Entry: LLVM diff Date: Tue May 21 10:22:23 EDT 2019 I've deleted the Langauge.EDSP.LLVM module, but it contained some valuable findings about how to map between a simple monadic language and LLVM. Pasted diff here. see asm_tools b0e69809a161f59da345bc8bab131c5b8b999564 diff --git a/asm-tools-axbc/Language/AXBC/LLVM.hs b/asm-tools-axbc/Language/AXBC/LLVM.hs deleted file mode 100644 index 1964733..0000000 --- a/asm-tools-axbc/Language/AXBC/LLVM.hs +++ /dev/null @@ -1,80 +0,0 @@ --- TODO: This will need some focus on the structure of the monads. I --- want to have my own stack to be able to implement some EDSP --- language context. It seems that what is needed is to mix it with --- the IRBuilderT transformer. - -{-# LANGUAGE CPP, OverloadedStrings #-} -{-# LANGUAGE OverloadedStrings #-} -{-# LANGUAGE RecursiveDo #-} -{-# LANGUAGE DeriveFunctor #-} -{-# LANGUAGE GeneralizedNewtypeDeriving #-} - - -module Language.EDSP.LLVM where - - - --- LLVM -import Data.Text.Lazy(Text) -import Data.Text.Lazy.IO as T -import Data.Text.Lazy.Encoding - -import LLVM.Pretty -- from the llvm-hs-pretty package -import LLVM.AST hiding (function) -import LLVM.AST.Type as AST -import qualified LLVM.AST.Float as F -import qualified LLVM.AST.Constant as C -import LLVM.IRBuilder.Module -import LLVM.IRBuilder.Monad -import LLVM.IRBuilder.Instruction as I - - --- EDSP -import Language.EDSP - --- Implementation -import Control.Monad.State - --- Start from the llvm_simple example. Create a tagless final wrapper --- as soon as possible. It's going to be necessary. - --- Create the monad. For now it can just be the LLVM monad. - --- Split up in the different parts, exposing only the "pure" core on --- the inside. - -llvm_simple :: Text -llvm_simple = ppllvm $ llvm_module - -llvm_module :: Module -llvm_module = buildModule "exampleModule" $ llvm_function - -llvm_function :: ModuleBuilder Operand -llvm_function = mdo - function "add" [(i32, "a"), (i32, "b")] i32 llvm_entry - -llvm_entry :: Monad m => [Operand] -> IRBuilderT m () -llvm_entry [a, b] = mdo - entry <- block `named` "entry"; do - c <- llvm_pure [a, b] - ret c - --- The bridge to the LLVM monadic representation is the class --- MonadIRBuilder, but it seems we can make it a little less abstract --- by focusing only on. -llvm_pure :: MonadIRBuilder m => [Operand] -> m Operand -llvm_pure [a, b] = do - c <- I.add a b - return c - - --- Use a custom monad to compile to LLVM. -newtype M t = M { unM :: State String t } deriving - (Functor, Applicative, Monad) --- instance MonadIRBuilder M where - - - -run = do - T.writeFile "/tmp/test.ll" $ llvm_simple - T.putStrLn $ llvm_simple Entry: Next? Date: Wed May 22 22:53:28 EDT 2019 Give a proper semantics to indices. EDIT: There is Ref and Def. Make that visible in the printout. EDIT: What I want is a variable that contains size, but also the static notion of size. There are two cases: - during definition, size is a function of the loop indices - after definition, size is a fixed number known at compile time how to separate those phases clearly? Let's give the latter a capital letter in the print rep. Ok: once a loop is finished, the array's size is static. EDIT: This feels like stuckness. I'm focusing on a detail that does not matter. Basically, if all "defining" indices are the same, whenever an array is referenced using anything other than its defining indices, it will be complete. Entry: What do I need? Date: Thu May 23 10:02:17 EDT 2019 I need this language to actually generate code! So maybe focus on that first. Make a code generator, and just have it be inefficient in first iteration. Things will become more apparent once there are concrete examples to constrain the general pattern. Entry: Move forward Date: Sat May 25 08:33:26 EDT 2019 I'm a bit stuck at the way indices and sizes are represented. So what is an index? It always represents the coordinates of cells that are bound in the current loop block. Entry: References Date: Mon May 27 08:52:05 EDT 2019 I guess it's just hard, because it's not just falling out. Maybe change the monadic form first to support references. Trying this: p a b = do [d] <- loop $ \i -> do loop $ \j -> do c <- op "mul" [a i j, b i j] d <- op "mul" [a i j, c i j] return [d] loop $ \i -> do loop $ \j -> do c <- op "mul" [d i j] d <- op "mul" [a i j, c i j] return [d] Where references are always explicit, and definitions are always implicit. Basically what this means: - assignment here is (incremental) definition of a finite function - since we know the dependency graph of computations leading from constants and local indices to array references, we should be able to create a proof that the references are valid. Maybe the real challenge here is to define what the binding means? c <- op _ $ d i j There are two levels to look at this: - element-wise: c(i,j) is made of tx of d(i,j) and we're simply not writing down the indices at the LHS. - whole array: c is made of tx of d as a whole, and there are some restrictions on how i and j can be used. EDIT: This looks ok, but how to express it such that type inference works? Entry: Bounds checking Date: Mon May 27 12:16:38 EDT 2019 Also, if the referencing computation cannot be performed at compile time, some kind of mechanism needs to be inserted to ensure that errors get caught. E.g. to generate code that has bounds checking so during quickcheck these asserts can be included, then for production code they can be left out. I.e if there is no actual proof, there could be a statistical proof. This way the prover could be written incrementally. Entry: Next Date: Tue May 28 09:12:40 EDT 2019 I will need to dedicate some bulk time to work through this. Fragmented attention won't cut it. Entry: c <- a i j Date: Tue May 28 09:30:43 EDT 2019 Doesn't work. Because <- will have different types. I'm looking for a morphism between: - the element-wise assignment - the abstract array operation Entry: Am I looking at this the wrong way? Date: Tue May 28 18:22:41 EDT 2019 Maybe it is enough to do manual fusion? Entry: Next Date: Wed May 29 11:02:12 EDT 2019 I'm stuck. Ideas aren't flowing. The probelm is this: c <- op "mul" [a i j, b i j] what i want instead is: c <- f (a,b,i,j) there are two things to relate to each other: - an array is derived from other arrays. though this by itself is not that important. - an array is constructed element-wise So in a sence, it really doesn't matter what is on the RHS. Those really are atomic values. The only thing that matters is that: - "c <- _" denotes the definition of an array - its dimensions are fully determined by the dimensions of the loops it is in. So I could start out by creating a "constant definer", and annotate the types that way. Maybe the most important part is that what is returned in a loop is always a function. Currying is natural here. The problem is to define the type of loop. This needs to be polymorphic. Something like loop :: (Index -> M [Index -> t]) -> M [Index -> t] 'M' needs to be on the outside: it is a representation of an array. Yeah I really don't have the intuition for this. No guiding principle. I'm stuck. EDIT: Not quite sure why. The initial analysis phase was simple. Syntheses and type encoding seems to not go very well. It could be that I'm just too tired and dull to see the path. Entry: It's time to crack the nut. What is loop? Date: Thu May 30 16:31:11 EDT 2019 loop :: (Index -> M [Index -> t]) -> M [Index -> t] t can be Index -> t' or Atom If arrays are represented as function, then what is inside a loop construct is also a function. c <- op2 a b I keep going back to these two views: the entire array vs. operations on atomic values. Entry: Can it just be a functor? Date: Thu May 30 16:46:23 EDT 2019 Until that is resolved, there is not a good way to work with this. Maybe take the view that these arrays are Functors, such that any operations on elements can be mapped over arrays? When in doubt, make it a functor... Start there. That should bring us full circle: back to Feldspar. Loops then come from fmap. This can then be generalized to fold. Any other context-aware operations will be generalizations of that. Do I just need to start from scratch, ensuring a proper interface? Is traverse the same as map, but for "monadic primitives"? map: (a -> b) -> t a -> t b traverse: (a -> f b) -> t a -> f (t b) yes. So next: build this from the ground up. Start from traverse, then add all the extensions such as parameterized indices. traverse: (a -> M b) -> A a -> M (A b) where M is the compilation/interpretation monad, and A is the array representation. e.g. type A = (Index ->) Then generalize to expose index traverse1: (Index -> a -> M b) -> A a -> M (A b) traverse2: (Index -> Size -> a -> M b) -> A a -> M (A b) This is really a different view from not expressing inputs explicitly. Maybe that is the core of the issue? I've been focusing on construction, with any inputs abstract, which removes the 'a' parameter: loop: (Index -> Size -> M b) -> M (A b) Entry: feldspar Date: Thu May 30 17:01:40 EDT 2019 I don't think there is a whole lot to be found there.. Let's just stick to the current setting. Entry: traverse vs. construction? Date: Fri May 31 08:46:21 EDT 2019 It is really about the difference between these two: traverse: (Index -> Size -> a -> M b) -> A a -> M (A b) loop: (Index -> Size -> M b) -> M (A b) The reason to go for the latter is that only construction is element-wise. Reference can be random access with some restrictions for feedback configurations. This is the MAIN IDEA. Entry: loop :: (i -> M t) -> M (A t) Date: Sat Jun 1 15:45:50 EDT 2019 Literally: transform a loop body that produces elements into an array. This can't be too hard. EDIT: Types work out. Arrays are constructed one dimension at a time. EDIT: So here is what I find out: I don't have intrinsic motivation to finish this. It is REALLY in the way to make progress, but my mind is on finding a new client. EDIT: Ok, some ideas. Dimensionality is in the type, do should be reflected in reference. EDIT: Conceptual problem? This whole thing of treating everything as a grid is not going to work. Because of the signature of loop, there will be a concept of "scalar". Entry: Structure of the language Date: Sun Jun 2 09:04:36 EDT 2019 So this is pretty much Seq + Array construction and referencing. I don't think there is a whole lot to be added apart from that. How do I separate these? Maybe it's not necesary. The missing step was the separating of array dereferencing. Basically, a new grid would be created with the indices. Then all primtives are just mappings of n-aries. Ok, another basic idea: - referencing creates a new grid of indices. this factors out anything special in the access pattern - all other operations can be factored into an fmap / liftA2 / ..., making grid operations and scalar operations isomorphic. Conclusion: EXPORT DEREFERNCING AS A PRIMITIVE Entry: Cleaning up the language Date: Sun Jun 2 09:27:14 EDT 2019 So there is still a bit of work to fit the data language to the monadic representation. First: dereferencing is always explicit. I think I need to start over. The LTA form is not what I'm looking for. EDIT: So I've split up LTA (which now has a simple data structure print statement to serve as an example), and Loop, which is built around being able to represent this thing: p :: ArrayZero t => (Array (Array t)) -> (Array (Array t)) -> M (Array (Array t)) p a b = do d <- loop $ \i -> do loop $ \j -> do aij <- ref2 a i j bij <- ref2 b i j c <- op2 "mul" aij bij d <- op2 "mul" aij c return d loop $ \i -> do loop $ \j -> do dij <- ref2 d i j c <- op2 "mul" dij dij e <- op2 "mul" c c return e I'm going to need a break. So there is a pattern: I often want a language with a particular control structure around an otherwise form. How to do that more efficiently? Looks like the inconsistency comes from the inability to represent the referencing properly in the data type. The final embedding has no issues with nested types, but the data type itself can't do that, so likely needs to use a flat encoding. Ok I am just not seeing the big picture. There is a real issue in mixing partial application and loop nesting (at the embedding side) and the need for uncurried representation in the main language. It seems best to treat arrays as a special kind of function that has "abstraction" (loop), and "application" (ref). Here's an idea: - pretend that the language is higher order. variables can contain partially applied array references. those are just compile time entities because they will need to resolve to scalars when operations are involved. Ok this is actual progress. Entry: partial application of grids / grids as finite functions Date: Sun Jun 2 16:00:24 EDT 2019 This seems to be the important idea to be able to manage nesting. 1. Nesting is necesary due to loops 2. That then reflectis to the data type as well. So the basic language is a typed lambda calculus where abstraction is array construction, and application is array reference. Entry: Meta-programmed Erlang node Date: Tue Jul 30 09:36:00 EDT 2019 I want something that behaves as an Erlang node, but is actually static state machines described by a Haskell-embedded language. Entry: Compositional state machines Date: Tue Jul 30 09:37:20 EDT 2019 Essentially I want something that runs on an FPGA, where the compiler can figure out what to implement as state machines, and what to multiplex on a CPU or state machine execution engine. The basic substrate is a state machine. The basic decision is to time-multiplex or not. Entry: Interface to Verilog Date: Tue Aug 13 09:13:33 EDT 2019 1. Have yosys compile it to something that can run on the haskell substrate. 2. Generate better verilog code Entry: State machines Date: Tue Aug 13 09:36:24 EDT 2019 I need a different way to express state machines. Maybe start looking into formal? EDIT: What is the real problem actually? I like the Erlang transactional model: message comes in and it changes state. Message can be anything, state can be anything. Does this work for syncrhonous state machines as well? A message in this case is a clock edge. Its contents is all the inputs at the clock edge. That is not very useful. So where do these two differ, really? It seems that the problem with clocked state machines is that the input is everything and the state is everything. This is just too unstructured to say anything meaningful. so how to abstract it? It would be great to create logic state machines in the same terms. EDIT: I wonder if the missing link is just missing knowlege, and not so much bad integration of knowledge. Maybe have a look at TLA+ https://www.apress.com/us/book/9781484238288 Practical TLA+ Planning Driven Development Authors: Wayne, Hillel Entry: Make FPGAs and CPUs the same Date: Wed Aug 14 17:41:46 EDT 2019 I need to find a way to translate high level event-driven "transactional programs" into gateware. That would be quite revolutionary. This can be done by enable signals, or "transaction busy" signals. Prototypical examples are the 1-cycle "event" pulse, and things like SPI CS. Maybe look at other transactional problems in the circuit domain, e.g. busses. Busses tend to evolve into packet systems, essentially. The problem isn't that this is hard to encode, but that there are many many different encodings that all have other efficiency tradeoffs. I guess there are many interpretations of: - receive message - update state Entry: stat should be local in both place and time Date: Wed Aug 14 18:47:20 EDT 2019 - use a lot of encapsulation: compose state machines based on the protocols they receive and produce, not about what state they are in. - keep state short-lived. i.e. reset to known state often and predictably. ( rationale: long-lived state "residue" is what makes things hard to test. ) These two rules already eliminate a lot of issues. They treat state as an implementation detail that is not observable at higher protocol units. ( Cfr. implementing FP lang in IP lang. ) Try to design in protocols, data flow. Might be specific to my applications, but it seems to be the way to go. Protocol oriented programming: a protocol and the state machine that parses / generates it are two sides of the same coin. Protocols are state machine traces. Maybe related: Quviq quickcheck state machine analysis? Is there a Haskell variant that can do state machines? http://hackage.haskell.org/package/quickcheck-state-machine Entry: Lets start with a state machine synthesizer Date: Wed Aug 14 18:56:31 EDT 2019 - how does encoding actually affect gate usage in FPGAs? - see how yosys does state machine transformations - can I make state machines abstract and convert them to states directly? i.e. can states be fully abstract? Entry: Notes Date: Wed Aug 14 20:40:51 EDT 2019 - Nothing needs to be bi-directional. That is a complete artefact of communication constraints (wires). Always think of everything as unidirectional. - This makes it also functional. The output sequence is a pure function of the input sequence and reset state. - There has to be an equivalent of the fourier/laplace transform for boolean machines. Is there? Those transforms rely on linearity and the concept of exponentials or orthogonal functions. Maybe this isn't all that useful though... Entry: The CPU attractor Date: Thu Aug 15 18:44:14 EDT 2019 What is a better use of a state machine fabric than to implement a CPU in it? Why is linear / nested execution so useful? What about this: start with thinking of every machine as a CPU executing code. Then implement it by stripping away functionality. What about flattening programs into linear execution states? Maybe most of this is about hierarchy of state representation. Entry: Protocol splitters Date: Thu Aug 15 19:19:14 EDT 2019 Think a bit more about that. The basic idea is single packet comes in one one bus and gets split into multiple busses. Entry: Transactions vs. Flip-Flops Date: Sat Aug 17 09:10:33 EDT 2019 Wait a minute. I can create a yosys target to run C code. I just need to re-interpret what flip-flop means. What a critical path means. Basically, a FF is a transaction unit. Entry: Pipelining is the real problem Date: Sat Aug 17 09:12:56 EDT 2019 For feedforward systems, pipelining is always possible. So pipelining needs to be solved first. Entry: The higher level language Date: Sat Aug 17 09:14:55 EDT 2019 It is time to start inventing structured macros on top of this low-level language. Entry: Common subexpressions Date: Sat Aug 17 09:16:06 EDT 2019 So is it actually necessary to factor out a logic function? I think Yosys ABC takes a global approach anyway. I.e. it would be fine to define boolean transition functions directly. Maybe what I need to do next is to make a layer that is pure, and expresses only boolean functions. Then build state machines on top of that. Entry: Sumary of ideas Date: Sat Aug 17 09:17:51 EDT 2019 - focus on the transition functions - figure out how to auto-pipeline - compiling to C is re-defining what transition means (a sequential program will execute a transaction, instead of a logic update function). The key is really in bridging these two worlds: expressing a machine such that it can be compiled to C but runs slower, or be compiled to (pipelined) logic and runs faster. Entry: Figure out how yosys passes things to ABC Date: Sat Aug 17 09:20:17 EDT 2019 The _key_ element is if I can represent boolean functions as unstructured flat tables, or if I need to represent them as DAGs. Entry: FSM extraction and boolean function optimization Date: Sat Aug 17 09:31:23 EDT 2019 If yosys does FSM extraction, that should be a good hint to represent states abstractly in an abstracted layer on top of Seq. I need an example. One of the state machines I want to create is an Ethernet to I2S or S/PDIF converter. Let's factor it into Ethernet to SPI and SPI to S/PDIF. Entry: A generic data converter architecture Date: Sat Aug 17 09:37:47 EDT 2019 Note that allmost all code I write can be implemented as feedforward chained state machines with some buffers inbetween. E.g: I -> b -> C1 -> b -> C2 -> b -> O - Where each b is just a dumb circular buffer - Cx is a converter state machine (parser + printer) What happens where is not important. Some C's could be on microcntrollers implemented in C, others could be FPGAs. But the general idea is that _testing_ should be completely contained. Also, some buffers are not there at all. Also, buffers can be abstract, i.e. they can be made to contain tokens, not bytes. That way the representation can either make smart writes or smart reads. Entry: State update representation Date: Sat Aug 17 11:13:33 EDT 2019 Another thing that has been bothering me for a while: it is easier to use an imperative style state update instead of a fully explicit constructor. This is a hard pill to swallow because it really brings forward the whole FP vs IP debate. Sometimes one is better than the other, so maybe use some kind of lens-like API? Entry: Try out ABC: give it a LUT Date: Sat Aug 17 11:36:27 EDT 2019 So given RTL, represent the update functions as LUTs, push it into ABC and see if that initial LUT form has an effect on what comes out. What is the primitive here? An N->1 boolean function. Describing such a thing requires 2^N bits. But that isn't really the core representation, because it does not allow for sharing. ABC likely needs to be presented with N->M boolean functions, represented in some kind of logic gate format that does not cause table explosion. So I probably still want to hold on to some shared DAG just to not explode the representation. Processing of such graphs is likely iterative, i.e. it needs a seed of _some_ structured/shared representation as AIG before it can do optimizations in AIG space. Full LUT description seems too unstructured. Cfr Karnaugh maps: the whole idea is to identify these islands == term pruning. https://people.eecs.berkeley.edu/~alanmi/abc/abc.htm https://en.wikipedia.org/wiki/And-inverter_graph Entry: Coroutines Date: Sun Aug 18 09:39:49 EDT 2019 Doing some manual buffer management, it seems also a good idea to make this automatic. I.e given a high level coroutine structure, split things up such that flow control and buffer structure can be parameterized. They depend on each other quite a bit. Entry: 8 instructions Date: Sat Aug 24 23:48:19 EDT 2019 TTL logic kit: https://www.youtube.com/watch?v=_2uXqTi42LI PDP8 Entry: Pipelining Date: Fri Aug 30 08:16:52 EDT 2019 So it's quite clear: my problem with digital logic is the pipelining. There has to be a way to express things such that the feedforward part can be separated from: - decoupling/pipelining delays - simple state machines It's because there are two elements here. I've already noticed that factoring state machines makes them easier to understand, as this allows concentration on stream processing. But pipelining delays are a real pain. It already starts with defining small state machines. Should the output be registered or not? The answer is that it depends. For high speed logic it is usually yes. For low speed, efficient use of gates it is usually no, or mostly not until timing gets violated. A thing to keep in mind is to write systems in a way that it is easy to add delays in the path. Entry: Shared substrate Date: Sun Sep 22 10:59:30 CEST 2019 It's an interesting problem. How to express a program in such a way, that it can be mapped to time-sliced programs and parallel state machines. EDIT: This is a very important insight to build applications that can use resources properly. Separate date processing (dependencies) from the sequential/parallel implementation questions. It is too easy to get absorbed by the attractors associated to each programming paradigm. Entry: Dataflow Date: Fri Oct 18 07:50:27 EDT 2019 Thinking in dataflow is still quite difficult for me, while thinking in sequential programs operating on buffers is second nature. Maybe that is the problem that needs to be solved first: write systems in that "CPU form", then gradually work towards factoring out steps that can go into hardware. I need to model the _reduction_ that happens there. So let's spell it out. - all problems (for now) are modeled as single input to single output data streams. - streams get "parsed" into some internal state, then "printed" out again. - splitting the problem up into buffers + sequential packet processing code is a natural thing to do (at least for a programmer). - so start out that way, and gradually eliminate the need for general purpose CPUs, by: - simplifying the high speed parts so buffers can be eliminated, and the processing cores can be simple state machines. - leaving the low speed parts to be handled by a single CPU executing an event loop Most if this is known stuff, apart from the "simplify to non-buffered stream processor". The transition there is very large, as it consists of: - eliminiating data memory accesses - eliminating instruction sequencing Is there a method that can be employed? E.g. write out the entire program as a sequence of data processing loops to expose the ones that need to be turned into hardware, then merge the remaining code into CPU code that operates on a minimal set of buffers. I wonder if there is some more elegant way to say that both are the same. The CPU is an artifact. It is NOT the most natural way to do things. The buffer is an artifact that is a side effect of the TDM nature of a task switching CPU. Basically, this representation is too biased. A CPU is essentially too powerful powerful such that al algorithm can be mapped back to a simpler architecture. A more restricted abstraction is necessary. Essentially, work backwards. What I want is to split the problem into: 1) An executable specification (no buffers, all parallel processing) 2) An explicit mapping from parallel to TDM, either to simple sequencers or full-fledged CPUs and buffer memories. How do you go from knowing that in principle this is possible, to finding a practical approach that can be implemented in a reasonable amount of time. I.e. to solve the core subproblem first. I need a toy problem to tackle this. The typical one is a logic / protocol analyzer. Let's write that DHT11 driver with this idea in mind. And let's write it in Haskell right away. Entry: Practically, what does that substrate look like? Date: Fri Oct 18 08:27:18 EDT 2019 1. It has to be an event based language with abstract events and abstract state. The program is (I,S) -> (O,S). 2. Blocking sequential program style will need to be implemented as a language preprocessing step. ( Is this just async/await? ) 3. Time has to be explicit. Do not rely on a fixed clock. If this is necessary, maybe add time stamps to the events instead. The last one is an important constraint, because it is often implicit in synchronous sytems where cycle counters can be used as part of the state machine. However, this trick cannot be used on a uC: all timers need to be external inputs. Entry: DHT11 state machine Date: Sat Nov 9 05:55:01 EST 2019 I want a pure event-driven parser with two kinds of events: transitions and timeouts. See electronics.txt MCU: - assert 0 - wait 18000 - release (1) - start receiver DHT11: - assert 0 - wait ACK:80 / BIT:80 - release (1) - wait ACL80 / BIT0:25 / BIT1:70 End pulse should probably timeout. Events (interrupts) - line change: - save state + elapsed time - reset timer - timer: - disable timer Now I want to write this in a high level language such that I just need to fill in the platform-dependent details. Turn this around because it is simpler to think of a data bit as: - wait 25 / 70 - send 0 - wait 50 - send 1 This means the DHT init sequence is: - send 0 - wait 80 - send 1 - wait 80 - send 0 - wait 50 - send 1 To parse: - reset timer at the last 0->1 transition - at a 1->0 transition, queue the time delay The machine is initialized after the request sequence is sent out. This means the first two measurements that come out can be ignored, as they are: - the initial response time (20-40) - the ack pulse 80 EDIT: So I don't really need a code generator. I just need some "scrap paper" to properly define the state machine, the initial state, and possible post-processing. Because once understood, state machines (in my case) are very simple. How does this generalize to more complex machines? I want something that works both on UC and on FPGA. The main difference is that on UC, events can be modeled as procedure calls. On FPGA they are state update functions evaluated at each clock. These need to be merged into the same representation somehow. Entry: Events on FPGA Date: Sat Nov 9 07:50:49 EST 2019 The core is always just that: - on a CPU, events are essentially procedure calls - on FPGA, events are much more implicit: state transitions caused by certain combinations of inputs So to put both on the same footing, it is the FPGA representation level that needs to be abstracted a bit, because the CPU cannot do the "polling" that is essentially done by the synchronous logic at every clock cycle. So how do you map events into synchronous logic? Assume the presence of an event is represented by a 1 bit somewhere. The core is then state updates conditional on that bit being 1. But that does not solve the case where there can be multiple events at once. Maybe this is not really an issue? Suppose that the imput to the state transition function is a mutually exclusive bit vector, where each represents an event. I already ran into this representing data streams. It seems quite natural. Let's keep this in mind and design a machine. The core idea seems to be that synchronous logic is too lowlevel for all but the most essential components (e.g. an edge detector, a counter). This points in this direction: - Create a library of low-level components that transfer "continous monitoring" into event streams, i.e. data + sync bit. - Express all processing as (state,{event}) -> (state,{event}), where {event} denotes a set of events. The difficulty seems to be events that are concurrent. Is it too much of a restriction to impoce each "process" to have only one "message queue"? This brings to mind the distinction between the channel approach and the mailbox approach. But that is not really what it is about, since a sequential program gets to choose from which channel it reads next, and can impose time order that way. The core idea for this to work is that each machine takes only a single event stream, and can do a single event at each clock tick. The mailbox model works best for that. This also makes it possible to express stream processors as functions, e.g. giving a sequential composition structure without fan-in. Fan-in should be handled by dedicated mergers. Alright there are a lot of corner cases, that is clear. E.g. it might be possible to implement a machine sequentially, but then it needs to have a limited event input rate. Entry: Abstract DHT11 driver Date: Sat Nov 9 09:59:54 EST 2019 So I wrote it out completely. Some notes: - This can be done as inline functions, where the "host" is supposed to implement basic functionality. - A single start function + single event hander seems to make most sense. This is also the structure that is used in Erlang. - The ping-pong between state machine and host consists of two kinds of functions: - Simple library functions that perform uC state change and readout, e.g. get/set a pin, read a timer, start a timer. - Some of these are expected to produce events, that will then be passed on to the abstract event handler. EDIT: Now this is a "sequential" event handling state machine, i.e. it knows exactly what the next resume point is at the time of yielding, so it can be represented in sequential, blocking form and auto-generated from such form. Entry: parallel, time Date: Mon Nov 11 11:27:26 EST 2019 What struck me writing the DHT11 driver is that even for this simple example, there are essentially two execution traces: The sequence of pin transitions, and the global timeout. There is something extra about parallelism that is impossible to express in straight line code, and you get there quite quickly. Actually it's very straightforward: often, a real-world interaction system needs to concept of time, not just of events. Events have only an _order_ property. Time is something else: there are two flavors: - timestamps: map 2 events to 1 postive/negative number. - alarms: map 1 positive number to the occurance of a future event Entry: a state machine language Date: Mon Nov 11 20:55:30 EST 2019 Two important conclusions: 1. Clocked logic is NOT good as a base substrate. It needs at least one higher abstraction, which is events. 2. This means the implementation language that sits below the functional specification layer can be anything. Which also means Seq is going to be good enough, and it can grow and change as long as the interface is kept. So how to move this forward? Create a mapper. I have a real-world state machine that I can use to test this, the DHT11 driver. I can't "design" this. So I'll have to evolve it. Start with writing out the state machine in the most straightforward Haskell form. Then rethink from that vantage point. Entry: state machine language Date: Fri Nov 15 08:57:55 EST 2019 Get past procrastination here.. What is the actual problem that needs to be solved? Mapping a description to code. Start with the prototypical state machine: a counter. Work backwards? EDIT: I need to somehow inch myself into this. The substrate should probably be Seq. And Seq's C output is where to focus. I'm not going to get into LLVM at this point. So how to create a C program in Seq? EDIT: See exo/ghcid/ExoSP.hs -- ExoSP static inline void fun(seq_t o[1], seq_t r0) { seq_t r1 = seqMUL(r0, r0); o[0] = r1; } EDIT: It seems that this is really enough to work incrementally in very small steps. Wait until inspiration hit. EDIT: Some essential structural / representational component is missing. Seq is made for sequential processing. Not for event handling. How to add this? Entry: An s-expression front-end Date: Fri Nov 15 09:32:45 EST 2019 Intermezzo. I really want this. Monadic notation is too cumbersome and it is really just a syntactic problem. I don't hink it would be necessary to create a Template Haskell S-expression front-end. It seems that Seq programs are untyped enough to be "interpreted". This actually turned out to be a feature! This is actually something that can be thought of as a core design element, and also makes RAI and Seq quite close. If more type encoding is necessary, it is possible to just create a "phantom wrapper" in Haskell to express higher level ideas. Let's give this a try. There is a parser in asm_tools/asm-tools/Data/AsmTools/SE.hs EDIT: I already started this in asm_tools/asm-tools-seq/Language/Seq/Syntax.hs EDIT: Jump is too big. Reduce scope. Stick to just doing C gen. Entry: I need a hello world Date: Fri Nov 15 10:26:50 EST 2019 Something essential is missing to make this event-driven, and I can't put my finger on it. The product should be a collection of event handlers that each perform a state update and possibly generate/enqueue more events. Entry: DSP Date: Thu Nov 28 05:29:32 EST 2019 Another stab. The problem I'm trying to solve is: - to be able to reuse intermediate buffers easily, e.g. declare them in C inside a limited scope, and - to do buffer dimension reduction Is there a simpler way to look at this problem? Is this just deforestation / loop fusion? Or are there other things at play? Needs some new insight. What is the core problem? To reuse buffers such that they are cache-hot. Entry: Sequential programs are the norm Date: Sat Nov 30 16:02:47 EST 2019 Why would I ever need sequential code if I have unlimited dataflow? Is it possible to express a state machine as pure dataflow, and then add specification for the dataflow? The problem is with loops. You almost always want sequential processing of data stored in a memory because the access to that memory is going to be sequential. So the idea is this: the default is sequential programs. What they run on doesn't really matter. Some form of datapath and control logic. A lot of freedom there that does not really matter much. Parallel dataflow is actually very rare and only needed for the most low level components such as e.g. bus interfaces. All the other logic will simplest to express as sequential code. So to write state machines on an FPGA, write evertything as code, and when it gets too inefficient, create a low level state machine. Entry: buffers Date: Sat Dec 7 10:46:43 EST 2019 So I have "clocked words", some signals that have a "valid" signal associated with them to indicate events. How to extend this to buffers? Buffering seems unavoidable, so make sure there is a good strategy. This is the "control plane vs. data plane" pattern that always comes up in implementation. The data plane can just be memories. So, conclusion: It is sometimes hard to see in the overall design which streaming connections should be buffered. For non-random access (FIFOs only), this is an implementation detail that has to do with the sequential decomposition of some processor. If sequential decompositions do not line up, buffering is needed to "deform time". So can it be kept abstract? Communicate tokens? Entry: logic implementation is about deforming time Date: Sat Dec 7 11:07:29 EST 2019 basically: buffering, sequentialization and pipelining Note that this is really about dataflow, wich is the only thing I ever need an FPGA for: communication, and possibly some stream processing. To make this work, set up some interfaces. The main missing component is a control flow mechanism for FIFOs. Ideally, manage the "coarse" time scale of the FIFO in some other way. This is not typically called a FIFO, but a buffer: one machine fills, then only when full a buffer is transferred. However it is still possible to use only streaming access on the buffer. Entry: buffers over fifos Date: Sat Dec 7 12:02:34 EST 2019 There is a good reason to use buffers, because often it is possible to put a hard constraint on message size. It is much harder to constrain queues of multiple objects. This is an interesting tension: while it might be more "natural" to not use a chunking mechanism (grouping many atoms in an arbitrary chunk), it does seem to allow for better memory management. FIFOs are probably ok as long as there is some kind of high level rate constraint. Some "coarse level" information. I.e. only if FIFOs do bounded time warping where maximum time delay is known. Entry: Transformation of parallel to serial? Date: Sat Dec 7 12:24:42 EST 2019 Can it be that simple? Start with actual "large" objects, then transform the data connections into sequential ones, and refactor the processing in the same way. It seems that the important bit is to decide where to make those time/space tradeoffs. It also seems simpler to go from parallel data to serial data. I.e. you "add" stuff. Going from parallel compute to serial compute is also straightforward, but it seems backwards in some way because we all get so used to sequential programs. However the intuitive sequential programs might hide some other sequentialization. It seems better to make resource allocation very explicit: start with pure data flow, then braid it into sequential machines. EDIT: Actually this goes back to things I remember from very early on: splitting a filter into a sequential program using a simpler datapath. Entry: Sequential programs are not natural Date: Sat Dec 7 12:31:52 EST 2019 So it is not that sequential programs are more intuitive, it is that a particular CPU or programming language is a "known substrate", where it becomes easier to be expressive as a designer. What is more natural is to think in terms of events that cause state changes, and then to implement the data transfer and compute onto the sequential machine substrate. Entry: Method: project model onto specification Date: Sat Dec 7 12:33:46 EST 2019 So how to translate that into a method? Does it make sense to implement the behavior at a higher level, to then use it as a template? Actually, very much yes. It really seems that "adding" information to a specification to then arrive at an implementation is the wrong way to look at it. Work the other way around: "remove" information from an implementation to end up with the specification. This is textbook abstract interpretation. It is very important to understand that an implementation in the first place does _more_ than the specification. Its substrate mapping has many more observable effects than are necessary to implement the implementation. It also might do _less_, in that certain corner cases are not handled. The tragedy is that this is usually what people (me, really) focus on, forgetting the part where an implementation is actually strictly _larger_ than a specification. This also makes it obvious that generating an implementation isn't always possible. There might be simply too many degrees of freedom that really do not compose in a nice way. I.e. changing a dataflow program into a sequential or pipelined program introduces a very practically significant shift in timing characteristics. Something the specification might not care about at all. Entry: Implementation / Specification codesign Date: Sat Dec 7 12:41:36 EST 2019 So let's make this concrete by inventing an event + state machine language that can be compared with an implementation, with maps going from implementation to specification, but not backwards. When there are backwards maps, great! Those are modules for which code generation can be used. But this is _never_ guaranteed. So a model-based approach should really take that structure: semantics can always be assigned by projecting an implementation into model space, and the reverse is optional but never guaranteed. Entry: How to actually do that? Phantom types. Date: Sat Dec 7 12:51:43 EST 2019 It seems like a good idea. Abstract enough to be correct, but also completely unusable without some concrete modeling ideas. The essense is state machines. So start there maybe? A state machine is something that transforms (event,state) into ([out_event],state). Stick to single input, many output machines. Capture multi-input in the state. EDIT: Focus on composition? Given a low-level implementation / specification pair, construct a way to compose them. Also, the specifications give phantom types to the implementations. The more I think about it this way, the easier it is to be ok with the fairly simple "boring haskell" approach for Seq. It really is just an implementation language, and a phantom layer is all that's needed. EDIT: The model doesn't need to be actual. It can just be a repackaging of input streams. Where model "fit" is just the ability to express pack/unpack of data? Entry: Example? Date: Sat Dec 7 13:10:00 EST 2019 The obvious path is to take some state machine from the Ethernet MAC. EDIT: I'm still missing an element. Can the model-wrapping be _only_ phantom types? EDIT: The frontend transforms the data ready + bit pair into a word stream that can go into a fifo. How to type those? Doesn't make a whole lot of sense putting it that way. The core issue here is that one type of event stream is transformed into another, and this is done multiple times until anything like a "packet" emerges. But still, it does make sense to somehow type the I/O. This composition is actual, so it makes sense to reflect it in the types. so: rmii -> word -> packet EDIT: I'm on the wrong track. Just start implementing, then once the composition is there, start typing it. A glimpse: type it without mentioning state. It is understood that things are stateful at a local level. Entry: Move structure into names Date: Sun Dec 8 18:25:48 EST 2019 I ran into two cases where there is a benefit of implementing grouping based on names, and not on hierarchical structures. They are isomorphic (paths vs nesting), but seem to often be much easier to manipulate. This has to be a generic pattern. Entry: Buffer reuse Date: Mon Dec 9 06:59:38 EST 2019 It is essentially about lifetime of variables. Entry: C-like lang Date: Fri Dec 13 20:07:32 EST 2019 Either do this with a subset of C as a frontend, or do something in Haskell on top of Seq. Entry: Next state spec Date: Sun Dec 22 00:15:18 CET 2019 So Mr. Lamport agrees: explictly stating what stays the same is a good idea. https://lamport.azurewebsites.net/video/video4.html Entry: Compiling CPS state machines Date: Sun Dec 22 13:11:14 CET 2019 This is very straightforward when the realization is made that a reified continuation is a sum type, with one clause for each state. In C using a big ball of mud state struct, this idea is often obscured because there is usually a lot of sharing between states. I.e. it is not always clear whether a particular struct member is valid in a particular state. Modeling the continuation as a sum type gets rid of this confusion. The only downside is then to map this idea efficiently onto a C struct or union. Rust will probably map very well to this idea. So essentially there are two parts: - Convert a blocking task to CPS form - Represent the continuation sum type efficiently How close am I getting to async/.await? https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html That is already performing the CPS transform. I think I have the perfect example to try this out on. And also the test: can "loops" and "recursions" be expressed in async/.await? Entry: Erlang as CSP substrate Date: Sat Jan 4 15:56:06 CET 2020 This is the missing link I've been looking for, connection between: - Abstract state machine work in Haskell - More practice with code gen & macro languages. - Emacs/Erlang exploratory exo system EDIT: CSP is synchronous and requires a layer over Erlang to make it work. Compilation to C might be more appropriate. There is a library now. Entry: Modify deser to be able to do 2-bit sequences Date: Mon Jan 27 06:06:05 EST 2020 It might already work. Currently, deser core routine is: shiftUpdate dir sr b where b could just as well be 2 bits. Entry: channels Date: Mon Jan 27 08:59:03 EST 2020 Some interesting cases pop up for a generic synchronizer: uart memory reader. Basically, DMA. So let's continue down this path. Create a channel compositor. The synchronization mechanism is easy enough: and together sender and receiver signal, but expressing it in an expression language requires some awkward shuffling. It's the old read/write problem: a read is the input of a function, and a write is the output. Composition is then done on the outside. The mem->UART DMA is a nice example. Mem read can continue once UART has acknowledged it has sampled the channel. Then mem read can obtain the next byte and block until UART is ready sending. Express that as a channel operation Can this be expressed as a select? The important part is the handshake. SM1: raises write ready flag SM2: raises read ready flag Optimization: if it is known that the reader is fast enough such that it will sit there waiting, it's ok to just pulse. This is what I did before in the DTI sequencer. Entry: Memory reader Date: Mon Jan 27 09:43:48 EST 2020 I.e. a "monitor" for a memory. This is a task that "selects" on a number of channels, and performs a read whenever a channel gets ready. This rendez-vous is a very powerful abstraction. Entry: Synchronization Date: Mon Jan 27 09:50:10 EST 2020 Let's distinguish two mechanisms: - Sender pulses, reader waits for pulse - Sender raises and waits until reader is raised. In the latter case it's necessary to be careful such that reads and next write do not overlap. Entry: integrate with fusesoc Date: Mon Jan 27 16:22:30 EST 2020 https://github.com/olofk/fusesoc Wrap it in exo nix. https://nixos.wiki/wiki/Python EDIT: Yeah once these things are getting bigger, reuse is becoming important. Seq is not going to be for system-level integration. It's only a module generator. Entry: Synchronization Date: Tue Jan 28 05:58:35 EST 2020 Actually this is a game changer. 2-way synchronization is more abstract: no longer necessary to prove that one machine is in time to pick up output of another machine. So nothing new, really, but I do have a way to think about things in the abstract. This is very different. That said I do need practice. I can't just "see" it at the state machine level. EDIT: Here's a thing: "select" is the same as "priority cond". E.g. in async_transmit: wordClock is responded to first. So let's abstract a synchronous send as an asynchronous send combined with a synchronous receive of ack. If it is easier to to have a level transition for the ack, so so, but add an edge detector somewhere. Entry: DMA Date: Tue Jan 28 06:18:06 EST 2020 So how to actually do this instead of gloating about insights? First iteration: use pulses in two directions. Build intuition about that first. Yeah why is this so hard? Still not done with the context switching. Do something else first untill fully awake. Entry: Handshake Date: Tue Jan 28 06:35:48 EST 2020 I need some basic design principles: Requirements: 1. Allow both pulsed and level (ready/busy) signals 2. "cond" is your friend Entry: Pulse vs. level Date: Tue Jan 28 06:38:12 EST 2020 Level is easier to do because it's a single time instance (change polarity). A pulse always requres two events: one to turn it on, and one to turn it off. The good part is that level can be transformed into pulse using just an edge detector. Important there is that it allows FACTORIZATION OF STATE MACHINES. I strongly believe it is better to have two independent transition functions, as compared to a single one. Now, there is a problem: we can't implement waiting if a pulse will be missed. That is why level-triggering is necessary. So let's look at this a bit closer. Suppose we have a series of signals that are used for synchronization. If all signals are low, we wait in the next cycle. If one signal goes high, that one is used to cause a transition. Ties are broken by using a priority select. That part is straightforward. Pulses can be seen only when they are not obscured by other higher priority pulses occuring simultaneously. A "case" study: the async_transmit has a priority select on wordClock, which means it will ignore bitClock if it occurs at the same time as wordClock. [shiftReg', cnt'] <- cond [(wordClock, [newframe, cbits n' (n + 2)]), (bitClock, [shifted, cntDec])] [shiftReg, cnt] What about this: - Use pulses for async one-way communication. This already works, and in most cases it is appropriate. - Find a new mechanism for rendez-vous synchronization. Tis seems to need level triggering with acknowledgement. Entry: Rendez-vous: ready + ack Date: Tue Jan 28 06:51:27 EST 2020 This is necessarily two-way. Let's break the a-symmetry for now by requiring that reads will wait for writes. Then later it is probably clear how to restore the symmetry. Essentially this is about agreeing on a time event. That statement actually contains the solution. The output of the synchronizer is a single cycle pulse. Can this be constructed from a reader and writer level signal? Let's go back to a-symmetric case: 1. Writer signals that data is ready by creating a 0->1 transition and holding it there. 2. When reader is ready to perform the read, it will generate a 0->1 transition, and at the same time sample the value. The action of 2. should acknowledge to 1 that it can continue, but also remove the condition immediately such that the reader state machine doesn't read again in the next cycle. That bit seems to be the essence. I think I'm re-discovering interrupts. Essentially, this is a counter. If the reader sees the interrupt, it will immediately (and only in that state!) output a received pulse. The writer will be waiting for this pulse. If it is seen it should immediately turn off its enable signal. This 2-way thing is tricky. Let's put in some requirements. W: . . x x x x . . . R: . . . . . x . . . S: . . . . . x . . . I think the key element is that the ANDed signal is not registered: information flows in the two directions in a single cycle. I.e. this makes it easy to generate by the receiver as a side-effect of the "cond" case that sees the input signal high, and the writer can see the pulse immediately and lower its write signal for the next cycle. So, summarized: - writer has a wait state where it has the output ready signal raised - reader has a state that sets the ack pin NON-REGISTERED - in writer's wait state, the ack pin is used to transition to lower the ready signal This should work for a continuous stream of readies. E.g. if writer doesn't lower the ready singnal, reader will treat it as a next sync. Summarized even more: Ready -> Ack path is COMBINATORIAL. If it is combinatorial, single cycle transfers are possible. If the signal is pipelined, some more work is necessary to avoid duplicates. So definitely this case is simpler. Also, if single cycle transfers are possible, this is exactly THE mechanism by which to factor machines into a composition of smaller ones. Entry: Justify "applicative" structure Date: Tue Jan 28 07:23:07 EST 2020 While some things are hard to write this way, e.g. when there is some "crossing" of signals, the desired end result of abstraction is almost always feedforward structures and those map very well to applicative structure. I.e. data processors with some internal state. Entry: Practical example Date: Tue Jan 28 07:28:34 EST 2020 I don't think this can be abstracted away, as it is a core part of how the two machines perform transitions in the presence of other things going on. So the transaction is an essential part of the machine's main transition function. An example. Two machines: 1. writer is a counter that writes out the next state 2. reader is synchronized to an external pulse Ok that's some setup to start with. Entry: A good book on circuits Date: Tue Jan 28 07:34:46 EST 2020 Maybe this one? Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design) https://www.amazon.com/Computer-Architecture-Quantitative-Approach-Kaufmann/dp/0128119055 Computer.Architecture-.A.Quantitative.Approach.-.4ed.pdf md5://808a7562b705ed1cf6a3deb9b9370d98 Entry: Syncrhonization example Date: Tue Jan 28 08:59:45 EST 2020 Ok so there is a catch: this cross-coupling introduces a loop, and I don't think I have a way to decouple it. This is why I couldn't just write it down. The idea has a flaw. Interesting. Why is there this limitation? So it seems that if I implement this using a delay in the feedback path, it will be straightforward to do. Otherwise it is a recursion scheme that is not currently possible, and requires a new language construct. Maybe not. What is necessary is the "open" version of the two transition functions. They should be merged into a single function and then closed over feedback registers. I'm too dumb for this shit rn. Note that there will be no loops: it is only because there is an apparent delay between rdy and ack. Maybe the trick is to expose these two signals (ready in next) and then close them explicitly? This is the same thing that happened for memory reads. And indeed! this is also a read/write pair! EDIT: Ok was able to code it up. Now create a test. EDIT: Ok have test, but things aren't correct yet. This is again an off-by-one thing that is hard to see. Thinks like that are just horrible to get right! Almost there, just too stupid atm. Entry: How to make this understandable? Date: Tue Jan 28 17:09:23 EST 2020 Draw it out on paper again? One thing that makes it difficult is to not distinguish between this state and next state in the printed tables. Fix that first. Entry: rendez-vous Date: Wed Jan 29 05:56:22 EST 2020 Basic idea: this is a cross-wiring of RDY and ACK, and one delay is necessary to make the loop work. Where does the delay go? One way to think about this, and maybe good to create a test case: - Reader responds with ACK pulse as soon as it can handle RDY - Writer uses that ACK to turn off the RDY signal In the end I want to be able to do this continuously, e.g. be able to transfer data at clock rate, but to see what actually happens, turning it off might be a good idea. So make a new test circuit that does only handshake. EDIT: Ok I think I got it. The trick is to use combinatorial output on RDY and ACK, and to delay the ACK going into the writer. Entry: testing rendez-vous Date: Thu Jan 30 07:39:47 EST 2020 I have a test case, but QC found a problem when pulse sep is 0. Ok the problem is in the writer. I need a general principle to do this. Suppose ready is always high. What is the output? - If not acknowledged, repeat the last one - If acknowledged, use next Yes this is tricky because of the dependences on current and last state. I think it's best to express the separate cases explicitly. Entry: Truth tables Date: Thu Jan 30 08:18:40 EST 2020 So I end up very naturally at truth tables. In some cases it is just very hard to express a transition function in terms of manually factored binary and,or,not. EDIT: Since there is currently no direct way to implement truth tables, I'm resorting to manual implementation. I do wonder: Verilog has don't care matching, right? https://embdev.net/topic/276558 Yes, casez. I probably should implement this. EDIT: It's not really necessary atm. Factoring can actually be beneficial for understanding. E.g. try to identify local signals that are meaningful enough to give a proper name, and include them in the truth table. Entry: Change of style Date: Thu Jan 30 10:35:20 EST 2020 Use d_ prefix for state machine inputs. The "default time" should be the current time instance. This makes much more intuitive sense than thinking about "future". FIXME: word this better. Entry: DMA Date: Thu Jan 30 11:01:39 EST 2020 I'm having real trouble with those off-by-one errors. Continue writing explicit truth tables for the transition function. EDIT: Ok I got something working: send a counter to UART. Then replace counter with arbitrary pulsed state machine. EDIT: Running into this cross-pattern again. I guess it is universal. So let's use the convention that the library exposes the open machine, and we need to close it on use. Naming will be clear after it's done. Entry: UART revisit Date: Thu Jan 30 14:44:13 EST 2020 Maybe good to test it out better. TODO: - merge the two test cases into one - fix the stop bit issue - make the done bit combinatorial - generalize the sync write/read This is going to take some time to all work out. Entry: Port transaction symmetry Date: Sun Feb 2 19:55:41 EST 2020 I factored out the read and the state machine. The SM now takes a 'cont' and produces a 'have'. This is essentially read and ack. So it appears I'm re-inventing bus transactions. 1. The core idea is that a read (write) command is an interplay between two signals: - read end will issue a 'req' - write end will respond with an 'ack' 2. The req->ack path is combinatorial at the writer end 3. The ack->req path is combinatorial at the reader 4. The composition of the two inserts a delay on the ack to break the loop. This doesn't seem too hard. Now why is it a-symmetric? Can we have a writer sending the req, and the reader sending the ack? The insight is that this is already symmetric. The assymmetry is just in the names! This means that this could be a 2-way read/write as well. The handshake just defines a moment in time when both are watching the i/o. So what to do with this? Make some drawings on paper... The symmetry is important. And the fact that the delay breaks the symmetry is also important. I'm tempted to split that delay in half! Go into this: There are also combinatorial signals at play for maximum performance. https://en.wikipedia.org/wiki/Wishbone_(computer_bus) EDIT: So how do I test this? Entry: More handshake examples Date: Mon Feb 3 05:48:16 EST 2020 So let's do: - a byte producer (that 4,5,6,7,12,13,14,15,... counter) - a consumer (the uart) Also let's properly name: - read/write strobe - ack The ack is not an ack. The or of in and out strobe is the ack. So a machine looks like: - some data that is exchanged (doesn't really matter!) - input: indication that peer is ready - output: ready indication - ack = wire or of the two TODO: - put this behind a standard interface with some constraints. - create a standard way of gluing two ports Entry: Asymmetry in read/write Date: Mon Feb 3 06:04:09 EST 2020 So what I want really is to split delays in half, but that doesn't work. So the solution is to make the interface a-symmetric, and let one of the ends assume a delayed input. Who should this be? EDIT: To avoid duplication of registers, make sure that the reader is frontend. Entry: Channel, final word? Date: Mon Feb 3 07:55:35 EST 2020 -- We break symmetry based on the requirement to not have a data delay -- as part of the loop closing operation. This makes frontend=write, -- backend=read. -- -- /-------------------<----------------------\ -- \-->--[D]-->--[f:write]---->--[b:read]-->--/ -- \--------->-----/ -- -- This brings us to the following implementation. closeChannel writer reader = do closeReg [bits 1] $ \[d_rd_sync] -> do (wr_sync, wr_data, wr_out) <- writer d_rd_sync (rd_sync, rd_out) <- reader wr_sync wr_data "d_rd_sync" <-- d_rd_sync "rd_sync" <-- rd_sync "wr_sync" <-- wr_sync "wr_data" <-- wr_data -- wr_out, rd_out: other state machine outputs not necessarily -- related to channel communication. return ([rd_sync],(wr_out,rd_out)) Entry: General remark about 'close' Date: Mon Feb 3 08:05:14 EST 2020 When closing multiple operations at once, it becomes a bit arbitrary in which order these are performed, and also there is quite a bit of shuffling needed to bring outputs out of the circuit being closed. In Verilog this is a lot easier to do by just defining a bunch of registers and using assignment to perform the cross-wiring approach. So is it worth it? This seems to be what is payed to keep an applicative interface. I'm going to assume for now that yes it is worth it, becaue it makes abstraction cheaper. Let's put it in the README. Entry: channel: examples Date: Mon Feb 3 11:42:37 EST 2020 These are canonical examples: - readers - chan->uart_tx - chan->memory - writers - uart_rx->chan - memory->chan Entry: general remarks Date: Mon Feb 3 11:52:56 EST 2020 I'm on the right path, but this will need a whole lot of work to find good factorization primitives, and to maybe also create a processor in such a factored way. EDIT: A PLC would probably be possible 1. Factor out the instruction sequencing: instruction memory interface, loops and call/return. 2. Allow user to plug in conditional jumps (input->control path) and instruction decoders (instruction->output path). Entry: Make tests easier Date: Wed Feb 5 06:48:36 EST 2020 - only use probes - why is TH necessary? Entry: Why is TH necessary? Date: Wed Feb 5 07:25:27 EST 2020 Interesting question. I think I did this because of lack of strict typing? Entry: Why am I not using other tools? Date: Thu Feb 6 05:03:03 EST 2020 First, I really want to know how to do this bottom up. That is very important. Once I've learned I might want to switch to different back-ends. Entry: Sharing Date: Thu Feb 6 05:16:32 EST 2020 See haskell.txt There are a lot of interesting points, but it is probably just a distraction at this point. For now, stick to monads. Entry: Accelerate Date: Thu Feb 6 05:23:39 EST 2020 Make it easier to get to hardware quicker. What is missing? Entry: Bit-serial Forth processor? Date: Thu Feb 6 08:18:42 EST 2020 Yeah why not. Probably best to start with a QSPI flash chip board. Entry: Substrate independence Date: Fri Feb 7 06:36:51 EST 2020 Entry: ISA generation Date: Sat Feb 8 13:12:33 EST 2020 So instead of designing an instruction set, what about having it generated? Basically, what you want is a "wide" instruction set with datapath control signals. This then needs to be compressed into something that is a tradeoff between: - not being too complex to decode - not being too wide to store Since I am already doing abstract code inside Haskell, I never actually want to see the instruction encoding. How about generating it? I only need the instructions themselves (abstract tags + concrete payloads), then find an encoding to map the abstract tags to instructions. Entry: Testing memory readers Date: Mon Feb 10 06:56:42 EST 2020 This is also a roadblock, but it appears the main roadblock is to work on the HW so fix that first. Entry: clock domain crossing Date: Wed Feb 12 15:23:42 EST 2020 Dual-ported memory + signal? Yes, FIFOs: https://filebox.ece.vt.edu/~athanas/4514/ledadoc/html/pol_cdc.html Entry: Programmable datapath Date: Thu Feb 13 07:49:50 EST 2020 How to manage a single DSP core for low-rate processing? It needs to be time-sliced. Idea is to feed it via a register file. Dual-read-port memory would be nice for operands, but probably ok to go through fetch phases for multiple operands. Incoding the instructions. Probably not too many different things. Mostly DSP. Basically, DSP and control processors are really different. DSP mostly about feeding data into the MAC, control mostly about responding to events, sending out events, possibly with subprograms. So control should focus on a forth-style approach, while DSP would focus on a register architecture. DSP would be compiled by mapping a dataflow network onto a program that executes it. Entry: Substrate nesting Date: Thu Feb 13 07:55:05 EST 2020 Maybe time to start doing this for DSP algos. It's probably ok to pass through Term, e.g. have a complete compiler behind a mapping. Entry: applicative notation Date: Fri Feb 14 16:43:04 EST 2020 Did i completely miss this? f <*> a <*> b Prelude> :t (<*>) (<*>) :: Applicative f => f (a -> b) -> f a -> f b Problem is that I have a -> m b Instead of m (a -> b) Is it possible to change one into the other? What I have is actually (f =<< ma) =<< mb Entry: next Date: Sun Feb 16 18:39:48 EST 2020 To/from memory. Make those tests straightforward. Entry: mini cpu Date: Thu Mar 19 21:32:48 EDT 2020 http://bleyer.org/pacoblaze/picoblaze.pdf Entry: z transform Date: Fri Jun 5 23:15:34 EDT 2020 I have an algorithm for this already in rai. The core problem is to linearize the update equation. Once there is a set of linear equations, the rest is straightforward with a library like hmatrix. In the problem I need to solve first, the update equation is linear already, so that step can be skipped.