Logic Analyzer in Rust

Entry: la.rs
Date: Sat Jan 31 11:40:42 EST 2015

So the basic ideas:

- Think of Rust as the implementation + scripting language
- Keep integration in sigrok in mind


This is a built off of the ideas of pyla, a Python + C++ logic
analyzer.  It worked fine, but still was clumsy in its aproach so I
started thinkig that Rust might be good as both a core implementation
language for the signal processors and the dataflow glue.


Entry: Synchronicity
Date: Sat Jan 31 11:43:28 EST 2015

Meaning: is it really necessary to process events from multiple
sources at the same time?  This is about the only obviou reason to
_not_ use a task/process abstraction for handling communication
protocols.


Entry: Threads
Date: Sat Jan 31 23:51:03 EST 2015

Solve the rest of the dataflow programming using concurrent
programming?  It seems the most natural approach.  A good trade-off
between speed and ease of use:

- front-end: processes a lot of data but produces likely very little.
  a synchronous state machine seems best here.

- back-end: processes little data but might need more elaborate
  code/data structure to do its job: higher abstraction seems best
  here.

The question is about task switch granularity.  For 20MHz data rate
there is no way that this can be anything other than a tight
single-task loop over a data array.

It would be nice though to abstract the connectivity.  I.e. if I want
to chain two processors, it will figure out how they pass data.

[1] http://doc.rust-lang.org/std/thread/


Entry: Iterators and buffers
Date: Sun Feb  1 15:00:12 EST 2015

Is it necessary to include the iterator in the "tick" method of the
trait that abstracts analyzer state machines?  I'd think that this can
all be optimized away.  Maybe take the plunge and look at generated code?


Entry: Input/Output
Date: Sun Feb  1 16:58:56 EST 2015

Problem: data types when connecting processing pipelines.  Either
avoid it by using byte streams and explicit protocols, or figure out a
way to encode it in the type system.

There seem to be too many variables here to find a good solution.

Attempt to simplify and fix the abstraction levels:

- parallel bit streams
- sequential byte streams
- packet streams
- high level data streams


Maybe abstract everything as Bus and provide some wrappers?

Parallelism is necessary: a bus has multiple channels with
time-correlated data.  A Bus can be a "packet bus" ?


Entry: Finding the right abstraction is hard
Date: Sun Feb  1 20:01:15 EST 2015

Trying too much at once.  It's probably best to give up on premature
optimization and figure out how to type things properly first.  A UART
is something that takes in a synchronous bit stream and produces a
(possibly time-tagged) byte stream.

There are two problems here:

- The types

- The I->O control flow.


There are three obvious way of structuring:

- As a function (i->o) possibly buffered, leaving connectivity to a
  different layer.

- As a sink, abstracting composite sinks.

- As a source (generator), chaining generators.

In Pyla I used an i->o approach and added some composition laws to
build sinks.

DO IT WELL OR DON'T DO IT


So I have a design that I already thought about for a long time.  It
uses byte buffers to communicate, which makes it rather simple,
structurally.

If types are necessary, why not build those on top of things?
I.e. use types as "compile time blessing" like phantom types[1].


So let's do this: two layers:
- Implementation, unconstrained uses byte buffers
- Phantom layer adds typed-blessed composition


[1] http://rustbyexample.com/generics/phantom.html


Entry: Iterators
Date: Sun Feb  1 22:39:36 EST 2015

Tired, but I really want to get to the bottom of this.  Apparently
passing around iterators doesn't work very well, maybe because they
are stateful, abstract objects?

So it seems better to pass something around that can create an
iterator.  Especially because I want the ability to provide fanout.


Entry: Push vs. pull
Date: Mon Feb  2 13:57:25 EST 2015


    // TL;DR: Call flow represents low level call flow.  High level
    // abstractions build on top of this.

    // From a first iteration in C++ (see zwizwa/pyla), the simplest
    // architecture seems to be a "push" approach, i.e. a data-driven
    // / reactive one where a function call corresponds to data being
    // available.

    // This corresponds best to the actual low-level structure when
    // this runs on a uC: a DMA transfer_complete interrupt.

    // It is opposed to a "pull" approach where a task blocks until
    // data is available, which always needs a scheduler.  The pull
    // approach works best on higher levels, i.e. when parsing
    // protocols.


So I took this out.  It's clear that the Iterator trait is the way to
go from the API side.


Entry: Bit generators
Date: Fri Feb 13 22:00:25 EST 2015

Makes sense to also put the stream generators in there.  At least for
SPI it's hard enough to make a test sequence without writing it as a
state machine.


Entry: Only 80 - 100 MiB/sec for uart.elf
Date: Sat Feb 14 14:26:22 EST 2015

This is after adding some generic code.  Trying with previous.  Isn't
better..

EDIT: That's 20-40 cycles per sample.  Maybe not too bad?

Ok, fixed.  The iterator next() wasn't inlined.  Now gets 250 - 300
MB/sec Chased the .dasm by adding a "marker const".


Entry: Visualize
Date: Sat Feb 14 18:37:29 EST 2015

First, get rid of wiggles.  Encode as black/white/grey.
Allow very fast zoom.


Entry: RL UART
Date: Tue Feb 17 22:10:21 EST 2015

Sending 0xF0 bytes, it sees 0xE0 bytes and frame errors.

So wire: 0 0000 1111 1
It sees: 0 0000 0111 1

Which means it samples to early.  Adding a bit more delay helps.


Entry: UART sampling
Date: Fri Feb 20 02:28:08 EST 2015

dsPIC UART can use 16x oversampling with where majority voting is used
for mid-bit and 1 16x clock left and right of mid-bit.

I'm still puzzled why mid-bit doesn't work on the current sniff setup.

This could also be rounding errors.  What about using a fractional
bit period?


Entry: BeagleBone Black
Date: Wed Feb 25 22:15:57 EST 2015

slip.elf can take about 26-27 MByte/sec from /dev/zero on BBB.

That's definitely usable.  It would be nice though to get that to the
full 100MByte/sec of the BeagleLogic.

A better way is probably to use RLE.  Reduce at the source.  This
requires the processor API to change.  PRU RLE does not work at 100MHz

A simpler way is to code an RLE processor in assembler to read from
/dev/beaglebone and feed the RLE into rust code.

So a loop in C isn't any better.  This needs assembly.


Entry: beagle logic
Date: Thu Feb 26 11:56:00 EST 2015

Looking at the PRU source, the RLE is not done there.
A good place would be copy_to_user() in the driver.
Replace that with some ASM to do the encoding.


Entry: Fast-path RLE
Date: Fri Feb 27 02:36:32 EST 2015

The idea is the following:

- Fast path: if vector of 16 x 8 bit (8 x 16 bit) is the same as
  before (= pattern), increment counter by 16 (8).

- If not, use scalar ops to scan the vector, read the next vector and
  check if it's all the same.  If so, initialize the pattern.


Entry: zero-copy mmap
Date: Fri Feb 27 00:14:31 EST 2015

Trying on BeagleLogic.

1. Set size
IOCTL_BL_SET_BUFFER_SIZE
This allocates multiples of 4MB.

2. Mmap buffers, all buffers are mapped subsequently.  Driver uses
devm_kzalloc() to allocate buffers, which is automatically freed on
unload.

3. Use poll() to synchronize.

4. use lseek(fd,0,SEEK_END) to find the size that's available.

5. Use NEON instructions to do the scanning.

[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s03s03.html
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.qrc/index.html
[3] https://groups.google.com/forum/#!topic/android-ndk/0jDCsFzbmu0


Entry: Shaders for waveform display
Date: Sat Feb 28 02:48:57 EST 2015

It should be possible to send raw data to the shaders and have it
render a waveform display, i.e. turning a 0/both/neither/1 input into
lines, blocks, ...

Yes, by feeding waveform data in a dynamic vertex array (1D) and
having a static array with x-axis values 0,1,2,... on the renderer.

Maybe shader can do logic shifts as well?  To pick out different bus
values.  Seems only in higher versions (1.5?)


Hmm... maybe this doesn't work.  All attributes go into one buffer,
and it is strided?  I.e. can't update just part of it?


EDIT: It's probably not worth the trouble, since it needs a geometry
shader: a logic edge turns one value into two vertices.