[<<][libprim][>>][..]
Sat Nov 7 11:30:02 CET 2009

Representing tokenized streams

Let's sum up the design decisions:

  * representing tokenized streams: prefixed or delimited

Prefix tokens are simpler for implementing token streams, as they
allow memory allocation in the _recognizer_ to happen before the
payload data is touched.  However, to do this completely general, some
form of delimited parsing needs to be used to represent the _size_
itself (if one doesn't want to introduce arbitrary size limits by
choosing fixed representations as in Quicktime/MPEG).  However, this
is not a problem: in practice a recogniser can set a limit (i.e. 64
bit sizes since it's limited anyway) while the format is in no way
limited.

  * quotation vs. construction

Since the base type is made of byte chunks, an extra tag is necessary
to indicate whether a list of bytes is a payload or a constructor
name.

  * bottom-up (postfix) or top-down (prefix)

This reflects the structure of the parser.  The same token stream
format can be used for both parsers.  I.e. the list (1 2 3) is
represented as

  RPN:  1 2 3 NIL CONS CONS CONS
  PN:   CONS 1 CONS 2 CONS 3 NIL

Note that for cons lists RPN isn't so great.  One could use XCONS to
build lists in reverse.

  RPN   NIL 3 XCONS 2 XCONS 1 XCONS

There is one important point: RPN isn't terminated, while for PN it's
always clear whether we've read all the data or not.  I.e. RPN needs
EOF.

  * arity-tagged constructors

Arity tagged constructors are important to be able to represent the
stream as a tree without meta-information about the constructor tags.

Is this actually useful?  It doesn't seem so...  Any kind of
less-structured format can always be embedded in cons-cell based
lists, and is maybe better represented as such.

So, for libprim:

  - prefixed tokens
  - 3 tags: 
      0x20 non-trailing-digit
      0x30 trailing-code (numerical digits)
      0x40 trailing-data (bytes)
  - polish (rpn = not delimited)



[Reply][About]
[<<][libprim][>>][..]