Tue Jan 1 15:48:20 CET 2008

regular expressions

the data is a stream of tokens, so regular expressions can be
constructed in terms of membership functions and modifiers like '*' or
'+'. symbols can be converted to membership functions.

that should be enough? not really. need some form of abstraction: a
pattern can be a composition of patterns.

so maybe it is better to stick with the lexer language in mzscheme?
since what i am going to re-invent is ultimately going to be a generic
regexp tool. EDIT: looks like it's really character-oriented. maybe it
is a good exercise to try to write a lexer generator? can't be that
hard.. also, i run into this problem so many times with low-level bit
protocols that it might be a good idea to take a closer look: white
space is essentially the 'stop bit' in async comm.

  which brings me to the question: i think i read on wikipedia (i'm
  offline now) that regular expressions and FSMs are somehow
  equivalent. how is this?

how about forth-lex.ss: a specification not as production rules of a
regular language, but as regular matching patterns? what is the
problem i am trying to solve? find a function (or macro) that maps

  lex-compiler : language-spec -> token-reader

stream = token stream | EOF
token = word | comment | white

at the same time, i am trying to stay true to the forth syntax: simple
read-ahead. (keyword + fixed number of tokens) or delimited read
(keyword + tokens + terminator).

  note: there seems to be a difference between reading UPTO a token,
  or reading UPTO AND INCLUDING a token. is standard forth always of
  the latter form?

to answer the question partially: the current forth-lex.ss performs
segmentation, and thus is not of that form: it cuts INBETWEEN tokens.
but forth is. can i learn something from this? yes: cutting AT a token
makes the automaton simpler, since it doesn't require peek. let's call
that 'delimited' until i know the technical term.

i think the important lesson is that:

  1) forth should be delimited: this simplifies on-target lexing
  2) exception: first stage tokenizer in brood = segmentation

the latter is an extension to make source processing in an editor
(like emacs) easier by preserving whitespace and delimiting
characters. BUT, it should not introduce structures that 1) can't

it looks to me that before fixing the higher level compiler and macro
stuff, the lexer should be fixed such that it can be replaced by a
simple, reflective state machine (true parsing words). looking at
forth, there are 2 reading modes:

   - read upto and excluding character
   - read next word (= upto and excluding whitespace)

by fixing some of the syntax (comments and strings) editor tools can
be made exact: a list of DELIMITED words will read upto and including
a delimiter.