Mon Feb 16 11:21:54 CET 2009

Parsing and Automata

Before storming into a dark room writing yet another ad-hoc recursive
descent parser, it might be a good idea to turn on the lights and look
at languages first.


Automata are classified by the class of formal languages they can
recognize.  Types of Finite Automata[1] :

  DFA: deterministic finite: each state has a exactly one transition
       for every symbol in the alfabet.

  NFA: nondeterministic finite: .... zero or more transitions ...
       (symbol with no state: input is _rejected_)

It can be shown that these [D|N]FA can accept the same languages.
These languages are called _regular_.

Extensions to Finite Automata:

  PDA: push-down:  FA + stack.

NPDAs accept the _context free_ languages.

  LBA: linear bounded: a turing machine with a tape length
       proportional to the input string.

LBAs accept the _context sensitive_ languages.

  TM:  turing machine: equivalent to algorithms.

Turing machines decide/accept recursive languages and recognize the
recursively enumerable languages.


Let's stick to formal languages defined by a formal grammar[3].  A
grammar is a set of rules for transforming strings.  These rules are
called _productions_.  All the strings in the language are generated
by applying the grammar rules to a collection of _start symbols_.  If
there are multiple ways of generating the same string, then the
grammar is said to be _ambiguous_.

Note that from a practical point (parsing) this is upside down:

parsing consists of two steps:

  - validate a string (is it part of the language?)

  - as a side effect: find the exact production rule(s) to build a
    parse tree representation of the input (convert concrete syntax to
    abstract syntax)

Ambiguity is problematic for the latter part:  you really want a
single parse tree to which to attribute semantics later.

  Note: The production rules approach is very different from
        recognition-based PEG - parsing expression grammars, where the
        language is the set of inputs recognized by the parser
        expression (a formal representation of a recursive descent

A context-free language can be recognized in O(n3) time, but there are
a couple of subclasses for which linear algorithms exist:

LR      Left-to-rigth, Rightmost derivation
LALR    Lookahead LR
LL      Left-to-right, Leftmost derivation
LL(k)   LL with k lookahead, without backtracking

LL(k) Can be implemented with recursive decent parsers. Lisp is LL(1)

A derivation[4] is a convenient way to express how a particular input
string can be produced by fixing a replacement strategy (i.e. always
replace Leftmost or Rightmost non-terminal first) and listing the
rules applied using that strategy.  This is not unique for an
ambiguous grammar.

[1] http://en.wikipedia.org/wiki/Automata_theory
[2] http://en.wikipedia.org/wiki/Context-free_languages
[3] http://en.wikipedia.org/wiki/Formal_grammar
[4] http://en.wikipedia.org/wiki/Context-free_grammar#Derivations_and_syntax_trees