Mon Feb 16 11:21:54 CET 2009
Parsing and Automata
Before storming into a dark room writing yet another ad-hoc recursive
descent parser, it might be a good idea to turn on the lights and look
at languages first.
Automata are classified by the class of formal languages they can
recognize. Types of Finite Automata :
DFA: deterministic finite: each state has a exactly one transition
for every symbol in the alfabet.
NFA: nondeterministic finite: .... zero or more transitions ...
(symbol with no state: input is _rejected_)
It can be shown that these [D|N]FA can accept the same languages.
These languages are called _regular_.
Extensions to Finite Automata:
PDA: push-down: FA + stack.
NPDAs accept the _context free_ languages.
LBA: linear bounded: a turing machine with a tape length
proportional to the input string.
LBAs accept the _context sensitive_ languages.
TM: turing machine: equivalent to algorithms.
Turing machines decide/accept recursive languages and recognize the
recursively enumerable languages.
Let's stick to formal languages defined by a formal grammar. A
grammar is a set of rules for transforming strings. These rules are
called _productions_. All the strings in the language are generated
by applying the grammar rules to a collection of _start symbols_. If
there are multiple ways of generating the same string, then the
grammar is said to be _ambiguous_.
Note that from a practical point (parsing) this is upside down:
parsing consists of two steps:
- validate a string (is it part of the language?)
- as a side effect: find the exact production rule(s) to build a
parse tree representation of the input (convert concrete syntax to
Ambiguity is problematic for the latter part: you really want a
single parse tree to which to attribute semantics later.
Note: The production rules approach is very different from
recognition-based PEG - parsing expression grammars, where the
language is the set of inputs recognized by the parser
expression (a formal representation of a recursive descent
A context-free language can be recognized in O(n3) time, but there are
a couple of subclasses for which linear algorithms exist:
LR Left-to-rigth, Rightmost derivation
LALR Lookahead LR
LL Left-to-right, Leftmost derivation
LL(k) LL with k lookahead, without backtracking
LL(k) Can be implemented with recursive decent parsers. Lisp is LL(1)
A derivation is a convenient way to express how a particular input
string can be produced by fixing a replacement strategy (i.e. always
replace Leftmost or Rightmost non-terminal first) and listing the
rules applied using that strategy. This is not unique for an