functional design and programming lecture 9: lexical analysis and parsing
Post on 19-Dec-2015
224 views
TRANSCRIPT
Parsing/Unparsing
Purpose: Encoding/decoding structured data into flat (string) representations
Reasons: Data read (and written) using operating system
routines (“read 25 bytes from file XYZ”). Need for universal format for all kinds of data;
e.g., to allow editing with text editor.
Language processor architecture
scanner
parser
transformer(s)
unparser
character stream
token stream
abstract syntax tree
abstract syntax tree
character stream
“<H1 > My title</ H1>”
[LANGLE, ID “H1”, RANGLE, ID “ My title”, LSLASH, ID “ H1”, RANGLE]
element
stag contents etag
“H1” “ My title” “H1”
“<H1> MY TITLE </H1>”
“MY TITLE”.... ...
Lexical analysis (Scanning, lexing, tokenizing)
Purpose: Turning a character stream into a stream of tokens.
Reasons: Making parsing easier by taking care of ‘low-level’
concerns such as eliminating whitespace. Efficient preprocessing and compression of input to parser. Unbounded lookahead into input stream (in contrast to
most parsers) Well-founded theoretical basis and tool support (regular
expressions and finite state machines).
Context-free Grammars (CFGs)
A context-free grammar G describes a language (set of strings)
G = (T, N, P, S) where T: set of terminal symbols N: set of nonterminal symbols P: set of productions S: start symbol (a particular nonterminal symbol)
CFGs: Example
T = { +, -, *, /, (, ), Var, Const }N = { Exp, Term, Factor }S = Exp
Exp ::= Exp + Term | Exp - Term | TermTerm :: = Term * Factor | Term / Factor | FactorFactor ::= Var | Const | ( Exp )
[Var, +, Var, /, Const, -, Var, *, Var]
CFG’s: Example...
“x + y / 15 - x * x”
Factor Factor
Term
Term
Factor
Term
Exp
Exp
Factor Factor
Term
Term
Exp
Parsing
Purpose: Turning a stream of tokens into a tree structure expressed by grammar
Reasons: Checking that input is well-formed (according to
given grammar) Producing parse tree or abstract syntax tree to
recover tree structure in input Processing parse tree according to grammar
Parsing combinators
Idea: For each terminal or nonterminal M there is a function: fM : token list -> T * token list (= T phrase)
such that fM takes elements from its argument until it has reduced the elements to M
and then produces a value of type T for it.
Parsing primitives
Terminals: Var: string phrase Const: int phrase $: string -> string phrase (for keywords)
Parsing primitives...
Parsing combinators: empty: (‘a list) phrase ||: ‘a phrase * ‘a phrase -> ‘a phrase --: ‘a phrase * ‘b phrase -> (‘a * ‘b) phrase >>: ‘a phrase * (‘a -> ‘b) -> ‘b phrase
Derived combinators: repeat: ‘a phrase -> ‘a list phrase $--: ‘a phrase * ‘b phrase -> ‘b phrase --$: ‘a phrase * ‘b phrase -> ‘a phrase
Problems with combinatory parsers
Left-recursion: Problem: Left-recursive grammars make parsers go into
an infinite loop. Remedy: Transform grammar to eliminate left-recursion
Mutual recursion: Problem (SML-specific!): Cannot use val-declaration
and combinator applications only. Remedy: Use fun-declarations for mutually recursive
parts of a grammar
Parsing problems...
Example grammar is left-recursive:Exp ::= Exp ‘+’ Term | Exp ‘-’ Term | TermTerm :: = Term ‘*’ Factor | Term ‘/’ Factor | FactorFactor ::= Var | Const | ‘(’ Exp ‘)’
Eliminate left-recursion:Binop1 ::= ‘+’ | ‘-’
Binop2 ::= ‘*’ | ‘/’Factor ::= Var | Const | ‘(’ Exp ‘)’
Term ::= Factor (Binop2 Factor)*
Exp ::= Term (Binop1 Term)*
Data type for abstract syntax trees
type binop = string
datatype expAST =
EXP of termAST * (binop * termAST) list
and termAST =
TERM of factorAST * (binop * factorAST) list
and factorAST =
VAR of string
| CONST of int
| PARENEXP of expAST
Parser: example (first try)
val binop1 = $”+” || $”-”
val binop2 = $”*” | $”/”val factor = Var >> VAR || Const >> CONST o Int.fromString || $”(” $-- exp --$ $”)” >> PARENEXPval term = factor -– repeat (binop2 -- factor) >> TERM
val exp = term –- repeat (binop1 term) >> EXP
PROBLEM: Doesn’t work! These definitions are intended to be mutually recursive, but are not!
Parser: example (second try)
val binop1 = $”+” || $”-”
val binop2 = $”*” | $”/”fun factor toks = ( Var >> VAR || Const >> CONST || $”(” $-- exp --$ $”)” ) toksand term toks = (factor -– repeat (binop2 -- factor)) toks
and exp toks =
(term -– repeat (binop1 term)) toks
Operator precedence parsing (overview)
When processing operator expressions, a parser has to decide whether to reduce (stop the current phrase parser and return its result) or shift (continue the current phrase parse)
Operator precedence parsing: Associate a precedence (binding strength) with each operator, remember the the precedence of the last operator processed and determine whether to reduce or shift depending on the precedence of the next operator.
See Paulson, pp. 364-366
Backtracking parsing (overview)
There may be more than one of parsing an expression.
Backtracking parsing: Construct a lazy list of all possible parses of a token stream. Continue parse with first of those and find a complete parse for the whole token stream; if that fails, backtrack to second in the list and repeat.
See Paulson, pp. 366-367
Recursive-descent parsing (overview)
Write one parser for each grammatical category (as in combinatory parsing)
Process token stream as in combinatory parsers, excepting alternatives.
Process alternatives as follows: Look at next token (first token of remaining
token stream). Choose phrase parser on the basis of that token.
LL-parsing and LR-parsing (overview)
Use tools to generate parsers from grammar specifications.
Produces a table that guides a push-down automaton through parsing actions (“shift”, “reduce”)
LL-parsing: Predictive (basically recursive descent parsing in table-driven form)
LR-parsing (incl. SLR- and LALR-parsing): (Virtual) parallel execution of phrase parsers.
Problems: Lookahead bounded in practice, at times unwieldy.