functional design and programming lecture 9: lexical analysis and parsing

Functional Design and Programming

Lecture 9:

Lexical analysis and parsing

Literature

Paulson, chap. 9: Lexical analysis (9.1) Functional parsing (9.2-9.4)

Exercises

Paulson, chap. 9: 9.1-9.2 9.3-9.6, 9.8

Write a parser for XML elements (see home page)

.

Parsing/Unparsing

Purpose: Encoding/decoding structured data into flat (string) representations

Reasons: Data read (and written) using operating system

routines (“read 25 bytes from file XYZ”). Need for universal format for all kinds of data;

e.g., to allow editing with text editor.

Language processor architecture

scanner

parser

transformer(s)

unparser

character stream

token stream

abstract syntax tree

abstract syntax tree

character stream

“<H1 > My title</ H1>”

[LANGLE, ID “H1”, RANGLE, ID “ My title”, LSLASH, ID “ H1”, RANGLE]

element

stag contents etag

“H1” “ My title” “H1”

“<H1> MY TITLE </H1>”

“MY TITLE”.... ...

Lexical analysis (Scanning, lexing, tokenizing)

Purpose: Turning a character stream into a stream of tokens.

Reasons: Making parsing easier by taking care of ‘low-level’

concerns such as eliminating whitespace. Efficient preprocessing and compression of input to parser. Unbounded lookahead into input stream (in contrast to

most parsers) Well-founded theoretical basis and tool support (regular

expressions and finite state machines).

Context-free Grammars (CFGs)

A context-free grammar G describes a language (set of strings)

G = (T, N, P, S) where T: set of terminal symbols N: set of nonterminal symbols P: set of productions S: start symbol (a particular nonterminal symbol)

CFGs: Example

T = { +, -, *, /, (, ), Var, Const }N = { Exp, Term, Factor }S = Exp

Exp ::= Exp + Term | Exp - Term | TermTerm :: = Term * Factor | Term / Factor | FactorFactor ::= Var | Const | ( Exp )

[Var, +, Var, /, Const, -, Var, *, Var]

CFG’s: Example...

“x + y / 15 - x * x”

Factor Factor

Term

Term

Factor

Term

Exp

Exp

Factor Factor

Term

Term

Exp

Parsing

Purpose: Turning a stream of tokens into a tree structure expressed by grammar

Reasons: Checking that input is well-formed (according to

given grammar) Producing parse tree or abstract syntax tree to

recover tree structure in input Processing parse tree according to grammar

Parsing combinators

Idea: For each terminal or nonterminal M there is a function: fM : token list -> T * token list (= T phrase)

such that fM takes elements from its argument until it has reduced the elements to M

and then produces a value of type T for it.

Parsing primitives

Terminals: Var: string phrase Const: int phrase $: string -> string phrase (for keywords)

Parsing primitives...

Parsing combinators: empty: (‘a list) phrase ||: ‘a phrase * ‘a phrase -> ‘a phrase --: ‘a phrase * ‘b phrase -> (‘a * ‘b) phrase >>: ‘a phrase * (‘a -> ‘b) -> ‘b phrase

Derived combinators: repeat: ‘a phrase -> ‘a list phrase $--: ‘a phrase * ‘b phrase -> ‘b phrase --$: ‘a phrase * ‘b phrase -> ‘a phrase

Parsing precedences

infix 6 $-- --$

infix 5 --

infix 3 >>

infix 0 ||

Problems with combinatory parsers

Left-recursion: Problem: Left-recursive grammars make parsers go into

an infinite loop. Remedy: Transform grammar to eliminate left-recursion

Mutual recursion: Problem (SML-specific!): Cannot use val-declaration

and combinator applications only. Remedy: Use fun-declarations for mutually recursive

parts of a grammar

Data type for abstract syntax trees

type binop = string

datatype expAST =

EXP of termAST * (binop * termAST) list

and termAST =

TERM of factorAST * (binop * factorAST) list

and factorAST =

VAR of string

| CONST of int

| PARENEXP of expAST

Parser: example (first try)

val binop1 = $”+” || $”-”

val binop2 = $”*” | $”/”val factor = Var >> VAR || Const >> CONST o Int.fromString || $”(” $-- exp --$ $”)” >> PARENEXPval term = factor -– repeat (binop2 -- factor) >> TERM

val exp = term –- repeat (binop1 term) >> EXP

PROBLEM: Doesn’t work! These definitions are intended to be mutually recursive, but are not!

Parser: example (second try)

val binop1 = $”+” || $”-”

val binop2 = $”*” | $”/”fun factor toks = ( Var >> VAR || Const >> CONST || $”(” $-- exp --$ $”)” ) toksand term toks = (factor -– repeat (binop2 -- factor)) toks

and exp toks =

(term -– repeat (binop1 term)) toks

Operator precedence parsing (overview)

When processing operator expressions, a parser has to decide whether to reduce (stop the current phrase parser and return its result) or shift (continue the current phrase parse)

Operator precedence parsing: Associate a precedence (binding strength) with each operator, remember the the precedence of the last operator processed and determine whether to reduce or shift depending on the precedence of the next operator.

See Paulson, pp. 364-366

Backtracking parsing (overview)

There may be more than one of parsing an expression.

Backtracking parsing: Construct a lazy list of all possible parses of a token stream. Continue parse with first of those and find a complete parse for the whole token stream; if that fails, backtrack to second in the list and repeat.

See Paulson, pp. 366-367

Recursive-descent parsing (overview)

Write one parser for each grammatical category (as in combinatory parsing)

Process token stream as in combinatory parsers, excepting alternatives.

Process alternatives as follows: Look at next token (first token of remaining

token stream). Choose phrase parser on the basis of that token.

LL-parsing and LR-parsing (overview)

Use tools to generate parsers from grammar specifications.

Produces a table that guides a push-down automaton through parsing actions (“shift”, “reduce”)

LL-parsing: Predictive (basically recursive descent parsing in table-driven form)

LR-parsing (incl. SLR- and LALR-parsing): (Virtual) parallel execution of phrase parsers.

Problems: Lookahead bounded in practice, at times unwieldy.

functional design and programming lecture 9: lexical analysis and parsing

Documents

phrase slide

b phrase b phrase

list phrase

int phrase

grammar slide

string string phrase

string phrase const

var const exp slide