chapter 4 syntax 1. 2 lexical analysis: converting a sequence of characters into a sequence of...

82
Chapter 4 Syntax 1

Post on 19-Dec-2015

232 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

1

Chapter 4 Syntax

Page 2: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

2

Lexical Analysis: converting a sequence of characters into a sequence of tokens

• Why split lexical analysis from parsing?– Simplifies design

• Parsers with whitespace and comments are more awkward– Efficiency

• Only use the most powerful technique that works and nothing more– No parsing sledgehammers for lexical nuts

– Portability• More modular code • More code re-use

Page 3: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

3

Source Code Characteristics• Code

– Identifiers• Count, max, get_num

– Language keywords: reserved or predefined(reserved cannot be re-defined by user)• switch, if .. then.. else, printf, return, void • Mathematical operators

– +, *, >> ….– <=, =, != …

– Literals• “Hello World” 2.6 -14e65

• Comments – ignored by compiler• Whitespace

Page 4: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

4

Reserved words versus predefined identifiers

• Reserved words cannot be used as the name of anything in a definition (i.e., as an identifier).

• Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t be).

• Examples of predefined identifiers in Ruby:ARGV, STRERR, TRUE, FALSE, NIL.

• Examples of reserved words in Ruby: alias and BEGIN begin break case class def defined? do else elsif END end ensure false for if in module next nil not or redo rescue retry return self super then true undef unless until when while yield

• So TRUE has value true, but can be redefined. true cannot be

Page 5: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

5

Terminology of Lexical Analysis

Tokens: category - Often an enumerated type IDENTIFIER (value 20)

Patterns: regular expression which defines the token

Lexemes: actual string matched (captured)

Page 6: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

6

Knowing Tokens are not enough…

• Clearly, if we replaced every occurrence of a variable with a token then ….

We would lose other valuable information (value, name)

• Other data items are attributes of the tokens

Page 7: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

7

Token delimiters

• When does a token/lexeme end?

e.g xtemp=-ytemp

Page 8: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

8

Ambiguity in identifying tokens• A programming language definition will

state how to resolve uncertain token assignment

• <> Is it 1 or 2 tokens?

• Reserved keywords (e.g. if) take precedence over identifiers (rules are same for both)

• Disambiguating rules state what to do

• ‘Principle of longest substring’: greedy match (xtemp == 64.23)

Page 9: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

9

Lexical Analysis• What is hard?

– How do I describe the rules?– How do I find the tokens– What do I do with ambiguities: is ++ one

symbol or two? We said two, but that depends on my language, right?

– Does it take lots of lookahead/backtracking? Ex: no spaces • whileamerica>china• whileamerica=china

When do I know how to interpret statement?

Page 10: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

10

Wouldn’t it be nice?• Wouldn’t it be nice if we could specify

how to do lexical analysis by• Here are the regular expressions which

are associated with these tokens names. Please tell me the lexeme and token.

And by the way…• Always match the longest string• Among rules that match the same

number of characters, use the one which I listed first.

This is what a program like lex/flex does.

Page 11: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

11

SKIP :{

" "| "\r"| "\t"

}TOKEN :{

< EOL: "\n" >}TOKEN : /* OPERATORS */{

< PLUS: "+" >| < MINUS: "-" >| < MULTIPLY: "*" >| < DIVIDE: "/" >

}TOKEN :{

< FLOAT: ( (<DIGIT>)+ "." ( <DIGIT> )* ) | ( "." ( <DIGIT> )+ ) > | < INTEGER: (<DIGIT>)+ >

| < #DIGIT: ["0" - "9"]>}TOKEN :{

< TYPE: ("float"|"int")>| < IDENTIFIER: ( (["a"-"z"]) | (["A"-"Z"]) ) ( (["a"-"z"]) | (["A"-"Z"]) | (["0"-"9"]) | "_" )*>

}

In JavaCC (a commercial parser generator), you specifyyour tokens using regularexpressions: Give a name andthe regular expression.System gives you a list of tokens/lexemes

Page 12: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

12

• make a choice about what you intend to do

Attendance is required.Programs are required.

Page 13: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

13

Matching Regular Expressions• Important for lexical analysis• Want to have immediate results• Want to avoid backtracking

match .*baba!• Suppose input is animals in bandanas are barely babaababa!

Page 14: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

14

Try it – write a recognizer for .*baba! or a quoted string

(as you don’t know which comes next) • We want a “graph” of nodes and edges

Nodes represent “states” such as “I have read in ba”

Arcs represent – input which causes you to go from one state to the next

Nodes are denoted as “final states” or “accept states” to mean the desired token is recognized

Our states don’t really need to have names, but we might assign a name to them.

Page 15: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

15

Deterministic Finite Automata (DFA)

• A recognizer determines if an input string is a sentence in a language

• Uses a regular expression• Turn the regular expression into a finite

automaton• Could be deterministic or non-deterministic

Page 16: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

16

Transition diagram for identifiers

Notice the accept state means you have matched• RE – Identifier -> letter (letter | digit)*

start 0 letter

letter

digit

other1 2accept

Page 17: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

17

Visit with your neighbor

• Write a finite state machine which reads in zeroes and ones and accepts the string if there are an odd number of zeroes.

• If you can find such a RE, we say the RE “recognizes” the language of strings of 0/1 such that there are an odd number of zeroes.

Page 18: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

18

Note it does the minimal work. It doesn’t count, just keep track of odd and even

odd even

Page 19: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

19

visit with your neighbor• Write a finite state machine which reads

in zeroes and ones and accepts the string if there are an odd number of zeroes

• Write a finite state machine which accepts a string with the same number of zeroes and ones

• Note the FS part of “FSA” is a finite number of states.

Page 20: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

20

What does this match?How does this differ from the

other kids of FSA we have seen?

Page 21: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

21

Thought question

• Does the non-deterministic FSA have more power than a FSA? In other words, are there languages (sets of strings) that a NFSA can “recognize” (say yes or no) that a FSA can’t make a decision about?

Page 22: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

22

• An NFA (nondeterministic finite automaton) is similar to a DFA but it also permits multiple transitions over the same character and transitions over (empty string) . The transition doesn't consume any input characters, so you may jump to another state for free.

• When we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices succeeds.

• Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power.

Page 23: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

Automating Scanner (lexical analyzer) Construction

To convert a specification into code:1 Write down the RE for the input language2 Build a big NFA3 Build the DFA that simulates the NFA4 Systematically shrink the DFA5 Turn it into code

Scanner generators• Lex and Flex work along these lines• Algorithms are well-known and well-understood

Page 24: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

24

From a Regular Expression to an NFA

Thompson’s Construction(a | b)* abb

0

2

76

3

1

4 5

1098

accept

start ee

e

e

e

e

a

b

a b b

e

e

e empty string match consumes no input

Page 25: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

RE NFA using Thompson’s Construction

Key idea• NFA pattern for each symbol & each operator• Join them with moves in precedence order

S0 S1

a

NFA for a

S0 S1

aS3 S4

b

NFA for ab

NFA for a | b

S0

S1 S2

a

S3 S4

b

S5

S0 S1

S3 S4

NFA for a*

a

Ken Thompson, CACM, 1968

Page 26: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

Example of Thompson’s ConstructionLet’s try a ( b | c )*

1. a, b, & c

2. b | c

3. ( b | c )*

S0 S1 a

S0 S1 b

S0 S1 c

S2 S3 b

S4 S5 c

S1 S6 S0 S7

S1 S2 b

S3 S4 c

S0 S5

Page 27: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

Example of Thompson’s Construction (con’t)

4. a ( b | c )*

5. Of course, a human would design something simpler ...

S0 S1 a

b | c

But, we can automate production of the more complex one ...

S0 S1 a

S4 S5 b

S6 S7 c

S3 S8 S2 S9

Page 28: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

28

start a

a

b

b b0 1 2 3

Non-deterministic finite state automata NFA

start a

a

b b0 01 02 03

b a

a

b

Equivalent deterministic finite state automata DFA

accept

accept

How would you use a NFA?

Page 29: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

29

NFA -> DFA (subset construction)• We can covert from an NFA to a DFA using subset

construction.• To perform this operation, let us define two functions: • The -closure function takes a state and returns the set of

states reachable from it, based on (one or more) -transitions. Note that this will always include the state itself. We should be able to get from a state to any state in its -closure without consuming any input.

• The function move takes a state and a character, and returns the set of states reachable by one transition on this character.

• We can generalize both these functions to apply to sets of states by taking the union of the application to individual states.

• Eg. If A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a') move(C,`a').

Page 30: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

30

NFA -> DFA (cont)• The Subset Construction Algorithm 1. Create the start state of the DFA by taking the ε-closure

of the start state of the NFA. 2. Perform the following for the new DFA state:

For each possible input symbol: – Apply move to the newly-created state and the input

symbol; this will return a set of states. – Apply the -closure to this set of states, possibly resulting in

a new set. 3. This set of NFA states will be a single state in the DFA. 4. Each time we generate a new DFA state, we must apply

step 2 to it. The process is complete when applying step 2 does not yield any new states.

5. The finish states of the DFA are those which contain any of the finish states of the NFA.

Page 31: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

31

start a

a

b

b b0 1 2 3

Non-deterministic finite state automata NFA

start a

a

b b0 01 02 03

b a

a

b

Equivalent deterministic finite state automata DFA

accept

accept

Use algorithm to transform

Page 32: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

32

Transition Table (DFA)

  A B0 01 001 01 0202 01 0303 01 0

Page 33: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

33

Writing a lexical analyzer

• The DFA helps us to write the scanner so that we recognize a symbol as early as possible without much lookahead.

• Figure 4.1 in your text gives a good example of what a scanner might look like.

Page 34: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

34

LEX (FLEX)

• Tool for generating programs which recognize lexical patterns in text

• Takes regular expressions and turns them into a program

Page 35: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

35

Lexical Errors

• Only a small percentage of errors can be recognized during Lexical Analysis

Consider if (good == !“bad”)

Page 36: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

36

–Line ends inside literal string – Illegal character in input file –calling an undefined method–missing operator–missing paren–unquoted string–using an unopened file

Which of the following errors could be found during lexical

analysis?

Page 37: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

37

Limitations of REs• REs can describe many languages but not all

• In regular expression lingo, a language is a set of strings which is acceptable. Alphabet is set of tokens of the language.

• For example :Alphabet = {a,b}, My language is the set of strings consisting of a single a surrounded by an equal number of b’s

S= {a, bab, bbabb, bbbabbb, …}

Can you represent the set of legal strings using RE?

Page 38: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

38

With your neighbor try

• With RE can you match nested paired Tags in HTML? Can you write the regular expression to match the language consisting of all nested tags (including user defined tags?)

• <div id =“menu”> HI <\div>

• <div id =“menu”><li><a href=“s.php">Officers</a></li> <\div>

Page 39: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

39

In general

• What does a lexical error mean?

• Strategies for dealing with:– “Panic-mode”

• Delete chars from input until something matches

– Inserting characters

– Re-ordering characters

– Replacing characters

• For an error like “illegal character” then we should report it sensibly

Page 40: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

40

Syntax Analysis

• also known as Parsing

• Grouping together tokens into larger structures

• Analogous to lexical analysis

• Input:– Tokens (output of Lexical Analyzer)

• Output:– Structured representation of original

program

Page 41: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

41

A Context Free Grammar

• A grammar is a four tuple (, N,P,S) where

• is the terminal alphabet • N is the non terminal alphabet • P is the set of productions • S is a designated start symbol in N

Page 42: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

42

Parsing• Need to express series of added operands• Basic need: a way to communicate what is

allowed in a way that is easy to translate.

• Expression number plus Expression | number– Similar to regular definitions:

• Concatenation• Choice• No * – repetition by recursion

Page 43: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

43

BNF Grammar – way Algol was defined (and other

languages since)

Operator + | - | * | /

Meta-symbols: |

Expression number Operator number

Structure on the left is defined to consist of the choiceson the right hand side

Expression number Operator number

Different conventions for writing BNF Grammars:

<expression> ::= number <operator> number

Page 44: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

44

Derivations• Derivation:

– Sequence of replacements of structure names by choices on the RHS of grammar rules

– Begin: start symbol– End: string of token symbols– Each step one replacement is made

Exp Exp Op Exp | number

Op + | - | * | /

Page 45: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

45

Example Derivation

Note the different arrows:

Derivation applies grammar rules

Used to define grammar rules

Non-terminals: Exp, Op Terminals: number, *

Terminals: because they terminate the derivation

How to we get: 4*5*21

Page 46: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

46

• E ( E ) | a

• What sentences does this grammar generate?

An example derivation: • E ( E ) ((E)) ((a))• Note that this is what we couldn’t

achieve with regular definitions

Page 47: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

47

Recursive Grammars

At seats, try using grammar to generate anbn

At seats, try using grammar to generate abn

– At seats, try using grammar to generate palindromes of = {a,b}

abbaaaabba aba aabaa

Page 48: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

48

• E E | – derives , , , , ….– All strings beginning with followed by zero

or more repetitions of • *

Page 49: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

49

Write a context free grammar for

• L = {am bn ck}, where k = m + n

Page 50: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

50

Write a CFG for

• (a(ab)*b)*

Page 51: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

51

Abstract Syntax Trees

exp

exp

expop

3

+

4

Parse Tree

+

3 4

Abstract Syntax Tree

number number Tokensequence

This is all the informationwe actually need

Parse trees contain surplus information

Page 52: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

52

An exercise• Consider the grammar

S->(L) | aL->L,S |S

(a) What are the terminals, non-terminals and start symbol(b) Find leftmost and rightmost derivations and parse trees

for the following sentencesi. (a,a)ii. (a, (a,a))iii. (a, ((a,a), (a,a)))

Page 53: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

53

Parsing token sequence: id + id * id

E E + E | E * E | ( E ) | - E | id

How many ways can you find a tree which matches the expression id * id + id ?

Page 54: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

54

We want structure of parse tree to help us in evaluating

• We build the parse tree• As we traverse the parse tree, we

generate the corresponding code.

Page 55: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

55

Ambiguity

• If the same sentence has two distinct parse trees, the grammar is ambiguous

• In English, the phrase ``small dogs and cats'' is ambiguous as we aren't sure if the cats are small or not.

• `I have a picture of Joe at my house' is also ambiguous

• A language is said to be ambiguous if no unambiguous grammar exists for it.

Page 56: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

56

Ambiguous Grammars• Problem – no unique structure is expressed• A grammar that generates a string with 2

distinct parse trees is called an ambiguous grammar

– 2+3*4 = 2 + (3*4) = 14– 2+3*4 = (2+3) * 4 = 20

• How does the grammar relate to meaning?

• Our experience of math says interpretation 1 is correct but the grammar does not express this:

– E E + E | E * E | ( E ) | - E | id

Page 57: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

57

Ambiguity is bad!!!!• Goal – rewrite the grammar so you can

never get two trees for the same input.• If you can do that, PROBLEM SOLVED!!!• We won’t spend time studying that, but

realize that is usually the goal.

Page 58: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

58

Show that this grammar is ambiguous.

• Consider the grammar• S aS | aSbS | ε

How do you do that?  • Think of ONE SENTENCE that you think is problematic.

• Show TWO different parse trees for the sentence.

Page 59: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

59

Is this grammar ambiguous?

• S A1B• A0A|ε • B 0B|1B |ε

Page 60: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

60

Precedence

E E + Term | E - Term |TermTerm Term * Factor | Term/Factor |FactorFactor ( exp ) | number | idid -> a|b|c|d

• Operators of equal precedence are grouped together at the same ‘level’ of the grammar ’precedence cascade’

• The lowest level operators have highest precedence

(The first shall be last and the last shall be first.)

Page 61: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

61

Precedence

• Using the previous grammar, show a parse tree for– a+b*c (notice how precedence matters)– a*b+c (notice how precedence matters)– a-b-c (notice how associativity matters)

Page 62: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

62

Associativity• 45-10-5 ? 30 or 40 Subtraction is left associative, left

to right (=30)

• E E + E | E - E |Term

Does not tell us how to split up 45-10-5 (ambiguity)

• E E + Term | E – Term | TermForces left associativity via left recursion

• E Term + E| Term – E | TermForces right associativity via right recursion

• Precedence & associativity remove ambiguity of arithmetic expressions – Which is what our math teachers took years telling us!

Page 63: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

63

Review: Precedence and Associativity

In the grammar below, where would you put unary

negation and exponentiation?

Precedence? Associativity?E E + Term | E - Term |TermTerm Term * Factor | Term/Factor |FactorFactor ( exp ) | number | id

Page 64: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

64

Try parsing -2^2^3 * -(a+b)

Is this right?

E E + Term | E - Term |TermTerm Term * Factor | Term/Factor |FactorFactor -Factor | ItemItem Part ^ Item | PartPart ( exp ) | number | id

Page 65: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

65

Extended BNF Notation• Notation for repetition and optional features.• {item} expresses repetition:

expr ® expr + term | term becomes expr ® term { + term }

• [item] expresses optional features:if-stmt ® if( expr ) stmt | if( expr ) stmt else stmtbecomesif-stmt ® if( expr ) stmt [ else stmt ]

Page 66: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

66

Syntax (railroad) Diagrams• An alternative to EBNF.• Rarely seen any more: EBNF is much

more compact.• Example (if-statement, p. 101):

if-statement expression

statement

if ( )

else statement

Page 67: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

67

Page 68: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

68

Formal Methods of Describing Syntax1950: Noam Chomsky (linguist) described generative devices

which describe four classes of languages (in order of decreasing power)

1. recursively enumerable x y where x and y can be any string of nonterminals and terminals.

2. context-sensitive x y where x and y can be string of terminals and non-terminals but y x in length.– Can recognize anbncn

3. context-free - nonterminals appear singly on left-side of productions. Any nonterminal can be replaced by its right hand side regardless of the context it appears in. – Ex: A Condition in a while statement is treated the same

as a condition in an assignment or in an if, this is context free

4. regular (only allow productions of the form NaT or N a)– Can recognize  anbm

Chomsky was interested in the theoretic nature of natural languages.

Page 69: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

69

Regular Grammar (like regular expression)

• A CFG is called regular if every• production has one of the forms below

AaBA ε

Page 70: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

70

Write the following regular expressions into regular

grammars

• a* b• a+b*

Page 71: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

71

Context Sensitive

• Allows for left hand side to be more than just a single non-terminal. It allows for “context”

• Context - sensitive : context dependentRules have the form xYz->xuz with Y being a non-terminal and x,u,z being terminals or non-terminals.

Page 72: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

72

              .

Consider the following context sensitive grammar G=(S,B,C,x,y,z,P,S) where P are 1.S ®xSBC 2.S® xyC 3.CB ®BC 4.yB® yy 5.yC® yz 6.zC® zz

At seats, what strings does this generate?

Page 73: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

73

How is Parsing done?

1. Recursive descent (top down).2. Bottom up – tries to match input with

the right hand side of a rule. Sometimes called shift-reduce parsers.

Page 74: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

74

Predictive Parsing• Which rule to use?• I need to generate a symbol, which

rule?• Top down parsing• LL(1) parsing• Table driven predictive parsing (no

recursion) versus recursive descent parsing where each nonterminal is associated with a procedure call

• No backtrackingE -> E + T | TT -> T * F | FF -> (E) | id

Page 75: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

75

Two grammar problems1. Eliminating left recursion

(without changing associativity) E-> E+a | a

I can’t tell which rule to use E-> E+a or E-> a as both generate same first symbol

2. removing two rules with same prefix E->aB|aC

I can’t tell which rule to use as both generate same symbol

Page 76: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

76

Removing Left RecursionBefore• A --> A x• A --> y After• A --> yB• B --> x B • B -->

Page 77: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

77

Removing common prefix (left factoring)

Stmt -> if Exp then Stmt else Stmt | if Expr then Stmt

Change so you don’t have to pick rule until laterBefore:

A -> ab1 | ab2

After:

A -> aA’

A’ -> b1 | b2

Page 78: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

78

Exercises

Eliminate left recursion from the following grammars.a) S->(L) | a

L->L,S | S

b) E ->E or T| T #or is part of boolean expressionT-> T and G| GG -> not G | (E) | true | false

Page 79: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

79

Table Driven Predictive Parsing

a + b $

Predictive ParsingProgram

Parsing Table

Output

Stack

Input

X

Y

Z

$

id + id * id

Partial derivationNot yet matched

Page 80: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

80

The idea behind Parsing• You start out with your derivation and

“someone smart” tells you which production you should use.

• You have a non-terminal E and you want to match a “e”. If only one rule from E produces something that starts with an “e”, you know which one to use.

Page 81: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

81

Table Driven Predictive Parsing

Parse id + id * id

Leftmost derivation and parse tree using the grammar

E -> TPP -> +TP | eT -> FUU -> *FU | eF -> (E) | id

Page 82: Chapter 4 Syntax 1. 2 Lexical Analysis: converting a sequence of characters into a sequence of tokens Why split lexical analysis from parsing? –Simplifies

82

Table Driven Predictive Parsing

NonTerminal

Input Symbol

id + ( ) $

E

P

T

U

F

E->TP

P->+TP

E->TP

P->e P->e

T->FU

U->e U->e U->e

F->id F->(E)

U->*FU

T->FU

*

Use parse table to derive: a*(a+a)*a