28-jun-15 recognizers. 2 parsers and recognizers given a grammar (say, in bnf) and a string, a...

Apr 18, 2023

Recognizers

2

Parsers and recognizers Given a grammar (say, in BNF) and a string,

A recognizer will tell whether the string belongs to the language defined by the grammar

A parser will try to build a binary tree corresponding to the string, according to the rules of the grammar

Input string Recognizer result Parser result

2 + 3 * 4 true

2 + 3 * false Error

3

Building a recognizer

One way of building a recognizer from a grammar is called recursive descent

Recursive descent is pretty easy to implement, once you figure out the basic ideas Recursive descent is a great way to build a “quick and dirty”

recognizer or parser Production-quality parsers use much more sophisticated and

efficient techniques In the following slides, I’ll talk about how to do

recursive descent, and give some examples in Java

4

Recognizing simple alternatives, I

Consider the following BNF rule: <add_operator> ::= “+” | “-” That is, an add operator is a plus sign or a minus sign

To recognize an add operator, we need to get the next token, and test whether it is one of these characters

If it is a plus or a minus, we simply return true But what if it isn’t?

We not only need to return false, but we also need to put the token back because it doesn’t belong to us, and some other grammar rule probably wants it

Our tokenizer needs to be extended to take back characters We will make do with putting back only one token at a time

5

Recognizing simple alternatives, II

Our rule is <add_operator> ::= “+” | “-” Our method for recognizing an <add_operator>

(which we will simply call addOperator) looks like this: public boolean addOperator() {

Get the next token, call it t If t is a “+”, return true If t is a “-”, return true If t is anything else, put the token back return false}

6

Helper methods

We could turn the preceding pseudocode directly into Java code But we will be writing a lot of very similar code... ...and it won’t be very readable code We should write some auxiliary or “helper” methods to hide

some of the details for us First helper method:

private boolean symbol(String expectedSymbol) Gets the next token and tests whether it matches the

expectedSymbol If it matches, return true If it doesn’t match, put the symbol back and return false

We’ll look more closely at this method in a moment

7

Recognizing simple alternatives, III

Our rule is <add_operator> ::= “+” | “-” Our pseudocode is:

public boolean addOperator() { Get the next token, call it t If t is a “+”, return true If t is a “-”, return true If t is anything else, put the token back return false}

Thanks to our helper method, our actual Java code is: public boolean addOperator() {

return symbol("+") || symbol("-");}

8

Implementing symbol

symbol gets a token, makes sure it’s a symbol, compares it to the desired value, possibly puts the token back, and returns true or false We will want to do something similar for numbers, names,

end of lines, and maybe end of input It would be foolish to write and debug all of these separately Again, we should use an auxiliary method

private boolean symbol(String expectedSymbol) { return nextTokenMatches(Token.SYMBOL, expectedSymbol);}

9

nextTokenMatches #1 The nextTokenMatches method should:

Get a token Compare types and values Return true if the token is as expected Put the token back and return false if it doesn’t match

private boolean nextTokenMatches(int type, String value) { Token t = tokenizer.next(); if (type == t.getType() && value.equals(t.getValue())) return true; else tokenizer.pushBack(t); return false;}

Wait a minute—pushBack? This is a method that you will have to add to your Tokenizer class Real code changes!

10

nextTokenMatches #2

The previous method is fine for symbols, but what if we only care about the type?

For example, we want to get a number—any number We need to compare only type, not value private boolean nextTokenMatches(int type, String value) {

Token t = tokenizer.next(); omit this parameter if (type == t.getType() && value.equals(t.getValue())) return true; else tokenizer.pushBack(t); omit this test return false;}

The two versions of nextTokenMatches are difficult to combine and fairly small, so we won’t worry about the code duplication too much

11

addOperator reprise public boolean addOperator() {

return symbol("+") || symbol("-");}

private boolean symbol(String expectedSymbol) { return nextTokenMatches(Token.SYMBOL, expectedSymbol);}

private boolean nextTokenMatches(int type, String value) { Token t = tokenizer.next(); if (type == t.getType() && value.equals(t.getValue())) return true; else tokenizer.pushBack(t); return false;}

12

Sequences, I

Suppose we want to recognize a grammar rule in which one thing follows another, for example, <empty_list> ::= “[” “]”

The code for this would be fairly simple... public boolean emptyList() {

return symbol("[") && symbol("]");}

...except for one thing... What happens if we get a “[” and don’t get a “]”? The above method won’t work—why not?

Only the second call to symbol failed, and only one token gets pushed back

13

Sequences, II

The grammar rule is <empty_list> ::= “[” “]” And the token string contains [ 5 ]

Solution #1: Write a pushBack method that can keep track of more than one token at a time (say, in a Vector)

This will allow you to put the back both the “[” and the “5” The code gets pretty messy You have to be very careful of the order in which you return tokens

Solution #2: Call it an error You might be able to get away with this, depending on the grammar For example, for any reasonable grammar, (2 + 3 +) is clearly an error

Solution #3: Change the grammar Tricky, and may not be possible

Solution #4: Combine rules See the next slide

14

Sequences, III Suppose the grammar really says

<list> ::= “[” “]” | “[” <number> “]”

Now your pseudocode should look something like this: public boolean list() {

if first token is “[” { if second token is “]” return true else if second token is a number { if third token is “]” return true else error } else put back first token}

Revised grammar: <list> ::= “[” <rest_of_list>

<rest_of_list> ::= “]” | <number> “]”

15

Simple sequences in Java

Suppose you have this rule: <factor> ::= “(” <expression> “)”

A good way to do this is often to test whether the grammar rule is not met

public boolean factor() { if (symbol("(")) { if (!expression()) error("Error in parenthesized expression"); if (!symbol(")")) error("Unclosed parenthetical expression"); return true; } return false;}

16

Sequences and alternatives Here’s the real grammar rule for <factor>: <factor> ::= <name>

| <number> | “(” <expression> “)”

And here’s the actual code: public boolean factor() {

if (name()) return true; if (number()) return true; if (symbol("(")) { if (!expression()) error("Error in parenthesized expression"); if (!symbol(")")) error("Unclosed parenthetical expression"); return true; } return false;}

17

Recursion, I

Here’s an unfortunate (but legal!) grammar rule: <expression> ::= <expression> “+” <term>

Here’s some code for it: public boolean expression() {

if (!expression()) return false; if (!addOperator()) return true; if (!term()) error("Error in expression after '+' or '-'"); return true;}

Do you see the problem? We aren’t recurring with a simpler case, therefore, we have an

infinite recursion Our grammar rule is left recursive (the recursive part is the

leftmost thing in the definition)

18

Recursion, II

Here’s our unfortunate grammar rule again: <expression> ::= <expression> “+” <term>

Here’s an equivalent, right recursive rule: <expression> ::= <term> “+” <expression>

Here’s some (much happier!) code for it: public boolean expression() {

if (!term()) return false; if (!addOperator()) return true; if (!expression()) error("Error in expression after '+' or '-'"); return true;}

19

Extended BNF—optional parts

Extended BNF uses brackets to indicate optional parts of rules Example:

<if_statement> ::= “if” <condition> <statement> [ “else” <statement> ]

Pseudocode for example:public boolean ifStatement() { if you don’t see “if”, return false if you don’t see a condition, return an error if you don’t see a statement, return an error if you see an “else” { if you see a “statement”, return true else return an error } else return true;}

20

Extended BNF—zero or more

Extended BNF uses braces to indicate parts of a rule that can be repeated Example: <expression> ::= <term> { “+”

<term> } Note that this is not a good definition for an expression

Pseudocode for example: public boolean expression() {

if you don’t see a term, return false while you see a “+” { if you don’t see a term, return an error } return true}

21

Back to parsers

A parser is like a recognizer The difference is that, when a parser recognizes

something, it does something about it Usually, what a parser does is build a tree

If the thing that is being parsed is a program, then You can write another program that “walks” the tree and

executes the statements and expressions as it finds them Such a program is called an interpreter

You can write another program that “walks” the tree and produces code in some other language (usually assembly language) that does the same thing

Such a program is called a compiler

22

Conclusions

If you start with a BNF definition of a language, You can write a recursive descent recognizer to tell you

whether an input string “belongs to” that language (is a valid program in that language)

Writing such a recognizer is a “cookbook” exercise—you just follow the recipe and it works (hopefully)

You can write a recursive descent parser to create a parse tree representing the program

The parse tree can later be used to execute the program

BNF is purely syntactic BNF tells you what is legal, and how things are put together BNF has nothing to say about what things actually mean

23

The End

28-jun-15 recognizers. 2 parsers and recognizers given a grammar (say, in bnf) and a string, a...

Documents