chapter 2 :: programming language syntaxtadavis/cs312/ch02f.pdf · 2019. 1. 30. · ebnf to bnf...

Copyright © 2005 Elsevier

Chapter 2 ::

Programming Language Syntax

Programming Language Pragmatics

Michael L. Scott


Introduction

• programming languages need to be precise

– natural languages less so

– both form (syntax) and meaning (semantics)

must be unambiguous

– example: digits

digit → 0|1|2|3|4|5|6|7|8|9

– we need good notation (or a metalanguage) to

describe precise languages by recognizing tokens

• regular expressions

• context-free grammars


Tokens

• tokens are the building blocks of programs

– shortest strings with individual meaning

– examples

• keywords (type names, control structures)

• identifiers (variable names)

• symbols (mathematical operators)

• constants (literals)

– considerations

• case sensitivity

• international characters

• maximum lengths


Regular Expressions

• a regular expression is one of the following:

– a character

– the empty string, denoted by or ϵ

– two regular expressions concatenated

– two regular expressions separated by | (i.e., or)

– a regular expression followed by the Kleene star

(concatenation of zero or more strings)

• these simple rules help us find tokens in the

programming language

• useful in unix/linux environments


Regular Expressions

• numerical literals in Pascal may be generated

by the following:

• arrow can be read as – can be replaced by

– goes to


Context-Free Grammars

• the notation for context-free grammars (CFG)

is sometimes called Backus-Naur Form (BNF) – necessary since regular expressions cannot specify nested

constructs

– used to define the syntax of a language

• with Kleene star and other facilitating

symbols, the notation is termed Extended BNF

(EBNF)


Source: Tucker & Noonan (2007)

Derivations

• example grammar binaryDigit → 0

binaryDigit → 1

or equivalently

binaryDigit → 0 | 1


Derivations

• consider the grammar Integer → Digit | Integer Digit

Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

we can derive any unsigned integer, like 352, from this

grammar:

Integer → Integer Digit

→ Integer 2

→ Integer Digit 2

→ Integer 5 2

→ Digit 5 2

→ 3 5 2


Derivations

– a different derivation of 352

Integer → Integer Digit

→ Integer Digit Digit

→ Digit Digit Digit

→ 3 Digit Digit

→ 3 5 Digit

→ 3 5 2

– this is called a leftmost derivation since at each step, the

leftmost nonterminal is replaced

– the previous derivation was a rightmost derivation


Derivations

– notation for derivations

Integer →* 352

– meaning that 352 can be derived in a finite number of

steps using the grammar for Integer

352 ϵ L(G)

– meaning that 352 is a member of the language

defined by grammar G

L(G) → { ω ϵ T* | Integer →* ω }

– meaning that the language defined by grammar G is

the set of all symbol strings ω that can be derived as

an Integer



Grammars

• conventional in general discussions of grammars to

use

– lower case letters near the beginning of the alphabet for

terminals

– lower case letters near the end of the alphabet for strings of

terminals

– upper case letters near the beginning of the alphabet for

non-terminals

– upper case letters near the end of the alphabet for arbitrary

symbols

– greek letters for arbitrary strings of symbols

Parse Trees

• a parse tree is a graphical representation of a

derivation – each internal node of the tree corresponds to a step in the

derivation

– the children of a node represent a right-hand side of a

production

– each leaf node represents a symbol of the derived string

reading from left to right


Parse Trees

• the step, Integer → Integer Digit appears in

the parse tree as


Parse Trees

• parse tree for 352 as in Integer




• expression grammar with precedence and

associativity



• parse tree for expression grammar (with precedence) for 3 + 4 * 5



• parse tree for expression grammar (with left associativity) for 10 - 4 - 3


• another grammar with precedence and

associativity – + and – are left-associative operators in mathematics

– * and / have higher precedence than + and –

• Grammar G1



• parse tree for 4**2**3 + 5 * 6 + 7



• associativity and precedence shown in the

structure of the parse tree – highest precedence at the bottom

– left-associativity on the left at each level


Ambiguous Grammars

• a grammar is ambiguous if one of its strings

has two or more different parse trees

– grammar G1 above is unambiguous

• ambiguous expression grammar G2 equivalent

to G1

– fewer productions and nonterminals, but

ambiguous Source: Tucker & Noonan (2007)

Ambiguous Grammars

• ambiguous parse of 5 – 4 + 3 using G2


Abstract Syntax Tree

• the shape of a parse tree reveals the meaning

of the program

• we want a tree that removes its inefficiency,

but keeps its shape – remove separator/punctuation terminal symbols

– remove all trivial root nonterminals

– replace remaining nonterminals with leaf terminals

• removes syntactic sugar and keeps essential elements

of a language


Abstract Syntax Tree


Dangling Else

• with which if statement does the else associate?


Dangling Else Ambiguity


Dangling Else Solutions

• Algol 60, C, C++ – associate each else with closest if

– use {}or begin/end to override

• Algol 68, Modula, Ada – use explicit delimiter to end every conditional (e.g., if..fi)

• Java – rewrite the grammar to limit what can appear in a conditional


Extended BNF (EBNF)

• BNF – recursion for iteration

– nonterminals for grouping

• EBNF additional metacharacters – { } for a series of zero or more

– ( ) for a list; must pick one

– [ ] for an optional list; pick none or one


EBNF Examples

• Expression is a list of Terms separated by

operators + and -


EBNF to BNF

• we can always rewrite an EBNF grammar as a

BNF grammar

can be rewritten as

• try rewriting EBNF rules with { } and ( )

• while EBNF is no more powerful than BNF,

its rules are often simpler and clearer



Scanning

• recall that the scanner is responsible for

– tokenizing source

– removing comments

• may be difficult if nested

– (often) dealing with pragmas (i.e., significant

comments)

– saving text of identifiers, numbers, strings

– saving source locations (file, line, column) for

error messages


Scanning

• suppose we are building an ad-hoc (hand-

written) scanner for Pascal:

– we read the characters one at a time with look-

ahead

• if it is one of the one-character tokens { ( ) [ ] < > , ; = + - etc }

we announce that token

• if it is a ., we look at the next character

– if that is a dot, we announce ..

– otherwise, we announce . and reuse the look-

ahead


Scanning

• if it is a


Scanning

• if it is a digit, we keep reading until we find

a non-digit

– if that is not a . , we announce an integer

– otherwise, we keep looking for a real number

– if the character after the . is not a digit, we

announce an integer and reuse the . and the

look-ahead


Scanning

• pictorial

representation

of a Pascal

scanner as a

finite

automaton


Scanning

• a scanner can be represented by a

deterministic finite automaton (DFA)

– lex, scangen, etc. build these things

automatically from a set of regular expressions

– specifically, they construct a machine that

accepts the language identifier | int const

| real const | comment | symbol |

...


Scanning

• we run the machine over and over to get one

token after another

– nearly universal rule

• always take the longest possible token from the

input

– thus foobar is foobar and never f or foo or foob

• more to the point, 3.14159 is a real const and

never 3, ., and 14159

• regular expressions "generate" a regular

language; DFAs "recognize" it


Scanning

• scanners tend to be built three ways

– ad-hoc

– semi-mechanical pure DFA

(usually realized as nested case statements)

– table-driven DFA

• ad-hoc generally yields the fastest, most

compact code by doing lots of special-

purpose things, though good automatically-generated scanners come very close


Scanning

• writing a pure DFA as a set of nested case

statements is a surprisingly useful

programming technique

– though it's often easier to use perl, awk, sed

– for details see Figure 2.11

• table-driven DFA is what lex and scangen

produce

– lex (flex) in the form of C code

– scangen in the form of numeric tables and a

separate driver (for details see Figure 2.12)


Scanning

• note that the rule about longest-possible tokens means you return only when the next character can't be used to continue the current token

– the next character will generally need to be saved for the next token

• in some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed

– in Pascal, for example, when you have a 3 and you a see a dot

• do you proceed (in hopes of getting 3.14)? or

• do you stop (in fear of getting 3..5)?


Scanning

• in messier cases, you may not be able to get

by with any fixed amount of look-ahead; in

Fortran, for example, we have DO 5 I = 1,25 loop

DO 5 I = 1.25 assignment

• here, we need to remember we were in a

potentially final state, and save enough

information that we can back up to it, if we get stuck later


Parsing

• terminology:

– context-free grammar (CFG)

– symbols

• terminals (tokens)

• non-terminals

– production

– derivations (left-most and right-most - canonical)

– parse trees

– sentential form


Parsing

• by analogy to RE and DFAs, a context-free

grammar (CFG) is a generator for a

context-free language (CFL)

– a parser is a language recognizer

• there is an infinite number of grammars for

every context-free language

– not all grammars are created equal, however


Parsing

• it turns out that for any CFG we can create a

parser that runs in O(n^3) time

• there are two well-known parsing

algorithms that permit this

– Early's algorithm

– Cooke-Younger-Kasami (CYK) algorithm

• O(n^3) time is clearly unacceptable for a

parser in a compiler - too slow

chapter 2 :: programming language syntaxtadavis/cs312/ch02f.pdf · 2019. 1. 30. · ebnf to bnf...

Documents