chapter 2 :: programming language syntaxtadavis/cs312/ch02f.pdf · 2019. 1. 30. · ebnf to bnf...

45
Copyright © 2005 Elsevier Chapter 2 :: Programming Language Syntax Programming Language Pragmatics Michael L. Scott

Upload: others

Post on 03-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Copyright © 2005 Elsevier

    Chapter 2 ::

    Programming Language Syntax

    Programming Language Pragmatics

    Michael L. Scott

  • Copyright © 2005 Elsevier

    Introduction

    • programming languages need to be precise

    – natural languages less so

    – both form (syntax) and meaning (semantics)

    must be unambiguous

    – example: digits

    digit → 0|1|2|3|4|5|6|7|8|9

    – we need good notation (or a metalanguage) to

    describe precise languages by recognizing tokens

    • regular expressions

    • context-free grammars

  • Copyright © 2005 Elsevier

    Tokens

    • tokens are the building blocks of programs

    – shortest strings with individual meaning

    – examples

    • keywords (type names, control structures)

    • identifiers (variable names)

    • symbols (mathematical operators)

    • constants (literals)

    – considerations

    • case sensitivity

    • international characters

    • maximum lengths

  • Copyright © 2005 Elsevier

    Regular Expressions

    • a regular expression is one of the following:

    – a character

    – the empty string, denoted by or ϵ

    – two regular expressions concatenated

    – two regular expressions separated by | (i.e., or)

    – a regular expression followed by the Kleene star

    (concatenation of zero or more strings)

    • these simple rules help us find tokens in the

    programming language

    • useful in unix/linux environments

  • Copyright © 2005 Elsevier

    Regular Expressions

    • numerical literals in Pascal may be generated

    by the following:

    • arrow can be read as – can be replaced by

    – goes to

  • Copyright © 2005 Elsevier

    Context-Free Grammars

    • the notation for context-free grammars (CFG)

    is sometimes called Backus-Naur Form (BNF) – necessary since regular expressions cannot specify nested

    constructs

    – used to define the syntax of a language

    • with Kleene star and other facilitating

    symbols, the notation is termed Extended BNF

    (EBNF)

  • Context-Free Grammars

    Source: Tucker & Noonan (2007)

  • Derivations

    • example grammar binaryDigit → 0

    binaryDigit → 1

    or equivalently

    binaryDigit → 0 | 1

    Source: Tucker & Noonan (2007)

  • Derivations

    • consider the grammar Integer → Digit | Integer Digit

    Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

    we can derive any unsigned integer, like 352, from this

    grammar:

    Integer → Integer Digit

    → Integer 2

    → Integer Digit 2

    → Integer 5 2

    → Digit 5 2

    → 3 5 2

    Source: Tucker & Noonan (2007)

  • Derivations

    – a different derivation of 352

    Integer → Integer Digit

    → Integer Digit Digit

    → Digit Digit Digit

    → 3 Digit Digit

    → 3 5 Digit

    → 3 5 2

    – this is called a leftmost derivation since at each step, the

    leftmost nonterminal is replaced

    – the previous derivation was a rightmost derivation

    Source: Tucker & Noonan (2007)

  • Derivations

    – notation for derivations

    Integer →* 352

    – meaning that 352 can be derived in a finite number of

    steps using the grammar for Integer

    352 ϵ L(G)

    – meaning that 352 is a member of the language

    defined by grammar G

    L(G) → { ω ϵ T* | Integer →* ω }

    – meaning that the language defined by grammar G is

    the set of all symbol strings ω that can be derived as

    an Integer

    Source: Tucker & Noonan (2007)

  • Copyright © 2005 Elsevier

    Grammars

    • conventional in general discussions of grammars to

    use

    – lower case letters near the beginning of the alphabet for

    terminals

    – lower case letters near the end of the alphabet for strings of

    terminals

    – upper case letters near the beginning of the alphabet for

    non-terminals

    – upper case letters near the end of the alphabet for arbitrary

    symbols

    – greek letters for arbitrary strings of symbols

  • Parse Trees

    • a parse tree is a graphical representation of a

    derivation – each internal node of the tree corresponds to a step in the

    derivation

    – the children of a node represent a right-hand side of a

    production

    – each leaf node represents a symbol of the derived string

    reading from left to right

    Source: Tucker & Noonan (2007)

  • Parse Trees

    • the step, Integer → Integer Digit appears in

    the parse tree as

    Source: Tucker & Noonan (2007)

  • Parse Trees

    • parse tree for 352 as in Integer

    Source: Tucker & Noonan (2007)

  • Copyright © 2005 Elsevier

    Context-Free Grammars

    • expression grammar with precedence and

    associativity

  • Copyright © 2005 Elsevier

    Context-Free Grammars

    • parse tree for expression grammar (with precedence) for 3 + 4 * 5

  • Copyright © 2005 Elsevier

    Context-Free Grammars

    • parse tree for expression grammar (with left associativity) for 10 - 4 - 3

  • Context-Free Grammars

    • another grammar with precedence and

    associativity – + and – are left-associative operators in mathematics

    – * and / have higher precedence than + and –

    • Grammar G1

    Source: Tucker & Noonan (2007)

  • Context-Free Grammars

    • parse tree for 4**2**3 + 5 * 6 + 7

    Source: Tucker & Noonan (2007)

  • Context-Free Grammars

    • associativity and precedence shown in the

    structure of the parse tree – highest precedence at the bottom

    – left-associativity on the left at each level

    Source: Tucker & Noonan (2007)

  • Ambiguous Grammars

    • a grammar is ambiguous if one of its strings

    has two or more different parse trees

    – grammar G1 above is unambiguous

    • ambiguous expression grammar G2 equivalent

    to G1

    – fewer productions and nonterminals, but

    ambiguous Source: Tucker & Noonan (2007)

  • Ambiguous Grammars

    • ambiguous parse of 5 – 4 + 3 using G2

    Source: Tucker & Noonan (2007)

  • Abstract Syntax Tree

    • the shape of a parse tree reveals the meaning

    of the program

    • we want a tree that removes its inefficiency,

    but keeps its shape – remove separator/punctuation terminal symbols

    – remove all trivial root nonterminals

    – replace remaining nonterminals with leaf terminals

    • removes syntactic sugar and keeps essential elements

    of a language

    Source: Tucker & Noonan (2007)

  • Abstract Syntax Tree

    Source: Tucker & Noonan (2007)

  • Dangling Else

    • with which if statement does the else associate?

    Source: Tucker & Noonan (2007)

  • Dangling Else Ambiguity

    Source: Tucker & Noonan (2007)

  • Dangling Else Solutions

    • Algol 60, C, C++ – associate each else with closest if

    – use {}or begin/end to override

    • Algol 68, Modula, Ada – use explicit delimiter to end every conditional (e.g., if..fi)

    • Java – rewrite the grammar to limit what can appear in a conditional

    Source: Tucker & Noonan (2007)

  • Extended BNF (EBNF)

    • BNF – recursion for iteration

    – nonterminals for grouping

    • EBNF additional metacharacters – { } for a series of zero or more

    – ( ) for a list; must pick one

    – [ ] for an optional list; pick none or one

    Source: Tucker & Noonan (2007)

  • EBNF Examples

    • Expression is a list of Terms separated by

    operators + and -

    Source: Tucker & Noonan (2007)

  • EBNF to BNF

    • we can always rewrite an EBNF grammar as a

    BNF grammar

    can be rewritten as

    • try rewriting EBNF rules with { } and ( )

    • while EBNF is no more powerful than BNF,

    its rules are often simpler and clearer

    Source: Tucker & Noonan (2007)

  • Copyright © 2005 Elsevier

    Scanning

    • recall that the scanner is responsible for

    – tokenizing source

    – removing comments

    • may be difficult if nested

    – (often) dealing with pragmas (i.e., significant

    comments)

    – saving text of identifiers, numbers, strings

    – saving source locations (file, line, column) for

    error messages

  • Copyright © 2005 Elsevier

    Scanning

    • suppose we are building an ad-hoc (hand-

    written) scanner for Pascal:

    – we read the characters one at a time with look-

    ahead

    • if it is one of the one-character tokens { ( ) [ ] < > , ; = + - etc }

    we announce that token

    • if it is a ., we look at the next character

    – if that is a dot, we announce ..

    – otherwise, we announce . and reuse the look-

    ahead

  • Copyright © 2005 Elsevier

    Scanning

    • if it is a

  • Copyright © 2005 Elsevier

    Scanning

    • if it is a digit, we keep reading until we find

    a non-digit

    – if that is not a . , we announce an integer

    – otherwise, we keep looking for a real number

    – if the character after the . is not a digit, we

    announce an integer and reuse the . and the

    look-ahead

  • Copyright © 2005 Elsevier

    Scanning

    • pictorial

    representation

    of a Pascal

    scanner as a

    finite

    automaton

  • Copyright © 2005 Elsevier

    Scanning

    • a scanner can be represented by a

    deterministic finite automaton (DFA)

    – lex, scangen, etc. build these things

    automatically from a set of regular expressions

    – specifically, they construct a machine that

    accepts the language identifier | int const

    | real const | comment | symbol |

    ...

  • Copyright © 2005 Elsevier

    Scanning

    • we run the machine over and over to get one

    token after another

    – nearly universal rule

    • always take the longest possible token from the

    input

    – thus foobar is foobar and never f or foo or foob

    • more to the point, 3.14159 is a real const and

    never 3, ., and 14159

    • regular expressions "generate" a regular

    language; DFAs "recognize" it

  • Copyright © 2005 Elsevier

    Scanning

    • scanners tend to be built three ways

    – ad-hoc

    – semi-mechanical pure DFA

    (usually realized as nested case statements)

    – table-driven DFA

    • ad-hoc generally yields the fastest, most

    compact code by doing lots of special-

    purpose things, though good automatically-generated scanners come very close

  • Copyright © 2005 Elsevier

    Scanning

    • writing a pure DFA as a set of nested case

    statements is a surprisingly useful

    programming technique

    – though it's often easier to use perl, awk, sed

    – for details see Figure 2.11

    • table-driven DFA is what lex and scangen

    produce

    – lex (flex) in the form of C code

    – scangen in the form of numeric tables and a

    separate driver (for details see Figure 2.12)

  • Copyright © 2005 Elsevier

    Scanning

    • note that the rule about longest-possible tokens means you return only when the next character can't be used to continue the current token

    – the next character will generally need to be saved for the next token

    • in some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed

    – in Pascal, for example, when you have a 3 and you a see a dot

    • do you proceed (in hopes of getting 3.14)? or

    • do you stop (in fear of getting 3..5)?

  • Copyright © 2005 Elsevier

    Scanning

    • in messier cases, you may not be able to get

    by with any fixed amount of look-ahead; in

    Fortran, for example, we have DO 5 I = 1,25 loop

    DO 5 I = 1.25 assignment

    • here, we need to remember we were in a

    potentially final state, and save enough

    information that we can back up to it, if we get stuck later

  • Copyright © 2005 Elsevier

    Parsing

    • terminology:

    – context-free grammar (CFG)

    – symbols

    • terminals (tokens)

    • non-terminals

    – production

    – derivations (left-most and right-most - canonical)

    – parse trees

    – sentential form

  • Copyright © 2005 Elsevier

    Parsing

    • by analogy to RE and DFAs, a context-free

    grammar (CFG) is a generator for a

    context-free language (CFL)

    – a parser is a language recognizer

    • there is an infinite number of grammars for

    every context-free language

    – not all grammars are created equal, however

  • Copyright © 2005 Elsevier

    Parsing

    • it turns out that for any CFG we can create a

    parser that runs in O(n^3) time

    • there are two well-known parsing

    algorithms that permit this

    – Early's algorithm

    – Cooke-Younger-Kasami (CYK) algorithm

    • O(n^3) time is clearly unacceptable for a

    parser in a compiler - too slow