structure of programming languages lecture 4

34
Structure of Programming Languages – Lecture 4 CSCI 6636 – 4536 September, 2018 CSCI 6636 – 4536 Lecture 4. . . 1/34 September, 2018 1 / 34

Upload: others

Post on 12-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structure of Programming Languages Lecture 4

Structure of Programming Languages – Lecture 4

CSCI 6636 – 4536

September, 2018

CSCI 6636 – 4536 Lecture 4. . . 1/34 September, 2018 1 / 34

Page 2: Structure of Programming Languages Lecture 4

Outline

1 Syntax and its SpecificationContext-Free LanguagesSyntax Diagrams

2 The Definition of Pascal

3 ParsingAd-hoc ParsingLL ParsersLR Parsers

4 Homework

CSCI 6636 – 4536 Lecture 4. . . 2/34 September, 2018 2 / 34

Page 3: Structure of Programming Languages Lecture 4

Syntax and its Specification

Part 1

1. Syntax and its Specification

Context-Free LanguagesExtended Backus-Naur Form

Syntax Diagrams

CSCI 6636 – 4536 Lecture 4. . . 3/34 September, 2018 3 / 34

Page 4: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Context-Free Languages

Formally, almost all programming languages belong to the category called“context-free languages”. That is, the syntax of the language (excludingthe type matching rules) can be described by a context-free grammar.

The set of all context-free languages is identical to the set oflanguages accepted by a finite-state machine that uses a stack fortemporary storage.

We call such a machine a pushdown automaton

Last week we studied regular languages. Context Free languages aremore powerful because they are able to describe matching ofparentheses and other paired symbols, to arbitrary nesting depth.

A context-free grammar provides a simple and precise mechanism fordescribing the way phrases in a language are built from smaller blocks.

CSCI 6636 – 4536 Lecture 4. . . 4/34 September, 2018 4 / 34

Page 5: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

A context-free grammar G is

A finite set of nonterminal symbols, V (for vocabulary), each onerepresenting a different type of syntactic category in the language.

A finite set of keywords and punctuation, Σ (for symbol). These arecalled terminal symbols.

A finite set, R, of rules or productions of the grammar.

There must be at least one rule for every nonterminal symbol.

The starting symbol, S , is used to represent the whole sentence orprogram. It must be an element of V .

CSCI 6636 – 4536 Lecture 4. . . 5/34 September, 2018 5 / 34

Page 6: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

History: Describing Programming Language Syntax.

First, context-free grammars were invented.

Soon later, they were applied to the definition of programminglanguages and were crucial in the development of the firstcompiler-generator.

So a new notation was developed, better adapted to the character setsupported by computers.

The notation was called Backus Naur Form (BNF).

Later, it was extended to make it easier to use. The extended versionis called EBNF and is the notation commonly used today.

CSCI 6636 – 4536 Lecture 4. . . 6/34 September, 2018 6 / 34

Page 7: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Example: A context-free grammar and EBNF notation.

This grammar defines a nonsense language called Nested x’s

V is {S ,A} and the starting symbol is S

Σ is x ( )

R isContext-free Grammar EBNF Grammar

1. S → A 1. S ::= A .2. A → ASA 2. A ::= ASA .3. A → ( S ) 3. A ::= ( S ) .4. A → x 4. A ::= x .

Rules 2, 3, and 4 of the EBNF grammar can be consolidated to :A ::= ASA | ( S ) | x .

CSCI 6636 – 4536 Lecture 4. . . 7/34 September, 2018 7 / 34

Page 8: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

The syntax for EBNF itself

EBNF is a notation for writing context-free grammars.We say it is a metalanguage, that is, a language for describing languages.

Nonterminal symbols will be written in non-bold type and/or enclosedin < . . . >.

Terminal symbols will be written in boldface and/or enclosed in‘single quotes’.

Production rules. The nonterminal being defined is written at the left,followed by a “::=” sign (which we will pronounce as “becomes”).After this is a set of options, which define how the nonterminal canbe expanded. The rule extends up to but does not include the “.” atthe end.

When a nonterminal is expanded it is replaced by one of the optionsfrom its definition.

Blank spaces between the “::=” and the “.” are ignored.

CSCI 6636 – 4536 Lecture 4. . . 8/34 September, 2018 8 / 34

Page 9: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Syntax for EBNF Production Rules

Alternatives are separated by vertical bars.This indicates that an ‘s’ may be replaced by an ‘a’ or a ‘bc’:

s ::= a | bc .

Parentheses may be used to indicate grouping. For example, thisindicates that an ‘s’ may be replaced by an ‘ad’ or a ‘bcd’.

s ::= ( a | bc ) d .

Something enclosed in square brackets is optional. For example, thisrule says that an ‘s’ may be replaced by an ‘ad’ or simply by a ‘d’:

s ::= [a] d .

Zero or more repetitions of a unit is indicated by enclosing the unit incurly braces. This rule says that an ‘s’ may be replaced by a ‘d’, an‘ad’, an ‘aad’, or a string of any number of ‘a’s followed by a single‘d’ and one or more ‘b’s.

s ::= {a} d b {b} .

CSCI 6636 – 4536 Lecture 4. . . 9/34 September, 2018 9 / 34

Page 10: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Example: An EBNF Grammar for Nested x’s.

Earlier, we showed the boldface version of this EBNF grammar. Here weshow the machine-compatible version that uses quotes and angle brackets.

The starting symbol is S .

Nonterminal symbols are: S, A

Terminal symbols are: ‘x‘ ‘(‘ ‘)’

Productions: <S> ::= <A> .<A>::= <A>< S >< A > | ‘(’ <S> ‘)’ | ‘x’ .

We use this grammar to generate some nested-x sentences:

S

A

x

S

A

ASA

AAA

xxx

S

A

(S )

(A )

( ( S ) )

( ( x ) )CSCI 6636 – 4536 Lecture 4. . . 10/34 September, 2018 10 / 34

Page 11: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Describing Programming Language Syntax

This grammar illustrates how matched and nested symbols are generated.

Start a derivation by writing down the starting symbol.

Apply rules to nonterminal symbols, in any order, to reach your goal.

Stop when all the nonterminals are gone.

Any rule that introduces a left-paren must also introduce a matchingright-paren.

The grammar is recursive so that parenthesized units can be producedinside other pairs of parentheses.

CSCI 6636 – 4536 Lecture 4. . . 11/34 September, 2018 11 / 34

Page 12: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Using the Grammar for Nested x’s

This language is all the strings of x ’s that start with an x , end with an x ,and have zero or more matching and properly nested parenthesessurrounding parts of the string.

S Starting symbolA Rule 1ASA Rule 2AAA Rule 1A( S )A Rules 3 and 1, both twiceA( A )A Rule 1A( ASA )A Rule 2A( ( S ) S ( S ) )A Rule 3, twiceA( ( A ) A ( A ) )A Rule 1, three timesx( ( x ) x ( x ) )x Rule 4, five times

CSCI 6636 – 4536 Lecture 4. . . 12/34 September, 2018 12 / 34

Page 13: Structure of Programming Languages Lecture 4

Syntax and its Specification Context-Free Languages

Example: An EBNF Grammar for Nonsense.

This grammar includes a loop and an optional element.

The starting symbol is S .

Nonterminal symbols are: S, stop

Terminal symbols are: A B C D E –

Productions:S ::= E { – E } B stopS ::= [ stop ] A stopstop ::= C | D

We use this grammar to generate four Nonsense sentences:S

A stop

A D

S

stop A stop

D A D

S

E B stop

E B C

S

E - E - E B stop

E - E - E B D

CSCI 6636 – 4536 Lecture 4. . . 13/34 September, 2018 13 / 34

Page 14: Structure of Programming Languages Lecture 4

Syntax and its Specification Syntax Diagrams

Syntax Diagrams

An alternative formal definition metalanguage was developed for Pascal; itis often called “railroad diagrams”. It has the same elements as EBNF,but they are presented in a 2D graphic format:

Terminal symbols are boldface and enclosed in ovals. Nonterminalsymbols are written in non-bold type.

Production rules: the nonterminal being defined is written at the left,followed by an arrow.

Alternatives are shown by branches in the arrow.

To expand a nonterminal, follow some branch of the arrow to its endat the right.

An optional element is handled by an empty arrow branching aroundit.

Repetitions of a unit are shown by the arrow looping back on itself.

CSCI 6636 – 4536 Lecture 4. . . 14/34 September, 2018 14 / 34

Page 15: Structure of Programming Languages Lecture 4

Syntax and its Specification Syntax Diagrams

Nested x’s in Syntax Diagrams

We have an alternative and a recursive rule.

SAASAAAx(S)(S)x(A)(A)x(x)((S))x(x)((A))x(x)((x))x

SA( S ) ( A )( x )

SAx

)(

x

S AA A S A

S

CSCI 6636 – 4536 Lecture 4. . . 15/34 September, 2018 15 / 34

Page 16: Structure of Programming Languages Lecture 4

Syntax and its Specification Syntax Diagrams

Nonsense in Syntax Diagrams

Here is a looping rule and an optional element.

S

stop

stop

D

C

E B

A

stop

-SA stopA D

SE B stopE B C

Sstop A stopC A D

SE-E-E B stopE-E-E B D

CSCI 6636 – 4536 Lecture 4. . . 16/34 September, 2018 16 / 34

Page 17: Structure of Programming Languages Lecture 4

The Definition of Pascal

Part 2

Here are large parts of the definition of Pascal.Productions involving type declarations have been omitted.

EBNF definition of a Pascal program

Syntax Diagrams for Pascal expressions

CSCI 6636 – 4536 Lecture 4. . . 17/34 September, 2018 17 / 34

Page 18: Structure of Programming Languages Lecture 4

The Definition of Pascal

The Syntax for part of Pascal.

program ::= <program-heading> ; <program-block> . .

program-heading ::=program <identifier> [ ( <program-parameters> ) ].

program-parameters ::= <identifier-list> .

identifier-list ::= <identifier> { , <identifier> } .

program-block ::= <block> .

block ::= <label-declaration-part> <constant-declaration-part><type-declaration-part> <variable-declaration-part><procedure-and-function-declaration-part><statement-part>.

variable-declaration-part ::= [ var{<identifier-list> :<typename>; }].

CSCI 6636 – 4536 Lecture 4. . . 18/34 September, 2018 18 / 34

Page 19: Structure of Programming Languages Lecture 4

The Definition of Pascal

Continuing with Pascal.

statement-part ::= <compound statement> .

compound-statement ::= begin <statement-sequence> end.

statement-sequence ::= <statement> { ; <statement> } .

statement ::= [ <label> : ]( <simple-statement> | <structured-statement> ).

simple-statement ::=<empty-statement> | <assignment-statement> |<procedure-call-statement> | <goto-statement> .

structured-statement ::=<compound-statement> | <conditional-statement> |<repetitive-statement> | <with-statement> .

CSCI 6636 – 4536 Lecture 4. . . 19/34 September, 2018 19 / 34

Page 20: Structure of Programming Languages Lecture 4

The Definition of Pascal

Simple Statements in Pascal.

empty-statement ::= .

assignment-statement ::=( <variable-reference> | <function-name> ) ’:=’ <expression> .

procedure-call-statement ::= <IO-procedure-statement> |<procedure-identifier> [ ( <actual-parameter-list> ) ] .

IO-procedure-statement := read <read-parameter-list >| readln <readln-parameter-list> |write <write-parameter-list>|writeln <writeln-parameter-list> .

goto-statement ::= goto <label> .

label-declaration ::= [ label <label> { , <label> } ] .

label ::= <digit-sequence> .

CSCI 6636 – 4536 Lecture 4. . . 20/34 September, 2018 20 / 34

Page 21: Structure of Programming Languages Lecture 4

The Definition of Pascal

Conditionals in Pascal.

compound-statement := begin <statement> { ; <statement>} end.

conditional-statement ::= <if-statement> | <case-statement> .

if-statement ::= if <boolean-expression>then <statement> [<else-part> ] .

else-part ::= else <statement> .

case-statement::= case <case-index> of<case-list-element> { ; <case-list-element> } [; ] end .

case-list-element ::= case-constant-list : <statement> .

case-constant-list ::= case-constant { , case-constant } .

case-constant ::= constant .

CSCI 6636 – 4536 Lecture 4. . . 21/34 September, 2018 21 / 34

Page 22: Structure of Programming Languages Lecture 4

The Definition of Pascal

Loops and With in Pascal.

repetitive-statement ::=<repeat-statement> | <while-statement> | <for-statement> .

repeat-statement ::= repeat <statement-sequence> until<boolean-expression> .

while-statement ::= while <boolean-expression> do <statement> .

for-statement ::= for <control-variable> := <initial-value> [ to |downto ] <final-value> do <statement> .

with-statement ::= with <record-variable-list> do <statement> .

CSCI 6636 – 4536 Lecture 4. . . 22/34 September, 2018 22 / 34

Page 23: Structure of Programming Languages Lecture 4

The Definition of Pascal

Pascal Expressions

+

term

term

or

simple expression

+

expression

>=

simple expression

in<=simple

expression

=< <> >

CSCI 6636 – 4536 Lecture 4. . . 23/34 September, 2018 23 / 34

Page 24: Structure of Programming Languages Lecture 4

The Definition of Pascal

Pascal Expressions

function designator

( )actual parameter

,

/

term

*

factor

factor

div mod and

function identifier

CSCI 6636 – 4536 Lecture 4. . . 24/34 September, 2018 24 / 34

Page 25: Structure of Programming Languages Lecture 4

The Definition of Pascal

Pascal Expressions

expression

factor

( )

unsigned constant

not

variablefunction designator

factor

set value

CSCI 6636 – 4536 Lecture 4. . . 25/34 September, 2018 25 / 34

Page 26: Structure of Programming Languages Lecture 4

Parsing

Parsing

Parsing

Ad-hoc ParsingParsing Based on EBNF

CSCI 6636 – 4536 Lecture 4. . . 26/34 September, 2018 26 / 34

Page 27: Structure of Programming Languages Lecture 4

Parsing Ad-hoc Parsing

Old Languages were Parsed Ad-Hoc

These comments reflect FORTRAN-IV.

The language itself was created by collecting a lot of features.Everything about it was non-uniform and full of special cases. Forexample, there were half a dozen ways to punctuate a series of items.Syntax diagrams occupied 40 pages, versus 6 for Pascal.

Everything was made more difficult because the language definitionsaid that spaces were ignored.

A FORTRAN-IV parser was basically hand-built. It would look at thenext source-code character and try to figure out what it might be,given the current context.

This is a famous FORTRAN parsing problem that illustrates what iswrong with ad-hoc design: DO 200 I=1,10,2

Since DO200I is a legal variable name, we can’t know whether this isan assignment statement or a DO loop until the first comma-token isfound.

CSCI 6636 – 4536 Lecture 4. . . 27/34 September, 2018 27 / 34

Page 28: Structure of Programming Languages Lecture 4

Parsing Ad-hoc Parsing

Ad-Hoc Languages Today

We can list several current languages with no rhyme or reason in thedesign:

The C-shell, bash, tcsh, and other UNIX shell languages and scripts.

Perl

TeX and LATeX

These are hard to learn and hard to write correctly. They are parsed andinterpreted in an ad-hoc manner. Often the semantics are complicated andhard to understand.

CSCI 6636 – 4536 Lecture 4. . . 28/34 September, 2018 28 / 34

Page 29: Structure of Programming Languages Lecture 4

Parsing LL Parsers

Recursive Descent Parsing: LL(k) languages

A recursive descent parser is a top-down parser built from a set ofmutually-recursive and/or non-recursive procedures.

Each procedure implements one of the rules of the grammar. Thusthe structure of the parser closely mirrors that of the grammar itrecognizes.

A linear-time parser can be built for any language in which alook-ahead of k input symbols allows the parser to decide whichproduction to use next. (k is a non-negative integer constant).

An ambiguous grammar cannot be parsed this way.

Also, the grammar cannot contain left-recursive rules, of the formexpr :: expr + term. However, right-recursive rules, of the formexpr :: term + expr are not a problem.

CSCI 6636 – 4536 Lecture 4. . . 29/34 September, 2018 29 / 34

Page 30: Structure of Programming Languages Lecture 4

Parsing LL Parsers

Recursive Descent Parsing is top down

The recursive descent parser starts with the starting symbol of thegrammar and the beginning of the tokenized source-code file.

It then attempts to find a match for the left end of one of thepossible definitions of the starting symbol.

If the left end is found, it calls itself recursively, with the rest of thesource code, to find a match for the next part of the production.

This process works its way through the source code and down the listof productions. It will terminate successfully when the inner, recursivecalls have all terminated and a match is found for the rightmostelement in the original starting production.

If it fails at any point, it has recognized a syntax error.

CSCI 6636 – 4536 Lecture 4. . . 30/34 September, 2018 30 / 34

Page 31: Structure of Programming Languages Lecture 4

Parsing LL Parsers

Recursive Descent Parsing with Backtracking

A less-efficient top-down method exists for grammars that do notmeet the criteria above.

The parser works as above, but if it fails at any point, it willbacktrack and try another option from the current production.

This process will terminate when it succeeds or when possibilitieshave been attempted.

Parsers that use recursive descent with backtracking may requireexponential time.

CSCI 6636 – 4536 Lecture 4. . . 31/34 September, 2018 31 / 34

Page 32: Structure of Programming Languages Lecture 4

Parsing LR Parsers

LR Parsers are bottom up

An LR(k) parser analyzes the source code from left to right with alook-ahead of k input tokens.

An LR parser starts with the leaves of the parse tree (the tokens) andattempts to build up from there to the starting symbol.

It detects a syntactic error when the input does not conform to thegrammar.

The syntax of many programming languages can be defined by agrammar that is LR(1), or close to being so, and for this reason LRparsers are often used by compilers to perform syntax analysis ofsource code.

LR parsers are difficult to produce by hand and they are usuallyconstructed by a parser generator or a compiler-compiler.

CSCI 6636 – 4536 Lecture 4. . . 32/34 September, 2018 32 / 34

Page 33: Structure of Programming Languages Lecture 4

Parsing LR Parsers

Compiler Compilers

A compiler compiler is a program whose input is a description of thelanguage and whose output is a compiler. Yacc is a well-known Unixcompiler compiler. The Gnu version is called bison. The inputs are:

A formal definition of the language’s lexical structure (expressed inEBNF).

A formal definition of preprocessing directives, if any, and theircorresponding actions.

The EBNF definition of the language syntax, given that tokens havealready been identified.

The code to be generated for each fully-parsed nonterminal symbol inthe grammar.

The compiler compiler produces the compiler that will build a parse tree(front end) and transmute the tree into the corresponding object code(back end).

CSCI 6636 – 4536 Lecture 4. . . 33/34 September, 2018 33 / 34

Page 34: Structure of Programming Languages Lecture 4

Homework

Homework 41 Generate a sentence of Nested x’s (according to the grammar on page

13) that is longer than 10 terminals.2 Invent three Nested x sentences with syntax errors. Choose answers

that are almost legal according to the grammar on page 14, but haveomissions or extra inclusions. Your answers should have differentkinds of errors.

3 Write an EBNF rule that defines a FORTH infinite loop. This is aloop with no while inside it. Look up the precise definition in theFORTH reference spreadsheet. If you cannot find it, ask me.

4 Look up the syntax for the FORTH counted loop and draw a syntaxdiagram for it. There are two forms, one where you add to the loopvariable and the other where you subtract from it. Handle both.

5 Write the following FORTH function that uses an if statement and acounted loop. Take one parameter (a number) off the stack. If it isless than 3, print an error comment. Otherwise, print the word”hooray” that many times.

CSCI 6636 – 4536 Lecture 4. . . 34/34 September, 2018 34 / 34