language translators: week 14 lecture: regular expressions finite state machines lexical analysers...

14
LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES USING REGULAR EXPRESSIONS

Upload: lynne-casey

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

LANGUAGE TRANSLATORS: WEEK 14

LECTURE:

REGULAR EXPRESSIONS

FINITE STATE MACHINES

LEXICAL ANALYSERS

INTRO TO GRAMMAR THEORY

TUTORIAL:

CAPTURING LANGUAGES USING REGULAR EXPRESSIONS

Page 2: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

LEXICAL ANALYSIS

Is the first step in the translation/compilation process

input language ====> output language

means putting the raw characters of the input into TOKENS.

Page 3: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

LEXICAL ANALYSIS PHASE The language of TOKENS e.g. Identifiers is always

a regular language. REGULAR EXPRESSIONS generate regular

languages (as do Regular Grammars..) The tokens of languages are often specified by regular expressions.

Finite State Machines consume regular languages

Page 4: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

REGULAR EXPRESSIONS

One line method of specifying a language equivalent to `type 3’ or regular grammars used to parameterize UNIX/LINUX file

processing commands

Page 5: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

REGULAR EXPRESSIONS - DEFINITION

EXAMPLE DEFINITION

a | b ‘|’ means choice

a | b | c = [abc] ‘[..]’ is shorthand for multiple choice

‘‘ means the empty word

(abc)* ‘*’ means repetition 0,1 or more ..

(abcd)+ ‘+’ means repetition 1 or more times

Page 6: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

REGULAR EXPRESSIONS - EXAMPLES [a - z A - Z][a - z A - Z 0 - 9]*

defines the language of IDENTIFIERS in some

programming languages (xyz)* defines the language

{ , xyz, xyzxyz, xyzxyzxyz, ..} [abcd]+ defines the language

{a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, ..}

Putting choice and repetition together produces

complicated regular languages

Page 7: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

Finite State Machines

Can be defined by annotated nodes and arcs.

Can translate Reg. Exps into FSMs but must add

ERROR STATES onto the FSMs

Page 8: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

Regular Expression ==> NDFSM

ab

[ab]

a*

then NDFSM ==> FSM..

a b

a

b

a

Page 9: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

Example Specify a language of alphabet { w,x,y,z} with the only restrictions

being that 1. no strings contain both x and y, and 2. If there is a y and w in a string, then the first w

ALWAYS occurs before the first ySOLUTION:1. 1. Write down exs and counter exs2. 2. Decide on any ambiguities

3.. Use Case Analysis to sub-divide the problemlanguage = (a) strings of { w,x,z} UNION

(b)strings of { w,y,z} with restriction 2.- Part (a): = [w x z]+- Part (b): can assume y is always in a string = [y z]+ | z* w [wz]* y [x y z]* -. Put together answer = [w x z]+ | [y z]+ | z* w [wz]* y [x y z]*

Page 10: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

A LEXICAL ANALYSER - GENERATOR (e.g. LEX, JLEX) - how they work

INPUT REGULAR EXPRESSIONS

TRANSLATE REGULAR EXPRESSION INTO NON-DETERMINISTIC FSM

TRANSLATE NON-DETERMINISTIC FSM INTO DETERMINISTIC FSM (which is easily described as a simple program)

Page 11: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

EXAMPLE INPUT TOA LEXICAL ANALYSER - GENERATOR %% ";" { return new Symbol(sym.SEMI); }"+" { return new Symbol(sym.PLUS); }"*" { return new Symbol(sym.TIMES); }"(" { return new Symbol(sym.LPAREN); }")" { return new Symbol(sym.RPAREN); }[0-9]+ { return new Symbol(sym.NUMBER, new Integer(yytext())); }[ \t\r\n\f] { /* ignore white space. */ }

. { System.err.println("Illegal character: "+yytext()); }

example; if string (231+3)*3 was input to the generated lexical analyser the output would be:LPAREN (NUMBER,231) PLUS (NUMBER,3) RPAREN TIMES (NUMBER,3)

Page 12: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

Simple Lexical Analyserpublic class scanner {

protected static int next_char;

protected static void advance()

throws java.io.IOException

{ next_char = System.in.read(); }

public static void init()

throws java.io.IOException

{ advance(); }

public static Symbol next_token()

throws java.io.IOException

{ for (;;) switch (next_char) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': /* parse a decimal integer */ int i_val = 0; do { i_val = i_val * 10 + (next_char - '0'); advance(); } while (next_char >= '0' && next_char <= '9'); return new Symbol(sym.INT, new Integer(i_val)); case 'p': advance(); return new Symbol(sym.PRINT); case 'r': advance(); return new Symbol(sym.REPEAT); case 'u': advance(); return new Symbol(sym.UNTIL); case '=': advance(); return new Symbol(sym.ASSIGNS); case ';': advance(); return new Symbol(sym.SEMI); case '+': advance(); return new Symbol(sym.PLUS); case '-': advance(); return new Symbol(sym.MINUS); case '(': advance(); return new Symbol(sym.LPAREN); case ')': advance(); return new Symbol(sym.RPAREN); case 'x': advance(); return new Symbol(sym.ID,"x"); case 'y': advance(); return new Symbol(sym.ID,"y"); case 'z': advance(); return new Symbol(sym.ID,"z"); case -1: return new Symbol(sym.EOF); default: advance(); break; } } };

Page 13: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

Introduction to Grammar Theory

Grammars can be used to generate the syntax of all formal languages – the structural complexity of a language is determined by the simplest grammar that can generate it.

In order to create parsers, we are interested in “properties of grammars”. For example, the “first set” of a string w of terminals and non-terminals is the set of TERMINAL symbols (tokens) that may be at the front of ANY string derived from w using the grammar rules.

Page 14: LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES

Summary:

Regular expressions are a quick and easy way to specify simple forms of language. They can be easily translated into FSMs (which have nice properties e.g. they have linear time complexity in their execution)

There are tools (JLEX) which input regular expressions and output a lexical analyser which recognises the language they define.