language translators: week 14 lecture: regular expressions finite state machines lexical analysers...

LANGUAGE TRANSLATORS: WEEK 14

LECTURE:

REGULAR EXPRESSIONS

FINITE STATE MACHINES

LEXICAL ANALYSERS

INTRO TO GRAMMAR THEORY

TUTORIAL:

CAPTURING LANGUAGES USING REGULAR EXPRESSIONS

LEXICAL ANALYSIS

Is the first step in the translation/compilation process

input language ====> output language

means putting the raw characters of the input into TOKENS.

LEXICAL ANALYSIS PHASE The language of TOKENS e.g. Identifiers is always

a regular language. REGULAR EXPRESSIONS generate regular

languages (as do Regular Grammars..) The tokens of languages are often specified by regular expressions.

Finite State Machines consume regular languages

REGULAR EXPRESSIONS

One line method of specifying a language equivalent to `type 3’ or regular grammars used to parameterize UNIX/LINUX file

processing commands

REGULAR EXPRESSIONS - DEFINITION

EXAMPLE DEFINITION

a | b ‘|’ means choice

a | b | c = [abc] ‘[..]’ is shorthand for multiple choice

‘‘ means the empty word

(abc)* ‘*’ means repetition 0,1 or more ..

(abcd)+ ‘+’ means repetition 1 or more times

REGULAR EXPRESSIONS - EXAMPLES [a - z A - Z][a - z A - Z 0 - 9]*

defines the language of IDENTIFIERS in some

programming languages (xyz)* defines the language

{ , xyz, xyzxyz, xyzxyzxyz, ..} [abcd]+ defines the language

{a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, ..}

Putting choice and repetition together produces

complicated regular languages

Finite State Machines

Can be defined by annotated nodes and arcs.

Can translate Reg. Exps into FSMs but must add

ERROR STATES onto the FSMs

Regular Expression ==> NDFSM

ab

[ab]

a*

then NDFSM ==> FSM..

a b

a

b

a

Example Specify a language of alphabet { w,x,y,z} with the only restrictions

being that 1. no strings contain both x and y, and 2. If there is a y and w in a string, then the first w

ALWAYS occurs before the first ySOLUTION:1. 1. Write down exs and counter exs2. 2. Decide on any ambiguities

3.. Use Case Analysis to sub-divide the problemlanguage = (a) strings of { w,x,z} UNION

(b)strings of { w,y,z} with restriction 2.- Part (a): = [w x z]+- Part (b): can assume y is always in a string = [y z]+ | z* w [wz]* y [x y z]* -. Put together answer = [w x z]+ | [y z]+ | z* w [wz]* y [x y z]*

A LEXICAL ANALYSER - GENERATOR (e.g. LEX, JLEX) - how they work

INPUT REGULAR EXPRESSIONS

TRANSLATE REGULAR EXPRESSION INTO NON-DETERMINISTIC FSM

TRANSLATE NON-DETERMINISTIC FSM INTO DETERMINISTIC FSM (which is easily described as a simple program)

EXAMPLE INPUT TOA LEXICAL ANALYSER - GENERATOR %% ";" { return new Symbol(sym.SEMI); }"+" { return new Symbol(sym.PLUS); }"*" { return new Symbol(sym.TIMES); }"(" { return new Symbol(sym.LPAREN); }")" { return new Symbol(sym.RPAREN); }[0-9]+ { return new Symbol(sym.NUMBER, new Integer(yytext())); }[ \t\r\n\f] { /* ignore white space. */ }

. { System.err.println("Illegal character: "+yytext()); }

example; if string (231+3)*3 was input to the generated lexical analyser the output would be:LPAREN (NUMBER,231) PLUS (NUMBER,3) RPAREN TIMES (NUMBER,3)

Simple Lexical Analyserpublic class scanner {

protected static int next_char;

protected static void advance()

throws java.io.IOException

{ next_char = System.in.read(); }

public static void init()


{ advance(); }

public static Symbol next_token()


{ for (;;) switch (next_char) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': /* parse a decimal integer */ int i_val = 0; do { i_val = i_val * 10 + (next_char - '0'); advance(); } while (next_char >= '0' && next_char <= '9'); return new Symbol(sym.INT, new Integer(i_val)); case 'p': advance(); return new Symbol(sym.PRINT); case 'r': advance(); return new Symbol(sym.REPEAT); case 'u': advance(); return new Symbol(sym.UNTIL); case '=': advance(); return new Symbol(sym.ASSIGNS); case ';': advance(); return new Symbol(sym.SEMI); case '+': advance(); return new Symbol(sym.PLUS); case '-': advance(); return new Symbol(sym.MINUS); case '(': advance(); return new Symbol(sym.LPAREN); case ')': advance(); return new Symbol(sym.RPAREN); case 'x': advance(); return new Symbol(sym.ID,"x"); case 'y': advance(); return new Symbol(sym.ID,"y"); case 'z': advance(); return new Symbol(sym.ID,"z"); case -1: return new Symbol(sym.EOF); default: advance(); break; } } };

Introduction to Grammar Theory

Grammars can be used to generate the syntax of all formal languages – the structural complexity of a language is determined by the simplest grammar that can generate it.

In order to create parsers, we are interested in “properties of grammars”. For example, the “first set” of a string w of terminals and non-terminals is the set of TERMINAL symbols (tokens) that may be at the front of ANY string derived from w using the grammar rules.

Summary:

Regular expressions are a quick and easy way to specify simple forms of language. They can be easily translated into FSMs (which have nice properties e.g. they have linear time complexity in their execution)

There are tools (JLEX) which input regular expressions and output a lexical analyser which recognises the language they define.

language translators: week 14 lecture: regular expressions finite state machines lexical analysers...

Documents