language translators: week 14 lecture: regular expressions finite state machines lexical analysers...
TRANSCRIPT
LANGUAGE TRANSLATORS: WEEK 14
LECTURE:
REGULAR EXPRESSIONS
FINITE STATE MACHINES
LEXICAL ANALYSERS
INTRO TO GRAMMAR THEORY
TUTORIAL:
CAPTURING LANGUAGES USING REGULAR EXPRESSIONS
LEXICAL ANALYSIS
Is the first step in the translation/compilation process
input language ====> output language
means putting the raw characters of the input into TOKENS.
LEXICAL ANALYSIS PHASE The language of TOKENS e.g. Identifiers is always
a regular language. REGULAR EXPRESSIONS generate regular
languages (as do Regular Grammars..) The tokens of languages are often specified by regular expressions.
Finite State Machines consume regular languages
REGULAR EXPRESSIONS
One line method of specifying a language equivalent to `type 3’ or regular grammars used to parameterize UNIX/LINUX file
processing commands
REGULAR EXPRESSIONS - DEFINITION
EXAMPLE DEFINITION
a | b ‘|’ means choice
a | b | c = [abc] ‘[..]’ is shorthand for multiple choice
‘‘ means the empty word
(abc)* ‘*’ means repetition 0,1 or more ..
(abcd)+ ‘+’ means repetition 1 or more times
REGULAR EXPRESSIONS - EXAMPLES [a - z A - Z][a - z A - Z 0 - 9]*
defines the language of IDENTIFIERS in some
programming languages (xyz)* defines the language
{ , xyz, xyzxyz, xyzxyzxyz, ..} [abcd]+ defines the language
{a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, ..}
Putting choice and repetition together produces
complicated regular languages
Finite State Machines
Can be defined by annotated nodes and arcs.
Can translate Reg. Exps into FSMs but must add
ERROR STATES onto the FSMs
Regular Expression ==> NDFSM
ab
[ab]
a*
then NDFSM ==> FSM..
a b
a
b
a
Example Specify a language of alphabet { w,x,y,z} with the only restrictions
being that 1. no strings contain both x and y, and 2. If there is a y and w in a string, then the first w
ALWAYS occurs before the first ySOLUTION:1. 1. Write down exs and counter exs2. 2. Decide on any ambiguities
3.. Use Case Analysis to sub-divide the problemlanguage = (a) strings of { w,x,z} UNION
(b)strings of { w,y,z} with restriction 2.- Part (a): = [w x z]+- Part (b): can assume y is always in a string = [y z]+ | z* w [wz]* y [x y z]* -. Put together answer = [w x z]+ | [y z]+ | z* w [wz]* y [x y z]*
A LEXICAL ANALYSER - GENERATOR (e.g. LEX, JLEX) - how they work
INPUT REGULAR EXPRESSIONS
TRANSLATE REGULAR EXPRESSION INTO NON-DETERMINISTIC FSM
TRANSLATE NON-DETERMINISTIC FSM INTO DETERMINISTIC FSM (which is easily described as a simple program)
EXAMPLE INPUT TOA LEXICAL ANALYSER - GENERATOR %% ";" { return new Symbol(sym.SEMI); }"+" { return new Symbol(sym.PLUS); }"*" { return new Symbol(sym.TIMES); }"(" { return new Symbol(sym.LPAREN); }")" { return new Symbol(sym.RPAREN); }[0-9]+ { return new Symbol(sym.NUMBER, new Integer(yytext())); }[ \t\r\n\f] { /* ignore white space. */ }
. { System.err.println("Illegal character: "+yytext()); }
example; if string (231+3)*3 was input to the generated lexical analyser the output would be:LPAREN (NUMBER,231) PLUS (NUMBER,3) RPAREN TIMES (NUMBER,3)
Simple Lexical Analyserpublic class scanner {
protected static int next_char;
protected static void advance()
throws java.io.IOException
{ next_char = System.in.read(); }
public static void init()
throws java.io.IOException
{ advance(); }
public static Symbol next_token()
throws java.io.IOException
{ for (;;) switch (next_char) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': /* parse a decimal integer */ int i_val = 0; do { i_val = i_val * 10 + (next_char - '0'); advance(); } while (next_char >= '0' && next_char <= '9'); return new Symbol(sym.INT, new Integer(i_val)); case 'p': advance(); return new Symbol(sym.PRINT); case 'r': advance(); return new Symbol(sym.REPEAT); case 'u': advance(); return new Symbol(sym.UNTIL); case '=': advance(); return new Symbol(sym.ASSIGNS); case ';': advance(); return new Symbol(sym.SEMI); case '+': advance(); return new Symbol(sym.PLUS); case '-': advance(); return new Symbol(sym.MINUS); case '(': advance(); return new Symbol(sym.LPAREN); case ')': advance(); return new Symbol(sym.RPAREN); case 'x': advance(); return new Symbol(sym.ID,"x"); case 'y': advance(); return new Symbol(sym.ID,"y"); case 'z': advance(); return new Symbol(sym.ID,"z"); case -1: return new Symbol(sym.EOF); default: advance(); break; } } };
Introduction to Grammar Theory
Grammars can be used to generate the syntax of all formal languages – the structural complexity of a language is determined by the simplest grammar that can generate it.
In order to create parsers, we are interested in “properties of grammars”. For example, the “first set” of a string w of terminals and non-terminals is the set of TERMINAL symbols (tokens) that may be at the front of ANY string derived from w using the grammar rules.
Summary:
Regular expressions are a quick and easy way to specify simple forms of language. They can be easily translated into FSMs (which have nice properties e.g. they have linear time complexity in their execution)
There are tools (JLEX) which input regular expressions and output a lexical analyser which recognises the language they define.