cpsc 388 – compiler design and construction scanners – regular expressions

20
CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Upload: lizbeth-nelson

Post on 24-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

CPSC 388 – Compiler Design and Construction

Scanners – Regular Expressions

Page 2: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Announcements Last day to Add/Drop Sept 11 Wipe Down Computer Keyboards and Mice ACM programming contest Read Chapter 3 Homework 2 due this Friday PROG1 due this Friday FSA for Java string constants (Anybody

figure it out?)

Page 3: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Homework 1 returned Summary – In addition to your summary please

answer the following questions in your report: Context – What is the context of the research

presented in the paper? What new ideas or concepts do you think were presented in the paper?

Evaluation – How was the research evaluated? What evaluation techniques would you like to see to compare the research to the state of the art before the paper?

Significance – What is the significance of the research? Do you feel the work is a minor improvement or a major step in the field?

Grammar / spelling / clarity / support for statements made

Page 4: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Scanner Generator

Scanner Generator.Jlex fileContainingRegular Expressions

.java fileContainingScanner code

To understand Regular Expressionsyou need to understand Finite-State Automata

Page 5: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

FSA Formal Definition (5-tuple)

Q – a finite set of statesΣ – The alphabet of the automata

(finite set of characters to label edges)

δ – state transition functionδ(statei,character) statej

q – The start stateF – The set of final states

Page 6: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Types of FSA Deterministic (DFA)

No State has more than one outgoing edge with the same label

Non-Deterministic (NFA) States may have more than one outgoing

edge with same label. Edges may be labeled with ε, the empty

string. The FSA can take an epsilon transition without looking at the current input character.

Page 7: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Terms to Know Alphabet (Σ) – any finite set of

symbols e.g. binary, ASCII, Unicode String – finite sequence of symbols

e.g. 010001, banana, bãër Language – any countable set of

strings e.g.Empty setWell-formed C programsEnglish words

Page 8: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Regular Expressions Easy way to express a language that is

accepted by FSA Rules:

ε is a regular expression Any symbol in Σ is a regular expressionIf r and s are any regular expressions then so is: r|s denotes union e.g. “r or s” rs denotes r followed by s (concatination) (r)* denotes concatination of r with itself zero or

more times (Kleene closer) () used for controlling order of operations

Page 9: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Example Regular Expressions

Regular Expression Corresponding Languageε {“”}

a {“a”}

abc {“abc”}

a|b|c {“a”,”b”,”c”}

(a|b|c)* {“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…}

a|b|c|…|z|A|B|…|Z Any letter

0|1|2|…|9 Any digit

Page 10: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Precedence in Regular Expressions

* has highest precedence, left associative

Concatenation has second highest precedence, left associative

| has lowest associative, left associative

Page 11: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

More Regular Expression Examples

Regular Expression Corresponding Languageε|a|b|ab* {“”, “a”, “b”, “ab”, “abb”, “abbb”,…}

ab*c {“ac”, “abc”, “abbc”,…}

ab*|a* {“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}

a(b*|a*) {“a”, “ab”, “aa”, “abb”, “aaa”, …}

a(b|a)* {“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}

Page 12: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

You Try What is the language described by

each Regular Expression?a*(a|b)*a|a*b(a|b)(a|b)aa|ab|ba|bb(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*

Page 13: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Regular Definitions

If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

D1 → R1

D2 → R2

Dn → Rn

1. Each di is a new symbol not in Σ andnot the same as any other of the d’s.

2. Each ri is a regular expression over

Σ U (d1,d2,…,di-1)

Page 14: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Regular Definitions Example

Example C identifiers:Σ = ASCII

letter_ → a|b|c|…|z|A|B|C|…|Z|_digit → 0|1|2|…|9id → letter_(letter_|digit)*

Page 15: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Regular Definitions ExampleExample Unsigned Numbers (integer or float):Σ = ASCII

digit → 0|1|2|…|9digits → digit digit*optionalFraction → . digits | εoptionalExponent → (E(+|-| ε)digits)| εnumber → digits optionalFraction optionalExponent

Page 16: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Special Characters in Reg. Exp.What does each of the following mean?*|()[]+?“”.\^$

– Kleene Closure– or – grouping – creates a character class– Positive Closure– zero or one instance– anything in quotes means itself, e.g. “*”– matches any single character (except newline)– used for escape characters (newline, tab, etc.)– matches beginning of a line– matches the end of a line

Page 17: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Extensions to Regular Expressions

+ means one or more occurrence (positive closure)

? means zero or one occurrence Character classes

a|r|t can be written [art] a|b|…|z can be written [a-z]

As long as there is a clear ordering to characters

[^a-z] matches any character except a-z

Page 18: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Example Using Character Classes

^[^aeiou]*$Matches any complete line that does not

contain a lowercase vowelHow do you tell which meaning of ^ is

intended?

Page 19: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Try It Create Character Classes for:

First ten letters (up to “j”) Lowercase consonants Digits in hexadecimal

Create Regular Expressions for: Case Insensitive keyword such as

SELECT (or Select or SeLeCt) in SQL Java string constants Any string of whitespace characters

Page 20: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions

Creating a Scanner Create a set of regular expressions, one for

each token to be recognized Convert regular expressions into one

combined DFA Run DFA over input character stream

Longest matching regular expression is selected If a tie then use first matching regular

expression Attach code to run when a regular

expression matches