cpsc 388 – compiler design and construction scanners – regular expressions
TRANSCRIPT
![Page 1: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/1.jpg)
CPSC 388 – Compiler Design and Construction
Scanners – Regular Expressions
![Page 2: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/2.jpg)
Announcements Last day to Add/Drop Sept 11 Wipe Down Computer Keyboards and Mice ACM programming contest Read Chapter 3 Homework 2 due this Friday PROG1 due this Friday FSA for Java string constants (Anybody
figure it out?)
![Page 3: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/3.jpg)
Homework 1 returned Summary – In addition to your summary please
answer the following questions in your report: Context – What is the context of the research
presented in the paper? What new ideas or concepts do you think were presented in the paper?
Evaluation – How was the research evaluated? What evaluation techniques would you like to see to compare the research to the state of the art before the paper?
Significance – What is the significance of the research? Do you feel the work is a minor improvement or a major step in the field?
Grammar / spelling / clarity / support for statements made
![Page 4: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/4.jpg)
Scanner Generator
Scanner Generator.Jlex fileContainingRegular Expressions
.java fileContainingScanner code
To understand Regular Expressionsyou need to understand Finite-State Automata
![Page 5: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/5.jpg)
FSA Formal Definition (5-tuple)
Q – a finite set of statesΣ – The alphabet of the automata
(finite set of characters to label edges)
δ – state transition functionδ(statei,character) statej
q – The start stateF – The set of final states
![Page 6: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/6.jpg)
Types of FSA Deterministic (DFA)
No State has more than one outgoing edge with the same label
Non-Deterministic (NFA) States may have more than one outgoing
edge with same label. Edges may be labeled with ε, the empty
string. The FSA can take an epsilon transition without looking at the current input character.
![Page 7: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/7.jpg)
Terms to Know Alphabet (Σ) – any finite set of
symbols e.g. binary, ASCII, Unicode String – finite sequence of symbols
e.g. 010001, banana, bãër Language – any countable set of
strings e.g.Empty setWell-formed C programsEnglish words
![Page 8: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/8.jpg)
Regular Expressions Easy way to express a language that is
accepted by FSA Rules:
ε is a regular expression Any symbol in Σ is a regular expressionIf r and s are any regular expressions then so is: r|s denotes union e.g. “r or s” rs denotes r followed by s (concatination) (r)* denotes concatination of r with itself zero or
more times (Kleene closer) () used for controlling order of operations
![Page 9: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/9.jpg)
Example Regular Expressions
Regular Expression Corresponding Languageε {“”}
a {“a”}
abc {“abc”}
a|b|c {“a”,”b”,”c”}
(a|b|c)* {“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…}
a|b|c|…|z|A|B|…|Z Any letter
0|1|2|…|9 Any digit
![Page 10: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/10.jpg)
Precedence in Regular Expressions
* has highest precedence, left associative
Concatenation has second highest precedence, left associative
| has lowest associative, left associative
![Page 11: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/11.jpg)
More Regular Expression Examples
Regular Expression Corresponding Languageε|a|b|ab* {“”, “a”, “b”, “ab”, “abb”, “abbb”,…}
ab*c {“ac”, “abc”, “abbc”,…}
ab*|a* {“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}
a(b*|a*) {“a”, “ab”, “aa”, “abb”, “aaa”, …}
a(b|a)* {“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}
![Page 12: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/12.jpg)
You Try What is the language described by
each Regular Expression?a*(a|b)*a|a*b(a|b)(a|b)aa|ab|ba|bb(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
![Page 13: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/13.jpg)
Regular Definitions
If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:
D1 → R1
D2 → R2
…
Dn → Rn
1. Each di is a new symbol not in Σ andnot the same as any other of the d’s.
2. Each ri is a regular expression over
Σ U (d1,d2,…,di-1)
![Page 14: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/14.jpg)
Regular Definitions Example
Example C identifiers:Σ = ASCII
letter_ → a|b|c|…|z|A|B|C|…|Z|_digit → 0|1|2|…|9id → letter_(letter_|digit)*
![Page 15: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/15.jpg)
Regular Definitions ExampleExample Unsigned Numbers (integer or float):Σ = ASCII
digit → 0|1|2|…|9digits → digit digit*optionalFraction → . digits | εoptionalExponent → (E(+|-| ε)digits)| εnumber → digits optionalFraction optionalExponent
![Page 16: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/16.jpg)
Special Characters in Reg. Exp.What does each of the following mean?*|()[]+?“”.\^$
– Kleene Closure– or – grouping – creates a character class– Positive Closure– zero or one instance– anything in quotes means itself, e.g. “*”– matches any single character (except newline)– used for escape characters (newline, tab, etc.)– matches beginning of a line– matches the end of a line
![Page 17: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/17.jpg)
Extensions to Regular Expressions
+ means one or more occurrence (positive closure)
? means zero or one occurrence Character classes
a|r|t can be written [art] a|b|…|z can be written [a-z]
As long as there is a clear ordering to characters
[^a-z] matches any character except a-z
![Page 18: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/18.jpg)
Example Using Character Classes
^[^aeiou]*$Matches any complete line that does not
contain a lowercase vowelHow do you tell which meaning of ^ is
intended?
![Page 19: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/19.jpg)
Try It Create Character Classes for:
First ten letters (up to “j”) Lowercase consonants Digits in hexadecimal
Create Regular Expressions for: Case Insensitive keyword such as
SELECT (or Select or SeLeCt) in SQL Java string constants Any string of whitespace characters
![Page 20: CPSC 388 – Compiler Design and Construction Scanners – Regular Expressions](https://reader036.vdocument.in/reader036/viewer/2022082422/56649dc65503460f94aba106/html5/thumbnails/20.jpg)
Creating a Scanner Create a set of regular expressions, one for
each token to be recognized Convert regular expressions into one
combined DFA Run DFA over input character stream
Longest matching regular expression is selected If a tie then use first matching regular
expression Attach code to run when a regular
expression matches