Download - Formal Grammars and Abstract Machines
Formal Grammars and Abstract Machines
Sahar Al Seesi
What are Formal Languages
• Describing the sentence structure of a language in a formal way
• Used in – Natural Language Processing Applications (translators,
grammar checking tools, etc..) • Language: English, French, Spanish, Chinese, etc..
– RNA/Protein Structure Analysis • RNA in general, ribosomal RNA, protein, etc..
– Compilers for programming languages • C, Java, Python, Linux shell script, Assembly, etc..
• To build a program for any of the above applications, the language rules must be described in a formal inclusive way.
Formal Languages and Grammars
• G={ Σ, R, S }
• Σ : Non-terminals (NT) Terminals (T) {S, VERB, SUBJECT, OBJECT} {children, sam, play, eat, ball} {S, A1, A2} {a, c, g, t} {a,c,g,u}
• R : Production rules {S SUBJECT VERB OBJECT}
• S : Starting symbol
• L(G) : The language defined by G; a finite or infinite set of strings (words/sentences)
• L(G) ⊆ T*
Chomsky Hierarchy
Regular
Context-free
Context-sensitive
Unrestricted Grammars Recursively Enumerable Languages
Pow
er o
f ex
pre
ssio
n
Ru
le c
om
ple
xity
Pa
rsin
g ti
me
com
ple
xity
Parsing/Accepting Abstract Machines
Grammar Parsing Automaton
Regular grammars Finite State Machine (FSM)
Context free grammars Push-Down Automaton (PDA)
Context sensitive grammars Linear-Bounded Automaton (LBA)
Unrestricted grammars Turing Machine (TM)
Regular Languages & Regular Expressions
• A regular language can be represented by a regular expression
• Let Σ = {a,b} • Let Lr be the language defined by regular
expression r. r Lr
Σ* the set of all strings over Σ of length 0 or more (includes the empty string, ) Σ+ the set of all strings over Σ of length 1 or more (does not include ) a+ the set of all strings of 1 or more a’s {a, aa, aaa, …} b* the set of all strings of 0 or more b’s {, b, bb, bbb, …}
Combining Regular Languages
• Concatenation
– Let r and s be 2 regular expressions, rs corresponds to the language LrLs
– Example:
• r = a*, s = b+
• LrLs : the set of strings consisting of 0 or more a’s followed by 1 or more b’s
• {b, bb, ab, aabbbb} ⊂ LrLs
Combining Regular Languages
• Union
– Let r and s be 2 regular expressions, r+s corresponds to the language Lr∪Ls
– Example:
• r = a*, s = b+
• Lr∪Ls : the set of strings consisting of 0 or more a’s and strings of 1 or more b’s
• {b, bb, a, aa, bbbb} ⊂Lr ∪ Ls
Combining Regular Languages
• Closure
– Let r be a regular expression, r* corresponds to the language Lr*
– Example:
• r = ab
• Lr* : the set of strings consisting of 0 or more “ab”s (ab)*
• {, ab, abab, abab, ababab} ⊂ Lr*
Example
• R = (a+c+t)ykk(p+q)*vdt(l+z+)pq
• Strings that belong to the language defined by R
ayykppvdtlpq
cyykpqppqvdtpq
tyykqvdtzpq
Regular Grammars
• Can be represented by a regular expression
• Grammar rules are of the form NT T NT
NT T
• Example: The set of all DNA strings
• Regular Expression: {a,c,g,t}+
• G= { Σ, {S}, R, S }
• Σ = {S, a, c, g, t}
• R = {S aS | cS | gS | tS | a | c | g | t}
Finite State Machine
• M={Q, Σ, , q0, F}
• Q: Finite set of states
• Σ: Language alphabet
• : Transition function (Qx Σ Q)
• q0 : Starting state
• F : Set of final states
Finite State Machine Example • M={Q, Σ, , q0, F}
• A FSM for
R = {S aS | cS | gS | tS | a | c | g | t}
2
a,c,g,t
Q = {1,2} Σ = {a, c, g, t}
q0 = 1 F = {2}
(1,a) = 2 (1,c) = 2
(1,g) = 2 (1,t) = 2
(2,a) = 2 (2,c) = 2
(2,g) = 2 (2,t) = 2
a,c,g,t 1
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input1: 0100
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input1: 0100
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input: 0100
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input1: 0100
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input1: 0100
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input1: 0100
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input2: 1101
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input2: 1101
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input2: 1101
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input2: 1101
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input2: 1101
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input2: 1101
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input3: 11011
Another Example
L = The set of all strings in {0,1}*that either begin or end (or both) with 01
R = (01(0+1)*)+((0+1)*01)
S B A
E C D
1
0
1
1
1
1 0
0,1
0
0
0
Input3: 11011
Context Free Grammars (CFG) and Languages
• CFGs Can represent nested pair-wise correlation between terminal symbols in the string
• Famous example: palindrome language (wwr) a b a a a a b a
• Can you write a regular grammar for wwr?
• Grammar rules are of the form – NT (T+NT)+
• M={Q, Σ, , , q0, F}
• Q: Finite set of states
• Σ: Language alphabet
• : Stack alphabet
• : Transition function (Q x Σ x Q x *)
• q0 : Starting state
• F : Set of final states
CFG and Push Down Automata
http://epsilonvectorplusplus.wordpress.com
Grammar wwr
• G={ Σ, V, R, S }
• Σ = {a, b} , V = {S}
• R = {S aSa| bSb | aa | bb}
Parse tree for string: abbbba
S
a S a
b S b
b b
Context Free Grammar for an RNA stem loop
• Language : wvwcr
• G={ Σ, R, S }
• Σ = {S, L, a, c, g, u}
• R = {S aSu| uSa | gSc | cSg | L,
L aL | cL | gL | uL | a |c |g | u}
Durbin et. al., Biological Sequence Analysis, adapted
Context Sensitive Grammars and Languages
• Can represent crossing pair-wise correlation between terminal symbols in the string
• Famous example: copy language (ww) a a b b a a b b
• Grammar rules are of the form:
– (T+NT)*NT (T+NT)* (T+NT)+
– |LHS| <= |RHS| (generated RHS cannot shrink from one production step to the next)
CSG and Linear Bounded Automata
SKIP FOR NOW
Non-deterministic and stochastic models
• A stochastic grammar has a probability associated with each rule in the grammar
• Similarly, in automata, a probability would be associated with each transition
Unrestricted Grammars and Recursively Enumerable Languages
• Grammar rules are of the form:
- (T+NT)*NT (T+NT)* (T+NT)*
- The only rule is that the left hand side must
contain at least one variable
• A recursively enumerable language is one that can be represented by an unrestricted grammar
• M={Q, Σ, , , q0, B, F}
• Q: Finite set of states
• Σ: Language alphabet
• : tape alphabet (Σ ⊆ )
• : Transition function (Q x Σ Q x x {L,R})
• q0 : Starting state
• B: The blank symbol
• F : Set of final states
Turing Machines
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # a a b b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X a b b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X a b b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X a Y b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X a Y b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X a Y b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y b # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y Y # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y Y # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y Y # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y Y # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y Y # # # # # # #
Example
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X X Y Y # # # # # # #
What is the language this TM accepts?
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
Example -cont. (input 2)
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # a b a b # # # # # # #
Example -cont. (input 2)
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X b a b # # # # # # #
Example -cont. (input 2)
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X Y a b # # # # # # #
Example -cont. (input 2)
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X Y a b # # # # # # #
Example -cont. (input 2)
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
# # # # X Y a b # # # # # # #
Language: anbn
q0 q1 q2 q3 q4
a/X,R
Y/Y,R
b/Y,L
a/a,R Y/Y,R
a/a,L Y/Y,L
X/X,R
Y/Y,R
#/#,R
Computing with Turing Machines
Examples: A TM that accepts a number x divisible by 3 in unary format and outputs the results of the computation x/3
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # 111111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # X11111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XX1111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # # # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXX111 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXX11 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXX1 # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 1 # # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 11 # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 11 # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 11 # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM
# # # # XXXXXX # 11 # # # #
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
Divide by 3 TM Try to parse 1111
q0 q1 q2 q3 q5 1/X,R 1/X,R
1/1,R
#/#,R q4
1/X,R
1/1,R
#/1,L
1/1,L
q6
#/#,L
q8
X/X,R
q7
1/1,L
1/1,L X/X,R
More complex TM models
• Several tapes
• Several read/write heads
A Turing machine can simulate a computer.
Back to Linear Bounded Automata
state
$ $
boundary boundary
• LBA is a TM whose read/write head never moves off the portion of the tape occupied by the input string