1 languages, grammars, and regular expressions ling 570 fei xia week 2: 10/03/07 texpoint fonts used...

Post on 21-Dec-2015

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Languages, grammars, and regular expressions

LING 570

Fei Xia

Week 2: 10/03/07

2

Today

• Hw1 is due: 1% penalty per hour after 3pm.

• Probability Theory (from last time)

• Formal languages, formal grammars, and regular expression J&M 2

• Hw2

3

Coming up

• Next Monday– Finite-state automaton: J&M Chapter 2

• Next Wed– Finite-state transducer: J&M Section 3.4– Hw2 is due

4

Probability theory(from last time)

5

Common trick #1: Maximum likelihood estimation

• An example: toss a coin 3 times, and got two heads. What is the probability of getting a head with one toss?

• Maximum likelihood: (ML) * = arg max P(data | )

• In the example, – P(X=2) = 3 * p * p * (1-p) when p=2/3, P(X=2) reaches the maximum.

6

Common trick #2:Chain rule

)|(*)()|(*)(),( YXPYPXYPXPYXP

),...|(),...,( 111

1 ii

in XXXPXXP

7

Common trick #3:joint prob Marginal prob

y

yxPxP ),()(

nxx

nxxPxP,...,

11

2

),...,()(

8

Common trick #4:Bayes’ rule

)(

)()|(

)(

),()|(

xP

yPyxP

xP

yxPxyP

)()|(maxarg

)(

)()|(maxarg

)|(maxarg*

yPyxP

xP

yPyxP

xyPy

y

y

y

9

Independent random variables

• Two random variables X and Y are independent iff the value of X has no influence on the value of Y and vice versa.

• P(X,Y) = P(X) P(Y)

• P(Y|X) = P(Y)

• P(X|Y) = P(X)

• Our previous coin examples: P(X, C) != P(X) P(C)

10

Conditional independence

Once we know the value of Z, knowing the value of Y does not help us predict the value of X, and vice versa.

• P(X,Y | Z) = P(X|Z) P(Y|Z)

• P(X|Y,Z) = P(X | Z)

• P(Y|X, Z) = P(Y |Z)

11

Independence and conditional independence

• If A and B are independent, are they conditional independent?

• Example:– Burglar, Earthquake– Alarm

12

Common trick #5:Independence assumption

)|(

),...|(),...,(

11

111

1

ii

i

ii

in

xxP

xxxPxxP

13

An example

• P(w1 w2 … wn)

= P(w1) P(w2 | w1) P(w3 | w1 w2) * …

* P(wn | w1 …, wn-1)

¼ P(w1) P(w2 | w1) …. P(wn | wn-1)

• Why do we make independence assumption which we know are not true?

14

Summary of elementaryprobability theory

• Basic concepts: sample space, event space, random variable, random vector

• Joint / conditional /marginal probability

• Independence and conditional independence

• Five common tricks:– Max likelihood estimation– Chain rule– Calculating marginal probability from joint probability– Bayes’ rule– Independence assumption

15

Formal languages, formal grammars and

regular languages

16

Unit #1

• Formal grammar, language and regular expression

• Finite-state automaton (FSA)

• Finite-state transducer (FST)

• Morphological analysis using FST

17

Other units in ling570

• Unit #2: LM and smoothing

• Unit #3: HMM and POS tagging

• Unit #4: Classification and sequence labeling

• Unit #5: Introduction to other common tasks (e.g., IR/IE/WSD)

18

Regular expression

• Two concepts:– Regular expression in formal language theory– Regular expression (or pattern) in pattern matching: it

is a way of expressing a pattern for the purpose of matching a string

• Both concepts describe a set of strings.

• The two concepts are closely related, but the latter is often more expressive than the former.

19

Outline

• Formal language– Regular language

• Regular expression in formal language theory

• Formal grammars– Regular grammars

• Regular expression in pattern matching

20

Formal languages

21

Definition of formal language

• An alphabet is a finite set of symbols:– Ex: § = {a, b, c}

• A string is a finite sequence of symbols from a particular alphabet juxtaposed:– Ex: the string “baccab”– Ex: empty string ²

• A formal language is a set of strings defined over some alphabet.– Ex1: {aa, bb, cc, aaaa, abba, acca, baab, bbbb, ….}– Ex2: {an bn | n > 0} – Ex3: the empty set Á

22

Definition of regular languages

• The class of regular languages over an alphabet § is formally defined as:– The empty set, Á, is a regular language– 8 a 2 § [ ², {a} is a regular language.– If L1 and L2 are regular languages, then so are:

(a) L1 ² L2 = {xy | x 2 L1; y 2 L2} (concatenation)

(b) L1 L2 (union or disjunction)

(c) L1* = {x1 x2 …, xn | xi 2 L1 , n 2 N} (Kleene closure)

– There are no other regular languages

23

Kleene star

Another way to define L*:• L2 = L ² L• Ln = Ln-1 ² L• L* = { ²} L1 L2 …

Examples:• L = {a, bc}• L2 = {aa, abc, bca, bcbc}• L* = {abcbca, ….} = { (a|bc)*}

24

Properties

• Regular languages are closed under– Concatenation – Union– Kleene closure

• Regular languages are also closed under: – Intersection: L1 Å L2

– Difference: L1 – L2

– Complementation: §* - L1

– Reversal

25

Are the following languages regular?

• {a, aa, aaa, ….}• Any finite set of strings

• {xy | x2 §*, and y is the reverse of x}• {xx | x 2 §*}

• {an bn | n 2 N}• {an bn cn | n 2 N}

26

Regular expression

27

Definition of Regular expression(as in formal language theory)

• The set of regular expressions is defined as follows:(1) Every symbol of is a regular expression

(2) ² is a regular expression

(3) If r1 and r2 are regular expressions, so are (r1), r1 r2, r1 | r2 , r1*

(4) Nothing else is a regular expression.

§

28

Examples

• ab*c• a (0|1|2|..|9)* b• (CV | CCV)+ C?C?: C is a consonant, V is

a vowel

Other operations that we can use:• a+ = a a* • a? = (a | ²)

29

Relation between regular language and Regex

• With every regular expression we can associate a regular language.

• Conversely, every regular language can be obtained from a regular expression.

• Examples:– Regex = ab*c– Regular language = {ac, abc, abbc, ….}

30

Formal grammar

31

Definition of formal grammar

A formal grammar is a concise description of a formal language. It is a (N, §, P, S) tuple:

• A finite set N of nonterminal symbols

• A finite set Σ of terminal symbols that is disjoint from N

• A finite set P of production rules, each of the form: (§ N)* N (§ N)* ! (§ N)* • A distinguished symbol S 2 N that is the start symbol

32

Chomsky hierarchyThe left-hand side of a rule must contain at least one non-terminal. ®, ¯, ° 2 (N §)*, A,B 2 N, a 2 §

• Type 0: unrestricted grammar: no other constraints.

• Type 1: Context-sensitive grammar: The rules must be of the form: ® A ¯ ! ® ° ¯

• Type 2: Context-free grammar (CFGs): The rules must be of the form: A ! ®

• Type 3: Regular grammar: The rules are of the forms: right regular grammar: A ! a, A ! aB, or A ! ² left regular grammar: A ! a, A ! Ba, or A ! ²

Are there other kinds of grammars?

33

Strings generated from a grammar

• The rules are:S → x | y | z | S + S | S - S | S * S | S/S | (S)

• What strings can be generated?

• A grammar is ambiguous if there exists at least one string which has multiple parse trees.

• Is this grammar ambiguous?

34

Languages generated by grammars

• Given a grammar G, L(G) is the set of strings that can be generated from G.

• Ex: G = (N, §, P, S)

N = {S}, § = {a, b, c} P = { S ! a S b, S ! c }

What is L(G)?

35

The relation between regular grammars and regular languages

• The regular grammars describe exactly all regular languages.

• All the following are equivalent:– Regular languages– Regular grammars– Regular expression– Finite state automaton (FSA)

36

Relation between grammars and languages (from wikipedia page)

Chomskyhierarchy

Grammars Languages Minimalautomaton

Type-0 Unrestricted Recursively enumerable

Turing machine

n/a (no common name) Recursive Decider

Type-1 Context-sensitive Context-sensitive Linear-bounded

n/a Indexed Indexed Nested stack

n/a Tree-adjoining Mildly context-sensitive

Thread

Type-2 Context-free Context-free Nondeterministic pushdown

n/a Deterministic context-free

Deterministic context-free

Deterministic pushdown

Type-3 Regular Regular Finite state

37

How about human languages?

• Are they formal languages?– What is alphabet?– What is string?

• What type of formal languages are they?

38

Outline

• Formal language – Regular language

• Regular expression in formal language theory

• Formal grammar– Regular grammar

• Patterns in pattern matching J&M 2.1

39

Patterns in Perl

[ab] a|b. match any character^ the starting position in a string$ the ending position in a string(..) defines a marked subexpressiona* match “a” zero or more timesa+ match “a” one or more timea? match “a” zero or one timea{n,m} “a” appears n to m times

40

Special symbols in the patterns

\s match any whitespace char\d match any digit\w match any letter or digit

\S match any non-whitespace char…

\+, \-, \., \?, \*, …

41

Examples

Integer: (\+|\-)?\d+

Real: (\+|\-)?\d+\.\d+

Scientific notation: (\+|\-)? \d+ (\.\d+)?e (\+|\-)?\d+

Any of the three:

(\+|\-)? \d+ (\.\d+)? (e (\+|\-)?\d+)?

42

Patterns in Perl and Regex

/^(.*)\1$/ { xx | x 2 §*}

/^(.+)a(.+)\1\2$/ {xayxy | x, y 2 §*}

The extra power comes from the ability to refer to marked subexpression.

43

Outline

• Formal language – Regular language

• Regular expression in formal language theory

• Formal grammars– Regular grammars

• Regex patterns in pattern matching

44

Homework #2

45

• Part I: probability

• Part II: formal grammar

• Part III: regular expression

top related