introduction ling 572 fei xia week 1: 1/3/06. outline course overview problems and methods...

Introduction

LING 572

Fei Xia

Week 1: 1/3/06

Outline

• Course overview

• Problems and methods

• Mathematical foundation– Probability theory– Information theory

Course overview

Course objective

• Focus on statistical methods that produce state-of-the-art results

• Questions: for each algorithm– How the algorithm works: input, output, steps– What kind of tasks an algorithm can be applied to?– How much data is needed?

• Labeled data• Unlabeled data

General info

• Course website: – Syllabus (incl. slides and papers): updated every week.– Message board– ESubmit

• Office hour: W: 3-5pm.

• Prerequisites: – Ling570 and Ling571.– Programming: C, C++, or Java, Perl is a plus.– Introduction to probability and statistics

Expectations

• Reading: – Papers are online: who don’t have access to printers?– Reference book: Manning & Schutze (MS)– Finish reading before class. Bring your questions to

class.

• Grade:– Homework (3): 30%– Project (6 parts): 60%– Class participation: 10%– No quizzes, exams

Assignments

Hw1: FSA and HMM

Hw2: DT, DL, and TBL.

Hw3: Boosting

No coding Bring the finished assignments to class.

ProjectP1: Method 1 (Baseline): Trigram P2: Method 2: TBLP3: Method 3: MaxEntP4: Method 4: choose one of four tasks.P5: PresentationP6: Final report

Methods 1-3 are supervised methods.Method 4: bagging, boosting, semi-supervised learning, or system combination.

P1 is an individual task, P2-P6 are group tasks.A group should have no more than three people.

Use ESubmit Need to use others’ code and write your own code.

Summary of Ling570

• Overview: corpora, evaluation• Tokenization• Morphological analysis• POS tagging• Shallow parsing• N-grams and smoothing• WSD• NE tagging• HMM

Summary of Ling571

• Parsing

• Semantics

• Discourse

• Dialogue

• Natural language generation (NLG)

• Machine translation (MT)

570/571 vs. 572

• 572 focuses more on statistical approaches.

• 570/571 are organized by tasks; 572 is organized by learning methods.

• I assume that you know– The basics of each task: POS tagging, parsing, …– The basic concepts: PCFG, entropy, … – Some learning methods: HMM, FSA, …

An example

• 570/571:– POS tagging: HMM– Parsing: PCFG– MT: Model 1-4 training

• 572:– HMM: forward-backward algorithm– PCFG: inside-outside algorithm– MT: EM algorithm All special cases of EM algorithm, one method of

unsupervised learning.

Course layout

• Supervised methods– Decision tree– Decision list– Transformation-based learning (TBL)– Bagging– Boosting– Maximum Entropy (MaxEnt)

Course layout (cont)

• Semi-supervised methods– Self-training– Co-training

• Unsupervised methods– EM algorithm

• Forward-backward algorithm• Inside-outside algorithm• EM for PM models

Outline

• Course overview



Problems and methods

Types of ML problems

• Classification problem• Estimation problem• Clustering• Discovery• …

A learning method can be applied to one or more types of ML problems.

We will focus on the classification problem.

Classification problem

• Given a set of classes and data x, decide which class x belongs to.

• Labeled data:– (xi, yi) is a set of labeled data.

– xi is a list of attribute values.

– yi is a member of a pre-defined set of classes.

Examples of classification problem

• Disambiguation:– Document classification– POS tagging– WSD– PP attachment given a set of other phrases

• Segmentation:– Tokenization / Word segmentation– NP Chunking

Learning methods

• Modeling: represent the problem as a formula and decompose the formula into a function of parameters

• Training stage: estimate the parameters

• Test (decoding) stage: find the answer given the parameters

Modeling

• Joint vs. conditional models:– P(data, model)– P(model | data)– P(data | model)

• Decomposition: – Which variable conditions on which variable? – What independent assumptions?

An example of different modeling

)|(*)|(*)(

),|(*)|(*)(),,(

BCPABPAP

BACPABPAPCBAP

)|(*)|(*)(

),|(*)|(*)(),,(

BAPBCPBP

CBAPBCPBPCBAP

Training

• Objective functions:– Maximize likelihood:

– Minimize error rate– Maximum entropy– ….

• Supervised, semi-supervised, unsupervised:– Ex: Maximize likelihood

• Supervised: simple counting• Unsupervised: EM

)|(maxargˆ

dataPML

Decoding

• DP algorithm– CYK for PCFG– Viterbi for HMM– …

• Pruning:– TopN: keep topN hyps at each node.– Beam: keep hyps whose weights >= beam *

max_weight– Threshold: keep hyps whose weights >= threshold– …

Outline

• Course overview



Probability Theory

Probability theory

• Sample space, event, event space

• Random variable and random vector

• Conditional probability, joint probability, marginal probability (prior)

Sample space, event, event space

• Sample space (Ω): a collection of basic outcomes. – Ex: toss a coin twice: {HH, HT, TH, TT}

• Event: an event is a subset of Ω.– Ex: {HT, TH}

• Event space (2Ω): the set of all possible events.

Random variable

• The outcome of an experiment need not be a number.

• We often want to represent outcomes as numbers.

• A random variable is a function that associates a unique numerical value with every outcome of an experiment.

• Random variable is a function X: ΩR.• Ex: toss a coin once: X(H)=1, X(T)=0

Two types of random variable

• Discrete random variable: X takes on only a countable number of distinct values.– Ex: Toss a coin 10 times. X is the number of

tails that are noted.

• Continuous random variable: X takes on uncountable number of possible values.– Ex: X is the lifetime (in hours) of a light bulb.

Probability function

• The probability function of a discrete variable X is a function which gives the probability p(xi) that the random variable equals xi: a.k.a. p(xi) = p(X=xi).

1)(

1)(0

ix

i

i

xp

xp

Random vector

• Random vector is a finite-dimensional vector of random variables: X=[X1,…,Xk].

• P(x) = P(x1,x2,…,xn)=P(X1=x1,…., Xn=xn)

• Ex: P(w1, …, wn, t1, …, tn)

Three types of probability

• Joint prob: P(x,y)= prob of x and y happening together

• Conditional prob: P(x|y) = prob of x given a specific value of y

• Marginal prob: P(x) = prob of x for all possible values of y

Common equations

)(

),()|(

)|(*)()|(*)(),(

),()(

AP

BAPABP

BAPBPABPAPBAP

BAPAPB

More general cases

),...|(),...,(

),...,()(

111

1

,...,11

2

ii

in

AAn

AAAPAAP

AAPAPn

Information Theory

Information theory

• It is the use of probability theory to quantify and measure “information”.

• Basic concepts:– Entropy– Joint entropy and conditional entropy– Cross entropy and relative entropy– Mutual information and perplexity

Entropy

• Entropy is a measure of the uncertainty associated with a distribution.

• The lower bound on the number of bits it takes to transmit messages.

• An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results.

x

xpxpXH )(log)()(

An example

• Uniform distribution: pi=1/8.

• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64)

bitsXH 3)8

1log8

1(*8)( 2

bitsXH 2)64

1log

64

1*4

16

1log

16

1

8

1log8

1

4

1log4

1

2

1log2

1()(

(0, 10, 110, 1110, 111100, 111101, 111110, 111111)

Entropy of a language

• The entropy of a language L:

• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

n

xpxp

LH nxnn

n

1

)(log)(

lim)(11

n

xp

n

xpLH nn

n

)(log)(loglim)( 11

Joint and conditional entropy

• Joint entropy:

• Conditional entropy:

x y

yxpyxpYXH ),(log),(),(

)(),(

)|(log),()|(

XHYXH

xypyxpXYHx y

Cross Entropy

• Entropy:

• Cross Entropy:

• Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).

xc

x

xqxpXH

xpxpXH

)(log)()(

)(log)()(

)()( XHXH c

Cross entropy of a language

• The cross entropy of a language L:

• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

n

xqxp

qLH nxnn

n

1

)(log)(

lim),(11

n

xq

n

xqqLH nn

n

)(log)(loglim),( 11

Relative Entropy

• Also called Kullback-Leibler distance:

• Another distance measure between prob functions p and q.

• KL distance is asymmetric (not a true distance):

)()()(

)(log)()||( 2 XHXH

xq

xpxpqpKL c

),(),( pqKLqpKL

Relative entropy is non-negative

1log0 zzz

0))(())((

)))()((())1)(

)()(((

)(

)(log)(

)(

)(log)(

)||(

xx

xx

xx

xqxp

xpxqxp

xqxp

xp

xqxp

xq

xpxp

qpKL

Mutual information

• It measures how much is in common between X and Y:

• I(X;Y)=KL(p(x,y)||p(x)p(y))

);(

),()()(

)()(

),(log),();(

XYI

YXHYHXH

ypxp

yxpyxpYXI

x y

Perplexity

• Perplexity is 2H.

• Perplexity is the weighted average number of choices a random variable has to make.

Summary

• Course overview



M&S Ch2

Next time

• FSA

• HMM: M&S Ch 9.1 and 9.2

introduction ling 572 fei xia week 1: 1/3/06. outline course overview problems and methods...

Documents

course overview slide

learning methods

data unlabeled data

statistical methods

maximum entropy maxent

outside algorithm em

outside algorithm mt

backward algorithm pcfg