cs546: machine learning and natural language lecture 7: introduction to classification 2009

56
1 CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

Upload: russell-barron

Post on 02-Jan-2016

31 views

Category:

Documents


1 download

DESCRIPTION

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009. Notes. Class Presentations: You do not need to (and in fact, cannot) present the whole paper. Focus on what’s new , what’s interesting , what’s unique . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

1

CS546: Machine Learning and Natural Language

Lecture 7: Introduction to Classification

2009

Page 2: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

2

Class Presentations:

You do not need to (and in fact, cannot) present the whole paper.

Focus on what’s new, what’s interesting, what’s unique. Don’t re-do material presented in class; assume it’s

known and mention in passing if needed.

Notes

Page 3: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

3

Illinois’ bored of education [board]Nissan Car and truck plant; plant and animal

kingdom(This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian;

a hospital Tiger was in Washington for the PGA Tour Finance; Banking; World News;

Sports

Important or not important; love or hate

Classification: Ambiguity Resolution

Page 4: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

4

The goal is to learn a function f: X Y that maps observations in a domain to one of several categories.

Task: Decide which of {board ,bored } is more likely in the given context:

X: some representation of: The Illinois’ _______ of education met yesterday… Y: {board ,bored }

Typical learning protocol: Observe a collection of labeled examples (x,y) 2 X £ Y Use it to learn a function f:XY that is consistent with

the observed examples, and (hopefully) performs well on new, previously unobserved examples.

Classification

Page 5: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

5

Theoretically: Generalization It is possible to say something rigorous about future behavior of learners

Good understanding of the issues that affect the quality of

generalization E.g., how many example does one need to see in order to guarantee

good behavior on previously unobserved examples. Algorithmically: good learning algorithms for linear representations.

Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples. On-line. Understanding that many algorithms behaves about the same.

Key issues remaining: Learning protocols: how to minimize interaction (supervision); how to

map domain/task information to supervision; semi-supervised learning; active learning. ranking.

What are the features? No good theoretical understanding here. Building systems that make use of multiple, possibly dependent

classifiers

Classification is Well Understood

Page 6: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

6

Classification Model and Justification (discriminative model)

Probabilistic Models Relations to discriminative models; justification

Algorithms Linear Classification: Perceptron, SVM, etc Max Entropy Features and Kernels

Learning Protocols How to get supervision (semi-supervised; co-training; co-ranking)

Multi Class Classification, Structured Prediction, etc As a generalization of Boolean Classification Models As a way to understand Structured learning and Inference.

Global Models and Inference Joint training, Modular training

Coarse Plan (not Chronlogical)

Page 7: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

7

Illinois’ bored of education. board

We took a walk it the park two. in, too

We fill it need no be this way feel, not

The amount of chairs in the room is… number

I’d like a peace of cake for desert piece, dessert

Context Sensitive Text Correction

Page 8: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

8

Disambiguation Problems

Middle Eastern ____ are known for their sweetness Task: Decide which of { deserts , desserts } is more likely in the given context.

C={ Noun,Adj.., Verb…}

C={ topic=Finance, topic=Computing}

C={ NE=Person, NE=location}

Ambiguity:modeled as confusion sets (class labels C )

C={ deserts, desserts}

Page 9: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

9

Disambiguation Problems

Archetypical disambiguation problem

Data is available

In principle, a solved problem Golding&Roth, Mangu&Brill,… But Many issues are involved in making an “in

principle” solution a realistic one

Page 10: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

10

Learning to Disambiguate

Given a confusion set C={ deserts, desserts} sentence (s) Middle Eastern ____ are known for their sweetness Map into a feature based representation Learn a function FC that determines which of

C={ deserts, desserts} more likely in a given context.

Evaluate the function on future C sentences

Page 11: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

11

S= I don’t know whether to laugh or cry [x x x x]Consider words, pos tags, relative location in windowGenerate binary features representing presence of:

a word/pos within window around target word don’t within +/-3 know within +/-3 Verb at -1 to within +/- 3 laugh within +/-3 to a +1

conjunctions of size 2, within window of size 3 words: know__to; ___to laugh pos+words: Verb__to; ____to Verb

Learning Approach: Representation

Page 12: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

12

S= I don’t know whether to laugh or cry Is represented as a set of its active features

S= (don’t at -2 , know within +/-3,… ____to Verb,...)Label= the confusion set element that occurs in the

text

Hope: S=I don’t care whether to laugh or cry

has almost the same representation

This representation can be used by any propositional learning algorithm. (features, examples)

Previous works: TBL (Decision Lists) NB, SNoW, DT,...

Learning Approach: Representation

Page 13: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

13

There is a huge number of potential features (~105). Out of these – only a small number is actually active

in each example.

The representation can be significantly smaller if we list only features that are active in each examples.

Some algorithms can take this into account. Some cannot. (Later).

Notes on Representation

Page 14: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

14

Formally: A feature =a characteristic function over

sentences

When the number of features is fixed, the collection of all examples is

When we do not want to fix the number of features (very large number, on-line algorithms,…) can work in the infinite attribute domain

}1,0{: S

nn }1,0{)},...,,{( 21

}1,0{,...)},...,,{( 21 n

Notes on Representation (2)

Page 15: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

15

Consider all training data S: {(l, f, f,….)}Represent as:

S={(f, #(l=0), #(l=1)} for all features

1. Choose best feature f* (and the label it suggests)2. S S \ {Examples labeled in (1)}3. GoTo 1

An Algorithm

“best” can be defined in multiple ways. E.g., sort features by P(label | feature).

Page 16: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

16

If f1 then label Else, if f2 then label Else… Else default label

A decision list

Issues: How well will this do? We train on the training data, what about new data?

An Algorithm:Hypothesis

Page 17: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

17

I saw the girl it the park The from needs to be completed

I maybe there tomorrow

New sentences you have not seen before. Can you recognize and correct it this time?

Intuitively, there are some regularities in the language, “identified” from previous examples, which can be utilized on future examples.

Two technical ways to formalize this intuition

Generalization

Page 18: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

18

Model the problem of text correction as a problem of learning from examples.

Goal: learn directly how to make predictions.

PARADIGM Look at many examples. Discover some regularities in the data. Use these to construct a prediction policy. A policy (a function, a predictor) needs to be specific.

[it/in] rule: if the occurs after the target in(in most cases, it won’t be that simple, though)

1: Direct Learning (Distribution-Free)

Non-Parametric Induction; Empirical Risk Minimization; Induction principle: Large deviation probabilistic bounds.

Page 19: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

19

Model the problem of text correction as that of generating correct sentences.

Goal: learn a model of the language; use it to predict.

PARADIGM Learn a probability distribution over all sentences

Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the

park)[In the same paradigm we sometimes learn a conditional

probability distribution]

In practice: make assumptions on the distribution’s type

In practice: a decision policy depends on the assumptions

2: Generative Model (or Prob models in general)

Parametric Inference: Induction principle: Maximum Likelihood

Page 20: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

20

Model 1: There are 5 characters, A, B, C, D, E At any point can generate any of them, according to: P(A)= 0.3; P(B) =0.1; P(C) =0.2; P(D)= 0.2; P(E)= 0.1

P(END)=0.1

A sentence in the language: AAACCCDEABB. A less likely sentence: DEEEEBBBBEEEEBBBBEEE Instantiating the Generative Model:

Given the model, can compute the probability of a sentence, and decide which is more likely/ how to predict next character

Given a family of models choose a specific model.

Example: Model of Language

Page 21: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

21

Model 2: A probabilistic finite state model.

Start: Ps(A)=0.4; Ps(B)=0.4; Ps(C)=0.2

From A: PA(A)=0.5; PA(B)=0.3; PA(C)=0.1; PA(S)=0.1

From B: PB(A)=0.1; PB(B)=0.4; PB(C)=0.4; PB(S)=0.1

From C: PC(A)=0.3; PC(B)=0.4; PC(C)=0.2 ; PC(S)=0.1

Practical issues: What is the space over which we define the

model? Characters? Words? Ideas? How do we acquire the model? Estimation;

Smoothing

Example: Model of Language

Page 22: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

22

The difference in not along probabilistic/deterministic or statistical/symbolic Lines. Both paradigms can do both.

The difference is in the basic assumptions underlying the paradigms, and why they work.

1st: Distribution Free: uncover regularities in the past; hope they will be there in the future.

2nd: Know the (type of) probabilistic model of the language (target phenomenon). Use it.

Distr.-Free vs. Probabilistic/Generative: major philosophical debate in learning. Interesting computational issues too.

Learning Paradigms: Comments

Page 23: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

23

Goal: discover some regularities from examples and generalize to previously unseen data.

What are the examples we learn from? Instance Space X: The space of all examples

How do we represent our hypothesis? Hypothesis Space H: Space of potential functions

Goal: given training data SX, find a good hH

}1,0{: Xh

}1,0{or }1,0{ nX

Distribution Free Learning: Formalism

Page 24: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

24

Learning is impossible, unless…. Outcome of Learning cannot be trusted, unless,… How can we quantify the expected generalization? Assume h is good on the training data; what can be

said on h’s performance on previously unseen data?

These are some of the topics studied in Computational Learning Theory (COLT)

Why Does Learning Work?

Page 25: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

25

Given: Training examples (x,f ((x)) of unknown function f

Find: A good approximation to f

y = f (x1, x2, x3, x4)Unknownfunction

x1

x2

x3

x4

Example x1 x2 x3 x4 y 1 0 0 1 0 0

3 0 0 1 1 1

4 1 0 0 1 1

5 0 1 1 0 0

6 1 1 0 0 0

7 0 1 0 1 0

2 0 1 0 0 0

Learning is impossible, unless…

Page 26: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

26

Complete Ignorance: There are 216 = 56536 possible functions over four input features.

We can’t figure out which one is correct until we’ve seen every possible input-output pair.

Even after seven examples we still have 29 possibilities for f

Is Learning Possible?

Example x1 x2 x3 x4 y

1 1 1 1 ?

0 0 0 0 ?

1 0 0 0 ?

1 0 1 1 ? 1 1 0 0 0 1 1 0 1 ?

1 0 1 0 ? 1 0 0 1 1

0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 ?

0 0 1 1 1 0 0 1 0 0 0 0 0 1 ?

1 1 1 0 ?

Why Does Learning Work (2)?

Page 27: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

27

Simple Rules: There are only 16 simple conjunctive rules of the form

y=xi xj xk

Try to learn a function of this form that explains the data. (try it: there isn’t).

m-of-n rules: There are 29 possible rules of the form

”y = 1 if and only if at least m of the following n variables are 1”

(try it, there is).

Hypothesis Space

Page 28: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

28

Bias

Learning requires guessing a good, small hypothesis class.

We can start with a very small class and enlarge it until it contains a hypothesis that fits the data.

(model selection)

We could be wrong ! There is a robustness issue here; why aren’t we wrong

more often? This also relates to the probabilistic view of prediction, and

to the relation between the paradigms. One way to think about this robustness is via the stability

of our prediction.

Page 29: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

29

Can We Trust the Hypothesis?

There is a hidden conjunction the learner is to learn f=x2 x3 x4 x5 x100

How many examples are needed to learn it ? How ?

Protocol: Some random source (e.g., Nature) provides training

examples; Teacher (Nature) provides the labels (f(x))

Not the only possible protocol (membership query; teaching)

<(1,1,1,1,1,1,…,1,1), 1> <(1,1,1,0,0,0,…,0,0), 0> <(1,1,1,1,1,0,...0,1,1),1> <(1,0,1,1,1,0,...0,1,1), 0> <(1,1,1,1,1,0,...0,0,1),1> <(1,0,1,0,0,0,...0,1,1), 0> <(1,1,1,1,1,1,…,0,1), 1> <(0,1,0,1,0,0,...0,1,1), 0>

Page 30: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

30

Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal if not active (0) in a positive

example

<(1,1,1,1,1,1,…,1,1), 1> f= x1 x2 x3 x4 x5 x100

<(1,1,1,0,0,0,…,0,0), 0> learned nothing <(1,1,1,1,1,0,...0,1,1),1> f= x1 x2 x3 x4 x5 x99 x100

<(1,0,1,1,0,0,...0,0,1),0> learned nothing <(1,1,1,1,1,0,...0,0,1),1> f= x1 x2 x3 x4 x5 x100

<(1,0,1,0,0,0,...0,1,1),0> <(1,1,1,1,1,1,…,0,1), 1> <(0,1,0,1,0,0,...0,1,1),0> Final: f= x1 x2 x3 x4 x5 x100

Learning Conjunction

Page 31: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

31

Instance Space: X Hypothesis Space: H (set of possible hypotheses) Training instances S:

positive and negative examples of the target f S: sampled according to a fixed, unknown,

probability distribution D over X Determine: A hypothesis h H such that

h(x) = f(x) for all x S ? h(x) = f(x) for all x X ?

Evaluated on future instances sampled according to D

f= x1 x2 x3 x4 x5 x100

Prototypical Learning Scenario

Page 32: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

32

Have seen many examples (drawn according to D )

Since in all the positive examples x1 was active, it is likely to be active in future positive examples

Even if not, in any case, in D, x1 is not active only in relatively few positive examples, so our error will be small.

fh

f and h disagree

++

-

-

-h(x)][f(x)Error

DxD Pr

Error can be boundedVia Chernoff bounds

A distribution free notion!

PAC Learning: Intuition

Page 33: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

33

Generalization for Consistent Learners

Claim: The probability that there exists a hypothesis h H that: (1) is consistent with m examples and (2) satisfies err(h) > is less than |H|(1- )m

For any distribution D governing the IID generation of training and test instances, for all hH , for all 0< , <1, if

m > {ln(|H|) + ln(1/ )}/ Then, with probability at least 1- (over the

choice of the training set of size m), err(h) <

Equivalently:

Page 34: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

34

Generalization for Consistent Learners

Proof: Let h be such a bad hypothesis (as in (2)). The probability that h is consistent with one example of f

isP xD [f(x)=h(x)] < (1- )

Since the m examples are drawn independently of each other, the probability that h is consistent with m examples is less than (1- )m

The probability that some hypothesis in H is consistent with m examples is less than |H|(1- )m

Claim: The probability that there exists a hypothesis h H that:

(1) is consistent with m examples and (2) satisfies err(h) > is less than |H|(1- )m

Page 35: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

35

Generalization for Consistent Learners

We want this probability to be smaller than , that is:

|H|(1- )m < And with (1- x < e-x) ln(|H|) - m < ln()

For any distribution D governing the IID generation of training and test instances, for all hH , for all 0< , <1, if

m > {ln(|H|) + ln(1/ )}/ Then, with probability at least 1- (over the

choice of the training set of size m), err(h) <

What kind of hypothesis spaces do we want ? Large ? Small ?

Do we want the smallest H possible ?

Page 36: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

36

Generalization (Agnostic Learners) In general: we try to learn a concept f using hypotheses in H, but f

H

Our goal should be to find a hypothesis hH, with a small training error:

ErrTR(h)= PxS [f(x)h(x)]

We want a guarantee that a hypothesis with a small training error will have a good accuracy on unseen examples

ErrD(h)= PxD [f(x)h(x)] Hoeffding bounds characterize the deviation between the

true probability of an event and its observed frequency over m independent trials.

Pr(p > E(p)+ ) < exp{-2m 2}

(p is the underlying probability of the binary variable being 1)

Page 37: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

37

Generalization (Agnostic Learners) Therefore, the probability that an element hH will have training

error which is off by more than can be bounded as follows:

Pr(ErrD(h) > ErrTR(h) + ) < exp{-2m 2}

As in the consistent case: use union bound to get a uniform bound on all H; to get |H|exp{-2m2} < we have the following generalization bound: a bound on how much will the true error deviate from the observed error.

For any distribution D generating training and test instances, with probability at least 1- over the choice of the training set of size m, (drawn IID), for all hH

mH

hErrhErr TRD

2)/1log(||log

)()(

Page 38: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

38

Summary: Generalization

Learnability depends on the size of the hypothesis space.

In the case of a finite hypothesis space:

In the case of an infinite hypothesis space

Where VC(H) is the Vapnik-Chernvonenkis dimension of the hypothesis class, a combinatorial measure of its complexity.

mH

hErrhErr TRD

2)/1log(||log

)()(

mHkVC

hErrhErr TRD

2)/1log()(

)()(

Later we will see that this understanding of generalization has also some immediate algorithmic implications (essentially, model selection implications).

Page 39: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

39

Learning Theory: Summary (1)

Labeled observations sampled according to a distribution D on

Goal: to compute a hypothesis hH that performs well on future, unseen observations.

} l)(x, {S m1i

{0,1} X

Assumption: test examples are also sampled according to D (label is not observed)

Page 40: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

40

Look for hH that minimizes the true error

All we get to see is the empirical error

l][h(x)Pr(h)Err Dl)(x,D

|S| /|} lh(x)|Sx {|(h)ErrS

)]/mln(1/[kVC(H)(h)Err (h)Err SD

Basic theorem: With probability at least (1-)

Learning Theory:Summary(2) [Why does it work?]

Page 41: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

41

Use Hypothesis Space with small expressivity E.g. prefer to use a function that is linear in the feature space, over

higher order functionsf(x) = Icii

VC dimension of a linear function of dimension N: is N+1

Data dependent VC notions:

The VC dimension of the class of separators with margin grows as becomes smaller.

Sparsity: If there are a maximum of k active in each example then VC dimension is k+1

Algorithmic issues: There are good algorithms for linear function; learning higher order functions is computationally hard.

This will imply that we will make an effort to learn linear functions even when we know that our target function isn’t linear in the raw input.

Practical LessonThree directions we can go from here:

-Generalization

-Details of the Direct Learning Paradigm

-The probabilistic Paradigm.

Page 42: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

42

VC dimension based bounds are unrealistic. The value is mostly in providing quantitative understanding of

“why learning works” and what are the important complexity parameters.

In recent years, this understanding has helped both to drive new algorithms Develop new methods that can actually provide somewhat realistic

generalization bounds.

PAC-Bayes Methods (McAllester, McAllester&Langford)

Random Projection/Weighted margin Methods (Garg, Har-Peled, Roth)

This method can be shown to have some algorithmic implications.

Advances in Theory of Generalization

Page 43: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

43

Model the problem of text correction as that of generating correct sentences.

Goal: learn a model of the language; use it to predict.

PARADIGM Learn a probability distribution over all sentences

Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the

park)[In the same paradigm we sometimes learn a conditional

probability distribution]

In practice: make assumptions on the distribution’s type

In practice: a decision policy depends on the assumptions

2: Generative Model

Page 44: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

44

An Example

I don’t know {whether, weather} to laugh or cry

How can we make this a learning problem?

We will look for a function F: Sentences {whether, weather} We need to define the domain of this function better.

An option: For each word w in English define a Boolean feature xw :

[xw =1] iff w is in the sentence This maps a sentence to a point in {0,1}50,000

In this space: some points are whether points some are weather points

Page 45: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

45

What’s Good?

Learning problem: Find a function that best separates the data

What function? What’s best? How to find it?

A possibility: Define the learning problem to be:Find a (linear) function that best separates the data

Page 46: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

46

Exclusive-OR (XOR)

In general: a parity function.

xi in {0,1} f(x1, x2,…, xn) = 1

iff xi is even

This function is not linearly separable.

x1

x2

Page 47: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

47

Functions Can be Made Linear

x1 x2 x4 + x2 x4 x5 +x1 x3 x7

Space: X= x1, x2,…, xn

Input Transformation

New Space: Y = {y1,y2,…} = {xi,xi xj, xi xj xj}

Weather

Whether

y3 + y4 + y7 New

discriminator is functionally

simpler

Page 48: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

48

Data are not separable in one dimension Not separable if you insist on using a specific class

of functions

x

Feature Space

Page 49: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

49

Blown Up Feature Space

Data are separable in <x, x2> space

x

x2

Key issue: what features to use.

Computationally, can be done implicitly (kernels)

But there are warnings.

Page 50: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

50

A General Framework for Learning

Goal: predict an unobserved output value y based on an observed input vector x

Estimate a functional relationship y~f(x) from a set {(x,y)i}i=1,n

Most relevant - Classification: y {0,1} (or y {1,2,…k} ) (But, within the same framework can also talk about Regression

or Structured Prediction)

What do we want f(x) to satisfy?

We want to minimize the Loss (Risk): L(f()) = E X,Y( [f(x)y] ) Where: E X,Y denotes the expectation with respect to the true

distribution.

Simply: # of mistakes[…] is a indicator function

Page 51: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

51

A General Framework for Learning (II)

We want to minimize the Loss: L(f()) = E X,Y( [f(X)Y] ) Where: E X,Y denotes the expectation with respect to the true

distribution.

We cannot do that. Instead, we try to minimize the empirical classification error. For a set of training examples {(Xi,Yi)}i=1,n

Try to minimize: L’(f()) = 1/n i [f(Xi)Yi]

This minimization problem is typically NP hard. To alleviate this computational problem, minimize a new function –

a convex upper bound of the classification error function

I(f(x),y) =[f(x) y] = {1 when f(x)y; 0 otherwise}

Page 52: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

52

Learning as an Optimization Problem

A Loss Function L(f(x),y) measures the penalty incurred by a classifier f on example (x,y).

There are many different loss functions one could define:

Misclassification Error:

L(f(x),y) = 0 if f(x) = y; 1 otherwise Squared Loss:

L(f(x),y) = (f(x) –y)2

Input dependent loss:

L(f(x),y) = 0 if f(x)= y; c(x)otherwise.A continuous convex loss function allows a simpler optimization algorithm.

f(x) –y

L

Page 53: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

53

How to Learn? Local search:

Start with a linear threshold function. See how well you are doing. Correct Repeat until you converge (??).

Optimization: Define an objective function such that its solution separates the data well.

Direction Computation of hypothesis

Page 54: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

54

Assignment: Verb Prediction The goal is to predict a verb given a context:

Supervised learning, examples like:   She <<dropped>> the ball.

She << ? >> the ball Good answer is probably “drop” or “kick” Two versions: predict verb form (“dropped”) or just base

(“drop”) There will be two approaches to compare:

Language Modeling You will have to implement the model yourself (incl.

smoothing) Classification

Use available packages: FEX for feature extraction and SnOW for classification

You need to decide on features, algorithm, parameters Original ideas: other datasets, web, clustering, etc

Page 55: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

55

Assignment: Verb Prediction Additional annotation is available, example

sentence:

Page 56: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009

56

Assignment: Verb Prediction See the course web-page for:

Description of the assignment (see detailed description at the bottom)

Pointers to relevant papers, tutorials for SNoW and FEX

Teams: Team 1: Sarah Borys, Maryam Karimzadehgan, Majid

Kazemian Team 2: Kavita Ganesan, Hyun Duk Kim, Parikshit Sondhi Team 3: Ryan Cunningham, Gourab Kundu, Daniel Schreiber Team 4: Oscar Sanchez Plazas, Mehwish Riaz, Scott Wegner

Feel free to swap; this arrangement is based on surveys (but no surveys from Kavita and Parikshit)

Due: Feb, 25; Next review due Mar, 6.