course review #2 and project parts 3-6 ling 572 fei xia 02/14/06

44
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Upload: cleopatra-franklin

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Supervised Learning

TRANSCRIPT

Page 1: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Course Review #2 and Project Parts 3-6

LING 572Fei Xia

02/14/06

Page 2: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Outline

• Supervised learning– Learning algorithms– Resampling: bootstrap– System combination

• Semi-supervised learning• Unsupervised learning

Page 3: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Supervised Learning

Page 4: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Machine learning problems• Input x: a sentence, a set of attributes, …• Input domain X: the set of all possible input • Output y: a class, a real number, a tag, a tag

sequence, a parse tree, a cluster• Output domain Y: the set of all possible output

• Training data t: a set of (x, y) pairs in supervised learning, y is known.

Page 5: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Machine learning problems (cont)

• Predictor (f): a function from X to Y.• A learner: a function from T to F

– T: the set of all possible training data– F: the set of all possible predictors.

• Types of ML problems: – Y is a finite set: classification– Y is R: regression– Y is of other types: parsing, clustering, …

Page 6: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

The standard setting for binary classification problems

• Input x: – There is a finite set of attributes: a1, …, an

– x is a vector: x=(x1, …, xn)

• Output y: – Binary-class: Y has only two members– Multi-class: Y has k members

Page 7: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Converting to the standard setting

• Multi-class binary: (Boosting) – Train one classifier: (x,y) (x, 1), 0), ….,

((x,y), 1), …, ((x,k), 0)– Train k classifiers, one for each class: for class j, (x,y) (x, (y=j))

• Y is not a pre-defined finite set– Ex: POS tagging, parsing– Convert Y to a sequence of decisions.

Page 8: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Converting to the standard setting (cont)

• x is not a vector (x1, …, xn)– Define a set of input attributes: a1, …, an

– Convert x to a vector– Ex: use boosting for POS tagging

Page 9: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Classification algorithms

• DT, DL, TBL, Boosting, MaxEnt• Comparison:

– Representation– Training: iterative approach

• Feature selection• Weight setting• Data processing

– Decoding

Page 10: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Representation• DT: a tree

– Each internal node is a test on an input attribute• DL: an ordered list of rules (fi, vi)

– Each fi is a test on one or more attributes.

• TBL: an ordered list of trans (fi, vi vi’).– Each fi is a test on one or more attributes.

• Boosting: a list of weighted weak classifiers.– Often a classifier tests one or more attributes.

• MaxEnt: a list of weighted features– A feature is a binary function: f(x, y)=0 or 1

Page 11: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Training: “feature” selection• DT: the test with max entropy reduction

– Test: attr == val

• DL: the decision rule with max entropy reduction– Rule: if (attr1=val1 && … && attr_i=val_i) then y=c.

• TBL: the transformation with max error reduction– Transformation: if (attr1=val1 && … && attr_i=val_i) then y=c1y=c2– Transformation: if (attr1=val1 && … && attr_i=val_i && y=c1) then y=c2

• Boosting: the classifier chosen by the weak learner. – Classifier: if (attr1=val1 && … && attr_i=val_i) then y=c1 else y=not (c1)

• MaxEnt: features with max increase of the log-likelihood of the training data– Feature: if (attr1=val1 && … && attr_i=val_i && y=c) then 1 else 0

Page 12: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Training: weight setting

• Boosting: weights that minimize the upper bound of training error.

• MaxEnt: weights that maximize the

entropy

Page 13: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Training: data processing

• DT: split data• DL: split data (optional)• TBL: apply transformations to reset cur_y

– Original data: (x, y) – Used data: ((x, cur_y), y)

• Boosting: re-weight the examples (x,y)• MaxEnt: none

Page 14: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Decoding for static problems: a single decision

• DT: find the unique path from the root to a leaf node in the decision tree

• DL: find the 1st rule that fires• TBL: find the sequence of rules that fire• Boosting: sum up the weighted decisions by

multiple classifiers

• MaxEnt: find the y that maximizes p(y | x).

)|(maxarg)( xypxfy

))(()( xfsignxf jj

j

Page 15: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Decoding for dynamic problems: a decision sequence

• TBL: it can handle dynamic problems directly.

• Beam search: – Decode from left-to-right.– A feature should not refer to future decisions.– Keep top-N at each position Easy to implement for MaxEnt Need to add weights (e.g, probabilities, costs,

confidence scores) to DT, DL, TBL, and boosting

Page 16: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Comparison of learnersDT DL TBL Boosting MaxEnt

Probabilistic SDT SDL TBL-DT confidence Y

Parametric N N N N Y

representation tree Ordered list of rules

Ordered list of transformations

List of weighted classifiers

List of weighted features

Each iteration attribute rule transformation

classifier & weight

feature & weight

Data processing

Splitdata

Split data*

Change cur_y

Reweight (x,y)

None.

decoding path 1st rule Sequence of rules

Calc f(x) Calc f(x)

Page 17: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Evaluation of learners• Accuracy: F-measure, error rate, ….

• Cost: – The types and amount of resources: tools and training data– The cost of errors

• Complexity: – Computational complexity of the algorithm (training time, decoding time)– Complexity of the model: # of parameters

• Stability

• Bias

Page 18: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Stability of a learner L

• Given two samples t1 and t2 from the same distribution DX*Y, let f1=L(t1) and f2=L(t2). If L is stable, f1 and f2 should agree most of the time.

))()((),(

)),(()(

2121

21

xfxfPffagreement

ffagreementELstability

X

YX

D

D

Page 19: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Bias

• Utgoff (1986):– Strong/weak bias: one that focuses the

learner on a relatively small (large, resp.) number of hypotheses.

– Correct/incorrect bias: one that allows (does allow, respectively) the learner to select the target

Page 20: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Bias (cont)• Rendell (1986): based on the learner’s behavior

– Exclusive bias: the learner does not consider any of the candidates in a class

– Preferential bias: the learner prefers one class of concepts over another class.

• Others: based on the learner’s design– Representational bias: certain concepts cannot be

considered because they cannot be expressed.– Procedural bias:

• Ex: pruning in C4.5 is a procedural bias that results in a preference for smaller DTs.

Page 21: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Resampling

Page 22: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Bagging

f1

f2

fB

ML

ML

ML

f

Sample with replacement

Page 23: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

System combination

Page 24: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

System combination

This can be seen as a special kind of ML problem.So we can use any learner

f1

f2

fB

f

Page 25: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Methods

ML problem:• Input attribute vector (f1(x), …., fn(x))• The goal: f(x)

Strategies:• Switching: for x, f(x) is equal to some fi(x)• Hybridization: create a new value.

Page 26: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Project Part 3

Page 27: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Tasks

• Understand the algorithm• Run the tagger on four sets of training data.

1K 5K 10K 40K

Accuracy

Training time# of features

Page 28: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

The MaxEnt core• What is the format of the training data?• What is the format of the test data?• How does GIS work?• How does L-BFGS work?• What is Gaussian prior smoothing? And how is it

calculated?• How are events and features represented

internally?• During the decoding stage, how does the code

find the top-N classes for a new instance?

Page 29: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

The MaxEnt tagger: features• Where are feature templates defined?

• List the feature templates used by the tagger.

• If you want to add a new feature template, what do you need to do? Which piece of code do you need to modify?

• Given the feature templates, how are (instantiated) features selected and filtered?

Page 30: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

The MaxEnt tagger: trainer• What’s the format of the training sentences?

• How does the trainer convert a training sentence into a list of events?

• How does the trainer treat rare words? What additional features do rare words produce?

• How many files are created by the trainer in each experiment? How are they created? And what are their usages?

Page 31: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

The MaxEnt tagger: decoder

• What’s the format of the test data?

• How are unknown words handled by the decoder?

• Which function does the beam search (Just provide the function name and file name)?

Page 32: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Project Part 4

Page 33: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 1: System combination

• Try three methods. • The methods can come from existing work (e.g,

(Henderson and Brill, 1999)), or be totally new. • At least one of them is trained:

– Create training data• Split S into (S1, S2)• Train each of the three POS taggers using S1• Tag instances in S2 (sys1, sys2, sys3, gold)

– Train the combiner with the training data

Page 34: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 1 (cont)1k 5K 10K 40K

Trigram a/b …TBL a/b ..MaxEnt …Comb1 …Comb2 …Comb3 …

a: tagging result with the whole training data.b: tagging result with part of the training data.

Page 35: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 2: bagging

• B=10: use 10 bags• Training data: 1K, 5K, and 10K. 40K is optional.• One combination method.

Page 36: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 2 (cont)1K 5K 10K 40K

(optional)Trigram a/b/c …

TBL …

MaxEnt …

Comb1 a/b/c …

a: no baggingb: one bagc: 10 bags

Page 37: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 3: boosting

• Software: boostexter• Main tasks:

– Handling unknown words– Format conversion: pay attention to special

characters: e.g., “,” in “2,300”– Feature templates– Choosing the number of rounds: N– Train and decode

Page 38: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 3 (cont)1K 5K 10K 40K

(optional)Iteration

num1a/b …

Iteration num2

…. …Iteration Num5

a: true tags for neighboring wordsb: most frequent tags for neighboring words

Page 39: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 4: semi-supervised learning

• Select one or more taggers• Choose the SSL methods: self-training,

co-training, or something else.• Decide on strategies for adding data.• Show the results with or without unlabeled

data.

Page 40: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Task 4 (cont)1K labeled data 5K labeled data

No unlabeled data a/b …

15K unlabeled data …

25K unlabeled data

35K unlabeled data

a: tagging accuracyb: the number of sentences added to the labeled data.

Page 41: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Project Part 5-6

Page 42: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Part 5: Presentation

• Presentation: 10 minutes + Q&A• Email me the slides by 6am on 3/9 and

bring a copy to class.• Focus:

– Tagging results: tables, figures– How TBL and MaxEnt work?– Project Part 4

Page 43: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Part 6: Final report

• Email me the file by 6am on 3/14.

• It should include the major results and observations from Project Part 1-5.

• Thoughts about ML algorithms

• Thoughts about the course, project, etc.

Page 44: Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Due date• 6am on 3/7/06: Part 3-4

– ESubmit the following:• code for part 4• reports for Parts 3 and 4.

– Bring a hardcopy of the report to class.

• 6am on 3/9/06: Part 5– Email me your presentation slides.– Bring a hardcopy of your slides to class (4 slides per page).

• 6am on 3/14/06: Part 6– Email me the final report.