corpora and statistical methods lecture 13

Click here to load reader

Upload: lewis

Post on 25-Feb-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Corpora and Statistical Methods Lecture 13. Albert Gatt. Part 2. Supervised methods. General characterisation. Training set: documents labeled with one or more classes encoded using some representation model Typical data representation model: every document represented as - PowerPoint PPT Presentation

TRANSCRIPT

PowerPoint Presentation

Albert GattCorpora and Statistical MethodsLecture 13Supervised methodsPart 2General characterisationTraining set:documents labeled with one or more classesencoded using some representation model

Typical data representation model:every document represented as vector of real-valued measurements + classvector may represent word countsclass is given since this is supervised learning

Data sources: Reuters collectionhttp://www.daviddlewis.com/resources/testcollections/reuters21578/Large collection of Reuters newswire texts, categorised by the topic. Topics include:earn(ings)grainwheatacq(uisitions)Reuters datasetText 1Text 2

17-MAR-1987 11:07:22.82 earn AMRE INC 3RD QTR JAN 31 NET DALLAS, MArch 17 - Shr five cts vs one ctNet 196,986 vs 37,966Revs 15.5 mln vs 8,900,000Nine mthsShr 52 cts vs 22 ctsNet two mln vs 874,000Revs 53.7 mln vs 28.6 mlnReuter

17-MAR-1987 11:26:47.36acq DEVELOPMENT CORP OF AMERICA MERGEDHOLLYWOOD, Fla., March 17 -Development Corp of America said its merger with Lennar Corp was completed and its stock no longer existed. Development Corp of America, whose board approved the acquisition last November for 90 mln dlrs, said the merger was effective today and its stock now represents the right to receive 15 dlrs a share.The American Stock Exchange said it would provide further details later.Reuter

Representing documents:vector representationsSuppose we select k = 20 keywords that are diagonistic of the earnings category.Can be done using chi-square, topic signatures etcEach document d represented as a vector, containing term weights for each of the k terms:

#times term i occurs in doc jlength of doc jWhy use a log weighting scheme?A formula like 1 + log(tf) dampens the actual frequency

Example:let d be a document of 89 wordsprofit occurs 6 timestf(profit) = 6; 10 * [1+log(tf(profit))/1+log(89)] = 6cts (cents) occurs 3 timestf(cents) = 3; 10 * [1+log(tf(cts))/1+log(89)] = 5

we avoid overestimating the importance of profit relative to cts (profit is more important than cts, but not twice as important)

Log weighting schemes are common in information retrieval Decision treesForm of a decision treeExample:probability of belonging to category earnings given that s(cts) > 2 is .116

node 4node 3node 17861 itemsp(c|n1) = 0.3split: ctsvalue: 2node 25977 itemsp(c|n2) = 0.116split: netvalue: 1node 51704 itemsp(c|n5) = 0.9split: vsvalue: 2net < 1net 1cts < 2cts 2node 7node 6vs < 2net 2Form of a decision treeEquivalent to a formula in disjunctive normal form.

(cts < 2 & net < 1 &)V(cts 2 & net 1 &)

a complete path is a conjunction

node 4node 3node 17861 itemsp(c|n1) = 0.3split: ctsvalue: 2node 25977 itemsp(c|n2) = 0.116split: netvalue: 1node 51704 itemsp(c|n5) = 0.9split: vsvalue: 2net < 1net 1cts < 2cts 2node 7node 6vs < 2net 2How to grow a decision treeTypical procedure:grow a very large treeprune it

Pruning avoids overfitting the training data.e.g. a tree can contain several branches which are based on accidental properties of the training sete.g. only 1 document in category earnings contains both dlrs and pctGrowing the treeSplitting criterion:to identify a value for a feature a on which a node is split

Stopping criterion:determines when to stop splittinge.g. stop splitting when all elements at a node have an identical representation (equal vectors for all keywords)Growing the tree: Splitting criterionInformation gain:do we reduce uncertainty if we split node n into two when attribute a has value y?let t be the distribution of nthis is equivalent to comparing:entropy of t vs entropy of t given ai.e. entropy of t vs entropy of its child nodes if we split

sum of entropy of child nodes, weighted by the proportion p of items from n in each child (l & r)Information gain exampleat node 1P(c|n1) = 0.3H = 0.6at node 2:P(c|n2) = 0.1H = 0.35at node 5:P(c|n5) = 0.9H = 0.22weighted sum of 2 & 5 = 0.328gain = 0.611 0.328 = 0.283

node 4node 3node 17861 itemsp(c|n1) = 0.3split: ctsvalue: 2node 25977 itemsp(c|n2) = 0.116split: netvalue: 1node 51704 itemsp(c|n5) = 0.9split: vsvalue: 2net < 1net 1cts < 2cts 2node 7node 6vs < 2net 2Leaf nodes

Suppose n3 has:1500 earnings docsother docs in other categoriesWhere do we classify a new doc d?e.g. use MLE with add-one smoothing

node 4541 itemsp(c|n4) = 0.649node 35436 itemsp(c|n3) = 0.050node 17861 itemsp(c|n1) = 0.3split: ctsvalue: 2node 25977 itemsp(c|n2) = 0.116split: netvalue: 1node 51704 itemsp(c|n5) = 0.9split: vsvalue: 2net < 1net 1cts < 2cts 2Pruning the treePruning proceeds by removing leaf nodes one by one, until tree is empty.At each step, remove the leaf node expected to be least helpful.

Needs a pruning criterion.i.e. a measure of confidence indicating what evidence we have that the node is useful.

Each pruning step gives us a new tree (old tree minus one node) total of n trees if original tree had n nodes

Which of these trees do we select as our final classifier?Pruning the tree: held-out dataTo select the best tree, we can use held-out data.

At each pruning step, try resulting tree against held-out data, and check success rate.

Since held-out data reduces training set, better to perform cross-validation.When are decision trees useful?Some disadvantages:A decision tree is a complex classification devicemany parameterssplit training data into very small chunkssmall sets will display regularities that dont generalise (overfitting)

Main advantage:very easy to understand!

Maximum entropy for text classificationA reminder from lecture 9MaxEnt distribution

a log-linear model:

probability of a category c and document d computed in terms of weighted multiplication of feature values (normalised by a constant)each feature imposes a constraint on the model:

A reminder from lecture 9The MaxEnt principle dictates that we find the simplest model p* satisfying the constraints:

where P is the set of possible distributions with p* is unique and has the form given earlier

Weights for features can be found using Generalised Iterative Scaling

Application to text categorisationExample:

weve identified 20 keywords which are diagnostic of the earnings category in Reuters

each keyword is a featureEarnings features (from M&S `99)fj (word)Weight jlog jcts12.3032.51profit9.7012.272net6.1551.817loss4.0321.394dlrs0.678-0.388pct0.590-0.528is0.418-0.871Very salient/ diagnostic features (higher weights)less important featuresClassifying with the maxent modelRecall that:

As a decision criterion we can use:Classify a new document as earnings ifP(earnings|d) > P(earnings|d)

K nearest neighbour classificationRationaleSimple nearest neighbour (1NN):Given: a new document dFind: the document in the training set that is most similar to dClassify d with the same category

Generalisation (kNN): compare d to its k nearest neighbours

The crucial thing is the similarity measure.

Example: 1NN + cosine similarityGiven: document dGoal: categorise d based on training set TDefine:

Find the subset T of T s.t.:

Generalising to k>1 neighboursChoose the k nearest neighbours and weight them by similarity.

Repeat method for each neighbour.

Decide on a classification based on the majority class for these neighbours.