machine learning reading: chapter 18. 2 text classification is text i a finance new article?...
Post on 21-Dec-2015
219 views
TRANSCRIPT
3
20 attributes
Investors 2 Dow 2 Jones 2 Industrial 1 Average 3 Percent 5 Gain 6 Trading 8 Broader 5 stock 5 Indicators 6 Standard 2 Rolling 1 Nasdaq 3 Early 10 Rest 12 More 13 first 11 Same 12 The 30
4
20 attributes
Men’s Basketball Championship UConn Huskies Georgia Tech Women Playing Crown Titles Games Rebounds All-America early rolling Celebrates Rest More First The same
Examplestock rolling the class
1 0 3 40 other
2 6 8 35 finance
3 7 7 25 other
4 5 7 14 other
5 8 2 20 finance
6 9 4 25 finance
7 5 6 20 finance
8 0 2 35 other
9 0 11 25 finance
10 0 15 28 other
6
Constructing the Decision Tree
Goal: Find the smallest decision tree consistent with the examples
Find the attribute that best splits examples Form tree with root = best attribute For each value vi (or range) of best attribute
Selects those examples with best=vi
Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best
Add a branch to tree with label=vi and subtree=subtreei
7
Choosing the Best Attribute:Binary Classification
Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction
Information theory (Shannon and Weaver 49)
Entropy: a measure that characterizes the impurity of a collection of examples
Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute
8
Formula for Entropy
H(P(v1),…P(vn))=∑-P(vi)log2P(vi)
where P(v) = probability of v
Examples:Suppose we have a collection of 10 examples, 5
positive, 5 negative:H(1/2,1/2)=-1/2log21/2-1/2log21/2=1 bit
Suppose we have a collection of 100 examples, 1 positive and 99 negative:
H(1/100,99/100)=-.01log2.01-.99log2.99=.08 bits
n
i=1
9
Choosing the Best Attribute: Information Gain
Information gain (from attribute test) = difference between the original information requirement and new requirement
Gain(A)=H(p/p+n,n/p+n)-Remainder(A)
H=entropy Highest when the set is equally divided between
positive (p) and negative (n) examples (.5,.5) (value of 1)
Lower as the set becomes more unbalanced (e.g., (.9, .1) )
Examplestock rolling the class
1 0 3 40 other
2 6 8 35 finance
3 7 7 25 other
4 5 7 14 other
5 8 2 20 finance
6 9 4 25 finance
7 5 6 20 finance
8 0 2 35 other
9 0 11 25 finance
10 0 15 28 other
14
Algorithm as specified so far is designed for binary classification, attributes with discrete values
Attributes: Outlook: sunny, overcast, rain Temperature: hot, mild, cool Humidity: normal, high Wind: weak, strongClassification PlayTennis?: Yes, No
Day Outlook Temperature
Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Humidity
E=.940 (9/14 yes)
Wind
E=.94
Outlook
E=.940
Temperature
E=.940
High Normal Strong Weak
Overcast Sunny
Rain
Cool Mild
Hot
E=.985 E=.592 E=.811 E=.1.0
Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029
Outlook is selected because it has highest gain
Gain(humidity)=.940-(7/14).985-(7/14).592 Gain(wind)=.940-(6/14).811-(8/14)1.0
Gain(outlook)=.940-(4/14)0-(5/14).79-(5/14).79
Day Outlook Temperature
Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Day Outlook Temperature
Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
19
Extending the algorithm for continuous valued attributes
Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals
For continuous A, create Ac that is true if A<c, false otherwise
How to select the best value for threshold c?
Sort examples by continuous attribute Identify adjacent examples that differ in target
classification Generate a set of candidate thresholds
midway between corresponding values of A Choose threshold c that maximizes
information gain
20
Example: temperature as continuous value
Temp 40 48 60 72 80 90
Playtennis?
No No Yes Yes Yes No
Two candidate thresholds: (48+60)/2 (80+90)/2
Information gain greater for Temperature>54 than for Temperature>85
21
Other cases What if class is discrete valued, not
binary?
What if an attribute has many values (e.g., 1 per instance)?
22
Training vs. Testing
A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data
Collect a large set of examples (with classifications)
Divide into two disjoint sets: the training set and the test set
Apply the learning algorithm to the training set, generating hypothesis h
Measure the percentage of examples in the test set that are correctly classified by h
Repeat for different sizes of training sets and different randomly selected training sets of each size.
24
Overfitting
Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper
When else can overfitting occur? Solution #1: Decision tree pruning
Prune away attributes with low information gain
Use statistical significance to test whether gain is meaningful
25
K-fold Cross Validation
Solution #2: To reduce overfitting
Run k experiments Use a different 1/k of data for testing
each time Average the results
5-fold, 10-fold, leave-one-out
26
Cross-Validation
Model
Lather, rinse, repeat (10 times)
9 folds (approx. 1409) 1 fold (approx. 157)
Train
Evaluate
Report average
Split into 10 folds
Labeled data (1566)
28
Ensemble Learning
Learn from a collection of hypotheses
Majority voting
Enlarges the hypothesis space
29
Boosting
Uses a weighted training set Each example has an associated weight wj0 Higher weighted examples have higher importance
Initially, wj=1 for all examples Next round: increase weights of misclassified
examples, decrease other weights From the new weighted set, generate hypothesis
h2 Continue until M hypotheses generated Final ensemble hypothesis = weighted-majority
combination of all M hypotheses Weight each hypothesis according to how well it did
on training data
30
AdaBoost
If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted
error on training slightly better than random Returns hypothesis that classifies training
data perfectly for large enough M Boosts the accuracy of the original
learning algorithm on training data