download presentation source

44
Machine Learning Sections 18.1 - 18.4

Upload: butest

Post on 20-Nov-2014

267 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Download presentation source

Machine Learning

Sections 18.1 - 18.4

Page 2: Download presentation source

What is learning?

“changes in a system that enable a system to do the same task more efficiently the next time” -- Herbert Simon

“constructing or modifying representations of what is being experienced” -- Ryszard Michalski

“making useful changes in our minds” -- Marvin Minsky

Page 3: Download presentation source

Why learn?Understand and improve human learning

learn to teach, CAI, CBT

Discover new things Data mining

Fill in skeletal information about a domain incorporate new information in real time make systems less “finicky” or “brittle” by

making them better able to generalize

Page 4: Download presentation source

Components of a learning system

Page 5: Download presentation source

Evaluating Performance

Several possible criteria Predictive accuracy of classifier Speed of learner Speed of classifier Space requirements

• Most common criterion is Predictive Accuracy

Page 6: Download presentation source

Major Paradigms of MLRote Learning

memorize examples Association-based storage and retrieval

Induction Learn from examples to reach general conclusions

ClusteringAnalogy

Determine correspondence between representations

Page 7: Download presentation source

Major Paradigms (Cont).Discovery

Unsupervised, no specific goal

Genetic Algorithms Combine successful behaviors, only the fittest

survive

Reinforcement Feedback (reward) given at end of a sequence of

steps Assign rewards by solving a Credit Assignment

Problem

Page 8: Download presentation source

Inductive LearningExtrapolate from a given set of examples so

that we can make accurate predictions about future examples.

Types Supervised

• teacher tells us the answer y=f(x) Unsupervised

• predict a future value and then validate Concept Learning

• Given examples in a class, determine if a test example is in the class (P) or not (N)

Page 9: Download presentation source

Supervised Concept LearningGiven a training set of positive and negative

examples of a concept construct a description that will accurately

classify future examples. Learn some good estimate of function f given a

training set:{ (x1,y1), (x2,y2), . . . (xn,yn)}where each yi is either + (positive) or- (negative)

Page 10: Download presentation source

Inductive Bias

Inductive learning generalizes from specific facts cannot be proven true, but can be proven false

• Falsity preserving is like searching a Hypothesis Space H of possible

f functions Bias allows us to pick which h is preferable define a metric for comparing f functions to find

the best

Page 11: Download presentation source

Inductive learning frameworkRaw input is a feature vector, x, that describes the

relevant attributes of an exampleEach x is a list of n (attribute, value) pairs

x = (person=Sue, major=CS, age=Young, Gender=F)

Attributes have discrete values. All examples have all attributes.

Each example is a point in n-dimensional feature space

Page 12: Download presentation source

Case-based ideaMaintain a library of previous casesWhen a new problem arises

find the most similar case(s) in the library adapt the similar cases to solving the current

problem

Page 13: Download presentation source

Nearest NeighborSave each training example as a point in

n-spaceFor each testing example, measure the

“distance” to each training example. Classify the example the same as the nearest

neighbor

Suffers from the curse of high dimensionalityDoesn’t generalize well if examples are not

clustered tightly

Page 14: Download presentation source

k-nearest neighbor

What should the value of k be? That is, how many “close” examples should the algorithm consider? problem dependent

Using k nearest neighbors hopefully avoids the problem of noise in the data

Page 15: Download presentation source

Nearest-neighbor problemsStoring a large number of examples

strategy for deciding whether to keep or discard an example

one idea• store part of the training data

• use stored part to predict rest of the training data

• if mistake, add the mistake to the stored examples

Irrelevant features use tuning set to add or remove features to/from

the feature set distance function: how much to weight each

dimension?

Page 16: Download presentation source

Nearest-neighbor results

Testbed Nearest-Neighbor Decision-trees Neural Nets

Wisc. Cancer 98% 95% 96%Heart Disease 78% 76% ?

Tumor 37% 38% ?Appendicitis 83% 85% 86%

Page 17: Download presentation source

Learning Decision Trees

Goal: Build a decision tree for classifying examples as positive or negative instances of a concept

Supervised Batch processing of training examples using a preference bias

Page 18: Download presentation source

Decision Tree Example

Page 19: Download presentation source

Building Decision Trees

Preference Bias is Ockham’s Razor Simplest explanation that is consistent with the

observations is probably the best

Finding the smallest decision tree is NP-Hard, so we’ll settle for pretty small

Page 20: Download presentation source

Construction Overview

Top-down, recursivePick the “best” attribute for the current nodeGenerate children nodes

one for each possible value of the selected attrib

Partition the examples on that attributeAssign each subset to the child it goes withrepeat for each child until homogeneous

Page 21: Download presentation source

How to pick the “best” attribute?Random

Just pick one

Least Values narrowest branching of the tree

Most Values shallowest tree (fewest levels)

Max-Gain largest expected Info Gain smallest expected size of subtrees

Page 22: Download presentation source

Max Gain background

Use Information Theory Expected work to guess if an example x in a set

S matches a concept:

log2 |S|

At each step, we can ask a yes/no question that eliminates at most 1/2 of the elements remaining.

Page 23: Download presentation source

Expected questions remainingGiven

S = P union N P and N disjoint if x in P, then log2 |P| questions needed

if x in N, then log2 |N| questions needed

(prob(x in P) * log2 |P|) + (prob(x in N) * log2 |N|)

or, equivalently,

(p / (p+n)) * log2p + (n / (p+n)) * log2n

Page 24: Download presentation source

Information Content

How many questions do we save by knowing if x is in P or N?

I(P,N) = log2 |S| - (|P|/|S| log2 |P|) - (|N|/|S| log2 |N|)

or, equivalently,

I(%P, %N) = - (%P log2%P) - (%N log2%N)

Note that 0 <= I(P,N) <= 1, 0 is no info, 1 is max info

Page 25: Download presentation source

Perfect Balance

I(1/2,1/2) = - ½log2(½) – ½log2(½)

= -½(log2(1) – log2(2)) - ½(log2(1) –log2(2))= - ½ (0 – 1) – ½ (0 – 1)

= ½ + ½

= 1 information content is large=

Page 26: Download presentation source

Example: Homogeneity

If all of the samples in S are Positive None are Negative

Information content is low

I(1,0) = -1 log2(1) – 0 log2(0)

= -0 – 0

= 0

Page 27: Download presentation source

Low Information Content

Low information content is desirable in order to make the smallest tree Most of the examples are classified the same The subtree under this node will probably be

small

Page 28: Download presentation source

Information Gained

For a given attribute measure the difference in Information Content

after a node splits up the examples• measure the information content at each child

• weight the information by the proportion of examples that will go there

Page 29: Download presentation source

MaxGain definitions

Si = subset of S with value i, i = 1, . . ., mPi = subset of Si that are +Ni = subset of Si that are -qi = |Si| / |S| = % of examples on branch i%Pi = |Pi| / |Si| = % of + examples on branch i%Ni = |Ni| / |Si| = % of - examples on branch i

Page 30: Download presentation source

Information remaining

Weighted sum of information content of each child node generated by that attribute

Remainder(A) = qi * I(%Pi, %Ni)i=1

n

Page 31: Download presentation source

Information Gain

Subtract expected information content after the node from the information content at the entrance to the node to get the gain at that node.

Gain(A) = I(%P,%N) - Remainder(A)

Page 32: Download presentation source

Select the best attribute

Of all the remaining attributes, select the attribute with the highest gain for this location in the decision tree Since entrance information is constant

• Select A to get the minimum remainder(A)

Page 33: Download presentation source

Example Data

Example Color Shape Size Class1 red square big +

2 blue square big +

3 red round small -

4 green square small -

5 red round big +

6 green square big -

Page 34: Download presentation source

Remainder(Color) 3 of 6 are Red, 2 of 3 are +

• 3/6 I(2/3, 1/3)

• = 0.5 * 0.914 = 0.457 1 of 6 are Blue, all are +

• 1/6 I(1/1, 0/1)

• = 0.000 2 of 6 are Green, all are -

• 2/6 I(0/2,2/2)

• = 0.000

Remainder(Color) = 0.457 + 0.0 + 0.0

Page 35: Download presentation source

Gain Result

Attribute Remainder Gain

Color 0.457 0.543

Shape 0.914 0.086

Size 0.514 0.459

Page 36: Download presentation source

Final Decision Tree

+ -

Size + -

ColorR BG

Big Small

Page 37: Download presentation source

Extensions

Real-valued data choose thresholds, each interval becomes a discrete

value

Noisy data and overfitting 2 examples have identical evidence, but different

classifications Some values are inaccurate teacher is wrong Some attributes are irrelevant

Page 38: Download presentation source

Pruning

To avoid overfitting choose a threshold for information gain

• if best remaining attribute is not very good, prune here by making the node a leaf rather than generating children

choose a depth limit use a tuning set

Page 39: Download presentation source

Generation of rules

For each path from the root to a leaf, translate to a rule: if color=red and size=big then +

The collection of rules for all paths from the root to leaves is an interpretation of what the tree means

Page 40: Download presentation source

Setting Parameters

Some algorithms require setting learning parameters. Must be set without looking at test data! One method: Tuning Sets

• Partition data into Train Set and Tune Set

• For each candidate parameter value, generate decision tree using the Train Set

• Use Tune set to evaluate error rates and determine which parameter is best

• Compute new decision tree using selected parameter and entire Training set

Page 41: Download presentation source

Cross Validation

Divide all examples into N disjoint sets E = {E1, E2, E3, . . . , En}

For each i=1,…,n do Train set = E - {Ei}, Test set = {Ei} Compute decision tree using Train set Determine performance accuracy Pi using Test set

Compute N-fold cross-validation estimate of performance

(P1 + P2 + P3 + . . . + Pn) / N

Page 42: Download presentation source

WillWait from 12 Examples

Page 43: Download presentation source

Increasing Training Set

Page 44: Download presentation source

Summary

Decision Trees are widely used Easy to understand rationale Can out-perform humans Fast, simple to implement handles noisy data well

Weaknesses Univariate (uses only 1 variable at a time) batch (non-incremental)