download presentation source

Machine Learning

Sections 18.1 - 18.4

What is learning?

“changes in a system that enable a system to do the same task more efficiently the next time” -- Herbert Simon

“constructing or modifying representations of what is being experienced” -- Ryszard Michalski

“making useful changes in our minds” -- Marvin Minsky

Why learn?Understand and improve human learning

learn to teach, CAI, CBT

Discover new things Data mining

Fill in skeletal information about a domain incorporate new information in real time make systems less “finicky” or “brittle” by

making them better able to generalize

Components of a learning system

Evaluating Performance

Several possible criteria Predictive accuracy of classifier Speed of learner Speed of classifier Space requirements

• Most common criterion is Predictive Accuracy

Major Paradigms of MLRote Learning

memorize examples Association-based storage and retrieval

Induction Learn from examples to reach general conclusions

ClusteringAnalogy

Determine correspondence between representations

Major Paradigms (Cont).Discovery

Unsupervised, no specific goal

Genetic Algorithms Combine successful behaviors, only the fittest

survive

Reinforcement Feedback (reward) given at end of a sequence of

steps Assign rewards by solving a Credit Assignment

Problem

Inductive LearningExtrapolate from a given set of examples so

that we can make accurate predictions about future examples.

Types Supervised

• teacher tells us the answer y=f(x) Unsupervised

• predict a future value and then validate Concept Learning

• Given examples in a class, determine if a test example is in the class (P) or not (N)

Supervised Concept LearningGiven a training set of positive and negative

examples of a concept construct a description that will accurately

classify future examples. Learn some good estimate of function f given a

training set:{ (x1,y1), (x2,y2), . . . (xn,yn)}where each yi is either + (positive) or- (negative)

Inductive Bias

Inductive learning generalizes from specific facts cannot be proven true, but can be proven false

• Falsity preserving is like searching a Hypothesis Space H of possible

f functions Bias allows us to pick which h is preferable define a metric for comparing f functions to find

the best

Inductive learning frameworkRaw input is a feature vector, x, that describes the

relevant attributes of an exampleEach x is a list of n (attribute, value) pairs

x = (person=Sue, major=CS, age=Young, Gender=F)

Attributes have discrete values. All examples have all attributes.

Each example is a point in n-dimensional feature space

Case-based ideaMaintain a library of previous casesWhen a new problem arises

find the most similar case(s) in the library adapt the similar cases to solving the current

problem

Nearest NeighborSave each training example as a point in

n-spaceFor each testing example, measure the

“distance” to each training example. Classify the example the same as the nearest

neighbor

Suffers from the curse of high dimensionalityDoesn’t generalize well if examples are not

clustered tightly

k-nearest neighbor

What should the value of k be? That is, how many “close” examples should the algorithm consider? problem dependent

Using k nearest neighbors hopefully avoids the problem of noise in the data

Nearest-neighbor problemsStoring a large number of examples

strategy for deciding whether to keep or discard an example

one idea• store part of the training data

• use stored part to predict rest of the training data

• if mistake, add the mistake to the stored examples

Irrelevant features use tuning set to add or remove features to/from

the feature set distance function: how much to weight each

dimension?

Nearest-neighbor results

Testbed Nearest-Neighbor Decision-trees Neural Nets

Wisc. Cancer 98% 95% 96%Heart Disease 78% 76% ?

Tumor 37% 38% ?Appendicitis 83% 85% 86%

Learning Decision Trees

Goal: Build a decision tree for classifying examples as positive or negative instances of a concept

Supervised Batch processing of training examples using a preference bias

Decision Tree Example

Building Decision Trees

Preference Bias is Ockham’s Razor Simplest explanation that is consistent with the

observations is probably the best

Finding the smallest decision tree is NP-Hard, so we’ll settle for pretty small

Construction Overview

Top-down, recursivePick the “best” attribute for the current nodeGenerate children nodes

one for each possible value of the selected attrib

Partition the examples on that attributeAssign each subset to the child it goes withrepeat for each child until homogeneous

How to pick the “best” attribute?Random

Just pick one

Least Values narrowest branching of the tree

Most Values shallowest tree (fewest levels)

Max-Gain largest expected Info Gain smallest expected size of subtrees

Max Gain background

Use Information Theory Expected work to guess if an example x in a set

S matches a concept:

log2 |S|

At each step, we can ask a yes/no question that eliminates at most 1/2 of the elements remaining.

Expected questions remainingGiven

S = P union N P and N disjoint if x in P, then log2 |P| questions needed

if x in N, then log2 |N| questions needed

(prob(x in P) * log2 |P|) + (prob(x in N) * log2 |N|)

or, equivalently,

(p / (p+n)) * log2p + (n / (p+n)) * log2n

Information Content

How many questions do we save by knowing if x is in P or N?

I(P,N) = log2 |S| - (|P|/|S| log2 |P|) - (|N|/|S| log2 |N|)

or, equivalently,

I(%P, %N) = - (%P log2%P) - (%N log2%N)

Note that 0 <= I(P,N) <= 1, 0 is no info, 1 is max info

Perfect Balance

I(1/2,1/2) = - ½log2(½) – ½log2(½)

= -½(log2(1) – log2(2)) - ½(log2(1) –log2(2))= - ½ (0 – 1) – ½ (0 – 1)

= ½ + ½

= 1 information content is large=

Example: Homogeneity

If all of the samples in S are Positive None are Negative

Information content is low

I(1,0) = -1 log2(1) – 0 log2(0)

= -0 – 0

= 0

Low Information Content

Low information content is desirable in order to make the smallest tree Most of the examples are classified the same The subtree under this node will probably be

small

Information Gained

For a given attribute measure the difference in Information Content

after a node splits up the examples• measure the information content at each child

• weight the information by the proportion of examples that will go there

MaxGain definitions

Si = subset of S with value i, i = 1, . . ., mPi = subset of Si that are +Ni = subset of Si that are -qi = |Si| / |S| = % of examples on branch i%Pi = |Pi| / |Si| = % of + examples on branch i%Ni = |Ni| / |Si| = % of - examples on branch i

Information remaining

Weighted sum of information content of each child node generated by that attribute

Remainder(A) = qi * I(%Pi, %Ni)i=1

n

Information Gain

Subtract expected information content after the node from the information content at the entrance to the node to get the gain at that node.

Gain(A) = I(%P,%N) - Remainder(A)

Select the best attribute

Of all the remaining attributes, select the attribute with the highest gain for this location in the decision tree Since entrance information is constant

• Select A to get the minimum remainder(A)

Example Data

Example Color Shape Size Class1 red square big +

2 blue square big +

3 red round small -

4 green square small -

5 red round big +

6 green square big -

Remainder(Color) 3 of 6 are Red, 2 of 3 are +

• 3/6 I(2/3, 1/3)

• = 0.5 * 0.914 = 0.457 1 of 6 are Blue, all are +

• 1/6 I(1/1, 0/1)

• = 0.000 2 of 6 are Green, all are -

• 2/6 I(0/2,2/2)

• = 0.000

Remainder(Color) = 0.457 + 0.0 + 0.0

Gain Result

Attribute Remainder Gain

Color 0.457 0.543

Shape 0.914 0.086

Size 0.514 0.459

Final Decision Tree

+ -

Size + -

ColorR BG

Big Small

Extensions

Real-valued data choose thresholds, each interval becomes a discrete

value

Noisy data and overfitting 2 examples have identical evidence, but different

classifications Some values are inaccurate teacher is wrong Some attributes are irrelevant

Pruning

To avoid overfitting choose a threshold for information gain

• if best remaining attribute is not very good, prune here by making the node a leaf rather than generating children

choose a depth limit use a tuning set

Generation of rules

For each path from the root to a leaf, translate to a rule: if color=red and size=big then +

The collection of rules for all paths from the root to leaves is an interpretation of what the tree means

Setting Parameters

Some algorithms require setting learning parameters. Must be set without looking at test data! One method: Tuning Sets

• Partition data into Train Set and Tune Set

• For each candidate parameter value, generate decision tree using the Train Set

• Use Tune set to evaluate error rates and determine which parameter is best

• Compute new decision tree using selected parameter and entire Training set

Cross Validation

Divide all examples into N disjoint sets E = {E1, E2, E3, . . . , En}

For each i=1,…,n do Train set = E - {Ei}, Test set = {Ei} Compute decision tree using Train set Determine performance accuracy Pi using Test set

Compute N-fold cross-validation estimate of performance

(P1 + P2 + P3 + . . . + Pn) / N

WillWait from 12 Examples

Increasing Training Set

Summary

Decision Trees are widely used Easy to understand rationale Can out-perform humans Fast, simple to implement handles noisy data well

Weaknesses Univariate (uses only 1 variable at a time) batch (non-incremental)

download presentation source

Documents

p union n p

negative examples

future examples

given set of examples

n questions

close examples

example x

p questions