classification and prediction

Classification and Prediction

Bamshad MobasherDePaul University

2

What Is Classification?i The goal of data classification is to organize and

categorize data in distinct classes4 A model is first created based on the data distribution4 The model is then used to classify new data4 Given the model, a class can be predicted for new data

i Classification = prediction for discrete and nominal values (e.g., class/category labels)4 Also called “Categorization”

3

Prediction, Clustering, Classificationi What is Prediction/Estimation?

4 The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes

4 A model is first created based on the data distribution4 The model is then used to predict future or unknown values4 Most common approach: regression analysis

i Supervised vs. Unsupervised Classification4 Supervised Classification = Classification

h We know the class labels and the number of classes

4 Unsupervised Classification = Clusteringh We do not know the class labels and may not know the

number of classes

4

Classification Taski Given:

4 A description of an instance, x X, where X is the instance language or instance or feature space.

h Typically, x is a row in a table with the instance/feature space described in terms of features or attributes.

4 A fixed set of class or category labels: C={c1, c2,…cn}

i Classification task is to determine:4 The class/category of x: c(x) C, where c(x) is a function whose

domain is X and whose range is C.

5

Learning for Classification

i A training example is an instance xX, paired with its correct class label c(x): <x, c(x)> for an unknown classification function, c.

i Given a set of training examples, D.4 Find a hypothesized classification function, h(x), such that: h(x) =

c(x), for all training instances (i.e., for all <x, c(x)> in D). This is called consistency.

6

Example of Classification Learningi Instance language: <size, color, shape>

4 size {small, medium, large}4 color {red, blue, green}4 shape {square, circle, triangle}

i C = {positive, negative}

i D:

i Hypotheses? circle positive? red positive?

Example Size Color Shape Category

1 small red circle positive

2 large red circle positive

3 small red triangle negative

4 large blue circle negative

7

General Learning Issues(All Predictive Modeling Tasks)

i Many hypotheses can be consistent with the training datai Bias: Any criteria other than consistency with the training data that is used to

select a hypothesisi Classification accuracy (% of instances classified correctly).

4 Measured on independent test datai Efficiency Issues:

4 Training time (efficiency of training algorithm)4 Testing time (efficiency of subsequent classification)

i Generalization4 Hypotheses must generalize to correctly classify instances not in training data4 Simply memorizing training examples is a consistent hypothesis that does not

generalize4 Occam’s razor: Finding a simple hypothesis helps ensure generalization

h Simplest models tend to be the best modelsh The KISS principle

8

Classification: 3 Step Processi 1. Model construction (Learning):

4 Each record (instance, example) is assumed to belong to a predefined class, as determined by one of the attributes

h This attribute is call the target attributeh The values of the target attribute are the class labels

4 The set of all instances used for learning the model is called training set

4 The model may be represented in many forms: decision trees, probabilities, neural networks, ….

i 2. Model Evaluation (Accuracy):4 Estimate accuracy rate of the model based on a test set4 The known labels of test instances are compared with the predicts class

from model4 Test set is independent of training set otherwise over-fitting will occur

i 3. Model Use (Classification):4 The model is used to classify unseen instances (i.e., to predict the class

labels for new unclassified instances)4 Predict the value of an actual attribute

9

Model Construction

10

Model Evaluation

11

Model Use: Classification

12

Classification Methodsi K-Nearest Neighbori Decision Tree Inductioni Bayesian Classificationi Neural Networksi Support Vector Machinesi Association-Based

Classificationi Genetic Algorithmsi Many More ….

i Also Ensemble Methods

13

Evaluating Modelsi To train and evaluate models, data are often divided into three

sets: the training set, the test set, and the evaluation set

i Training Set4 is used to build the initial model4 may need to “enrich the data” to get enough of the special cases

i Test Set4 is used to adjust the initial model4 models can be tweaked to be less idiosyncrasies to the training data and can be

adapted for a more general model4 idea is to prevent “over-training” (i.e., finding patterns where none exist).

i Evaluation Set4 is used to evaluate the model performance

14

Test and Evaluation Setsi Reading too much into the training set (overfitting)

4 common problem with most data mining algorithms 4 resulting model works well on the training set but performs poorly on unseen

data4 test set can be used to “tweak” the initial model, and to remove unnecessary

inputs or features

i Evaluation Set is used for final performance evaluation

i Insufficient data to divide into three disjoint sets?4 In such cases, validation techniques can play a major role

h Cross Validationh Bootstrap Validation

15

Cross Validationi Cross validation is a heuristic that works as follows

4 randomly divide the data into n folds, each with approximately the same number of records

4 create n models using the same algorithms and training parameters; each model is trained with n-1 folds of the data and tested on the remaining fold

4 can be used to find the best algorithm and its optimal training parameteri Steps in Cross Validation

4 1. Divide the available data into a training set and an evaluation set4 2. Split the training data into n folds4 3. Select an algorithm and training parameters4 4. Train and test n models using the n train-test splits4 5. Repeat step 2 to 4 using different algorithms / parameters and compare

model accuracies4 6. Select the best model4 7. Use all the training data to train the model4 8. Assess the final model using the evaluation set

Example – 5 Fold Cross Validation

16

17

Bootstrap Validationi Based on the statistical procedure of sampling with replacement

4 data set of n instances is sampled n times (with replacement) to give another data set of n instances

4 since some elements will be repeated, there will be elements in the original data set that are not picked

4 these remaining instances are used as the test seti How many instances in the test set?

4 Probability of not getting picked in one sampling = 1 - 1/n4 Pr(not getting picked in n samples) = (1 -1/n)n = e-1 = 0.3684 so, for large data set, test set will contain about 36.8% of instances4 to compensate for smaller training sample (63.2%), test set error rate is combined

with the re-substitution error in training set:

e = (0.632 * e test instance) + (0.368 * e training instance)

i Bootstrap validation increases variance that can occur in each fold

18

Measuring Effectiveness of Classification Models

i When the output field is nominal (e.g., in two-class prediction), we use a confusion matrix to evaluate the resulting model

i Example

4 Overall correct classification rate = (18 + 15) / 38 = 87%4 Given T, correct classification rate = 18 / 20 = 90%4 Given F, correct classification rate = 15 / 18 = 83%

T F TotalT 18 2 20F 3 15 18

Total 21 17 38

Predicted Class

Actual Class

Confusion Matrix & Accuracy Metrics

i Classifier Accuracy, or recognition rate: percentage of test set instances that are correctly classified4 Accuracy = (TP + TN)/All4 Error rate: 1 – accuracy, or Error rate = (FP + FN)/All

i Class Imbalance Problem: One class may be rare, e.g. fraud, or HIV-positive4 Sensitivity: True Positive recognition rate = TP/P4 Specificity: True Negative recognition rate = TN/N

19

Actual class\Predicted class C1 ¬ C1

C1 True Positives (TP) False Negatives (FN)

¬ C1 False Positives (FP) True Negatives (TN)

Other Classifier Evaluation Metricsi Precision

4 % of instances that the classifier predicted as positive that are actually positive

i Recall4 % of positive instances that the classifier

predicted correctly as positive4 a.k.a “Completeness”

i Perfect score for both is 1.0, but there is often a trade-off between Precision and Recall

i F measure (F1 or F-score)4 harmonic mean of precision and recall

20

21

Decision Treesi A decision tree is a flow-chart-like tree structure

4 Internal node denotes a test on an attribute (feature)4 Branch represents an outcome of the test4 All records in a branch have the same value for the tested

attribute4 Leaf node represents class label or class label distribution

outlook

humidity windyP

P N PN

sunny overcast rain

high normal true false

22

Decision Treesi Example: “is it a good day to play golf?”

4 a set of attributes and their possible values: outlook sunny, overcast, rain temperature cool, mild, hot humidity high, normal windy true, false

A particular instance in thetraining set might be:

<overcast, hot, normal, false>: play

In this case, the target classis a binary attribute, so eachinstance represents a positiveor a negative example.

23

Using Decision Trees for Classificationi Examples can be classified as follows

4 1. look at the example's value for the feature specified4 2. move along the edge labeled with this value4 3. if you reach a leaf, return the label of the leaf4 4. otherwise, repeat from step 1

i Example (a decision tree to decide whether to go on a picnic):

outlook

humidity windyP

P N PN

sunny overcast rain

high normal true false

So a new instance:

<rainy, hot, normal, true>: ?

will be classified as “noplay”

24

Decision Trees and Decision Rulesoutlook

humidity windyP

P N PN

sunny overcast rain

> 75%<= 75% > 20 <= 20

If attributes are continuous, internal nodes may test against a threshold.

Rule1:If (outlook=“sunny”) AND (humidity<=0.75)Then (play=“yes”)

Rule2:If (outlook=“rainy”) AND (wind>20)Then (play=“no”)

Rule3:If (outlook=“overcast”)Then (play=“yes”). . .

Each path in the tree represents a decision rule:

25

Top-Down Decision Tree Generation

i The basic approach usually consists of two phases:4 Tree construction

h At the start, all the training instances are at the rooth Partition instances recursively based on selected attributes

4 Tree pruning (to improve accuracy)h remove tree branches that may reflect noise in the training data and

lead to errors when classifying test datai Basic Steps in Decision Tree Construction

4 Tree starts a single node representing all data4 If instances are all same class then node becomes a leaf labeled

with class label4 Otherwise, select feature that best distinguishes among

instancesh Partition the data based the values of the selected feature (with each

branch representing one partitions)4 Recursion stops when:

h instances in node belong to the same class (or if too few instances remain)

h There are no remaining attributes on which to split

26

Trees Construction Algorithm (ID3)i Decision Tree Learning Method (ID3)

4 Input: a set of training instances S, a set of features F4 1. If every element of S has a class value “yes”, return “yes”; if every element of

S has class value “no”, return “no”4 2. Otherwise, choose the best feature f from F (if there are no features

remaining, then return failure); 4 3. Extend tree from f by adding a new branch for each attribute value of f

h 3.1. Set F’ = F – {f},4 4. Distribute training instances to leaf nodes (so each leaf node n represents the

subset of examples Sn of S with the corresponding attribute value4 5. Repeat steps 1-5 for each leaf node n with Sn as the new set of training

instances and F’ as the new set of attributesi Main Question:

4 how do we choose the best feature at each step?

Note: ID3 algorithm only deals with categorical attributes, but can be extended(as in C4.5) to handle continuous attributes

27

Choosing the “Best” Featurei Use Information Gain to find the “best” (most discriminating) featurei Assume there are two classes, P and N (e.g, P = “yes” and N = “no”)

4 Let the set of instances S (training data) contains p elements of class P and n elements of class N

4 The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n):

4 Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n)

i In other words, entropy of a set on instances S is a function of the probability distribution of classes among the instances in S.

2 2( , ) Pr( ) log Pr( ) Pr( ) log Pr( )I p n P P N N

28

Entropyi Entropy for a two class variable

29

Entropy in Multi-Class Problemsi More generally, if we have m classes, c1, c2, …, cm , with s1, s2, …, sm

as the numbers of instances from S in each class, then the entropy is:

i where pi is the probability that an arbitrary instance belongs to the class ci.

I

30

Information Gaini Now, assume that using attribute A a set S of instances will be

partitioned into sets S1, S2 , …, Sv each corresponding to distinct values of attribute A.4 If Si contains pi cases of P and ni cases of N, the entropy, or the expected

information needed to classify objects in all subtrees Si is

i The encoding information that would be gained by branching on A:

i At any point we want to branch using an attribute that provides the highest information gain.

1

( ) Pr( ) ( , )i i ii

E A S I p n

)(),()( AEnpIAGain

Pr( ) i i ii

S p nSS p n

where,

The probability that an arbitrary instance in S belongs to the partition Si

31

Attribute Selection - Examplei The “Golf” example: what attribute should we choose as the root? Day outlook temp humidity wind playD1 sunny hot high weak NoD2 sunny hot high strong NoD3 overcast hot high weak YesD4 rain mild high weak YesD5 rain cool normal weak YesD6 rain cool normal strong NoD7 overcast cool normal strong YesD8 sunny mild high weak NoD9 sunny cool normal weak Yes

D10 rain mild normal weak YesD11 sunny mild normal strong YesD12 overcast mild high strong YesD13 overcast hot normal weak YesD14 rain mild high strong No

Outlook?

overcastsunny

rainy

S: [9+,5-]

[4+,0-] [2+,3-] [3+,2-]

I(9,5) = -(9/14).log(9/14) - (5/14).log(5/14) = 0.94

I(4,0) = -(4/4).log(4/4) - (0/4).log(0/4) = 0

I(2,3) = -(2/5).log(2/5) - (3/5).log(3/5) = 0.97

I(3,2) = -(3/5).log(3/5) - (2/5).log(2/5) = 0.97

Gain(outlook) = .94 - (4/14)*0 - (5/14)*.97 - (5/14)*.97 = .24

32

Attribute Selection - Example (Cont.)

humidity?

high normal

S: [9+,5-] (I = 0.940)

[3+,4-] (I = 0.985) [6+,1-] (I = 0.592)

Gain(humidity) = .940 - (7/14)*.985 - (7/14)*.592 = .151

wind?

weak strong

S: [9+,5-] (I = 0.940)

[6+,2-] (I = 0.811) [3+,3-] (I = 1.00)

Gain(wind) = .940 - (8/14)*.811 - (8/14)*1.0 = .048

So, classifying examples by humidity providesmore information gain than by wind. Similarly,we must find the information gain for “temp”.In this case, however, you can verify thatoutlook has largest information gain, so it’ll beselected as root

Day outlook temp humidity wind playD1 sunny hot high weak NoD2 sunny hot high strong NoD3 overcast hot high weak YesD4 rain mild high weak YesD5 rain cool normal weak YesD6 rain cool normal strong NoD7 overcast cool normal strong YesD8 sunny mild high weak NoD9 sunny cool normal weak Yes

D10 rain mild normal weak YesD11 sunny mild normal strong YesD12 overcast mild high strong YesD13 overcast hot normal weak YesD14 rain mild high strong No

33

Attribute Selection - Example (Cont.)i Partially learned decision tree

i which attribute should be tested here?

Outlook

overcastsunny rainy

S: [9+,5-]

[4+,0-][2+,3-] [3+,2-] ? ? yes

{D1, D2, …, D14}

{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}

Ssunny = {D1, D2, D8, D9, D11}

Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970

Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570

Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019

34

Other Attribute Selection Measuresi Gain ratio:

4 Information Gain measure tends to be biased in favor attributes with a large number of values

4 Gain Ratio normalizes the Information Gain with respect to the total entropy of all splits based on values of an attribute

4 Used by C4.5 (the successor of ID3)4 But, tends to prefer unbalanced splits (one partition much smaller than others)

i Gini index: 4 A measure of impurity (based on relative frequencies of classes in a set of instances)

h The attribute that provides the smallest Gini index (or the largest reduction in impurity due to the split) is chosen to split the node

4 Possible Problems:h Biased towards multivalued attributes; similar to Info. Gain.h Has difficulty when # of classes is large

35

Overfitting and Tree Pruning

i Overfitting: An induced tree may overfit the training data 4 Too many branches, some may reflect anomalies due to noise or outliers4 Some splits or leaf nodes may be the result of decision based on very few

instances, resulting in poor accuracy for unseen instancesi Two approaches to avoid overfitting

4 Prepruning: Halt tree construction early do not split a node if this would result in the error rate going above a pre-specified threshold

h Difficult to choose an appropriate threshold4 Postpruning: Remove branches from a “fully grown” tree

h Get a sequence of progressively pruned treesh Use a test data different from the training data to measure error ratesh Select the “best pruned tree”

36

Enhancements to Basic Decision Tree Learning Approach

i Allow for continuous-valued attributes4 Dynamically define new discrete-valued attributes that partition the

continuous attribute value into a discrete set of intervals

i Handle missing attribute values4 Assign the most common value of the attribute4 Assign probability to each of the possible values

i Attribute construction4 Create new attributes based on existing ones that are sparsely represented4 This reduces fragmentation, repetition, and replication

Bayesian Methodsi Bayes’s theorem plays a critical role in probabilistic learning and

classification4 Uses prior probability of each class given no information about an item4 Classification produces a posterior probability distribution over the possible classes

given a description of an item4 The models are incremental in the sense that each training example can incrementally

increase or decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data

i Given a data instance X with an unknown class label, H is the hypothesis that X belongs to a specific class C4 The conditional probability of hypothesis H given observation X, Pr(H|X), follows

Bayes’s theorem:

i Practical difficulty: requires initial knowledge of many probabilities

37

Pr( | )Pr( )Pr( | )Pr( )

X H HH XX

Axioms of Probability Theoryi All probabilities between 0 and 1

i True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0

i The probability of disjunction is:

38

1)(0 AP

)()()()( BAPBPAPBAP

A BBA

Conditional Probability i P(A | B) is the probability of A given Bi Assumes that B is all and only information known.

i Defined by:

39

)()()|(

BPBAPBAP

A BBA

Bayesian Classificationi Let set of classes be {c1, c2,…cn}i Let E be description of an instance (e.g., vector representation)i Determine class of E by computing for each class ci

i P(E) can be determined since classes are complete and disjoint:

41

)()|()()|(

EPcEPcPEcP ii

i

n

i

iin

ii EP

cEPcPEcP11

1)(

)|()()|(

n

iii cEPcPEP

1

)|()()(

Bayesian Categorization (cont.)i Need to know:

4 Priors: P(ci) and Conditionals: P(E | ci)

i P(ci) are easily estimated from data. 4 If ni of the examples in D are in ci,then P(ci) = ni / |D|

i Assume instance is a conjunction of binary features/attributes:

42

meeeE 21

trueWindynormalHumiditycoolTemprainOutlookE

Naïve Bayesian Classificationi Problem: Too many possible instances (exponential in m) to

estimate all P(E | ci)

i If we assume features/attributes of an instance are independent given the class (ci) (conditionally independent)

i Therefore, we then only need to know P(ej | ci) for each feature and category

43

)|()|()|(1

21

m

jijimi cePceeePcEP

Estimating Probabilitiesi Normally, probabilities are estimated based on observed

frequencies in the training data.i If D contains ni examples in class ci, and nij of these ni examples

contains feature/attribute ej, then:

i If the feature is continuous-valued, P(ej|ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ

and P(ej|ci) is

44

i

ijij n

nceP )|(

2

2

2)(

21),,(

x

exg

),,()|(ii ccjj egcieP

Smoothingi Estimating probabilities from small training sets is error-prone:

4 If due only to chance, a rare feature, ek, is always false in the training data, ci :P(ek | ci) = 0.

4 If ek then occurs in a test example, E, the result is that ci: P(E | ci) = 0 and ci: P(ci | E) = 0

i To account for estimation from small samples, probability estimates are adjusted or smoothed

i Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.

i For binary features, p is simply assumed to be 0.5.

45

mnmpn

cePi

ijij

)|(

46

Naïve Bayesian Classifier - Example

i Here, we have two classes C1=“yes” (Positive) and C2=“no” (Negative)i Pr(“yes”) = instances with “yes” / all instances = 9/14i If a new instance X had outlook=“sunny”, then Pr(outlook=“sunny” | “yes”) = 2/9

(since there are 9 instances with “yes” (or P) of which 2 have outlook=“sunny”)i Similarly, for humidity=“high”, Pr(humidity=“high” | “no”) = 4/5i And so on.

47

Naïve Bayes (Example Continued)i Now, given the training set, we can compute all the probabilities

i Suppose we have new instance X = <sunny, mild, high, true>. How should it be classified?

i Similarly:

X = < sunny , mild , high , true >

Pr(X | “no”) = 3/5 . 2/5 . 4/5 . 3/5

Pr(X | “yes”) = (2/9 . 4/9 . 3/9 . 3/9)

48

Naïve Bayes (Example Continued)i To find out to which class X belongs we need to maximize: Pr(X |

Ci).Pr(Ci), for each class Ci (here “yes” and “no”)

i To convert these to probabilities, we can normalize by dividing each by the sum of the two:4 Pr(“no” | X) = 0.04 / (0.04 + 0.007) = 0.854 Pr(“yes” | X) = 0.007 / (0.04 + 0.007) = 0.15

i Therefore the new instance X will be classified as “no”.

X = <sunny, mild, high, true>

Pr(X | “no”).Pr(“no”) = (3/5 . 2/5 . 4/5 . 3/5) . 5/14 = 0.04Pr(X | “yes”).Pr(“yes”) = (2/9 . 4/9 . 3/9 . 3/9) . 9/14 = 0.007

49

Text Naïve Bayes – Spam Examplet1 t2 t3 t4 t5 Spam

D1 1 1 0 1 0 noD2 0 1 1 0 0 noD3 1 0 1 0 1 yesD4 1 1 1 1 0 yesD5 0 1 0 1 0 yesD6 0 0 0 1 1 noD7 0 1 0 0 0 yesD8 1 1 0 1 0 yesD9 0 0 1 1 1 no

D10 1 0 1 0 1 yes

Term P(t|no) P(t|yes)t1 1/4 4/6t2 2/4 4/6t3 2/4 3/6t4 3/4 3/6t5 2/4 2/6

P(no) = 0.4P(yes) = 0.6

TrainingData

New email x containing t1, t4, t5 x = <1, 0, 0, 1, 1>

Should it be classified as spam = “yes” or spam = “no”?Need to find P(yes | x) and P(no | x) …

50

Text Naïve Bayes - ExampleTerm P(t|no) P(t|yes)

t1 1/4 4/6t2 2/4 4/6t3 2/4 3/6t4 3/4 3/6t5 2/4 2/6

P(no) = 0.4P(yes) = 0.6

New email x containing t1, t4, t5

x = <1, 0, 0, 1, 1>

P(yes | x) = [4/6 * (1-4/6) * (1-3/6) * 3/6 * 2/6] * P(yes) / P(x)= [0.67 * 0.33 * 0.5 * 0.5 * 0.33] * 0.6 / P(x) = 0.11 / P(x)

P(no | x) = [1/4 * (1-2/4) * (1-2/4) * 3/4 * 2/4] * P(no) / P(x)= [0.25 * 0.5 * 0.5 * 0.75 * 0.5] * 0.4 / P(x) = 0.019 / P(x)

To get actual probabilities need to normalize: note that P(yes | x) + P(no | x) must be 1

0.11 / P(x) + 0.019 / P(x) = 1 P(x) = 0.11 + 0.019 = 0.129

So: P(yes | x) = 0.11 / 0.129 = 0.853

P(no | x) = 0.019 / 0.129 = 0.147

51

What Is Prediction?i (Numerical) prediction is similar to classification

4 construct a model4 use model to predict continuous or ordered value for a given input

i Prediction is different from classification4 Classification refers to predicting categorical class label4 Prediction models continuous-valued functions

i Major method for prediction: regression4 model the relationship between one or more independent or predictor

variables and a dependent or response variablei Regression analysis

4 Linear and multiple regression4 Non-linear regression4 Other regression methods: generalized linear model, Poisson regression,

log-linear models, regression trees

52

Linear Regression i Linear regression: involves a response variable y and a single predictor

variable x y = w0 + w1 x4 w0 (y-intercept) and w1 (slope) are regression coefficients

i Method of least squares: estimates the best-fitting straight line

i Multiple linear regression: involves more than one predictor variable4 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)

4 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2

4 Solvable by extension of least square method 4 Many nonlinear functions can be transformed into the above

||

1

2

||

11

)(

))((

D

ii

D

iii

xx

yyxxw xwyw 10

53

i Some nonlinear models can be modeled by a polynomial function

i A polynomial regression model can be transformed into linear regression model. For example,

y = w0 + w1 x + w2 x2 + w3 x3

is convertible to linear with new variables: x2 = x2, x3= x3

y = w0 + w1 x + w2 x2 + w3 x3

i Other functions, such as power function, can also be transformed to linear model

i Some models are intractable nonlinear (e.g., sum of exponential terms)4 possible to obtain least squares estimates through extensive

computation on more complex functions

Nonlinear Regression

More on Linear Models

y

x

Observed Value of y for xi

Predicted Value of y for xi

εxββy 10

xi

Slope = β1

Intercept = β0

εi

Random Error for this x value

Population Linear Regression

xwwy 10i

The sample regression line provides an estimate of the population regression line

Estimated Regression Model

Estimate of the regression intercept

Estimate of the regression slope

Estimated (or predicted) y value

Independent variable

The individual random error terms ei have a mean of zero

Least Squares Criterioni w0 and w1 are obtained by minimizing the sum of the

squared residuals

210

22

x))w(w(y

)y(ye

i The formulas for w1 and w0 are:

21 )())((

xxyyxx

w xwyw 10

General Form of Linear Functions

Slide thanks to Greg Shakhnarovich (Brown Univ., 2006)

i Simple Least Squares:4 Determine linear coefficients , that minimize sum of

squared error (SSE).4 Use standard (multivariate) differential calculus:

h differentiate SSE with respect to , h find zeros of each partial differential equationh solve for ,

i One dimension:

Least Squares Generalization

ttt

N

jjj

xxy

x, yx,yxyx

yx

Nxy

samplefor test ˆ

trainingof means ]var[

],cov[

samples ofnumber ))((SSE1

2

i Multiple dimensions4 To simplify notation and derivation, change to 0, and

add a new feature x0 = 1 to feature vector x:

4 Calculate SSE and determine :

Least Squares Generalization

xβ

d

iii xy

10 1ˆ

ttt

j

j

N

j

d

iiij

y

y

xy

xxβyXXXβ

xX

y

XβyXβy

samplefor test ˆ)(

samples trainingall ofmatrix

responses trainingall ofvector

)()()(SSE

T1T

1

T2

0

i The inputs X for linear regression can be:4 Original quantitative inputs4 Transformation of quantitative inputs, e.g. log, exp, square

root, square, etc.4 Polynomial transformation

h example: y = 0 + 1x + 2x2 + 3x3

4 Dummy coding of categorical inputs4 Interactions between variables

h example: x3 = x1 x2

i This allows use of linear regression techniques to fit much more complicated non-linear datasets.

Extending Application of Linear Regression

Example of fitting polynomial curve with linear model

i Complex models (lots of parameters) often prone to overfittingi Overfitting can be reduced by imposing a constraint on the overall

magnitude of the parameters (i.e., by including coefficients as part of the optimization process)

i Two common types of regularization in linear regression:4 L2 regularization (a.k.a. ridge regression). Find which minimizes:

h is the regularization parameter: bigger imposes more constraint

4 L1 regularization (a.k.a. lasso). Find which minimizes:

Regularization

N

j

d

ii

d

iiij xy

1 1

22

0

)(

N

j

d

ii

d

iiij xy

1 1

2

0

||)(

Example: 1D Ploy Fit

63


65


66


67

68

i Generalized linear models 4 Foundation on which linear regression can be applied to modeling

categorical response variables4 Variance of y is a function of the mean value of y, not a constant4 Logistic regression: models the probability of some event occurring as a

linear function of a set of predictor variables4 Poisson regression: models the data that exhibit a Poisson distribution

i Log-linear models (for categorical data)4 Approximate discrete multidimensional prob. distributions 4 Also useful for data compression and smoothing

i Regression trees and model trees4 Trees to predict continuous values rather than class labels

Other Regression-Based Models

69

Regression Trees and Model Treesi Regression tree: proposed in CART system (Breiman et al. 1984)

4 CART: Classification And Regression Trees

4 Each leaf stores a continuous-valued prediction

4 It is the average value of the predicted attribute for the training instances that reach the leaf

i Model tree: proposed by Quinlan (1992)

4 Each leaf holds a regression model—a multivariate linear equation for the predicted attribute

4 A more general case than regression tree

i Regression and model trees tend to be more accurate than linear regression when instances are not represented well by simple linear models

70

Evaluating Numeric Predictioni Prediction Accuracy

4 Difference between predicted scores and the actual results (from evaluation set)4 Typically the accuracy of the model is measured in terms of variance (i.e., average

of the squared differences)

i Common Metrics (pi = predicted target value for test instance i, ai = actual target value for instance i)

4 Mean Absolute Error: Average loss over the test set

4 Root Mean Squared Error: compute the standard deviation (i.e., square root of the co-variance between predicted and actual ratings)

napapRMSE nn

2211 )(...)(

napapMAE nn )(...)( 11

classification and prediction

Documents

training instances

training setthe model

model use classification

new data classification

data distributionthe

independent of training

training datasimply

goal of data classification