data mining - massey university 1 chapter 3 data mining concepts: data preparation, model evaluation...

Data Mining - Massey University 1

Chapter 3Data Mining Concepts:

Data Preparation, Model Evaluation

Credits:Padhraic Smyth notes

Cook and Swayne book


Data Mining Tasks• EDA/Exploration

– visualization• Predictive Modelling

– goal is predict an answer– see how independent variables effect an outcome– predict that outcome for new cases– Inference

• will this drug fight that disease

• Descriptive Modelling– there is no specific outcome of interest– describe the data in qualitative ways– simple crosstabs of categorical data– clustering, density estimation, segmentation– includes pattern identification

• frequent itemsets - beer and diapers? • anomaly detection - network intrusion• sports rules - when X is in the game, Y scores 30% more.


Data Mining Tasks cont

• Retreival by content– user supplies a pattern to a large

dataset, which retrieves the most relevant answer

– web search– image retrieval


Data Preparation

• Data in the real world is dirty– incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data

– noisy: containing errors or outliers– inconsistent: containing discrepancies in codes or

names

• No quality data, no quality mining results!– Quality decisions must be based on quality data– Data warehouse needs consistent integration of

quality data– Assessment of quality reflects on confidence in

results


Preparing Data for Analysis• Think about your data

– how is it measured, what does it mean?– nominal or categorical

• jersey numbers, ids, colors, simple labels• sometimes recoded into integers - careful!

– ordinal• rank has meaning - numeric value not necessarily• educational attainment, military rank

– integer valued• distances between numeric values have meaning• temperature, time

– ratio • zero value has meaning - means that fractions and ratios are sensible• money, age, height,

• It might seem obvious what a given data value is, but not always– pain index, movie ratings, etc


Investigate your data carefully!

• Example: lapsed donors to a charity: (KDD Cup 1998)– Made their last donation to PVA 13 to

24 months prior to June 1997 – 200,000 (training and test sets) – Who should get the current mailing? – What is the cost effective strategy?– “tcode” was an important variable…


QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.




Mixed data• Many real-world data sets have multiple types of

variables, – e.g., medical diagnosis data on patients and controls– Categorical (Nominal): employment type, ethnic group– Ordinal: education level– Interval: body temperature, pixel values in medical image– Ratio: income, age, response to drug

• Unfortunately, many data analysis algorithms are suited to only one type of data (e.g., interval)

• Linear regression, neural networks, support vector machines, etc

• These models implicitly assume interval-scale data

• Exception: decision trees– Trees operate by subgrouping variable values at internal nodes– Can operate effectively on binary, nominal, ordinal, interval


Tasks in Data Preprocessing

• Data cleaning– Check for data quality– Missing data

• Data transformation– Normalization and aggregation

• Data reduction– Obtains reduced representation in volume but

produces the same or similar analytical results

• Data discretization– Combination of reduction and transformation but with

particular importance, especially for numerical data


Data Cleaning / Quality

• Individual measurements– Random noise in individual measurements

• Outliers• Random data entry errors• Noise in label assignment (e.g., class labels in medical data sets)• can be corrected or smoothed out

– Systematic errors• E.g., all ages > 99 recorded as 99• More individuals aged 20, 30, 40, etc than expected

– Missing information• Missing at random

– Questions on a questionnaire that people randomly forget to fill in• Missing systematically

– Questions that people don’t want to answer– Patients who are too ill for a certain test


Missing Data

• Data is not always available– E.g., many records have no recorded value for several

attributes,

• survey respondents

• disparate sources of data

• Missing data may be due to – equipment malfunction

– inconsistent with other recorded data and thus deleted

– data not entered due to misunderstanding

– certain data may not be considered important at the time of entry

– not register history or changes of the data


How to Handle Missing Data?

• Ignore the tuple: not effective when the percentage

of missing values per attribute varies considerably.

• Fill in the missing value manually: tedious +

infeasible?

• Use a global constant to fill in the missing value: e.g.,

“unknown”, a new class?!

• Use the attribute mean to fill in the missing value

• Use imputation

– nearest neighbor

– model based (regression or Bayesian MC based)


Missing Data

• What do I choose for a given situation?

• What you do depends

– the data - how much is missing? are they

‘important’ values?

– the model - can it handle missing values?

– there is no right answer!


Noisy Data

• Noise: random error or variance in a measured variable

• Incorrect attribute values (outliers) may due to– faulty data collection instruments– data entry problems– data transmission problems– technology limitation– inconsistency in naming convention

• Other data problems which requires data cleaning– duplicate records– incomplete data– inconsistent data


How to Handle Noisy Data?

• Is the suspected outlier from– human error? or– real data?

• err on the side of caution, if unsure, use methods that are robust to outliers


Data Transformation

• Can help reduce influence of extreme values

• Putting data on the real line is often convenient

• Variance reduction:– log-transform often used for incomes and other highly

skewed variables.

• Normalization: scaled to fall within a small, specified range– min-max normalization

– z-score normalization

– normalization by decimal scaling

• Attribute/feature construction– New attributes constructed from the given ones


Data Transformation: Standardization

We can standardize data by dividing by the sample standard deviation. This makes them all equally important.

( )2

1

1

21ˆ ⎟

⎠

⎞⎜⎝

⎛−= ∑

=

n

ikkk xx

nσ

The estimate for the standard deviation of xk :

where xk is the sample mean:

∑=

=n

ikk ix

nx

1

)(1

(When might standardization *not* be a such a good idea?)


Dealing with massive data

• What if the data simply does not fit on my computer (or R crashes)?– Force the data into main memory

• be careful, you need some overhead for modelling!

– use scalable algorithms• keep up on the literature, this keeps changing!

– Use a database• Mysql is a good (and free) one

– Investigate data reduction strategies


Data Reduction: Strategies

• Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set

• Data reduction – Obtains a reduced representation of the data set

that is much smaller in volume but yet produces the same (or almost the same) analytical results

– Goal: reduce dimensionality of data from p to p’


Data Reduction: Dimension Reduction

• In general, incurs loss of information about x

• If dimensionality p is very large (e.g., 1000’s), representing the data in a lower-dimensional space may make learning more reliable,– e.g., clustering example

• 100 dimensional data• if cluster structure is only present in 2 of the

dimensions, the others are just noise• if other 98 dimensions are just noise (relative to

cluster structure), then clusters will be much easier to discover if we just focus on the 2d space

• Dimension reduction can also provide interpretation/insight– e.g for 2d visualization purposes


Data Reduction: Dimension Reduction

• Feature selection (i.e., attribute subset selection):– Select a minimum set of features such that the

minimal signal is lost

• Heuristic methods (exhaustive search implausible except for small p):– step-wise forward selection– step-wise backward elimination– combining forward selection and backward

elimination– decision-tree induction

• can work well, but often gets trapped in local minima

• often computationally expensive


• One of several projection methods

• Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data – The original data set is reduced to one consisting of

N data vectors on c principal components (reduced dimensions)

• Each data vector is a linear combination of the c principal component vectors

• Works for numeric data only

• Used when the number of dimensions is large

Data Reduction: Principal Components


PCA Example

Direction of 1st principal component vector (highest variance projection)

x1

x2


PCA Example

Direction of 1st principal component vector (highest variance projection)

x1

x2

Direction of 2ndprincipal component vector


Data Reduction: Multidimensional Scaling

• Say we have data in the form of an N x N matrix of dissimilarities– 0’s on the diagonal– Symmetric– Could either be given data in this form, or create such a

dissimilarity matrix from our data vectors

• Examples– Perceptual dissimilarity of N objects in cognitive science

experiments– String-edit distance between N protein sequences

• MDS:– Find k-dimensional coordinates for each of the N objects such that

Euclidean distances in “embedded” space matches set of dissimilarities as closely as possible


Multidimensional Scaling (MDS)

• MDS score function (“stress”)

• Often used for visualization, e.g., k=2, 3

∑∑ −=jiji

jidjijidS,

2

,

2 ),(/)),(),(( δ

Originaldissimilarities

Euclidean distancein “embedded” k-dim space


MDS: Example data


MDS: 2d embedding of face images


Data Reduction: Sampling

• Don’t forget about sampling!• Choose a representative subset of the data

– Simple random sampling may be ok but beware of skewed variables.

• Stratified sampling methods– Approximate the percentage of each

class (or subpopulation of interest) in the overall database

– Used in conjunction with skewed data– Propensity scores may be useful if

response is unbalanced.


Data Discretization

• Discretization:

– divide the range of a continuous attribute into intervals

– Some classification algorithms only accept categorical attributes.

– Reduce data size by discretization– Prepare for further analysis

• Reduces information in the data, but sometimes surprisingly good!– Bagnall and Janacek - KDD 2004


Model Evaluation


Model Evaluation

• e.g. pick the “best” a and b in Y = aX + b

• how do you define “Best?”

• Big difference between Data Mining and Statistics is the focus on predictive performance over find best estimators.

• Most obvious criterion for predictive performance:


Predictive Model Scores

∑=

−=N

iSSE iyixf

NS

1

2))());((ˆ(1

)( θθ

•More generally:

•Assumes all observations equally important

• assumes errors are treated equally

- what about if recent cases are more important, or high revenue, etc.

•Depends on differences rather than values

- scale matters with squared error


Descriptive Model Scores

∏=

=N

i

ixpL1

));((ˆ)( θθ

•If your model assigns probabilities to classes:

• likelihood based - “Pick the model that assigns highest probability to what actually happened”

• Many different scoring rules for non-probabilistic models


Scoring Models with Different Complexities

score(model) = Goodness-of-fit - penalty for complexity

Classic bias/variance tradeoff

this is called “regularization” and is used to combat “overfitting”

complex models can fit data perfectly!!!!


Using Hold-Out Data

• Instead of penalizing complexity, look at performance on hold-out data

• Using the same set of examples for training as well as for evaluation results in an overoptimistic evaluation of model performance.

Need to test performance on data not “seen” by the modeling algorithm. I.e., data that was no used for model building


Data Partition

• Randomly partition data into training and test set• Training set – data used to train/build the model.

– Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc.

• Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Also referred to as out-of-sample data.

• Generalization Error: Model’s error on the test data.

Set of training examplesSet of testexamples


Training Data:Customer characteristics &cell phone usage behavior

Model

Model Inductionalgorithm

Set of training examples

Set of testingexamples

Use model to predict outcome on test set

The model is used to infer whether a customer would leave

ConsumerI

ModelPredictionwhether

customer I willterminatecontract


Holding out data

• The holdout method reserves a certain amount for testing and uses the remainder for training– Usually: one third for testing, the rest for

training

• For “unbalanced” datasets, random samples might not be representative– Few or none instances of some classes

• Stratified sample: – Make sure that each class is represented with

approximately equal proportions in both subsets


Evaluation on “small” data

• What if we have a small data set?• The chosen 2/3 for training may

not be representative.• The chosen 1/3 for testing may not

be representative.


Repeated holdout method

• Holdout estimate can be made more reliable by repeating the process with different subsamples– In each iteration, a certain proportion is

randomly selected for training (possibly with stratification)

– The error rates on the different iterations are averaged to yield an overall error rate

• This is called the repeated holdout method


Cross-validation

• Most popular and effective type of repeated holdout is cross-validation

• Cross-validation avoids overlapping test sets– First step: data is split into k subsets of equal

size– Second step: each subset in turn is used for

testing and the remainder for training

• This is called k-fold cross-validation• Often the subsets are stratified before the

cross-validation is performed• The error estimates are averaged to yield

an overall error estimate

Data Mining - Massey University 4646 46

Cross-validation example:

— Break up data into groups of the same size —

—

— Hold aside one group for testing and use the rest to build model

—

— Repeat

Test


More on cross-validation

• Standard method for evaluation: stratified ten-fold cross-validation

• Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate

• Stratification reduces the estimate’s variance

• Even better: repeated stratified cross-validation– E.g. ten-fold cross-validation is repeated ten

times and results are averaged (reduces the variance)


Leave-One-Out cross-validation

• Leave-One-Out:a particular form of cross-validation:– Set number of folds to number of training

instances– I.e., for n training instances, build classifier n

times

• Makes best use of the data• Involves no random subsampling • Computationally expensive, but good

performance


Leave-One-Out-CV and stratification

• Disadvantage of Leave-One-Out-CV: stratification is not possible– It guarantees a non-stratified sample because

there is only one instance in the test set!

• Extreme example: random dataset split equally into two classes– Best model predicts majority class– 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!


Model’s Performance Evaluation

Classification models predict what class an observation belongs to.E.g., good vs. bad credit risk (Credit), Response

vs. no response to a direct marketing campaign, etc.

Classification Accuracy RateProportion of accurate classifications of examples in test set. E.g., the model predicts the correct class for 70% of test examples.


Classification Accuracy Rate

Classification Accuracy Rate: S/N = Proportion examples accurately classified by the model

S – number of examples accurately classified by modelN – Total number of examples

Inputs Output Model’s prediction

Correct/incorrect prediction

Single No of cards

Age Income>50K Good/Bad risk

Good/Bad risk

0 1 28 1 1 1 :)

1 2 56 0 0 0 :)

0 5 61 1 0 1 :(

0 1 28 1 1 1 :)

… … … … … … …


Consider the following…

• Response rate for a mailing campaign is 1%• We build a classification model to predict

whether or not a customer would respond.• The model classification accuracy rate is

99%• Is the model good?

99% do not respond1%

respond


Respond Do not respond

Respond 0 0

Do not Respond

10 990

Diagonal left to right: predicted class=actual class. 990/1000 (99%) accurate predictions.

Model’s performance over test set:

Predicted Class

Actual Class

Confusion Matrix for Classification


Evaluation for a Continuous Response

• In a logistic regression example, we predict probabilites that the response = 1.

• Classification accuracy depends on the threshold


Test Data

predicted probabilities

8 3

0 9

1

0

01

actual outcome

predictedoutcome

Suppose we use a cutoff of 0.5…


a b

c d

1

0

01

actual outcome

predictedoutcome

More generally…

misclassification rate: b + c

a+b+c+d

recall or sensitivity (how many of those that are really positive did you predict?):

a

a+c

specificity: d

b+d

precision (how many of those predicted positive really are?)

a

a+b


8 3

0 9

1

0

01

actual outcome

predictedoutcome


sensitivity: = 100% 8

8+0

specificity: = 75% 9

9+3

we want both of these to be high


6 2

2 10

1

0

01

actual outcome

predictedoutcome


sensitivity: = 75% 6

6+2

specificity: = 83% 10

10+2


• Note there are 20 possible thresholds

• ROC computes sensitivity and specificity for all possible thresholds and plots them

• Note if threshold = minimum

c=d=0 so sens=1; spec=0

• If threshold = maximum

a=b=0 so sens=0; spec=1

a b

c d

1

0

01

actual outcome


“Area under the curve” (AUC) is a common measure of predictive performance


Another Look at ROC Curves

Test Result

Pts Pts with with diseasdiseasee

Pts Pts without without the the diseasedisease


Test Result

Call these patients “negative”

Call these patients “positive”

Threshold


Test Result



without the diseasewith the disease

True Positives

Some definitions ...


Test Result




False Positives


Test Result




True negatives


Test Result




False negatives


Test Result


‘‘‘‘-’-’’’

‘‘‘‘+’+’’’

Moving the Threshold: right


Test Result


‘‘‘‘-’-’’’

‘‘‘‘+’+’’’

Moving the Threshold: left


Tru

e P

osi

tive R

ate

(s

en

siti

vit

y)

0%

100%

False Positive Rate (1-specificity)

0%

100%

ROC curve


Tru

e P

osi

tive

Rate

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

Rate

0%

100%

False Positive Rate

0%

100%

A good test: A poor test:

ROC curve comparison


Best Test: Worst test:T

rue P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

The distributions don’t overlap at all

The distributions overlap completely

ROC curve extremes


ROC Curve

Ideal ROC curve (AUC=1)

100%

100%

0 AUC 1

Actual R

OC

Random ROC

(AUC=0.5)

0

For a given threshold on f(x), you get a point on the ROC curve.

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)


Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

AUC = 50%

AUC = 90% AUC =

65%

AUC = 100%

Tru

e P

osi

tive

R

ate

0%

100%

False Positive Rate

0%

100%

AUC for ROC curves


Interpretation of AUC

• AUC can be interpreted as the probability that the test result from a randomly chosen diseased individual is more indicative of disease than that from a randomly chosen nondiseased individual: P(Xi Xj | Di = 1, Dj = 0)

• So can think of this as a nonparametric distance between disease/nondisease test results

• equivalent to Mann-Whitney U-statistic (nonparametric test of difference in location between two populations)


Lab #3

• Missing Data: using visualization to see if the data is missing at random through “missing in the margins”

• Multi-dimensional scaling

data mining - massey university 1 chapter 3 data mining concepts: data preparation, model evaluation...

Documents

type of data

resultspreparing data

namesno quality data

data mining concepts

given data value

medical diagnosis data

data analysis algorithms

intervalscale data exception