david corne, heriot-watt university - [email protected] these slides and related resources:...

51
David Corne, Heriot-Watt University - [email protected] These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining (and machine learning) DM Lecture 3: Basic Statistics and Coursework 1

Upload: anis-greer

Post on 31-Dec-2015

243 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining(and machine learning)

DM Lecture 3: Basic Statistics and Coursework 1

Page 2: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Communities and Crime

Here is an interesting dataset

Page 3: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 4: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

-- state: US state (by number) - -- county: numeric code for county -- community: numeric code for community - -- communityname: community name – -- fold: fold number for non-random 10 fold cross validation, -- population: population for community: (numeric - decimal) -- householdsize: mean people per household (numeric - decimal) -- racepctblack: percentage of population that is african american (numeric - decimal) -- racePctWhite: percentage of population that is caucasian (numeric - decimal) -- racePctAsian: percentage of population that is of asian heritage (numeric - decimal) -- racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal) -- agePct12t21: percentage of population that is 12-21 in age (numeric - decimal) -- agePct12t29: percentage of population that is 12-29 in age (numeric - decimal) -- agePct16t24: percentage of population that is 16-24 in age (numeric - decimal) -- agePct65up: percentage of population that is 65 and over in age (numeric - decimal) -- numbUrban: number of people living in areas classified as urban (numeric - decimal) -- pctUrban: percentage of people living in areas classified as urban (numeric - decimal) -- medIncome: median household income (numeric - decimal) –-- pctWWage: percentage of households with wage or salary income in 1989 (numeric - decimal) -- pctWFarmSelf: percentage of households with farm or self employment income in 1989 [etc etc etc --- 128 fields altogether] -- ViolentCrimesPerPop: total number of violent crimes per 100K popuation (numeric - decimal)

GOAL attribute (to be predicted)

Page 5: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Mining the C&C data

Let’s do some basic preprocessing and mining of these data, to start to grasp whether we can find any patterns that will predict certain levels of violent crime

Page 6: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

etc … about 2,000 instances

Page 7: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

First: some sensible preprocessing

The first 5 fields are (probably) not useful for prediction – they are more like “ID” fields for the record. So, let’s remove them.

There are many cases of missing data here too – let’s remove any field which has any missing data in it at all. This is OK for the C&C data, still leaving 100 fields.

Page 8: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

First: some sensible preprocessing

I downloaded the data. First I converted it to a space-separated form, rather than comma-separated, because I prefer it that way. I wrote an awk script to do this called cs2ss.awk, here:

http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/cs2ss.awk

I did that with this command line on a unix machine:

awk –f cs2ss.awk < communities.data > commss.txt

Placing the new version in “commss.txt”

Then, I wanted to remove the first 5 fields, and remove any field in which any record contained missing values. I wrote an awk script for that too:

http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/fixcommdata.awk

and did this:

awk –f fixcommdata.awk < comss.txt > commssfixed.txt

Page 9: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Normalisation

The fields in these data happen to be already min-max normalised into [0,1]; I wondered whether it would also be good to z-normalise the fields. So I wrote an awk script for z-normalisation, and produced a version that had that done

http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/znorm.awk

awk –f znorm.awk < commsfixed.txt > commssfixedznorm.txt

In these data, the class value is numeric, between 0 and 1, indicating (already normalized) the relative amount of violent crime in the community in question. To make it easier to find patterns and relationships, I produced new versions of each dataset where the class value was either 0 (low) or 1 (high) – 0 in the cases where it had been <= 0.4, and 1 otherwise. I used an awk script for that too, and did some renaming of files, and ended up with:

• commssfixed.txt two-class

• commssfixedznorm.txt two-class z-normlalised

Page 10: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Now, I wonder: how good is 1-NN at predicting the class for these data?

If only using fields 20—30 to work out the distance values, the answer is:

Unchanged data (in this case, already min-max normalised to [0,1]): 81.1%

Z-normalised: 81.5%

But note that 81% of the data is class 0 – so if you always guess “0”, your accuracy will be 81.0%.

Page 11: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Now let’s look at the data in more detail; some histograms of the fields

Here is the distribution of values in field 6 for class 0 – it is a “5-bin” distribution.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5

Series1

0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0.8—1.0

Page 12: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, , Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Let’s look at the distributions of field 6 for class 0 and class 1 together (% of pop that is Hispanic)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5

class 0

class 1

0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0.8—1.0

Page 13: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

An aside: similar histograms withclearer messages …

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 14: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Field 7 (% of pop aged 12--21)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5

class 0

class 1

0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0 .8—1.0

Page 15: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Field 8 (% of pop aged 12—29)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5

class 0

class 1

0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0 .8—1.0

Page 16: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Field 9 (% of pop aged 16—24)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5

class 0

class 1

0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0 .8—1.0

Page 17: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Field 10 (% of pop aged >= 65

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 3 4 5

class 0

class 1

0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0 .8—1.0

Page 18: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5

class 0

class 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5

class 0

class 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5

class 0

class 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5

class 0

class 1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 3 4 5

class 0

class 1

Which two fieldsseem most usefulfor discriminating between classes 0 and 1?

Page 19: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Fields 6 and 7

Maybe we will get better 1-NN results using only the important fields? 2 is (most often) too small a number of fields, but anyway …

I produced versions of the dataset that had only fields 6, 7 and 100 (these two, and the class field): I then calculated 1-NN accuracy for these. Results:

`Unchanged’ version: fields 6 and 7: 70.8% (was 81.1%)Z-normalised: fields 6 and 7: 70.9% (was 81.5%)

Not very successful! But I didn’t expect that – working with several more of the important fields would quite possibly give better accuracies, but may take too much time to demonstrate, or do in your assignment.

Page 20: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

Coursework 1 You will do what we just did, for two datasets:

Spambase EEG Eye State

MSc/Meng yr 5 will in addition,work with Urban Land Cover

Page 21: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

For each dataset

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 22: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

For each dataset

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Additionally, Level 11 students: Do some preparation work on the urban land cover dataset (see CW1 pdf), and then repeat steps 1—7.

Page 23: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 24: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

How? What tools?

Whatever you prefer: C, Java, Excel, Matlab, …

Read what the handout says about ‘How’.

Basically:•You don’t have to, but it’s a very good idea to learn to use scripting tools such as awk,

•Once you get used to ‘awk’ (or perl, python, etc), you can do complex things with files (e.g. datasets) quickly and easily

•This includes things that may be too difficult in Excel, or may take you too long to program and debug in C. Java, etc…

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 25: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

You should know what Z-normalisation is, so here is a brief lecture on basic statistics, including that, and other

wonders

Page 26: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Fundamental Statistics Definitions

• A Population is the total collection of all items/individuals/events under consideration• A Sample is that part of a population which has beenobserved or selected for analysis

E.g. all students is a population. Students at HWU is a sample; this class is a sample, etc …

• A Statistic is a measure which can be computed to describe a characteristic of the sample (e.g. the sample mean)

The reason for doing this is almost always to estimate (i.e. make a good guess) things about that characteristic in the population

Page 27: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

E.g.

• This class is a sample from the population of students at HWU (it can also be considered as a sample of other populations – like what?)

• One statistic of this sample is your mean weight. Suppose that is 65Kg. I.e. this is the sample mean.

• Is 65Kg a good estimate for the mean weight of the population?

•Another statistic: suppose 10% of you are married. Is this a good estimate for the proportion that are married in the population?

Page 28: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some Simple Statistics• The Mean (average) is the sum of the values in a sample divided by the

number of values

• The Median is the midpoint of the values in a sample (50% above; 50% below) after they have been ordered (e.g. from the smallest to the largest)

• The Mode is the value that appears most frequently in a sample

• The Range is the difference between the smallest and largest values in a sample

• The Variance is a measure of the dispersion of the values in a sample – how closely the observations cluster around the mean of the sample

• The Standard Deviation is the square root of the variance of a sample

Page 29: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Distributions / Histograms(a way to ‘see’ these statistics)

A Normal (aka Gaussian) distribution (image from Mathworld)

Page 30: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Statistical moments• The m-th moment about the mean (μ) of a sample is:

Where n is the number of items in the sample.• The first moment (m = 1) is 0!• The second moment (m = 2) is the variance• (and: square root of the variance is the standard deviation)• The third moment can be used in tests for skewness• The fourth moment can be used in tests for kurtosis

Samplex

mxn

)(1

Page 31: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Variation and Standard Deviation• The variance of a sample is the 2nd moment

Where n is the number of items in the sample.

square root of the variance is the standard deviation)

Samplex

xn

2)(1

variance

Page 32: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Distributions / Histograms

A Normal (aka Gaussian) distribution (image from Mathworld)

Page 33: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Normal’ or Gaussian distributions …

• … tend to be everywhere

• Given a typical numeric field in a typical dataset, it is common that most values are centred around a particular value (the mean), and the proportion with larger or smaller values tends to tail off.

Page 34: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Normal’ or Gaussian distributions …

• We just saw this – fields 7—10 were Normal-ish

• Heights, weights, times (e.g. for 100m sprint, for lengths of software projects), measurements (e.g. length of a leaf, waist measurement, coursework marks, level of protein A in a blood sample, …) all tend to be Normally distributed. Why??

Page 35: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Sometimes distributions are uniform

Uniform distributions. Every possible value tends to be equally likely

Page 36: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

This figure is from: http://mathworld.wolfram.com/Dice.html

One die: uniform distribution of possible totals:But look what happens as soon as the value is a sum of things;The more things, the more Gaussian the distribution.Are measurements (etc.) usually the sum of many factors?

Page 37: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Probability Distributions• If a population (e.g. field of a dataset) is expected

to match a standard probability distribution then a wealth of statistical knowledge and results can be brought to bear on its analysis

• Many standard statistical techniques are based on the assumption that the underlying distribution of a population is Normal (Gaussian)

• Usually this assumption works fine

Page 38: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

The power of assumptions…

David Corne, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

You are a random sample of 30 HWU/Riccarton studentsThe mean height of this sample is 1.685cmThere are 5,000 students in the population

With no more information, what can we say about the mean height of the population of 5,000 students?

Page 39: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A closer look at the normal distributionThis is the ND with mean mu and std sigma

Page 40: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

Suppose the standard deviation of your sample is 0.12

Theory tells us that if a population is Normal, the sample std is a fairly good guess at the population std

More than just a pretty bell shape

So, the sample STD is a good estimate for the population STDSo we can say, for example, that ~95% of the population of 5000students (4750 students) will be within 0.24m of the population mean

Page 41: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Mean of our sample was 1.685

The Standard Error of the Mean is pop std / sqrt(sample size)

which we can approximate by:

sample std / sqrt(sample size)

… in our case this 0.12/5.5 = 0.022

But what is the population mean?

This ‘standard error’ (SE) is actually the standard deviation of the distribution of sample means We can use this it to build a confidenceinterval for the actual population mean. Basically, we can be 95% sure that the pop mean is within 2 SEs of the sample mean …

Page 42: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

The power of assumptions…

You are a random sample of 30 HWU/Riccarton studentsThe mean height of this sample is 1.685cmThere are 5,000 students in the population

With no more information, what can we say about the mean height of the population of 5,000 students?

If we assume the population is normally distributed …. our sample std (0.12) is a good estimate of the pop std ….. so, means of samples of size 30 will generally have their own std, of 0.022 (calculated on last slide) … so, we can be 95% confident that the pop mean is between 1.641 and 1.729 (2 SEs either side of the sample mean)

With no more information, what can we say about the mean height of the population of 5,000 students?

Page 43: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Z-normalisation (or z-score normalisation)

Given any collection of numbers (e.g. the values of a particular field in a dataset) we can work out the mean and the standard deviation.

Z-score normalisation means converting the numbers into units of standard deviation.

Page 44: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Simple z-normalisation examplevalues

2.8

17.6

4.1

12.7

3.5

11.8

12.2

11.1

15.8

19.6

Page 45: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Simple z-normalisation examplevalues

2.8

17.6

4.1

12.7

3.5

11.8

12.2

11.1

15.8

19.6

Mean: 11.12 STD: 5.93

Page 46: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Simple z-normalisation examplevalues

2.8

17.6

4.1

12.7

3.5

11.8

12.2

11.1

15.8

19.6

Mean: 11.12 STD: 5.93

Mean subtracted

-8.32

6.48

-7.02

1.58

-7.62

0.68

1.08

-0.02

4.68

8.48

subtract mean, sothat these are centred around zero

Page 47: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Simple z-normalisation examplevalues

2.8

17.6

4.1

12.7

3.5

11.8

12.2

11.1

15.8

19.6

Mean: 11.12 STD: 5.93

Mean subtracted

-8.32

6.48

-7.02

1.58

-7.62

0.68

1.08

-0.02

4.68

8.48

subtract mean, sothat these are centred around zero

In Z units

-1.403

1.092

-1.18

0.27

-1.28

0.11

0.18

-0.003

0.79

1.43

Divide eachvalue by thestd; we now see how usual or unusual each value is

Page 48: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The take-home lesson (for those new to statistics)

Your have data from a sample, and you have good reason to believe that the population is normally distributed.

Thanks to the Central Limit Theorem, and other marvels of statistical theory, you can:– Make good estimates about the statistics of the population– Make justified conclusions about two distributions being

different (e.g. the distribution of field X for class 1, and the distribution of field X for class 2)

– Maybe find outliers and spot other problems in the data

Page 49: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Next week – a bit more statscorrelation / regression

Page 50: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Central Limit Theorem is this:

As more and more samples are taken from a population the distribution of the sample means conforms to a normal distribution

• The average of the samples more and more closely approximates the average of the entire population• A very powerful and useful theorem• The normal distribution is such a common and useful distribution that additional statistics have been developed to measure how closely a population conforms to it and to test for divergence from it due to skewness and kurtosis

Page 51: David Corne, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: dwcorne/Teaching/dmml.html Data Mining

Skew: how much

it’s like this:

See: http://mvpprograms.com/help/mvpstats/distributions/SkewnessKurtosis