school of computing faculty of engineering mj11 (comp1640) modelling, analysis & algorithm...
TRANSCRIPT
School of ComputingFACULTY OF ENGINEERING
MJ11 (COMP1640)
Modelling, Analysis & Algorithm Design
Vania Dimitrova
Lecture 18Statistical Data Analysis: Types of Data, Sampling
Methods, Descriptive Statistics
November 2011
In the previous lectures
Mathematical Modelling
Identify Factors
Make assumptions
Formulate model
Examine behaviour
Assumptions are crucial
Estimated values for parameters
Assumed dependencies between variables
Check validity of assumptions
Grounded on data analysis
In this series of lectures
Analysis of data
How to collect data samples?
How to make estimations of values (point/interval)?
How to infer possible dependencies between variables?
How to check the validity of a hypothesis?
Do you agree with these statements?
Average earnings in the UK grow steadily.
There are more overseas visits to the UK than UK visits abroad.
House prices are dependent on average family income.
Young people prefer to shop online.
Advertisement improves sales figures.
TV advertisement is more powerful than Radio advertisement.
Women are more likely to buy computer games than men.
Men are more likely to buy cosmetics products than women.
Population
Large collection of objects or events which vary in respect of some characteristics
The whole set of measurements or counts about which we want to draw a conclusion
Characteristics of population:
height, age, reading abilities, fitness level
What is the population for each of the claims on the previous slide?
Sample
Subset of the population, as set of some of the measurements or the characteristics of the population.
population sample
Measures describing population
characteristics
PARAMETERS
Measures describing sample
characteristics
STATISTICS
Statistics estimate the parameters
Systematic errorthe sample is not representative of the population
Sampling errorinfluences by the size of the sample and the variation in the population
Sampling methods
Random samplingselecting members of the population in a random order
Pros & Cons
Systematic samplingselecting members of the population in a systematic order (quasi-random)
Pros & Cons
Sampling methods (cont.)
Stratified samplingdividing population in homegeneous groups and random selection within the group
Pros & Cons
Cluster samplingwhen the population is too big, we may select certain clusters (e.g. UK students)
Pros & Cons
Stage sampling – random selection of clusters
Sample size
What precision do we want?
Increase the size to get better precision
What is the likely variability in the population?
Increase the size to account for higher variability
Types of data
Nominal
Categories, classes
Ordinal
Nominal with order
Discrete
Numbers that are distinct points on a scale
Continuous
Can take any values between points on a scale
GIVE EXAMPLES
Descriptive Statistics
Mean – average score
n
x
x
n
jj
1
Median – middle point on the scale of measurementhelpful for oddly shaped distributions
General description of the sample
2 3 5 7 9 10 12 13 14 16 18 20 21
3 4 4 5 5 5 5 6 7
Distribution of scores
Standard Deviationn
xxn
jj
1
2)(
Variance2
Coefficient of variation %*100
xV
Example (EU-Area-Current-Accounts.xls)
http://epp.eurostat.ec.europa.eu/
Normal and Skewed Distribution
wikipedia
Skewed Distribution
Normal Distribution
x
Approximating Normal Distribution
As the sample size increases, the shape of the sampling distribution becomes normal(see also Central Limit Theorem)
http://www.statsoft.com/textbook/esc.html
Correlation between two variables
Measure of the relations between two or more variables
Correlation coefficient r
Negative correlation r -1
Positive correlation r 1
Different methods to calculate r
Simplest: based on deviations from the mean
11 r
n
jj
n
jj
n
jjj
yyxx
yyxx
r
1
2
1
2
1
)()(
))((
Example: Positive Correlation
40
45
50
55
60
65
70
75
40 50 60 70 80
r=0.998
Example: Negative Correlation
r=-0.99
40
45
50
55
60
65
70
75
80
40 50 60 70 80
Example: Limited (or No) Correlation
r=0.179
40
45
50
55
60
65
70
75
40 50 60 70 80
Summary
Types of data
Population vs Sample
Sampling methods
Descriptive statistics
Normal & Skewed distribution
Correlation between variables
References
Rees D.G., Essential Statistics, Chapman & Hall/CRC, 2000.
Cohen, L., Holliday, M., Practical Statistics for Students, Chapman, 1996. http://www.statsoft.com/textbook/esc.html