summarizing data
DESCRIPTION
Summarizing Data. Graphical Methods. Histogram. Grouped Freq Table. Stem-Leaf Diagram. Box-whisker Plot. Measure of Central Location. Mean Median. Measure of Variability (Dispersion, Spread). Range Inter-Quartile Range Variance, standard deviation Pseudo-standard deviation. - PowerPoint PPT PresentationTRANSCRIPT
Summarizing Data
Graphical Methods
8 0 2 4 6 6 9
9 0 4 4 5 5 6 9 9
10 2 2 4 5 5 9
11 1 8 9
12
0
1
2
3
4
5
6
7
8
70 to 80 80 to 90 90 to100
100 to110
110 to120
120 to130
Histogram
Stem-Leaf Diagram
Verbal IQ Math IQ70 to 80 1 180 to 90 6 290 to 100 7 11
100 to 110 6 4110 to 120 3 4120 to 130 0 1
Grouped Freq Table
Box-whisker Plot
Measure of Central Location
1. Mean
2. Median
Measure of Variability (Dispersion, Spread)
1. Range
2. Inter-Quartile Range
3. Variance, standard deviation
4. Pseudo-standard deviation
Descriptive techniques for Multivariate data
In most research situations data is collected on more than one variable (usually many variables)
Graphical Techniques
• The scatter plot
• The two dimensional Histogram
The Scatter Plot
For two variables X and Y we will have a measurements for each variable on each case:
xi, yi
xi = the value of X for case i
and
yi = the value of Y for case i.
To Construct a scatter plot we plot the points:
(xi, yi)
for each case on the X-Y plane.
(xi, yi)
xi
yi
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Scatter Plot
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
Verbal IQ
Mat
h I
Q
Scatter Plot
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
Verbal IQ
Mat
h I
Q
(84,80)
Scatter Plot
60
70
80
90
100
110
120
130
60 70 80 90 100 110 120 130
Verbal IQ
Mat
h I
Q
Some Scatter Patterns
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
• Circular
• No relationship between X and Y
• Unable to predict Y from X
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
• Ellipsoidal
• Positive relationship between X and Y
• Increases in X correspond to increases in Y (but not always)
• Major axis of the ellipse has positive slope
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Example
Verbal IQ, MathIQ
Scatter Plot
60
70
80
90
100
110
120
130
60 70 80 90 100 110 120 130
Verbal IQ
Mat
h I
Q
Some More Patterns
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Ellipsoidal (thinner ellipse)
• Stronger positive relationship between X and Y
• Increases in X correspond to increases in Y (more freqequently)
• Major axis of the ellipse has positive slope
• Minor axis of the ellipse much smaller
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Increased strength in the positive relationship between X and Y
• Increases in X correspond to increases in Y (almost always)
• Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Perfect positive relationship between X and Y
• Y perfectly predictable from X
• Data falls exactly along a straight line with positive slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• Ellipsoidal
• Negative relationship between X and Y
• Increases in X correspond to decreases in Y (but not always)
• Major axis of the ellipse has negative slope slope
0
20
40
60
80
100
120
140
40 60 80 100 120 140
• The strength of the relationship can increase until changes in Y can be perfectly predicted from X
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
0
20
40
60
80
100
120
140
40 60 80 100 120 140
Some Non-Linear Patterns
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
0
200
400
600
800
1000
1200
-20 -10 0 10 20 30 40 50
• In a Linear pattern Y increase with respect to X at a constant rate
• In a Non-linear pattern the rate that Y increases with respect to X is variable
Growth Patterns
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
-150
-100
-50
0
50
100
150
0 10 20 30 40 50
-20
0
20
40
60
80
100
120
0 10 20 30 40 50
• Growth patterns frequently follow a sigmoid curve
• Growth at the start is slow
• It then speeds up
• Slows down again as it reaches it limiting size
0
20
40
60
80
100
120
0 10 20 30 40 50
Measures of strength of a relationship (Correlation)
• Pearson’s correlation coefficient (r)
• Spearman’s rank correlation coefficient (rho, )
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
From this data we can compute summary statistics for each variable.
The means
and
n
xx
n
ii
1
n
yy
n
ii
1
The standard deviations
and
11
2
n
xxs
n
ii
x
11
2
n
yys
n
ii
y
These statistics:
• give information for each variable separately
but
• give no information about the relationship between the two variables
x yxs ys
Consider the statistics:
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
The first two statistics:
• are used to measure variability in each variable
• they are used to compute the sample standard deviations
n
iixx xxS
1
2
n
iiyy yyS
1
2and
1
n
Ss xx
x 1
n
Ss yy
y
The third statistic:
• is used to measure correlation• If two variables are positively related the sign of
will agree with the sign of
n
iiixy yyxxS
1
xxi
yyi
•When is positive will be positive.
•When xi is above its mean, yi will be above its
mean
•When is negative will be negative.
•When xi is below its mean, yi will be below its
mean
The product will be positive for most cases.
xxi yyi
xxi yyi
yyxx ii
This implies that the statistic
• will be positive
• Most of the terms in this sum will be positive
n
iiixy yyxxS
1
On the other hand
• If two variables are negatively related the sign of
will be opposite in sign to
xxi
yyi
•When is positive will be negative.
•When xi is above its mean, yi will be below its
mean
•When is negative will be positive.
•When xi is below its mean, yi will be above its
mean
The product will be negative for most cases.
xxi yyi
xxi yyi
yyxx ii
Again implies that the statistic
• will be negative
• Most of the terms in this sum will be negative
n
iiixy yyxxS
1
Pearsons correlation coefficient is defined as below:
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
The denominator:
is always positive
n
ii
n
ii yyxx
1
2
1
2
The numerator:
• is positive if there is a positive relationship between X ad Y and
• negative if there is a negative relationship between X ad Y.
• This property carries over to Pearson’s correlation coefficient r
n
iii yyxx
1
Properties of Pearson’s correlation coefficient r
1. The value of r is always between –1 and +1.2. If the relationship between X and Y is positive, then
r will be positive.3. If the relationship between X and Y is negative,
then r will be negative.4. If there is no relationship between X and Y, then r
will be zero.
5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.
6. The value of r will be -1 if the points, (xi, yi) lie on a straight line with negative slope.
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r =1
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.95
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.7
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
r = 0.4
-100
-50
0
50
100
150
200
250
40 60 80 100 120 140
r = 0
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.4
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.7
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.8
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.95
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -1
Computing formulae for the statistics:
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
To compute
first compute
Then
xxS yyS xyS
n
iixC
1
2
n
iii yxE
1
n
iiyD
1
2
n
iiyB
1
n
iixA
1
n
ACSxx
2
n
BDS yy
2
n
BAESxy
Example
Verbal IQ, MathIQ
Data Set #3
The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score
for 23 students who have recently completed a reading improvement program
Initial FinalVerbal Math Reading Reading
Student IQ IQ Acheivement Acheivement
1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7
10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9
Scatter Plot
60
70
80
90
100
110
120
130
60 70 80 90 100 110 120 130
Verbal IQ
Mat
h I
Q
Now
Hence
2214941
2
n
iix 227199
1
n
iii yx234363
1
2
n
iiy
23071
n
iiy2244
1
n
iix
652.255723
2244221494
2
xxS
87.296023
2307234363
2
yyS
043.2116
23
23072244227199 xyS
Thus Pearsons correlation coefficient is:
yyxx
xy
SS
Sr
769.087.2960652.2557
043.2116
Thus r = 0.769
• Verbal IQ and Math IQ are positively correlated.
• If Verbal IQ is above (below) the mean then for most cases Math IQ will also be above (below) the mean.
Is the improvement in reading achievement (RA) related to either Verbal IQ or Math IQ?
improvement in RA = Final RA – Initial RA
The Data
Student Math IQ Verbal IQ Initial RA Final RA Imp RA1 86 94 1.1 1.7 0.62 104 103 1.5 1.7 0.23 86 92 1.5 1.9 0.44 105 100 2 2 05 118 115 1.9 3.5 1.66 96 102 1.4 2.4 17 90 87 1.5 1.8 0.38 95 100 1.4 2 0.69 105 96 1.7 1.7 010 84 80 1.6 1.7 0.111 94 87 1.6 1.7 0.112 119 116 1.7 3.1 1.413 82 91 1.2 1.8 0.614 80 93 1 1.7 0.715 109 124 1.8 2.5 0.716 111 119 1.4 3 1.617 89 94 1.6 1.8 0.218 99 117 1.6 2.6 119 94 93 1.4 1.4 020 99 110 1.4 2 0.621 95 97 1.5 1.3 -0.222 102 104 1.7 3.1 1.423 102 93 1.6 1.9 0.3
r = 0.48469
Correlation between Math IQ and RA Improvement
Correlation between Verbal IQ and RA Improvement
r = 0.68318
r = 0.48469Scatterplot: Math IQ vs RA Improvement
-0.4
0.1
0.6
1.1
1.6
70 80 90 100 110 120
Scatterplot: Verbal IQ vs RA Improvement
r = 0.68318
-0.4
0
0.4
0.8
1.2
1.6
70 80 90 100 110 120 130