summarizing data

Summarizing Data

Graphical Methods

8 0 2 4 6 6 9

9 0 4 4 5 5 6 9 9

10 2 2 4 5 5 9

11 1 8 9

12

0

1

2

3

4

5

6

7

8

70 to 80 80 to 90 90 to100

100 to110

110 to120

120 to130

Histogram

Stem-Leaf Diagram

Verbal IQ Math IQ70 to 80 1 180 to 90 6 290 to 100 7 11

100 to 110 6 4110 to 120 3 4120 to 130 0 1

Grouped Freq Table

Box-whisker Plot

Measure of Central Location

1. Mean

2. Median

Measure of Variability (Dispersion, Spread)

1. Range

2. Inter-Quartile Range

3. Variance, standard deviation

4. Pseudo-standard deviation

Descriptive techniques for Multivariate data

In most research situations data is collected on more than one variable (usually many variables)

Graphical Techniques

• The scatter plot

• The two dimensional Histogram

The Scatter Plot

For two variables X and Y we will have a measurements for each variable on each case:

xi, yi

xi = the value of X for case i

and

yi = the value of Y for case i.

To Construct a scatter plot we plot the points:

(xi, yi)

for each case on the X-Y plane.

(xi, yi)

xi

yi

Data Set #3

The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score

for 23 students who have recently completed a reading improvement program

Initial FinalVerbal Math Reading Reading

Student IQ IQ Acheivement Acheivement

1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7

10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9

Scatter Plot

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Verbal IQ

Mat

h I

Q

Scatter Plot

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

Verbal IQ

Mat

h I

Q

(84,80)

Scatter Plot

60

70

80

90

100

110

120

130

60 70 80 90 100 110 120 130

Verbal IQ

Mat

h I

Q

Some Scatter Patterns

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140

• Circular

• No relationship between X and Y

• Unable to predict Y from X

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

• Ellipsoidal

• Positive relationship between X and Y

• Increases in X correspond to increases in Y (but not always)

• Major axis of the ellipse has positive slope

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

Example

Verbal IQ, MathIQ

Scatter Plot

60

70

80

90

100

110

120

130

60 70 80 90 100 110 120 130

Verbal IQ

Mat

h I

Q

Some More Patterns

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Ellipsoidal (thinner ellipse)

• Stronger positive relationship between X and Y

• Increases in X correspond to increases in Y (more freqequently)

• Major axis of the ellipse has positive slope

• Minor axis of the ellipse much smaller

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Increased strength in the positive relationship between X and Y

• Increases in X correspond to increases in Y (almost always)

• Minor axis of the ellipse extremely small in relationship to the Major axis of the ellipse.

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Perfect positive relationship between X and Y

• Y perfectly predictable from X

• Data falls exactly along a straight line with positive slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• Ellipsoidal

• Negative relationship between X and Y

• Increases in X correspond to decreases in Y (but not always)

• Major axis of the ellipse has negative slope slope

0

20

40

60

80

100

120

140

40 60 80 100 120 140

• The strength of the relationship can increase until changes in Y can be perfectly predicted from X

0

20

40

60

80

100

120

140

40 60 80 100 120 140

Some Non-Linear Patterns

0

200

400

600

800

1000

1200

-20 -10 0 10 20 30 40 50

• In a Linear pattern Y increase with respect to X at a constant rate

• In a Non-linear pattern the rate that Y increases with respect to X is variable

Growth Patterns

-20

0

20

40

60

80

100

120

0 10 20 30 40 50

-150

-100

-50

0

50

100

150

0 10 20 30 40 50

-20

0

20

40

60

80

100

120

0 10 20 30 40 50

• Growth patterns frequently follow a sigmoid curve

• Growth at the start is slow

• It then speeds up

• Slows down again as it reaches it limiting size

0

20

40

60

80

100

120

0 10 20 30 40 50

Measures of strength of a relationship (Correlation)

• Pearson’s correlation coefficient (r)

• Spearman’s rank correlation coefficient (rho, )

Assume that we have collected data on two variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

From this data we can compute summary statistics for each variable.

The means

and

n

xx

n

ii

1

n

yy

n

ii

1

The standard deviations

and

11

2

n

xxs

n

ii

x

11

2

n

yys

n

ii

y

These statistics:

• give information for each variable separately

but

• give no information about the relationship between the two variables

x yxs ys

Consider the statistics:

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

The first two statistics:

• are used to measure variability in each variable

• they are used to compute the sample standard deviations

n

iixx xxS

1

2

n

iiyy yyS

1

2and

1

n

Ss xx

x 1

n

Ss yy

y

The third statistic:

• is used to measure correlation• If two variables are positively related the sign of

will agree with the sign of

n

iiixy yyxxS

1

xxi

yyi

•When is positive will be positive.

•When xi is above its mean, yi will be above its

mean

•When is negative will be negative.

•When xi is below its mean, yi will be below its

mean

The product will be positive for most cases.

xxi yyi

xxi yyi

yyxx ii

This implies that the statistic

• will be positive

• Most of the terms in this sum will be positive

n

iiixy yyxxS

1

On the other hand

• If two variables are negatively related the sign of

will be opposite in sign to

xxi

yyi

•When is positive will be negative.

•When xi is above its mean, yi will be below its

mean

•When is negative will be positive.

•When xi is below its mean, yi will be above its

mean

The product will be negative for most cases.

xxi yyi

xxi yyi

yyxx ii

Again implies that the statistic

• will be negative

• Most of the terms in this sum will be negative

n

iiixy yyxxS

1

Pearsons correlation coefficient is defined as below:

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

The denominator:

is always positive

n

ii

n

ii yyxx

1

2

1

2

The numerator:

• is positive if there is a positive relationship between X ad Y and

• negative if there is a negative relationship between X ad Y.

• This property carries over to Pearson’s correlation coefficient r

n

iii yyxx

1

Properties of Pearson’s correlation coefficient r

1. The value of r is always between –1 and +1.2. If the relationship between X and Y is positive, then

r will be positive.3. If the relationship between X and Y is negative,

then r will be negative.4. If there is no relationship between X and Y, then r

will be zero.

5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.

6. The value of r will be -1 if the points, (xi, yi) lie on a straight line with negative slope.

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r =1

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.95

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.7

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

r = 0.4

-100

-50

0

50

100

150

200

250

40 60 80 100 120 140

r = 0

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.4

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.7

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.8

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.95

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -1

Computing formulae for the statistics:

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

To compute

first compute

Then

xxS yyS xyS

n

iixC

1

2

n

iii yxE

1

n

iiyD

1

2

n

iiyB

1

n

iixA

1

n

ACSxx

2

n

BDS yy

2

n

BAESxy

Example

Verbal IQ, MathIQ

Data Set #3

The following table gives data on Verbal IQ, Math IQ,Initial Reading Acheivement Score, and Final Reading Acheivement Score

for 23 students who have recently completed a reading improvement program

Initial FinalVerbal Math Reading Reading

Student IQ IQ Acheivement Acheivement

1 86 94 1.1 1.72 104 103 1.5 1.73 86 92 1.5 1.94 105 100 2.0 2.05 118 115 1.9 3.56 96 102 1.4 2.47 90 87 1.5 1.88 95 100 1.4 2.09 105 96 1.7 1.7

10 84 80 1.6 1.711 94 87 1.6 1.712 119 116 1.7 3.113 82 91 1.2 1.814 80 93 1.0 1.715 109 124 1.8 2.516 111 119 1.4 3.017 89 94 1.6 1.818 99 117 1.6 2.619 94 93 1.4 1.420 99 110 1.4 2.021 95 97 1.5 1.322 102 104 1.7 3.123 102 93 1.6 1.9

Scatter Plot

60

70

80

90

100

110

120

130

60 70 80 90 100 110 120 130

Verbal IQ

Mat

h I

Q

Now

Hence

2214941

2

n

iix 227199

1

n

iii yx234363

1

2

n

iiy

23071

n

iiy2244

1

n

iix

652.255723

2244221494

2

xxS

87.296023

2307234363

2

yyS

043.2116

23

23072244227199 xyS

Thus Pearsons correlation coefficient is:

yyxx

xy

SS

Sr

769.087.2960652.2557

043.2116

Thus r = 0.769

• Verbal IQ and Math IQ are positively correlated.

• If Verbal IQ is above (below) the mean then for most cases Math IQ will also be above (below) the mean.

Is the improvement in reading achievement (RA) related to either Verbal IQ or Math IQ?

improvement in RA = Final RA – Initial RA

The Data

Student Math IQ Verbal IQ Initial RA Final RA Imp RA1 86 94 1.1 1.7 0.62 104 103 1.5 1.7 0.23 86 92 1.5 1.9 0.44 105 100 2 2 05 118 115 1.9 3.5 1.66 96 102 1.4 2.4 17 90 87 1.5 1.8 0.38 95 100 1.4 2 0.69 105 96 1.7 1.7 010 84 80 1.6 1.7 0.111 94 87 1.6 1.7 0.112 119 116 1.7 3.1 1.413 82 91 1.2 1.8 0.614 80 93 1 1.7 0.715 109 124 1.8 2.5 0.716 111 119 1.4 3 1.617 89 94 1.6 1.8 0.218 99 117 1.6 2.6 119 94 93 1.4 1.4 020 99 110 1.4 2 0.621 95 97 1.5 1.3 -0.222 102 104 1.7 3.1 1.423 102 93 1.6 1.9 0.3

r = 0.48469

Correlation between Math IQ and RA Improvement

Correlation between Verbal IQ and RA Improvement

r = 0.68318

r = 0.48469Scatterplot: Math IQ vs RA Improvement

-0.4

0.1

0.6

1.1

1.6

70 80 90 100 110 120

Scatterplot: Verbal IQ vs RA Improvement

r = 0.68318

-0.4

0

0.4

0.8

1.2

1.6

70 80 90 100 110 120 130

summarizing data

Documents

value of x

value of y

case iandyi

xy plane

variables x

verbal iq

research situations

multivariate datain