week 4 lecture notes - asal aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... ·...

45
2019-01-22 1 Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics: Mean Median Mode Standard deviation IQR (interquartile range): Upper Quartile (Q3) Lower Quartile (Q1) Table: Summary statistics tables (e.g., mean, standard deviation) Graphical Display: Histogram Stem-and-Leaf plot Boxplot 2 Describing Quantitative data 1 2

Upload: others

Post on 09-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

1

Week 4 Lecture NotesPSYC2021: Winter 2019

1

Descriptive statistics:

• Mean

• Median

• Mode

• Standard deviation

• IQR (interquartile range): Upper Quartile (Q3) – Lower Quartile (Q1)

Table:

• Summary statistics tables (e.g., mean, standard deviation)

Graphical Display:

• Histogram

• Stem-and-Leaf plot

• Boxplot

2

Describing Quantitative data

1

2

Page 2: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

2

Example of Describing and Displaying a Quantitative Data

The Organisation of Economic Cooperation and Development (OECD) gathers various information regarding OECD countries and its partners to promote policies that aims to improve the economic and social well-being of people around the world (http://www.oecd.org/about/)

• A particular domain is “Social Protection and Well-being”, which includes a yearly collection of data “Better Life Index”. This information can be retrieved from: http://stats.oecd.org

• From the “Better Life Index 2017” (BLI, 2017), the most recent data collected in this domain, we will analyze a quantitative variable named “Educational Attainment”.

• Information regarding this variable can be retrieved from: http://www.oecd.org/statistics/OECD-Better-Life-Index-2017-definitions.pdf

3

A Quantitative Data: Educational Attainment Percentages (BLI, 2017)

Educational Attainment considers the number of adults aged 25 to 64 holding at least an upper secondary degree

over the same age as defined by ISCED Classification (International Standard Classification of Education).

Unit of measurement: Percentage of adult population (aged 25 to 64)

Additional information: Gender Inequality

The number of OECD countries were n = 35.

• Information regarding this variable can be retrieved from: http://www.oecd.org/statistics/OECD-Better-Life-Index-2017-definitions.pdf4

3

4

Page 3: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

3

Read the Data File: Percentages of Educational Attainment (BLI, 2017) in R

5

Call the Data File “Educational Attainment” in R

6

5

6

Page 4: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

4

Sort the Percentages of Educational Attainment in R

7

Centre and Central Tendency

Centre:

A typical value in the data.

Central tendency:

• It is a statistical measure (e.g., average, middle) to determine a single score that defines the centre of

a distribution.

• The goal of central tendency is to find the single score that is most typical or most representative of

the entire group.

• The three measures of central tendency are Mean, Median, and Mode.

8

7

8

Page 5: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

5

Mean of a Quantitative VariableThink of the mean (average) as:

• Amount of something that individuals receive when the total is divided equally among them in a distribution.

• The mean is a balance point.

• The mean is located between the lowest and the highest data value.

• The total distance below the mean is the same as the total distance above the mean.

Finding Mean or Average:

We sum all of the observations from a particular variable that we are interested in finding its mean, and divide by the total number of cases of the same variable.

Note:

• For (approx.) symmetric and bell-shaped distributions, we report the mean (average).

• The mean gets influenced by the extremely large or small observations (unusual, rare observations) in the data.

• The mean is not resistant (“sensitive”) to extreme values in the data.

9

Mean of a Quantitative Variable

• Mean of a population is denoted by 𝜇.

Population Mean = 𝜇 =σ𝑖=1𝑁 𝑦𝑖

𝑁=

𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑢𝑙𝑎𝑡𝑖𝑜𝑛

𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒

The sum σ greek upper case letter sigma taken over all i values from 1 to N.

• Mean of a sample is denoted by ത𝑦 𝑦 − 𝑏𝑎𝑟 (or ҧ𝑥 or M).

Sample Mean = ത𝑦 =σ𝑖=1𝑛 𝑦𝑖

𝑛=

𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒)

The sum σ greek upper case letter sigma taken over all i values from 1 to n.

10

9

10

Page 6: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

6

Sample Mean: Percentages of Educational Attainment (BLI, 2017)

Estimated mean percentage of educational attainment:

Sample mean (ത𝑦) =37+39+⋯+94

35= 78.51

Interpretation:

On average, 78.51% of adults aged 25 to 64 hold at least an upper secondary degree over the same age as defined by ISCED classification.

11

Use Frequency Distribution Table to Find Mean

Compute the mean from a frequency distribution table:

ത𝑦 =σ𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ∗ 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒

𝑛

In our example of educational attainment percentages:

ത𝑦 =1 𝑥 37 + 1𝑥39 +⋯+(1𝑥94)

35= 78.51

12

11

12

Page 7: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

7

Median of a Quantitative Distribution• For skewed distributions, we report the median.

• Median is the middle value in the sorted data from smallest to largest.

• To find median, list data values from smallest to largest. The median is the midpoint (middle number) of the list.

• Median is the 50th percentile. It is a point on a measurement scale such that 50% of the data values are below it and 50% of the data values are above it.

Sample Median:

• In odd numbered data: 𝑛+1

2position (the middle number in ordered data)

• In even numbered data: average of 𝑛

2𝑎𝑛𝑑

𝑛

2+ 1 position (the average of two middle numbers in ordered data).

Note:

• The median is resistant (not sensitive) to values that are extremely large or small. Because the median takes the order of the data values into account and not what the actual values are.

13

Sample Median: Percentages of Educational Attainment (BLI, 2017)

There are 35 OECD countries (n = 35), so this is an odd-numbered data set.

• The median is the 𝑛+1

2position (the middle number), which is

35+1

2= 18th ordered value of 82%

14

Interpretation:

The median percentage of adults aged 25 to 64 hold at least an upper secondary degree over the same age as

defined by ISCED classification is 82%. That is, 50% of the OECD countries had at most 82% or less of their

adults aged 25 to 64 with at least an upper secondary degree, whereas 50% of the OECD countries had more

than 82% of their adults aged 25 to 64 with at least an upper secondary degree.

13

14

Page 8: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

8

Use Frequency Distribution Table to Find MedianMedian is the 50th percentile (Median has 50% of data values below it and 50% above it).

Compute the median from a frequency distribution table: Use the frequency or cumulative percentages as a guide.

15

• There are n = 35 observations.

• Median is 18th position ((35+1)/2 = 18th ) in the data set.

• Under cumulative frequency, the 18th position

corresponds to 82% educational attainment.

This is the median.

Another approach:

• Search for 50% cumulative percentage.

• We find that 48.57% of OECD countries have 81%

educational attainment.

• We need to borrow 50% - 48.57% = 1.43% of data values

from the next bin(s).

• The next bin accumulates 51.428571% of responses; so

this bin gives the 1.43% to makeup for 50% of data

values. Therefore, median is in this bin with 82%

educational attainment

Mode of a Variable

• Mode is a value that occurs most frequently in the frequency distribution.

• It is a the location of peak (or peaks) in the frequency distribution graph.

• Mode is typically used for categorical variables measured on a nominal scale.

• Mode is also used for discrete variables.

• By knowing the mean, median, and the mode of data, we can have a better picture of the data.

16

15

16

Page 9: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

9

Sample Mode: Percentages of Educational Attainment (BLI, 2017)

The most frequent data value is 87.

• 4 out of 35 OECD countries had 87% educational attainment.

17

Interpretation:

The most frequent (4 out of 35 OECD countries; 11% of OECD countries) percentage of adults aged 25 to 64

hold at least an upper secondary degree over the same age as defined by ISCED classification is 87%.

Sample Mode: Percentages of Educational Attainment (BLI, 2017) in RInstall the “lsr” package in Rlsr: Learning Statistics with R

18

17

18

Page 10: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

10

Use Frequency Distribution Table to Find Mode

Compute the mode from a frequency distribution table:

• Mode is the most frequent observation in the table.

In our example of educational attainment percentages:

Mode is 87% with the most frequency of 4.

19

Understanding The Shape of Distribution• The pattern of variation of a variable is called “distribution”.

• In any graph of data, look for overall pattern and for striking deviations from that pattern.

• An important kind of deviation is an outlier – an individual value that falls outside the overall pattern.

• Outliers (potential outliers) are observations that are plotted in the lowest or in the highest end points in

the data that appear to be deviated from the rest of the data.

• They are also refereed to as extreme values in the data.

• They tell us something interesting or exciting about the data.

• They can be very informative and can affect almost every statistical method.

• We need to investigate outlying points in the data.

• They could be a wrong input in the data; in that case, we can fix it (correct it with the right value).

• If they are correct values, we need to understand their value and explain them.

• We need to employ statistical analysis with and without outliers and describe how outliers affect our

data analysis.

Note: we do not simply remove outliers from any data set without investigating them. 20

19

20

Page 11: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

11

Symmetrical Distribution

• Check whether the right-hand side of the graph is a mirror image of the left-hand side.

• The median is exactly at the centre.

• The mean is exactly at the centre.

• For a perfectly symmetrical distribution, the mean and median are the same.

• In a approx. symmetric distribution, the mean and the median will be close to each other.

• Unimodal distribution (with one mode in the data):

if mean, median, and the mode are the same, then the distribution is bell-shaped, symmetric.

• Bimodal distribution (with two modes in the data):

if mean and median are the same, and the distribution has two peaks, and when we fold the graph in the middle

(where the mean and median reside) to realize that edges match closely, then the distribution is symmetric.

21

Skewed Distribution

• Skewedness occur when an observation (or a few observations) is deviated from the overall pattern of the data.

• See if the thinner ends of a distribution, tails, pull the distribution to their side.

• There is a tendency for the mean, median, and the mode to be located in predictably different positions.

• The most probable order for the three measures of central tendency:

• Negative (left) skewed: Mean, Median, Mode

• Positive (right) skewed: Mode, Median, Mean

• In a skewed distribution:

If Mean < Median, the data is left skewed.

If Mean > Median, the data is right skewed.

In the example of percentage of educational attainment, Mean of 78.51% < Median of 82%. So, it is plausible to

indicate that the distribution of percentage of educational attainment is skewed to the left. However, we should refer

to its graphical display just to be sure.

22

21

22

Page 12: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

12

Shape of a HistogramFor the shape of a histogram, we need to detect:

Any peak in the graph:

Does it have a single peak (one hump/mode), or several peaks (humps/modes).

• Histogram with one pick/mode is called “unimodal”.

• Histogram with two picks is called “bimodal”.

• Histogram with three or more peaks is called “multimodal”.

• A histogram with no apparent mode in which all the bars are about the same height is a uniform distribution.

Note: Value on the horizontal axis of the histogram is the mode.

Whether it is symmetric:

• Fold the histogram along a vertical line in the middle and see if the edges match closely.

Whether it is skewed to the right or to the left:

• Skewedness occur when an observation (or a few observations) is deviated from the overall pattern of the data.

• See if the thinner ends of a distribution, tails, pull the distribution to their side.

• If one tail stretches out farther than the other, then the histogram is skewed toward the side of the longer tail.

23

Visualizing A Quantitative Data with a Histogram

Histogram displays the entire distribution of a quantitative

variable. How are they constructed?

• Slice up all possible values into bins.

• Then count the number of cases that fall into each bin.

• Attached bars (bins) with interval scale (equal width).

• The bins, together with these counts give the distribution

of the quantitative variable.

• A gap in the histogram indicates that there a bin with no

cases in it.

• There are usually between 5 to 30 bins, unless for very

large data sets there may be more number of bins.

• Number of bins depend on the sample size.

• Appropriate number of bins:

• not too little; not too many

• R: Statistical software, automatically creates bins.

24

23

24

Page 13: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

13

Common Shapes of Histograms

25

Left Skewed:

Tail to the left;

most of the

observations in the

data are to the right.

One peak (Unimodal)

Approx. Bell-shaped

and Symmetric.

No peak

Uniform Distribution

Right Skewed:

Tail to the right;

most of the

observations in the

data are to the left.

Example of a Bimodal Distribution Depicted by a Histogram

26

Bimodal: We see two peaks in a plot

(this plot suggest accounting for two groups in the data)

25

26

Page 14: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

14

Examples of U-shaped Distributions Depicted by Histograms

• We would see two peaks.

• E.g., responses to a survey:

large number of responses of strongly disagree (1) and of strongly agree (5)

U-shaped (Non-symmetric) U-shaped (Symmetric)

27

Histogram of Percentages of Educational Attainment in R

28

• What is typical percentage of educational attainment

in OECD countries?

• What is the shape of the distribution of percentage of

educational attainment in OECD countries?

• What percentage of OECD countries had 50% or less

of their adults holding at least an upper degree?

27

28

Page 15: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

15

Histogram of Percentages of Educational Attainment (BLI, 2017)

29

• What is typical percentage of educational attainment

in OECD countries? About 75%

• What is the shape of the distribution of percentage of

educational attainment in OECD countries?

Left-skewed (tail of the histogram is pulled to the left);

most of the observations (OECD countries) are located in

the right part of the graph.

• What percentage of OECD countries had 50% or less

of their adults holding at least an upper degree?

Count the number of observations (OECD countries) with

50% or less. These values are located in the first two-bins:

the first bin has two values in it; the second bin has 1 value

in it: total data values in this case is 3.

(3 out of 35 OECD countries)

• The relative frequency is: 3/35 = 0.088 ≅ 0.09

• The percentage is: 0.09 x 100 = 9%

Stem-and-leaf Display

• Stem-and-leaf is designed by John W. Tukey, one of the

greatest statistician of the twentieth century.

• Stem-and-leaf display is like a histogram, but it shows

raw data (individual) values in ordered manner from the

smallest to the largest value. Tilt your head and squint.

• Split all numbers into two parts: the stem and the leaf.

• The stem is the left part of the number (data value) and

the leaf is the right part.

• The number of stems depends on the size of the data

(e.g., sample size) just like histogram in terms of

number of bins to display.

• R Statistical software, automatically creates the number

of stems.

• Sometimes a value of a stem is repeated (stretched) in

order to visualize data better.30

29

30

Page 16: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

16

Stem-and-leaf Display of Percentages of Educational Attainment in R

31

The first line: 3|7 means the number 37.

Stem-and-leaf Display of Percentages of Educational Attainment in R

32

Stem-and-leaf plots shows raw data values in ordered manner.

• What is the minimum value?

• What is the maximum value?

• What is the shape of the distribution?

31

32

Page 17: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

17

Stem-and-leaf Display of Percentages of Educational Attainment in R

33

• What is the minimum value?

37 x 1 (leaf unit) = 37

• What is the maximum value?

94 x 1 (leaf unit) = 94

• What is the shape of the distribution?

Left-skewed. Most of the observation are in the bottom part of the stem-plot.

Tilt your head to the right to see the shape (from min to max values).

Measures of Skew and Kurtosis

Skewness is a measure of asymmetry.

• The (tedious) formula is omitted from our calculation purpose.

• 0 skewness means that there is no skewness (data is not skewed to either right or left).

• Skewness between -1 and +1 is OK; However, more than +1 or less than -1 is a concern.

Kurtosis is a measure of pointiness.

• The (tedious) formula is omitted from our calculation purpose.

• 0 kurtosis means we have an approx. normal curve (e.g., bell-shaped symmetric distribution).

• we call this mesokurtic. So, the pointiness of a data set is assessed relative to this curve.

• Negative kurtosis is called platykurtic (not pointy enough) – too flat.

• Positive kurtosis is called leptokurtic – too pointy.

34

33

34

Page 18: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

18

Skew and Kurtosis of Percentage of Educational Attainment in RInstall the “psych” package in R

35

The Spread of a Quantitative Distribution

36

How spread out are data?

• Spread tells us how much data vary around its centre.

• How far from the mean or the median do the observed values tend to be?

Variability describes the distribution.

• It provides a quantitative measure of the differences between scores in a distribution and describes the degree to

which the scores are spread out or clustered together.

• It measures how well an individual score (or group of scores) represents the entire distribution.

• It provides how much error to expect if you are using a sample to represent a population.

When we describe a distribution numerically, we always report measure of its spread along with its centre.

• There are several measure of spread: Range, Interquartile range, Variance and Standard deviation.

35

36

Page 19: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

19

Sample Range: Educational Attainment % (BLI, 2017)

37

• Range is a simple measure of spread.

• It is calculated as Maximum value – Minimum value

• In our example of percentage of educational attainment:

Note: Range is affected by extremely low or extremely high values in the data.

Range = 94-37 = 57

Interpretation: The percentage of educational attainment ranges from 37% to 94%

Sample Range: Educational Attainment % (BLI, 2017) in R

38

37

38

Page 20: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

20

mean Think about distances between observations to the mean

• We want to find the distance for each observation from the sample mean.

• This difference is referred to as “deviation”, which reflect the spread of data.

• Idea: Combine all deviations into one useful summary.

• Can we average the deviations to find a typical spread for a data set?

Example: Data: 2, 3, 7 (mean, ത𝑦 = 4)

• Some distances are below the mean (negative)

• Some distances are above the mean (positive)

2 – 4 = -2

3 – 4 = -1

7 – 4 = +3

• Add these distances: (-2) + (-1) + (+3) = 0 {does not help us find typical spread!)

39

The Spread of Distribution: How far each value is away form the mean?

mean Think about distances between observations to the mean

• Solution: Square the deviations. That is, find sample variance.

• Sample variance is (almost) average of squared deviation of observations from the sample mean.

• Note that population variance is average of squared deviations of observations from the population (true) mean.

• What would be the sample variance for the following data sets? Use your intuition.

Data set #1: 7, 7, 7

Date set #2: 2, 3, 7

40

The Spread of Distribution: How far each value is away form the mean?

39

40

Page 21: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

21

Recall the example: Data: 2, 3, 7 (mean = 4)

• Borrowing from Pythagorean theorem

• Sum of squared distances in the data (2, 3, 7):

(2 – 4)2 = 4

(3 – 4)2 = 1

(7 – 4)2= 9

• Add these squared distances: 4 + 1 + 9 = 14

• Take an “adjusted” average;

• This means we divide by n-1 = 3 – 1 = 2 (because we estimated the mean, we subtract 1 from n; degree of freedom = n - 1)

• In the data example above, Sample Variance is: 14

2= 7

• In general, sample variance is: 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠

𝑛 −1

• We denote sample variance by 𝑆2

𝑆2= σ𝑖=1𝑛 (𝑦𝑖 −ത𝑦)2

𝑛 −141

Sample Variance: The Spread of Distribution

• In population, we find the deviation for each score by measuring its distance from the population mean

• With a sample, the value of is unknow, so we measure from sample mean.

• Because the value of sample varies from sample to sample, we must first compute the sample mean, before we compute the deviation.

• However, calculating the value of sample mean places restriction of the variability of the scores.

• Consider our data set example with n = 3, 𝑦1 = 2, 𝑦2 = 3, 𝑦3 = ?, with mean of ത𝑦 = 4.

• This means that the value of 𝑦3 must be 7. This is why:

ത𝑦 = 4 =3+2+𝑦3

3, so: (4x3 = 12) = 3+2+ 𝑦3 12 = 5 + 𝑦3 𝑦3 = 12 – 5 = 7

• The two value 𝑦1 and 𝑦2 were free to have any values. However, the third value 𝑦3 is dependent on the values chosen for the first two.

In general:

For a sample of n scores, the first n – 1 scores are free to vary, but the final score is restricted. As a result the sample is said to have n – 1 degrees of freedom (df).

• The df determines the number of scores in the sample are independent and are free to vary.

42

Sample Variability and Degrees of Freedom (n – 1)

41

42

Page 22: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

22

• However, sample variance is a “squared idea”.

Recall:

sample variance is: 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠

𝑛 −1

• We take the square root of it, in order to get the “typical spread” of the data.

• We call this typical spread of data, sample standard deviation.

• We denote sample standard deviation by S

• S = 𝑆2

• For example in our data, S = 7 = 2.65

• S measures the spread about the sample mean ത𝑦

• S can be 0 or positive.

• S is 0 when 𝑆2 is 0 • e.g., quiz data for 3 students: 10, 10, 10; ത𝑦 = 10; 𝑆2= 0; S = 0

Note:

Variance and Standard Deviation are both affected by (not resistant to, sensitive to) extremely low or extremely high values in the data.

43

Sample Standard Deviation: The Spread of Distribution

• Recall that sample mean was 78.51%

• Sample variance: 𝑆2 =(37−78.51)2+(39−78.51)2+⋯(94−78.51)2

35−1

• 𝑆2 = 209.55

• Sample Standard Deviation: S = take positive value of 209.55 = 14.48

Interpretation:

We can expect that percentage of educational attainment to differ from the mean, on average, by about 14.48.

44

Standard Deviation of Percentage of Educational Attainment (BLI, 2017)

43

44

Page 23: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

23

45

Summary Statistics of Percentage of Educational Attainment in R

46

Summary Statistics of Percentage of Educational Attainment in RUse describe() function in “psych” package

45

46

Page 24: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

24

47

Summary Statistics of Percentage of Educational Attainment in RInstall the “mosaic” package in R

The Affect of Unusual Data Values (Potential Outliers) on Statistical Analysis

Let us compare the descriptive statistics for:

• n = 35: the complete data set

• n = 33: two low potential outliers were removed from the original data (37% and 39% data values are omitted)

48

47

48

Page 25: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

25

The Affect of Unusual Data Values (Potential Outliers) on Statistical Analysis

Let us compare the sample standard deviation for:

• n = 35: the complete data set

• n = 33: a subset of the complete data set (low potential outliners are omitted from original data set)

❖ Compare mean, median, range, variance, and standard deviations between these two data sets.

49

The Affect of Unusual Data Values (Potential Outliers) on Statistical Analysis

• The data set with n = 33 observations has larger mean value but smaller spread (smaller range, smaller standard deviation and smaller variance) than the complete data set with n = 35 observations.

• Thus, in terms of mean, standard deviation (and range, and variance) the distribution of educational attainment is affected by removing these two potential outliers.

• However, the median value is less affected by removing these two low outliers from the data:

• Median values are about the same (82%, n = 35; 83%, n = 33)

50

49

50

Page 26: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

26

Bell-shaped and Symmetric Distribution

• Mountain: One peak.

• symmetric bell-shaped: mean, median, and mode are the same (or very close to each other).

• Below is an illustration of symmetric bell-shaped distribution.

As sample size increases, the number of intervals (bins), for example, for histogram increases, so their width narrows; then we have a smooth curve (more about smooth curves later: Ch. 4).

51

Empirical Rule

52

If the histogram of a data is approximately bell-shaped and

symmetric, then:

• About 68% of the observations in the data are within

1 standard deviation of the mean:

(ഥ𝒚 − 𝟏 𝑺, ഥ𝒚 + 𝟏 𝑺)

• About 95% of the observations in the data are within

2 standard deviation of the mean:

(ഥ𝒚 − 𝟐 𝑺, ഥ𝒚 + 𝟐 𝑺)

• All or about all of the observations in the data are within

3 standard deviation of the mean:

(ഥ𝒚 − 𝟑 𝑺, ഥ𝒚 + 𝟑 𝑺)

51

52

Page 27: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

27

Example of Checking Whether Empirical Rule Holds: Educational Attainment

53

(ഥ𝒚 − 𝟏 𝑺, ഥ𝒚 + 𝟏 𝑺) :

(ഥ𝒚 − 𝟐 𝑺, ഥ𝒚 + 𝟐 𝑺):

(ഥ𝒚 − 𝟑 𝑺, ഥ𝒚 + 𝟑 𝑺):

Min = 37, Max = 94, Mean = 78.51, S = 14.48

Example of Checking Whether Empirical Rule Holds: Educational Attainment

54

(ഥ𝒚 − 𝟏 𝑺, ഥ𝒚 + 𝟏 𝑺) : (78.51 – (1*14.48), 78.51 + (1*14.48))

= (64.03, 92.99) = [65, 92]

65 and 92 are values is the data; we count all values in this range.

data counts: 28/35 x 100 = 80%

The Empirical Rule does NOT apply to this distribution.

(ഥ𝒚 − 𝟐 𝑺, ഥ𝒚 + 𝟐 𝑺): (78.51 – (2*14.48), 78.51 + (2*14.48))

= (78.51 – 28.96, 78.51 + 28.96)

= (49.55, 107.47) = [50, 107]

50 and 107 are NOT values is the data; we count from a data value

after 50, which is 58, to a data value below 107, which is 94 (max).

data counts: 32/35 x 100 ≅ 91.43%

(ഥ𝒚 − 𝟑 𝑺, ഥ𝒚 + 𝟑 𝑺): (78.51 – (3*14.48), 78.51 + (3*14.48))

= (78.51 – 43.44, 78.51 + 43.44)

= (35.07, 121.95) = [36, 121]

36 and 107 are NOT values is the data; we count from a data value

after 36, which is 37 (min), to a value below 107, which is 94 (max).

data counts: 35/35 x 100 = 100%

Min = 37, Max = 94, Mean = 78.51, S = 14.48

53

54

Page 28: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

28

• IQR is the range of the middle half of the data. 50% of the data fall in this range.

• To find IQR, we divide the data into quartiles.

• We define the median of the data as the second quartile (Q2, 50th percentile). It has half (50%) of the data below and half (50%) of the data above it.

• We define the median of the first half as the first quartile (Q1, 25th percentile). It has 1/4 (25%) of the

observations below it and 3/4 (75%) of the observations above it.

• We define the median of the second half as the third quartile (Q3, 75th percentile). It has 3/4 (75%) of the

observations below it and 1/4 (25%) of the observations above it.

• IQR is: 75th percentile (Q3 ) – 25TH percentile (Q1)

Note: IQR is NOT (or less) affected by (resistant to, not sensitive to) outliers.

55

Measure of Variability: Interquartile Range (IQR)

Quartiles in the Data

56

• The Quartiles together with the median (Q2) split the distribution into four parts.

• Each part contains one-fourth of the observations.

55

56

Page 29: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

29

• Recall that Median (Q2) is the 18th observation: Q2 = 82%

• The first (or lower) Quartile (Q1) is the average of 9th and 10th positions (that is, we look in the first half of the data set;

among the first 18 observations): 77% + 77% = 77%

• The third (or upper) Quartile (Q3) is the average 9th and 10th positions (that is, we look in the second half of the data set;

(among the second 18 observation): 87% + 88% = 87.5%

• So, we can calculate the IQR = Q3 – Q1 = 87.5% – 77% = 10.5%

• Interpretation of IQR within the context of this data:

The middle half of the OECD countries has educational attainment that extends across a range of 10.5%.

57

IQR (Interquartile Range): Educational Attainment (BLI, 2017)

58

Summary Statistics for Educational Attainment (BLI, 2017) in R

Min: 37%

Max: 94%

Range: 57%

Q1: 77%

Median (Q2): 82%

Q3: 87.5%

IQR: 10.5%

57

58

Page 30: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

30

Measure of Position

59

• It will tell us the point at which a percentage of the data fall below or above that point.

• For example:

• Range uses two positions: Max – Min.

• Median (Q2, 50th percentile): Half of the data (50%) falling below it; and, half (50%) above it.

• Lower (first) quartile (Q1, 25th percentile):

• It is the median for the observations that fall below the median (bottom half of data).

• One quarter (1/4; 25%) of data fall below it; and, 3/4 (75%) of the data fall above it.

• Upper (third) quartile (Q3, 75th percentile):

• It is the median for the observations that fall above the median (the upper half of data).

• One quarter (1/4; 25%) of data fall above it; and, 3/4 (75%) of the data fall below it.

Example of Measure of Position: Educational Attainment (BLI, 2017)

60

Min: 37%

Max: 94%

Q1: 77%

Median (Q2): 82%

Q3: 87.5%

IQR: 10.5%

59

60

Page 31: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

31

The Affect of Unusual Data Values (Potential Outliers) on Statistical Analysis

• The original data set has n = 35 observations.

• The subset of this data has n = 33 observations (data values of 37 and 39 were removed from the original data).

• From the R outputs of Descriptive, we notice that the IQR values are about the same.

• The IQR values are not affected by the two extremely low data values.

• IQR for n = 35 is: Q3 – Q1 = 87.5 – 77 = 10.5

• IQR for n = 33 is: Q3 – Q1 = 88 – 78 = 10

61

62

5-number Summary Statistics: Nice Features for Data Visualization (Boxplot – next slide)

The 5-numbers Summary Statistics provide a simple description of the data. These are:

• Minimum: Smallest value in the data

• Lower quartile (Q1): 25th percentile

• Median (Q2): 50th percentile

• Upper quartile (Q3): 75th percentile

• Maximum: Largest value in the data

In our example of percentage of educational attainment in 35 OECD countries, the 5-number Summary Statistics are:

min=37, Q1=77, median=82, Q3=87.5, max=94 (note these values are all measured in percentage)

61

62

Page 32: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

32

Box Plot: Graphing 5-Number Summary of Positions

63

• Boxplot is designed by John W. Tukey.

• Boxplot utilizes the 5-number summary statistics. For example:

min=37, Q1=77, median=82, Q3=87.5, max=94

• It reflects the shape of the data.

• It summarizes both the centre and the spread.

• The box of the box plot contains, 50% of the data, from Q1 to Q3.

• The median is marked by a line drawn within the box but it is not

necessarily in the middle of the box.

• The line extending from the box are called whiskers.

These extend to minimum and the maximum data values,

except for extreme values (potential outliers), which are marked separately

with a circle on the plot.

min=37, Q1=77, median=82, Q3=87.5, max=94

Describe the shape (tilt your head to the left, scan from min to max of data), the centre (median), and spread (IQR) of this data, as well as note any data points that are plotted individually (they may or may not be outliers – for now, refer to them as “potential” outlier: e.g., potential low outlier: that is, if plotted in the lower part of the graph, potential high outlier, that is, if plotted in the upper part of the graph). 64

Boxplot of Percentage of Educational Attainment (BLI, 2017)

• Shape:

• Centre:

• Spread (IQR):

• Extreme value(s):

63

64

Page 33: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

33

• Shape: The data is skewed to the left. The lower part of the graph shows 5 data points that are plotted individually.

• Centre: The median is almost in the middle of the box of the boxplot, and it is positioned at value of 82%. 50% of OECD countries have 82% or less percentage of education attainment and 50% of them have more than 82%.

• Spread (IQR): The percentage of educational attainment ranges from 37 to 94. The middle half of this distribution (50% of the OECD countries) has educational attainment between 77 and 87.5 percent.

• Extreme value(s): Five data point are plotted individually with educational attainment of 37, 39, 47, 58, and 60 percent.

65

Boxplot of Percentage of Educational Attainment (BLI, 2017)

min=37, Q1=77, median=82, Q3=87.5, max=94

Identify Outliers in the Boxplot using R

66

65

66

Page 34: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

34

1. Locate the min and the max on a horizontal line

2. Locate the median, draw a vertical line above the median (away from the horizontal line).

3. Locate Q1, and Q3, and draw vertical lines above these values.

4. Draw a rectangle using the vertical lines from Q1, Q2, and Q3.

5. Calculate IQR = Q3 – Q1= 87.5 – 77 = 10.5

6. Calculate 1.5 IQR = 1.5 x 10.5= 15.75

7. Calculate inner fences:

• Lower inner fence = Q1 – 1.5(IQR) = 77 - (1.5(10.5)) = 77 – 15.75 = 61.25

• Upper inner fence = Q3 + 1.5(IQR) = 87.5 + (1.5(10.5)) = 103.25

• Draw lines (whiskers) connecting the box to the most extreme value within fences:

• From Q1 draw a line to a value ≥ 61.25 (greater than or equal to 61.25) in the data (in our example it will be 65).

• From Q3 draw a line to a value ≤ 103.25 (less than or equal to 103.25) in the data (in our example it will be max = 94).

• Plot values outside fences individually. In our example, we plot 37, 39, 47, 58, & 60 individually. 67

Drawing the Boxplot of Percentage of Educational Attainment

Sample Mean (ഥ𝒚): 78.51, Sample Standard Deviation: S = 𝟏𝟒. 𝟒𝟖

Sample Median: 82, Sample IQR: Q3 – Q1 = 87.50 – 77 = 10.5

Interpretation:

• The educational attainment in OECD countries ranges from 37% to 94%, and it is about 79% on average . The median education attainment is 82%. That is, 50% of the OECD countries had at most 82% or less of their adults aged 25 to 64 with at least an upper secondary degree, whereas 50% of the OECD countries had more than 82% of their adults aged 25 to 64 with at least an upper secondary degree. The middle half of this distribution (50% of the OECD countries) has educational attainment between 77 and 87.5 percent. We can expect that percentage of educational attainment to differ from the mean, on average, by about 14.48%.

68

Interpret the Summary Statistics of Percentage of Educational Attainment

67

68

Page 35: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

35

Z-Score: Measure of Position Standardizing Data

69

Z-score (also referred to as a standardized data value) is a measure position which indicates the number of

standard deviation (𝑠) that an observation (𝑦) falls away from the mean ( ത𝑦).

• To standardize a value, we subtract the mean and then divide this difference by the standard deviation.

𝒁 =𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 −𝒎𝒆𝒂𝒏

𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏

In symbol format: 𝒛 =𝒚− ത𝑦

𝒔

• z-score expresses in units of standard deviation.

• Standardizing data into z-scores shifts the data by subtracting the mean and rescales the values by dividing by

standard deviation.

• Standardizing into z-scores does not change the shape of the distribution.

• Standardizing into z-scores changes the centre by making the mean 0.

• No matter what mean or standard deviation a distribution has, Z has mean 0, and standard deviation 1.

• Standardizing into z-scores changes the spread by making the standard deviation 1.

• Adding (or subtracting) a constant to each value will increase (or decrease) all measures of position (centre: mean, median; percentiles (e.g., Q1, Q3); min; max) by the same constant.

• Adding (or subtracting) a constant to each data value leaves measures of spread (range, IQR, standard deviation) unchanged.

Example 1: Shifting to Adjust the Centre

70

69

70

Page 36: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

36

• Adding (or subtracting) a constant to each value will increase (or decrease) all measures of position (centre: mean, median; percentiles (e.g., Q1, Q3); min; max) by the same constant.

• Adding (or subtracting) a constant to each data value leaves measures of spread (range, IQR, standard deviation) unchanged.

Example 2: Shifting to Adjust the Centre

71

When we multiply (or divide) all the data values by any constant, all measures of position (such as the mean, median, and percentiles) and measures of spread (such as the range, the IQR, and the standard deviation) are multiplied (or divided) by that same constant.

Example 1: Rescaling data to Adjust the Scale

72

71

72

Page 37: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

37

When we multiply (or divide) all the data values by any constant, all measures of position (such as the mean, median, and percentiles) and measures of spread (such as the range, the IQR, and the standard deviation) are multiplied (or divided) by that same constant.

Example 2: Rescaling data to Adjust the Scale

73

Example: The Distribution of z-scores

Standardizing data into z-scores shifts the data by subtracting the mean and rescales the values by dividing by standard deviation.

74

73

74

Page 38: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

38

Example: The Distribution of Original Data and z-scoresBoth these two distributions have the same shape

75

More about Z-Scores

76

• Data values below the mean have negative z-scores.

• Data values above the mean have positive z-scores.

• Regardless of the direction, the farther a data value is from the mean, the more unusual it is.

• For a bell-shaped symmetric distribution, it is unlikely (very rare) for an observation to fall more than 3

standard deviation from the mean.

• This means, if an observation has a z-score of:

• less than -3 or more than +3 (larger than 3 in absolute value) it is unlikely or rare.

• Between 2 and 3 in absolute value, it is a potential outlier.

• Z-scores give the basis for comparison for data values with different means and standard deviations.

75

76

Page 39: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

39

Example: Comparing Data Values

77

➢Which is the better exam score?

• 67 on an exam with mean 50 and a standard deviation of 10?

• 62 on an exam with mean 40 and a standard deviation of 12?

➢ Is it fair to say 67 is better than 62, because 67 > 62?

➢Or, is it fair to say 62 is better than 67, because it is 22 marks above the mean, whereas 67 is only 17 marks above the mean?

Example: Comparing z-Scores

78

So, which is the better exam score?

• Turn 67 to a z-score: 𝑧 =67 −50

10= 1.70

• Turn 62 to a z-score: 𝑧 =62 −40

12= 1.83

So, 62 exam mark is a (slightly) better performance, relative to its mean and its standard deviation.

77

78

Page 40: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

40

Find the Original Score from the z-Score

79

People with z-scores above 2.5 on an IQ test are sometimes classified as geniuses. If IQ scores have

mean of 100 and a standard deviation of 16 points what IQ score do you need to be considered a genius?

Recall the formula: 𝒛 =𝒚 − ത𝑦

𝒔

Observation (y) = ത𝑦 + (𝑧 𝑥 𝑠)

Observation (IQ) = mean of IQ + (z-score * stand. deviation of IQ)

IQ = 100 + (2.5 x 16) = 140

Recall the Example Percentage of Educational Attainment: Mean = 78.51, SD = 14.48

80

• How many standard deviation is the country with 37% educational attainment away from the mean percentage

of educational attainment?

𝒛 =37 −78.51

14.48= -2.87

OECD country (recall this was Mexico) with 37% educational attainment is 2.87 standard deviation below the

mean. So, this observation is not necessarily a low outlier.

• How many standard deviation is the country with 94% educational attainment away from the mean percentage

of educational attainment?

𝒛 =94 −78.51

14.48= +1.07

OECD country with 94% educational attainment is 1.07 standard deviation above the mean.

Can you confirm this country’s name?

79

80

Page 41: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

41

Recall the Example Percentage of Educational Attainment: Mean = 78.51, SD = 14.48

81

Which OECD country had 94% educational?

• In RStudio’s Global Environment panel, click on the data frame “Edu_Attain”

• You can sort the data by clicking on the cursor (e.g., click on the cursor below the varaible “Educational_Attainment”)

• You may need to click on the cursor twice to show data from “high values” to “low values”

• The country Japan (case number: 18) had 94% of their adults with an educational attainment.

The Distribution of z-scores for Percentage of Educational Attainment

82

• The distribution of z-scores (𝒁 =𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏 −𝒎𝒆𝒂𝒏

𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏) for n = 35 observations:

37 −78.51

14.48= -2.87,

39 −78.51

14.48= -2.73,

47 −78.51

14.48= -2.18, …,

94 −78.51

14.48= 1.07

• Next, we find the mean, and the standard deviation of the distribution of z-scores.

• We will also obtain graphical displays (Histogram and boxplot) of the distribution of z-scores and compare the

graphs with the original distribution of educational attainment percentages.

81

82

Page 42: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

42

Obtain Summary Statistics and Graphical Displays in R for:The Distribution of z-Scores of Percentage of Educational Attainment

83

The Distribution of z-Scores for Percentage of Educational Attainment

84

• The distribution of z-scores for educational attainment (n = 35 observations):

37 −78.51

14.48= -2.87,

39 −78.51

14.48= -2.73,

47 −78.51

14.48= -2.18, …,

94 −78.51

14.48= 1.07

So, we confirmed that the distribution of z-scores has mean 0 and standard deviation of 1.

How could the standard deviation of z-scores be 1?

Note that the standard deviation is a square root of variance; and, standard deviation is 1 when variance is 1.

• So, how could variance of z-scores be 1?

Recall the formula for variance (sample variance actually):

sample variance is: 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 −1

We denote sample variance by 𝑆2 𝑆2= σ𝑖=1𝑛 (𝑦𝑖 −ത𝑦)2

𝑛 −1

83

84

Page 43: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

43

Variance of z-Scores (𝑺𝟐 = 1)

85

Since z-scores have mean 0, sample variance for z-scores is:

𝑆2= σ𝑖=1𝑛 (𝑍𝑖 − ത𝑍)2

𝑛 −1= σ𝑖=1

𝑛 (𝑍𝑖 − 0)2

𝑛 −1= σ𝑖=1

𝑛 (𝑍𝑖)2

𝑛 −1=𝑠𝑢𝑚(𝑧−𝑠𝑐𝑜𝑟𝑒𝑠)2

𝑛−1

So, in order for the sample variance of z-scores be 1, the sum of all squared z-scores must be 𝑛 − 1.

𝑆2 =𝑠𝑢𝑚(𝑧−𝑠𝑐𝑜𝑟𝑒𝑠)2

𝑛−1= 𝑛−1

𝑛−1= 1

• Let’s square the z-scores: 𝑍1, 𝑍2, … , 𝑍𝑛=35. Note that in our example we have n = 35 observations.

• And, then sum all the squared z-scores (all 35 z-scores). We will use R:

The Distribution of Z-scores of Data (Percentage of Educational Attainment) &The Distribution of Original Data (Percentage of Educational Attainment)

Both these two histograms have the same shape

86

85

86

Page 44: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

44

The Distribution of Z-scores of Data (Percentage of Educational Attainment) &The Distribution of Original Data (Percentage of Educational Attainment)

Both these two boxplots have the same shape

87

For Bell-shaped and Symmetric: Distance from the mean to either quartiles are about 2/3rd of a S.

88

• A distribution is approx. bell-shaped and symmetric if 𝐼𝑄𝑅

𝑆=

4

3≅ 1.33

• IQR is roughly (4/3) S.

87

88

Page 45: Week 4 Lecture Notes - Asal Aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/week_4... · 2019-01-22 · Week 4 Lecture Notes PSYC2021: Winter 2019 1 Descriptive statistics:

2019-01-22

45

Check if Distribution of Percentage of Educational Attainment is Bell-Shaped and Symmetric

89

Recall that percentage of educational attainment had: IQR = 11 and S = 14.48

•𝐼𝑄𝑅

𝑆=

11

14.48≅ 0.76 and it is not approximately 1.33

• The distribution of percentage of educational attainment is not bell-shaped and symmetric.

• We report mean and standard deviation for almost bell-shaped and symmetric distributions.

• We report median and IQR for Skewed distributions.

• Mean, range, SD, and Variance are not resistant (are sensitive) to extremely low and high values in the data.

They are heavily influenced by them.

• Median, and IQR are resistant (not sensitive, not affected as much) by extremely low or high values in the data.

• We need to examine outlying points in the data.

• We need to conduct a sensitivity analysis to determine the effects of outliers on the reported statistics. That

is, we need to conduct statistical analyses with and without outlying points in the data.

• We also, need to make sure that the outlying point is an accurate measure in the data (that is, we need to

check whether an extreme measurement is reported correctly).

• Calculate Z-scores to realize how many standard deviation an observation is away from the mean of all

observations in the data.

• If an observation has a z-score of less than -3 or more than +3 (larger than 3 in absolute value) it is unlikely

or rare case.

Summary

89

90