chapter 5 · 2018. 9. 9. · •a bar graph is used to display categorical (nominal) information....

72
Chapter 5

Upload: others

Post on 27-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Chapter 5

  • Categorical

    • A general name for non-numerical data; the data is separated into

    categories of some kind.

    • Nominal data

    • Categorical data with no implied order.

    • Eg. Eye colours, favourite TV show, favourite music genre.

    • Ordinal data

    • Categorical data with some form of implied order.

    • Eg. A question followed by the respondent selecting a level of satisfaction

    ie Very good – good – moderate - bad

  • Numerical data

    • A general name for data that is numerical in its nature.

    • Discrete data

    • Numerical data where there is a countable number of values possible.

    • Eg. What year were you born?, How many pets do you own?

    • Continuous data

    • Numerical data where there is an infinite number of values possible; often associated with measurement of some kind. The values can be any (real) numbers within a particular range.

    • Eg. How tall are the players on Melbourne Victory squad? (the values recoded varies depending on the accuracy requested or the degree of accuracy possible with out measuring instrument)

  • • We make this distinction between discrete and continuous data

    because there are different things we can do with each type.

    However, we often treat continuous data as if it were discrete.

    This is because we must record the data in some way. With the

    height example previously, if the question said to record the

    data to the nearest cm, then we would have discrete

    possibilities.

  • • A structural method of recording statistical data.

    • If the range of values is too big, group the data into class

    intervals.

    No. of siblings Tally marks Frequency

    0 III 3

    1 IIII 4

    2 II 2

    Total 9

  • • We can also add in relative frequency and percentage

    frequency columns to compare values in different frequency

    tables where the total for each table is different.

    • Relative frequency

    • Divide the frequency by the total number of data values in the table.

    • The relative frequency table should equate to 1.

    • Percentage frequency

    • Multiply the relative percentage by 100.

    • The percentage frequency column should equate to 100.

  • • Although a frequency table is a useful way of recording data, when we group data values we lose information. A stem plot preserves this lost information.

    • Stem plots are usually divided into intervals of 10.

    • Example:

    • 12, 15, 23, 45, 56, 18, 44, 33, 23, 19, 34, 52, 59, 41, 23, 13, 9, 11, 15, 18

    • Note: We can easily identify the highest

    and lowest score.

    • If we need to break the stem because of

    large numbers of leaf values, use a code

    such as 5L to represent the values in the

    lower half of the 50’s and 5U for the

    upper half

    STEM LEAF

    0 9

    1 1 2 3 5 5 8 8 9

    2 3 3 3

    3 3 4

    4 1 4 5

    5 2 6 9

  • • A bar graph is used to display categorical (nominal) information.

    • Bar graphs can easily be constructed from data represented in a frequency

    table.

    • Gaps between bars to represent we are not dealing with continuous data.

    • Example:

    Type of vehicle Frequency

    Sedan 16

    Ute 7

    Truck 8

    Stationwagon 6

    Motorbike 1

    Total 38

    0

    5

    10

    15

    20

    Type of Vehicle

  • • A single rectangle is divided into pieces according to the contribution of the

    category to the whole.

    • An appropriate scale needs to be chosen to make the graph easier to draw.

    • Find the percentage contribution for each category and base the scale on

    those values.

    • Note: Number of grams

    involved = 78.6

    Component Protein Fat Sat Fat Carbs Sugar

    No. of grams 17 18.2 6.3 34.5 2.6

    0 50 100

    No. of grams

    Protein

    Fat

    Sat Fat

    Carbs

    Sugar

  • • When dealing with numerical data we should not really use a

    bar chart. However, the temptation is strong when we have

    discrete data

    • Uses strips instead of bars (to make it look different)

    • Graph does not have a title but both axis need labels.

    • Gaps between strips to represent discrete data

  • • When dealing with continuous numerical data we use

    histograms.

    • Like a bar chart, but the bars are joint together. There are no

    gaps in the data values so we leave no gaps in the diagram.

    • Leave a half-column gap between the first column and the

    vertical axis.

    • Note: indicates that that

    part of the horizontal axis has

    been left out.

    • Make the intervals the same size.

  • • When we have a frequency table that have already been set up and the intervals are not the same size, information can be misleading.

    • Looking at the histogram, it seems to indicate that the most dangerous age range is 30-

  • • To draw a histogram correctly when we have inconsistent intervals we need

    to a percentage frequency and a frequency density column.

    • Frequency density

    • 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 =𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

    𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ

    • Remember:

    • 𝑃𝑟𝑒𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

    𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠× 100

  • • Example:

    • 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =3

    299× 100 = 1.0%

    • 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 =1.0

    5= 0.2

    Note: add the vertical axis title

    ‘frequency density’

  • • 5.3.4:

    • The frequency table gives information about the number of

    traffic fatalities for females in Victoria during 2002. draw a

    historgram of the data, taking note of the inconsistent class

    interval used.

  • • Add the percentage

    frequency column and a

    frequency density column.

    • 𝑃𝑟𝑒𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =

    •𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

    𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠× 100

    • 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 =

    •𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

    𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ

    Age

    range

    No. of

    deaths

    Percentage frequency Frequency density

    0- 1 (1 / 98) x 100 = 1.0 (1.0 / 5) = 0.2

    5- 3 (3 / 98) x 100 = 3.1 (3.1 / 8) = 0.3875

    13- 5 (5 / 98) x 100 = 5.1 (5.1 / 3) = 1.7

    16- 6 (6 / 98) x 100 = 6.1 (6.1 / 2) = 3.05

    18- 11 (11 / 98) x 100 = 11.2 (11.2 / 4) = 2.8

    22- 6 (6 / 98) x 100 = 6.1 (6.1 / 4) = 1.525

    26- 5 (5 / 98) x 100 = 5.1 (5.1 / 4) = 1.275

    30- 7 (7 / 98) x 100 = 7.1 (7.1 / 10) = 0.71

    40- 12 (12 / 98) x 100 = 12.2 (12.2 / 10) = 1.22

    50- 11 (11 / 98) x 100 = 11.2 (11.2 / 10) = 1.12

    60- 16 (16 / 98) x 100 = 16.3 (16.3 / 15) = 1.087

    75+ 15 (15 / 98) x 100 = 15.3 (15.3 / 15) = 1.02

    Σf=98

  • Age

    range

    No. of

    deaths

    Percentage

    frequency

    Frequency

    density

    0- 1 1.0 0.204

    5- 3 3.1 0.3825

    13- 5 5.1 1.7

    16- 6 6.1 3.06

    18- 11 11.2 2.805

    22- 6 6.1 1.53

    26- 5 5.1 1.275

    30- 7 7.1 0.714

    40- 12 12.2 1.224

    50- 11 11.2 1.122

    60- 16 16.3 1.089

    75+ 15 15.3 1.021

    Σf=98

  • • When we have a continuous data set recorded in class intervals we can draw

    what is called a cumulative frequency diagram.

    • This shows the number of data values less than a particular value.

    • Add a column to the frequency table labelled ‘cumulative frequency’

    • Add a new first row to emphasis we are finding the number of data values

    less than the given value.

    • The cumulative frequency column tells us how many values there are less than

    the right-hand end-point of the interval.

    • Eg.

  • • Example:

    Class interval

    (mass, kg)

    Frequency

    (number of students)

    10 -

  • • Once we have added the ‘cumulative frequency’ column we can

    now draw the diagram.

    0

    5

    10

    15

    20

    25

    30

    35

    40

    10 20 30 40 50

    Cumulative frequency

    Mass (kg)

  • • How to read the diagram

    • What percentage of people are less than 35kg?

    • Find the cumulative frequency: 20 people

    • How many people in data set: 34

    •20

    34× 100 = 58.82%

    • 59% of people surveyed are less than 35 kg.

    0

    10

    20

    30

    40

    10 20 30 40 50

    Cumulative frequency

    Mass (kg)

  • • We can use our CAS to find only approximate values for percentiles. These arise from questions such as “Under what mass are 70% of the students?” of course you have also been given the distribution of the student’ weight.

    • We have the following information

    • We need to enter the cumulative frequency values as the second column in our list. As a check, the total frequency should be the value in the last row.

    Class interval

    (mass, kg)

    Frequency

    (Number of students)

    10 -

  • • 1. Enter the data in the Lists & spreadsheet

    application.

    • Call the first column x

    • Enter the right-hand boundary values of the class

    interval

    • Call the second column y

    • Include the point (10,0)

    Class interval

    (mass, kg)

    Frequency

    (Number of students)

    Cumulative

    frequency

  • • 2. Insert a new page, , and

    choose Data and Statistics. Label the axis

    • 3. Join the dots by pressing > Plot

    Type > XY Line Plot

  • • 4. To estimate the 70th percentile we need to know 70% of the cumulative

    frequency. In this case this is 0.7 x 34 = 23.8. we now want to draw in the

    line y = 23.8. to do this press > Analyse > Plot Function and fill in

    the dialog box. We can now approximate the x-value of the point of

    intersection by simply reading from the graph.

    • Therefore, the 70th percentile is approximately 37

  • • Mean• The measure of central tendency found by adding together all of the data

    values and dividing by the data set

    𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠

    𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠=Σ𝑥

    𝑛=Σ𝑥𝑓

    Σ𝑓

    • Median• The measure of central tendency that is the (physical) middle value of the

    data set

    for 𝐨𝐝𝐝 𝑛, the median is the𝑛

    2+ 0.5 th value.

    for 𝐞𝐯𝐞𝐧 𝑛, the median is the mean of the𝑛

    2th and

    𝑛

    2+ 1 th values.

    • Mode• The measure of central tendency that is the most frequent occurring data

    value

  • • 1. Enter the data in Lists &

    Spreadsheets using the mid-

    point values to represent each

    group.

    • 2. Press > Statistics

    > Stat Calculations > One-

    Variable Statistics to bring up

    this dialogue box.

  • • 3. In this case we have only one list of data values as the second list is just the frequencies. So, to OKand press . This brings up another dialogue box which needs to be filled in as shown.

    • (If you had just entered a list of ungrouped data values then you would leave Frequency List as ‘1’)

  • • 4. to OK and press and we get

    the summary statistics included in our

    table. Arrow down to highlight the value

    beside 𝑥 and read of the value of the mean from the entry line at the bottom. In

    this case it is 9.4375

    • 5. Continue to arrow down to find the

    value beside MedianX…as this is the

    median. This tells us the median is 7, but

    as this is grouped data, we should really

    say the median occurs in the 5-9 group.

  • • Find the mode, median and mean of the following:

    • Mode: look in the frequency column for the highest occurring number

    • Mode = 5 and 7

    Score Frequency

    4 15

    5 23

    6 14

    7 23

    8 17

    9 19

    10 20

  • • Median

    • Find Σ𝑓

    for 𝐨𝐝𝐝 𝑛, the median is the𝑛

    2+ 0.5 th value.

    • Median =131

    2+ 0.5 th value = 66th value = 7

    • Median = 7

    Score Frequency

    4 15

    5 23

    6 14

    7 23

    8 17

    9 19

    10 20

    Score Frequency

    4 15

    5 23

    6 14

    7 23

    8 17

    9 19

    10 20

    Σ𝑓= 131

  • • Mean:

    • Add a new column to the frequency table and label it xf (x stands for the

    data value, f for the frequency), and fill it by multiplying the data values by

    the frequency for that value.

    • Mean = Σ𝑥𝑓

    Σ𝑓=

    927

    131= 7.08 (2 𝑑𝑝)

    Score Frequency

    4 15

    5 23

    6 14

    7 23

    8 17

    9 19

    10 20

    Σ𝑓= 131

    Score Frequency xf

    4 15 60

    5 23 115

    6 14 84

    7 23 161

    8 17 136

    9 19 171

    10 20 200

    Σ𝑓= 131 Σ𝑥𝑓 = 927

  • • Find the modal class, median class and mean (2 dp) for the following

    grouped discrete data sets.

    Score Frequency

    10-19 15

    20-29 23

    30-39 17

    40-49 15

    50-59 13

    60-69 20

    70-79 5

    Modal class: Look in the frequency column for the highest number; the

    modal class is the number associated with this.

    Modal class = 20-29

  • • Median class

    • Find Σf

    for 𝐞𝐯𝐞𝐧 𝑛, the median is the mean

    of the𝑛

    2th and

    𝑛

    2+ 1 th values.

    •108

    2𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 𝑎𝑛𝑑

    108

    2+ 1 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒

    • 54th and 55th values

    • 54th and 55th values are in the 30-39 interval

    • Median class = 30-39

    Score Frequency

    10-19 15

    20-29 23

    30-39 17

    40-49 15

    50-59 13

    60-69 20

    70-79 5

    Σf= 108

  • • When we deal with grouped data we cannot find a single

    mode, instead we find the modal class. We can find a specific

    value for the mean but we need to find the these values that

    represent each of these class intervals.

    • We use the median of the interval and call this the midpoint of

    the interval (𝑥𝑚)

    • We add two new columns to the frequency table 𝑥𝑚 and 𝑥𝑚𝑓

  • • Add the new columns 𝑥𝑚 and 𝑥𝑚𝑓

    • Find the mean:

    • Mean = Σ𝑥𝑚𝑓

    Σ𝑓=

    4406

    108= 40.80

    Score Frequency

    10-19 15

    20-29 23

    30-39 17

    40-49 15

    50-59 13

    60-69 20

    70-79 5

    Score Frequency 𝒙𝒎 𝒙𝒎𝒇

    10-19 15 14.5 217.5

    20-29 23 24.5 563.5

    30-39 17 34.5 586.5

    40-49 15 44.5 667.5

    50-59 13 54.5 708.5

    60-69 20 64.5 1290

    70-79 5 74.5 372.5

    Σ𝑓= 108 Σ𝑥𝑚𝑓 = 4406

  • • For the following data sets draw a cumulative frequency curve to find an

    estimate for the median.

    • Add a cumulative frequency column to the table. Note that

  • • Draw the cumulative frequency curve

    for 𝐞𝐯𝐞𝐧 𝑛, the median is the mean of the𝑛

    2th and

    𝑛

    2+ 1 th values.

    • Median = 146

    2𝑎𝑛𝑑

    146

    2+ 1 = median between 73rd and 74th value

    • Median is approximately 156

    • Draw a line from the median position across to the curve and then draw a line from this point on the curve down to the horizontal axis.

    Score Frequency Cumulative 𝒇

  • • The median divides the data set into two equal pieces. We can

    extend this idea by dividing the data set into any number of

    equal-sized pieces. We call these pieces quantiles.

    • Some quantiles have special names.

    • 4 equal pieces = quartiles

    • 10 equal pieces = deciles

    • 100 equal pieces = percentiles

  • • Lower quartile (Q1 or QL)

    • Upper quartile (Q3 or QU)

    • We consider there to be approximately 25% of the data values

    in each of the quartiles.

  • • Sometimes we want to find how the data spreads out from the central values (mean, median, mode)

    • The range (maxX-minX) is the simplest measure of spread• Sensitive to extreme values (outliers)

    • Example: 1, 1, 1, 4, 6, 8, 8, 9, 10, 13, 13, 14

    • Range = 14 – 1 = 13

    • The range represents the difference between the highest score to the lowest score

    • Interquartile range: IQR = Q3 – Q1• Interquartile range (IQR) = upper quartile (Q3) – lower quartile (Q1)

    • The IQR represents the middle 50% of the data set.

  • • 5.5.1

    • Find the interquartile range for the following data set.

    2, 3, 5, 8, 9, 11, 15, 16, 18, 25, 36

    • IQR = Q3 – Q1

    • IQR = 18 – 5

    • IQR = 13

    median

    Q3Q1

  • • We use quartiles to draw up boxplots.

    • The key features of box plots are show in the diagram.

    • The five values shown are sometimes referred to as the five-figure

    summary for a data set.

    • The box plot needs a scale line associated with it.

    • The ‘box’ is the central box representing the middle 50% of the data.

    • The ‘whiskers’ are the lines that go out to the extreme values.

  • • 5.5.3a• Find the interquartile range and draw a boxplot for the following data.

    • Find Σf• Σf = 24

    • Identify the median

    • Median = mean of the𝑛

    2th and

    𝑛

    2+ 1 th values.

    • Median between 12th and 13th value. = 2

    • Identify Q1 and Q3• 12 values in the lower and upper half

    • Q1 is between the 6th and 7th value = 1

    • Q3 is between the 18th and 19th value = 3

    • Calculate the IQR

    • IQR = 3 – 1 = 2

    Score Frequency

    0 3

    1 6

    2 7

    3 4

    4 3

    5 1

  • • Draw a box plot to represent the data

    Median = 2

    Q3 = 3Q1 = 1

  • • 5.5.3c

    • Find the interquartile range and draw a boxplot for the following data.

    • Find Σf• Σf = 26

    • Identify the median

    • Median = mean of the𝑛

    2th and

    𝑛

    2+ 1 th values.

    • Median between 13th and 14th value. = 30.5

    • Identify Q1 and Q3• 13 values in the lower and upper half

    • Q1 is the 7th value = 15

    • Q3 is the20th value = 42

    • Calculate the IQR

    • IQR = 42 – 15 = 27

    Stem Leaf

    1 1 2 3 3 4 4 5 8

    2 2 3 4 5

    3 0 1 1 2 6 6

    4 2 2 3 9

    5 0 9

    6 1 8

  • Median = 30.5

    Q3 = 42Q1 = 15

  • • 5.5.5

    • Draw a box plot for the following data which is displayed in a percentage cumulative frequency graph

    • Draw a line from the 50% mark to estimate for the median

    • Approx median = 18

    • Draw a line at the 25% mark and 75% mark to estimate Q1 and Q3

    • Approx Q1 = 12

    • Approx Q3 = 23.5

    • Calculate the IQR

    • IQR = 23.5 – 12 = 11.5

  • Median = 18

    Q3 = 23.5Q1 = 12

  • • 1. Enter the data the normal way. As we will be

    drawing a graph you need to name the column.

    We will use x here. Then get the summary

    statistics screen by pressing > Statistics >

    Stat Calculations > One-Variable Statistics.

    When you get to the dialog box you need to

    make X1 List x as this is the name you gave the

    column with the data in it. Leave Frequency List

    as 1 since we entered each value individually.

    You will be asked about the x – say it is a

    Variable Reference.

  • • 2. Arrow down to get the five-figure summary on the screen and write the values into your exercise book.

    • 3. Insert a new page ( ) and choose Add Data & Statistics. Name the horizontal axis with the name given to column A, in our case x. Now press > Plot Type > Box Plot and our boxplot appears.

    • Move your cursor around to find out info

    • The dot represents an outlier or extreme value.

    • An outlier is a value that lies more than 1.5 x IQR away from the nearer of the upper or lower quartiles.

  • • The standard deviation ( σ ) sigma is a measure of spread relating to how far the data deviates from the mean.

    • I.e. How far from the centre we might expect to still find data. The larger the number, the more spread out the data is.

    • Note:• Population: when gather data from the whole group. Ie. The height of year

    11’s – Every single year 11 is measured.• Mean of population: 𝝁 (mu)

    • Sample: When we gather data from a select group to represent the population. Ie. The height of year 11’s – one class may be measured to represent the whole group.

    • Mean of sample: 𝒙 (x bar)

    • μ and 𝑥 are measures of central tendency that tells us where we might expect to find the centre of the data set

  • • We define variance as the mean of the squared deviations from the mean and use the symbol σ2 (sigma squared) to represent it.

    • Variance = 𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛

    𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠

    • 𝜎2 =Σ 𝑥−𝜇 2

    𝑛

    • For grouped intervals

    • 𝜎2 =Σ𝑓 𝑥−𝜇 2

    Σ𝑓

  • • We squared the deviations to make them positive before

    adding. To take account of this we now find the positive square

    root of the variance and call this the standard deviation ( σ ) of the population.

    • So…

    • 𝜎 =Σ𝑓(𝑥−𝜇)2

    Σ𝑓

  • • There are two standard deviations

    calculated. sx is the standard deviation if

    the data represents a sample whereas σx is the standard deviation if the data represents a population. In this case we are using all the members of a team so we would use the σx value.

    • 1. Find the midpoint for each class interval

    and add them to the table

    Distance covered

    (nearest km)

    Number of

    runners

    x

    21-30 3 25.5

    31-40 5 35.5

    41-50 4 45.5

    51-60 2 55.5

    61-70 1 65.5

    71-80 1 75.5

  • • 2. Enter the data and go to the One-Variable Statistics screen and make the entries shown. (it is slightly different to previously since we are not drawing a graph).

    • 3. down to OK and press

    • 4. the mean is 𝑥 = 43 and the standard deviation we are interested in is σx = 13.92. write these in your book.

    • Note: the calculator has given us the values of the mean ( 𝑥), the sum of the x-values (Σ𝑥), and the sum of the values2 (Σ𝑥2). Note also that 𝑥is used for the mean whereas, strikly speaking, it should be 𝜇, as this was a population. However, there is no difference between the values 𝑥 and 𝜇.

  • • 5.5.8a

    • Find the standard deviation, correct to one decimal place, of the following

    sample data sets.

    • 3, 3, 4, 6, 2, 1, 3, 4, 6, 7, 5, 3, 2, 1, 7, 9

    • Enter the data in your CAS using ‘Lists & Spreadsheets’

    • Press Menu > Statistics > Stat Calculations > One-Variable Statistics.

    • When prompted choose what you labelled column A

    • sx = 2.3 (1 dp)

  • • Comparative analysis

    • Used when dealing with the analysis of more than one set of data

    • Absolute analysis

    • Used when dealing with the analysis of just one set of data

    • Outliers (Extreme values)

    • A value more than 1.5 x IQR away from the nearer of the upper and

    lower quartiles

  • • Skewness

    • A term used to describe data sets that are not symmetrical.

    • Boxplots:

    • Negative skew: indicates that the median is closer to the upper quartile

    • Symmetrical: indicates that the median lies in the middle

    • Positive skew: The median is closer to the lower quartile.

  • • Histograms

    • Positive skew: mode < median < mean

    • Negative skew: mean < median < mode

    • Degree of skewness

    • A numerical value to indicate the degree of skewness.

    Tℎ𝑒 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =3(𝑚𝑒𝑎𝑛−𝑚𝑒𝑑𝑖𝑎𝑛)

    𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

    • This value indicates the direction of the skew as well as its size.

    • The larger the value the greater the degree of skewness

  • • Skewed data sets are more likely to contain outliers

    • Use median and IQR as summary statistics

    • Mean and standard deviation are affected by extreme values

  • • 5.6.1 Describe the following data set, including a comment about its degree

    of skewness. The data set represents the score obtained out of 66 on a test.

    60, 51, 47, 42, 53, 34, 47, 39, 56, 63,

    35, 34, 50, 35, 41, 19, 48, 42, 37, 45, 29

    • Enter data into CAS, write down the five-figure summary:

    • 𝑥 (mean) = 43.1905

    • 𝜎x (st dev) = 10.35

    • minX = 19

    • Q1 = 35

    • Median = 42

    • Q3 = 50.5

    • MaxX = 63

  • • Calculate to determine if there are any outliers in the data set.

    • IQR = 50.5 – 35 = 15.5

    • 1.5 X IQR = 23.25

    • No outliers because there is no value that lies further than Q1 or Q3:

    • 50.5 + 23.25 = 73.75 or (maxX = 63)

    • 35 – 23.5 = 11.5 (MinX = 19)

    • Calculate the degree of skewness

    •3(𝑚𝑒𝑎𝑛 −𝑚𝑒𝑑𝑖𝑎𝑛)

    𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛=

    3(43.19−42)

    10.35= 0.34

    • Data set contains no outliers, but is slightly positively skewed. This is confirmed by the mean lying closer to the lower quartile and the calculated value for the degree of skewness.

  • • Composite bar charts

    • Some can handle more than two categories

  • • Back-to-back stemplots

    • Can ONLY handle two categories

  • • Comparative boxplots

    • Several related boxplots using the SAME scale

    • Can be used for a number of categories.

  • • 1. In the Lists & Spreadsheets application, name the first column A ‘gender’ and column B ‘life’

    • Enter the male data set in column B. In column A enter ‘male’ press > Data > Fill down and press . A dashed box will appear. Scroll down column A until the last piece of data and press .

    • 2. Return to column B and under the males life expectancy fill in the female life expectancy. Give all these pieces of data the label ‘female’ in column A.

    Male 54 73 68 50 73 59 75 59 67 68

    Female 53 80 74 53 79 63 82 63 73 76

  • • To create the box plot, press (to

    insert a new page) and select Add Data &

    Statistics. Label the horizontal axis ‘Life’ and

    the vertical axis ‘Gender’, then press

    > Plot type > Box plot.

  • • 5.6.2. The table below gives the life expectancy for males and females in

    the 40 most populated countries of the world. Draw comparative boxplots

    for males and females and discuss your findings.

    • Enter the data into your CAS to produce the boxplots. And five figure

    summary

    • Both data sets are negatively skewed. Females outscore males on all of the

    five values. 50% of females values are greater than the value for which less

    than 25% of the males are greater. There are no outliers. The range of

    values and IQR are greater for females.

  • • 5.6.5. The bar chart below shows the number of drivers killed in accidents that involved the vehicle ‘going off path on straight’ of ‘going off path on curve’ for the years 2002 and 2003. compare the two years.

    • In both years many more males were killed than females.

    • In both years the youngest drivers killed were females, these drivers must have been driving illegally as they were under 16.

    • In 2002, three underage male drivers were killed.

    • More males died in 2002 in comparison to 2003 except for the age class 18-21where more males died In 2003 compared to 2002.

    • Female figures are so low that useful comparisons cannot really be made.

  • • 5.6.8. The data below represents the height (cm) of the players for two clubs.

    Draw a back-to-back stemplot and use it to help make comparisons between

    the teams.

  • • Find the median and IQR of both teams.

    • WB median = 38

    2and

    38

    2+ 1 = between the 19th and 20th value = 186 cm

    • WB IQR = 193 – 181 = 12

    • Geelong median = 189.5

    • Geelong IQR = 193 – 184 = 9

    • Geelong appears to have the taller team; median height 189.5 cm

    compared with 186 cm.

    • Spread of height is slightly greater for WB with a larger IQR.

    • WB modal class is 18L and Geelong’s is 19L reinforcing that Geelong has the

    taller team.