describing data - dr. d. dambreville's math...

38
Algebra 1 Describing Data Describing Data Extra . . . Extra . . . Read all about it!!! Data is important in the world around us! Statistics are used everyday in a variety of situations. USA Today newspapers include charts, graphs, or tables on the bottom left of every section of each issue because they understand the difficulties people have with understanding statistics without visuals. When you were younger, you may have been asked to guess the number of items in a jar or container. Therefore, it is important for us as individuals to have ways to predict things. From the weather to the amount of money a college graduate will make can be predicted by using mathematical models. In this unit, we will represent sets of data in a variety of different ways and discuss similarities and differences between data sets. Essential Questions How can we represent data on a real number line? What do the characteristics of a bell curve and the empirical rule tell us about a normal distribution? How can we make predictions with data of one and two variable data? Module Minute Mathematics allows us to use statistics to represent data in a variety of waysdot plots, histograms, and box plots. Bell curves represent normal distributions. Bell curves of normal distributions are symmetric about the mean with the highest point representing the mean. Measures of center (mean and median) focus on the average or middle term while measures of spread (interquartile range, standard deviation) focus on how spread or dispersed the data is. Standard deviation represents how spread out the numbers of a data are. The closer the data is the smaller the standard deviation. A line of best fit considers past data in order to predict future outcomes. Correlation describes the relationship (positive, negative, or no) between two variables while causation explains how one variable causes another. Key Terms Measures of Central Tendency Values that describe the center of a distribution. The mean, median, and mode are 3 measures of central tendency. Mean A measure of central tendency that is determined by dividing the sum of all values in a data set by the number of values. Median The value of the middle term in a set of organized data. For a set of data with an odd number of values, it is the value that has an equal number of data values before and after it, or the middle value. For a set of data with an even number of values, the median is the average of the 2 values in the middle positions. Mode The value or values that occur with the greatest frequency in a data set. Outliers Extreme values in a data set. Broken Line Graph A graph that is used when it is necessary to show change over time. A line is used to join the values, but the line has no defined slope. Continuous Data Data for which the plotted points can be joined. Correlation A statistical method used to determine whether or not there is a linear relationship between 2 variables. Data Set A collection of observations of a variable. Dependent Variable The variable represented by the values that are plotted on the yaxis. Discrete Data Data for which the plotted points cannot be joined. FiveNumber Summary 5 values for a data set that include the smallest value, the lower quartile, the median, the upper quartile, and the largest value. Frequency Distribution A table that lists all of the classes and the number of data values that belong to each of the classes. Histogram A graph in which the classes are on the horizontal axis and the frequencies are plotted on the vertical axis. The frequencies are represented by vertical bars that are drawn adjacent to each other. Independent Variable The variable represented by the values that are plotted on the xaxis. Interquartile Range (IQR) The difference between the third quartile and the first quartile. LeftSkewed Distribution A distribution in which most of the data values are located to the right of the mean. Line of Best Fit A straight line drawn on a scatter plot such that the sums of the distances to points on either side of the line are approximately equal and such that there are an equal number of points above and below the line. Midpoint The value obtained by adding the lower and upper limits of a class and dividing the sum by 2.

Upload: others

Post on 14-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 1/38

Algebra 1Describing Data

Describing DataExtra . . . Extra . . . Read all about it!!! Data is important inthe world around us! Statistics are used everyday in avariety of situations. USA Today newspapers includecharts, graphs, or tables on the bottom left of every sectionof each issue because they understand the difficultiespeople have with understanding statistics without visuals.When you were younger, you may have been asked toguess the number of items in a jar or container. Therefore,it is important for us as individuals to have ways to predictthings. From the weather to the amount of money a collegegraduate will make can be predicted by using mathematicalmodels. In this unit, we will represent sets of data in avariety of different ways and discuss similarities anddifferences between data sets.

Essential Questions

How can we represent data on a real number line?What do the characteristics of a bell curve and theempirical rule tell us about a normal distribution?How can we make predictions with data of one and two variable data?

Module Minute

Mathematics allows us to use statistics to represent data in a variety of ways­dot plots, histograms, andbox plots. Bell curves represent normal distributions. Bell curves of normal distributions are symmetricabout the mean with the highest point representing the mean. Measures of center (mean and median)focus on the average or middle term while measures of spread (interquartile range, standard deviation)focus on how spread or dispersed the data is. Standard deviation represents how spread out the numbersof a data are. The closer the data is the smaller the standard deviation. A line of best fit considers pastdata in order to predict future outcomes. Correlation describes the relationship (positive, negative, or no)between two variables while causation explains how one variable causes another.

Key Terms

Measures of Central Tendency­ Values that describe the center of a distribution. The mean, median, and mode are 3measures of central tendency.Mean­ A measure of central tendency that is determined by dividing the sum of all values in a data set by the number ofvalues.Median­ The value of the middle term in a set of organized data. For a set of data with an odd number of values, it is thevalue that has an equal number of data values before and after it, or the middle value. For a set of data with an even numberof values, the median is the average of the 2 values in the middle positions.Mode ­ The value or values that occur with the greatest frequency in a data set.Outliers­ Extreme values in a data set.Broken Line Graph ­A graph that is used when it is necessary to show change over time. A line is used to join the values,but the line has no defined slope.Continuous Data ­ Data for which the plotted points can be joined.Correlation ­ A statistical method used to determine whether or not there is a linear relationship between 2 variables.Data Set ­A collection of observations of a variable.Dependent Variable ­The variable represented by the values that are plotted on the y­axis.Discrete Data ­Data for which the plotted points cannot be joined.Five­Number Summary ­ 5 values for a data set that include the smallest value, the lower quartile, the median, the upperquartile, and the largest value.Frequency Distribution ­A table that lists all of the classes and the number of data values that belong to each of theclasses.Histogram ­A graph in which the classes are on the horizontal axis and the frequencies are plotted on the vertical axis. Thefrequencies are represented by vertical bars that are drawn adjacent to each other.Independent Variable ­The variable represented by the values that are plotted on the x­axis.Interquartile Range (IQR)­ The difference between the third quartile and the first quartile.Left­Skewed Distribution ­A distribution in which most of the data values are located to the right of the mean.Line of Best Fit ­A straight line drawn on a scatter plot such that the sums of the distances to points on either side of the lineare approximately equal and such that there are an equal number of points above and below the line.Midpoint ­The value obtained by adding the lower and upper limits of a class and dividing the sum by 2.

Page 2: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 2/38

Qualitative Variable­ A variable that can be placed into specific categories according to some defined characteristic.Quantitative Variable­ A variable that is numerical in nature and that can be ordered.Right­Skewed Distribution­ A distribution in which most of the data values are located to the left of the mean.Scatter Plot­ A graph used to investigate whether or not there is a relationship between 2 sets of data. The data is plottedon a graph such that one quantity is plotted on the x­axis and one quantity is plotted on the y­axis.Symmetric Histogram ­A histogram for which the values of the mean, median, and mode are all the same and are alllocated at the center of the distribution.Variable ­A characteristic that is being studied.Categorical Data ­Data that are in categories and describe characteristics, or qualities, of a category.Double Box­and­Whisker Plots­ 2 box­and­whisker plots that are plotted on the same number line.Numerical Data ­Data that involves measuring or counting a numerical value.

To view the standards for this unit, please download the handout from the sidebar.

Measures of Central TendencyMeanThe mean, often called the 'average' of a numerical set of data, is the sumof all of the numbers divided by the number of values in the data set. Thisvalue is the arithmetic mean, and it tells us what value we would have if allof the data were the same. The mean is the balance point of a distribution,and is one of the three measures of central tendency commonly used instatistics. The mean, the median, and the mode are all measures ofcentral tendency. They all show where the center of a set of data "tends"to be. Each one is useful at different times.

The mean is a summary statistic that gives you a description of the entiredata set and is especially useful with large data sets where you might nothave the time to examine every single value. The mean is often used as asummary statistic. However, extreme values, or outliers affect it. Thismeans that when there are extreme values at one end of a data set, themean is not a very good summary statistic.

For example, if you were employed by a company that paid all of itsemployees a salary between $60,000 and $70,000, you could probablyestimate the mean salary to be about $65,000. However, if you had to addin the $150,000 salary of the CEO when calculating the mean, then thevalue of the mean would increase greatly. It would, in fact, be the mean ofthe employees' salaries, but it probably would not be a good measure ofthe central tendency of the salaries.

All of the values for the means that you have calculated so far have beenfor ungrouped, or listed, data. A mean can also be determined for datathat is grouped, or placed in intervals. Unlike listed data, the individualvalues for grouped data are not available, and you are not able tocalculate their sum.

To calculate the mean of grouped data, the first step is to determine the midpoint of each interval, or class. These midpoints mustthen be multiplied by the frequencies of the corresponding classes. The sum of the products divided by the total number of valueswill be the value of the mean.

The following example will show how the mean value for grouped data can be calculated.

Example 1

The following table shows the frequency distribution of the number of hours spent per week texting messages on a cell phone by60 tenth grade students at a local high school.

Time Per Week (Hours) Number of Students (f)0 to less than 5 8

5 to less than 10 11

10 to less than 15 15

15 to less than 20 12

Page 3: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 3/38

20 to less than 25 9

25 to less than 30 5

Calculate the mean number of hours per week spent by each student texting messages on a cell phone. Hint: A table may beuseful.

Solution:

The mean time spent per week by each student texting messages on a cell phone is 14 hours.

Now that you have created several distribution tables for grouped data, it's time to point out that the first column of the table can berepresented in another way. As an alternative to writing the interval, or class, in words, the words can be expressed as [# ­ #),where the front square bracket closes the class, so the first number is included in the designated interval, but the open bracket atthe end does not close the class, so the last number is not included in the designated interval. Keeping this in mind, the table canbe presented as follows:

TimePerWeek(Hours)

Numberof

Students(f)

Midpointof Class

mProductof mf

[0­5) 8 2.5 20.0

[5­10) 11 7.5 82.5

[10­15) 15 12.5 187.5

[15­20) 12 17.5 210.0

[20­25) 9 22.5 202.5

[25­30) 5 27.5 137.5

MedianThe median is the number in the middle position once the data has been organized. Organized data is simply the numbersarranged from smallest to largest or from largest to smallest. This is the only number for which there are as many above it as belowit in the set of organized data, and is referred to as the equal areas point. The median, for an odd number of data, is the value thatis exactly in the middle of the ordered list, it divides the data into two halves. The median for an even number of data, is the meanof the two values in the middle of the ordered list. The median is useful when there are a few extreme values that can effect themean, because the middle number will stay in the middle. The median often gives a good impression of the center, because thereare 50% of the values above the median, 50% of the values below the median, and it doesn't matter how big the biggest values areor how small the smallest values are.

ModeThe mode of a set of data is simply the number that appears most frequently in the set. There are no calculations required to findthe mode of a data set. You simply need to look for it. However, be aware that it is common for a set of data to have no mode, onemode, two modes or more than two modes. If there is more than one mode, simply list them all. And, if there is no mode, write 'nomode'. No matter how many modes, the same set of data will have only one mean and only one median. The mode is a measure ofcentral tendency that is simple to locate but is not used much in practical applications. It is the only one of these three values thatcan be for either categorical or numerical data. Remember the example regarding pets? The mode was 'dogs' because that wasthe most common response. Range The range of a data set describes how spread out the data is. To calculate the range, subtract

Page 4: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 4/38

the smallest value from the largest value (maximum value – minimum value = range). This value provides information about a dataset that we cannot see from only the mean, median, or mode. For example, two students may both have a quiz average of 75%,but one of them may have scores ranging from 70% to 82% while the other may have scores ranging from 24% to 90%. In a casesuch as this, the mean would make the students appear to be achieving at the same level, when in reality one of them is muchmore consistent than the other.

Watch the video below to learn more about Mean, Median and Mode.

RangeThe range of a data set describes how spread out the data is. To calculate the range, subtract the smallest value from the largestvalue (maximum value – minimum value = range). This value provides information about a data set that we cannot see from onlythe mean, median, or mode. For example, two students may both have a quiz average of 75%, but one of them may have scoresranging from 70% to 82% while the other may have scores ranging from 24% to 90%. In a case such as this, the mean would makethe students appear to be achieving at the same level, when in reality one of them is much more consistent than the other.

Example 1

Stephen has been working at Wendy's for 15 months. The following numbers are the number of hours that Stephen worked atWendy's during the past seven months:

24; 24; 31; 50; 53; 66; 78

What is the mean number of hours that Stephen worked per month?

Solution

Stephen has worked at Wendy's for 15 months but the numbers given above are for seven months. Therefore, this set of datarepresents a sample of the population. The formula that is used to calculate the mean for a sample and for a population is thesame. However, the symbols are different. The mean of a sample is denoted by which is called ''x bar". The mean of an entirepopulation is denoted by μ which is the Greek letter "mu" (pronounced "myoo").

The number of data for a sample is written as n. The following formula represents the steps that are involved in calculating themean of a sample:

This formula can now be written using symbols.

You can now use the formula to calculate the mean of the hours that Stephen worked.

The mean number of hours that Stephen worked during this time period was 47 hours per month.

Example 2

The ages of several randomly selected customers at a coffee shop were recorded. Calculate the mean, median, mode, and rangefor this data.

23; 21; 29; 24; 31; 21; 27; 23; 24; 32; 33; 19

Solution

Mean:

Page 5: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 5/38

(23 + 21 + 29 + 24 + 31 + 21 + 27 + 23 + 24 + 32 + 33 + 19) /12 = 307/12

307/12 = 25.58

Median:

First, organize the ages in ascending order:

19; 21; 21; 23; 23; 24; 24; 27; 29; 31; 32; 33

Second, count in to find the middle value:

24; 24

The middle value will be halfway between these two values (or the average of these two values) 24+24 /2 = 24

Mode:

Look for the value(s) that occur most frequently:

21; 23; 24

This data set has three modes .

Range:

Subtract the smallest value from the largest value (max ­ min = range):

33 ­ 19 = 14

Solution: Make your conclusion in context.

At this coffee shop, the mean age of people in this sample is 25.58 years old and the median age is 24 years old. Therewere three modes for age at 21, 23, and 24 yeas old and the range for ages is 14 years.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Dot PlotsOne convenient way to organize numerical data is a dot plot. A dot plot is a simple display that places a dot (or X, or anothersymbol) above an axis for each datum value (datum is the singular of data). The axis should cover the entire range of the data,even numbers that will have no data marked above them should be included to show outliers or gaps. There is a dot for each value,so values that occur more than once will be shown by stacked dots. Dot plots are especially useful when you are working with asmall set of data across a reasonably small range of values. This type of graph gives a clear view of the shape, any mode(s) andthe range of a set of data. The numbers are already in order, so finding the median is fairly quick. And any outliers are quicklyvisible.

Look at the example below:

Ages of all of the Sales People at Stinky's Car Dealership

ShapeOnce a graphical display is constructed, we can describe the distribution. When describing the distribution, we should be sure toaddress its shape. Although many graphs will not have a clear or exact shape, we can usually identify the shape as symmetrical or

Page 6: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 6/38

skewed. A symmetrical distribution will have a middle where we can draw an imaginary line through the center, and a fairly equal"look" on either side of that imaginary line. If you were to fold along the imaginary center line, the two sides would almost match up.Many symmetrical distributions are bell shaped, they will be tall in the middle with the two sides thinning out. The sides are referredto as tails. A skewed distribution is one in which the bulk of the data is concentrated on one end, with the other side being a longertail. The direction of the longer tail is the direction of the skew.

Skewed right will have a longer tail to the right, or higher numbers.Skewed left will have a longer tail off to the left, or the lower values.

Other shapes that you might see are uniform (almost consistent height all the way across) and bimodal (having two peaks in thedistribution).

OutliersIf there are any outliers, gaps, groupings, or other unusual features in the distribution, we should be sure to mention them. Anoutlier is a value that does not fit with the rest of the data. Some distributions will have several outliers, while others will not haveany. We should always look for outliers because they can affect many of our statistics. Also, sometimes an outlier is actually anerror that needs to be corrected. If you have ever 'bombed' one test in a class, you probably discovered that it had a big impact onyour overall average in that class. This is because the mean will be affected by an outlier­it will be pulled toward it. This is anotherreason why we should be sure to look at the data, not just look at the statistics about the data. When an outlier is part of the dataand we do not realize it, we can be misled by the mean to believe that the numbers are higher or lower than they really are.

CenterThe center of the distribution should always be included in the verbal analysis as well. People often wonder what the 'average is'.The measure for center can be reported as the median, the mean, or the mode. Even better, give more than one of these in yourdescription. Remember that outliers effect the mean, but do not effect the median. For example, the median of a list of data will stayin the center even when the largest value increases tremendously, but such a change would effect the mean quite a bit.

SpreadAnother thing to include in the description is the spread of the data. The spread is the specific range of the data. When analyzing adistribution, we don't want to simply say that the range is equal to some number. It is much more informative to say that the dataranges from_____ to ______ (minimum value to maximum value). For example, if the news reports that the temperature in St. Paulhad a range of 20° during a given week, this could mean very different temperatures depending on the time of year. It would bemore informative to say something specific like, the temperature in St. Paul ranged from 68° to 88° last week.

Example 1

An anthropology instructor at the community college is interested in analyzing the age distribution of her students. The students inher Anthropology 102 class are:

21, 23, 25, 26, 25, 24, 26, 19, 18, 19, 26, 28, 24, 22, 24, 19, 23, 24, 24, 21, 23, and 28 years old.

Organize the data in a dot plot. Calculate the mean, median, mode, and range for the distribution. Describe the distribution. Be sureto include the shape, outliers, center, context, and spread.

A) Construct a dot plot.

Page 7: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 7/38

Use the box below to create your dot plot. (Click on the blank paper in the upper right corner to get a new blank screen.) Roll overthe solution to see if your image matches the correct answer.

Dot Plot SOLUTION

B) Calculate the mean, median, mode, and range for the distribution.

Roll over the solutions below to check your answers.

MEAN SOLUTIONMEDIAN SOLUTIONMODE SOLUTIONRANGE SOLUTION

C) Describe the distribution. Be sure to include the shape, outliers, center, context, and spread. (This could be describedas fairly symmetrical or slightly skewed to the left.)

Roll over the underlined area to check your answers.

The distribution of student ages in this Anthropology 102 class is _______________________.The ages of students range from _________________.The median and mode for age are _________________ and the mean is ________.Thus, the typical student in this class is _____________.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Box Plots and OutliersA box plot (also called box­and­whisker plot) is another type of graph used to display data. A box plot divides a set of numericaldata into quarters. It shows how the data are dispersed around a median, but does not show specific values in the data. It does notshow a distribution in as much detail as does a stem plot or a histogram, but it clearly shows where the data is located. This type ofgraph is often used when the number of data values is large or when two or more data sets are being compared. The center andspread of the distribution are very obvious from the graph. It is easy to see the range of the values as well as how these values aredistributed around the middle value. The smaller the box, the more consistent the data values are with the median of the data. Theshape of the box plot will give you a general idea of the shape of the distribution, but a histogram or stem plot will I do this moreaccurately. Any outliers will show up as long whiskers.The box in the box plot contains the middle 50% of the data, and each'whisker' contains 25% of the data.

In order to divide into fourths, it is necessary to find five numbers. This list of five values is called the five number summary.

The numbers in the list are minimum value, Quartile 1, Median, Quartile 3, maximum value.

We have already learned how to find the median of a set of numbers (put in order and find the middle value), and the minimum andmaximum are the smallest and largest numbers. Now we will learn how to find the quartiles.

QuartilesThe first step is to list all of the numbers in order from least to greatest. The minimum and maximum are now on the ends of the listand we can count in to find the median–circle these three values. Finding the quartiles is just like finding the median. Quartile 1 isthe 'median' of all of the values to the left of the median (do NOT include the median itself). Quartile 3 is the 'median' of all of thevalues to the right of the median (do not include the median).

Constructing a Box PlotNow list the five number summary in order min, Q1, Med, Q3, max). The next step is to mark an axis that covers the entire rangeof the data. Mark the numbers along the axis before you make the box plot, so that the resulting plot shows the shape of the data.The last step is to place a dot above the axis for the 5 numbers from the five number summary, and then to make a 'box' throughthe second and fourth dots, mark a line through the middle dot to show the median, and mark 'whiskers' from the box out to the first

Page 8: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 8/38

and fifth dots.

Example 1

You have a summer job working at Paddy's Pond which is a recreational fishing spot where children can go to catch salmon whichhave been raised in a nearby fish hatchery and then transferred into the pond. The cost of fishing depends upon the length of thefish caught ($0.75 per inch). Your job is to transfer 15 fish into the pond three times a day. But, before the fish are transferred, youmust measure the length of each one and record the results.

Below are the lengths (in inches) of the first 15 fish you transferred to the pond. Calculate the five number summary, and constructa box plot for the lengths of these fish.

Length of Fish (in.)

13, 14, 6, 9, 10 , 21, 17, 15, 15, 7, 10, 13, 13, 8, 11

Solution

Since box plots are based on the median and quartiles, the first step is to organize the data in order from smallest to largest.

6, 7, 8, 9, 10, 10, 11, 13, 13, 13, 14, 15, 15, 17, 21

The minimum is the smallest number (min = 6), and the maximum is the largest number (max = 21). Next, we need to find themedian. This has an odd number of data, so the median of all the data is the value in the middle position (Med = 13). There are 7numbers before and 7 numbers after 13.

The next step is the find the median of the first half of the data – the 7 numbers before the median, but not including the median.This is called the lower quartile since it marks the point above the first quarter of the data. On the graphing calculator this value isreferred to as Q1.

Quartile 1 is the median of the lower half of the data (Q1 = 9).

This step must be repeated for the upper half of the data – the 7 numbers above the median of 13. This is called the upper quartilesince it is the point that marks the third quarter of the data. On the graphing calculator this value is referred to as Q3.

Quartile 3 is the median of the upper half of the data (Q3 = 15).

Now that the five numbers have all been determined, it is time to construct the actual graph. The graph is drawn above a numberline that includes all the values in the data set (graph paper works very well since the numbers can be placed evenly using the linesof the graph paper). For this example we will need to mark from at least 6 to at least 21.

Be sure to mark your axis before you start to construct the box plot. Next, represent the following values by placing dots above theircorresponding values on the number line:

Minimum = 6Quartile 1 = 9Median = 13Quartile 3 = 15Maximum = 21

The five data values listed above are often called the five number summary for the data set and are necessary to graph every boxplot. Make the 'box' part around the Q1 and Q3 values, make 'whiskers' out to the min and max values, and make a vertical line toshow the location of the median. This will complete the box plot.

Length of fish (in inches) 5# sum = 6, 9, 13, 15, 21

Page 9: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 9/38

The five numbers divide the data into four equal parts. In other words:

One­quarter of the data values are located between 6 and 9One­quarter of the data values are located between 9 and 13One­quarter of the data values are located between 13 and 15One­quarter of the data values are located between 15 and 21

The display of the five­number summary produces a box­and­whisker plot as shown below:

The above model of a box­and­whisker plot shows 2 horizontal lines (the whiskers) that each contain 25% of the data and are ofthe same length. In addition, it shows that the median of the data set is in the middle of the box, which contains 50% of the data.The lengths of the whiskers and the location of the median with respect to the center of the box are used to describe the distributionof the data. It's important to note that this is just an example. Not all box­and­whisker plots have the median in the middle of the boxand whiskers of the same size. Information about the data set that can be determined from the box­and­whisker plot with respect tothe location of the median includes the following:

a) If the median is located in the center or near the center of the box, the distribution is approximately symmetric.

b) If the median is located to the left of the center of the box, the distribution is positively skewed.

c) If the median is located to the right of the center of the box, the distribution is negatively skewed.

Information about the data set that can be determined from the box­and­whisker plot with respect to the length of the whiskersincludes the following:

a) If the whiskers are the same or almost the same length, the distribution is approximately symmetric.

b) If the right whisker is longer than the left whisker, the distribution is positively skewed.

c) If the left whisker is longer than the right whisker, the distribution is negatively skewed.

The length of the whiskers also gives you information about how spread out the data is.

A box­and­whisker plot is often used when the number of data values is large. The center of the distribution, the nature of thedistribution, and the range of the data are very obvious from the graph. The five­number summary divides the data into quarters byuse of the medians of the upper and lower halves of the data. Remember that, unlike the mean, the median of the entire data set isnot affected by outliers, so it is the measure of central tendency that is most often used in exploratory data analysis.

Content from this page found at www.ck12.org and khanacademy.org CCBY

More Measures of SpreadRangeWe have already learned how to find the range of a set of data. The range represents the entire spread of all of the data.

The formula for calculating the range is:

max ­ min = range

Page 10: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 10/38

Interquartile RangeThe quartiles give us one more measure of spread called the inner quartile range. The interquartile range (IQR) is the rangebetween the lower and upper quartile. To find the IQR, subtract the quartile 1 value from the quartile 3 value (Q3 ­ Q1 = IQR). TheIQR represents the spread, or range, of the middle 50% of the data. The IQR is a measure of spread that is used when the medianis the measure of central tendency.

The formula for calculating the IQR is:

Q3 ­ Q1 = IQR

OutliersWe have been noticing some values that appear to be outliers, but have not defined a specific distance to be considered an outlier.The common outlier test, used to determine whether or not any of the values are outliers uses the IQR. This outlier test, often calledthe 1.5*(IQR) Criterion, says that any value that is more than one and one­half times the width of the IQR box away from the box isan outlier.

Page 11: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 11/38

Describing Data Question for Thought 1

The two main measures of spread are the standard deviation and the IQR. Define these measures and how they are usefulin the field of statistics.

Once you've completed your responses, follow your teacher's instructions for submitting your work.

Describing Data Quiz 1

It is now time to complete the "Describing Data Quiz 1".

After you have completed all the work in this chapter and feel confident of your knowledge of the content, you'll be preparedfor a test over this material .

Content from this page found at www.ck12.org and khanacademy.org CCBY

Comparing Data SetsParallel box plots (also called side­by­side box plots) are very useful when two or more numerical data sets need to be compared.The graphs of the parallel box plots are plotted, one parallel to the other, along the same number line. This can be done verticallyor horizontally and for as many data sets as needed.

Example 1

The figure shows the distributions of the temperatures for three different cities. By graphing the three box plots along the sameaxis, it becomes very easy to compare the temperatures of the three cities. What are some conclusions that can be drawn aboutthe temperatures in these three cities?

Page 12: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 12/38

Solution

Here are some conclusions, based on the graphs, that might be made. Be sure to compare the distributions to one another, usingstatistics to support your observations.

Quartile 1 for City 2 is higher than the quartile 3 in City 1 and the median in City 3. Also, the minimum temperature in City 2 isat about the median for the other two cities.City 2 is generally warmer than both of the other cities. Cities 1 and 3 have nearly the same median temperature, around 60°to 63°. Whereas, the median temperature in City 2 is around 82°.City 3 has a much larger range in temperatures (35° to 85°), than City 1 (45° to 75°) or City 2 (62° to 95°). Thus, thetemperature in City 1 is the most consistent of the three.The temperature distributions in all three cities are fairly symmetrical and none have any outliers.

Example 2

The heights of a group of students are all included in the first histogram. The second histogram only contains the data from themale students and the third is a graph of the heights of only the girls. Explain what these histograms show.

Page 13: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 13/38

Solution

The range of heights of all students in this group is approximately 20 inches. However, the female heights only range about 11inches and the male heights only range about 13 inches. The females' height distribution is the most symmetrical of all three. Thereis one male whose height is a high outlier, but none for the females. The median height for the class is around 70 inches, for malesit is slightly higher around 72 inches, and for females it is around 65 inches tall. In general, the female students tend to be shorterthan the male students.

Now You Try!

Use the data below to answer the matching questions:

From 1996 to 2004, the ages of the best actors at the time of winning the award were 45, 59, 45, 42, 35, 46, 28, 42, and 36. Theages of the best actresses at the time of winning the award for this same time period were 39, 33, 25, 24, 32, 32, 34, 27, and 30.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Line Graphs and Scatter PlotsBefore you continue to explore the concept of representing data graphically, it is very important to understand the meaning of somebasic terms that will often be used in this lesson.

Page 14: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 14/38

VariableThe first such definition is that of a variable. In statistics, a variable is simply a characteristic that is being studied. This characteristicassumes different values for different elements, or members, of the population, whether it is the entire population or a sample. Thevalue of the variable is referred to as an observation, or a measurement. A collection of these observations of the variable is a dataset. Variables can be quantitative or qualitative.

Quantitative Variable

A quantitative variable is one that can be measured numerically. Some examples of a quantitative variable are wages, prices,weights, numbers of vehicles, and numbers of goals. All of these examples can be expressed numerically. A quantitative variablecan be classified as discrete or continuous.

A discrete variable is one whose values are all countable and does not include any values between 2 consecutive values of adata set. An example of a discrete variable is the number of goals scored by a team during a hockey game.A continuous variable is one that can assume any countable value, as well as all the values between 2 consecutive numbersof a data set. An example of a continuous variable is the number of gallons of gasoline used during a trip to the beach.

Qualitative Variable

A qualitative variable is one that cannot be measured numerically but can be placed in a category. Some examples of a qualitativevariable are months of the year, hair color, color of cars, a person's status, and favorite vacation spots. The following flow chartshould help you to better understand the above terms.

Example 1

Select the best descriptions for the following variables and indicate your selections by marking an 'x' in the appropriate boxes.

Variable Quantitative Qualitative Discrete ContinuousNumber ofmembers ina family

A person'smaritalstatus

Length ofperson'sarm

Color of

Page 15: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 15/38

carsNumber oferrors onmath test

Roll over the solution to check your answers. SOLUTION

Variables can also be classified as dependent or independent. When there is a linear relationship between 2 variables, the valuesof one variable depend upon the values of the other variable. In a linear relation, the values of y depend upon the values of x.Therefore, the dependent variable is represented by the values that are plotted on the y­axis, and the independent variable isrepresented by the values that are plotted on the x­axis.

Example 2

Sally works at the local ballpark stadium selling lemonade. She is paid $15.00 each time she works, plus $0.75 for each glass oflemonade she sells. Create a table of values to represent Sally's earnings if she sells 8 glasses of lemonade. Use this table ofvalues to represent her earnings on a graph.

Solution:

The first step is to write an equation to represent her earnings and then to use this equation to create a table of value.

y = 0.75x + 15, where y represents her earnings and x represents the number of glasses of lemonade she sells.

Number of Glasses of Lemonade Earnings0 $15.00

1 $15.75

2 $16.50

3 $17.25

4 $18.00

5 $18.75

6 $19.50

7 $20.25

8 $21.00

The dependent variable is the money earned, and the independent variable is the number of glasses of lemonade sold. Therefore,money is on the y­axis, and the number of glasses of lemonade is on the x­axis. From the table of values, Sally will earn $21.00 ifshe sells 8 glasses of lemonade.

Now that the points have been plotted, the decision has to be made as to whether or not to join them. Between every 2 pointsplotted on the graph are an infinite number of values. If these values are meaningful to the problem, then the plotted points can bejoined. This type of data is called continuous data. If the values between the 2 plotted points are not meaningful to the problem,then the points should not be joined. This type of data is called discrete data. Since glasses of lemonade are represented by wholenumbers, and since fractions or decimals are not appropriate values, the points between 2 consecutive values are not meaningful inthis problem. Therefore, the points should not be joined. The data is discrete.

Page 16: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 16/38

Linear GraphsLinear graphs are important in statistics when several data sets are used to represent information about a single topic. An examplewould be data sets that represent different plans available for cell phone users. These data sets can be plotted on the same grid.The resulting graph will show intersection points for the plans. These intersection points indicate a coordinate where 2 plans areequal. An observer can easily interpret the graph to decide which plan is best, and when. If the observer is trying to choose a planto use, the choice can be made easier by seeing a graphical representation of the data.

Example 3

The following graph represents 3 plans that are available to customers interested in hiring a maintenance company to tend to theirlawn. Using the graph, explain when it would be best to use each plan for lawn maintenance.

Solution:

From the graph, the base fee that is charged for each plan is obvious. These values are found on the y­axis. Plan A charges abase fee of $200.00, Plan C charges a base fee of $100.00, and Plan B charges a base fee of $50.00. The cost per hour can becalculated by using the values of the intersection points and the base fee in the equation y = mx+b and solving for m. Plan B is thebest plan to choose if the lawn maintenance takes less than 12.5 hours. At 12.5 hours, Plan B and Plan C both cost $150.00 forlawn maintenance. After 12.5 hours, Plan C is the best deal, until 50 hours of lawn maintenance is needed. At 50 hours, Plan A andPlan C both cost $300.00 for lawn maintenance. For more than 50 hours of lawn maintenance, Plan A is the best plan. All of theabove information was obvious from the graph and would enhance the decision­making process for any interested client. Theabove graphs represent linear functions, and are called linear (line) graphs. Each of these graphs has a defined slope that remainsconstant when the line is plotted. A variation of this graph is a broken­line graph. This type of line graph is used when it isnecessary to show change over time. A line is used to join the values, but the line has no defined slope. However, the points aremeaningful, and they all represent an important part of the graph. Usually a broken­line graph is given to you, and you mustinterpret the given information from the graph.

Example 4

The following graph is an example of a broken­line graph, and it represents the time of a round­trip journey, driving from home to apopular campground and back.

Page 17: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 17/38

Use the chart above to complete the matching activity below:

Scatter PlotsOften, when real­world data is plotted, the result is a linear pattern. The general direction of the data can be seen, but the datapoints do not all fall on a line. This type of graph is called a scatter plot. A scatter plot is often used to investigate whether or notthere is a relationship or connection between 2 sets of data. The data is plotted on a graph such that one quantity is plotted on thex­axis and one quantity is plotted on the y­axis. The quantity that is plotted on the x­axis is the independent variable, and thequantity that is plotted on the y­axis is the dependent variable. If a relationship does exist between the 2 sets of data, it will be easyto see if the data is plotted on a scatter plot.

The following scatter plot shows the price of peaches and the number sold:

The connection is obvious when the price of peaches was high, the sales were low, but when the price was low, the sales werehigh.

The following scatter plot shows the sales of a weekly newspaper and the temperature:

Page 18: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 18/38

There is no connection between the number of newspapers sold and the temperature. Another term used to describe 2 sets of datathat have a connection or a relationship is correlation. The correlation between 2 sets of data can be positive or negative, and it canbe strong or weak. The following scatter plots will help to enhance this concept.

If you look at the 2 sketches that represent a positive correlation, you will notice that the points are around a line that slopes upwardto the right. When the correlation is negative, the line slopes downward to the right. The 2 sketches that show a strong correlationhave points that are bunched together and appear to be close to a line that is in the middle of the points. When the correlation isweak, the points are more scattered and not as concentrated. In the sales of newspapers and the temperature, there was noconnection between the 2 data sets. The following sketches represent some other possible outcomes when there is no correlationbetween data sets:

Example 1

Plot the following points on a scatter plot, with m as the independent variable and n as the dependent variable. Number both axesfrom 0 to 20. If a correlation exists between the values of m and n, describe the correlation (strong negative, weak positive, etc.).

Page 19: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 19/38

m­ 4, 9, 13, 16, 17, 6, 7, 18, 10

n­ 5, 3, 11, 18, 6, 11, 18, 12, 16

Solution:

Example 2

Describe the correlation, if any, in the following scatter plot:

Solution

In the above scatter plot, there is a strong positive correlation.

You now know that a scatter plot can have either a positive or a negative correlation. When this exists on a scatter plot, a line ofbest fit can be drawn on the graph. The line of best fit must be drawn so that the sums of the distances to the points on either sideof the line are approximately equal and such that there are an equal number of points above and below the line. Using a clearplastic ruler makes it easier to meet all of these conditions when drawing the line. Another useful tool is a stick of spaghetti, since itcan be easily rolled and moved on the graph until you are satisfied with its location. The edge of the spaghetti can be traced toproduce the line of best fit. A line of best fit can be used to make estimations from the graph, but you must remember that the line

Page 20: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 20/38

of best fit is simply a sketch of where the line should appear on the graph. As a result, any values that you choose from this line arenot very accurate­ the values are more of a ballpark figure.

Example 3

The following table consists of the marks achieved by 9 students on chemistry and math tests:

Student A B C D E F G H IChemistry Marks 49 46 35 58 51 56 54 46 53

Math Marks 29 23 10 41 38 36 31 24 ?

Plot the above marks on scatter plot, with the chemistry marks on the x­axis and the math marks on the y­axis. Draw a line of bestfit, and use this line to estimate the mark that Student I would have made in math had he or she taken the test.

Solution:

If Student I had taken the math test, his or her mark would have been between 32 and 37. Scatter plots and lines of best fit canalso be drawn by using technology. The TI­83 is capable of graphing both a scatter plot and of inserting the line of best fit onto thescatter plot.

Content from this page found at www.ck12.org and khanacademy.org CCBY

HistogramsThe shape of a histogram can tell you a lot about the distribution of the data, as well as provide you with information about themean, median, and mode of the data set. The following are some typical histograms, with a caption below each one explaining thedistribution of the data, as well as the characteristics of the mean, median, and mode. Distributions can have other shapes besidesthe ones shown below, but these represent the most common ones that you will see when analyzing data. In each of the graphsbelow, the distributions are not perfectly shaped, but are shaped enough to identify an overall pattern.

Page 21: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 21/38

The histogram above represents a bell­shaped distribution, which has a single peak and tapers off to both the left and to theright of the peak. The shape appears to be symmetric about the center of the histogram. The single peak indicates that thedistribution is unimodal. The highest peak of the histogram represents the location of the mode of the data set. The mode is thedata value that occurs the most often in a data set. For a symmetric histogram, the values of the mean, median, and mode are allthe same and are all located at the center of the distribution.

The histogram above represents a distribution that is approximately uniform and forms a rectangular, flat shape. The frequency ofeach class is approximately the same.

The histogram above represents a right­skewed distribution, which has a peak to the left of the distribution and data values that

Page 22: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 22/38

taper off to the right. This distribution has a single peak and is also unimodal. For a histogram that is skewed to the right, the meanis located to the right on the distribution and is the largest value of the measures of central tendency. The mean has the largestvalue because it is strongly affected by the outliers on the right tail that pull the mean to the right. The mode is the smallest value,and it is located to the left on the distribution. The mode always occurs at the highest point of the peak. The median is locatedbetween the mode and the mean.

The histogram above represents a left­skewed distribution, which has a peak to the right of the distribution and data values thattaper off to the left. This distribution has a single peak and is also unimodal. For a histogram that is skewed to the left, the mean islocated to the left on the distribution and is the smallest value of the measures of central tendency. The mean has the smallestvalue because it is strongly affected by the outliers on the left tail that pull the mean to the left. The median is located between themode and the mean.

Figure e has no shape that can be defined. The only defining characteristic about this distribution is that it has 2 peaks of the sameheight. This means that the distribution is bimodal.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Double Box­and­Whisker PlotsDouble box­and­whisker plots give you a quick visual comparison of 2 sets of data, as was also found with other double graphforms you learned about earlier in this chapter. The difference with double box­and­whisker plots is that you are also able to quicklyvisually compare the means, the medians, the maximums (upper range), and the minimums (lower range) of the data.

Page 23: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 23/38

Example 1

Emma and Daniel are surveying the times it takes students to arrive at school from home. There are 2 main groups of commuterswho were in the survey. There were those who drove their own cars to school, and there were those who took the school bus.Emma and Daniel collected the following data:

Draw a box­and­whisker plot for both sets of data on the same number line. Use the double box­and­whisker plots to compare thetimes it takes for students to arrive at school either by car or by bus.

Solution:

When plotted, the box­and­whisker plots look like the following:

Using the medians, 50% of the cars arrive at school in 11 minutes or less, whereas 50% of the students arrive by bus in 17 minutesor less. The range for the car times is 17 ­ 8 = 9 minutes. For the bus times, the range is 32 ­12 = 20 minutes. Since the range forthe driving times is smaller, it means the times to arrive by car are less spread out. This would, therefore, mean that the times aremore predictable and reliable.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Displaying Bivariate DataBivariate simply means two variables. Our previous work was with univariate, or single­variable data. The goal of examiningbivariate data is usually to show some sort of relationship or association between the two variables.

ScatterplotsScatterplots are graphs that represent a relationship between two variables. Often times we name these variables the explanatoryvariable and the response variable. The explanatory variable is the variable that we believe explains the relationship or the changein the other variable. The response variable is the variable we believe responds to the change in the explanatory variable. Theresponse variable is often referred to as the dependent variable and the explanatory is referred to as the independent variable.

Example 1

A college professor emails all of his students, asking them how many hours they studied for their exam. After students take theexam, the professor looks at a graph of exam scores compared to time spent studying. In this situation which variable do you thinkis the explanatory variable and which is the response variable?

Solution

The explanatory variable is the hours spent studying and the response variable is the score on the exam. It is reasonable to believethat the amount of studying does somehow have an affect on their exam score. Often thinking in terms of a cause and effectrelationship can identify the explanatory variable and the response variable. As a hint, often the variable that comes first is theexplanatory. For example the studying comes before the exam.

Suppose the professor made a scatterplot showing student scores versus hours studying. Each dot on the graph represents anindividual student. Their horizontal position corresponds with the amount of hours they spent studying, their vertical position

Page 24: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 24/38

corresponds with their score on the exam. This series of disconnected points is referred to as a scatterplot.

As you can see the explanatory variable, hours studying, is on the x­axis. Placing the explanatory variable on the x­axis is standardprocedure for scatterplots. The response variable then goes on the y­axis.

Example 2

Looking at the scatterplot below which variable is explanatory variable?

Solution

It appears that IQ score has been chosen as the explanatory variable and height has been chosen as the response. However, theexplanatory variable should help explain the response we see. It does not seem to be reasonable in this case. For example wewould not reasonably argue that a person's IQ score somehow has an affect on their height. As we will learn later in this chapter,our graph of that data shows us there really is no explanatory/response relationship between IQ and Height. Our conclusion is thatthere is no explanatory variable in this situation.

Lets look at some more data. Below we have recycling rates for paper packaging and glass by individual countries. It would beinteresting to see if there is a predictable relationship between the percentages of each material that a country recycles. Followingis a data table that includes both percentages.

Page 25: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 25/38

We will place the paper recycling rates on the horizontal axis, and the glass on the vertical axis. Next, plot a point that shows eachcountry's rate of recycling for the two materials.

In univariate data, we initially characterize a data set by describing its form, center and spread. For bivariate data, we will alsodiscuss three important characteristics: form, direction and strength to inform us about the association between the two variables.The easiest way to describe these traits for this scatterplot is to think of the data as a ''cloud." If you draw an ellipse around thedata, the general trend is that the ellipse is rising from left to right.

Page 26: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 26/38

Data that is oriented in this manner is said to have a positive linear association. That is, as one variable increases, the othervariable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recyclingrates. Lines that rise in this direction have a positive slope and lines that trend downward from left to right have a negative slope. Ifthe ellipse cloud was trending down in this manner, we would say the data had a negative linear association. For example, we mightexpect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill.As the recycling rate increases, the landfill percentage would have to decrease. The ellipse cloud also gives us some informationabout the strength of the linear association. If there were a strong linear relationship between glass and paper recycling rates, thecloud of data would be much longer than it is wide. Long and narrow ellipses mean strong linear association, shorter and widerones show a weaker linear relationship. In this example, there are some countries in which the glass and paper recycling rates donot seem to be related.

New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass rates, and Austria(circled in green) is an example of a country with a much lower glass rate than their paper rate. These data points are spread awayfrom the rest of the data enough to make the ellipse much wider, therefore weakening the association between the variables.

Here are some more examples to help you describe the form, direction, and strength of a bivariate relationship.

Page 27: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 27/38

Form

Direction

Strength

Example 3

Describe the form, direction, and strength of the data in the scatterplot below.

Page 28: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 28/38

Solution

Form: There appears to be a linear relationship between vehicle age and value.Direction: The direction of the relationship appears to be negative because as the age of the vehicle goes up the value goesdown.Strength: The strength of the relationship appears to be strong because their graph is close to a line.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Bivariate Data, Correlation Between ValuesWhat if we notice that two variables seem to be related to one another and we want to more accurately determine the strength ofthe relationship? We may notice that scores for two variables – such as verbal SAT score and GPA – are related and that studentsthat have high scores on one appear to have high scores on another (see table below).

Page 29: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 29/38

Correlation is a number that measures the strength and gives direction of a linear relationship between bivariate data. In ourexample above, we notice that there are two observations (verbal SAT score and GPA) for each student. Each student isrepresented by one point on the scatterplot below.

If we carefully examine the data in the example above we notice that those students with high SAT scores tend to have high GPAsand those with low SAT scores tend to have low GPAs. Correlation Patterns in Scatterplot Graphs When the points on a scatterplotgraph produce a lower­left­to­upper­right pattern (see below), we say that there is a positive correlation between the twovariables. This pattern means that when the score of one observation is high, we expect the score of the other observation to behigh as well and vice­versa.

When the points on a scatterplot graph produce a upper­left­to­lower­right pattern (see below), we say that there is a negativecorrelation between the two variables. This pattern means that when the score of one observation is high, we expect the score ofthe other observation to be low and vice­versa.

Page 30: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 30/38

When the points on a scatterplot lie on a straight line you have what is called a perfect correlation between the two variables.That is, all of the points in the scatterplot will lie on a straight line (see below).

A scatterplot in which the points do not have a linear trend (either positive or negative) is called a zero or a near­zero correlation(see below).

Page 31: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 31/38

Correlation CoefficientsWe use a statistic called the correlation coefficient to give us a more precisemeasurement of the relationship between two variables. The correlation coefficientis an index that describes the relationship between two variables and can take onvalues between ­1.0 and +1.0 with a positive correlation coefficient indicating apositive correlation and a negative correlation coefficient indicating a negativecorrelation.

The absolute value of the coefficient indicates the magnitude or the strength of therelationship . The closer the absolute value of the coefficient is to 1, the strongerthe relationship. For example, a correlation coefficient of 0.20 indicates that there isweak linear relationship between the variables while a coefficient of ­0.90 indicatesthat there is a strong linear relationship. The value of a perfect positive correlationis 1.0 while the value of a perfect negative correlation is ­1.0.

When there is no linear relationship between two variables, the correlationcoefficient is 0. It is important to remember that a correlation coefficient of 0 indicates that there is no linear relationship. There maystill be a strong relationship between the two variables. For example, there could be a quadratic relationship between them.

For a better understanding of correlation try the fun links in the sidebar!

Calculating rThe correlation coefficient is a statistic that is used to measure the strength and direction of a linear correlation. It is symbolized bythe letter r. To calculate this coefficient you will be using either your calculator technology or software on a computer or internet.

Example 1

What is the correlation coefficient for these two variables?

Page 32: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 32/38

Solution

r = 0:95

Looking at the scatterplot of the data first, one might estimate the value of r to be close to maybe 0.9.

There are several websites in addition to a calculator where you can enter in data points and find their correlation.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Common Errors in CorrelationThe Properties and Common Errors of Correlation

When examining correlation, there are four things that could affect our results: outliers, linearity, homogeneity of the group andsample size.

Outliers

An outlier, or a data point that lies outside of our overall pattern, can have a great effect on correlation. How great of an affect isdetermined by the sample size of the data and by the magnitude by which the outlier lies outside of the pattern. As a general ruleoutliers will bring correlation closer to zero.

LinearityAs mentioned, the correlation coefficient is the measure of the linear relationship between two variables. However, while many pairsof variables have a linear relationship, some do not. For example, let's consider performance anxiety. As a person's anxiety aboutperforming increases, so does their performance up to a point (we sometimes call this 'good stress'). However, at that point the

Page 33: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 33/38

increase in the anxiety may cause their performance to go down. We call these non­linear relationships curvilinear relationships.We can identify curvilinear relationships by examining scatterplots (see below). One may ask why curvilinear relationships pose aproblem when calculating the correlation coefficient. The answer is that if we use the traditional formula to calculate theserelationships, it will not be an accurate index and we will be underestimating the relationship between the variables. If we graphedperformance against anxiety, we would see that anxiety has a strong affect on performance. However, if we calculated thecorrelation coefficient, we would arrive at a figure around zero. Therefore, the correlation coefficient is not always the best statisticto use to understand the relationship between variables.

Homogeneity of the GroupAnother error we could encounter when calculating the correlation coefficient is homogeneity of the group. When a group ishomogeneous or possessing similar characteristics, the range of scores on either or both of the variables is restricted. For example,suppose we are interested in finding out the correlation between IQ and salary. If only members of the Mensa Club (a club forpeople with IQs over 140) are sampled, we will most likely find a very low correlation between IQ and salary since most memberswill have a consistently high IQ but their salaries will vary. This does not mean that there is not a relationship – it simply means thatthe restriction of the sample limited the magnitude of the correlation coefficient.

Sample SizeFinally, we should consider sample size. One may assume that the number of observations used in the calculation of the coefficientmay influence the magnitude of the coefficient itself. However, this is not the case. While the number in the sample size does notaffect the coefficient, it may affect the accuracy of the relationship. The larger the sample, the more accurate of a predictor thecorrelation coefficient will be on the relationship between the two variables. Correlation does not mean Causation! Often timesstudies that show correlation between two variables will influence readers into thinking that one variable is the cause of therelationship.

A lighthearted example of this phenomenon can be found at this website, which discusses the evil of the pickle.

Hopefully you now understand that just because two variables show a strong relationship this does not mean that one causes theresponse in the other. In truth there are four possible explanations for why the relationship exists.

1. Direct Cause­and­Effect (Causation)

In this situation we would agree that one variable is in fact causing the response in another. Often times these relationships havebeen studied by an experiment. For example smoking is correlated with lung disease, and it is reasonable, based on scientificevidence, to say smoking causes lung disease.

2. Common Response

In this situation we have two variables that are correlated with each other but the relationship can be explained by some otherfactor, lurking variable. For example countries that have a high percentage of cell phone users also tend to have higher lifeexpectancies. Do you believe it is safe to say that using a cell phone adds years to your life? Or, can you think of a possiblelurking variable that could explain the relationship?

Much like the teen texting example earlier in the lesson, we can come up possible outside variables that could be at work.Possibly the best explanation is money. A country's wealth (sometimes expressed and GNP) could easily be used to explainboth variables in this example. If a country is wealthy it is much more likely to have citizens who own cell phones. Also if acountry is wealthy it is much more likely to have good hospitals, roads, health education, access to clean water and food, allthings that contribute to longer life.

If you have not viewed the evil of the pickle website above here is a direct quote from that page... "Of all the people who die fromcancer, 99% have eaten pickles." Assuming the statement is factual, can you somehow relate that sentence to the idea of

Page 34: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 34/38

common response?

3. Confounding

In this situation we have two variables that again are correlated but we are unsure of the exact cause of the relationship. Ahighly debated topic in social media and weblogs is global warming. Some people argue that human pollution is a major causeof the increase in CO2 and other green house gasses in the atmosphere. While others argue it is a part of a natural cycle thatnormally has occurred in our Earth's history. Still some may think both explanations are at work. This is an example ofconfounding because if we are uneducated about the issue we may be confused (or confounded) on the true cause of theproblem. 4. Coincidence In our final situation we see relationships that are occurring completely by chance. For example if youresearched divorce rates and gas prices over the past 50 years you may note that both have gone up. Furthermore if you foundthe correlation between gas prices and divorce rates you may find a strong positive relationship. Does this mean increasingdivorce rates are causing high gas prices?

Probably not, and also it probably not likely that there exists a common response or some form of confounding. So when all elsefails, we would say this is a relationship that is best explained by sheer coincidence. Here's another example for you toconsider... Over the past several decades the percent of crime has increased in a particular country, during the same period ofthe profits of casinos has also steadily increased. Do believe this situation to be coincidence?

Content from this page found at www.ck12.org and khanacademy.org CCBY

Least­Squares RegressionIn the last section we learned about the concept of correlation, which we defined as the measure of the linear relationship betweentwo variables. As a reminder, when we have a strong positive correlation, we can expect that if the score on one variable is high,the score on the other variable will also most likely be high. With correlation, we are able to roughly predict the score of one variablewhen we have the other. Prediction is simply the process of estimating scores of one variable based on the scores of anothervariable. In the previous section we illustrated the concept of correlation through scatterplot graphs. We saw that when variableswere correlated, the points on this graph tended to follow a straight line. If we could draw this straight line it, in theory, wouldrepresent the change in one variable associated with the other. This line is called the least squares or the linear regression line (seefigure below).

Regression Lines, Slope, and Y­interceptLinear regression involves using data to calculate a line that best fits the data and then using that line to predict scores. We arelooking for a line of ''best fit". There are many ways one could define this ''best fit". Statisticians define this line to be the one whichminimizes the sum of the squared distances from the observed data to the line. In the example below, you can see the calculateddistance from each of the observations to the regression line, or residual values. This method of fitting the data line so that there isminimal difference between the observation and the line is called the method of least squares.

Below is data given by a canine expert. It relates a dog's age in years to what they believe the equivalent age in human years to be.

Page 35: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 35/38

The equation for the regression line of the data is approximately

y = 4.64x + 7.79

Example 1

Interpret the meaning of 4.64 for the slope, and 7.79 for the y­intercept, in the context of the dog years data.

Solution

As you can see, the regression line is a straight line that expresses the relationship between two variables. When predicting onescore by using another, we use an equation equivalent to the slope­intercept form of the equation for a straight line.

Y = bX + a

where:

Y = the score that we are trying to predict

b = the slope of the line

a = the Y intercept (value of Y when X = 0)

Page 36: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 36/38

The meaning of 4.64 for the slope is that for every year the dog is alive they age 4.64 equivalent human years. This can be seen inthe graph below, when looking at the actual regression line, one can see that it goes up about 4.64 years for every year you moveto the right. Having a y­intercept of 7.79, tells us that when the dog is born the line predicts its equivalent human age to be 7.79years old. This can also be seen on the graph below. When looking at y­axis on can see the line crossing it at approximately 7.79.

Predicting Values Using Scatterplot Data, Extrapolation

One of the uses of the regression line is to predict values. After calculating this line, we are able to predict values by simplysubstituting a value of a predictor variable X into the regression equation and solving the equation for the outcome variable Y. Weare able to predict the values for Y for any value of X within a specified range. We say specified range because we wouldn't want topredict dog age or equivalent human age that falls outside of the range of our data.

What if we wanted to use this data to predict the equivalent age in human years of a dog that was 18 years old? Can you think ofany potential problems with that? One problem would be that the data came from a canine expert only up to a dog of age 11.Therefore we really can't predict a dog that is 18 years old because our data doesn't extend out to that age. We have no idea howfast or slow the expert would predict a dog to age when the dogs get that old. Predicting beyond our the range of our data set iscalled extrapolating. Making decisions based on extrapolation can be dangerous as we are coming to conclusions that are backedup by data.

Outliers and Influential PointsAn outlier is an extreme observation that does not fit the general correlation or regression pattern (see figure below). An outlier isan unusual observation; therefore, the inclusion of this observation may affect the slope and the intercept of the regression line.When examining the scatterplot graph and calculating the regression equation, it is worth considering whether extremeobservations should be included or not.

Let's use our example above to illustrate the effect of a single outlier. Say that we have a student that has a high GPA, but sufferedfrom test anxiety the morning of the SAT verbal test and scored a 410. Using our original regression equation, we would expect thestudent to have a GPA of 2.2. But in reality, the student has a GPA equal to 3.9. The inclusion of this value would change the slopeof the regression equation from ­0.0056 to ­0.0032 which is quite a large difference.

Page 37: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 37/38

There is no set rule when trying to decide whether or not to include an outlier in regression analysis. This decision depends on thesample size, how extreme the outlier is and the normality of the distribution. As a general rule of thumb, we should consider valuesthat are 1.5 times the interquartile range below the first quartile or above the third quartile as outliers. Extreme outliers are valuesthat are 3.0 times the inter­quartile range below the first quartile or above the third quartile.

For an applet that will calculate correlation and the least squares regression line, see the link in the sidebar.

Section SummaryBivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usuallyone of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the othervariable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the responsevariable should be placed on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association betweenthe two variables, it can be identified as being strong if the points form a very distinct form with little variation from that form in theindividual points, or weak if the points appear more randomly scattered. If the values of the response variable generally increase asthe values of the explanatory variable also increase, the data has a positive association. If the response variable generallydecreases as the explanatory variable increases, the data has a negative association. Correlation is a measure of the linearrelationship between two variables – it does not necessarily state that one variable is caused by another. For example, a thirdvariable or a combination of other things may be causing the two correlated variables to relate as they do. Therefore, it is importantto remember that we are interpreting the variables and their correlation as not causal, but instead as relational. When calculatingcorrelation, there are several things that could affect our computation including outliers, curvilinear relationships, homogeneity ofthe group and the size of the group. Prediction is simply the process of estimating scores on one variable based on the scores ofanother variable. We use the least­squares (also known as the linear) regression line to predict the value of a variable. Using thisregression line, we are able to use the slope, y­intercept and the calculated regression coefficient to predict the scores of a variablex. Extrapolation is the process of making a prediction beyond the range of your data based on a regression line. Extrapolation canbe dangerous as there is no data to fully support the prediction.

Describing Data Question for Thought 2

State the differences between correlation and causation and give an example of each.

Once you've completed your responses, follow your teacher's instructions for submitting your work.

Describing Data Quiz 2

It is now time to complete the "Describing Data Quiz 2".

After you have completed all the work in this chapter and feel confident of your knowledge of the content, you'll be preparedfor a test over this material .

Describing Data Assignment

It is now time to complete the "Describing Data" assignment. Please download the assignment handout from the sidebar.

Once you've completed your responses, follow your teacher's instructions for submitting your work.

Content from this page found at www.ck12.org and khanacademy.org CCBY

Module Wrap ­ UP

Now that you have finished the lessons and assignments, engage in the practice below and visit the extra resources in the sidebar.Then, continue to the next page to complete your final assessment.

Page 38: Describing Data - DR. D. Dambreville's Math Pageplanemath.weebly.com/.../algebra_1_describing_data.pdf · 2020-02-07 · employees a salary between $60,000 and $70,000, you could

6/19/2015 Algebra 1

http://cms.gavirtualschool.org/Shared/Math/Algebra_1_CCGPS/05_DescribingData/Algebra_DescribingData_SHARED_print.html 38/38

Describing Data Final Module TestIt is now time to complete the "Describing Data" Test. Once you have completed all self­assessments, assignments, andthe review items and feel confident in your understanding of this material, you may begin.

After you have completed all the work in this chapter and feel confident of your knowledge of the content, you'll be preparedfor a test over this material .

Describing Data ProjectThis project will synthesize what you have learned in this unit and will allow you to design your own study. You willformulate research questions, use appropriate statistical methods to analyze the data, and develop and analyzepredictions. A sample research study is provided to make sure you understand the standards to be addressed asyou design your own study.

Materials Needed:

Pencil and (graphing) paper; graphing calculator or statistical software package; portfolio of work from the unit; possibly internetaccess to help brainstorm question design.

Once you've completed your responses, follow your teacher's instructions for submitting your work.