trends in data chapter 1.3 – visualizing trends mathematics of data management (nelson) mdm 4u
TRANSCRIPT
Trends in Data
Chapter 1.3 – Visualizing Trends
Mathematics of Data Management (Nelson)
MDM 4U
Variables
In computer science and mathematics, a variable is a symbol denoting a quantity or symbolic representation. In mathematics, a variable often represents an unknown quantity; in computer science, it represents a place where a quantity can be stored. Variables are often contrasted with constants, which are known and unchanging. (Wikipedia, 2004)
The Two Types of Variables Independent Variable
a variable whose values are arbitrarily chosen placed on the horizontal-axis time is always independent (why?)
Dependent Variable a variable whose values depend on the
independent variable placed on the vertical-axis
Scatter Plots
a graphical method of showing the joint distribution of two variables where each point on the graph indicates a pair of variables
may show a trend or not a trend indicates a correlation that may be
strong or weak, positive or negative, linear or non-linear
What is a trend?
a pattern of average behavior that occurs over time
a general “direction” that something tends toward
for example there has been a trend towards increasing costs in Canada
need two variables to exhibit a trend
An Example of a trend
U.S. population from 1780 to 1960
what is the trend?
is the trend linear?
Att
r2_
po
pm
illio
ns
0
20
40
60
80
100
120
140
PearlReedandKish1940_USpopulationfrom17901940_year1780 1800 1820 1840 1860 1880 1900 1920 1940 1960
019 Scatter Plot
Line of Best Fit
the line of best fit is a line which best represents the trend in the data and is used for making predictions
these can be drawn by hand but there are also methods for mathematically calculating them (median-median and least squares methods are examples that we will study)
gives no indication of the strength of the trend (use the r or r2 value)
An example of the line of best fit this is temperature
data from New York over time, with a median-median line added
what type of trend are we looking at?
see p35 for method for creating a median-median line
Att
r2_
me
an
tem
p
14
1618
20
22
24
2628
30
32
StateofNewYorkHistoricalTemperatureData_winters...1900 1920 1940 1960 1980 2000
Attr2_meantemp = 0.0230StateofNewYorkHistoricalTemperatureData_winterseasonmeanof40wea_ - 21.4
048 Scatter Plot
Creating a Median-Median Line Divide the points into 3 symmetric groups
If there is 1 extra point, include it in the middle group If there are 2 extra points, group one in each end
Calculate the median x- and y-coordinates for each group and plot the median point (x, y)
If the median points are on a straight line, connect them Otherwise, line up the two outer points, move 1/3 of the
way to the middle point and draw a line of best fit
Median-Median Line (10 points)
Median-Median Line (14 points)
Exercises
try page 37 #2, 3, 6, 8
Trends in Data Using Technology
Chapter 1.4 – Trends in Technology
Mathematics of Data Management (Nelson)
MDM 4U
Categories of Correlation correlation scatter plots
can be positive or negative, strong or weak
try looking at the examples in this website to help you understand:
tem
pe
ratu
re
1020304050
60708090
strongpositive1 2 3 4 5 6 7 8 9
Collection 1 Scatter Plot
http://www.seeingstatistics.com/seeing1999/gallery/CorrelationPicture.html
Regression a process of fitting a line or curve to a set of
data if a line is used, it is linear regression if a curve is used, it may be quadratic
regression, cubic regression, etc. why do we do this? what can we do with the resulting function? http://www.seeingstatistics.com/seeing1999/g
allery/CorrelationPicture.html
Correlation Coefficient
the correlation coefficient r is an indicator of the strength and direction of a linear relationship r = 0 no relationship r = 1 perfect positive correlation r = -1 perfect negative correlation
r2 is the coefficient of determination if r2 = 0.42, that means that 42% of the variation in
y is due to x
Residuals a residual is the vertical
distance between a point and the line of best fit
if the model you are considering is a good fit, the residuals should be small and have no noticeable pattern
why?
y
2
3
4
5
6
7
8
9
x1 2 3 4 5 6 7 8 9
y = 0.0804x + 3.5; r^2 = 0.021
-1
1
3
Re
sid
ual
1 2 3 4 5 6 7 8 9x
Collection 1 Scatter Plot
http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html
Creating a Median-Median Line Using Technology Copy the following file to your M:\ drive
N: \ LIEFF \ MDM4U \ 1.3 Best Fit Lines \
armspan vs height.stu.ftm
Right-click the file | Open With |
Choose Program | Browse
Program Files \ Fathom \ fathom.exe
Exercises
Page 51 #1-6, 7 b,c,d, 8
References
Wikipedia (2004). Online Encyclopedia. Retrieved September 1, 2004 from http://en.wikipedia.org/wiki/Main_Page
The Power of Data
Chapter 1.5 – The Media
Mathematics of Data Management (Nelson)
MDM 4U
There are 3 kinds of lies: lies, damn lies and statistics.
‘4 out of 5 dentists recommend Trident sugarless gum to their patients who chew gum’ In small groups discuss how this statistical
statement could be misleading
Trident conclusions
How many dentists did they ask? 5? 4 out of 5 is convincing but reasonable
5 out of 5 is preposterous 3 out of 5 is good but not great
Recommend Trident over what? Chewing sugared gum?
Is Trident the “best” sugarless gum? What variables were considered?
What did the 5th dentist recommend?
“More people stay with [Bell Mobility] than any other provider.” In small groups, discuss:
1) What variables would be recorded in this study?
2) How could the data be used to arrive at this conclusion falsely?
1) What variables would be recorded in this study?
Number of Bell Mobility subscribers Number of renewed contracts Contract renewed? Time of Renewal (during contract / upon completion
of contract) Contract Length Contract Type (business or home)
2) How could the data be used to arrive at this conclusion falsely? Does not specify how many more customers stay
with Bell. e.g. Percentage of customers renewing their plan:
Bell: 30% Rogers: 29% Telus: 25% Fido: 28% Did they only count totals? What does it mean to “stay with Bell”? Honour entire
contract? Renew contract at the end of a term? Are early terminations factored in? If so, does Bell
have a higher cost for early terminations? Competitors’ renewal rates may have decreased
due to family plans Does the data include Private / Corporate plans?
How does the media use (misuse) data? To inform the public about world events in an
objective manner It sometimes gives misleading or false impressions
to sway the public or to increase ratings
It is important to: Study statistics to understand how information is
represented or misrepresented Correctly interpret tables/charts presented by the media
Exercises
p. 60 #1-6 Final Project – Manipulating Data