math 80 lects week3 f19 - fredpark.com€¦ · chap 2: 3,8,12,21,40,41,43,44,45,49-55, 69-72. math...
TRANSCRIPT
Math 80• Lecture 7
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d.
Stem and Leaf Plot:quick way to look at small amounts of numerical data
Math 80 Test Grades example (grades are a bit high)percentage grades of 25 studentsDraw a stem and leaf plot
Divide each number so that the tens digit is the stem and the ones digit is the leaf:62 --> 6|2place on vertical chart
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d.
Divide each number so that the tens digit is the stem and the ones digit is the leaf:62 --> 6|2, place on vertical chart.
Stems on chart below (high to low): 2 placed on right of 6
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d.
6|2 8|7 remaining
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d.
remaining
sort leaf values horizontally (low to high)
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d.
Now can interpret data:• somewhat symmetric• center roughly 70
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d: Scatter Plots
Q: what if want to see if two different variables are related
ex. is there a relationship between elevation and temp on a given day?
prelim: state random variablesx = altitudey = high temperatureplot x vs y
Math 80: Elementary StatisticsLecture 7
Dr. Fred Park
Graphical Representation of Data Cont’d: Scatter Plots
Scatter Plots in R
#get the input valuesinput <- mtcars[,c(‘wt’,’mpg’)]
# plot the chart for cars w/ weights between 2.5 and 5# and mileage between 15 and 30 (no prius)plot(x=input$wt, y=input$mpg,
xlab = “weight”,ylab = “mileage”,xlim = c(2.5,5),ylim = c(15,30),main = “Weight vs Mileage”)
• Lecture 8
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
Graphical Representation of Data Cont’d: Scatter Plots
Scatter Plots in R
#get the input valuesinput <- mtcars[,c(‘wt’,’mpg’)]
# plot the chart for cars w/ weights between 2.5 and 5# and mileage between 15 and 30 (no prius)
R-code:
plot(x=input$wt, y=input$mpg, xlab = “weight”,ylab = “mileage”,xlim = c(2.5,5),ylim = c(15,30),main = “Weight vs Mileage”)
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
class exercise:
ex 1. Table contains value of a house and amount of rental income in a year house brings in. Create a scatter plot and state if there is a relationship between the value of the house and annual rental income
Use R for this one!
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
class exercise:
ex 2.
do by hand!
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
HW#2: Due Next Thursday 10/3Chap 2: 3,8,12,21,40,41,43,44,45,49-55, 69-72.
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Q: what if want to see distribution of data for continuous numerical data
A Histogram consists of adjoining boxes where • each height is the frequency or relative frequency on one axis• on the other are binned values or partition points bet. values
f = frequencyn = total # of data valuesRF = relative frequency = f/n
ex. data = {1,2,2,3,3,3,4,4,5}bins = {[0.5,1.5), [1.5,2.5),[2.5,3.5),[3.5,4.5),[4.5,5.5)} = {b1,b2,b3,b4,b5}
frequency of 1 in bin 1 = 1frequency of 2 in bin 2 = 2frequency of 3 in bin 3 = 3frequency of 4 in bin 4 = 2frequency of 5 in bin 5 = 1
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
ex. data = {1,2,2,3,3,3,4,4,5}bins = {[0.5,1.5), [1.5,2.5),[2.5,3.5),[3.5,4.5),[4.5,5.5)} = {b1,b2,b3,b4,b5}
frequency of 1 in bin 1 = 1frequency of 2 in bin 2 = 2frequency of 3 in bin 3 = 3frequency of 4 in bin 4 = 2frequency of 5 in bin 5 = 1
R code:%rx <-c(1,2,2,3,3,3,4,4,5)bins <-seq(0.5,5.5,by=1)hist(x,breaks=bins,col="red")
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
ex. data = {1,2,2,3,3,3,4,4,5}bins = {[0.5,1.5), [1.5,2.5),[2.5,3.5),[3.5,4.5),[4.5,5.5)} = {b1,b2,b3,b4,b5}
frequency of 1 in bin 1 = 1frequency of 2 in bin 2 = 2frequency of 3 in bin 3 = 3frequency of 4 in bin 4 = 2frequency of 5 in bin 5 = 1
rel freq of 1 = 1/9rel freq of 2 = 2/9rel freq of 3 = 3/9rel freq of 4 = 2/9rel freq of 5 = 1/9
R code:%rx <-c(1,2,2,3,3,3,4,4,5)bins <-seq(0.5,5.5,by=1)hist(x,breaks=bins,freq=FALSE,col="red")
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Diff between histogram and bar chart?
Note for bin boundaries, you can use [ ,) or (, ] or () or even []first two are more optimal since you do not double count.whichever used depends on book
software implementations use some combination of these boundaries.we will use [ , )
• Lecture 9
Math 80: Elementary StatisticsLecture 9
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Monthly rent dataCreate a histogram by hand and in R
Math 80: Elementary StatisticsLecture 9
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Math 80: Elementary StatisticsLecture 8
Dr. Fred Park
Graphical Representation of Data Cont’d: Hi
Math 80: Elementary StatisticsLecture 9
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Math 80: Elementary StatisticsLecture 9
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Math 80: Elementary StatisticsLecture 9
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms%r
x<-c(1500,1500,1250,900,1350,1150,600,800,350,1500,610,2550,1200,900,960,495,850,1400,890,1200,900,1100,1325,690)
xmin = min(x)xmax = max(x)sprintf("min = %2.1f, max = %2.1f", xmin, xmax)
L = xmax-xminprint(L)nBins = 7if (L/nBins == 0){
del = ceiling(L/(nBins))+1}else{ bw = ceiling(L/(nBins))
}#round up to next integer even if already an integersprintf("del = %2.0f", del)bin_lim <-seq(xmin-1/2,xmax+1/2+del,by=del)cat("bin limits = ", bin_lim)bins<-bin_lim - 0.5cat("\nbins=" ,bins)bin_mpt<-c(507,822,1137,1452,1767,2082,2397)hist(x,breaks=bins,freq=TRUE,col="red")#print(bins)
Math 80: Elementary StatisticsLecture 9
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
See Example from Classon hand construction of histogram of dataD = {1,2,2,3,3,3,4,4,5}
• Lecture 10
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: HistogramsD = {1,2,2,3,3,3,4,4,5}
xmin = 1, xmax = 5
n = #data points = 9
L = length of data = xmax-xmin = 5-1 = 4
nb = # of bins = 5 (usually n^(1/2) for larger data sets )
bw = bin width = L/#bins = 4/5 = 0.8 (take ceiling if decimal, else +1 if integer)
del = partition offset = 0.5 (if integer valued data. free choice dep. on data)
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: HistogramsD = {1,2,2,3,3,3,4,4,5}
break points = bp = bin boundaries = {xmin-del, xmin-del+bw, xmin-del+2*bw, ..., xmin-del+nb*bw} = {1-0.5, 1-0.5+1, 1-0.5+2, 1-0.5+3, 1-0.5+4, 1-0.5+5}= {0.5, 1.5, 2.5, 3.5, 4.5, 5.5} (note: #bp’s = 6)
bins = {[0.5,1.5), [1.5,2.5), [2.5,3.5), [3.5,4.5), [4.5,5.5)}= {b1, b2, b3, b4, b5}
#bins = #bp’s-1 = 6-1 = 5
1 2 3 4 5
xmin-del xmin-del+bwxmin-del+5*bw
bw
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
bin bin centers data value frequency[0.5,1.5) 1 1 1
[1.5,2.5) 2 2 2
[2.5,3.5) 3 3 3
[3.5,4.5) 4 4 2
[4.5,5.5) 5 5 1
note: bin boundaries vary with different software platforms, books, interpretationse.g. can be [, ] or [, ) or (, ] or (, ) depending on author or implementation
bin centers = (left bp + right bp)/2e.g. for bin [0.5, 1.5 ), center = (1.5+0.5)/2 = 1
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
stunning histogram!!
green: break ptsred: bin centers
bin bin centers
data value
frequency
[0.5,1.5) 1 1 1
[1.5,2.5) 2 2 2
[2.5,3.5) 3 3 3
[3.5,4.5) 4 4 2
[4.5,5.5) 5 5 1
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Monthly rent dataCreate a histogram by (1) hand and (2) in R
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Time Series
Time Series plot: graph showing data measurements in chronological order
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Time Series
Time Series plot: graph showing data measurements in chronological order
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Time Series
Time Series plot: graph showing data measurements in chronological order
Math 80: Elementary StatisticsLecture 10
Dr. Fred Park
Graphical Representation of Data Cont’d: Histograms
Class exercise:find 2 data sets1. Data that you can create a histogram for2. Data that you can create a time series for
Plot both in R
• Lecture 11
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Center of Data
mode: data value that occurs most frequently in datafind it by looking at data pt with highest frequencye.g. D = {1,2,2,3,3,3,3,4,5,8,8,8,8,8,9,10,11,2}what’s mode?
median: data value in the middle of a sorted listfind it by sorting data and taking middle valuee.g. D = {2,1,5,3,4} what’s median?How’s about D = {4,2,3,1}?
mean: arithmetic average of the numbersfind mean of D = {1,2,3,4,5}?or D = {1,2,1,1,3,4,6}
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Center of Data
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
1. Find the mean median and mode of the data:D = {6.8, 8.2,7.5,9.4,8.2}weights of cats in lbs.
2. Find a data set that interests you and calculate the meanmedian and mode
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
ex 1. Find the mean median and mode of the data:D = {6.8, 8.2,7.5,9.4,8.2}weights of cats in lbs.
variable x = weight of the cat
à mean:
6.8 7.5 8.2 8.2 9.4à median: sort list
take middle value
à mode: most freq’ly occurring value = 8.2
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
ex 2. Find the mean median and mode of the data:D = {6.8, 8.2,7.5,9.4,8.2,6.3} #even number of ptsweights of cats in lbs.
variable x = weight of the cat
6.3 6.8 7.5 8.2 8.2 9.4à median: sort list
take middle value
even # pts = no middle data valueso avg 2 neighbor pts
median = (7.5+8.2)/2 = 7.85 lbs.
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
ex 3. Effect of Extreme Values on Mean and MedianD = {6.8, 7.5,8.2,8.2,9.4,22.1} #even number of ptsweights of cats in lbs.
variable x = weight of the cat
note: median > mean
mean went from 8.02 to 10.37 but median stayed the samemean effected by extreme values more median is not
fat cat brought the mean up! à outlier!
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
However, due to fact that data is sampled, mean is a morereliable measure of the center (e.g. consistent) of the data
see different distributions of data below:
mean < median mean, median, modeall centered
mean > median
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Average vs Weighted Average?suppose you take 3 classes Spring 2018:Math 141A (5 units), grade = A-COSC 220 (3 units), grade = BMath 80 (3 units), grade = C
what is the avg gpa for that semester?
method 1 avg: (3.7 + 3 + 2)/3 = 2.9
method 2 weighted avg: (5*3.7 + 3*3 + 3*2)/(5+3+3) = 3.0455
Discrepancy!
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Average vs Weighted Average?suppose you take 3 classes Spring 2018:Math 141A (3 units), grade = A-COSC 220 (3 units), grade = BMath 80 (3 units), grade = C
weighted avg gp = (3*3.7 + 3*3 + 3*2)/(3+3+3) = 3(3.7+3+2)/3(1+1+1) = (3.7+3+2)/3 = avg gpa
note if all weights equal, avg = weighted averagewhy?
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
ex. weighted average
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
going back to cat example:mean = avg weight = 8.02 lbs
were most of weights close to this weight?how far off were they?
range of data = highest value – lowest value = max val – min val
cat example: D = {6.8, 8.2, 7.5, 9.4, 8.2}variable x = weight of a catmean = 8.02
range of data = 9.4-6.8 = 2.6
look at distance from data to mean: called deviation
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
cat example: D = {6.8, 8.2, 7.5, 9.4, 8.2}variable x = weight of a carmean = 8.02
look at distance from data to mean: called deviation
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
sum all deviations
why sum of deviations = 0?
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
why sum of deviations = 0?
better sum squares of deviations
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
better sum squares of deviations
avg total of squared deviations:
note: 1 less than # data pts
standard deviation:
standard deviation: avg (mean) distance from a data pt. to the meanhow much a typ data pt differs from mean.
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
standard deviation: avg (mean) distance from a data pt. to the meanhow much a typ data pt differs from mean.
n-1 used due to degrees of freedom.makes sample stdv better approx pop’n stdv
Math 80: Elementary StatisticsLecture 11
Dr. Fred Park
Measures of Spread of Data
standard deviation: avg (mean) distance from a data pt. to the meanhow much a typ data pt differs from mean.
• Lecture 12
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Measures of Spread of Data
Recall:standard deviation: avg (mean) distance from a data pt. to the meanhow much a typ data pt differs from mean.
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
squared deviations for training 1 squared deviations for training 2
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
R code:
%rx1 <- c(56,75,48,63,59)x2 <- c(60,58,66,59,58)
x1_bar = sum(x1)/length(x1)x2_bar = sum(x2)/length(x2)
sprintf("mean of data set #1 = %2.2f", x1_bar)sprintf("mean of data set #1 = %2.2f", x2_bar)
sigma1 = sqrt(sum((x1-x1_bar)^2)/(length(x1)-1))sigma2 = sqrt(sum((x2-x2_bar)^2)/(length(x2)-1))
sprintf("sigma1 = %2.2f",sigma1)sprintf("sigma2 = %2.2f",sigma2)
output: 'mean of data set #1 = 60.20''mean of data set #1 = 60.20''sigma1 = 9.93''sigma2 = 3.35'
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Ranking
A percentile is measure of ranking
The kth percentile: data value that has k% of the data at or below that value
e.g. The median is the 50th percentile
If you are in the 90th percentile what does that mean? (no pun intended)
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Ranking
A percentile is measure of ranking
The kth percentile: data value that has k% of the data at or below that value
e.g. The median is the 50th percentile
If you are in the 90th percentile what does that mean? (no pun intended)
This means that 90% of the scores were below this score. So you did the same or better than 90% of the test takers
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Quartiles: split the data into fourths
Interquartile Range (IQR):IQR = Q3-Q1 typical box plot
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
example:Total assets (in billions of AUD) of Australian Banks (2012)
2855 2862 2861 2884 3014 2965
2971 3002 3032 2950 2967 2964
find the 5 number summary and interquartile range IQR
variable x = total assets of Austr. bankssort the data
min = 2855 billion AUDmax = 3032 billion AUD
median = ?
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
sorted data with total assets median
median = (2964+2965)/2 = 2964.5 billion AUD
Q1? find median of 1st half of list
Q1 = (2862+2884)/2 = 2873 bill. AUD
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
sorted data with total assets median
Q3? find median of 3rd half of list
Q3 = (2971+3002)/2 = 2986.5 bill. AUD
five number summary (in billions of AUD):min = 2855Q1 = 2873median = 2964.5Q3 = 2986.5max = 3032
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
five number summary (in billions of AUD):min = 2855Q1 = 2873median = 2964.5Q3 = 2986.5max = 3032
IQR = Q3-Q1 = 2986.5-2873 = 113.5 billion AUD
à middle 50% of assets were within 113.5 billion AUD of each other
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Box-and-Whiskers Plot (Box Plot)
Box-and-Whiskers Plot of Total Assets of Aust. Banks in 2012
distribution is skewed right bc right tail is longer
Math 80: Elementary StatisticsLecture 12
Dr. Fred Park
Create a Box-and-Whiskers Plot (Box Plot) for following
ex. The life expectancy for a person living in one of 11 countriesin a region of South East Asia in 2012 is given below
Find the 5 number summary of the data and the IQR and drawa box-and-whiskers plot.
Starter:variable x = life expectancy of a personsort the listcalculate approp. medians to split the data into different quartiles