math 80 lects week3 f19 - fredpark.com€¦ · chap 2: 3,8,12,21,40,41,43,44,45,49-55, 69-72. math...

Math 80• Lecture 7

Math 80: Elementary StatisticsLecture 7

Dr. Fred Park

Graphical Representation of Data Cont’d.

Stem and Leaf Plot:quick way to look at small amounts of numerical data

Math 80 Test Grades example (grades are a bit high)percentage grades of 25 studentsDraw a stem and leaf plot

Divide each number so that the tens digit is the stem and the ones digit is the leaf:62 --> 6|2place on vertical chart


Dr. Fred Park


Divide each number so that the tens digit is the stem and the ones digit is the leaf:62 --> 6|2, place on vertical chart.

Stems on chart below (high to low): 2 placed on right of 6


Dr. Fred Park


6|2 8|7 remaining


Dr. Fred Park


remaining

sort leaf values horizontally (low to high)


Dr. Fred Park


Now can interpret data:• somewhat symmetric• center roughly 70


Dr. Fred Park

Graphical Representation of Data Cont’d: Scatter Plots

Q: what if want to see if two different variables are related

ex. is there a relationship between elevation and temp on a given day?

prelim: state random variablesx = altitudey = high temperatureplot x vs y


Dr. Fred Park


Scatter Plots in R

#get the input valuesinput <- mtcars[,c(‘wt’,’mpg’)]

# plot the chart for cars w/ weights between 2.5 and 5# and mileage between 15 and 30 (no prius)plot(x=input$wt, y=input$mpg,

xlab = “weight”,ylab = “mileage”,xlim = c(2.5,5),ylim = c(15,30),main = “Weight vs Mileage”)

• Lecture 8


Dr. Fred Park


Scatter Plots in R

#get the input valuesinput <- mtcars[,c(‘wt’,’mpg’)]

# plot the chart for cars w/ weights between 2.5 and 5# and mileage between 15 and 30 (no prius)

R-code:

plot(x=input$wt, y=input$mpg, xlab = “weight”,ylab = “mileage”,xlim = c(2.5,5),ylim = c(15,30),main = “Weight vs Mileage”)


Dr. Fred Park

class exercise:

ex 1. Table contains value of a house and amount of rental income in a year house brings in. Create a scatter plot and state if there is a relationship between the value of the house and annual rental income

Use R for this one!


Dr. Fred Park

class exercise:

ex 2.

do by hand!


Dr. Fred Park

HW#2: Due Next Thursday 10/3Chap 2: 3,8,12,21,40,41,43,44,45,49-55, 69-72.


Dr. Fred Park

Graphical Representation of Data Cont’d: Histograms

Q: what if want to see distribution of data for continuous numerical data

A Histogram consists of adjoining boxes where • each height is the frequency or relative frequency on one axis• on the other are binned values or partition points bet. values

f = frequencyn = total # of data valuesRF = relative frequency = f/n

ex. data = {1,2,2,3,3,3,4,4,5}bins = {[0.5,1.5), [1.5,2.5),[2.5,3.5),[3.5,4.5),[4.5,5.5)} = {b1,b2,b3,b4,b5}

frequency of 1 in bin 1 = 1frequency of 2 in bin 2 = 2frequency of 3 in bin 3 = 3frequency of 4 in bin 4 = 2frequency of 5 in bin 5 = 1


Dr. Fred Park




R code:%rx <-c(1,2,2,3,3,3,4,4,5)bins <-seq(0.5,5.5,by=1)hist(x,breaks=bins,col="red")


Dr. Fred Park




rel freq of 1 = 1/9rel freq of 2 = 2/9rel freq of 3 = 3/9rel freq of 4 = 2/9rel freq of 5 = 1/9

R code:%rx <-c(1,2,2,3,3,3,4,4,5)bins <-seq(0.5,5.5,by=1)hist(x,breaks=bins,freq=FALSE,col="red")


Dr. Fred Park


Diff between histogram and bar chart?

Note for bin boundaries, you can use [ ,) or (, ] or () or even []first two are more optimal since you do not double count.whichever used depends on book

software implementations use some combination of these boundaries.we will use [ , )

• Lecture 9


Dr. Fred Park


Monthly rent dataCreate a histogram by hand and in R


Dr. Fred Park



Dr. Fred Park

Graphical Representation of Data Cont’d: Hi


Dr. Fred Park



Dr. Fred Park

Graphical Representation of Data Cont’d: Histograms%r

x<-c(1500,1500,1250,900,1350,1150,600,800,350,1500,610,2550,1200,900,960,495,850,1400,890,1200,900,1100,1325,690)

xmin = min(x)xmax = max(x)sprintf("min = %2.1f, max = %2.1f", xmin, xmax)

L = xmax-xminprint(L)nBins = 7if (L/nBins == 0){

del = ceiling(L/(nBins))+1}else{ bw = ceiling(L/(nBins))

}#round up to next integer even if already an integersprintf("del = %2.0f", del)bin_lim <-seq(xmin-1/2,xmax+1/2+del,by=del)cat("bin limits = ", bin_lim)bins<-bin_lim - 0.5cat("\nbins=" ,bins)bin_mpt<-c(507,822,1137,1452,1767,2082,2397)hist(x,breaks=bins,freq=TRUE,col="red")#print(bins)


Dr. Fred Park


See Example from Classon hand construction of histogram of dataD = {1,2,2,3,3,3,4,4,5}

• Lecture 10


Dr. Fred Park

Graphical Representation of Data Cont’d: HistogramsD = {1,2,2,3,3,3,4,4,5}

xmin = 1, xmax = 5

n = #data points = 9

L = length of data = xmax-xmin = 5-1 = 4

nb = # of bins = 5 (usually n^(1/2) for larger data sets )

bw = bin width = L/#bins = 4/5 = 0.8 (take ceiling if decimal, else +1 if integer)

del = partition offset = 0.5 (if integer valued data. free choice dep. on data)


Dr. Fred Park

Graphical Representation of Data Cont’d: HistogramsD = {1,2,2,3,3,3,4,4,5}

break points = bp = bin boundaries = {xmin-del, xmin-del+bw, xmin-del+2*bw, ..., xmin-del+nb*bw} = {1-0.5, 1-0.5+1, 1-0.5+2, 1-0.5+3, 1-0.5+4, 1-0.5+5}= {0.5, 1.5, 2.5, 3.5, 4.5, 5.5} (note: #bp’s = 6)

bins = {[0.5,1.5), [1.5,2.5), [2.5,3.5), [3.5,4.5), [4.5,5.5)}= {b1, b2, b3, b4, b5}

#bins = #bp’s-1 = 6-1 = 5

1 2 3 4 5

xmin-del xmin-del+bwxmin-del+5*bw

bw


Dr. Fred Park


bin bin centers data value frequency[0.5,1.5) 1 1 1

[1.5,2.5) 2 2 2

[2.5,3.5) 3 3 3

[3.5,4.5) 4 4 2

[4.5,5.5) 5 5 1

note: bin boundaries vary with different software platforms, books, interpretationse.g. can be [, ] or [, ) or (, ] or (, ) depending on author or implementation

bin centers = (left bp + right bp)/2e.g. for bin [0.5, 1.5 ), center = (1.5+0.5)/2 = 1


Dr. Fred Park


stunning histogram!!

green: break ptsred: bin centers

bin bin centers

data value

frequency

[0.5,1.5) 1 1 1

[1.5,2.5) 2 2 2

[2.5,3.5) 3 3 3

[3.5,4.5) 4 4 2

[4.5,5.5) 5 5 1


Dr. Fred Park


Dr. Fred Park


Monthly rent dataCreate a histogram by (1) hand and (2) in R


Dr. Fred Park

Graphical Representation of Data Cont’d: Time Series

Time Series plot: graph showing data measurements in chronological order


Dr. Fred Park


Class exercise:find 2 data sets1. Data that you can create a histogram for2. Data that you can create a time series for

Plot both in R

• Lecture 11


Dr. Fred Park

Measures of Center of Data

mode: data value that occurs most frequently in datafind it by looking at data pt with highest frequencye.g. D = {1,2,2,3,3,3,3,4,5,8,8,8,8,8,9,10,11,2}what’s mode?

median: data value in the middle of a sorted listfind it by sorting data and taking middle valuee.g. D = {2,1,5,3,4} what’s median?How’s about D = {4,2,3,1}?

mean: arithmetic average of the numbersfind mean of D = {1,2,3,4,5}?or D = {1,2,1,1,3,4,6}


Dr. Fred Park

Measures of Center of Data


Dr. Fred Park

1. Find the mean median and mode of the data:D = {6.8, 8.2,7.5,9.4,8.2}weights of cats in lbs.

2. Find a data set that interests you and calculate the meanmedian and mode


Dr. Fred Park

ex 1. Find the mean median and mode of the data:D = {6.8, 8.2,7.5,9.4,8.2}weights of cats in lbs.

variable x = weight of the cat

à mean:

6.8 7.5 8.2 8.2 9.4à median: sort list

take middle value

à mode: most freq’ly occurring value = 8.2


Dr. Fred Park

ex 2. Find the mean median and mode of the data:D = {6.8, 8.2,7.5,9.4,8.2,6.3} #even number of ptsweights of cats in lbs.


6.3 6.8 7.5 8.2 8.2 9.4à median: sort list

take middle value

even # pts = no middle data valueso avg 2 neighbor pts

median = (7.5+8.2)/2 = 7.85 lbs.


Dr. Fred Park

ex 3. Effect of Extreme Values on Mean and MedianD = {6.8, 7.5,8.2,8.2,9.4,22.1} #even number of ptsweights of cats in lbs.


note: median > mean

mean went from 8.02 to 10.37 but median stayed the samemean effected by extreme values more median is not

fat cat brought the mean up! à outlier!


Dr. Fred Park

However, due to fact that data is sampled, mean is a morereliable measure of the center (e.g. consistent) of the data

see different distributions of data below:

mean < median mean, median, modeall centered

mean > median


Dr. Fred Park

Average vs Weighted Average?suppose you take 3 classes Spring 2018:Math 141A (5 units), grade = A-COSC 220 (3 units), grade = BMath 80 (3 units), grade = C

what is the avg gpa for that semester?

method 1 avg: (3.7 + 3 + 2)/3 = 2.9

method 2 weighted avg: (5*3.7 + 3*3 + 3*2)/(5+3+3) = 3.0455

Discrepancy!


Dr. Fred Park

Average vs Weighted Average?suppose you take 3 classes Spring 2018:Math 141A (3 units), grade = A-COSC 220 (3 units), grade = BMath 80 (3 units), grade = C

weighted avg gp = (3*3.7 + 3*3 + 3*2)/(3+3+3) = 3(3.7+3+2)/3(1+1+1) = (3.7+3+2)/3 = avg gpa

note if all weights equal, avg = weighted averagewhy?


Dr. Fred Park

ex. weighted average


Dr. Fred Park

Measures of Spread of Data

going back to cat example:mean = avg weight = 8.02 lbs

were most of weights close to this weight?how far off were they?

range of data = highest value – lowest value = max val – min val

cat example: D = {6.8, 8.2, 7.5, 9.4, 8.2}variable x = weight of a catmean = 8.02

range of data = 9.4-6.8 = 2.6

look at distance from data to mean: called deviation


Dr. Fred Park


cat example: D = {6.8, 8.2, 7.5, 9.4, 8.2}variable x = weight of a carmean = 8.02

look at distance from data to mean: called deviation


Dr. Fred Park


sum all deviations

why sum of deviations = 0?


Dr. Fred Park


why sum of deviations = 0?

better sum squares of deviations


Dr. Fred Park


better sum squares of deviations

avg total of squared deviations:

note: 1 less than # data pts

standard deviation:

standard deviation: avg (mean) distance from a data pt. to the meanhow much a typ data pt differs from mean.


Dr. Fred Park



n-1 used due to degrees of freedom.makes sample stdv better approx pop’n stdv


Dr. Fred Park



• Lecture 12


Dr. Fred Park


Recall:standard deviation: avg (mean) distance from a data pt. to the meanhow much a typ data pt differs from mean.


Dr. Fred Park


Dr. Fred Park

squared deviations for training 1 squared deviations for training 2


Dr. Fred Park

R code:

%rx1 <- c(56,75,48,63,59)x2 <- c(60,58,66,59,58)

x1_bar = sum(x1)/length(x1)x2_bar = sum(x2)/length(x2)

sprintf("mean of data set #1 = %2.2f", x1_bar)sprintf("mean of data set #1 = %2.2f", x2_bar)

sigma1 = sqrt(sum((x1-x1_bar)^2)/(length(x1)-1))sigma2 = sqrt(sum((x2-x2_bar)^2)/(length(x2)-1))

sprintf("sigma1 = %2.2f",sigma1)sprintf("sigma2 = %2.2f",sigma2)

output: 'mean of data set #1 = 60.20''mean of data set #1 = 60.20''sigma1 = 9.93''sigma2 = 3.35'


Dr. Fred Park


Dr. Fred Park

Ranking

A percentile is measure of ranking

The kth percentile: data value that has k% of the data at or below that value

e.g. The median is the 50th percentile

If you are in the 90th percentile what does that mean? (no pun intended)


Dr. Fred Park

Ranking

A percentile is measure of ranking

The kth percentile: data value that has k% of the data at or below that value

e.g. The median is the 50th percentile

If you are in the 90th percentile what does that mean? (no pun intended)

This means that 90% of the scores were below this score. So you did the same or better than 90% of the test takers


Dr. Fred Park

Quartiles: split the data into fourths

Interquartile Range (IQR):IQR = Q3-Q1 typical box plot


Dr. Fred Park

example:Total assets (in billions of AUD) of Australian Banks (2012)

2855 2862 2861 2884 3014 2965

2971 3002 3032 2950 2967 2964

find the 5 number summary and interquartile range IQR

variable x = total assets of Austr. bankssort the data

min = 2855 billion AUDmax = 3032 billion AUD

median = ?


Dr. Fred Park

sorted data with total assets median

median = (2964+2965)/2 = 2964.5 billion AUD

Q1? find median of 1st half of list

Q1 = (2862+2884)/2 = 2873 bill. AUD


Dr. Fred Park

sorted data with total assets median

Q3? find median of 3rd half of list

Q3 = (2971+3002)/2 = 2986.5 bill. AUD

five number summary (in billions of AUD):min = 2855Q1 = 2873median = 2964.5Q3 = 2986.5max = 3032


Dr. Fred Park

five number summary (in billions of AUD):min = 2855Q1 = 2873median = 2964.5Q3 = 2986.5max = 3032

IQR = Q3-Q1 = 2986.5-2873 = 113.5 billion AUD

à middle 50% of assets were within 113.5 billion AUD of each other


Dr. Fred Park

Box-and-Whiskers Plot (Box Plot)

Box-and-Whiskers Plot of Total Assets of Aust. Banks in 2012

distribution is skewed right bc right tail is longer


Dr. Fred Park

Create a Box-and-Whiskers Plot (Box Plot) for following

ex. The life expectancy for a person living in one of 11 countriesin a region of South East Asia in 2012 is given below

Find the 5 number summary of the data and the IQR and drawa box-and-whiskers plot.

Starter:variable x = life expectancy of a personsort the listcalculate approp. medians to split the data into different quartiles

math 80 lects week3 f19 - fredpark.com€¦ · chap 2: 3,8,12,21,40,41,43,44,45,49-55, 69-72. math...

Documents