Download - Final Examination Thursday, April 30, 4:00 – 7:00 Location: here, Hanes 120

Final Examination

• Thursday, April 30, 4:00 – 7:00

• Location: here, Hanes 120

Final Examination

• Thursday, April 30, 4:00 – 7:00


• Suggested Study Strategy: Rework HW

Final Examination

• Thursday, April 30, 4:00 – 7:00



• Out of Class Review Session?

Final Examination

• Thursday, April 30, 4:00 – 7:00




• No, personal Q-A much better use of time

Final Examination

• Thursday, April 30, 4:00 – 7:00




• No, personal Q-A much better use of time

• Thus, instead offer extended Office Hours

Final Examination

Extended Office Hours

Monday, April 27, 8:00 – 11:00

Tuesday, April 28, 12:00 – 2:30

Wednesday, April 29, 1:00 – 5:00

Thursday, April 30, 8:00 – 1:00

Last Time

• Comparing Scatterplots

• Measuring Strength of Relationship– Correlation

• Two Sample Inference– Paired Sampling– Independent Sampling

Reading In Textbook

Approximate Reading for Today’s Material:

Pages 110-135, 560-574

Approximate Reading for Next Class:

None, review only

2 Sample Measurement Error

Easy Case: Paired Differences

Have Treatment 1:

Treatment 2:

nXXX ,,, 21

nYYY ,,, 21



Have Treatment 1:

Treatment 2:

Hard case: 2 different (unmatched) samples

nXXX ,,, 21

nYYY ,,, 21



Have Treatment 1:

Treatment 2:


nXXX ,,, 21

nYYY ,,, 21

XnXXX ,,, 21

YnYYY ,,, 21



Have Treatment 1:

Treatment 2:


different!

nXXX ,,, 21

nYYY ,,, 21

XnXXX ,,, 21

YnYYY ,,, 21



Notes:

• There are several variations



Notes:


• For Hypo. Testing, EXCEL works well



Notes:


• For Hypo. Testing, EXCEL works well

• Variations well labelled in TTEST



Main Ideas:



Main Ideas:

Data:

XnXXX ,,, 21

YnYYY ,,, 21



Main Ideas:

Data:

Sample Averages:

XnXXX ,,, 21

YnYYY ,,, 21



Main Ideas:

Data:

Sample Averages:

XnXXX ,,, 21

YnYYY ,,, 21

X

XX

nNX

,~

Y

YY

nNY

,~



Base inference on:

YX



Base inference on:

Probability Theory (can show):

YX



Base inference on:


YX

Y

Y

X

XYX nn

NYX22

,~




Y

Y

X

XYX nn

NYX22

,~




Assumptions

Y

Y

X

XYX nn

NYX22

,~




Assumptions:

• Xs & Ys Independent

Y

Y

X

XYX nn

NYX22

,~




Assumptions:

• Xs & Ys Independent

• Otherwise based on Law of Averages

Y

Y

X

XYX nn

NYX22

,~


Step towards statistical inference:

2 sample Z statistic

1,0~22

N

nn

YX

Y

Y

X

X

YX




• Just do standardization (usual idea)

1,0~22

N

nn

YX

Y

Y

X

X

YX




• Just do standardization (usual idea)

• Handle unknown s.d.s???

1,0~22

N

nn

YX

Y

Y

X

X

YX


For unknown s.d.s, use usual approx:

For 2 sample t statistic

Y

Y

X

X

YX

ns

ns

YX22

YYXX ss ,


2 sample t statistic:

Y

Y

X

X

YX

ns

ns

YX22



Probability Distribution

Y

Y

X

X

YX

ns

ns

YX22



Probability Distribution:

• 2 sample version of t distribution

Y

Y

X

X

YX

ns

ns

YX22





• Well modelled by EXCEL using TTEST

Y

Y

X

X

YX

ns

ns

YX22





• Well modelled by EXCEL using TTEST

• Use this for Hypothesis Testing

Y

Y

X

X

YX

ns

ns

YX22


Variations on TTest


Variations on TTest: Argument “Type”

1. Paired (simple case above)




2. Two sample, equal variance

(studied below)





(studied below)

3. Two sample, unequal variance

(version derived above)




Main Idea: when

• Can find an “improved estimate”

YX




Main Idea: when


• By “pooling data”

YX




Main Idea: when



• i.e. use combined

YX

YXs




Main Idea: when



• i.e. use combined

• Won’t use in this class

YX

YXs

2 Separate SamplesE.g. Old Textbook 7.32:

b. Do separate sample Hypo test,

Class Example 15, Part 3http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor155-2009/ClassNotes/Stor155Eg15.xls




Use type = 3 (don’t know common variance)




P-value = 3.95 x 10-6




P-value = 3.95 x 10-6

Interpretation: very strong evidence




P-value = 3.95 x 10-6

Interpretation: very strong evidence

either yes-no or gray level

2 Separate SamplesSuggested HW:

7.81, 7.82

2 Sample Hypo Testing

Comparison of Paired vs. Unmatched Cases

Notes:

• Can always use unmatched procedure

– Just ignore matching…

• Advantage to pairing???



• Advantage to Pairing???

• Recall previous example:

Old Textbook 7.32

– Matched Paired P-value = 1.87 x 10-5

– Unmatched P-value = 3.95 x 10-6

• Unmatched better!?! (can happen)



• Advantage to Pairing???

Happens when “variation of diff’s”, ,

is smaller than “full sample variation”

I.e.

(whether this happens depends on data)

D

Y

Y

X

XD nn

22

Paired vs. Unmatched SamplingClass Example 29:

A new drug is being tested that should boost white

blood cell count following chemo-therapy. For

a set of 4 patients, it was not administered (as

a control) for the 1st round of chemotherapy,

and then the new drug was tried after the 2nd

round of chemotherapy. White blood cell

counts were measured one week after each

round of chemotherapy.

Paired vs. Unmatched Sampling

Class Example 29:

The resulting white blood cell counts were:

Patient 1 33 35

Patient 2 26 27

Patient 3 36 39

Patient 4 28 30


Class Example 29:

Does the new drug seem to reduce white

blood cell counts well enough to be

studied further?

• Seems to be some improvement

• But is it statistically significant?

• Only 4 patients…


Let: = Average Blood c’nts w/out drug

= Average Blood c’nts with drug

Set up:

(want strong evidence of improvement)

YX

YXA

YX

H

H

:

:0


Class Example 29:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg29.xls

Results:

• Matched Pair P-val = 0.00813

– Very strong evidence of improvement

• Unmatched P-val = 0.295

– Not statistically significant


Class Example 29:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg29.xls

Conclusions:

• Paired Sampling can give better results

• When diff’ing reduces variation

• Often happens for careful matching

Paired Sampling Visualization

2 Sample ProportionsIn text Section 8.2

• Skip this

• Ideas are only slight variation of above

• Basically mix & Match of 2 sample

ideas, and proportion methods

• If you need it (later), pull out text

• Covered on exams to extent it is in HW

Research Corner

Example of High Dimensional Visualization

Research Corner

Example of High Dimensional Visualization:

Microarray Analysis

Research Corner


Microarray Analysis

For a biological tissue sample

(e.g. tumor from a cancer biopsy))

Research Corner


Microarray Analysis


simultaneously measure “gene expression”

(activity level)

Research Corner


Microarray Analysis



over all human genes

(~38,000)

Research Corner


Microarray Analysis



over all human genes

Data set considered here:

Breast Cancer, ~2500 genes

Research Corner

Section 2.3: Linear Regression

Idea:

Fit a line to data in a scatterplot


Idea:


• To learn about basic structure


Idea:



• To model data


Idea:



• To model data

• To provide prediction of new values

Linear Regression

Recall some basic geometry

Linear Regression

Recall some basic geometry:A line

Linear Regression

Recall some basic geometry:A line is described by an equation:

y = mx + b

Linear Regression


y = mx + b

Really {(x,y} : y = mx + b} “set of all ordered pairs such that y = mx + b”

Linear Regression


y = mx + b

m = slope

Linear Regression


y = mx + b

m = slope m

Linear Regression


y = mx + b

m = slope m

b = y intercept

Linear Regression


y = mx + b

m = slope m

b = y intercept b

Linear Regression


y = mx + b

m = slope m

b = y intercept b

Varying m & b gives a family of lines

Linear Regression


y = mx + b

m = slope m

b = y intercept b

Varying m & b gives a family of lines,Indexed by parameters m & b

Basics of Lines

Textbook’s notation:

Y = b0 + b1x .

Basics of Lines


Y = b0 + b1x = b1x + b0.

Basics of Lines


Y = b0 + b1x = b1x + b0.

b1 = m (above) = slope

Basics of Lines


Y = b0 + b1x = b1x + b0.

b1 = m (above) = slope

b0 = b (above) = y-intercept

Basics of Lines

Suggested HW (to review line ideas):

C24: Fred keeps his savings in his mattress. He begins with $500 from his mother, and adds $100 each year. His total savings y, after x years are given by the equation:

y = 500 + 100 x

(a) Draw a graph of this equation.

Basics of LinesC24: (cont.)

(b) After 20 years, how much will Fred have?

($2500)

(c) If Fred adds $200 instead of $100 each year to his initial $500, what is the equation that describes his savings after x years? (y = 500 + 200 x)

Linear Regression

Approach:

Given a scatterplot of data:

),(),...,,( 11 nn yxyx

Linear Regression

Approach:


Find b0 & b1

),(),...,,( 11 nn yxyx

Linear Regression

Approach:


Find b0 & b1

(i.e. choose a line)

),(),...,,( 11 nn yxyx

Linear Regression

Approach:


Find b0 & b1

(i.e. choose a line)

to best fit the data

),(),...,,( 11 nn yxyx

Linear Regression - Approach

),( 11 yx),( 22 yx

),( 33 yx


Given a line, y = b1x + b0

),( 11 yx),( 22 yx

),( 33 yx


Given a line, y = b1x + b0, indexed by b0 & b1

),( 11 yx),( 22 yx

),( 33 yx



Define residuals

),( 11 yx),( 22 yx

),( 33 yx



Define residuals = data Y – Y on line

),( 11 yx),( 22 yx

),( 33 yx




= Yi – (b1xi + b0)

),( 11 yx),( 22 yx

),( 33 yx




= Yi – (b1xi + b0)

Now choose b0 & b1

),( 11 yx),( 22 yx

),( 33 yx




= Yi – (b1xi + b0)

Now choose b0 & b1 to make these “small”

),( 11 yx),( 22 yx

),( 33 yx


Make Residuals > 0, by squaring

Least Squares: adjust b0 & b1 to

minimize the Sum of Squared Errors

21

01 )(

n

iii bxbySSE


JAVA Demo, by David Lane at Rice U.http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html



• Applet gives us scatterplot with data

(Appears to be

randomly generated)

David Lane Demo AppletRaw Data




• Try drawing lines (to min MSE)

David Lane Demo Applet(Deliberately dumb) hand-drawn line

David Lane Demo Applet(Deliberately dumb) hand-drawn line

Measure Fit (bad) by Mean Square Error





• Experiment with intercepts, b0

David Lane Demo Applet(Hopefully better) hand-drawn line

David Lane Demo Applet(Hopefully better) hand-drawn line

Yes! Improved (smaller) MSE

David Lane Demo AppletBest choice of b0?

David Lane Demo AppletBest choice of b0?

Try to vertically center, i.e. b0 = Avg(Ys)

David Lane Demo AppletManual attempt at b0 = Avg(Ys)

David Lane Demo AppletManual attempt at b0 = Avg(Ys)

As expected: improved MSE






• And slopes, b1

David Lane Demo AppletNext try to allow slope, while maintaining

intercept

David Lane Demo AppletNext try to allow slope, while maintaining

intercept (so go through center)

David Lane Demo AppletSlope which direction?

(apparent small downward trend?)

David Lane Demo AppletMake an attempt

David Lane Demo AppletMake an attempt: Worse MSE!?!

David Lane Demo AppletMake an attempt: Worse MSE!?!

Perhaps too steep? Try less…

David Lane Demo AppletNext attempt

David Lane Demo AppletNext attempt: Improved MSE!

David Lane Demo AppletCould try to fine tune more,

but let’s look at best possible next

David Lane Demo AppletOur center point (intercept) was off

(too high)

David Lane Demo AppletBut our slope looks pretty good






• And slopes, b1

• Guess the correlation, r?

David Lane Demo AppletNext try to guess correlation

David Lane Demo AppletNext try to guess correlation:

Answer

David Lane Demo AppletTry Another Data Set

(using)

David Lane Demo Applet• Clearly slopes upwards

• Apparently stronger correlation




= Yi – (b1xi + b0)

Now choose b0 & b1 to make these “small”

),( 11 yx),( 22 yx

),( 33 yx





21

01 )(

n

iii bxbySSE





(How to Compute?)

21

01 )(

n

iii bxbySSE

Least Squares

Can Show: (math beyond this course)

Least Squares


Least Squares Fit Line

Least Squares


Least Squares Fit Line:

• Passes through the point ),( yx

Least Squares



• Passes through the point

• Has Slope:

),( yx

x

y

s

srb

Least Squares




• Has Slope:

(correction factor uses correlation, r)

),( yx

x

y

s

srb

Least Squares




• Has Slope:

(correction factor uses correlation, r)

(think r = 0, and r < 0)

),( yx

x

y

s

srb

Least Squares in Excel

• Could compute manually



(using formulas for sX, sY &

r)



• But EXCEL provides useful summaries



• But EXCEL provides useful summaries:

– INTERCEPT (computes y-intercept b0)





– SLOPE (computes slope b1)


Suggested

HW: 2.59 a, b






Additional trick: To draw overlay fit line






Additional trick: To draw overlay fit line

(to existing data plot)






Additional trick: To draw overlay fit line, – Right click a data point






Additional trick: To draw overlay fit line, – Right click a data point

– Choose: “Add Trendline” from menu


Suggested

HW: 2.59 c

And now for something completely different

Can you guess the phrase,that these pictures intend to convey?

Requires:“Thinking Outside the Box”

Also Called:“Lateral Thinking”


Egg Plant


Pool Table


Hole Milk


Tap Dancers

Linear Regression - Insight

Another Demo, by Charles Stanton, CSUSBhttp://www.math.csusb.edu/faculty/stanton/m262/regress



What University?



What University?

California State University



What University?

California State University

San Bernardino



• Now we choose data




• Applet draws fit line





• Study quality of fit, using Residual Plot

Diagnostic for Linear Regression

Recall Normal Quantile plot shows “how well

normal curve fits a data set”




Useful visual assessment of how well the

regression line fits data





regression line fits data is:

Residual Plot





regression line fits data is:

Residual Plot

Just Plot of Residuals (on Y axis),

versus X’s (on X axis)

Charles Stanton Demo AppletAdd point by clicking

(no line yet…)

Charles Stanton Demo AppletAdd point by clicking, and another


Applet draws line


Applet draws line, and gives equation


Applet draws line, and gives equation

And plots residuals

Charles Stanton Demo AppletNow add another point

(goal: very close to line)


Equation similar, (but not exact)


Residuals now non-0


Residuals now non-0

& Magnify relative differences

Charles Stanton Demo AppletNow add more points along line

Residuals magnify differences


Residuals magnify differences

(note change of scale)


Major players clearly stand out

Charles Stanton Demo AppletOutliers have a drastic impact


Poor fit to data along previous line


Poor fit to data along previous line

(shows up clearly)

Charles Stanton Demo AppletMisfit shows up clearly

Especially nonlinear relationships

Charles Stanton Demo AppletMisfit shows up clearly

Especially nonlinear relationships

(even when hard to see in scatterplot)






• Useful visual diagnostic






• Useful visual diagnostic

(Good at highlighting problems)

Residual Diagnostic Plot

Toy Examples:http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor155-2009/ClassNotes/Stor155Eg19.xls

1. Generate Data to follow a line

• Residuals seem to be randomly distributed

• No apparent structure

• Residuals seem “random”

• Suggests linear fit is a good model for data


Toy Examples:http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor155-2009/ClassNotes/Stor155Eg19.xls

2. Generate Data to follow a Parabola

• Shows systematic structure

• Pos. – Neg. – Pos. suggests data follow a

curve (not linear)

• Suggests that line is a poor fit


Example from text: problem 2.74http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

Study (for runners), how Stride Rate

depends on Running Speed

(to run faster, need faster strides)

a. & b. Scatterplot & Fit line

c. & d. Residual Plot & Comment

Residual Diagnostic Plot E.g.http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

a. & b. Scatterplot & Fit line

• Linear fit looks very good

• Backed up by correlation ≈ 1

• “Low noise” because data are averaged

(over 21 runners)

Residual Diagnostic Plot E.g.http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg15.xls

c. & d. Residual Plot & Comment

• Systematic structure: Pos. – Neg. – Pos.

• Not random, but systematic pattern

• Suggests line can be improved

(as a model for these data)

• Residual plot provides “zoomed in view”

(can’t see this in raw data)


Suggested HW 2.87

Effect of a Single Data Point

Suggested HW: 2.102

Least Squares Prediction

Idea: After finding a & b (i.e. fit line)

For new x, predict new value of y,

Using b x + a

I. e. “predict by point on the line”


EXCEL Prediction: revisit examplehttp://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg14.xls

EXCEL offers two functions:

• TREND

• FORECAST

They work similarly, input raw x’s and y’s

(careful about order!)


Caution: prediction outside range of data is called “extrapolation”

Dangerous, since small errors are magnified


Suggested HW:

2.67a, b,

2.75 (hint, use Least Squares formula above, since don’t have raw data)

Interpretation of r squared

Recall correlation measures

“strength of linear relationship”

is “fraction of variation explained by line”

for “good fit”

for “very poor fit”

measures “signal to noise ratio”

1

2r

0

r

2r

Interpretation of r squaredRevisit

http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg13.xls

(a, c, d) “data near line”high signal to noise ratio

(b) “noisier data”low signal to noise ratio

(c) “almost pure noise”nearly no signal

197.02 r

58.02 r

003.02 r

Interpretation of r squared

Suggested HW:

2.67 c

Statistical Inference

Show

Excel Output

Download - Final Examination Thursday, April 30, 4:00 – 7:00 Location: here, Hanes 120

Top Related