data analysis · p-value (observed significance level) • p-value - measure of the strength of...

39
Data Analysis Bijay Lal Pradhan, Ph.D. [email protected] http://bijaylalpradhan.com.np “ A Number speaks more than thousand words.”

Upload: others

Post on 25-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Data Analysis

Bijay Lal Pradhan, [email protected]

http://bijaylalpradhan.com.np

“ A Number speaks more than thousand words.”

Page 2: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Fundamental of Hypothesis Testing

• There two types of statistical inferences, Estimation and

Hypothesis Testing

• Hypothesis Testing: A hypothesis is a claim (assumption)

about one or more population parameters.

– Average price of a lunch in Sauraha is μ = Rs 200

– The population mean monthly cell phone bill of this city

is: μ = Rs 125

– The average number of TV sets in Homes is equal to

three; μ = 2

Copyright @ Dr Bijay Lal Pradhan

Page 3: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

• It Is always about a population parameter, not about a

sample statistic

• Sample evidence is used to assess the probability that the

claim about the population parameter is true

A. It starts with Null Hypothesis, H0

• We begin with the assumption that H0 is true and any

difference between the sample statistic and true population

parameter is due to chance and not a real (systematic)

difference.

• Always contains “=” , “≤” or “” sign

• May or may not be rejected

0H :μ 3 and X=2.79=

Copyright @ Dr Bijay Lal Pradhan

Page 4: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

B. Next we state the Alternative Hypothesis, H1

• Is the opposite of the null hypothesis– e.g., The average number of TV sets in

homes is not equal to 2 ( H1: μ ≠ 2 )• Never contains the “=” , “≤” or “” sign• May or may not be proven• Is generally the hypothesis that the

researcher is trying to prove. Evidence is always examined with respect to H1, never with respect to H0.

• We never “accept” H0, we either “reject” or “not reject” it

Copyright @ Dr Bijay Lal Pradhan

Page 5: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

A. Rejection Region Method:

• Divide the distribution into rejection and non-rejection

regions

• Defines the unlikely values of the sample statistic if the

null hypothesis is true, the critical value(s)

– Defines rejection region of the sampling distribution

• Rejection region(s) is designated by , (level of

significance)

– Typical values are .01, .05, or .10

• is selected by the researcher at the beginning

• provides the critical value(s) of the test

Copyright @ Dr Bijay Lal Pradhan

Page 6: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

H0: μ ≥ 12

H1: μ < 12

0

H0: μ ≤ 12

H1: μ > 12

a

a

Representscritical value

Lower-tail test

0Upper-tail test

Two-tail test

Rejection

region is

shaded

/2

0

a/2aH0: μ = 12

H1: μ ≠ 12

Rejection Region or Critical Value Approach:

Level of significance =

Non-rejection region

Copyright @ Dr Bijay Lal Pradhan

Page 7: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

• P-Value Approach –• P-value=Max. Probability of (Type I Error), calculated from the

sample.

• Given the sample information what is the size of blue are?

H0: μ ≥ 12

H1: μ < 12

H0: μ ≤ 12

H1: μ > 120Upper-tail test

Two-tail test 0

H0: μ = 12

H1: μ ≠ 12

0Copyright @ Dr Bijay Lal Pradhan

Page 8: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Type I and II Errors:

• The size of , the rejection region, affects the risk of making different

types of incorrect decisions.

Type I Error– Rejecting a true null hypothesis when it should NOT be rejected

– Considered a serious type of error

– The probability of Type I Error is

– It is also called level of significance of the test

Type II Error– Fail to reject a false null hypothesis that should have been rejected

– The probability of Type II Error is β

Copyright @ Dr Bijay Lal Pradhan

Page 9: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Truth

Decision H0 true H0 false

Retain H0

Correct retention

Type II error

Reject H0 Type I error Correct rejection

α ≡ probability of a Type I error

β ≡ Probability of a Type II error

Two types of decision errors:

Type I error = erroneous rejection of true H0

Type II error = erroneous retention of false H0

Page 10: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

• P-Value approach to Hypothesis Testing:

• That is to say that P-value is the smallest value of

for which H0 can be rejected based on the sample

information

• Convert Sample Statistic (e.g., sample mean) to Test

Statistic (e.g., Z statistic )

• Obtain the p-value from a table or computer

• Compare the p-value with

– If p-value < , reject H0

– If p-value , do not reject H0

Copyright @ Dr Bijay Lal Pradhan

Page 11: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

P-value (Observed Significance Level)

• P-value - Measure of the strength of evidence the sample data provides against the null hypothesis:

P(Evidence This strong or stronger against H0 | H0 is true)

)(: obszZPpvalP =−

Page 12: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Test of Hypothesis for the Mean

The test statistic is:

n

S

μXt 1-n

−=

σ Unknownσ known

The test statistic is:

n

σ

μXZ

−=

Copyright @ Dr Bijay Lal Pradhan

Page 13: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Steps to Hypothesis Testing

1. State the H0 and H1 clearly

2. Identify the test statistic (two-tail, one-tail, and

type of test to be used)

3. Depending on the type of risk you are willing to

take, specify the level of significance,

4. Find the decision rule, critical values, and rejection

regions. If –CV<actual value (sample statistic) <+CV,

then do not reject the H0

5. Collect the data and do the calculation for the

actual values of the test statistic from the sample

Copyright @ Dr Bijay Lal Pradhan

Page 14: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Steps to Hypothesis testing, continued

Make statistical decision

Do not Reject H0 Reject H0

Conclude H0 may be true

Make

management/business/admi

nistrative decision

Conclude H1 is “true”

(There is sufficient evidence of

H1)

Copyright @ Dr Bijay Lal Pradhan

Page 15: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

When do we use a two-tail test?

when do we use a one-tail test?

• The answer depends on the question you are trying to answer.

• A two-tail is used when the researcher has no idea which

direction the study will go, interested in both direction.

• (example: testing a new technique, a new product, a new theory and we don’t know

the direction)

• A new machine is producing 12 fluid once can of soft drink. The quality control

manager is concern with cans containing too much or too little. Then, the test is a

two-tailed test. That is the two rejection regions in tails is most likely (higher

probability) to provide evidence of H1.

oz 12 :H

oz12:H

1

0

=

12Copyright @ Dr Bijay Lal Pradhan

Page 16: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

• One-tail test is used when the researcher is interested in

the direction.

• Example: The soft-drink company puts a label on cans

claiming they contain 12 oz. A consumer advocate desires

to test this statement. She would assume that each can

contains at least 12 oz and tries to find evidence to the

contrary. That is, she examines the evidence for less than

12 0z.

• What tail of the distribution is the most logical (higher

probability) to find that evidence? The only way to reject

the claim is to get evidence of less than 12 oz, left tail.

oz 12 :H

oz12:H

1

0

12 1411.5Copyright @ Dr Bijay Lal Pradhan

Page 17: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Type of Hypothesis

• What to test

• Significance of means– Single mean test

– Double mean test• Dependent pairs

• Independent pairs

– More than two mean test

Copyright @ Dr Bijay Lal Pradhan

Page 18: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Correlation and Regression

Copyright @ Dr Bijay Lal Pradhan

Page 19: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

How do we measure association

between two variables?

1. For ordinal and nominal variable

• Odds Ratio (OR)

• Chi square test of independence of

attributes

2. For scale variables

• Correlation Coefficient R

• Coefficient of Determination (R-Square)

Copyright @ Dr Bijay Lal Pradhan

Page 20: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Example

• A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn

• The following data set provide information on 15 pregnant mothers who were contacted for this study

Copyright @ Dr Bijay Lal Pradhan

Page 21: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

BMI (Kg/m2) Birth-weight (Kg)

20 2.730 2.950 3.445 3.010 2.230 3.140 3.325 2.350 3.520 2.510 1.555 3.860 3.750 3.135 2.8

Copyright @ Dr Bijay Lal Pradhan

Page 22: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Scatter Diagram

• Scatter diagram is a graphical method to display the relationship between two variables

• Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y plane

• Y is called the dependent variable

• X is called an independent variable

Copyright @ Dr Bijay Lal Pradhan

Page 23: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60 70

Scatter diagram of BMI and Birthweight

Copyright @ Dr Bijay Lal Pradhan

Page 24: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Is there a linear relationship

between BMI and BW?

• Scatter diagrams are important for

initial exploration of the relationship

between two quantitative variables

• In the above example, we may wish to

summarize this relationship by a

straight line drawn through the scatter

of pointsCopyright @ Dr Bijay Lal Pradhan

Page 25: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Simple Linear Regression

• Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory.

• An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares.

• Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized.

Copyright @ Dr Bijay Lal Pradhan

Page 26: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Least-squares or regression line

• These vertical distances, i.e., the distance

between y values and their corresponding

estimated values on the line are called

residuals

• The line which fits the best is called the

regression line or, sometimes, the least-squares line

• The line always passes through the point

defined by the mean of Y and the mean of

XCopyright @ Dr Bijay Lal Pradhan

Page 27: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Linear Regression Model

• The method of least-squares is available in most of the statistical packages (and also on some calculators) and is usually referred to as linear regression

• Y is also known as an outcome variable

• X is also called as a predictor

Page 28: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Estimated Regression Line

ˆˆˆ 0

ˆ. . . int

ˆ 0 . . .

y = + x = 1.775351 + 0. 330187 x

1.775351 is called y ercept

0. 330187 is called the slope

= − −

= −

Copyright @ Dr Bijay Lal Pradhan

Page 29: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Application of Regression Line

This equation allows you to estimate BW

of other newborns when the BMI is

given.

e.g., for a mother who has BMI=40, i.e.

X = 40 we predict BW to be

ˆˆˆ 0 (40) 3.096y = + x = 1.775351 + 0. 330187 =

Copyright @ Dr Bijay Lal Pradhan

Page 30: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Correlation Coefficient, R

• R is a measure of strength of the linear association between two variables, x and y.

• Most statistical packages and some hand calculators can calculate R

• For the data in our Example R=0.94

• R has some unique characteristicsCopyright @ Dr Bijay Lal Pradhan

Page 31: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Correlation Coefficient, R

• R takes values between -1 and +1

• R=0 represents no linear relationshipbetween the two variables

• R>0 implies a direct linear relationship

• R<0 implies an inverse linearrelationship

• The closer R comes to either +1 or -1,the stronger is the linear relationship

Page 32: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Coefficient of Determination

• R2 is another important measure of linear association between x and y (0 R2 1)

• R2 measures the proportion of the total variation in y which is explained by x

• For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).

Copyright @ Dr Bijay Lal Pradhan

Page 33: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Difference between Correlation and

Regression

• Correlation Coefficient, R, measures

the strength of bivariate association

• The regression line is a prediction equation that estimates the values of

y for any given x

Copyright @ Dr Bijay Lal Pradhan

Page 34: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Limitations of the correlation

coefficient

• Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of nonlinear relationship

• When the sample size, n, is small we also have to be careful with the reliability of the correlation

• Outliers could have a marked effect on R

• Causal Linear Relationship

Page 35: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Regression Analysis

• Software generally gives the result like

Copyright @ Dr Bijay Lal Pradhan

Page 36: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Copyright @ Dr Bijay Lal Pradhan

Page 37: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Is the model significant?• r2 is the proportion of the variance in y that is

explained by our regression model

• SE is also another measure check significance through

• F-statistic:

F(dfŷ,dfer) =sŷ

2

ser2

=......=r2 (n - 2)2

1 – r2

complicated

rearranging

And we should know the significance of reg.

coeff.t =

byx

S.E.

If all these satisfies than we can say model is

Fit.

Page 38: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

Copyright @ Dr Bijay Lal Pradhan

Page 39: Data Analysis · P-value (Observed Significance Level) • P-value - Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong

For further Questions:[email protected]://bijaylalpradhan.com.np

Copyright @ Dr Bijay Lal Pradhan