bandit thinkhamrop, phd. (statistics) department of biostatistics and demography faculty of public...

Bandit Thinkhamrop, PhD. (Statistics)Department of Biostatistics and Demography

Faculty of Public HealthKhon Kaen University, THAILAND

Statistical inference revisited

Statistical inference use data from samples to make inferences about a population

1. Estimate the population parameter• Characterized by confidence interval of the

magnitude of effect of interest

2. Test the hypothesis being formulated before looking at the data

• Characterized by p-value

n = 25X = 52SD = 5

Sample

PopulationParameter estimation

[95%CI]

Hypothesis testing[P-value]

Parameter estimation[95%CI]

Hypothesis testing[P-value]

n = 25X = 52SD = 5SE = 1

Sample

PopulationParameter estimationParameter estimation

[95%CI] : 52-1.96(1) to 52+1.96(1) 50.04 to 53.96We are 95% confidence that the population mean would lie between 50.04 and 53.96

[95%CI] : 52-1.96(1) to 52+1.96(1) 50.04 to 53.96We are 95% confidence that the population mean would lie between 50.04 and 53.96

Z = 2.58Z = 1.96Z = 1.64

n = 25X = 52SD = 5SE = 1

Sample

Hypothesis testing

Hypothesis testing

Population

Z = 55 – 52 1

3H0 : = 55HA : 55

Hypothesis testing

H0 : = 55HA : 55If the true mean in the population is 55, chance to obtain a sample mean of 52 or more extreme is 0.0027.

Hypothesis testing

H0 : = 55HA : 55If the true mean in the population is 55, chance to obtain a sample mean of 52 or more extreme is 0.0027.

Z = 55 – 52 1

3 P-value = 1-0.9973 = 0.0027

5552

-3SE +3SE

Calculation of the previous example based on t-distribution

Stata command to find probability.di (ttail(24, 3))*2 .00620574

Stata command to find t value for 95%CL. di (invttail(24, 0.025))2.0638986

Web base stat table: http://vassarstats.net/tabs.html or www.stattrek.com

Revisit the example based on t-distribution (Stata output)

Variable | Obs Mean Std. Err. [95% Conf. Interval]-------------+--------------------------------------------------------------- | 25 52 1 49.9361 54.0639

1. Estimate the population parameter

2. Test the hypothesis being formulated before looking at the data

One-sample t test------------------------------------------------------------------------------ | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-------------------------------------------------------------------- x | 25 52 1 5 49.9361 54.0639------------------------------------------------------------------------------ mean = mean(x) t = -3.0000Ho: mean = 55 degrees of freedom = 24

Ha: mean < 55 Ha: mean != 55 Ha: mean > 55 Pr(T < t) = 0.0031 Pr(|T| > |t|) = 0.0062 Pr(T > t) = 0.9969

Mean one group: T-test

a12255

1. Hypothesis H0: = 0Ha: 0

2. Data

3. Calculating for t-statistic

4. Obtain p-value based on t-distribution

5. Make a decision

P-value = 0.023

Reject the null hypothesis at level of significant of 0.05 The mean of y is statistically significantly different from zero.

Stata command .di (ttail(4, 3.59))*2 .02296182

Mean one group: T-test (cont.)

One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-------------------------------------------------------------------- y | 5 3 .83666 1.870829 .6770594 5.322941------------------------------------------------------------------------------ mean = mean(y) t = 3.5857Ho: mean = 0 degrees of freedom = 4

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 0.9885 Pr(|T| > |t|) = 0.0231 Pr(T > t) = 0.0115

One sample t-test using SPSS

Please do the data analysis using SPSS and paste the results here.

Comparing 2 means: T-test

a b1 52 92 95 85 9

1. Hypothesis H0: A = B

Ha: A B

2. Data

3. Calculating for t-statistic


5. Make a decision

P-value = 0.002 (http://vassarstats.net/tabs.html)

Reject the null hypothesis at level of significant of 0.05 Mean of Group A is statistically significantly different from that of Group B.

T-testTwo-sample t test with equal variances------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-------------------------------------------------------------------- a | 5 3 .83666 1.870829 .6770594 5.322941 b | 5 8 .7745967 1.732051 5.849375 10.15063---------+--------------------------------------------------------------------combined | 10 5.5 .9916317 3.135815 3.256773 7.743227---------+-------------------------------------------------------------------- diff | -5 1.140175 -7.629249 -2.370751------------------------------------------------------------------------------ diff = mean(1) - mean(2) t = -4.3853Ho: diff = 0 degrees of freedom = 8

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0012 Pr(|T| > |t|) = 0.0023 Pr(T > t) = 0.9988

Two independent sample t-test using SPSS


Mann-Whitney U test Wilcoxon rank-sum test

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

group | obs rank sum expected-------------+--------------------------------- 1 | 5 16 27.5 2 | 5 39 27.5-------------+--------------------------------- combined | 10 55 55

unadjusted variance 22.92adjustment for ties -1.25 ----------adjusted variance 21.67

Ho: y(group==1) = y(group==2) z = -2.471 Prob > |z| = 0.0135

Mann-Whitney U test Wilcoxon rank-sum test

using SPSS


Comparing 2 means : ANOVAMathematical model of ANOVA X = + +

X = Grand mean + Treatment effect + ErrorX = M + T + E

3. Calculating for F-statistic

4. Obtain p-value based on F-distribution

5. Make a decision


Reject the null hypothesis at level of significant of 0.05 Mean of Group A is statistically significantly different from that of Group B.

X M T

Mean: 3 8

= + +

E

[3-5.5] [8-5.5]

SST

SSE

Degree of freedom 1 1 8

Between groups

Within groups

ANOVA 2 groups

Analysis of Variance Source SS df MS F Prob > F------------------------------------------------------------------------Between groups 62.5 1 62.5 19.23 0.0023 Within groups 26 8 3.25------------------------------------------------------------------------ Total 88.5 9 9.83333333

Bartlett's test for equal variances: chi2(1) = 0.0211 Prob>chi2 = 0.885

Comparing 3 means: ANOVA1. Hypothesis H0: A = B = C

Ha: At least one mean is difference

2. Data a b c1 5 42 9 42 9 65 8 85 9 4

ANOVA 3 groups (cont.)Mathematical model of ANOVA X = + +

X = Grand mean + Treatment effect + ErrorX = M + T + E

3. Calculating for F-statistic

4. Obtain p-value based on F-distribution

5. Make a decision


Reject the null hypothesis at level of significant of 0.05 At least one mean of the three groups is statistically significantly different from the others.

X M T

Mean: 3 8 5.2

= + +

E

[3-5.4] [8-5.4] [5.2-5.4]

SST

SSE

Df: 15 1 2 12

Between groups

Within groups

ANOVA 3 groups

Analysis of Variance Source SS df MS F Prob > F------------------------------------------------------------------------Between groups 62.8 2 31.4 9.71 0.0031 Within groups 38.8 12 3.23333333------------------------------------------------------------------------ Total 101.6 14 7.25714286

Bartlett's test for equal variances: chi2(2) = 0.0217 Prob>chi2 = 0.989

ANOVA 3 groups using SPSS


Kruskal-Wallis testKruskal-Wallis equality-of-populations rank test

+------------------------+ | group | Obs | Rank Sum | |-------+-----+----------| | 1 | 5 | 22.00 | | 2 | 5 | 61.50 | | 3 | 5 | 36.50 | +------------------------+

chi-squared = 7.985 with 2 d.f.probability = 0.0185

chi-squared with ties = 8.190 with 2 d.f.probability = 0.0167

Kruskal-Wallis test using SPSS


Comparing 2 means: Regressiona b1 52 92 95 85 9

1. Data x y (x-x) (x-x)2 (y-y) (x-x)(y-y)1 1 -0.5 0.25 -4.5 2.251 2 -0.5 0.25 -3.5 1.751 2 -0.5 0.25 -3.5 1.751 5 -0.5 0.25 -0.5 0.251 5 -0.5 0.25 -0.5 0.252 5 0.5 0.25 -0.5 -0.302 9 0.5 0.25 3.5 1.752 9 0.5 0.25 3.5 1.752 8 0.5 0.25 2.5 1.252 9 0.5 0.25 3.5 1.75

Mean 1.5 5.5Sum 2.5 12.5

y = a + bxwhere b = 12.5/2.5 = 5, then5.5 = a + 5(1.5) Thus a = 5.5-7.5 = -2

Comparing 2 means: RegressionY

x

10

0

2

4

6

8

-2

a b

a b1 52 92 95 85 9

Comparing 2 means: Regression (cont.)Y

x

10

0

2

4

6

8

-2

1 2

x y1 11 21 21 51 52 52 92 92 82 9

Comparing 2 means: Regression (cont.)Y

x

10

0

2

4

6

8

-2

1 2

y = a + bx

b difference of y between x=1 vs. x=2

a

y = 3 if x = 1

y = 8 if x = 2

y = -2 if x = 0

y = -2 + 5x

y = 5.5; x = 1.5

Regression model (2 means)

Source | SS df MS Number of obs = 10-------------+------------------------------ F( 1, 8) = 19.23 Model | 62.5 1 62.5 Prob > F = 0.0023 Residual | 26 8 3.25 R-squared = 0.7062-------------+------------------------------ Adj R-squared = 0.6695 Total | 88.5 9 9.83333333 Root MSE = 1.8028

------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- group | 5 1.140175 4.39 0.002 2.370751 7.629249 _cons | -2 1.802776 -1.11 0.299 -6.157208 2.157208------------------------------------------------------------------------------

i.group _Igroup_1-3 (naturally coded; _Igroup_1 omitted)

Source | SS df MS Number of obs = 15-------------+------------------------------ F( 2, 12) = 9.71 Model | 62.8 2 31.4 Prob > F = 0.0031 Residual | 38.8 12 3.23333333 R-squared = 0.6181-------------+------------------------------ Adj R-squared = 0.5545 Total | 101.6 14 7.25714286 Root MSE = 1.7981

------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Igroup_2 | 5 1.137248 4.40 0.001 2.522149 7.477851 _Igroup_3 | 2.2 1.137248 1.93 0.077 -.2778508 4.677851 _cons | 3 .8041559 3.73 0.003 1.247895 4.752105------------------------------------------------------------------------------

Regression model (3 means)

Correlation coefficient

• Pearson product moment correlation – Denoted by r (for the sample) or (for the

population)– Require bivariate normal distribution assumption– Require linear relationship

• Spearman rank correlation– For small sample, not require bivariate normal

distribution assumption

Regression model using SPSS


Pearson product moment correlation

or

Indeed it is the mean of the product of standard score.

Scatter plotb

a

10

0

2

4

6

8a b1 52 92 95 85 9

51 2 3 4

Calculation for correlation coefficient(r)

[1]x

[2]y

[3](x-x)/SD

[4](y-y)/SD [3] x [4]

1 5 -1.07 -1.73 1.852 9 -0.53 0.58 -0.312 9 -0.53 0.58 -0.315 8 1.07 0.00 0.005 9 1.07 0.58 0.62

Sum 1.85Mean 3 8SD 1.87 1.73

Interpretation of correlation coefficient

Correlation Negative Positive

None −0.09 to 0.00 0.00 to 0.09

Small −0.30 to −0.10 0.10 to 0.30

Medium −0.50 to −0.30 0.30 to 0.50

Strong −1.00 to −0.50 0.50 to 1.00

These serve as a guide, not a strict rule. In fact, the interpretation of a correlation coefficient depends on the context and purposes.

From Wikipedia, the free encyclopedia

The correlation coefficient reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). The figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

This is a file from the Wikimedia Commons.

Inference on correlation coefficient

Stata commands:.di tanh(-0.885)-.70891534.di tanh(1.887).95511058

Stata command

• ci2 x y, corr spearman

Confidence interval for Spearman's rank correlation of x and y, based on Fisher's transformation.

Correlation = 0.354 on 5 observations (95% CI: -0.768 to 0.942)

Warning: This method may not give valid results with small samples (n<= 10) for rank correlations.

Inference on correlation coefficient

Or use Stata command .di (ttail(3, 0.9))*2 .43445103

Inference on correlation coefficientusing SPSS


Inference on proportion

• One proportion• Two proportions• Three or more proportions

One proportion: Z-test

y101

. . .0

1. Hypothesis H0: 1 = 0Ha: 1 0

2. Data

3. Calculating for z-statistic

ny = 50, py = 0.1

4. Obtain p-value based on Z-distribution

5. Make a decision


Reject the null hypothesis at a level of significant of 0.05 Proportion of Y is statistically significantly different from zero.

Stata command to get the p-vale. di (1-normal(2.357))*2.01842325

Comparing 2 proportions: Z-test

x y1 11 00 1

. . . . . .1 0

1. Hypothesis H0: 1 = 0

Ha: 1 0

2. Data

3. Calculating for z-statistic

n0 = 50, p0 = 0.1

n1 = 50, p1 = 0.4


5. Make a decision


Reject the null hypothesis at level of significant of 0.05 Proportion of Y between group of x is statistically significantly different from each other.

xy

0 1 Total

0 45 5 50

1 30 20 50

Total 75 25 100

Z-test for two proportionsTwo-sample test of proportions 0: Number of obs = 50 1: Number of obs = 50------------------------------------------------------------------------------ Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- 0 | .1 .0424264 .0168458 .1831542 1 | .4 .069282 .2642097 .5357903-------------+---------------------------------------------------------------- diff | -.3 .0812404 -.4592282 -.1407718 | under Ho: .0866025 -3.46 0.001------------------------------------------------------------------------------ diff = prop(0) - prop(1) z = -3.4641 Ho: diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(Z < z) = 0.0003 Pr(|Z| < |z|) = 0.0005 Pr(Z > z) = 0.9997

Comparing 2 proportions: Chi-square-test1. Hypothesis H0: ij = i+ +j where I = 0, 1; j = 0, 1

Ha: ร่� i+ +j

2. Data

3. Calculating for 2-statistic


5. Make a decision


Reject the null hypothesis at level of significant of 0.05 There is statistically significantly association between x and y.

xy

0 1 Total

0 45 5 50

1 30 20 50

Total 75 25 100 O E (O-E) (O-E)2 (O-E)2/E

45 (75/100)50 = 37.50 7.50 56.25 1.50

5 (25/100)50 =12.50 -7.50 56.25 4.50

30 (75/100)50 =37.50 -7.50 56.25 1.50

20 (25/100)50 =12.50 7.50 56.25 4.50

Chi-square (df = 1) 12.00

Comparing 2 proportions: Chi-square-test

| y x | 0 1 | Total-----------+----------------------+---------- 0 | 45 5 | 50 1 | 30 20 | 50 -----------+----------------------+---------- Total | 75 25 | 100

Pearson chi2(1) = 12.0000 Pr = 0.001

csi 20 5 30 45, or exact

| Exposed Unexposed | Total-----------------+------------------------+------------ Cases | 20 5 | 25 Noncases | 30 45 | 75-----------------+------------------------+------------ Total | 50 50 | 100 | | Risk | .4 .1 | .25 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Risk difference | .3 | .1407718 .4592282 Risk ratio | 4 | 1.62926 9.820408 Attr. frac. ex. | .75 | .3862245 .8981712 Attr. frac. pop | .6 | Odds ratio | 6 | 2.086602 17.09265 (Cornfield) +------------------------------------------------- 1-sided Fisher's exact P = 0.0005 2-sided Fisher's exact P = 0.0010

Binomial regression. binreg y x, rr

Generalized linear models No. of obs = 100Optimization : MQL Fisher scoring Residual df = 98 (IRLS EIM) Scale parameter = 1Deviance = 99.80946404 (1/df) Deviance = 1.018464Pearson = 99.99966753 (1/df) Pearson = 1.020405

Variance function: V(u) = u*(1-u) [Bernoulli]Link function : g(u) = ln(u) [Log]

BIC = -351.4972

------------------------------------------------------------------------------ | EIM y | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- x | 4 1.833024 3.03 0.002 1.629265 9.820377 _cons | .1 .0424262 -5.43 0.000 .0435379 .2296851------------------------------------------------------------------------------

Logistic regression

. logistic y x

Logistic regression Number of obs = 100 LR chi2(1) = 12.66 Prob > chi2 = 0.0004Log likelihood = -49.904732 Pseudo R2 = 0.1125

------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- x | 6 3.316625 3.24 0.001 2.030635 17.72844 _cons | .1111111 .0523783 -4.66 0.000 .044106 .2799096------------------------------------------------------------------------------

bandit thinkhamrop, phd. (statistics) department of biostatistics and demography faculty of public...

Documents