nonparametric statistical methods

86
Nonparametric Statistical Methods Presented by Xiaojin Dong, Owen Gu, Sohail Khan, Hao Miao, Shaolan Xiang, Chunmin Han, Yinjin Wu, Jiayue Zhang, Yuanhao Zhang 1

Upload: kendra

Post on 23-Feb-2016

117 views

Category:

Documents


6 download

DESCRIPTION

Nonparametric Statistical Methods. Presented by Xiaojin Dong, Owen Gu , Sohail Khan, Hao Miao, Shaolan Xiang, Chunmin Han, Yinjin Wu, Jiayue Zhang , Yuanhao Zhang . Outlines. Wilcoxon signed rank test Wilcoxon sum rank test Kolmogorov Smirnov test Kruskal Wallis test - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nonparametric Statistical Methods

1

Nonparametric Statistical Methods

Presented by Xiaojin Dong, Owen Gu, Sohail Khan, Hao Miao, Shaolan Xiang, Chunmin Han, Yinjin Wu, Jiayue Zhang, Yuanhao Zhang

Page 2: Nonparametric Statistical Methods

2

Outlines1. Wilcoxon signed rank test2. Wilcoxon sum rank test3. Kolmogorov Smirnov test4. Kruskal Wallis test5. Kendall’s correlation coefficient6. Spearman’s rank correlation

coefficient

Page 3: Nonparametric Statistical Methods

3

When to use nonparametric methods?Population is not normalSample size is very smallSome data are ordinalSamples could be dependent or

independent.What to estimate?Population median.Median is a better measurement of

central tendency for non-normal population, e.g. skewed distributions.

Page 4: Nonparametric Statistical Methods

4

1. Wilcoxon signed rank test

Page 5: Nonparametric Statistical Methods

Inventors

Henry Berthold Mann1905-2000

Ohio State UniversityHe is the dissertation advisor of

Whitney.

Frank Wilcoxon1892-1965

Donald Ransom Whitney

Page 6: Nonparametric Statistical Methods

6

Hypothesis testing vs ( or )Compute the differences Rank in terms of absolute valuesLet be the rank of .Statistics:= sum of the ranks of positive differences= sum of the ranks of negative differencesReject region: Reject if is large or equivalently if is small, or if

Page 7: Nonparametric Statistical Methods

7

Large sample approximation vs ( or )Statistics for large sample size n: and

Rejection region: Reject if or if

Page 8: Nonparametric Statistical Methods

8

Intuition and assumptionsIf positive differences are larger than

negative differences, they get higher ranks, thus contributing to larger value of , likewise the larger negative differences contributes to larger value of .

Assumption: Population must be symmetric.

Reason: under the null hypothesis the right skewed population tends to have higher value of and the left skewed population tends to have higher value of .

Page 9: Nonparametric Statistical Methods

9

Example 1.1 Test the thermostat setting data if the median

setting differs from 200

vs

Conclusion: The population median differs from the design setting of 200 at

Table 1.1 Thermostat setting data

x 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8 201.3 199.0 diff. 2.2 3.4 0.5 2.5 6.3 -2.0 3.7 0.8 1.3 -1.0

rank 6 8 1 7 10 5 9 2 4 3

Page 10: Nonparametric Statistical Methods

10

SAS codesDATA themostat;INPUT temp;datalines;202.2203.4…;PROC UNIVARIATE DATA=themostat loccount mu0=200;TITLE "Wilcoxon signed rank test the themostat";VAR temp;RUN;

Page 11: Nonparametric Statistical Methods

11

SAS outputs (selected results)

8

Basic Statistical Measures Location Variability Mean 201.7700 Std Deviation 2.41019 Median 201.7500 Variance 5.80900 Mode . Range 8.30000 Interquartile Range 2.90000

Tests for Location: Mu0=200 Test -Statistic- -----p Value------ Student's t t 2.322323 Pr > |t| 0.0453 Sign M 3 Pr >= |M| 0.1094 Signed Rank S 19.5 Pr >= |S| 0.048

Page 12: Nonparametric Statistical Methods

12

2. Wilcoxon rank sum test

Page 13: Nonparametric Statistical Methods

13

Wilcoxon rank sum test: introductionWilcoxon-Mann-Whitney Test is also called

Wilcoxon rank sum test and the Mann-Whitney U test.

It was proposed initially by the Irish-born US statistician Frank Wilcoxon in 1945, for equal sample sizes, and extended to arbitrary sample sizes and in other ways by the Austrian-born US mathematician Henry Berthold Mann and the US statistician Donald Ransom Whitney in 1947.

Page 14: Nonparametric Statistical Methods

14

Where to useWhen we analyze two independent samples, there are

times when the assumptions for using a t-test are not met. (the data are not normally distributed, sample size is small, etc.) Or data values may only represent ordered categories.

If sample sizes are moderate, if there is question concerning distributions or if the data are really ordinal, USE WILCOXON RANK-SUM TEST.

This test only assumes: 1. all the observations from both groups are

independent of each other. 2. the responses are ordinal or continuous

measurements. 3. the distributions of two groups have similar shape.

Page 15: Nonparametric Statistical Methods

15

Calculation stepsH0: F1=F2 (the distribution of both group are

equal) vs. Ha: F1<F2 or Ha: F1>F2 (one r.v. is stochastically larger than the other one)

Put all the data from both groups in increasing order (with special provision for ties), retaining the group identity.

Compute the sum of ranks for each group respectively and denote the sums by w1 and w2.

Compute u1=w1-n1(n1+1)/2 and u2=w2-n2(n2+1)/2. Look up to Table A.11 to determine if reject H0 at significance level α and P-value.

Page 16: Nonparametric Statistical Methods

16

Special treatmentFor large samples In the case of n1>10 and n2>10, U is approximately

normally distributed with parameters µ=n1n2/2 and σ2=n1n2(N+1)/12. Therefore a large sample z-test can be based on the statistics

For ties Use the midrank when observations from one group

are equal to a observation from the other group.

Page 17: Nonparametric Statistical Methods

17

ExampleTo determine if the order of questions has

significant impact on students’ performances in a exam, 20 students are randomly equally divided into 2 groups A and B. Everyone were asked to answer a exam paper . The exam papers for both groups consist of same questions. The questions were ranked from easy to hard in the papers for group A while they were ranked from hard the easy for group B. The scores each student got are as follows.

A: 83, 82, 84, 96, 90, 64, 91, 71, 75, 72B: 42, 61, 52, 78, 69, 81, 75, 78, 78, 65

Page 18: Nonparametric Statistical Methods

18

SolutionH0: F1=F2 vs. Ha: F1>F2Rank scores of both group in

ascending order.scores rank group scores rank group42 1 B 78 12 B52 2 B 78 12 B61 3 B 78 12 B64 4 A 81 14 B65 5 B 82 15 A69 6 B 83 16 A71 7 A 84 17 A72 8 A 90 18 A75 9.5 A 91 19 A75 9.5 B 96 20 A

Page 19: Nonparametric Statistical Methods

19

The rank sums are w1=4+7+8+9.5+15+16+17+18+19+20=133.5 w2=1+2+3+5+6+9.5+12+12+12+14=76.5Therefore u1=w1-n1(n1+1)/2=133.5-10*11/2=78.5 u2=w2-n2(n2+1)/2=76.5-10*11/2=21.5Check that u1+u2=n1*n2=100From Table A.11 we find that the P-value is between

0.012 and 0.026.To compare this with the large sample normal

approximation, calculate

which yields the P-value≈Φ(-2.12)=0.0170

Page 20: Nonparametric Statistical Methods

20

SAS code and outputData exam; Input group $ score @@;Datalines;A 83 A 82 A 84 A 96 A 90 A 64 A 91 A 71 A 75 A 72B 42 B 61 B 52 B 78 B 69 B 81 B 75 B 78 B 78 B 65 ;Proc npar1way data=exam wilcoxon;Var score;Class group;exact wilcoxon;Run;

Page 21: Nonparametric Statistical Methods
Page 22: Nonparametric Statistical Methods

22

3. Kolmogorov-Smirnov Test

Page 23: Nonparametric Statistical Methods

"Every mathematician believes he is ahead over all others. The reason why they don't say this in public, is because they are intelligent people"

Russian MathematicianMajor advances in the fields of Probability theory, topology, turbulence, classical mechanics and computational complexity.Gained international recognition in 1922, for constructing Fourier series that diverges almost everywhere.

Andrey Kolmogorov 25 April 1903 – 20 October 1987

Page 24: Nonparametric Statistical Methods

Vladimir Ivanovich SmirnovJune 10, 1887 – February 11, 1974

Significant contributions in both pure and applied mathematics, as well as the history of mathematics.Five volume book A Course in Higher Mathematics

Page 25: Nonparametric Statistical Methods

25

KS-Test Tries to determine if two datasets

differ significantly by comparing their distributions

Makes no assumption about the distribution of the data.

This Generality comes at some cost: other parametric tests, i.e. t-test may be more sensitive if the data meets the requirements of the test.

Page 26: Nonparametric Statistical Methods

26

Types of KS-TestsOne Sample: Sample VS

Reference probability distribution. i.e. test for normality. Empirical VS standard normal

Two Sample: Test if two samples come from the same distribution

Page 27: Nonparametric Statistical Methods

27

The one sample Kolmogorov-Smirnov (K-S) test is based on the empirical distribution function (ECDF). Given N data points Y1 Y2 ..., YN the ECDF is defined as

Where n(i) is the number of points less than Yi This is a step function that increases by 1/N at the value of each data point. We can graph a plot of the empirical distribution function with a cumulative distribution function for a given distribution. The one sample K-S test is based on the maximum distance between these two curves. That is,

Where F is the theoretical cumulative distribution function

Page 28: Nonparametric Statistical Methods

28

The two sample K-S test is a variation of this. Compares two empirical distribution functions

Where E1 and E2 are the empirical distribution functions for the two samples.

More formally, the Kolmogorov-Smirnov two sample test statistic can be defined as follows. H0: The two samples come from a common distribution.Ha: The two samples do not come from a common distribution.

Page 29: Nonparametric Statistical Methods

29

.

Test Statistic: The Kolmogorov-Smirnov two sample test statistic is defined as

Where E1 and E2 are the empirical distribution functions for the two samples. Critical Region:

The hypothesis regarding the distributional form is rejected if the test statistic, D, is greater than the critical value obtained from a table at significance level α. The quantile-quantile plot, bihistogram, and Tukey mean-difference plot are graphical alternatives to the two sample K-S test

Page 30: Nonparametric Statistical Methods

30

Application of Kolmogorov-Smirnov TestThe K-S goodness-of-fit test can be applied in the case of

both one sample and two samples. In the one-sample test, we compare the empirical

distribution function of the sample data with the cumulative distribution function of the reference distribution to determine if the sample is drawn from the reference distribution, such as standard normal, lognormal or exponential distribution, etc.

In the two-sample test, we compare the empirical distribution functions of two sets of data to determine if they come from the same distribution.

The following slides exemplify the application of the test in both cases using computing software language MATLAB.

Page 31: Nonparametric Statistical Methods

31

one sample K-S test• Hypothesis Testing: H0: The sample data follows the standard normal distribution (μ=0, σ2=1). Ha: The data does not follow the standard normal distribution.

• The sample data, extracted from the daily percentage change in the share price of company XXX, Inc. for the past 19 days, is listed as follows in an ascending order: -4.0% -3.5% -3.0% -2.5% -2.0% -1.5% -1.0% -0.5% 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 5.0%

• The test statistic is: max(|Fx-Gx|), where Fx is the empirical cdf and Gx is the standard normal cdf .

Page 32: Nonparametric Statistical Methods

32

one sample K-S test (cont’d)• The MATLAB language syntax for the test is: x=-4:0.5:5; [h,p,k,c]=kstest(x,[],alpha, type), where (1) x is the sample data set, and the values

increase from -4 to 5 in an even increment of 0.5; (2) [] means the standard normal distribution is

used; (3) alpha is a double and represents the level of

significance; (4) type is a string and specifies whether the type

of test for the alternative hypothesis is ‘unequal’, ‘larger’ or

‘smaller’, meaning whether the empirical cdf and the cdf

of the specified distribution are unequal, the

empirical cdf is larger or the empirical cdf is smaller ;

Page 33: Nonparametric Statistical Methods

33

one sample K-S test (cont’d) (5) h = 0 if the test accepts the null hypothesis and 1 if the null hypothesis is rejected; (6) p = the p value of the test; (7) k = the test statistic; (8) c = the critical value, depending on alpha and sample size We are testing under three different scenarios: a) alpha=0.1 b)

alpha=0.05 and c) alpha=0.01; All three scenarios are under the assumption that type=‘unequal’.

Scenario #1: α=0.1 a) MATLAB code: [h,p,k,c]=kstest(x,[],0.1,'unequal'); b) Testing result: h=1, p=0.0122, k=0.3542, c=0.2714; So since k>c (or h=1), we reject the null hypothesis at 10%

level of significance that the sample data follows the standard normal distribution.

Page 34: Nonparametric Statistical Methods

34

one sample K-S test (cont’d) Scenario #2: α=0.05 a) MATLAB code: [h,p,k,c]=kstest(x,[],0.05,'unequal'); b) Testing result: h=1, p=0.0122, k=0.3542, c=0.3014; So since k>c (or h=1), we also reject the null

hypothesis at 5% level of significance that the sample data follows the standard normal distribution.

Scenario #3: α=0.01 a) MATLAB code: [h,p,k,c]=kstest(x,[],0.05,'unequal'); b) Testing result: h=0, p=0.0122, k=0.3542, c=0.3612; So since k<c (or h=0), we accept the null hypothesis

at 1% level of significance that the sample data follows the standard normal distribution.

Page 35: Nonparametric Statistical Methods

35

two sample K-S test• Hypothesis Testing: H0: The two sets of data have the same distribution. Ha: The two sets of data do not have the same distribution.

• The first sample data set X1 is evenly-spaced with values ranging from -2.0 to 1.0, while the numbers in the second set X2 come from a function that generates standard normal random variables with μ=0, σ2=1. The sample sizes of both data sets are 16. values of the two data sets are as follows:

X1: -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 X2: -0.1241 1.4897 1.4090 1.4172 0.6715 -1.2075 0.7172 1.6302 0.4889 1.0347 0.7269 -0.3034 0.2939 -0.7873 0.8884 -1.1471

Page 36: Nonparametric Statistical Methods

36

two sample K-S test (cont’d)• The MATLAB language syntax for the test is: X1=-2:0.2:1, meaning the values of X1 go from -2 to 1 with an even increasing space of 0.2; X2=randn(16,1), which is a function that generates 16 normal random variables; [h,p,k]=kstest2(X1,X2,alpha,type), where the definitions of h, p, k, alpha and type are the same as those described in the one-sample test; We are testing under two different scenarios: a) alpha=0.05 b) alpha=0.1.

Both scenarios are under the assumption that type=‘unequal’.

Scenario #1: α=0.05 a) MATLAB code: [h,p,k]=kstest2(X1,X2,0.05,’unequal’); b) Testing result: h=0, p=0.0657, k=0.4375;

Page 37: Nonparametric Statistical Methods

37

one sample K-S test (cont’d)This function kstest2 does not produce the critical

value. However, since h=0, then the test accepts null hypothesis that the two sets of data come from the same distribution at 5% level of significance.

Scenario #2: α=0.1 a) MATLAB code: [h,p,k]=kstest(x,

[],0.1,'unequal'); b) Testing result: h=1, p=0.0657, k=0.4375; So since h=1 now, the test rejects the null

hypothesis that the two sets of data come from the same distribution at 10% level of significance.

Page 38: Nonparametric Statistical Methods

38

4. Inferences on Several Independent Samples—Kruskal-Wallis Test

Page 39: Nonparametric Statistical Methods

39

William Henry Kruskal

Born in New York CityMathematician and

statisticianPresident of the Institute of

Mathematical Statistics (1971)

President of the American Statistical Association (1982) (Oct 10,1919–Apr 21,2005)

Page 40: Nonparametric Statistical Methods

40

Wilson Allen WallisB.A. in psychology, University

of MinnesotaEconomist and statisticianPresident of the University of

Rochester (1962-1982)Secretary of State for

Economic, Business, and Agricultural Affairs (1985-1989)(1912–Oct 12,1998)

Page 41: Nonparametric Statistical Methods

41

Kruskal-Wallis Test Definition-- Non-parametric test(distribution-

free) Compare three or more

independent groups of sampled data

Page 42: Nonparametric Statistical Methods

42

HypothesisNull hypothes (Ho ): samples from identical populations.Alternative hypothesis (Ha ): samples from different populations.

Page 43: Nonparametric Statistical Methods

43

Steps1. Arrange the data of all samples: in a single series in ascending orderNote: If we have repeated values, assign

ranks to them by averaging their rank position.

2. Ranks of the different samples are separated and summed up as R1 R2 R3,...

Page 44: Nonparametric Statistical Methods

44

Steps3.Test Statistic:

Where, H = Kruskal - Wallis test statistic n = total # of observations in all

samples Ri = rank of each sample4. Rejection Region: We will reject Ho if-- H is greater than the chi-square table value => Conclude that the sample comes from

a different population.

212 3( 1)( 1)

i

i

RH n

n n n

Page 45: Nonparametric Statistical Methods

45

Example 4.1An experiment was done to compare

four different methods of teaching the concept of percentage to sixth graders. Experimental units were 28 classes which were randomly assigned to the four methods, seven classes per method. A 45 item test was given to all classes. The average test scores of the classes are summarized in table 4.1. Apply the Kruskal-Wallis test to the test scores data.

Page 46: Nonparametric Statistical Methods

46

Table 4.1The ranks of the data values are shown in the following table, where the two equal values are assigned the midrank=(14+15)/2=14.5.

Average Test ScoresTeaching method

Case method

Formula Method

Equation Method

Unitary Analysis Method

14.5923.4425.4318.1520.8214.0614.26

20.2726.8414.7122.3419.4924.9220.20

27.8224.9228.6823.3232.8533.9023.42

33.1626.9330.4336.4337.0429.7633.88

Page 47: Nonparametric Statistical Methods

s

Ranks of Average Test ScoresTeaching Method

Case Method

Formula Method

Equation Method

Unitary Analysis Method

313165912

8174106

14.57

1914.52011232612

24182227282125

Rank Sum

49 66.5 125.5 165

Page 48: Nonparametric Statistical Methods

48

The value of the kruskal-wallis test statistic equals

Since , therefore the P-

value<.005, from which we can conclude there are significant differences among the four teaching methods.

2

1

2 2 2 2

12 3( 1)( 1)

12 3(29)28(29) 7 7 7 7

18.134

(49) (66.5) (125.5) (165)

ai

i i

kw NN N

rn

2

3,.00518.134 12.837kw

Page 49: Nonparametric Statistical Methods

49

SAS codes data test; input methodname $ scores; cards; case 14.59 formula 20.27 case 23.44 case 25.43 case 18.15 … unitary 36.43 unitary 37.04 unitary 29.76 unitary 33.88 ; proc npar1way data=test wilcoxon; class methodname; var scores; run;

Page 50: Nonparametric Statistical Methods

50

SAS output: Wilcoxon Scores (Rank Sums) for Variable scores Classified by Variable methodname

Sum of Expected Std Dev Mean methodname N Scores Under H0 Under H0 Score case 7 49.00 101.50 18.845498 7.000000 formula 7 66.50 101.50 18.845498 9.500000 equation 7 125.50 101.50 18.845498 17.928571 unitary 7 165.00 101.50 18.845498 23.571429

Average scores were used for ties.

Kruskal-Wallis Test

Chi-Square 18.1390 DF 3 Pr > Chi-Square 0.0004

Page 51: Nonparametric Statistical Methods

51

5. Kendall’s correlation coefficient

Page 52: Nonparametric Statistical Methods

52

Kendall: a historySir Maurice George Kendall

(09/06/1907-03/29/1983) was British statistician widely known for his contribution to statistics.

Studied Mathematics at St. John’s college, Cambridge.

Began his work in Agriculture and formulated a series test for statistical randomness

In 1948 wrote a monograph on rank correlation.

Page 53: Nonparametric Statistical Methods

53

Introduction to Kendall’s correlationCompare Bivariate relationship between

ranked paired and For ordered If values have similar order,

there is a strong positive correlation between and .

For ordered If values have opposite order, there is a strong negative correlation between and .

A measurement for linear relationship or dependency, but NOT suitable for nonlinear dependency.

Page 54: Nonparametric Statistical Methods

54

Concordant and discordantThe pair is concordant if and has

the same sign;The pair is discordant if and has

the opposite sign;The pair is a tie in X if ;The pair is a tie in Y if .For n pairs there are comparisons

Page 55: Nonparametric Statistical Methods

55

Example of calculating concordant and discordant pairs

Total number of concordances:

• To calculate concordant pairs, sort x in ascending order and pair y ranks with x ranks.

• inspect y-ranks in table 2.

• The first y-rank is 2 succeeded by 10 greater ranks, namely 3, 5, 4, 6, 7, 9, 12, 11, 10, 8.

• Similarly, the next y-rank, 3 followed by 9 greater ranks, which gives concordance 9.

Table 6.1: A data set showing scores x, y for two examination questions, Q1 and Q2 on the same paper

Q1 x 1 3 4 5 6 8 10 11 13 14 16 17 Q2 y 13 15 18 16 23 31 39 56 45 43 37 0

Table 6.2: Paired ranks for the data in table 1.

Q1 x 1 2 3 4 5 6 7 8 9 10 11 12 Q2 y 2 3 5 4 6 7 9 12 11 10 8 1

Page 56: Nonparametric Statistical Methods

56

Example of calculating concordant and discordant pairs

Total number of discordances:

Or alternative:= = 19

Table 5.1: A data set showing scores x, y for two examination questions, Q1 and Q2 on the same paper

Q1 x 1 3 4 5 6 8 10 11 13 14 16 17 Q2 y 13 15 18 16 23 31 39 56 45 43 37 0

Table 5.2: Paired ranks for the data in table 1.

Q1 x 1 2 3 4 5 6 7 8 9 10 11 12 Q2 y 2 3 5 4 6 7 9 12 11 10 8 1

• To calculate concordant pairs, sort x in ascending order and pair y ranks with x ranks.

• inspect y-ranks in table 2.

• The first y-rank is 2 succeeded by 1 smaller rank, namely 1.

• Similarly, the next y-rank, 3 followed by 1 smaller ranks.

Page 57: Nonparametric Statistical Methods

57

Kendall’s τ with no tiesTau statistic

Where is the number of concordant pairs, is the number of discordant pairs, is the sample size.

is total number of pairs. is between -1 and 1.is used to distinguish from population .

1 12

c dk

n ntn n

Page 58: Nonparametric Statistical Methods

58

Kendall’s τ with no tiesIntuitive explanations:If the rankings of and are equal, we

should expect a fix mixed amount of concordances and discordances, i.e. .

If all pairs are concordant, , i. e. there is monotonic increasing association between and .

If all pairs are discordant, , i.e. there is monotonic decreasing association between and .

Page 59: Nonparametric Statistical Methods

59

Kendall’s tau: example Previous score example

: no association, : positive association between x and y Statistic: . Rejection region: 5 percent critical value when

n=12 in a one-tail test is 0.3929. where is this value from?

Conclusion: There is moderately strong evidence against the hypothesis of no association.

Table 5.1: A data set showing scores x, y for two examination questions, Q1 and Q2 on the same paper

Q1 x 1 3 4 5 6 8 10 11 13 14 16 17 Q2 y 13 15 18 16 23 31 39 56 45 43 37 0

Page 60: Nonparametric Statistical Methods

60

Exact distribution of Kendall’s

For small samples we may calculate exact distribution of . Let

If the x-ranks are in ascending order, there are equally likely permutations of y-ranks under the assumption that and have no association.

From table 3 we can see that there are 7 possible values of tau.

Table 5.3: The set of all possible rank orders for N=4, along with their correlation with the 'canonical' order 1234

Rank Orders

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 241 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 42 2 3 3 4 4 1 1 3 3 4 4 1 1 2 2 4 4 1 1 2 2 3 33 4 2 4 2 3 3 4 1 4 1 3 2 4 1 4 1 2 2 3 1 3 1 24 3 4 2 3 2 4 3 4 1 3 1 4 2 4 1 2 1 3 2 3 1 2 1

tau 1 2/3 2/3 1/3 1/3 0 2/3 1/3 1/3 0 0 -1/3 1/3 0 0 -1/3 -1/3 -2/3 0 -1/3 -1/3 -2/3 -2/3 -1

Page 61: Nonparametric Statistical Methods

61

Inference on Kendall’s Probability distribution of tau . Table 5.4

The distribution of is symmetric.A P-value less than 0.05 is obtained by a one-

tail test corresponding to for which

tau 1 2/3 1/3 0 -1/3 -2/3 -1

Probability 1/24 3/24 5/24 6/24 5/24 3/24 1/24

Page 62: Nonparametric Statistical Methods

62

Kendall’s tau: large sample approximation

Page 63: Nonparametric Statistical Methods

63

Kendall’s tau-b with tiesReview:

Kendall’s tau b

Where ,, are the number of consecutive ranks in a tie

within the x- and the y-ranks respectively.

c d

bn nt

D U D V

1 12

c dk

n ntn n

Page 64: Nonparametric Statistical Methods

64

Kendall’s tau-b: exampleLife expectancy showed a general tendency

to increase during the nineteenth and twentieth centuries as standards of health care and hygiene improved. The extra life expectancy varies between countries, communities and even families. Table 6.5 gives the year of death and age at death for 13 males in a clan we call the McDeltas buried in the Badenscallie burial ground in Wester Ross, Scotland. Is there an association that life expectancy is increasing in this clan in more recent years?

Page 65: Nonparametric Statistical Methods

65

Kendall’s tau-b: example

Hypothesis: : no increasing trend in life expectancy: life expectancy has increased

, ,

Table 5.5 Year of death and ages of 13 McDeltas

year 1827 1884 1895 1908 1914 1918 1924 1928 1936 1941 1965 1977Age 13 18 83 34 1 11 16 13 74 87 65 83

0

0

: 0: 0b

b

HH

53 23 0.3896

78 78 2bt

Page 66: Nonparametric Statistical Methods

66

SAS codeDATA kendall;INPUT year age;datalines;1827 131884 83…;PROC CORR DATA=kendall KENDALL;

TITLE "Kendall's tau for Life expectancy";VAR year age;RUN;

Page 67: Nonparametric Statistical Methods

67

SAS output Kendall's tau for Life expectancy                       

The CORR Procedure          

  2 Variables: year age                         Simple Statistics                    Variable N Mean Std dev Medain

Minimum

Maximum

 year 13 1922 39.4832 1924 1827 1977

 age 13 48.0769233.44762

65.00000 1.00000

87.00000

                 Kendall Tau b Correlation Coefficients, N = 13     Prob > |tau| under H0: Tau=0                         year age         year 1.00000 0.38964             0.0662         age 0.38964 1.00000           0.0662        

Page 68: Nonparametric Statistical Methods

68

6. Spearman rank correlation

6.1 From Pearson to Spearman6.2 Hypothesis Test &Examples

Page 69: Nonparametric Statistical Methods

6.1 From Pearson to SpearmanPearson product-moment correlation

population correlation coefficient

-1

at the extremes perfect linear relationship between X and Y

Karl Pearson (27 March 1857 – 27 April 1936)

Page 70: Nonparametric Statistical Methods

70

Figure 6.1

X 0 2 3 4 6Y 0 16 81 256 1296

Rank(X) 1 2 3 4 5

Rank(Y) 1 2 3 4 5

6.1 From Pearson to Spearman

Page 71: Nonparametric Statistical Methods

Statistician

Pioneer of factor

analysisSpearman's

rank correlation coefficient

PsychologistModels

for human intelligence

Charles Edward Spearman, (10 Sept. 1863 - 17 Sept. 1945)

6.1 FROM PEARSON TO SPEARMAN SPEARMAN RANK CORRELATION

Page 72: Nonparametric Statistical Methods

72

• are the differences in the ranks. • n=number of pairs of X’s and

Y’s.

Why?

𝑟 𝑠=∑𝑖=1

𝑛

(𝑢𝑖−𝑢)(𝑣𝑖−𝑣)

√{∑𝑖=1

𝑛

(𝑢𝑖−𝑢)2}{∑𝑖=1

𝑛

(𝑣𝑖−𝑣)2 }

𝑟 𝑠=1−6∑𝑖=1

𝑛

𝑑𝑖2

𝑛(𝑛2−1)

6.1 From Pearson to Spearman Spearman Rank Correlation

Page 73: Nonparametric Statistical Methods

73

6.2 Hypothesis Test &Examples H0: X and Y are independent

H1: X and Y are associated

The null distribution of test statistic can be derived from the fact that any fixed ordering of the Xi’s, all ordering of the Yi’s are equally likely under H0. Assuming there is no ties, the total number of the orderings is n!, each with probability 1/n!. So should be fairly close to 0 .

Page 74: Nonparametric Statistical Methods

74

Small sample size n<=10Step 1: rank the X’s and Y’sStep 2: compute the Step 3: Upper-tail test : p-value Lower-tail test : p-value Two-side test for independency :p-value Constant is from the table of Upper-tail Probabilities for Spearman Rank Correlation.

6.2 Hypothesis Test &Examples A Exact test

Page 75: Nonparametric Statistical Methods

75

6.2 Hypothesis Test &ExamplesExample A

A researcher took blood samples from healthy rabbits and made counts of the heterophils (X) and lymphocytes (Y). Rabb

itHeterophils(

X)Lymphocytes

(Y)Ranks of

XRanks of

Y1 2276 2830 5 4 12 3724 5488 8 9 -13 2723 2904 7 5 24 4020 5528 10 10 05 4011 4966 9 8 16 2035 3135 3 7 -47 1540 2079 2 2 08 1300 1755 1 1 09 2240 3080 4 6 -210 2467 2363 6 3 3

Table 6.1

Page 76: Nonparametric Statistical Methods

6.2 Hypothesis Test &ExamplesExample A

X and Y are independentX and Y are correlatedcalculate

Two sided P-value 0.01X and Y are significantly

associated.

Table 6.2

n=10  c Prob

0.01 0.5000.05 0.4460.10 0.3930.15 0.3410.20 0.2920.25 0.246… …

0.62 0.3000.64 0.0270.65 0.0240.66 0.0220.67 0.0190.68 0.0170.70 0.015… …

0.78 0.005Greater 0.005

Page 77: Nonparametric Statistical Methods

77

6.2 Hypothesis Test &ExamplesB. Large-sample approximation

Large ample size n>10=0, One common normal approximation of

Spearman coefficient is Another approach is to use the fisher

transformation F(Rs)=(ln(1+r)-ln(1-r))/2, which follows a approximated distribution of N(0,1/(n-3))

Page 78: Nonparametric Statistical Methods

78

6.2 Hypothesis Test &ExamplesB. Large-sample approximation

Step 1: rank the X’s and Y’sStep 2: compute Step 3: compute test statistic ZStep 4: compare Z with the

critical value

Page 79: Nonparametric Statistical Methods

79

6.2 Hypothesis Test &ExamplesExample B

Rabbit Heterophils(X)

Lymphocytes(Y)

Ranks of X

Ranks of Y

1 2276 2830 6 5 12 3724 5488 16 17 -13 2723 2904 8 6 24 4020 5528 18 18 05 4011 4966 17 14 36 2035 3135 4 8 -47 1540 2079 3 2 18 1300 1755 1 1 09 2240 3080 5 7 -210 2467 2363 7 3 411 3700 5087 15 15 012 1501 2821 2 4 -213 2907 5130 13 16 -314 2898 4830 12 13 -115 2783 4690 10 12 -216 2870 3570 11 10 117 3263 3480 14 9 518 2780 3823 9 11 -2

Table 6.2

Page 80: Nonparametric Statistical Methods

80

6.2 Hypothesis Test &ExamplesExample B

X and Y are independent X and Y are correlated

two-sided p-value for Z=3.70 is 0.0002, two variables are significantly associated.

Page 81: Nonparametric Statistical Methods

6.2 Hypothesis Test &ExamplesSpearman Rank Correlation in SAS

data Rabbit; input x y; datalines; … run; proc corr data=Rabbit spearman; var x y; title'example'; run;

Page 82: Nonparametric Statistical Methods

-=

 And

Page 83: Nonparametric Statistical Methods

From E(Rs)=0 we can get

Page 84: Nonparametric Statistical Methods

)} appears in (n-1)! Permutations, appears in (n-2)! Permutations.

Substitute the results we have Var(Rs)=1/(n-1)

Page 85: Nonparametric Statistical Methods

85

7. Summary

Page 86: Nonparametric Statistical Methods

Questions?

86