nonparametric statistical methods

1

Nonparametric Statistical Methods

Presented by Xiaojin Dong, Owen Gu, Sohail Khan, Hao Miao, Shaolan Xiang, Chunmin Han, Yinjin Wu, Jiayue Zhang, Yuanhao Zhang

2

Outlines1. Wilcoxon signed rank test2. Wilcoxon sum rank test3. Kolmogorov Smirnov test4. Kruskal Wallis test5. Kendall’s correlation coefficient6. Spearman’s rank correlation

coefficient

3

When to use nonparametric methods?Population is not normalSample size is very smallSome data are ordinalSamples could be dependent or

independent.What to estimate?Population median.Median is a better measurement of

central tendency for non-normal population, e.g. skewed distributions.

4

1. Wilcoxon signed rank test

Inventors

Henry Berthold Mann1905-2000

Ohio State UniversityHe is the dissertation advisor of

Whitney.

Frank Wilcoxon1892-1965

Donald Ransom Whitney

6

Hypothesis testing vs ( or )Compute the differences Rank in terms of absolute valuesLet be the rank of .Statistics:= sum of the ranks of positive differences= sum of the ranks of negative differencesReject region: Reject if is large or equivalently if is small, or if

7

Large sample approximation vs ( or )Statistics for large sample size n: and

Rejection region: Reject if or if

8

Intuition and assumptionsIf positive differences are larger than

negative differences, they get higher ranks, thus contributing to larger value of , likewise the larger negative differences contributes to larger value of .

Assumption: Population must be symmetric.

Reason: under the null hypothesis the right skewed population tends to have higher value of and the left skewed population tends to have higher value of .

9

Example 1.1 Test the thermostat setting data if the median

setting differs from 200

vs

Conclusion: The population median differs from the design setting of 200 at

Table 1.1 Thermostat setting data

x 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8 201.3 199.0 diff. 2.2 3.4 0.5 2.5 6.3 -2.0 3.7 0.8 1.3 -1.0

rank 6 8 1 7 10 5 9 2 4 3

10

SAS codesDATA themostat;INPUT temp;datalines;202.2203.4…;PROC UNIVARIATE DATA=themostat loccount mu0=200;TITLE "Wilcoxon signed rank test the themostat";VAR temp;RUN;

11

SAS outputs (selected results)

8

Basic Statistical Measures Location Variability Mean 201.7700 Std Deviation 2.41019 Median 201.7500 Variance 5.80900 Mode . Range 8.30000 Interquartile Range 2.90000

Tests for Location: Mu0=200 Test -Statistic- -----p Value------ Student's t t 2.322323 Pr > |t| 0.0453 Sign M 3 Pr >= |M| 0.1094 Signed Rank S 19.5 Pr >= |S| 0.048

12

2. Wilcoxon rank sum test

13

Wilcoxon rank sum test: introductionWilcoxon-Mann-Whitney Test is also called

Wilcoxon rank sum test and the Mann-Whitney U test.

It was proposed initially by the Irish-born US statistician Frank Wilcoxon in 1945, for equal sample sizes, and extended to arbitrary sample sizes and in other ways by the Austrian-born US mathematician Henry Berthold Mann and the US statistician Donald Ransom Whitney in 1947.

14

Where to useWhen we analyze two independent samples, there are

times when the assumptions for using a t-test are not met. (the data are not normally distributed, sample size is small, etc.) Or data values may only represent ordered categories.

If sample sizes are moderate, if there is question concerning distributions or if the data are really ordinal, USE WILCOXON RANK-SUM TEST.

This test only assumes: 1. all the observations from both groups are

independent of each other. 2. the responses are ordinal or continuous

measurements. 3. the distributions of two groups have similar shape.

15

Calculation stepsH0: F1=F2 (the distribution of both group are

equal) vs. Ha: F1<F2 or Ha: F1>F2 (one r.v. is stochastically larger than the other one)

Put all the data from both groups in increasing order (with special provision for ties), retaining the group identity.

Compute the sum of ranks for each group respectively and denote the sums by w1 and w2.

Compute u1=w1-n1(n1+1)/2 and u2=w2-n2(n2+1)/2. Look up to Table A.11 to determine if reject H0 at significance level α and P-value.

16

Special treatmentFor large samples In the case of n1>10 and n2>10, U is approximately

normally distributed with parameters µ=n1n2/2 and σ2=n1n2(N+1)/12. Therefore a large sample z-test can be based on the statistics

For ties Use the midrank when observations from one group

are equal to a observation from the other group.

17

ExampleTo determine if the order of questions has

significant impact on students’ performances in a exam, 20 students are randomly equally divided into 2 groups A and B. Everyone were asked to answer a exam paper . The exam papers for both groups consist of same questions. The questions were ranked from easy to hard in the papers for group A while they were ranked from hard the easy for group B. The scores each student got are as follows.

A: 83, 82, 84, 96, 90, 64, 91, 71, 75, 72B: 42, 61, 52, 78, 69, 81, 75, 78, 78, 65

18

SolutionH0: F1=F2 vs. Ha: F1>F2Rank scores of both group in

ascending order.scores rank group scores rank group42 1 B 78 12 B52 2 B 78 12 B61 3 B 78 12 B64 4 A 81 14 B65 5 B 82 15 A69 6 B 83 16 A71 7 A 84 17 A72 8 A 90 18 A75 9.5 A 91 19 A75 9.5 B 96 20 A

19

The rank sums are w1=4+7+8+9.5+15+16+17+18+19+20=133.5 w2=1+2+3+5+6+9.5+12+12+12+14=76.5Therefore u1=w1-n1(n1+1)/2=133.5-10*11/2=78.5 u2=w2-n2(n2+1)/2=76.5-10*11/2=21.5Check that u1+u2=n1*n2=100From Table A.11 we find that the P-value is between

0.012 and 0.026.To compare this with the large sample normal

approximation, calculate

which yields the P-value≈Φ(-2.12)=0.0170

20

SAS code and outputData exam; Input group $ score @@;Datalines;A 83 A 82 A 84 A 96 A 90 A 64 A 91 A 71 A 75 A 72B 42 B 61 B 52 B 78 B 69 B 81 B 75 B 78 B 78 B 65 ;Proc npar1way data=exam wilcoxon;Var score;Class group;exact wilcoxon;Run;

22

3. Kolmogorov-Smirnov Test

"Every mathematician believes he is ahead over all others. The reason why they don't say this in public, is because they are intelligent people"

Russian MathematicianMajor advances in the fields of Probability theory, topology, turbulence, classical mechanics and computational complexity.Gained international recognition in 1922, for constructing Fourier series that diverges almost everywhere.

Andrey Kolmogorov 25 April 1903 – 20 October 1987

Vladimir Ivanovich SmirnovJune 10, 1887 – February 11, 1974

Significant contributions in both pure and applied mathematics, as well as the history of mathematics.Five volume book A Course in Higher Mathematics

25

KS-Test Tries to determine if two datasets

differ significantly by comparing their distributions

Makes no assumption about the distribution of the data.

This Generality comes at some cost: other parametric tests, i.e. t-test may be more sensitive if the data meets the requirements of the test.

26

Types of KS-TestsOne Sample: Sample VS

Reference probability distribution. i.e. test for normality. Empirical VS standard normal

Two Sample: Test if two samples come from the same distribution

27

The one sample Kolmogorov-Smirnov (K-S) test is based on the empirical distribution function (ECDF). Given N data points Y1 Y2 ..., YN the ECDF is defined as

Where n(i) is the number of points less than Yi This is a step function that increases by 1/N at the value of each data point. We can graph a plot of the empirical distribution function with a cumulative distribution function for a given distribution. The one sample K-S test is based on the maximum distance between these two curves. That is,

Where F is the theoretical cumulative distribution function

28

The two sample K-S test is a variation of this. Compares two empirical distribution functions

Where E1 and E2 are the empirical distribution functions for the two samples.

More formally, the Kolmogorov-Smirnov two sample test statistic can be defined as follows. H0: The two samples come from a common distribution.Ha: The two samples do not come from a common distribution.

29

.

Test Statistic: The Kolmogorov-Smirnov two sample test statistic is defined as

Where E1 and E2 are the empirical distribution functions for the two samples. Critical Region:

The hypothesis regarding the distributional form is rejected if the test statistic, D, is greater than the critical value obtained from a table at significance level α. The quantile-quantile plot, bihistogram, and Tukey mean-difference plot are graphical alternatives to the two sample K-S test

30

Application of Kolmogorov-Smirnov TestThe K-S goodness-of-fit test can be applied in the case of

both one sample and two samples. In the one-sample test, we compare the empirical

distribution function of the sample data with the cumulative distribution function of the reference distribution to determine if the sample is drawn from the reference distribution, such as standard normal, lognormal or exponential distribution, etc.

In the two-sample test, we compare the empirical distribution functions of two sets of data to determine if they come from the same distribution.

The following slides exemplify the application of the test in both cases using computing software language MATLAB.

31

one sample K-S test• Hypothesis Testing: H0: The sample data follows the standard normal distribution (μ=0, σ2=1). Ha: The data does not follow the standard normal distribution.

• The sample data, extracted from the daily percentage change in the share price of company XXX, Inc. for the past 19 days, is listed as follows in an ascending order: -4.0% -3.5% -3.0% -2.5% -2.0% -1.5% -1.0% -0.5% 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5% 5.0%

• The test statistic is: max(|Fx-Gx|), where Fx is the empirical cdf and Gx is the standard normal cdf .

32

one sample K-S test (cont’d)• The MATLAB language syntax for the test is: x=-4:0.5:5; [h,p,k,c]=kstest(x,[],alpha, type), where (1) x is the sample data set, and the values

increase from -4 to 5 in an even increment of 0.5; (2) [] means the standard normal distribution is

used; (3) alpha is a double and represents the level of

significance; (4) type is a string and specifies whether the type

of test for the alternative hypothesis is ‘unequal’, ‘larger’ or

‘smaller’, meaning whether the empirical cdf and the cdf

of the specified distribution are unequal, the

empirical cdf is larger or the empirical cdf is smaller ;

33

one sample K-S test (cont’d) (5) h = 0 if the test accepts the null hypothesis and 1 if the null hypothesis is rejected; (6) p = the p value of the test; (7) k = the test statistic; (8) c = the critical value, depending on alpha and sample size We are testing under three different scenarios: a) alpha=0.1 b)

alpha=0.05 and c) alpha=0.01; All three scenarios are under the assumption that type=‘unequal’.

Scenario #1: α=0.1 a) MATLAB code: [h,p,k,c]=kstest(x,[],0.1,'unequal'); b) Testing result: h=1, p=0.0122, k=0.3542, c=0.2714; So since k>c (or h=1), we reject the null hypothesis at 10%

level of significance that the sample data follows the standard normal distribution.

34

one sample K-S test (cont’d) Scenario #2: α=0.05 a) MATLAB code: [h,p,k,c]=kstest(x,[],0.05,'unequal'); b) Testing result: h=1, p=0.0122, k=0.3542, c=0.3014; So since k>c (or h=1), we also reject the null

hypothesis at 5% level of significance that the sample data follows the standard normal distribution.

Scenario #3: α=0.01 a) MATLAB code: [h,p,k,c]=kstest(x,[],0.05,'unequal'); b) Testing result: h=0, p=0.0122, k=0.3542, c=0.3612; So since k<c (or h=0), we accept the null hypothesis

at 1% level of significance that the sample data follows the standard normal distribution.

35

two sample K-S test• Hypothesis Testing: H0: The two sets of data have the same distribution. Ha: The two sets of data do not have the same distribution.

• The first sample data set X1 is evenly-spaced with values ranging from -2.0 to 1.0, while the numbers in the second set X2 come from a function that generates standard normal random variables with μ=0, σ2=1. The sample sizes of both data sets are 16. values of the two data sets are as follows:

X1: -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 X2: -0.1241 1.4897 1.4090 1.4172 0.6715 -1.2075 0.7172 1.6302 0.4889 1.0347 0.7269 -0.3034 0.2939 -0.7873 0.8884 -1.1471

36

two sample K-S test (cont’d)• The MATLAB language syntax for the test is: X1=-2:0.2:1, meaning the values of X1 go from -2 to 1 with an even increasing space of 0.2; X2=randn(16,1), which is a function that generates 16 normal random variables; [h,p,k]=kstest2(X1,X2,alpha,type), where the definitions of h, p, k, alpha and type are the same as those described in the one-sample test; We are testing under two different scenarios: a) alpha=0.05 b) alpha=0.1.

Both scenarios are under the assumption that type=‘unequal’.

Scenario #1: α=0.05 a) MATLAB code: [h,p,k]=kstest2(X1,X2,0.05,’unequal’); b) Testing result: h=0, p=0.0657, k=0.4375;

37

one sample K-S test (cont’d)This function kstest2 does not produce the critical

value. However, since h=0, then the test accepts null hypothesis that the two sets of data come from the same distribution at 5% level of significance.

Scenario #2: α=0.1 a) MATLAB code: [h,p,k]=kstest(x,

[],0.1,'unequal'); b) Testing result: h=1, p=0.0657, k=0.4375; So since h=1 now, the test rejects the null

hypothesis that the two sets of data come from the same distribution at 10% level of significance.

38

4. Inferences on Several Independent Samples—Kruskal-Wallis Test

39

William Henry Kruskal

Born in New York CityMathematician and

statisticianPresident of the Institute of

Mathematical Statistics (1971)

President of the American Statistical Association (1982) (Oct 10,1919–Apr 21,2005)

40

Wilson Allen WallisB.A. in psychology, University

of MinnesotaEconomist and statisticianPresident of the University of

Rochester (1962-1982)Secretary of State for

Economic, Business, and Agricultural Affairs (1985-1989)(1912–Oct 12,1998)

41

Kruskal-Wallis Test Definition-- Non-parametric test(distribution-

free) Compare three or more

independent groups of sampled data

42

HypothesisNull hypothes (Ho ): samples from identical populations.Alternative hypothesis (Ha ): samples from different populations.

43

Steps1. Arrange the data of all samples: in a single series in ascending orderNote: If we have repeated values, assign

ranks to them by averaging their rank position.

2. Ranks of the different samples are separated and summed up as R1 R2 R3,...

44

Steps3.Test Statistic:

Where, H = Kruskal - Wallis test statistic n = total # of observations in all

samples Ri = rank of each sample4. Rejection Region: We will reject Ho if-- H is greater than the chi-square table value => Conclude that the sample comes from

a different population.

212 3( 1)( 1)

i

i

RH n

n n n

45

Example 4.1An experiment was done to compare

four different methods of teaching the concept of percentage to sixth graders. Experimental units were 28 classes which were randomly assigned to the four methods, seven classes per method. A 45 item test was given to all classes. The average test scores of the classes are summarized in table 4.1. Apply the Kruskal-Wallis test to the test scores data.

46

Table 4.1The ranks of the data values are shown in the following table, where the two equal values are assigned the midrank=(14+15)/2=14.5.

Average Test ScoresTeaching method

Case method

Formula Method

Equation Method

Unitary Analysis Method

14.5923.4425.4318.1520.8214.0614.26

20.2726.8414.7122.3419.4924.9220.20

27.8224.9228.6823.3232.8533.9023.42

33.1626.9330.4336.4337.0429.7633.88

s

Ranks of Average Test ScoresTeaching Method

Case Method

Formula Method

Equation Method

Unitary Analysis Method

313165912

8174106

14.57

1914.52011232612

24182227282125

Rank Sum

49 66.5 125.5 165

48

The value of the kruskal-wallis test statistic equals

Since , therefore the P-

value<.005, from which we can conclude there are significant differences among the four teaching methods.

2

1

2 2 2 2

12 3( 1)( 1)

12 3(29)28(29) 7 7 7 7

18.134

(49) (66.5) (125.5) (165)

ai

i i

kw NN N

rn

2

3,.00518.134 12.837kw

49

SAS codes data test; input methodname $ scores; cards; case 14.59 formula 20.27 case 23.44 case 25.43 case 18.15 … unitary 36.43 unitary 37.04 unitary 29.76 unitary 33.88 ; proc npar1way data=test wilcoxon; class methodname; var scores; run;

50

SAS output: Wilcoxon Scores (Rank Sums) for Variable scores Classified by Variable methodname

Sum of Expected Std Dev Mean methodname N Scores Under H0 Under H0 Score case 7 49.00 101.50 18.845498 7.000000 formula 7 66.50 101.50 18.845498 9.500000 equation 7 125.50 101.50 18.845498 17.928571 unitary 7 165.00 101.50 18.845498 23.571429

Average scores were used for ties.

Kruskal-Wallis Test

Chi-Square 18.1390 DF 3 Pr > Chi-Square 0.0004

51

5. Kendall’s correlation coefficient

52

Kendall: a historySir Maurice George Kendall

(09/06/1907-03/29/1983) was British statistician widely known for his contribution to statistics.

Studied Mathematics at St. John’s college, Cambridge.

Began his work in Agriculture and formulated a series test for statistical randomness

In 1948 wrote a monograph on rank correlation.

53

Introduction to Kendall’s correlationCompare Bivariate relationship between

ranked paired and For ordered If values have similar order,

there is a strong positive correlation between and .

For ordered If values have opposite order, there is a strong negative correlation between and .

A measurement for linear relationship or dependency, but NOT suitable for nonlinear dependency.

54

Concordant and discordantThe pair is concordant if and has

the same sign;The pair is discordant if and has

the opposite sign;The pair is a tie in X if ;The pair is a tie in Y if .For n pairs there are comparisons

55

Example of calculating concordant and discordant pairs

Total number of concordances:

• To calculate concordant pairs, sort x in ascending order and pair y ranks with x ranks.

• inspect y-ranks in table 2.

• The first y-rank is 2 succeeded by 10 greater ranks, namely 3, 5, 4, 6, 7, 9, 12, 11, 10, 8.

• Similarly, the next y-rank, 3 followed by 9 greater ranks, which gives concordance 9.

Table 6.1: A data set showing scores x, y for two examination questions, Q1 and Q2 on the same paper

Q1 x 1 3 4 5 6 8 10 11 13 14 16 17 Q2 y 13 15 18 16 23 31 39 56 45 43 37 0

Table 6.2: Paired ranks for the data in table 1.

Q1 x 1 2 3 4 5 6 7 8 9 10 11 12 Q2 y 2 3 5 4 6 7 9 12 11 10 8 1

56

Example of calculating concordant and discordant pairs

Total number of discordances:

Or alternative:= = 19


Q1 x 1 3 4 5 6 8 10 11 13 14 16 17 Q2 y 13 15 18 16 23 31 39 56 45 43 37 0

Table 5.2: Paired ranks for the data in table 1.

Q1 x 1 2 3 4 5 6 7 8 9 10 11 12 Q2 y 2 3 5 4 6 7 9 12 11 10 8 1

• To calculate concordant pairs, sort x in ascending order and pair y ranks with x ranks.

• inspect y-ranks in table 2.

• The first y-rank is 2 succeeded by 1 smaller rank, namely 1.

• Similarly, the next y-rank, 3 followed by 1 smaller ranks.

57

Kendall’s τ with no tiesTau statistic

Where is the number of concordant pairs, is the number of discordant pairs, is the sample size.

is total number of pairs. is between -1 and 1.is used to distinguish from population .

1 12

c dk

n ntn n

58

Kendall’s τ with no tiesIntuitive explanations:If the rankings of and are equal, we

should expect a fix mixed amount of concordances and discordances, i.e. .

If all pairs are concordant, , i. e. there is monotonic increasing association between and .

If all pairs are discordant, , i.e. there is monotonic decreasing association between and .

59

Kendall’s tau: example Previous score example

: no association, : positive association between x and y Statistic: . Rejection region: 5 percent critical value when

n=12 in a one-tail test is 0.3929. where is this value from?

Conclusion: There is moderately strong evidence against the hypothesis of no association.


Q1 x 1 3 4 5 6 8 10 11 13 14 16 17 Q2 y 13 15 18 16 23 31 39 56 45 43 37 0

60

Exact distribution of Kendall’s

For small samples we may calculate exact distribution of . Let

If the x-ranks are in ascending order, there are equally likely permutations of y-ranks under the assumption that and have no association.

From table 3 we can see that there are 7 possible values of tau.

Table 5.3: The set of all possible rank orders for N=4, along with their correlation with the 'canonical' order 1234

Rank Orders

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 241 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 42 2 3 3 4 4 1 1 3 3 4 4 1 1 2 2 4 4 1 1 2 2 3 33 4 2 4 2 3 3 4 1 4 1 3 2 4 1 4 1 2 2 3 1 3 1 24 3 4 2 3 2 4 3 4 1 3 1 4 2 4 1 2 1 3 2 3 1 2 1

tau 1 2/3 2/3 1/3 1/3 0 2/3 1/3 1/3 0 0 -1/3 1/3 0 0 -1/3 -1/3 -2/3 0 -1/3 -1/3 -2/3 -2/3 -1

61

Inference on Kendall’s Probability distribution of tau . Table 5.4

The distribution of is symmetric.A P-value less than 0.05 is obtained by a one-

tail test corresponding to for which

tau 1 2/3 1/3 0 -1/3 -2/3 -1

Probability 1/24 3/24 5/24 6/24 5/24 3/24 1/24

62

Kendall’s tau: large sample approximation

63

Kendall’s tau-b with tiesReview:

Kendall’s tau b

Where ,, are the number of consecutive ranks in a tie

within the x- and the y-ranks respectively.

c d

bn nt

D U D V

1 12

c dk

n ntn n

64

Kendall’s tau-b: exampleLife expectancy showed a general tendency

to increase during the nineteenth and twentieth centuries as standards of health care and hygiene improved. The extra life expectancy varies between countries, communities and even families. Table 6.5 gives the year of death and age at death for 13 males in a clan we call the McDeltas buried in the Badenscallie burial ground in Wester Ross, Scotland. Is there an association that life expectancy is increasing in this clan in more recent years?

65

Kendall’s tau-b: example

Hypothesis: : no increasing trend in life expectancy: life expectancy has increased

, ,

Table 5.5 Year of death and ages of 13 McDeltas

year 1827 1884 1895 1908 1914 1918 1924 1928 1936 1941 1965 1977Age 13 18 83 34 1 11 16 13 74 87 65 83

0

0

: 0: 0b

b

HH

53 23 0.3896

78 78 2bt

66

SAS codeDATA kendall;INPUT year age;datalines;1827 131884 83…;PROC CORR DATA=kendall KENDALL;

TITLE "Kendall's tau for Life expectancy";VAR year age;RUN;

67

SAS output Kendall's tau for Life expectancy 　　　　　　　　　　　　　

The CORR Procedure 　　　　　

　 2 Variables: year age 　　　　　　　　　　　　　　 Simple Statistics 　　　　　　　　　　　Variable N Mean Std dev Medain

Minimum

Maximum

　year 13 1922 39.4832 1924 1827 1977

　age 13 48.0769233.44762

65.00000 1.00000

87.00000

　　　　　　　　　　 Kendall Tau b Correlation Coefficients, N = 13 　　　 Prob > |tau| under H0: Tau=0 　　　　　　　　　　　　　　 year age 　　　　　 year 1.00000 0.38964 　　　　　　　 0.0662 　　　　　 age 0.38964 1.00000 　　　　　　 0.0662 　　　　

68

6. Spearman rank correlation

6.1 From Pearson to Spearman6.2 Hypothesis Test &Examples

6.1 From Pearson to SpearmanPearson product-moment correlation

population correlation coefficient

-1

at the extremes perfect linear relationship between X and Y

Karl Pearson (27 March 1857 – 27 April 1936)

70

Figure 6.1

X 0 2 3 4 6Y 0 16 81 256 1296

Rank(X) 1 2 3 4 5

Rank(Y) 1 2 3 4 5

6.1 From Pearson to Spearman

Statistician

Pioneer of factor

analysisSpearman's

rank correlation coefficient

PsychologistModels

for human intelligence

Charles Edward Spearman, (10 Sept. 1863 - 17 Sept. 1945)

6.1 FROM PEARSON TO SPEARMAN SPEARMAN RANK CORRELATION

72

• are the differences in the ranks. • n=number of pairs of X’s and

Y’s.

Why?

𝑟 𝑠=∑𝑖=1

𝑛

(𝑢𝑖−𝑢)(𝑣𝑖−𝑣)

√{∑𝑖=1

𝑛

(𝑢𝑖−𝑢)2}{∑𝑖=1

𝑛

(𝑣𝑖−𝑣)2 }

𝑟 𝑠=1−6∑𝑖=1

𝑛

𝑑𝑖2

𝑛(𝑛2−1)

6.1 From Pearson to Spearman Spearman Rank Correlation

73

6.2 Hypothesis Test &Examples H0: X and Y are independent

H1: X and Y are associated

The null distribution of test statistic can be derived from the fact that any fixed ordering of the Xi’s, all ordering of the Yi’s are equally likely under H0. Assuming there is no ties, the total number of the orderings is n!, each with probability 1/n!. So should be fairly close to 0 .

74

Small sample size n<=10Step 1: rank the X’s and Y’sStep 2: compute the Step 3: Upper-tail test : p-value Lower-tail test : p-value Two-side test for independency :p-value Constant is from the table of Upper-tail Probabilities for Spearman Rank Correlation.

6.2 Hypothesis Test &Examples A Exact test

75

6.2 Hypothesis Test &ExamplesExample A

A researcher took blood samples from healthy rabbits and made counts of the heterophils (X) and lymphocytes (Y). Rabb

itHeterophils(

X)Lymphocytes

(Y)Ranks of

XRanks of

Y1 2276 2830 5 4 12 3724 5488 8 9 -13 2723 2904 7 5 24 4020 5528 10 10 05 4011 4966 9 8 16 2035 3135 3 7 -47 1540 2079 2 2 08 1300 1755 1 1 09 2240 3080 4 6 -210 2467 2363 6 3 3

Table 6.1

6.2 Hypothesis Test &ExamplesExample A

X and Y are independentX and Y are correlatedcalculate

Two sided P-value 0.01X and Y are significantly

associated.

Table 6.2

n=10 c Prob

0.01 0.5000.05 0.4460.10 0.3930.15 0.3410.20 0.2920.25 0.246… …

0.62 0.3000.64 0.0270.65 0.0240.66 0.0220.67 0.0190.68 0.0170.70 0.015… …

0.78 0.005Greater 0.005

77

6.2 Hypothesis Test &ExamplesB. Large-sample approximation

Large ample size n>10=0, One common normal approximation of

Spearman coefficient is Another approach is to use the fisher

transformation F(Rs)=(ln(1+r)-ln(1-r))/2, which follows a approximated distribution of N(0,1/(n-3))

78

6.2 Hypothesis Test &ExamplesB. Large-sample approximation

Step 1: rank the X’s and Y’sStep 2: compute Step 3: compute test statistic ZStep 4: compare Z with the

critical value

79

6.2 Hypothesis Test &ExamplesExample B

Rabbit Heterophils(X)

Lymphocytes(Y)

Ranks of X

Ranks of Y

1 2276 2830 6 5 12 3724 5488 16 17 -13 2723 2904 8 6 24 4020 5528 18 18 05 4011 4966 17 14 36 2035 3135 4 8 -47 1540 2079 3 2 18 1300 1755 1 1 09 2240 3080 5 7 -210 2467 2363 7 3 411 3700 5087 15 15 012 1501 2821 2 4 -213 2907 5130 13 16 -314 2898 4830 12 13 -115 2783 4690 10 12 -216 2870 3570 11 10 117 3263 3480 14 9 518 2780 3823 9 11 -2

Table 6.2

80

6.2 Hypothesis Test &ExamplesExample B

X and Y are independent X and Y are correlated

two-sided p-value for Z=3.70 is 0.0002, two variables are significantly associated.

6.2 Hypothesis Test &ExamplesSpearman Rank Correlation in SAS

data Rabbit; input x y; datalines; … run; proc corr data=Rabbit spearman; var x y; title'example'; run;

-=

And

From E(Rs)=0 we can get

)} appears in (n-1)! Permutations, appears in (n-2)! Permutations.

Substitute the results we have Var(Rs)=1/(n-1)

85

7. Summary

Questions?

86

nonparametric statistical methods

Documents

population median

whitney test

signed rank s

larger value

median setting

test statistic

larger negative differences

mannwhitney u test