computer-intensive statistical methods - ohio · pdf filesimulation a computer intensive...

Computer-Intensive Statistical Methods

• Simulation

• Randomization

• Bootstrapping

Simulation

A computer intensive method used in hypothesis testing, where the major challenge is to determine the null distribution.

Simulation uses a computer to mimic, or “simulate,” sampling from a population under the null hypothesis.

The computer creates an imaginary population whose parameters are specified by a null hypothesis. The computer then samples the population and calculates a test statistic. This process is repeated a large number of times.

Example 19.1 350 people are asked to choose a 2-digit number.

Q: Are all 2-digit numbers selected with equal probability?

A: Histogram suggests not. Distribution mode in low 20s, highly skewed, few pick multiples of 10.

Example 19.1

How do we test this 2-digit hypothesis?

Ho: Two digit numbers are chosen with equal probability.Ha: Two digit numbers are NOT chosen with equal prob.

Population distribution unknown. But, chi-square looks like a good approximation.

There are 90 categories of outcome (10-99). This expected frequency (EXP) = 350/90 = 3.89.

Example 19.1

VIOLATION! Problem we have is that with an EXP = 3.89, we have violated the assumption that no EXP frequency < 5 for 20% of the categories.

Result is that null distribution is probably not true chi-square, even though appears to be a good first approximation.

How can we generate a null distribution, and generate an appropriate test statistic and P-value to do an appropriate test of hypothesis?

Use SIMULATION!

Example 19.1

Let's walk through the process first:

1) Use a computer to create and sample an imaginary population whose parameter values are those specified by the Ho.

2) Calculate the test statistic from the simulated data.

3) Repeat steps 1 & 2 thousands of times.

4) Gather up all the simulated values to form the Ho.

5) Compare test stat from data to null distribution.

Example 19.1

R has many built in functions for

doing computer-intensive

resampling.

Example 19.1> data<-seq(10,99,1) # create the data set of 10 thru 99> data [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27[19] 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45[37] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63[55] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81[73] 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99> obs<-sample(data,350,replace=TRUE)> obs [1] 97 65 28 26 57 80 43 77 25 55 60 58 66 51 17 31 70 [18] 24 15 43 73 96 33 63 74 95 18 24 55 79 42 98 58 16 [35] 52 10 25 61 93 64 11 77 59 90 56 15 22 52 40 61 27 [52] 37 53 75 53 70 19 10 61 28 98 98 76 93 10 71 73 17 [69] 39 48 59 70 72 88 84 24 92 26 40 97 89 60 16 30 79 [86] 32 85 17 15 93 39 62 25 94 30 14 70 19 62 63 87 76[103] 29 56 98 21 45 93 22 31 40 27 79 93 56 72 34 44 99[120] 30 32 30 21 77 19 60 27 86 52 34 13 91 86 53 16 13[137] 41 10 67 91 25 88 68 38 89 86 74 78 96 72 68 37 25[154] 98 48 50 69 89 94 88 69 62 99 50 54 47 28 96 99 57[171] 62 83 43 75 69 37 42 11 66 67 57 25 68 59 66 98 71[188] 61 21 70 96 71 50 21 32 95 91 10 85 87 12 66 72 15[205] 11 74 27 16 68 83 96 92 33 71 19 80 65 33 27 48 43[222] 55 18 95 65 27 40 35 78 61 16 91 82 16 52 23 19 26[239] 82 18 55 56 23 97 59 73 48 31 72 91 78 75 76 62 40[256] 66 59 62 82 66 81 43 70 92 70 79 81 97 55 29 36 48[273] 15 31 60 91 94 70 39 88 62 10 19 57 29 61 66 56 75[290] 71 93 28 71 70 97 57 87 42 62 39 24 23 20 52 29 17[307] 72 53 96 96 95 92 46 74 39 47 16 65 73 17 98 37 76[324] 95 16 76 98 20 97 45 90 99 82 77 98 76 38 22 51 14[341] 94 29 84 58 32 78 81 14 70 33

> resampled=replicate(100,(sample(data,350,replace=TRUE)))> # Produces resamples by columns, each being a new resample> str(resampled) num [1:350, 1:100] 18 36 42 96 85 89 45 76 62 81 ...> chi.function=function(x) { + freqs=table(x) + sum((freqs-3.89)^2/3.89)}> resampled.chi<-apply(resampled,2,chi.function) [1] 97.62388 92.48249 89.55285 106.36424 92.41458 [6] 56.64797 82.13180 53.18483 65.01003 65.61144 [11] 89.55285 71.84902 86.31283 98.29321 75.51591 [16] 69.27833 98.80735 89.84388 102.02802 95.20838 [21] 81.17144 89.84388 83.96524 117.96632 84.99352 [26] 87.56422 68.47316 77.57247 71.62591 104.75391 [31] 99.09838 90.13491 78.30972 105.78219 85.73077 [36] 79.27008 91.74524 96.01355 83.67422 82.71386 [41] 62.37141 85.21663 83.89733 78.30972 82.42283 [46] 63.17658 96.08147 64.71900 64.94211 69.86039 [51] 72.14005 80.58938 87.05008 91.96835 96.88663 [56] 79.11488 102.40632 66.99866 68.83211 93.28766 [61] 78.24180 80.94833 89.10663 61.41105 64.20486 [66] 80.58938 81.68558 63.17658 98.36113 54.14519 [71] 94.98527 105.55907 67.28969 117.38427 72.65419 [76] 97.33285 73.97350 83.96524 99.90355 80.36627 [81] 91.45422 88.88352 71.33488 84.99352 94.24802 [86] 78.82386 94.47113 77.50455 69.86039 75.00177 [91] 76.47627 79.40591 93.28766 59.28658 83.22799 [96] 72.65419 75.00177 65.45625 80.65730 101.15494> # Output is chi-square values for each simulation> hist(resampled.chi)

This graph bears a very close resemblance to Fig. 19.1-2. We should probably extend to 1,000 or 10,000 resamples etc. to get a better approximation. Our chi-square value was 1111.4, which was much higher than anything simulated here, suggesting that the Ho should be rejected.

Randomization

Used to test the hypothesis that two variables are associated:- Between 2 categorical variables (sensu chi-square)- Between a catg. var and a continuous var (sensu t-test)- Between two numerical variables (sensu correlation)

Use randomization procedures when assumptions in doubt.

Randomization tests have been shown to be more powerful than most nonparametric tests.

Randomization

In a randomization test, a test statistic is chosen that measures the appropriate type of association (e.g., χ2, t, r, etc.).

Assignment to groups (e.g., A vs B) is scrambled to yield a randomized data set. So, any association present at beginning is lost (yet data stays intact).

Randomization procedure repeated many times. If observed value is unusual compared to null, reject Ho.

Randomization

CAUTION:

Do NOT confuse “Randomization Tests” with with “Randomization.”

The latter deals with assignment of treatments to individuals as part of experimental design process.

Example 19.2

Pseudoscorpions live in tropical forests. They feed on rotting figs. Females are promiscuous, and often mate multiple times.It is unclear what the advantage of this is.

One possibility is that it has to do with sperm compatibility. Multiple matings may increase chance of fertilization.

To test this hypothesis, experimenters assigned scorpions to two different groups: DM = mated to 2 different males, SM = mated to a single male. Then they compared the number of successful broods.

Example 19.2

Because the mean number of successful broods in each treatment group is compared, this is clearly a two-sample comparison of the t-test variety.

First, let's look at the raw data, then as usual, let's plot the data to see what we might be working with, in an attempt to find a good starting point for the analysis.

Example 19.2

Example 19.2

How to proceed?

From the histograms, we can tell that the data are not normally distributed (though SM passes shapiro.test, it is clearly skewed to right).

Mann-Whitney U-test seems like a potentially safer route; however, it is also clear from the histograms that these two data sets are not from the same population (different shapes) violating the nonparm assumption.

A randomization test requires similar shapes, but is much more robust to departures.

Example 19.2

To carry out the randomization test, we need to decide on a test statistic. There are multiple ways to solve this problem using randomization procedures. The book suggests just using the difference between sample means:

From the data:

My inclination would be to use a t-test procedure since this is a two-sample problem. We can make use of an R package called COIN (Conditional Inference Procedures in a Permutation Test Framework) to solve this simply.

Y SM− Y DM=2.2−3.625=−1.425

Example 19.2

First, organize your data into a CSV file with two columns. The first will be factor variable (trt: DM or SM) the other will be numeric (brood: number of broods).

Next, download and install the COIN package, along with additional required packages.

Explore the COIN package documentation. ONE_WAY procedure in COIN provides a two- or K-sample permutation procedure, which is exactly what we are looking for.

> library(coin)> data<-read.csv("pseudo.csv",header=TRUE)> data trt broods1 SM 42 SM 03 SM 34 SM 15 SM 26 SM 3...

> attach(data)> oneway_test(broods~trt, data=data,+ distribution=approximate(B=9999))

Approximative 2-Sample Permutation Test

data: broods by trt (DM, SM) Z = 2.2481, p-value = 0.02550alternative hypothesis: true mu is not equal to 0

By way of exploration, what would have happened should we have ignored assumptions and did a straight up2-sample t-test or ANOVA?

> t.test(broods~trt)

Welch Two Sample t-test

data: broods by trt t = 2.3424, df = 28.883, p-value = 0.02627alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.1805508 2.6694492 sample estimates:mean in group DM mean in group SM 3.625 2.200

Example 19.2

Using the book's solution of differences between means and knowledge of the normal distribution, they found the P-value to approximate 0.0176 in one tail.

Therefore, for a two-sample test, the P-value is approximately 2(0.0176) = 0.0352, and we reject the null hypothesis. Our permuted P-value using COIN procedure was 0.02550, offering a similar solution (P-values will ALWAYS be different due to permutation procedure).

So, female pseudoscorpions have more successful broods when they mate with different males (DM) than when they mate with the same male (SM) repeatedly. In this species, there is a reproductive advantage to females that are promiscuous and mate with more than one male.

Example 19.2

Bootstrapping

The bootstrap is another computer-intensive procedure, generally reserved for approximating the sampling distribution of an estimate.

Unlike simulation and randomization, the bootstrap is not directly intended for hypothesis testing. Instead, it is used to find the standard error or confidence interval for a parameter estimate.

The bootstrap is especially useful when (1) there is no known formula for the SE/CI, or (2) when the sampling distribution of interest is unknown.

Bootstrapping

Remember when we first introduced the concept of the standard error? It was presented as being the standard deviation of the sampling distribution.

In essence, the SE represents the SD of multiple samples taken from the population (we used the example for 20 samples at the time and introduced the notion of making a mistake 1 time in 20, or 0.05). Recall?

In reality, we can't really do this. Sampling is limited by TME. Bootstrapping is used to replicate this general procedure, using a form of repeated sampling: instead of taking individuals from the population, it takes samples. Called: resampling.

Bootstrapping

We can resample a small data set to accurately provide a SE and CI around a median.

Example 19.3 uses asymmetry scores in brain function of chimps to illustrate an example. We can easily calculate the median of this sample as 0.14.

However, what would be of great use to us would be a determination of uncertainty of this estimate of the population median. Let's calculate the 95% CI around this median.

Example 19.3

> asym<-c(0.3,0.16,0.24,0.25,0.36,0.17,0.11,+ 0.12,0.34,0.32,0.71,0.09,1.12,-0.22,1.19,+ 0.01,-0.24,0.24,-0.30,-0.16)> median(asym) ## straight up arithmetic median[1] 0.205> hist(asym, col="red")

Example 19.3

Let's look at a couple of different ways to solve this problem using various tools in R.

Perhaps the simplest way to start with understanding bootstrapping is to utilize the sample function, which takes the form:

sample (x, size, replace, prob)

If you haven't already done so, install the package boot and associated packages.

Example 19.3> median (asym)[1] 0.205> sample(asym,10, replace=TRUE) [1] 0.25 0.24 0.34 0.24 -0.24 0.24 -0.22 0.11 1.12 0.11> S1<-sample(asym,10,replace=TRUE)> median(S1)[1] 0.24> S2<-sample(asym,10,replace=TRUE)> median(S2)[1] 0.215> S3<-sample(asym,10,replace=TRUE)> median(S3)[1] 0.11> S<-c(S1,S2,S3)> median(S)[1] 0.205 ## NB: Same as textbook result

See what's happening here? Let's start over and do this 9999 times!

Example 19.3

> resample<-lapply(1:9999, function(i)+ sample(asym,10,replace=TRUE))> asym.median<-sapply(resample,median)

> hist(asym.median, col="red")

R generates 10,000 medians from subsamples of 10 at a time.

To find 95% CI is simple. In a rank ordered sequence of values, the 250th and 9750th values will be your lower and upper CI, respectively (i.e., 2.5% and 97.5%).

Use asym.median from your sapply function, rank order, and identify those values:

> asym.median[250][1] -0.075> asym.median[9750][1] 0.300

Example 19.3

Another approach is to use the package in R called BOOTSTRAP and BOOT. Download and install those, and supporting packages.

> m.asym<-function(x,i) median(x[i])> med.boot<-boot(data=asym,statistic=m.asym,R=999)> med.boot

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:boot(data = asym, statistic = m.asym, R = 999)

Bootstrap Statistics : original bias std. errort1* 0.205 -0.005925926 0.05965346

Note: estimate of median and associated SE. Same as resample approach.

Example 19.3

> boot.ci(med.boot)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 999 bootstrap replicates

CALL : boot.ci(boot.out = med.boot)

Intervals : Level Normal Basic 95% ( 0.0940, 0.3278 ) ( 0.1000, 0.3100 )

Level Percentile BCa 95% ( 0.10, 0.31 ) ( 0.09, 0.30 )

The End!The End!

Reminders:

TU, 23-APR-13, NOW: (1) Term paper due (LB)

TH, 25-APR-13, 0900-1100: (1) Review(2) LaTeX extra credit due (BCM)

TH, 2-MAY-13, 0900-10:20: Final ExamThis room, same format

Don't forget Electronic Class Evaluations!

computer-intensive statistical methods - ohio · pdf filesimulation a computer intensive...

Documents