hypothesis testing - simplifyingstats · in the box is exactly equal to the number of blue balls in...

Hypothesis Testing

Manuj Goel, Akshat Shankar

July 25, 2012

1

Contents

1 Introduction 31.1 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Hypothesis Testing: Motivation . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Statistical Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Hypothesis Testing:Procedure 52.1 State the Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Set the Criteria for the Decision . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Decision Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Tradeoff between Type I Error and Type II Error . . . . . . . . . . 82.2.4 Significance Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Compute the Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Critical Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Make a Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 Evaluating the Test Statistics . . . . . . . . . . . . . . . . . . . . . 112.4.2 P-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.3 Equivalence: Decision Rule of Test Statistic and Decision Rule of

P-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 One-Tailed vs Two Tailed Tests 13

2

1 Introduction

1.1 Inferential Statistics

In real life, one is not always fortunate to have the entire data which needs to be analyzed.For example, if the task is to find the average height of an Indian, it is practicallyimpossible to collect the entire data and hence an estimate is required to be found usinga subset (sample) of the entire data (population). Inferential Statistics is a branch ofStatistics which helps to make inferences about the population based on a random sampledrawn from it. The inferences fall basically into two categories: finding the true valueof a population parameter or to find the validity of some hypothesis on the population.Further, the population parameter can be estimated exactly (Point Estimation) or arange of possible values can be estimated. (Interval Estimation)

The present chapter describes the Art and Science behind Hypothesis Testing. Spe-cific examples of commonly used Hypothesis Tests have not been given prominence,instead the focus is on the conceptual understanding of the technique.

Figure 1.1: Inferential Statistics mainly consists of three parts: Hypothesis testing, EstimationTheory and Confidence Intervals.

1.2 Hypothesis Testing: Motivation

Recall the classical experiment from Physics famously known as ‘Galileo’s Leaning Towerof Pisa experiment’. Millenniums back, Aristotle had hypothesized that the speed of fallof an object is dependent on its weight. Surprisingly this ‘hypothesis’ went unchallengedbefore Galileo Galilei thought of ‘testing’ it in the 16th century.

Lets try to think how he ‘tested’ the ‘hypothesis’. Galileo wanted to disprove thehypothesis that ‘the speed of fall of an object is dependent on its weight’. He ratherbelieved that the ‘speed of fall of an object is independent of its weight’. Galilieo plannedto test the hypothesis by dropping two balls of unequal weights from the Leaning Towerof Pisa.

Assuming Aristotle’s hypothesis to be true, the speed of the balls would have been

3

different while falling. This implies that if the two balls are thrown from the top, timetaken by them to reach the ground level would be different. Hence it is reasonable tocome up with a decision rule that if the time taken by two balls is almost the saml, thehypothesis can be rejected else Aristotle’s hypothesis would continue.

Figure 1.2: Galileo threw a heavy and a light ball from Leaning Tower of Pisa and found thatboth of them reached the ground level almost at the same time.

When the two unequal weighted balls were actually dropped from the top, it wasfound that they reached the ground level almost simultaneously. This implied thatAristotle’s Hypothesis was rejected while Galileo’s hypothesis was accepted.

While this story has been retold in popular accounts, some historians presently be-lieve that it was a thought experiment which did not actually take place. Regardless ofthe veracity of this story, it would be seen in this chapter, that the Art of Hypothesistesting follows almost the same steps.

Summarizing the process which Galileo followed, there are four basic steps in test-ing the hypothesis: ‘Stating the Hypothesis’, ‘Setting the Criteria for the Decision’,‘Computing the output of the Experiment’ and ‘Making the Decision’.

Remainder of the chapter deals with incorporating the same idea in a statisticalsetting.

1.3 Statistical Hypothesis Testing

In the previous section, it was seen that ‘testing’ a ‘hypothesis’ basically involves com-paring the hypothesis with the reality. The problem with ‘Statistical Hypothesis Testing’is that sometimes it is impossible to ascertain the reality in its entirety.

To understand this point, suppose there is an opaque closed box which contains somered balls and some blue balls. Suppose someone hypothesizes that number of red balls

4

in the box is exactly equal to the number of blue balls in the box. The obvious way totest this hypothesis would be to open the box and count the number of red and blueballs. If the number is same, the hypothesis has been proved else the hypothesis wouldhave to be rejected.

Now suppose the box contains 1 million balls ( assume either the box is too big orthe balls are too small!). In this case, it is practically impossible to count the number ofballs for each color. Statistical Hypothesis Testing formulates a way to ‘reject’ or ‘failto reject’ the hypothesis based on a random sample drawn from the entire population.Statistical Hypothesis is never accepted rather it is ‘not rejected’. In the legal world, aperson is always assumed to be innocent until proven guilty. Either the person is foundguilty or the prosecution is unable to reject the hypothesis that the person is innocent.Hence, it is never the case that courts prove someone to be innocent.

Suppose, 1000 balls are drawn from the box and 900 of them are colored blue whilethe remaining 100 are colored red. In this case, it would be logical to believe that thehypothesis that the number of red and blue balls are equal, stands rejected. On thecontrary, if the number of blue balls in the sample of 1000 balls is 509, it would bedifficult to reject the hypothesis with great confidence. The more important question isthat how do we decide the exact decision criteria by which we can reject the hypothesis.(i.e. do we start rejecting at 600, 550 or 501?)

Statistical Hypothesis Testing formulates the formal procedure by which hypothesisis tested probabilistically (i.e. in a statistical setting). As described in the Galileoexample, the procedure to test the hypothesis consists of four steps. The only majordifference being that rather than comparing the actual output, statistic of the sample(function of the sample) is compared to the hypothesis.

Figure 1.3: FlowChart of the Hypothesis Testing Procedure.

2 Hypothesis Testing:Procedure

2.1 State the Hypothesis

The first step in the process of Statistical Hypothesis Testing is to identify the hypoth-esis which is being challenged. In each problem considered, the question of interest issimplified into two competing hypothesis.

• The Null Hypothesis, denoted H0 or HNULL, represents a theory that has been putforward but is still unproven. For example in the Gallileo example, Null Hypothesis

5

is that the speed of fall of an object is dependent on its weight.

• The alternative hypothesis, denoted H1 or HALT , represents negation of the nullhypothesis H0. For example, in the introductory example the Alternative Hypoth-esis would be that the speed of fall of an object is independent of its weight.

These two competing hypotheses are not however treated on an equal basis andspecial consideration is given to Null Hypothesis. The experiment is carried out inan attempt to reject a particular hypothesis which is otherwise assumed to be true.Hence priority is given to Null Hypothesis so that it cannot be rejected unless theevidence against it is sufficiently strong. The way it is achieved has been discussedin the next section.

2.2 Set the Criteria for the Decision

2.2.1 Type I and Type II Errors

Setting a Criteria for testing a non statistical hypothesis is not very difficult and isdone using ‘Proof by Contradiction’. Assuming the Hypothesis, it is checked whether acertain output contradicts the Hypothesis or is consistent with the Hypothesis. Goingback to the box example, if the hypothesis that the box contains equal number of redand blue balls is true then it should be evident when the box is opened and all the ballsare counted. In case it is found that the number of red balls is not equal to number ofblue balls then the hypothesis stands rejected. The problem with statistical hypothesistesting is that we are trying to draw inference from a sample drawn from the completepopulation as it is not possible to count the entire population. Looking at the exampleof the box with colored balls, it may be the case that the box actually has equal numberof blue and red balls. But when 1000 balls were drawn from it, unfortunately 700 ofthem were red while 300 of them were blue. Even though it is not very probable tohappen but possibility of such an occurrence cannot be denied. This type of error iscalled Type I Error. Similary it can be the case that the box contained more blue ballsthan red balls but when the sample balls were drawn we got 500 red balls and 500 blueballs. This type of error is called Type II Error. It can be seen that setting the Criteriafor Decision becomes slightly involved in the case of Statistical Hypothesis Testing asProof by Contradiction is not straightforward here and we need to minimize the twotype of errors previously discussed.

6

Figure 2.1: Out of the four possibilities, two result in an error which are named Type I error andType II Error.

• Type I Error: A Type I error occurs when the Null Hypothesis is rejected whenactually it is true.

• Type II Error: A Type II error occurs when the Null Hypothesis is not rejectedwhen actually it is not true.

As discussed previously, Hypothesis Testing is not secular to the Null and AlternativeHypothesis. Null Hypothesis is the well established fact which one is trying to reject.Hence the onus lies on us to convincingly reject the Null Hypothesis. Hence rejecting aTrue Hypothesis is a bigger sin than not rejecting an incorrect hypothesis. For the samereason, a Type I error is often considered to be more serious than Type II error.

2.2.2 Decision Criteria

It has been seen that in Statistical Hypothesis Testing, a random sample is drawn andbased on a decision rule it is decided whether the hypothesis has to be rejected or notrejected. So for each possible sample, Decision Rule categorizes it into Rejection Region(Hypothesis rejected) or Acceptance Region 1 (Hypothesis not rejected). This meansthat the set of all possible samples can be partitioned into two parts: Rejection Regionand Acceptance Region. The challenge is to partition the set in such a way that theType I Errors and Type II Errors are minimized.

1Acceptance Region is a misnomer as the Null Hypothesis is never accepted but ‘failed to be rejected’.

7

Figure 2.2: Based on the Decision Rule, some samples infer ‘Rejection of Hypothesis’ and others‘Fail to Reject the Hypothesis’.

2.2.3 Tradeoff between Type I Error and Type II Error

For any given set of data, Type I and Type II errors are inversely related; the smallerthe probability of one, the higher the probability of the other. It can be seen that eitherof Type 1 or Type 2 error cannot be completely removed (as they are inversely relatedso either of one will always exist), and as Type I error is more hazardous than Type IIerror, therefore the hypothesis test procedure is adjusted so that there is a guaranteedlow probability of rejecting the Null Hypothesis wrongly. Type I Error can be removedif for every sample, Decision Rule accepts (fails to reject) the Hypothesis. This wouldimply that the Null Hypothesis would never be rejected and the set of possible sampleswould only have acceptance Region. Even in case the hypothesis is false, the DecisionRule would still be accepting the Hypothesis. This would mean that the probability ofType II Error would become 1. (all occurrences of wrong Hypothesis lead to ‘acceptance’.Similarly if the Type II error is removed, probability of Type I Error would become 1.

Figure 2.3: In case the Decision Rule, starts ‘Not Rejecting the Hypothesis’ for all the Samples,Pr(TypeIError) would become 0 while Pr(TypeIIError) would become 1.

8

2.2.4 Significance Level

The significance level of a statistical hypothesis test is the probability of Type I Errorwhich is desired by the experimentar. That is, if one wants to be very sure before re-jecting the Null Hypthesis, the significance level can be decreased which would decreasethe probability of wrongly rejecting the Null Hypothesis. Similary if one is concernedabout wrongly accepting (not rejecting) the Null Hypothesis, Significance Level shouldbe increased. The challenge is to come with a procedure to draw the Decision Bound-ary which ensures that the probability of Type I error would be exactly equal to theSignificance Level.

2.3 Compute the Test Statistics

2.3.1 Test Statistics

In the previous section it was seen that the basic challenge in drawing the DecisionBoundary is to ensure that probability of a wrong rejection (Type I Error) is equal tothe desired tolerance.(Significance Level) For this it is necessary to know the probabilitydistribution of the samples. In case the probability distribution is found, the nextproblem is to divide the set of samples into Acceptance and Rejection Regions. Giventhat the size of a sample is normally quite high (may be even 1000s), to come up witha decision rule for each of such 1000 dimensional vector is a non trivial task. Both ofthese problems are tackled by defining a function which maps each sample into a realvalued number. Also the function is constructed such that the probability distributionof the resulting range is a standard distribution.

Figure 2.4: Test Statistic maps each sample to Real Line and ensures that the resulting distri-bution is a standard distribution.

In the subject of Statistics, the term ‘statistics’ implies any function of the sampledata. Test statistics implies a function of sample data which is used to test the hy-pothesis. As discussed earlier, Test Statistic is a real valued number which follows somestandard probability distribution. In fact specific Hypothesis Tests are known by thename of the underlying distribution. For example, the name ‘T Test’ implies that the

9

Test Statistic follows T Distribution, the name ‘F Test’ implies the Test Statistic followsF Distribution and so on.

2.3.2 Critical Values

Test Statistic is a real valued number (one dimensional) and the number line can bepartitioned into two parts: Acceptance Region and Critical Region (Rejection Region).

• Acceptance Region: If the test statistics falls in the Acceptance Region then theNull Hypothesis cannot be rejected. It should be noted that Null Hypothesis iseither rejected or the evidence suggests that the hypothesis cannot be rejected.

• Rejection/Critical Region: If the test statistics falls in the Critical region regionthen the Null Hypothesis stands rejected.

The shape of the Critical and Acceptance Region for the Test Statistic shown in theFigure is standard for the Null Hypothesis of the form H0 : θ = θ0. The reason for thesame is that there would be a value of sample statistic based on the Null Hypothesis.In case the sample statistic obtained is much less than or much greater than that value,the hypothesis can be rejected. Hence the Critical Region lies on the extreme ends ofthe Real Line. 1

Figure 2.5: Real Line should be divided into Critical Region and Acceptance Region.

The problem is to find the critical values on the real line i.e. values which definethe decision boundary. As stated earlier, sample statistic is chosen in such a way thatits probability distribution is a standard distribution. Assuming the hypothesis to becorrect, the sample statistic would follow some standard probability distribution. Usingthis distribution, Critical Values are chosen such that the Probability of the CriticalRegion (which is the probability of Type I Error) is equal to a predefined tolerance.(Significance Level)

1For the same reason test of the form H0 : θ = θ0 are called ‘Two Tailed Test’.

10

Figure 2.6: Using the Distribution, Critical Values are found such that Probability of RejectionRegion is equal to a pre-defined tolerance level. (Significance Level)

2.4 Make a Decision

2.4.1 Evaluating the Test Statistics

The Decision Making part is the easiest step in Hypothesis Testing. In the Galileo exam-ple, it was seen that when the results of the experiment did not comply with the Decisionrule, the hypothesis is rejected. The only difference in Statistical Hypothesis Testing isthat rather than using the actual result, Test Statistic is checked for compliance withthe Decision Rule. In case the Test Statistic belongs to the Critical Region (discussedin the previous section), the hypothesis is rejected else the test fails to reject the NullHypothesis.

Figure 2.7: Decision Rule for the Test Statistic.

11

2.4.2 P-Value

Decision about the Hypothesis can be simply made by comparing the Test Statisticwith the left and right critical values. In a practical setting, it means computing thetest statistic as well as computing (or memorizing) the critical value for the desiredsignificance level. Since the mathematical formulae for the distribution is normallyquite involved, it is a cumbersome job to compute the critical values. To get rid of thishassle, most of the statistical softwares output a number called P-Value which can bedirectly compared with the Significance Level.

Figure 2.8: Decision Rule based on the P Value.

Definition: If the Null Hypothesis is assumed to be true, P-Value is the probabilityof getting an outcome as worse (or more) as what we have got. ’Worse’ is defined on thebasis of how distant the outcome is from the hypothesis.

Figure 2.9: Assuming the Hypothesis to be true, P-Value is the probability of getting worse thanwhat we have got.

2.4.3 Equivalence: Decision Rule of Test Statistic and Decision Rule of P-Value

In this section, it would be seen that a hypothesis can rejected either on the basis ofthe Test Statistic or on the basis of the P-Value. Also the two approaches are exactly

12

equivalent.Suppose the Test Statistic belongs to the Acceptance Region (left to the right critical

value or right to the left critical value). This means as per the Test Statistic DecisionRule, the hypothesis would not be rejected.

If the Test Statistic is left to the right critical value, the probability of getting a worseoutcome as what we have got (P-Value) on the right tail is more than the probability ofwrongly rejecting on the right side. As ’worse’ is defined on both sides of the distributionand that too symmetrically, probability of getting a worse outcome on the left sidewould be more than the probability of wrongly rejecting on the left side. P-Value isthe probability of getting a worse outcome on both the sides and Significance Level isthe probability of wrongly rejecting on both sides. So P-Value is more than SignificanceLevel and according to P-Value Decision Criteria, the hypothesis would not be rejected.

By a similar argument, if the Test Statistic belongs to the Critical Region, both theDecision Rules would imply the same inference.

Figure 2.10: Equivalence: Decision Rule based on Test Statistic, Decision Rule based on P Value.

3 One-Tailed vs Two Tailed Tests

The above discussion focused on a specific type of hypothesis HNULL : θ = θ0. It may bethe case that one is not hypothesizing about the exact value of an unknown but whetherthe unknown is less than/greater than a certain quantity.

The Null Hypothesis takes the form

HNULL : θ ≥ θ0 (3.1)

The Alternate Hypothesis which is the complement of the Null Hypothesis becomes

HALT : θ < θ0 (3.2)

13

Figure 3.1: Hypothesis is classified into two parts: Simple Hypothesis and Composite Hypothesis.

The general approach taken in testing Simple Hypothesis assumed the hypothesisto be correct and based on that computed the distribution of the Test Statistic. InComposite Hypothesis Testing, assuming Null Hypothesis to be correct implies a rangeof values for the unknown. Looking at the example of box with colored balls, if thehypothesis is that the box contains more than 50% blue balls. Null Hypothesis eitherimplies that θ = 50% or θ = 60% or θ = 99.24% etc. Each of such assumption wouldlead to different distributions of Test Statistic. One way can be, rejecting the hypothesisfor all such cases but number of possible cases is infinite so it is practically impossibleto test in such a way.

Figure 3.2: Critical Region only comprises the left tail. The point at which the probability ofgetting a value less than that is equal to Significance Level, is called Critical Value.

Thankfully it can be shown that if the Hypothesis can be rejected for the caseθ = 50%, all the other cases would follow. Though the difference here is that theRejection Region changes form both the tails to the tail which is contradictory to the

14

hypothesis. For example, getting 990 blue balls should not reject the hypothesis thatθ ≥ 50% but getting 20 blue balls should reject the hypothesis. Hence only the left tailforms the Rejection Region.

15

hypothesis testing - simplifyingstats · in the box is exactly equal to the number of blue balls in...

Documents