statistical power and its subcomponents –missing an misunderstood concepts in software engineering...

Statistical Power and its subcomponents –missing an misunderstood concepts in

Software Engineering Empirical Research

By James Miller et.all.

Presented by Siv Hilde Houmb 1 November 2002

Outline

• Statistical Power Analysis

• Statistical significance testing

• Direction/Non-direction of statistical test• Parametric and Non-parametric tests

• Calculating Statistical Power

• Normative and judgment effect estimation approach

Introduction

• The authors informal review of the software engineering empirical literature failed to find many articles which report the statistical power of the described experiment.

• Why do we need it -> The power to identify false hypothesis

Statistical Power Analysis

• Statistical power analysis is a method of increasing the probability that an effect is found in the empirical study

• Reject or accept a null hypothesis (denoted H0)– H1 : alternative hypothesis

• Statistical power is defined as the probability that a statistical test will correctly reject a false null hypothesis.

• Eks. A power level of 0.4 means that if an experiment is run ten times, an existing effect will be discovered only four times out of ten.

Statistical Power Analysis cont.

• High power means that if an effect exists there is a high probability that it will be found.

• And if an effect does not exists, you have a solid statistical argument for accepting the null hypothesis.

Statistical significance testing

• Concerns with controlling the “fate-luck”

• Should have a power level of 0.8

Direction/Non-direction of statistical test

• One tailed and two tailed tests

• Direction – one tailed– Eks: phenomenon exists if A is larger than B

• Non-drection – two tailed– Eks: phenomenon exists if A and B differs

Parametric and Non-parametric tests

• A parametric statistical test requires the estimation of one or more population parameters – Example: an estimate of the difference in the average

between the first and the second populations

• A non-parametric test does not involve estimation of a specific parameter– Example: provides you with an estimate of P[X>Y],

probability that a randomly selected patient from your first population has a larger value than a randomly selected patient from the second population

Parametric and Non-parametric tests

• Advantage of non-parametric test– Do not require the sample population to be normally

distributes

• Disadvantage– Non-parametric test do not have the same statistical

power as parametric test do

• Non-parametric test should only be used when substantial non-normality of the sample is believed to exist or when one wishes to be particularly conservative on the side of Type I errors.

Calculating Statistical Power

• The significance criterion ( ): the chosen risk of committing a Type I error, that is the probability of incorrectly rejecting the null hypothesis (H0), when performing significance testing.

• The sample size (N): number of subjects (as large as possible)

• The effect size ( ): the degree to which the phenomenon under study is present in the population.

Calculating Statistical Power cont.

• Significance criterion:

• Sample size: N

• Effect size:

• For comparison of two means, the most request used

)(Nf

2/N

The significance criterion

• Type I error: the probability of incorrectly rejecting the null hypothesis (H0),

• Type II error: the probability of incorrectly accepting the null hypothesis (H0),

• Probability of correctly rejecting the null hypothesis: 1-

• Relationship between and (Type I and Type II error) is / = x, which means that a false rejection of H0 is x times more serious than erroneously accepting it.

The sample size N

• Given the effect size and the significance criterion as constant, the power level of the test is directly dependent upon the sample size..

• As N increases, the probability of error decreases, thus greater the precision and the higher the chance of rejecting the false null hypothesis.

Effect size

• The effect size is the degree to which the phenomenon under study is present in the population.

• The larger the effect size the greater the degree a phenomenon is likely to be detected and the null hypothesis to be rejected.

Evaluating effect size

• Normative: rely on other related empirical studies or the establishment of an empirical norm for subject of experiment– Normative approach are used when conducting

a replicating study in ESE since this is a fairly young research field

• Expert judgement: rely on the experts providing the estimate– Guesswork

Expert Judgment

• The judgmental approach to the estimation and evaluation of effect sizes can simply be regarded as a consensus opinion of experts within the field of experimentation

• The difficulty of this task within SE is that the experts will often not fully understand the concepts of significance and effect size and hence their opinion may only address these concepts in a relatively indirect manner

Expert Judgment cont.

• Experts do also have problems with providing quantitative opinion and one then need to transfer the qualitative opinion into a quantitative value

• Opinion can be collected using– Formal structured interviews– Formal questionnaires survey

statistical power and its subcomponents –missing an misunderstood concepts in software engineering...

Documents