[notes]stat1 - elementary statistics

7/28/2019 [Notes]STAT1 - Elementary Statistics

http://slidepdf.com/reader/full/notesstat1-elementary-statistics 1/17

STAT 1 Notes by F5XS Elementary Statistics

1

The term statistics came from the Latin phrase “ratio status”, which meansstudy of practical politics or the statesman’s art. In the middle of 18

thcentury, the

termstat ist ik

was used, a German term defined as “the political science of several countries.” From statistik it became statistics defined as a statement infigures and facts of the present condition of a state.

The word statistics can be viewed in two contexts. If in a singular sense,statistics is a science concerned with the collection, organization, presentation,analysis, and interpretation of data. If in a plural sense, statistics is a collection of facts and figures of processed data.

Two broad categories of statistics

descriptive statistics

used to describe a mass of data in a clear, concise, and informative way deals with the methods of organizing, summarizing, and presenting data

inferential statistics is concerned with making generalizations about thecharacteristics of a larger set where only a part is examined

Data

facts and figures that are collected, presented, and analyzed

can be numeric or non-numeric

must be contextualized

The universe is a collection or set of all individuals or entities whosecharacteristics are to be studied. A finite universe is present when the elementsof the universe can be counted for a given time period while an infinite universeis present when the number of elements of the universe is unlimited.

A variable is an attribute or characteristic of interest measurable on each andevery unit of the universe.

A qualitative variable assumes values that are not numerical but can becategorized. Categories of qualitative variable may be identified by either non-numerical descriptions or by numeric codes. A quantitative variable indicates

the quantity or amount of a characteristic. The data are always numeric and canbe discrete or continuous.

Discrete quantitative variables constitute a finite or countable number of possible values while continuous quantitative variable assumes any value in agiven interval.

The population is the set of all possible values of the variable. A sample is a subset of the population or universe.

Data may be classified into four hierarchical levels of measurement:

nominal

data collected are labels, names, or categories

frequencies or counts of observations belonging to the same category canbe obtained

lowest level of measurement

ordinal

data collected are labels with implied ordering




2

the difference between two data labels is meaningless

interval

data can be ordered or ranked the difference between two data values is meaningful

data at this level may lack an absolute zero point

ratio

data have all the properties of the interval scale

the number zero indicates the absence of the characteristic being measured

it is the highest level of measurement

Methods of collecting data

objective method

the data are collected through measurement, counting, or by observation

this method requires the use of a measuring or counting instrument

subjective method

the information is provided by identified respondents

the instrument used to gather data may take the form of a questionnaire

the researcher collects data by:

conducting personal interviews either face-to-face or through telephones

gathering responses using mailed questionnaires

use of existing records uses data which have been previously collected byanother person or institution for some other purposes

Types of data primary data are acquired directly from the source

secondary data are not acquired directly from the source

Methods of data presentation

textual

a narrative form of describing the characteristics of the universe or population based on the data collected and organized by giving highlights

tabular

data are organized into classes or categories by rows and/or columns andappropriate pieces of information are found in the cells of the table

relatively more information that can be presented and trends can be easilyseen

some details are lost when data are summarized in tabular form

graphical

provides visual presentation of the distributional properties of the data

most efficient way of presenting trends

some details are lost in using this type of presentation

examples: pie, bar, line, scatter plot

Parts of a statistical table

table heading – includes the table number and title

caption – designates the information contained in the columns

body – main part of the table containing the information or figures presented

stubs/classes – categories which describes the data usually found at the leftside of the table




3

Stem-and-leaf plot

presents data in ordered form and provides an idea of the shape of thedistribution of a set of quantitative data

combines the grouping of a frequency distribution and the pictorial display of ahistogram

best for smaller number of observations with values greater than zero

also called stemplot

Steps in constructing a stem and leaf plot1] arrange data in ascending or descending order 2] split each datum into a leaf value, which is the last digit, and a stem value,

which consists of the remaining digit

3] list the stems vertically in increasing or decreasing order 4] draw a vertical line to the right of the stems5] for each stem, write its leaves to the right of the vertical line in ascending order

Important facts of a stemplot

reveals the center of the distribution

illustrates the overall shape of the distribution (like symmetry and spread)

shows marked deviations from the overall shape

Descriptive Measures quantities that are used to summarize the characteristics of a universe or

population

some of these are measures of: location, dispersion, skewness, and kurtosis

Measures of location

summarizes a data set by giving a “typical value” within the range of the datavalues that describes its location relative to entire data set

some common measures

minimum-maximum

minimum is the smallest value in the data set, denoted by MIN;

maximum is the largest value in the data set, denoted by MAX measures of central tendency

values about which the set of observations tend to cluster

also called as an average

most commonly used average: mean, median, and mode

percentiles, deciles, quartiles (fractiles)

Mean

the sum of all observations in the data set divided by the total number of observations

= =1 where = ℎ observation of the variable and is the total

number of observations in the data seto there is only one mean for a given data seto it is defined only for quantitative datao it reflects the magnitude of every observationo it is easily affected by the presence of extreme values




4

o the sum of the deviations from the mean is equal to zeroo the means of different sets/groups of comparable data may be combined when

properly weighted

Median

the middle value when the data values are arranged in ascending or descending order of magnitude

Md =

+1

2

, is odd 2

+ 2

+1 , is even

o there is only one median for a data set

o it is not amenable to further complicationso the sum of the absolute deviations of the observations from a value, say , issmallest when is equal to the median; − is minimum when =

Mode

the value in the data set which occurs most frequentlyo it may or may not existo if it exists, there can be more than one mode for a given data seto it is determined by the frequency and not by the values of the observationso it is applicable for both quantitative and qualitative data

Fractiles (percentiles, in this scenario)

numerical measures that give the position of a data value relative to the entiredata set

the ℎ percentile, denoted as , is the data value that separates the bottom % of the data from the top 100− %

finding the ℎ percentile1] arrange the data values in ascending/descending order to form an array

2] find the location of in the array by computing = 100× where is the

total number of data values, and

is the percentile of interest

3] a] if

is a whole number, then

is the mean of the data values in position

and position + 1 b] if is not a whole number, then is taken as the data value in the next

higher whole number position

Measure of Dispersion

a quantity that describes the spread or variability of the observations in a givendata set

the higher the value, the greater the variability in the data set

absolute measure of dispersion: range, inter-quartile range, variance, standarddeviation

relative measure of dispersion: coefficient of variation

Range ()

the difference between the maximum and minimum values in a data set

= MAX− MIN

quick and easy to understand




5

a rough measure of dispersion

usually reported together with the median

Inter-Quartile Range ()

the difference between the third quartile and the first quartile

= 3 − 1

not affected by the presence of extreme values

not as easy to calculate as the range

Variance () the average squared difference of the observations from the mean

2 =

− 2=1

one of the most useful measures of dispersion all observations contribute in the computation

always non-negative

comes in the square of the unit of measure of the given set of values

Standard Deviation ()

the average deviation of the observations from the mean

the positive square root of the variance

= 2

the unit of measure is the same as that of the observations

usually reported with the mean

Coefficient of Variation ()

a relative measure that indicates the magnitude of variation relative to themagnitude of the mean, expressed in percent

= × 100%

this measure of dispersion is unitless

used to compare dispersion of two or more data sets with the same or differentunits

the higher the

, the more variable is the data set relative to its mean

Chebyshev’s Rule

permits us to make statements about the percentage of observations that mustbe within a specified number of standard deviation from the mean

the proportion of any distribution that lies within standard deviations of the

mean is at least 1 − 1 2 where is any positive number larger than 1

for any data set with mean and standard deviation , the followingstatements apply:

at least 75% of the observations are within 2 of its mean

at least 75% of the observations are within 3

of its mean

A distribution is said to be symmetric about the mean if the distribution to theleft of the mean is the “mirror image” of the distribution to the right of the mean

Measure of Skewness

describes the degree of departure of the distribution of the data from symmetry

=3 − Md




6

a symmetric distribution has = 0 since its mean is equal to its median andits mode

Measure of Kurtosis

describes the extent of peakedness of flatness of the distribution of the data

= − 4=1 4 − 3

Box-and-Whiskers Plot

indicates the symmetry of the distribution and incorporates measure of locationto describe the variability of the observations

used for identifying outliers

diagram is made up of a box which lies between the first and third quartiles

the whiskers are the straight lines extending from the ends of the box to thesmallest and larges values (1 − 1.5 and 3 + 1.5) that are not outliers

the outliers are denoted by dots

A random experiment is a process of drawing observations capable of repetition under the same conditions with well-defined possible outcomes.

Sample space

set or collection of all possible outcomes of a random experiment

may either be finite or infinite elements of the sample space are referred to as sample points

Event

a subset of the sample space

may either be simple or compound

observing an element of an event indicates the occurrence of the event

A probability is a numerical value ranging from 0 to 1 that measures thelikelihood of an event occurring. There are three approaches to assigningprobability:

a priori approach utilizes an experimental model whose underlying assumptions are used to

measure the likelihood of an event

assumptions are conditions on the likelihood of an event

for a random experiment with the assumption of an equally-likely sample

space, the probability of event , denoted as [] is defined as =

where [] is the number of elements of event and [] is the totalnumber of possible outcomes of the random experiment

for a random experiment with the assumption of an unequally-likely samplespace, the probability of event

, denoted as

[

], is defined as

=

{ } = where or is the probability o the th element

of the event

a posteriori approach

utilizes the relative frequency of the occurrence of an event in repeatedtrials of the random experiment as the probability of the event

=number of occurences of

number of trials




7

subjective approach utilizes one’s personal judgment and knowledge inassessing how likely an event will occur

Properties of []

0 ≤ ≤ 1

for a random experiment with sample space , = 1

if 1, 2, 3, …, are mutually disjoint events in , then 1 ∪ 2 ∪ 3 ∪⋯∪ = 1+ 2+ 3 +⋯+ [ ]

Given the probability of an even , we say that is a sure event if = 1 while it is an impossible event if = 0.

The complementary of an event is the set of all outcomes in the sample

space

not in

, denoted by

.

= 1

− [

].

Two events 1 and 2 are said to be mutually exclusive if they have nocommon element and are therefore mutually exclusive. This they cannothappen simultaneously.

Two events 1 and 2 are said to be independent if the likelihood of the

occurrence 1 is not affected by the occurrence of 2.

The union of two events consist the elements of 1 but not in 2, elements of 2 but not in 1, and elements of both 1 and 2.The intersection of two events consists of elements found in both 1 and 2.

Observing the intersection of two events implies the simultaneous occurrence of

the two events.

The sum probability of two events 1 and 2 in is defined mathematically as… 1 ∪ 2 = 1+ 2 − 1 ∩ 2 The conditional probability 1 given 2 is defined mathematically as… 12 =

[1 ∩ 2]2 ,2 > 0

A random variable is a rule or function that assigns exactly one real number to every possible outcome of a random experiment. Discrete random variables

take on a set of distinct possible values or a countably infinite number of possiblevalues. Continuous random variables take on any value within a specifiedinterval or continuum of values.

Probability of values of on a binomial experiment: = = 1− −

where is the number of trials, is the number of “successes”, and is theprobability of getting a “success” given .

Probability of values of

on a random experiment without replacement:

= = −−

where is the number of trials, is the number of “successes”, is the number of elements under “success”, and refers to the total number of elements.




8

Expected Value of a Random Variable

denoted by

or

[

]

interpreted as the long-run average of a random variable for a discrete random variable, it is computed as = = = all

Standard Deviation of a Random Variable

denoted by

measures the average deviation of the values of the random variable from itsmean

for a discrete random variable, it is computed as = − 2 = all

Probability distribution of a continuous random variable

the probability that a random variable takes on an exact value is zero, i.e.

= = 0, for a continuous random variable

the probability distribution is specified by a function from which probabilitystatements are made

Properties of a probability density function

the total area under the curve is 1

the probability that the random variable will take on a value between two

quantities 1 and 2 is given by the area under the curve bounded by the lines = 1 and = 2

A continuous random variable is said to be a normal random variable if itfollows a normal probability distribution specified by… =

1 2 −12− 2

Properties of a normal curve

it is bell-shaped and unimodal

it is symmetric at =

it is asymptotic to the

axis

the total area under the curve is 1 it has a distribution with…

68% of the observations within − , + 95% of the observations within − 2, + 2 99.7% of the observations within − 3, + 3

Standard Normal Distribution

has a normal distribution with mean equal to 0 and variance equal to 1

the random variable which follows the standard normal distribution is referredto as the standard normal variate, denoted by

The table summarizes the cumulative probability distribution for .

Rules in computing probabilities

= = 0, hence ≤ = < [ ≤ ] can be obtained directly from the table

> = 1 − ≤




9

> − = [ < ]

<

−=

[

>

]

1 < < 2 = < 2 − [ < 1]

Transformation Theorem: Given a normal random variable with mean and

variance 2, then = − follows the standard normal distribution.

Methods of drawing conclusions

deductive method

draws conclusions from general to specific

assumes that any part of the universe will bear the observed characteristics

of the universe conclusions are stated with certainty

inductive method

draws conclusions from specific to general

assumes that the characteristics observed from a part of the universe islikely to hold true for the whole universe

conclusions are subject to uncertainty

Inferential statistics makes use of the inductive method of drawingconclusions.

Reasons on why sampling is done:

reduced cost

greater speed or timeliness

greater efficiency and accuracy

greater scope

convenience

necessity

ethical considerations

Two types of samples

probability samples samples are obtained using some objective chance mechanism, thus

involving randomization

require the use of a complete listing of the elements of the universe calledthe sampling frame

probabilities of selection are known

generally referred to as random samples

allow drawing of valid generalizations about the universe/population

non-probability samples

samples are obtained haphazardly, selected purposively or are taken as

volunteers the probabilities of selection are unknown

should not be used for statistical inference

result from the use of judgment sampling, accidental sampling, purposivelysampling, and the like




10

Methods of probability sampling

simple random sampling

most basic method of drawing a probability sample assigns equal probabilities of selection to each possible sample

results to a simple random sample

simple random sampling without replacement does not allow repetitionsof selected units in the sample while simple random sampling withreplacement allows repetitions of selected units in the sample

stratified random sampling

the universe is divided into mutually exclusive sub-universes called strata

independent simple random samples are obtained from each stratum

advantages of stratification

gives a better cross-section of the population

simplifies the administration of the survey/data gathering

the nature of the population dictates some inherent stratification

allows one to draw inferences for various subdivisions of the population

increases the precision of the estimates generally

systematic random sampling

adopts a skipping pattern in the selection of sample units

gives a better cross-section if the listing is linear in trend but has high risk of bias if there is periodicity in the listing of units in the sampling frame

allows the simultaneous listing and selection of samples in one operation

cluster sampling

considers a universe divided into mutually exclusive sub-groups calledclusters

a random sample of samples is selected and their elements arecompletely enumerated

has simpler frame requirements

administratively convenient to implement

simple two-stage sampling 1] in the first stage, the units are grouped into sub-groups, called primary

sampling units and a simple random sampling of primary sampling unitsare selected

2] in the second stage, from each of the primary sampling units selected with elements, simple random sampling of units, called secondarysampling units will be obtained

Sampling is a process that…

can be repeatedly done under basically the same conditions

can lead to well-defined possible outcomes

is unpredictable

Stratified random sampling

total number of units in the universe:

=

1

=1

total number of units in the stratified sample: = 1=1

total number of units in a sample stratum with equal allocation: =

total number of units in a sample stratum with proportional allocation: = ×




11

SWR SWOR

Number of possible samples

Probability of selecting each sample 1 1

= = 2 = = − 2 = all = 2 = −

all 2

SWR SWOR

=

2 =

2

2 =

2

− − 1

The central limit theorem states that for any non-normal distribution with

mean and variance 2, the sample mean approaches the normal distributionwith mean and variance 2 as the sample size increases (a sample size of 25 is large enough).

Estimation is concerned with finding a value or range of values for anunknown parameter.

an estimator of a parameter is a rule or a formula for computing the statisticusing the sample data

usually denoted by a Greek letter with a “hat”, e.g. and in other cases, special symbols are used like for the sample mean as

estimator of the population mean

an estimate is a numerical value of the estimator

Some desirable properties of an estimator:

an estimator must be accurate

accuracy measure the closeness of an estimate to the true value

the difference between the expected value of the estimates and the parameter measures the accuracy of an estimator which is referred to as the bias

BIAS, = −

an estimator must be precise

precision measure the closeness of the different possible values of theestimator to each other

the precision of an estimator can be measured by its variance or by its

standard error : MSE = BIAS2, + VAR When estimating, the following factors (specifications) must be known to

determine the appropriate sample size:Level of confidence desired, 1− × 100%, also called the confidencecoefficient, is the measure of confidence that the estimate obtained is near or the same as the true value of the parameter.

Variability of the population being studied, 2, is a measure of howdispersed the population observations are from each other. A large sample isneeded when the population is widely dispersed. When the population




12

variability is unknown, it is estimated using information on the same or relatedvariable from previous studies.

Maximum allowable error , , also called the maximum tolerable error or margin of error, is the specified acceptable difference of the estimate and theparameter for a given level of confidence.

The sample size, , necessary to meet the above specifications is ≥ 2 2

. However…:

the formula assumes drawing a simple random sample of size from a

population of size

always adjust to ∗ where ∗ = + as a marginal difference will be

noticed between

and

∗ when

is very, very small relative to

when determining the sample size for estimating proportions, take =1 − , where is the “best” educated guess of the population proportion,

usually equal to 0.50 because it is with this value that the maximum samplesize is obtained

A point estimate is a single number computed from a random sample whichrepresents a plausible value of the parameter. It pinpoints a location or a point inthe distribution of possible values of the random variable.

Parameter SRS StRS

Population Mean = =1 = =1

Population Proportion = =

= = =1

= =1

Population Variance

2

2 =

−2=1

−1

= 2

=1 − 2 − 1

2 =

− 2

=1

=1

Variance of the SampleMean (with replacement) = 2 = 22

=1

Variance of the SampleMean (withoutreplacement)

= 2 − − 1

= 22 − − 1

=1

Standard Error of the

Sample Mean (withreplacement) = =

2

2

=1




13

Standard Error of the

Sample Mean (withoutreplacement) = − − 1 =

2

2

− − 1 =1

An interval estimate is a range of values computed from a random sample,which represents an interval of plausible values for the unknown value of theparameter of the population. When some measure of certainty or confidence isattached to the interval estimate, the interval is referred to as a confidenceinterval estimate. The measure of certainty or confidence, also called theconfidence level or confidence coefficient, provides information on how“confident” the researcher is in stating that the interval estimate obtained f rom the

random sample contains the true value of the parameter.

Different cases of obtaining the confidence interval estimate of the populationmean, :

If a continuous random variable is normally distributed with variance 2,

then a 1− × 100% confidence interval about is

∓ 2 .

If a continuous random variable is normally distributed with unknownmean and unknown variance 2, then a 1− × 100% interval about is

∓ 2 −1

.

As sample size

becomes large (

≥ 25), a

1− × 100%conidence

interval about the mean of an approximately normally distributed random

variable with unknown variance 2 is

∓ 2 .

Estimation of population proportion, : A point estimator of is given in the

previous table with estimated standard error

= 1 − . A 1− × 100%

confidence interval about is ∓ 2 1− . For estimating the population

proportion, the sample size is considered large enough when 1 − > 3.

Hypothesis testing is a technique used to determine whether a specificconjecture about a parameter(s) of the population under study will be accepted or rejected.

A statistical hypothesis is an assertion about the value of the populationparameter or the form of the distribution. In any test of hypothesis, there are twotypes of statistical hypotheses involved. The null hypothesis, denoted as , isusually a statement of equality signifying no difference, no change, norelationship, or no effect. The alternative hypothesis, denoted by is acontrasting statement which is accepted if sample data do not provide sufficientevidence to support the null hypothesis.

A test statistic is chosen to decide which hypothesis is to accept or reject. Adecision rule specifies the range of values of the test statistic which leads to therejection of the null hypothesis in favor of alternative hypothesis.

The critical or rejected region defines the range of values of the test statisticthat are very unlikely to be obtained when the null hypothesis is true and willresult to the rejection of the null hypothesis.




14

Decision (based on sample evidence)True state of the population

Ho is true Ho is falseReject Ho Type I Error Correct Decision

Fail to reject Ho Correct Decision Type II Error

The probability of obtaining a Type I error, denoted by PType I Error =PReject HoHo is True = , often called the level of significance, measures therisk of rejecting a true null hypothesis. On the other hand, the probability of

making a correct decision, computed as PAccept HoHo is True = 1 − α, isknown as the level of confidence.

The probability of obtaining a Type II error, denoted by P

Type II Error

=

P

Accept Ho

Ho is False

=

measures the risk of accepting a false null

hypothesis but can only be evaluated when the alternative hypothesis specifiesan exact value. On the other hand, the probability of rejecting a false hypothesis,computed as PReject HoHo is False = 1 − β, is also known as the power of thetest.

However, + ≠ 1, i.e. they are not complementary events.

For the test on one population mean, the test statistic is given by

= − which is distributed as the Student’s

-distribution with

=

−1.

Assumptions of the test on the population mean: the sample is randomly taken from the population of interest

the population is normally distributed

Ha Test Procedure

Decisoin Rule: At a specified , reject Ho if

(when < 25) (when ≥ 25)

Ha: ≠ Two-tailed -test

> 2 ,(−1) >

2

Ha: > One-tailed -test > ,(−1) >

Ha:

<

One-tailed

-test

<

−,(

−1)

<

−

Otherwise, fail to reject Ho.

For the test on one population proportion, the test statistic is given by

= ( − ) 1− which is approximately distributed as standard normal.

Assumptions of the test on the population proportion:

the mean of

is

the standard deviation of is = 1

− the shape of the distribution is approximately normal for large samples




15

Ha Test Procedure

Decisoin Rule: At a specified

, reject Ho if

(when ≥ 25)Ha: ≠ Two-tailed approximate -test > 2

Ha: > One-tailed approximate -test >

Ha: < One-tailed approximate -test < −


In the estimation and test of hypothesis on two population means, we have twocases of obtaining a test statistic, depending on the relation of the obtainedsamples.

If samples are related from each other… Parameter Estimator

Point estimators

Mean difference,

= =1

Standard deviation of ’s = − 2=1 − 1

Standard error of

=

A 1− × 100% confidence interval estimator Mean difference, ∓ 2 −1 or ∓ 2

Ha Test Procedure

Decisoin Rule:

At a specified , reject Ho if

(when < 25)(when ≥ 25)

Ha:

≠

Two-tailed -test on twopopulation means using

related samples

> 2 ,(−1) >

2

Ha: > One-tailed -test on twopopulation means using

related samples > ,(−1) >

Ha: < One-tailed -test on twopopulation means using

related samples

< −,(−1) < −


If samples are independent from each other… Parameter Estimator

Point estimators

Mean difference, = 1 − 2

Standard deviation of ’s = 121 − 1+ 2

22 − 11 + 2 − 2

Standard error of = 2 11

+12




16

A 1− × 100% confidence interval estimator

Mean difference,

∓ 2

−1

or

∓ 2

In the test of hypothesis on more than two population means, the appropriate

test to compare more than two population means is the -test which can beperformed using the analysis of variance technique.

The Ho and Ha will always be the same in the ANOVA technique. For the Ho, itis hypothesized that there is no difference between the given population means.For the Ha, it is hypothesized that at least one of the population means differ.

There are three assumptions considered essential for the -test to be valid:the samples should come from normally distributed populations; the variances of the populations should be equal, or often referred to as homoscedasticity; and

the errors must be mutually independent.

Source of Variation

Degrees of Freedom

Sum of Squares

MeanSquares

Among

Populations

− 1 2

=1

− GT2 ASS

A AMS

WMS

WithinPopulations

− 2 − 1=1

WSS

Total

−1 ASS + WSS

Ha Test ProcedureDecisoin Rule:


Ha: ≠ |, ∈ 1, One-way analysis of variance

> ,−1,− Otherwise, fail to reject Ho.

The Pearson’s correlation coefficient, denoted by , is a parameter whichgives a measure of linear relationship between two variables in the population

and is estimated using = . The covariance between and measuresthe covariation between the two variables and is computed using:

= − − − 1

=1

= − =1 =1 =1 − 1

The test statistic to be used to test the statistical significance of the samplecorrelation coefficient is: =

− 2 1− 2

Ha Test Procedure Decisoin Rule: At a specified , reject Ho if

(when < 25) (when ≥ 25)

Ha: ≠ 0 Two-tailed -test > 2 ,(−2) >

2

Ha: > 0 One-tailed -test > ,(−2) >

Ha: < 0 One-tailed -test < − ,(−2) < −





17

The estimated line for a sample of pairs of observations is 1 = 0 + 1 ,where

0 is the sample regression constant and

1 is the sample regression

coefficient.

A measure of the adequacy of the model is the coefficient in determination and

it gives values on a scale of 0% to 100%. It is the proportion of the total variabilityin that can be explained or accounted for by ’s relationship with .2 = 2 = 2

In the

2 goodness-of-fit test, the null and alternative hypotheses remain the

same for every sample:

Ho: The observed frequencies are in agreement with the expected frequenciesHa: The observed frequencies are not in agreement with the expected

frequencies

The test statistic follows the 2 distribution with − 1 degrees of freedom andis computed as:

2 = − 2

=1

= 2

−

=1

In the 2 test of independence, the null and alternative hypotheses remain thesame for every sample:

Ho: The row variable and the column variable are not associatedHa: The row variable and the column variable are associated

The test statistic follows the 2 distribution with − 1 − 1 degrees of freedom and is computed as:

2 =

−

2

−1

=1

=

2

=1 −

=1

Ha Test ProcedureDecisoin Rule:


Ha: see above 2 goodness-of-fit test 2 > ,−12 2 test of independence 2 > ,−1−12


2 tests always require the following assumptions:

a simple random sample of size

must be drawn from the population

the categories are non-overlapping an observation can only belong to one and only one category

no more than 20% of the categories have expected frequencies less than 5

and no expected frequency is less than 1

[notes]stat1 - elementary statistics

Documents