statistics in practice, part i

19

Click here to load reader

Upload: cris-ely

Post on 05-Dec-2015

8 views

Category:

Documents


2 download

DESCRIPTION

Statistics

TRANSCRIPT

Page 1: Statistics in Practice, Part I

STATISTICS IN PRACTICE

Part I

The use of statistics in business process improvement

All rights reserved. Nothing from this publication may be copied, stored in an authorised data

file, or made public in any form or any manner whether electronic, mechanical, photocopying,

photography or any other means, without the prior written consent of the author.

Page 2: Statistics in Practice, Part I

2

Page 3: Statistics in Practice, Part I

3

Contents

1 Background to statistics

1.1 What is statistics?

1.2 Research terminology

2 Descriptive statistics

2.1 Types of tables and graphs

2.2 Measuring levels

2.3 Measuring trends

2.3.1 Centre measures

2.3.2 Distribution measures

2.4 Trend measures per measure level

2.5 Normal distribution (1)

2.6 Questions and assignments

3 Calculating risk

3.1 Questions and assignments

4 Inductive statistics

4.1 The normal distribution (2)

4.2 Z-scores

Appendices:

• Table with z-values

Page 4: Statistics in Practice, Part I

4

Background to statistics

1

Imagine you have a computer shop and have signed a contract with IBMS. All students can

purchase a notebook at a discount from your shop. You are a good businessman and would like

to have an estimate of the number of purchasers this will create for you if only because you have

to ensure you have adequate stock. It’s obvious an investigation is needed. But...

- How can you analyse the results from the questionnaire?

- How many students do you have to question to get a good picture?

- How can you present the results meaningfully and understandably?

- How great is the chance that the result from your survey is a good prediction of the final sales

figures?

To answer these questions, you need statistics. We understand statistics to mean ‘turning data

into information’, in order to be able to analyse better. The consistent methods used make an

analysis clearer, simpler and more efficient.

Page 5: Statistics in Practice, Part I

5

1.1 What is statistics?

It is usually impossible that all the people or objects about which you want to know something can

be involved in an investigation and it is also unnecessary. If the businessman from the example

on the previous page wants to have an indication of the number of computers he will sell to the

IBMS students as a result of his discount offer, he doesn’t need to ask all IBMS students. Taking

a sample is sufficient. If his sample is sufficiently large and it forms a good reflection of the

student group as an entirety, then the result will, in terms of percentage, be a good match to what

he would get if he had asked all the students. On the basis of the sample’s results, it can be

estimated correctly how many computers he must have in stock.

Therefore, market research is about processing data at different levels. You are involved with the

results of surveys and also in the generalisation of the data to a larger group.

The analysis and description of random sample (survey) data is called descriptive statistics. This

is the simplest form and you are only involved with the data from the sample. You construct

tables, you calculate core numbers such as means, or you display graphically what you have

discovered.

However, we are not usually in this situation. With most surveys, we want to make statements

that go beyond this. For example, you would like to know how great the chance is that your

results from the sample are valid for the complete category from which the sample was taken, or

you want to compare your results with those of an earlier investigation. For these types of

questions, you need more complex calculations and considerably more knowledge. With all

calculations where you go beyond the sample results we talk about inductive statistics. An

example of this is when you try to forecast how many IBMS students will buy one of your laptops,

based on a sample of only 75 IBMS students.

Page 6: Statistics in Practice, Part I

6

1.2 Research terminology

Statistics is a discipline that is full of specialised terminology. The employment of the correct

concepts in the correct context is very important. By using uniform terms, you are in the position

to confer with colleagues in such a way that you understand each other and you can explain

simply to your customer what it means. But you also must have a command of the research

terminology in order to be able to understand statistics. Statistics is not about learning formulae in

isolation but about gaining fundamental insight into basic processes, so no misunderstandings

can exist about the meaning of the terms used. You must strictly adhere to this. To start with, we

make a differentiation between research units and research properties.

Research units and properties

We will stay with the example given earlier, a survey into students at IBMS, with the goal of

making predictions about all the students in the college. The IBMS students concerned are

therefore the research units of this investigation.

All the research units together we call the population of the investigation. You will make

predictions about the population based on the results of the survey after the completion of the

market investigation.

Research units are often people, but that is not always the case. If you want to investigate the

safety of pedestrian crossings, then the pedestrian crossings are your research units and if you

want to do an investigation into the percentage of accounts of Dutch companies that contain

errors, then the accounts of Dutch companies are your research units.

The selection of the population on which or about which you collect information is the sample. If

you question or observe a group of people who are part of the population, that group makes up

the category for your sample. But if all Dutch pedestrian crossings or accounts make up your

population, you can also create a sample from them.

A research unit in a sample is also called a record. If the research units are people that have

answered questions, then we usually speak of them as respondents.

Therefore, both your population and your random sample contain research units and one

research unit in your sample is called a respondent or a record.

How you collect information from a sample is, as a rule, not so important for the end result for

your research. Why would our shopkeeper want to know that, for example, one hundred students

questioned in the sample acknowledged that they want to buy a computer? The only important

thing is approximately how many students in the future he can expect as customers – thus which

fraction of the population. We talk here about which proportion of the population.

What we are doing in the research is determining which percentage of those questioned in the

sample are interested and we use this percentage as an indicator for the situation for all IBMS

students. Therefore, characteristic for market research is the collection of information from

research units that are a part of the sample with the aim of making predictions about a

population! This is an important understanding.

Obviously, IBMS students differ from each other. For example, some students are male, others

are female. Some live in Eindhoven, others live somewhere else. Additionally, there are students

who are satisfied with their studies, and others who do not find much in them. In other words, the

research units have properties. These properties or characteristics of the research units we

denote in research terminology as variables. Pedestrian crossings also have properties because

some have traffic lights and others not. Such differences can be responsible for variations in the

research results because the number of accidents, for example, can vary according to the

presence of traffic lights.

Page 7: Statistics in Practice, Part I

7

Study town, study course, gender, satisfaction and the presence of traffic lights are examples of

the characteristics of research units and thereby of variables in your investigation.

In our notebook example, one variable like the income of students is important because there

could be differences in purchasing a computer. A student with a high income might be more

inclined to buy a computer than a student with less income.

We distinguish two types of properties. If the property of the research unit cannot change within

the proposed setting (or: conceptual model, see below), then we talk about an independent

variable.

Our example is directed to the question of whether the independent variable has a relationship

with the purchase or not of a computer. The latter, the purchase behaviour, we call a dependent

variable, because it is hypothesized to be dependent on the level of the independent variable(s).

Sometimes, the distinction between dependent and independent variables is difficult to draw

because both appear to be independent. In this type of situation, you can ask which of the two

precedes the other in time. The independent variable always precedes the dependent variable.

For an investigation into the relationship between intelligence and income, for example, it

appears that both characteristics are independent but a high IQ precedes a high income.

Back to the example investigation in which we wish to verify if there is a relationship between the

income of students (the independent variable) and the purchase behaviour (the dependent

variable). Reversing these is not possible because you are will not get more income just by

buying a computer. The purchasing behaviour is something that can change therefore we call this

type of data dependent.

It is normal to represent the relationship between an independent and dependent variable in a

diagram called the conceptual model. A conceptual model in its basic form looks like this:

study course buying behaviour

(independent variable) (dependent variable)

If you write a market research proposal or a report, a conceptual model makes it clear at a glance

what the focus of the research is to be. Therefore, it should seldom be absent.

Examples of independent variables are gender, education, shop outlets, hair colour, IQ, type of

car, the presence or absence of traffic lights, or departments in companies.

Examples of dependent variables are satisfaction, an opinion about a certain subject, an intention

to purchase and the number of accidents.

Representation

The core of a lot of market research is, as mentioned, that we collect data from a sample in order

to make predictions about the population. Naturally, this is only possible if the properties of the

research units of the sample have the same composition as the population. If 40% of the

population in the previous example consists of students with a high income, then this has to be

the case in the sample, otherwise we will get a distorted impression. The random sample must,

as it is called, be representative. Only if the sample is representative can we generalise the

results of the sample for the population without disaster.

Page 8: Statistics in Practice, Part I

8

Sample size

Whenever we research a sample, the results will generally not totally agree with the situation in

the population. For example, if we find that 15% of those questioned in our sample say they will

buy a computer soon, then we hope that this 15% gives a good indication as a prediction of the

purchasing behaviour of all the students, but we do not have that certainty. It is certain that the

more students we question – the greater the number of research units in our sample – the greater

the certainty that our 15% reflects the situation in the population.

In order to calculate how close our sample results are to the situation in the population, inductive

statistics become involved. The accuracy that we require and the extent to which we will accept

deviations are central to inductive statistics and are heavily involved with the theory of probability.

For example, we accept that the prediction will not deviate by more than 5% (called margin) with

a probability of 90% (accuracy). We will delve much deeper into this later. Lets start with

descriptive statistics.

Page 9: Statistics in Practice, Part I

9

Descriptive statistics

2.1 Types of tables and graphs

Two types of tables you will use in market research are frequency tables and crosstabulations. A

frequency table is concerned with one variable. A cross-relationship table is meant to display the

relationship between two or more variables.

Frequency tables

Here is an example of the SPSS (Statistical Package for Social Sciences) output of a frequency

table, also often called straight counting. It the simplest and most common form of presentation.

Study course

Of course, in a report for the customer you always process crude computer output. It is not much trouble to

rework the table above into the result below.

Table 1: Study courses

Number Percent

Technical 230 57,2

Nursing 172 42,8

Total 402 100,0

From the table you can see that 402 students were questioned for this investigation. These are all

the respondents, which sets the total at 100%. Of the 402 interviewed, 230 are following a

technical course (approximately 57%) and 172 a nursing course (43%).

Crosstabulations

We select a cross-relationship table if we want to know if a part of the sample with a particular

property scores differently on a question to those with other properties. For example, if you want

to know from the sample if the pedestrian crossings with traffic lights are safer than those without,

or that customers of a particular chain store are more satisfied with one store than another, or

that women spend more money than men, then a cross-relationship table is the correct tool.

The construction and interpretation of crosstabulations appears simple but they are definitely not.

In day-to-day use, many errors are made using them and their output! To avoid mistakes, it is

strongly recommended to keep to a fixed procedure. The consistent adherence to the procedure

Frequency Percent Valid Percent

Cumulative

Percent

Valid Technical 230 57,2 57,2 57,2

Nursing 172 42,8 42,8 100,0

Total 402 100,0 100,0

Page 10: Statistics in Practice, Part I

10

is essential and it is important to keep the steps in mind when you construct or interpret a cross-

reference table!

Basic rules for building a cross-relationship table

- The independent variable is always at the top.

- Percentages are always used in the cells (absolute numbers are optional).

- Calculating percentages is only done in the columns.

- Interpreting is only done by looking at the percentages.

For our example investigation, a cross-relationship table is constructed to see if participating in

the notebook project varies per study course. To make it easy, we start from the basis that a

university only has technical and nursing students. The table is based on computer output which

has had several cosmetic embellishments made which are also expected from you when you

construct this type of table.

Cross-relationship table: Participation in notebook project to study course.

Study course Total

Technical Nursing

174 116 290 Yes

77,0% 67,4% 72,9%

52 56 108

Participate in

Notebook project?

No

23,0% 32,6% 27,1%

Total 226 172 398

100,0% 100,0% 100,0%

In this case, the study course is the independent variable. This characteristic belongs at the top in

the table heading. The question in the investigation (“Will you or will you not participate in the

notebook project?”) is the dependent variable and therefore, by definition, is placed on the left of

the table.

In each of the cells, you can see the absolute numbers and the percentages. Only the

percentages are actually important! You ignore the absolute numbers in your reporting.

On the bottom row, you see 100% three times. This shows you that, according to the rule, the

percentages are calculated in columns because each column has a total of 100%. The number of

technical students questioned is 100% in total just as the number of nursing students and the

total number of respondents.

Only a table constructed in this way allows correct interpretation. In this case, the interpretation is

the following: 73% of all the students questioned, indicate that they will participate in the

notebook project. In the survey, there is a difference according to the course. A higher

percentage of technical students will take part in comparison with those doing the nursing course.

The percentages are 77% to 67%.

Whether or not we can conclude that the difference in the sample is also a difference in the

population is an important question which we cannot yet answer. For predictions about the

population we need inductive statistics. This will be covered later.

Figures

Page 11: Statistics in Practice, Part I

11

The readability of a report benefits considerably when the data from the tables is turned into

figures. The two most common figures are histograms (or bar charts) and pie charts). To

construct bar and pie charts, most researchers use Excel in view of its ease of use and the large

number of layout possibilities.

2.2 Measuring levels

In a questionnaire, questions are included that can be roughly split into four categories. The

difference is in the way the answers can be processed statistically.

Look at the first three questions in the text box below and try to figure out what the differences for

processing the results are.

Questions with answers of various measuring levels

1 Will you cross whether you are a man or a woman?

� Man

� Woman

2 Do you consider yourself as a satisfied or dissatisfied customer of

this store?

� Very dissatisfied

� Dissatisfied

� Satisfied

� Very satisfied

3 What time is it?

………………. ‘o clock

4 How old are you?

................. years

------------------------------------------------------------------------------------------------------------------------

5 How old are you?

� 19 or younger

� 20 to 29

� 30 to 39

� et cetera

The possible answers to the questions are each of a different order; each has its own measuring

level. It is important to know what the measuring level of a score is. The choice has

consequences for the statistical processing possibilities. Before a researcher develops a

questionnaire, this must be realised.

In the first question about gender, the respondent can select one of two options, in this case

without there being any mention of a value difference between either option. You are a man or a

Page 12: Statistics in Practice, Part I

12

woman. There is nothing between them and there is also no value difference. The only thing a

researcher can do is to determine how many people belong in the one category and how many

people belong in the other category. There is not a lot to calculate. We call this data at the

nominal level. All yes/no questions are also nominal data.

It is different with question 2 which, likewise, is a closed question but with a value differentiation

appearing. For each answer there is some indication of more or less, or better or worse. The

answers vary from high to low or from much to little. Therefore this is data at the ordinal level.

In question 3, respondents can fill in the time. This is an example of data on interval level. The

interval between 12.00h and 14.00h is just as long as the interval between 16.00h and 18.00h.

However, we cannot say that 14.00h is “two times as late” as 07.00h. Other examples of data on

interval level include people’s IQ and the outside temperature.

Finally, in question number 4, numbers are filled in for the age. Age, at least if they are separate

numbers, is ratio data. Numbers allow the most calculation possibilities. Someone that is four

years old has lived twice as song as someone that is only two years old. Other examples of ratio

scale data include income and weight.

2.3 Measuring trends

Every market researcher will be curious to know whether particular trends in the results are true.

The two most important ways to represent numerical results are by centre and distribution. Or, in

other words, where is the highest concentration of numerical observations and in how far do

these observations vary?

Mean, median and mode show which values the data is grouped around. These are centre

measures. The range, variation and the standard deviation are necessary to determine how wide

or narrow the reactions are. Percentile scores say something about the position of individual

scores compared with the other scores.

Difference between centre and distribution measurements

Centre: where is the highest concentration of observations?

Distribution: how far do the observations vary?

Centre measurements are explained in section 2.3.1 and distribution measurements in section

2.3.2.

Page 13: Statistics in Practice, Part I

13

2.3.1 Centre measures

The most important centre measures are the arithmetical mean, the median and the mode.

Arithmetical mean

When five people questioned are 19, 25, 25, 21 and 35 years old, the mean age is 25 years

((19+25+25+21+35) / 5 = 25).

In calculating the arithmetic mean, you will meet two different symbols in the formulae. Population

and sample means have their own designations. They are as follows:

µ = Population mean

x = Sample mean

N = The number of observations in the population

n = The number of observation in the sample

xi = Individual score (1 = first score, 2 = second score, et cetera.)

∑ = Sum of the observations

(Subscript from the first (i) up to and including the last (n), or 1 to n inclusive)

The calculation of the arithmetic mean in a formula:

n

x

n

)x + ... + x + x( = x

n

i

i

n21

∑=== 1µ

Mode

The mode is the most frequent observation. Look at the following list:

19

25

25

21

35

There is one score occurring twice, 25. That is the mode.

Tip: You most certainly have heard the term ‘mode income’. Many people think that this means

the average income. This is not correct, it is the most common income.

We work with classes in many investigations. For example, respondents can indicate if their age

falls in a certain category such as 10-19 years or 20-29. In this case, the most prevalent class is

the modal class.

Numbers reproduced in classes also have a mode. The mode is the middle of the modal class.

Page 14: Statistics in Practice, Part I

14

Median

The median is the middle observation after all the observations have been sorted from low to

high. In other words, the median is the observation where 50% of the observations are above and

50% below. In order to determine the median, you first sort the observations from low to high (19,

21, 25, 25, 35). Subsequently, we look at which one has the middle score. In this case, it is 25

because there are two numbers below and two above.

Whenever the number of observations is even and therefore you have two middle observations,

the median is the arithmetical mean of these two scores. Imagine the list is 19, 21, 25, 27, 29, 35,

then there are two numbers in the middle namely 25 and 27. The median is 26.

You can use the following formula, to determine the location of the median in the set of

observations:

Lm = (n+1)/2

In which,

Lm = the location of the median

n = the number of observations

Using the five observations above (19, 21, 25, 25, 35), we determine Lm = (5 + 1)/2 = 3. Hence,

the median is located at the third observation. Be careful though, because 3 is not the median!

The median is equal to the value of the third observation, which is 25! Using the six observations

given above (19, 21, 25, 27, 29, 35), we determine Lm = (6 + 1)/2 = 3.5. The median in this set of

observations is the value of the 3.5th observation. The 3.5

th observation is exactly in the middle of

the third and fourth observation (25 and 27 respectively). Therefore, we take the average of these

two observations; (25 + 27) = 26.

Examples of centre measures

Below you will find the frequency division according to the SPSS output in which the mean,

median and mode is to be calculated.

As you can see, the mean score of these 13 observations is 27614 (rounded). The median is

25666. You can see for yourself where the median is if you look under 'Cumulative Percent'. 50%

of the observations are larger than 25666 and 50% are less. The values in the first column are

automatically sorted from low to high. It must be the 7th observation (middle) because there are

13.

Under 'Frequency', or ‘number’, you can see how often an observation appears. There is one

number that appears more often than the rest, 28950, and therefore that must be the mode.

Page 15: Statistics in Practice, Part I

15

N Valid 13

Missing 0

Mean 27613,92

Median 25666,00

Mode 28950

Frequency Percent

Valid

Percent

Cumulative

Percent

Valid 22222 1 7,7 7,7 7,7

22333 1 7,7 7,7 15,4

23434 1 7,7 7,7 23,1

25050 2 15,4 15,4 38,5

25555 1 7,7 7,7 46,2

25666 1 7,7 7,7 53,8

28950 3 23,1 23,1 76,9

34244 2 15,4 15,4 92,3

34333 1 7,7 7,7 100,0

Total 13 100,0 100,0

Page 16: Statistics in Practice, Part I

16

2.6 Questions and assignments

In an investigation for the C-1000 supermarket chain, researchers were looking into whether

there was a difference in satisfaction between two stores.

Store A: In this shop, 80 were satisfied about the shop from the 100 interviewed.

Store B: In this shop, 100 were satisfied about the shop from the 150 interviewed.

1 What is the population of the investigation?

……………………………………………………………………………………..

2 What is the independent variable in this investigation?

……………………………………………………………………………………..

3 Fill in the absolute numbers in the table below:

total

total 250

4 Fill in the percentages in the table below:

total

total 100%

5 Interpret the results of this investigation

...........................................................................................................................

...........................................................................................................................

………………………………………………………………………………………….

Another investigation was performed by Nike. The goal of this research was to find out whether

there was a difference between young and older people concerning the judgement of the

‘coolness’ of the brand Nike. This trend-sensitive manufacturer of fashionable brand-name

articles has few of its own sales outlets. Therefore, customers are difficult to find for research.

From a previous survey it was found that there was high degree of correlation between the

customers of Foot Locker shops and the purchasers of Nike products, so with the permission of

Foot Locker, people were interviewed with questions about Nike outside the shops.

(This example is fictional.)

6 What is the population of the investigation?

Page 17: Statistics in Practice, Part I

17

……………………………………………………………………………………..

Nike feared that people over twenty found the brand cooler than youngsters under twenty which

would be a disaster for Nike. From the investigation, the following appeared:

< 20 Of the 200 youngsters in the sample, 80 found Nike to be a cool brand.

20 or older From this group of 150, 100 found Nike to be a cool brand.

7 What is the independent variable in this investigation?

……………………………………………………………………………………..

8 Fill in the absolute numbers in the table below:

total

total 350

9 Fill in the percentages in the table below:

total

total 100%

10 Interpret the results

...........................................................................................................................

...........................................................................................................................

………………………………………………………………………………………….

Page 18: Statistics in Practice, Part I

18

Given the following investigation result (n = number of people interviewed):

Age Visits a café weekly Number

n=

15 to 24 60 200

25 to 34 55 200

35 to 44 52 200

45 to 54 50 200

55+ 75 200

11. How great is the chance that a 23 year old will visit a café weekly?

……………………………………………………………………………………..

12. What is the median of the numbers in the column 'Visits a café weekly’?

……………………………………………………………………………………..

13. What is the arithmetic mean for the numbers in the column 'Visits a café weekly’?

……………………………………………………………………………………..

14. We can make a distinction between four different measuring levels. Explain for all

of the following data what measuring level is used.

a) Data about the weight of packages of coffee

………………………………………………………………………………………………………

b) Data about the country of origin of customers

………………………………………………………………………………………………………

c) Data about the intelligence quotient (IQ) of students

………………………………………………………………………………………………………

d) Data about the level of satisfaction on a five-point scale

………………………………………………………………………………………………………

Page 19: Statistics in Practice, Part I

19

15. Calculate the mean, mode and median for the following scores of the statistics

exam from last year:

7 5,5

4 2,5

1 6

8 7

6,5 5

6,5 4,5

5 5

3,5 8

4 9

10 2

Mean:

………………………………………………………………………………………………………..

Mode:

……………………………………………………………………………………………………….

Median:

……………………………………………………………………………………………………….