web viewan auditor determines the sample size of the book to be audited on ... does the question...

107
INTRODUCTION TO STATISTICS 1.1 Definitions of frequently used terms Statistics Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions. Variable Characteristic or attribute that can assume different values Random Variable A variable whose values are determined by chance. Population All subjects possessing a common characteristic that is being studied. Sample A subgroup or subset of the population. Parameter Characteristic or measure obtained from a population. Statistic (not to be confused with Statistics) Characteristic or measure obtained from a sample. Descriptive Statistics Collection, organization, summarization, and presentation of data. Inferential Statistics Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions. Qualitative Variables Variables which assume non-numerical values. Quantitative Variables Variables which assume numerical values.

Upload: leduong

Post on 02-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

INTRODUCTION TO STATISTICS

1.1 Definitions of frequently used terms

Statistics

Collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.

Variable

Characteristic or attribute that can assume different values

Random Variable

A variable whose values are determined by chance.

Population

All subjects possessing a common characteristic that is being studied.

Sample

A subgroup or subset of the population.

Parameter

Characteristic or measure obtained from a population.

Statistic (not to be confused with Statistics)

Characteristic or measure obtained from a sample.

Descriptive Statistics

Collection, organization, summarization, and presentation of data.

Inferential Statistics

Generalizing from samples to populations using probabilities. Performing hypothesis testing, determining relationships between variables, and making predictions.

Qualitative Variables

Variables which assume non-numerical values.

Quantitative Variables

Variables which assume numerical values.

Discrete Variables

Variables which assume a finite or countable number of possible values. Usually obtained by counting.

Continuous Variables

Page 2: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Variables which assume an infinite number of possible values. Usually obtained by measurement.

Nominal Level

Level of measurement which classifies data into mutually exclusive, all inclusive categories in which no order or ranking can be imposed on the data.

Ordinal Level

Level of measurement which classifies data into categories that can be ranked. Differences between the ranks do not exist.

Interval Level

Level of measurement which classifies data that can be ranked and differences are meaningful. However, there is no meaningful zero, so ratios are meaningless.

Ratio Level

Level of measurement which classifies data that can be ranked, differences are meaningful, and there is a true zero. True ratios exist between the different units of measure.

Random Sampling

Sampling in which the data is collected using chance methods or random numbers.

Systematic Sampling

Sampling in which data is obtained by selecting every kth object.

Convenience Sampling

Sampling in which data is which is readily available is used.

Stratified Sampling

Sampling in which the population is divided into groups (called strata) according to some characteristic. Each of these strata is then sampled using one of the other sampling techniques.

Cluster Sampling

Sampling in which the population is divided into groups (usually geographically). Some of these groups are randomly selected, and then all of the elements in those groups are selected.

1.2 ROLE OF STATISTICS

Statistics as a discipline is considered indispensable in almost all spheres of human knowledge. There is hardly any branch of study which does not use statistics. Scientific, social and economic studies use statistics in one form or another. These disciplines make-use of observations, facts and figures, enquiries and experiments etc. using statistics and statistical methods. Statistics studies almost all aspects in an enquiry. It mainly aims at simplifying the complexity of information collected in an enquiry. It presents data in a simplified form as to make them intelligible. It analyses data and facilitates drawal of conclusions. Now let us briefly discuss some of the important functions of statistics.

Page 3: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

a. Presents facts in. simple form:

Statistics presents facts and figures in a definite form. That makes the statement logical and convincing than mere description. It condenses the whole mass of figures into a single figure. This makes the problem intelligible.

b. Reduces the Complexity of data:

Statistics simplifies the complexity of data. The raw data are unintelligible. We make them simple and intelligible by using different statistical measures. Some such commonly used measures are graphs, averages, dispersions, skewness, kurtosis, correlation and regression etc. These measures help in interpretation and drawing inferences. Therefore, statistics enables to enlarge the horizon of one's knowledge.

c. Facilitates comparison:

Comparison between different sets of observation is an important function of statistics. Comparison is necessary to draw conclusions as Professor Boddington rightly points out.” the object of statistics is to enable comparison between past and present results to ascertain the reasons for changes, which have taken place and the effect of such changes in future. So to determine the efficiency of any measure comparison is necessary. Statistical devices like averages, ratios, coefficients etc. are used for the purpose of comparison.

d. Testing hypothesis:

Formulating and testing of hypothesis is an important function of statistics. This helps in developing new theories. So statistics examines the truth and helps in innovating new ideas.

e. Formulation of Policies:

Statistics helps in formulating plans and policies in different fields. Statistical analysis of data forms the beginning of policy formulations. Hence, statistics is essential for planners, economists, scientists and administrators to prepare different plans and programmes.

f. Forecasting:

The future is uncertain. Statistics helps in forecasting the trend and tendencies. Statistical techniques are used for predicting the future values of a variable. For example a producer forecasts his future production on the basis of the present demand conditions and his past experiences. Similarly, the planners can forecast the future population etc. considering the present population trends.

g. Derives valid inferences:

Statistical methods mainly aim at deriving inferences from an enquiry. Statistical techniques are often used by scholar’s planners and scientists to evaluate different projects. These techniques are also used to draw inferences regarding population parameters on the basis of sample information.

1.3 Importance of Statistics in Different Fields

Statistics plays a vital role in every fields of human activity. Statistics has important role in determining the existing position of per capita income, unemployment, population growth rate, housing, schooling medical facilities etc…in a country. Now statistics holds a central position in almost every field like Industry, Commerce, Trade, Physics,

Page 4: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Chemistry, Economics, Mathematics, Biology, Botany, Psychology, Astronomy etc…, so application of statistics is very wide. Now we discuss some important fields in which statistics is commonly applied.  

(a) Business:          Statistics play an important role in business. A successful businessman must be very quick and accurate in decision making. He knows that what his customers wants, he should therefore, know what to produce and sell and in what quantities. Statistics helps businessman to plan production according to the taste of the costumers, the quality of the products can also be checked more efficiently by using statistical methods. So all the activities of the businessman based on statistical information. He can make correct decision about the location of business, marketing of the products, financial resources etc…

(b) In Economics:          Statistics play an important role in economics. Economics largely depends upon statistics. National income accounts are multipurpose indicators for the economists and administrators. Statistical methods are used for preparation of these accounts. In economics research statistical methods are used for collecting and analysis the data and testing hypothesis. The relationship between supply and demands is studies by statistical methods, the imports and exports, the inflation rate, the per capita income are the problems which require good knowledge of statistics.

(c) In Mathematics:          Statistical plays a central role in almost all natural and social sciences. The methods of natural sciences are most reliable but conclusions draw from them are only probable, because they are based on incomplete evidence. Statistical helps in describing these measurements more precisely. Statistics is branch of applied mathematics. The large number of statistical methods like probability averages, dispersions, estimation etc… is used in mathematics and different techniques of pure mathematics like integration, differentiation and algebra are used in statistics.   

(d) In Banking:          Statistics play an important role in banking. The banks make use of statistics for a number of purposes. The banks work on the principle that all the people who deposit their money with the banks do not withdraw it at the same time. The bank earns profits out of these deposits by lending to others on interest. The bankers use statistical approaches based on probability to estimate the numbers of depositors and their claims for a certain day. 

(e) In State Management (Administration):          Statistics is essential for a country. Different policies of the government are based on statistics. Statistical data are now widely used in taking all administrative decisions. Suppose if the government wants to revise the pay scales of employees in view of an increase in the living cost, statistical methods will be used to determine the rise in the cost of living. Preparation of federal and provincial government budgets mainly depends upon statistics because it helps in estimating the expected expenditures and revenue from different sources. So statistics are the eyes of administration of the state.    

(f) In Accounting and Auditing:          Accounting is impossible without exactness. But for decision making purpose, so much precision is not essential the decision may be taken on the basis of approximation, known as statistics. The correction of the values of current asserts is made on the basis of the purchasing power of money or the current value of it.            In auditing sampling techniques are commonly used. An auditor determines the sample size of the book to be audited on the basis of error.    

(g) In Natural and Social Sciences:

Page 5: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

          Statistics plays a vital role in almost all the natural and social sciences. Statistical methods are commonly used for analyzing the experiments results, testing their significance in Biology, Physics, Chemistry, Mathematics, Meteorology, Research chambers of commerce, Sociology, Business, Public Administration, Communication and Information Technology etc…

1.4 DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics include graphical and numerical procedures that summarize and process data and are used to transform data to information.

Inferential statistics provide the bases for predictions, forecasts, and estimates that are used to transform information to knowledge.

1.5 TYPES OF DATA

Quantitative Discrete Continuous Categorical

Quantitative Data

Quantitative data (metric or continuous) is often referred to as the measurable data. This type of data allows statisticians to perform various arithmetic operations, such as addition and multiplication, to find parameters of a population like mean or variance. The observations represent counts or measurements, and thus all values are numerical.  Each observation represents a characteristic of the individuals in a population or a sample.

Example: A set containing annual salaries of all your family members, measured to the nearest thousand, contains quantitative data.  Take, for instance, family X.  Here is a possible data set for this family: mother $25,000, father $30,000, myself $35,000, my wife $32,000, Uncle Joe $20,000, etc 

Discrete Data

According to the New World Dictionary of the American Language, the definition of "discrete" is the following: separate and distinct; not attached to others; unrelated; made up of distinct parts;       discontinuous.

Statistically speaking, discrete data result from either a finite or a countable infinity of possible options for the values present in a given discrete data set.  The values of this data type can constitute a sequence of isolated or separated points on the real number line. Each observation of this data type can therefore take on a value from a discrete list of options.

The discrete data type usually represents a count of something.  Some examples of this type include the number of cars per family, a student's height, the number of times a person yawns during a day, a number of defective light bulbs on a production line, and a number of tosses of a coin before a head appears (which process could be infinite in length).

Here are three kinds of discrete data:

Page 6: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

1. Discrete numerical data is discrete data that consists of numerical measurements or counts.  A set that consists of discrete numerical data contains numbers. It is all quantitative data, which allows us to find population or sample parameters like mean, variance, and others.  Discrete numerical data sets do not always consist of whole-numbers (or integers), but they may also take on the values of fractions and decimals.

Example: A set containing the heights of students in your high school graduating class, rounded to the nearest inch, represents such data type.  It is very important to remember here to round the numbers up.  If we do not, and we accept measurements such as 62.896 in., 63.277... in., 67.8435... in., we will be dealing with continuous data type, described below.   With discrete data type, there is a countable number of observations involved.  For example, a set containing possible students' heights will consist of integers starting at 0 in and ending at perhaps 84 in, unless there are students over 7 feet tall, which is highly unlikely. Integers are countable and that is what makes this set discrete.

Also, a number of Farmland Dairies ultra-pasteurized whole milk bottles in different stores can result in values like 0, or 1, or 2, or 3, and so on, and that would also be considered discrete numerical data. We can count all possible values. Scroll down for a variation to this example used to describe the continuous data type.

2. Discrete ordinal data is discrete data that may be arranged in some order or succession, but differences between values either cannot be determined or are meaningless.    There are only relative comparisons made about the differences between the ordinal levels.

Example: Consider the following statement: "In a group of twenty workers, five are "best," ten are "good," and five "need improvement."  Although there are obvious differences between each category (best, good, and need improvement), and we can arrange them in order of worse to best or vice versa, there is not much more we can do to compare them.  We do not know how much better is "best" from "good" or "good" from "need improvement." In this case, we could have also used numbers instead of words, ex. 1 for best, 2 for good, and 3 for need improvement, and the data type would still be ordinal.  The numbers still lack any computational significance.

3. Discrete qualitative (nominal) data is discrete data that cannot be arranged in any order. It can be represented by numbers, letters, words, and other forms of notation or symbolism, but there are no ranking differences to be determined.  Each category or group will certainly be different from the others, but it will be equally significant.  This data type will only consist of names, labels, or categories.

Example: Gender, political parties, or religions are just some of many qualitative sets that exist around us.  Take, for example, these statements: 1. "In a group of twenty workers, there are ten women and ten men," or 2. "In a group of twenty workers, there are five Republicans and four Democrats, and 1 Independent." The categories such as women and men, or republicans, democrats, and independents, can be talked about, described, and even criticized, but not officially ranked.  There are no accepted schemes to put these categories in any meaningful order. 

Continuous Data    

Continuous quantitative data result from infinitely many possible values that the observations in a set can take on.  The term "infinitely," however, does not refer to the "countable" term we have seen with discrete data types.   Continuous data types involve the uncountable or non-denumerable kind of

Page 7: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

infinity, which is frequently referred to as the number of points on a number line (or an interval on a number line). In other words, the observations of this data type can be associated with points on a number line, where any observation can take on any real-number value within a certain range or interval.

Example: Temperature readings are one example of such data set.  Each reading can take on any real number value on a thermometer.  If we agree that during a particular day the temperatures between 10am and 6pm will be somewhere between 32 and 100 degrees Fahrenheit, the truth is that these temperatures could take on any value in that range.  For example, consider the following possible temperature readings given in degrees Fahrenheit: 90.333..., 75.324, 40.23..., 85, or 65 multiplied by Pi (or 65 multiplied by 3.1415...).

Another example will be a different approach to the Farmland Dairies ultra-pasteurized whole milk bottle example used with a description of the discrete numerical data.  If, instead of measuring the number of bottles in different stores, we measure the amount of milk in each one half gallon bottle in different stores, those values could, for instance, be 0.498 gallon, or 0.5025 gallon, or any value in between.  The observed values will be represented by real-line values, and there is an uncountable number of possibilities for that to occur.  

Categorical Data

Categorical data, also called qualitative or nominal, result from placing individuals into groups or categories.   The values of a categorical variable are labels for the categories. We have described both ordinal and qualitative categorical data types above.

1. Discrete ordinal data --described above

2. Discrete qualitative data--described above

1.6 POPULATION VS. SAMPLE

The population includes all objects of interest whereas the sample is only a portion of the population. Parameters are associated with populations and statistics with samples. Parameters are usually denoted using Greek letters (mu, sigma) while statistics are usually denoted using Roman letters (x, s).

There are several reasons why we don't work with populations. They are usually large, and it is often impossible to get data for every object we're studying. Sampling does not usually occur without cost, and the more items surveyed, the larger the cost.

We compute statistics, and use them to estimate parameters. The computation is the first part of the statistics course (Descriptive Statistics) and the estimation is the second part (Inferential Statistics)

2.0 COLLECTION AND PRESENTATION OF DATA

2.1 DISCRETE VS. CONTINUOUS

Discrete variables are usually obtained by counting. There are a finite or countable number of choices available with discrete data. You can't have 2.63 people in the room.

Page 8: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Continuous variables are usually obtained by measuring. Length, weight, and time are all examples of continuous variables. Since continuous variables are real numbers, we usually round them. This implies a boundary depending on the number of decimal places. For example: 64 is really anything 63.5 <= x < 64.5. Likewise, if there are two decimal places, then 64.03 is really anything 63.025 <= x < 63.035. Boundaries always have one more decimal place than the data and end in a 5.

2.2 SAMPLING TECHNIQUES

What is sampling? Imagine, for example, an experiment to test the effects of a new education technique on schoolchildren. It would be impossible to select the entire school age population of a country, divide them into groups and perform research. A research group sampling the diversity of flowers in the African savannah could not count every single flower, because it would take many years. This is where statistical sampling comes in, the idea of trying to take a representative section of the population,  perform the experiment and extrapolate it back to the population as a whole.

Page 9: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

In the education example, the research group could test all of the schools in a city, or select one school in a few different cities. Of course, the process is not that easy, and the researchers must use a battery of statistical techniques, and a good research design, to ensure that this subset is as representative as possible. Failure to take into account all of the various experimental biases and errors that can creep into an experiment, if the sample group is chosen poorly, will inevitably lead to invalid results.

The basic question that a researcher should be asking when selecting a sample group is:

How many subjects will I need to complete a viable study, and how will I select them?

The Advantages of Sampling

It involves a smaller amount of subjects, which reduces investment in time and money.

Sampling can actually be more accurate than studying an entire population, because it affords researchers a lot more control over the subjects. Large studies can bury interesting correlations amongst the ‘noise.'

Statistical manipulations are much easier with smaller data sets, and it is easier to avoid human error when inputting and analyzing the data.

The Disadvantages of Sampling

There is room for potential bias in the selection of suitable subjects for the research. This may be because the researcher selects subjects that are more likely to give the desired results, or that the subjects tend to select themselves.

For example, if an opinion poll company canvasses opinion by phoning people between 9am and 5pm, they are going to miss most people who are out working, totally invalidating their results. These are called determining factors, and also include poor experiment design, confounding variables and human error.

Page 10: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Sampling requires a knowledge of statistics, and the entire design of the experiment depends upon the exact sampling method required.

Selecting Sample Groups and Extrapolating Results

When sampling, a researcher has two distinct choices:

1. Ideally, they will take a representative sample of the whole population and use  randomization techniques to establish sample groups and controls.

2. In many cases, this is not always possible, and the make-up of the groups has to be assigned.

For example, a study that needs to ask for volunteers is never representative of a population. In such cases, the researcher needs to be aware that they cannot extrapolate the findings to represent an entire population.

A study into heart disease that only looks at middle aged men, between 40 and 60, will say very little about heart disease in women or younger men, although it can always be a basis for future research involving other groups.

However robust the research design, there is always an inherent inaccuracy with any sample-based experiment, due to chance fluctuations and natural variety. Most statistical tests take this into account, and this is why results are judged to a significance level, or given a margin of error.

Sampling is an essential part of most research, and researchers must know how to choose sample groups that are as free from bias as possible, and also be aware of the extent to which they can extrapolate their results back to the general population.

 

This process is done when the researchers aims to draw conclusions for the entire population after conducting a study on a sample taken from the same population.

Concerns in Statistical Sampling

Representativeness

This is the primary concern in statistical sampling. The sample obtained from the population must be representative of the same population. This can be accomplished by using randomized statistical sampling techniques or probability sampling like cluster sampling and stratified sampling. The reason behind representativeness being the primary concern in statistical sampling is that it allows the researcher to draw conclusions for the entire population. If the sample is not representative of the population, conclusions cannot be drawn since the results that the researcher obtained from the sample will be different from the results if the entire population is to be tested.

Practicability

Practicability of statistical sampling techniques allows the researchers to estimate the possible number of subjects that can be included in the sample, the type of sampling technique, the duration of the study, the number of materials, ethical concerns, availability of the subjects/samples, the need for the study and the

Page 11: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

amount of workforce that the study demands. All these factors contribute to the decisions of the researcher regarding to the study design.

Sampling Risks

There are two types of sampling risks, first is the risk of incorrect acceptance of the research hypothesis and the second is the risk for incorrect rejection. These risks pertain to the possibility that when a test is conducted to a sample, the results and conclusions may be different from the results and conclusions when the test is conducted to the entire population. The risk of incorrect acceptance pertains to the risk that the sample can yield a conclusion that supports a theory about the population when it is actually not existent in the population. On the other hand, the risk of incorrect rejection pertains to the risk that the sample can yield a conclusion that rejects a theory about the population when in fact, the theory holds true in the population. Comparing the two types of risks, researchers fear the risk of incorrect rejection more than the risk of incorrect acceptance. Consider this example; an experimental drug was tested for its debilitating side effects. With the risk of incorrect acceptance, the researcher will conclude that the drug indeed has negative side effects but the truth is that it doesn’t. The entire population will then abstain from taking the drug. But with the risk of incorrect rejection, the researcher will conclude that the drug has no negative side effects. The entire population will then take the drug knowing that it has no side effects but all of them will then suffer the consequences of the mistake of the researcher.

2.1.1 SAMPLE GROUP

Ideally, this is a population at risk. The "study population" is the population from which sample is to be drawn. Commonly, the population is found to be very large and in any research study, studying all population is often impractical or impossible. Therefore, sample unit gives researchers a manageable and representative subset of population.

Sampling Frame > Sampling Unit > Sampling Fraction

Before a sample is taken, members of study population need to be identified by constructing a list called a sampling frame. Each member of sampling frame is called sampling unit. For example, someone may want to know details about shopping trends of people coming to a particular grocery store on Sundays. So people coming to that grocery store on Sunday forms a sampling frame and each customer is a sampling unit. The sampling fraction is the ratio of sample size to study population size. For example if you choose 10 customers out of total 1000 coming to that grocery store, than the sampling fraction would be 1%. The sampling units may be individuals or they may be in groups. For example, in a particular study involving animals, one can select individual animals or groups of animals like in herds, farms, or administrative regions.

Types of Sampling

Now how to get our desired sample group? Well, there are two types of sampling:

1. Non-Probability Sampling

In non-probability sampling, the choice of sample group is left to the researcher and thus element of bias always shows up in such studies.

2. Probability Sampling

Page 12: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

In probability sampling, the selection of the sample is made using deliberate, unbiased process, so that each sample unit in a group has an equal chance of being selected. This forms the basis of random sampling.

Probability sampling is most commonly used in experimental research. Randomization is performed to choose samples providing each sample an equal chance of being selected and thus minimizing or eliminating bias altogether.

2.1.2 RESEARCH POPULATION

A research population is generally a large collection of individuals or objects that is the main focus of a scientific query. It is for the benefit of the population that researches are done. However, due to the large sizes of populations, researchers often cannot test every individual in the population because it is too expensive and time-consuming. This is the reason why researchers rely on sampling techniques.

A research population is also known as a well-defined collection of individuals or objects known to have similar characteristics. All individuals or objects within a certain population usually have a common, binding characteristic or trait.

Usually, the description of the population and the common binding characteristic of its members are the same. "Government officials" is a well-defined group of individuals which can be considered as a population and all the members of this population are indeed officials of the government.

Relationship of Sample and Population in Research

A sample is simply a subset of the population. The concept of sample arises from the inability of the researchers to test all the individuals in a given population. The sample must be representative of the population from which it was drawn and it must have good size to warrant statistical analysis.

The main function of the sample is to allow the researchers to conduct the study to individuals from the population so that the results of their study can be used to derive conclusions that will apply to the entire population. It is much like a give-and-take process. The population “gives” the sample, and then it “takes” conclusions from the results obtained from the sample.

Two Types of Population in Research

Target Population

Target population refers to the ENTIRE group of individuals or objects to which researchers are interested in generalizing the conclusions. The target population usually has varying characteristics and it is also known as the theoretical population.

Accessible Population

The accessible population is the population in research to which the researchers can apply their conclusions. This population is a subset of the target population and is also known as the study population. It is from the accessible population that researchers draw their samples.

 

2.1.3 SAMPLE SIZE

Page 13: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

The sample size is typically denoted by n and it is always a positive integer. No exact sample size can be mentioned here and it can vary in different research settings. However, all else being equal, large sized sample leads to increased precision in estimates of various properties of the population.

What Should Be the Sample Size?

Determining the sample size to be selected is an important step in any research study. For example let us suppose that some researcher wants to determine prevalence of eye problems in school children and wants to conduct a survey.

The important question that should be answered in all sample surveys is "How many participants should be chosen for a survey"? However, the answer cannot be given without considering the objectives and circumstances of investigations.

The choosing of sample size depends on non-statistical considerations and statistical considerations. The non-statistical considerations may include availability of resources, manpower, budget, ethics and sampling frame. The statistical considerations will include the desired precision of the estimate of prevalence and the expected prevalence of eye problems in school children.

Following three criteria need to be specified to determine the appropriate samples size:

1. The Level of Precision

Also called sampling error, the level of precision, is the range in which the true value of the population is estimated to be. This is range is expressed in percentage points. Thus, if a researcher finds that 70% of farmers in the sample have adopted a recommend technology with a precision rate of ±5%, then the researcher can conclude that between 65% and 75% of farmers in the population have adopted the new technology.

2. The Confidence Level

The confidence interval is the statistical measure of the number of times out of 100 that results can be expected to be within a specified range.

For example, a confidence interval of 90% means that results of an action will probably meet expectations 90% of the time.

The basic idea described in Central Limit Theorem is that when a population is repeatedly sampled, the average value of an attribute obtained is equal to the true population value. In other words, if a confidence interval is 95%, it means 95 out of 100 samples will have the true population value within range of precision.

3. Degree of Variability

Depending upon the target population and attributes under consideration, the degree of variability varies considerably. The more heterogeneous a population is, the larger the sample size is required to get an optimum level of precision. Note that a proportion of 55% indicates a high level of variability than either 10% or 80%. This is because 10% and 80% means that a large majority does not or does, respectively, have the attribute under consideration.

Page 14: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

There are number of approaches to determine the sample size including: using a census for smaller populations, using published tables, imitating a sample size of similar studies, and applying formulas to calculate a sample size.

Methods of Data CollectionThe purpose of this section is to help you to learn how to collect data.

There are six major methods of data collection.

Tests  

Questionnaires

Interviews 

Focus groups 

Observation 

Existing or Secondary data (i.e., using data that are originally collected and then archived or any other kind of “data” that was simply left behind at an earlier time for some other purpose).   

Tests

Tests are commonly used in research to measure personality, aptitude, achievement, and performance. The last chapter discussed standardized tests; therefore, we only have a brief discussion in this chapter. Note that tests can also be used to complement other measures (following the fundamental principle of mixed research).

 

 In addition to the tests discussed in the last chapter, note that sometimes, a researcher must develop a new test to measure the specific knowledge, skills, behavior, or cognitive activity that is being studied. For example, a researcher might need to measure response time to a memory task using a mechanical

Page 15: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

apparatus or develop a test to measure a specific mental or cognitive activity (which obviously cannot be directly observed).

Strengths and Weaknesses of Tests

Strengths of tests

Can provide measures of many characteristics of people.

Often standardized (i.e., the same stimulus is provided to all participants).

Allows comparability of common measures across research populations.

Strong psychometric properties (high measurement validity).

Availability of reference group data.

Many tests can be administered to groups which saves time.

Can provide “hard,” quantitative data.

Tests are usually already developed.

A wide range of tests is available (most content can be tapped).

Response rate is high for group administered tests.

Ease of data analysis because of quantitative nature of data.       

Weaknesses of tests

Can be expensive if test must be purchased for each research participant.

Reactive effects such as social desirability can occur.

Test may not be appropriate for a local or unique population.

Open-ended questions and probing not available.

Tests are sometimes biased against certain groups of people.

Nonresponse to selected items on the test.

Some tests lack psychometric data.

 

 

Questionnaires

A questionnaire is a self-report data collection instrument that is filled out by research participants. Questionnaires are usually paper-and-pencil instruments, but they can also be placed on the web for participants to go to and “fill out.” Questionnaires are sometimes called survey instruments, which is fine, but the actual questionnaire should not be called “the survey.” The word “survey” refers to the process of using a questionnaire or interview protocol to collect data. For example, you might do a survey of teacher

Page 16: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

attitudes about inclusion; the instrument of data collection should be called the questionnaire or the survey instrument.

A questionnaire is composed of questions and/or statements.

Because one way to learn to write questionnaires is to look at other questionnaires, here is an example of a typical questionnaire that has mostly quantitative items.

For an example of a qualitative questionnaire.

When developing a questionnaire make sure that you follow the 15 Principles of Questionnaire Construction given below.

Principle 1: Make sure the questionnaire items match your research objectives.

Principle 2: Understand your research participants.

Your participants (not you!) will be filling out the questionnaire.

Consider the demographic and cultural characteristics of your potential participants so that you can make it understandable to them.  

Principle 3: Use natural and familiar language. Familiar language is comforting; jargon is not.  

Principle 4: Write items that are clear, precise, and relatively short.

If your participants don't understand the items, your data will be invalid (i.e., your research study will have the garbage in, garbage out, GIGO, syndrome).

Short items are more easily understood and less stressful than long items.

Principle 5: Do not use "leading" or "loaded" questions.

Leading questions lead the participant to where you want him or her to be.

Loaded questions include loaded words (i.e., words that create an emotional reaction or response by your participants).

Always remember that you do not want the participant's response to be the result of how you worded the question. Always use neutral wording.  

Principle 6: Avoid double-barreled questions.

A double-barreled question combines two or more issues in a single question (e.g., here is a double barreled question: “Do you elicit information from parents and other teachers?” It’s double barreled because if someone answered it, you would not know whether they were referring to parents or teachers or both).

Does the question include the word "and"? If yes, it might be a double-barreled question.

Page 17: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Answers to double-barreled questions are ambiguous because two or more ideas are confounded.  

Principle 7: Avoid double negatives.

Does the answer provided by the participant require combining two negatives? (e.g., "I disagree that teachers should not be required to supervise their students during library time"). If yes, rewrite it.

 

Principle 8: Determine whether an open-ended or a closed ended question is needed.

Open-ended questions provide qualitative data in the  participants' own words. Here is an open ended question: How can your principal improve the morale at your school? _______________________________________________

Closed-ended questions provide quantitative data based on the researcher's response categories. Here is an example of a closed-ended question:

How do you find statistics?

1. Difficult2. Moderately difficult3. Easy4. Very easy5. Don’t know

 

Open-ended questions are common in exploratory research and closed-ended questions are common in confirmatory research.  

Principle 9: Use mutually exclusive and exhaustive response categories for closed-ended questions.

Mutually exclusive categories do not overlap (e.g., ages 0-10, 10-20, 20-30 are NOT mutually exclusive and should be rewritten as less than 10, 10-19, 20-29, 30-39, ...).

Exhaustive categories include all possible responses (e.g., if you are doing a national survey of adult citizens (i.e., 18 or older) then the these categories (18-19, 20-29, 30-39, 40-49, 50-59, 60-69) are NOT exhaustive because there is nowhere to put someone who is 70 years old or older. 

Principle 10: Consider the different types of response categories available for closed-ended questionnaire items.

Rating scales are the most commonly used, including:

 

o Numerical rating scales (where the endpoints are anchored; sometimes the center point or area is also labeled).

Page 18: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

 

1            2          3          4          5          6          7       

     Very Low                                                            Very High

 

o Fully anchored rating scales (where all the points on the scale are anchored).

           

1                      2                      3                      4                      5

      Strongly           Agree              Neutral            Disagree           Strongly

       Agree                                                                                       Disagree

 

 

            1                      2                      3                      4

     Strongly            Agree              Disagree         Strongly

      Agree                                                              Disagree

 

 

o Omitting the center point on a rating scale (e.g., using a 4-point rather than a 5-point rating scale) does not appreciably affect the response pattern. Some researchers prefer 5- point rating scales; other researchers prefer 4-point rating scales. Both generally work well.

 

o You should use somewhere from four to eleven points on your rating scale. Personally, I like the 4 and 5-point scales because all of the points are easily anchored.  

 

o I do not recommend a 1 to 10 scale because too many respondents mistakenly view the 5 as the center point. If you want to use a wide scale like this, use a 0 to 10 scale (where the 5 is the middle point) and label the 5 with the anchor “medium” or some other appropriate anchor.

 

Rankings (i.e., where participants put their responses into rank order, such as most important, second most important, and third most important).

 

Page 19: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Semantic differential (i.e., where one item stem and multiple scales that are anchored with polar opposites or antonyms are included and are rated by the participants).

 

Checklists (i.e., where participants "check all of the responses in a list that apply to them").   

Principle 11: Use multiple items to measure abstract constructs.

This is required if you want your measures to have high reliability and validity.

One approach is to use a summated rating scale(such as the Rosenberg Self-Esteem Scale that is composed of 10 items, with each item measuring self-esteem).

Another name for a summated rating scale is a Likert Scale because the summated rating scale was pretty much invented by the famous social psychologist named Rensis Likert.

Here is the Rosenberg Self-Esteem Scale, which is a summated rating scale:

 

 

Principle 12: Consider using multiple methods when measuring abstract constructs.

The idea here is that if you only use one method of measurement, then your measurement may be an artifact of that method of measurement.

On the other hand, if you use two or more methods of measurement you will be able to see whether the answers depend on the method (i.e., are the answers corroborated across the methods of measurement or do you get different answers for the different methods?). For example, you might measure student’s self-esteem via the Rosenberg Scale just shown (which is used in a self-report form) as well as using teachers’ ratings of the students’ self-esteem; you might even want to observe the students in situations that should provide indications of high and low self-esteem. 

 

Principle 13: Use caution if you reverse the wording in some of the items to prevent response sets.   (A response set is the tendency of a participant to respond in a specific direction to items regardless of the item content.)

Reversing the wording of some items can help ensure that participants don't just "speed through" the instrument, checking "yes" or "strongly agree" for all the items.

On the other hand, you may want to avoid reverse wording if it creates a double negative.

Also, recent research suggests that the use of reverse wording reduces the reliability and validity of scales. Therefore, you should generally use reverse wording sparingly, if at all.  

 Principle 14: Develop a questionnaire that is easy for the participant to use.

Page 20: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

The participant must not get confused or lost anywhere in the questionnaire.

Make sure that the directions are clear and that any filter questions used are easy to follow.  

 Principle 15: Always pilot test your questionnaire.

You will always find some problems that you have overlooked!

The best pilot tests are with people similar to the ones to be included in your research study.

After pilot testing your questionnaire, revise it and pilot test it again, until it works correctly.

Strengths of questionnaires

Good for measuring attitudes and eliciting other content from research participants.

Inexpensive (especially mail questionnaires and group administered questionnaires).

Can provide information about participants’ internal meanings and ways of thinking.

Can administer to probability samples.

Quick turnaround.

Can be administered to groups.

Perceived anonymity by respondent may be high.

Moderately high measurement validity (i.e., high reliability and validity) for well-constructed and validated questionnaires.

Closed-ended items can provide exact information needed by researcher.

Open-ended items can provide detailed information in respondents’ own words.

Ease of data analysis for closed-ended items.

Useful for exploration as well as confirmation.

Weaknesses of questionnaires

Usually must be kept short.

Reactive effects may occur (e.g., interviewees may try to show only what is socially desirable).

Nonresponse to selective items.

People filling out questionnaires may not recall important information and may lack self-awareness.

Response rate may be low for mail and email questionnaires.

Open-ended items may reflect differences in verbal ability, obscuring the issues of interest.

Page 21: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Data analysis can be time consuming for open-ended items.

Measures need validation.

 

Interviews

In an interview, the interviewer asks the interviewee questions (in-person or over the telephone).

Trust and rapport are important.

Probing is available (unlike in paper-and-pencil questionnaires) and is used to reach clarity or gain additional information

Here are some examples of standard probes: 

- Anything else?

- Any other reason?

- What do you mean? Interviews may be quantitative or qualitative.  

Quantitative interviews:

Are standardized (i.e., the same information is provided to everyone).

Use closed-ended questions.

Exhibit 6.3 has an example of an interview protocol. Note that it looks very much like a questionnaire! The key difference between an interview protocol and a questionnaire is that the interview protocol is read by the interviewer who also records the answers (you have probably participated in telephone surveys before...you were interviewed).

 

Qualitative interviews

They are based on open-ended questions.

There are three types of qualitative interviews.  

1) Informal Conversational Interview.

            - It is spontaneous.

-  It is loosely structured (i.e., no interview protocol us used).

 

2) Interview Guide Approach.

It is more structured than the informal conversational interview.

It includes an interview protocol listing the open-ended questions.

Page 22: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

The questions can be asked in any order by the interviewer.

Question wording can be changed by the interviewer if it is deemed appropriate.  

3) Standardized Open-Ended Interview.

Open-ended questions are written on an interview protocol, and they are asked in the exact order given on the protocol.

The wording of the questions cannot be changed.  

Strengths of interviews

Good for measuring attitudes and most other content of interest.

Allows probing and posing of follow-up questions by the interviewer.

Can provide in-depth information.

Can provide information about participants’ internal meanings and ways of thinking.

Closed-ended interviews provide exact information needed by researcher.

Telephone and e-mail interviews provide very quick turnaround.

Moderately high measurement validity (i.e., high reliability and validity) for well-constructed and tested interview protocols.

Can use with probability samples.

Relatively high response rates are often attainable.

Useful for exploration as well as confirmation.

Weaknesses of interviews

In-person interviews usually are expensive and time consuming.

Reactive effects (e.g., interviewees may try to show only what is socially desirable).

Investigator effects may occur (e.g., untrained interviewers may distort data because of personal biases and poor interviewing skills).

Interviewees may not recall important information and may lack self-awareness.

Perceived anonymity by respondents may be low.

Data analysis can be time consuming for open-ended items.

Measures need validation.

 Focus Groups

Page 23: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

A focus group is a situation where a focus group moderator keeps a small and homogeneous group (of 6-12 people) focused on the discussion of a research topic or issue.

Focus group sessions generally last between one and three hours and they are recorded using audio and/or videotapes.

Focus groups are useful for exploring ideas and obtaining in-depth information about how people think about an issue.

Strengths of focus groups

Useful for exploring ideas and concepts.

Provides window into participants’ internal thinking.

Can obtain in-depth information.

Can examine how participants react to each other.

Allows probing.

Most content can be tapped.

Allows quick turnaround.

Weaknesses of focus groups

Sometimes expensive.

May be difficult to find a focus group moderator with good facilitative and rapport building skills.

Reactive and investigator effects may occur if participants feel they are being watched or studied.

May be dominated by one or two participants. 

Difficult to generalize results if small, unrepresentative samples of participants are used.

May include large amount of extra or unnecessary information.

Measurement validity may be low.

Usually should not be the only data collection methods used in a study.

Data analysis can be time consuming because of the open-ended nature of the data.

Observation

In the method of data collection called observation, the researcher observes participants in natural and/or structured environments.

It is important to collect observational data (in addition to attitudinal data) because what people say is not always what they do!

 

Page 24: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Observation can be carried out in two types of environments:

Laboratory observation (which is done in a lab set up by the researcher).

Naturalistic observation (which is done in real-world settings).  

 There are two important forms of observation: quantitative observation and qualitative observation.  

1) Quantitative observation involves standardization procedures, and it produces quantitative data.

The following can be standardized:

            - Who is observed?

- What is observed?

- When the observations are to take place.

- Where the observations are to take place.

- How the observations are to take place.

Standardized instruments (e.g., checklists) are often used in quantitative observation.

Sampling procedures are also often used in quantitative observation: --Time-interval sampling (i.e., observing during time intervals, e.g., during the first minute of each 10 minute interval).

--Event sampling (i.e., observing after an event has taken place, e.g., observing after teacher asks a question).

 

2) Qualitative observation is exploratory and open- ended, and the researcher takes extensive field notes.  

The qualitative observer may take on four different roles that make up a continuum:

·        Complete participant (i.e., becoming a full member of the group and not informing the participants that you are studying them).

·        Participant-as-Observer (i.e., spending extensive time "inside" and informing the participants that you are studying them).

·        Observer-as-Participant (i.e., spending a limited amount of time "inside" and informing them that you are studying them).

·        Complete Observer (i.e., observing from the "outside" and not informing that participants that you are studying them).

 

Strengths of observational data

Allows one to directly see what people do without having to rely on what they say they do.

Page 25: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Provides firsthand experience, especially if the observer participates in activities.

Can provide relatively objective measurement of behavior (especially for standardized observations).

Observer can determine what does not occur.

Observer may see things that escape the awareness of people in the setting.

Excellent way to discover what is occurring in a setting.

Helps in understanding importance of contextual factors.

Can be used with participants with weak verbal skills.

May provide information on things people would otherwise be unwilling to talk about.

Observer may move beyond selective perceptions of people in the setting.

Good for description.

Provides moderate degree of realism (when done outside of the laboratory).

Weaknesses of observational data

Reasons for observed behavior may be unclear.

Reactive effects may occur when respondents know they are being observed (e.g., people being observed may behave in atypical ways).

Investigator effects (e.g., personal biases and selective perception of observers)

Observer may “go native” (i.e., over-identifying with the group being studied).

Sampling of observed people and settings may be limited.

Cannot observe large or dispersed populations.

Some settings and content of interest cannot be observed.

Collection of unimportant material may be moderately high.

More expensive to conduct than questionnaires and tests.

Data analysis can be time consuming.

 

Secondary/Existing Data

Secondary data (i.e., data originally used for a different purpose) are contrasted with primary data (i.e., original data collected for the new research study).

 

The most commonly used secondary data are documents, physical data, and archived research data.

Page 26: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

 

1.   Documents. There are two main kinds of documents.

·        Personal documents (i.e., things written or recorded for private purposes). Letters, diaries, family pictures.

·        Official documents (i.e., things written or recorded for public or private organizations). Newspapers, annual reports, yearbooks, minutes.

 

2.   Physical data (are any material thing created or left by humans that might provide information about a phenomenon of interest to a researcher).

3.   Archived research data (i.e., research data collected by other researchers for other purposes, and these data are save often in tape form or cd form so that others might later use the data). For the biggest repository of archived research data,  

Strengths of documents and physical data:

Can provide insight into what people think and what they do.

Unobtrusive, making reactive and investigator effects very unlikely.

Can be collected for time periods occurring in the past (e.g., historical data).

Provides useful background and historical data on people, groups, and organizations.

Useful for corroboration.

Grounded in local setting.

Useful for exploration.

Strengths of archived research data:

Archived research data are available on a wide variety of topics.

Inexpensive.

Often are reliable and valid (high measurement validity).

Can study trends.

Ease of data analysis.

Often based on high quality or large probability samples.

Weaknesses of documents and physical data:

May be incomplete.

Page 27: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

May be representative only of one perspective.

Access to some types of content is limited.

May not provide insight into participants’ personal thinking for physical data.

May not apply to general populations.

Weaknesses of archived research data:

May not be available for the population of interest to you.

May not be available for the research questions of interest to you.

Data may be dated.

Open-ended or qualitative data usually not available.

Many of the most important findings have already been mined from the data.

 

DATA PRESENTATION METHODS

Organizing data into frequency tablesi. class and frequencyii. extended table includes relative frequency, cumulative frequency, and cumulative

relative frequency, as well as class marks Making charts or graphs

i. Histogram and bar graphs.ii. frequency curve or polygoniii. Ogiveiv. box & whisker or boxplotv. circle or pie graphvi. stem & leafvii. pictographsviii. scatter plotsix. line plots

Organize data into frequency tables

Frequency Table

Is an excellent device for making larger collections of data much more intelligible. A frequency table is so named because it lists categories of scores along with their corresponding frequencies.

The frequency for a category or class is the number of original scores that fall into that class. The columns of an extended frequency table generate various graphs or charts.

Page 28: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Extended frequency tables therefore become important prerequisites for creating graphs and charts used in statistics.

Guidelines for frequency tables:

1. Class intervals should not overlap. Classes are mutually exclusive.

2. Classes should continue throughout the distribution with NO gaps. Include all classes.

3. All classes should have the same width.

4. Class widths should be “convenient” numbers.

5. Use 5-20 classes.

6. Make lower or upper limits multiples of the width.

An extended frequency table includes the following:

a. class intervals (lower and upper limits)

b. marks

c. frequency

d. cumulative frequency

e. relative frequency

f. cumulative relative frequency

Example Data Set: Dr. Chuma’s Exam Scores

55 88 67 58 66 78 84 77 69 78 66 57 64 77 87 83 81 82

82 68 78 87 76 74 77 76 73 69 64 76 68 76 76 87 87

Note: Typically, you will have to rank data first; data does not usually come ordered!

The first thing to do with numerical data is to organize it into a frequency table. Each column of a frequency table generates (is used to create) a particular graph or chart.

Extended Frequency Table of Dr. Chuma’s Exam Scores

class freq Cumulative freq (less than)

Cumulative freq(more than)

Relativefreq

Cumulative relative freq

MidpointOr mark

55-6060-6565-7070-7575-80

Page 29: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

80-8585-90

The width of each class is 5 (size of each class).

The lower limits are the smaller numbers of each class (50, 55, 60, 65, 70, etc.)

The upper limits are the larger numbers of each class (55, 60, 65, 70, 75, etc.)

Note: the class limits (either lower or upper) should be a multiple of the width.

The mark is the midpoint of each class.

= (max value – min value)

2

There should be no "gaps" in organizing classes.

There should be no "overlap" in class numbers.

MAKING A QUANTITATIVE FREQUENCY DISTRIBUTION

To create a histogram, you first must make a quantitative frequency distribution. The following list of steps allows you to construct a perfect quantitative frequency distribution every time. Other methods may work sometimes, but they may not work every time.

1. Find the smallest data value (low score) and the largest data value (high score).

2. Select the number of classes you want. Usually, this number is between 3 and 7. (The number of classes may be given in the instructions to the problem.)

3. Determine the accuracy of the data. That is, look at the data to see how many places to the right of the decimal point are used.

4. Compute the following two numbers:

5. The class width is now chosen to be any number greater than the lower bound, but not more than the upper bound. The class width may have more accuracy than the original data, but should be easy to use in calculations. Since there may be more than one possible class width, there can be many correct frequency distributions with the same number of classes.

6. Next you compute the lower class limits. Starting with the low score, repeatedly add the class width until - including the low score - you have one lower class limit for each class.

7. The upper class limit for the first class is the biggest number below the second lower class limit with the same accuracy as the class width. To obtain the other upper class limits, you repeatedly add the class width to the first upper class limit until - including the first upper class limit - you have one upper class limit for each class.

Page 30: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

8. For each class, count the number of data values in the class. This is the class frequency. You can do this by going through the data values one by one and making a tally mark next to the class where the data value occurs. Counting up the tallies for each class gives the class frequency. The class frequencies should be recorded in their own column.

Tally marks are optional, but you must show the class frequencies. The frequencies of the first and last class must be greater than zero. The frequency of any other class may be zero. If you tallied correctly, the sum of all the frequencies should equal the total number of data values.

 

Example: The following data represents the actual liquid weight in 16 "twelve-ounce" cans. Construct a frequency distribution with four classes from this data.

11.95 11.91 11.86 11.94 12.00 11.93 12.00 11.94

12.10 11.95 11.99 11.94 11.89 12.01 11.99 11.94

Solution: First we use the steps listed above to construct the frequency distribution.

Step 1: low score = 11.86, high score = 12.10

Step 2: number of classes = 4 (given in problem)

Step 3: The accuracy is two decimal places.

Step 4: Compute the lower and upper bounds.

Step 5: We can use any number bigger than 0.06, but not more than 0.08. If we restrict our attention to the simplest numbers, either 0.07 or 0.08 will work. I chose 0.08 because I think it is easier to work with than 0.07.

Step 6: By adding 0.08 to 11.86 repeatedly, we obtain the lower class limits: 11.86, 11.94, 12.02, 12.10. Notice there are 4 numbers because we want 4 classes.

Step 7: The first upper class limit is the largest number with the same accuracy as the data that is just below the second lower class limit. In this case, the number is 11.93. The other upper class limits are found by adding 0.08 repeatedly to 11.93, until there are 4 upper class limits.

Step 8: Next, for each member of the data set, we decide which class contains it and then put a tally mark by that class. The numbers corresponding to these tallies gives us the class frequencies.

 

Class Tally Frequency

11.86-11.93 |||| 4

11.94-12.01 ||||  ||||  | 11

12.02-12.09   0

Page 31: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

12.10-12.17 | 1

The tallies in the last step are optional, but the frequency column is required. Notice that the frequency of the third class is zero. Since this is not the first or last class, this is not a problem. Notice also that the sum of the frequencies is 16, which is the same as the number of data values.

B. Making a Histogram from a Quantitative Frequency Distribution

To make a histogram, you must first create a quantitative frequency distribution. We will make a histogram from the the quantitative frequency distribution constructed in part A, a copy of which is shown below. 

Class Frequency

11.86-11.93 4

11.94-12.01 11

12.02-12.09 0

12.10-12.17 1

First, set up a coordinate system with a uniform scale on each axis (See Figure 1 below). The data axis is marked here with the lower class limits. Note that the last number is 12.10 + 0.08 = 12.18, which is not in the frequency table, but it keeps the scale uniform. You could also mark the data axis using the upper class limits, or you could mark it with the class midpoints. Whatever method you use, the data axis will always have a uniform numeric scale with tick marks at regular intervals and numbers next to the tick marks. 

If the data axis doesn’t look like a number line, then you don’t have a histogram.

Page 32: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Frequency scales always start at zero, so the frequency scale must extend from 0 to at least 11 in this case. I went up to 12, so that I could use multiples of 4 on the vertical axis. You could use multiples of three, multiples of two or mark each number from 0 to 11 on the frequency scale, but the smaller the multiple, the more work you will have to do. As with the data axis, the frequency scale should have tick marks at regular intervals and numbers next to the tick marks.

Once the scales are set up, you draw a bar for each class with a frequency greater than zero (See Figure 2 below). Each bar will cover the interval from its lower class limit to the next lower class limit to the right. The height of the bar is the same as the frequency for that class.

 

 

Finally add a label for each axis. The vertical axis can always be labeled "Frequency". The label on the horizontal axis just describes the original data set.

 

For more examples of making a histogram, go to the GeoGebra applet Histograms.

 

C. Good Histograms, Bad Histograms and Non-histograms

 

Figure 1: GOOD Histogram

Page 33: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

This graph has all the characteristics of a good histogram. Both axes are labeled and a title is given. The data axis has a uniform number scale to label the bars. The scales on both the frequency and the data axes cover the data values and not much more. Finally, there are no gaps between the bars.

Page 34: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Figure 2: BAD Histogram

The problem with this histogram is that there are gaps between the bars. The gaps make it appear that some values like 44 and 45 or 53 and 54 never occur.

 

Figure 3: BAD Histogram

Here, the problem is that the frequency scale goes too high. One bar has a frequency of 5 and all the rest have frequencies of 3 or lower, so there is no reason to extend the frequency scale above 6. Notice that the bars in Figure 1 are twice as tall as the bars in Figure 3, even though Figure 1 is about the same size as Figure 3.

Page 35: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Figure 4: BAD Histogram

The data scale here extends too far to the left. The smallest data value is 27 and everything else is bigger, so there is no reason for the data scale to go below 25. Notice that the bars in Figure 1 are wider than those in Figure 4, even though Figure 1 is about the same size as Figure 4.

Figure 5: NOT a Histogram

This chart is NOT a histogram because there is not a uniform number scale on the data axis. The chart is best described as a quantitative bar chart. Without a uniform number scale on the data axis, reading the chart and analyzing the results become much more difficult.

D. Analyzing Histograms

The following are questions that a statistician should be able to answer about any histogram.

*        What is the maximum data value as shown on the histogram? (What is the largest value on the data axis?)

Page 36: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

 *       What is the minimum data value as shown on the histogram? (What is the smallest value on the data axis?)

*        Is the histogram symmetric, skewed to the left, skewed to the right, bell-shaped, uniform or does it have no special shape? (Because real data rarely results in perfectly uniform, bell-shaped, or symmetric histograms, anything close to these shapes can be classified as such.)

 *       How many peaks does the histogram have, and where are they located? (Peaks are bars with shorter bars on each side. First bars that are taller than second bars or last bars that are taller than the preceding bar are also called peaks. Two or more adjacent bars of the same height with neighboring shorter bars - a plateau - would be considered one peak.)

   *     Does the histogram have any gaps, and if so, where are they located? (Gaps are empty classes with bars on both sides.)

   *     Does the histogram have any extreme values, and if so, where are they located? (An extreme value is a bar with a large gap - two or more classes - between it and the other bars.)

Notice that to answer all of these questions you only need to look at the numbers on the data axis of the histogram - not the frequency axis. The questions are not numbered - in fact they can be asked in any order - so placing a number or letter next to an answer does not identify the question. You should give enough information in your answer to a question so the reader does not have to even know there was a question. The eventual goal is for you to combine all the answers to these questions together in a paragraph.

Exercise: Answer the questions listed above for each of the histograms shown below.

Page 37: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or
Page 38: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

What is the difference between a bar graph and a histogram?

There are two differences, one is in the type of data that is presented and the other in the way they are drawn.

In bar graphs are usually used to display "categorical data", that is data that fits into categories. For example suppose that i offered to buy donuts for six people and three said they wanted chocolate covered, 2 said plain and one said with icing sugar. I would present this in a bar garph as: 

Page 39: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Histograms on the other hand are usually used to present "continuous data", that is ata that represents measured quantity where, at least in theory, the numbers can take on any value in a certain range. A good example is weight. If you measure the weights of a group of adults you might get and numbers between say 90 pounds and 240 pounds. We usually report our weights as pounds or to the nearest half pound but we might do so to the nearest tenth of a pound or however acurate the scale is. The data would then be collected into categories to present a histogram. For example: 

might be a histogram for heights (with the appropriate scale on the vertical axis). Here the data has been collected into categories of width 30 pounds.

The difference in the way that bar graphs and histograms are drawn is that the bars in bar graphs are usually separated where in histograms the bars are adjacent to each other. This is not always true however. Sometimes you see bar graphs with no spaces between the bars but histograms are never drawn with spaces between the bars.

STEM AND LEAF DIAGRAM

A special table where each data value is split into a "leaf" (usually the last digit) and a "stem" (the other digits)

Page 40: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Key 2/6 means 26 the stem is 2 and the leaf is 6.

FREQUENCY POLYGON

Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful for comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions.

To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the middle of each class interval at the height corresponding to its frequency. Finally, connect the points. You should include one class interval below the lowest value in your data and one above the highest value. The graph will then touch the X-axis on both sides.

A frequency polygon for 642 psychology test scores shown in Figure 1 was constructed from the frequency table shown in Table 1.

Table 1. Frequency Distribution of Psychology Test Scores.

Page 41: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

The first label on the X-axis is 35. This represents an interval extending from 29.5 to 39.5. Since the lowest test score is 46, this interval has a frequency of 0. The point labeled 45 represents the interval from 39.5 to 49.5. There are three scores in this interval. There are 147 scores in the interval that surrounds 85.

You can easily discern the shape of the distribution from Figure 1. Most of the scores are between 65 and 115. It is clear that the distribution is not symmetric inasmuch as good scores (to the right) trail off more gradually than poor scores (to the left). In the terminology of Chapter 3 (where we will study shapes of distributions more systematically), the distribution is skewed.

Lower Limit

Upper Limit Count

Cumulative Count

29.5 39.5 0 0

39.5 49.5 3 3

49.5 59.5 10 13

59.5 69.5 53 66

69.5 79.5 107 173

79.5 89.5 147 320

89.5 99.5 130 450

99.5 109.5 78 528

109.5 119.5 59 587

119.5 129.5 36 623

129.5 139.5 11 634

139.5 149.5 6 640

149.5 159.5 1 641

159.5 169.5 1 642

169.5 179.5 0 642

Page 42: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Figure 1. Frequency polygon for the psychology test scores.

A cumulative frequency polygon for the same test scores is shown in Figure 2. The graph is the same as before except that the Y value for each point is the number of students in the corresponding class interval plus all numbers in lower intervals. For example, there are no scores in the interval labeled "35," three in the interval "45," and 10 in the interval "55." Therefore, the Y value corresponding to "55" is 13. Since 642 students took the test, the cumulative frequency for the last interval is 642.

Figure 2. Cumulative frequency polygon for the psychology test scores.

Page 43: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons drawn for different data sets. Figure 3 provides an example. The data come from a task in which the goal is to move a computer mouse to a target on the screen as fast as possible. On 20 of the trials, the target was a small rectangle; on the other 20, the target was a large rectangle. Time to reach the target was recorded on each trial. The two distributions (one for each target) are plotted together in Figure 3. The figure shows that, although there is some overlap in times, it generally took longer to move the mouse to the small target than to the large one.

Figure 3. Overlaid frequency polygons.

It is also possible to plot two cumulative frequency distributions in the same graph. This is illustrated in Figure 4 using the same data from the mouse task. The difference in distributions for the two targets is again evident.

Page 44: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Figure 4. Overlaid cumulative frequency polygons.

PIE CHART

A special chart that uses "pie slices" to show relative sizes of data

BOXPLOT

A boxplot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set

Page 45: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

(maximum and minimum values), the lower and upper quartiles, and the median. (Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

The median for each dataset is indicated by the black center line, and the first and third quartiles are the edges of the red area, which is known as the inter-quartile range (IQR). The extreme values (within 1.5 times the inter-quartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.

In this example, the three boxplots have nearly identical median values. The IQR is decreasing from one time period to the next, indicating reduced variability of payoffs in the second and third periods. In addition, the extreme values are closer to the median in the later time periods.

SCATTER PLOT

A scatterplot is a useful summary of a set of bivariate data (two variables), usually drawn before working out a linear correlation coefficient or fitting a regression line. It gives a good visual picture of the relationship between the two variables, and aids the interpretation of the correlation coefficient or regression model.

Each unit contributes one point to the scatterplot, on which points are plotted but not joined. The resulting pattern indicates the type and strength of the relationship between the two variables.(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)

A scatterplot is often employed to identify potential associations between two variables, where one may be considered to be an explanatory variable (such as years of education) and another may be considered a response variable (such as annual income). A positive association between education and income would be indicated on a scatterplot by a upward trend (positive slope), where higher incomes correspond to higher education levels and lower incomes correspond to fewer years of education. A negative association would be indicated by the opposite effect (negative slope), where the most highly educated individuals would have lower incomes than the least educated individuals. Or, there might not be any notable association, in which case a scatterplot would not indicate any trends whatsoever. The following plots demonstrate the appearance of positively associated, negatively associated, and non-associated variables:

Page 46: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

 

MEASURES OF CENTRAL TENDENCY AND DISPERSION

Mean, mode and median for ungrouped data

Page 47: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Mean

The sample mean, or average, of a group of values is calculated by taking the sum of all of the values and dividing by the total number of values. In other words, for n values x1, x2, x3, ... , xn, the mean   = (x1 + x2 +x3 + ... + xn)/n, or 

Example

Suppose a group of 10 students have the following heights (in inches):60, 72, 64, 67, 70, 68, 71, 68, 73, 59.

The mean height for this group is (1/10)*(60+72+64+67+70+68+71+68+73+59) = 670/10 = 67.2.

Median

The median of a group of values is the center, or midpoint, of the ordered values. The median is calculated by placing a group of values in ascending order and taking the center observation of the ordered list, such that there are an equal number of values above and below the median (for an even number of observations, one may take the average of the two center values).

Example

For the data in the previous example, the median is calculated as follows:

First order the data: 59, 60, 64, 67, 68, 68, 70, 71, 72, 73.

Since there are 10 observations, the median is the average of the 5th and 6th observations, which in this case are identical: 5th observation = 68, 6th observation = 68, median = 68.

Quartiles

Page 48: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

The first quartile of a group of values is the value such the 25% of the values fall at or below this value. The third quartile of a group of values is the value such that 75% of the values fall at or below this value. The first quartile may be approximately calculated by placing a group of values in ascending order and determining the median of the values below the true median, and the third quartile is approximately calculated by determining the median of the values above the true median. For an odd number of observations, the median is excluded from the calculation of the first and third quartiles.

The distance between the first and third quartiles is known as the Inter-Quartile Range (IQR).

A useful graphical representation of a distribution including the quartiles is a boxplot.

Example

For the data in the previous example, the quartiles may be approximately calculated as follows:

First order the data: 59, 60, 64, 67, 68, 68, 70, 71, 72, 73.

Since there are an even number of observations (10), the first half of the data is considered in calculating the first quartile: 59, 60, 64, 67, 68. The median of these values is 64, so this is the first quartile.

The second half of the data is considered in calculating the third quartile: 68, 70, 71, 72, 73. The median of these values is 71, so this is the third quartile.

For this example, the Inter-Quartile Range is 71-64 = 7.

Variance and Standard Deviation

The variance of a group of values measures the spread of the distribution. A large variance indicates a wide range of values, while a small variance indicates that the values lie close to their mean. The variance s² is calculated by summing the squared distances from each value to the mean of the values, then dividing by one fewer than the number of observations. The standard

Page 49: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

deviation s is the square root of the variance.

Example

The following calculation computes the variance for the student height data, where the mean was previously calculated to be 67.2:

s² = 1/9[(59-67.2)² + (60-67.2)² + 64-67.2)² + (67-67.2)² + .... + (73-67.2)²]

= 1/9[67.24 + 51.84 + 9.4 + 0.04 + .... + 33.64]

= 1/9[208.76]

= 23.2

The standard deviation is the square root: s = 4.8.

Grouped Data Problems

Find the mean and standard deviation of the following quantitative frequency distributions.

1) A sample of college students was asked how much they spent monthly on a cell phone plan (to the nearest dollar).

Monthly Cell Phone Plan Cost ($) Number of Students

10 – 20 8

20 – 30 16

30 – 40 21

40 – 50 11

50 – 60 4

2) The following data represent the difference in scores between the winning and losing teams in a sample of 15 college football bowl games from 2004-2005.

Point Difference Number of Bowl Games

Page 50: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

1 – 5 8

5 - 10 0

10 - 15 2

15 - 20 3

20 - 25 1

25 - 30 0

30 - 35 1

3) The following table represents the distribution of the annual number of days over 100 degrees Fahrenheit for Dallas-Fort Worth for a sample of 80 years from 1905 to 2004.

Days Above 100 Degrees

Number of Years

0 – 10 25

10 - 20 33

20 – 30 14

30 - 40 5

40 - 50 2

50 - 60 1

4) The following table shows the distribution of the number of hours worked each week (on average) for a sample of 100 community college students.

Hours Worked per Number of Students

Page 51: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Week

0 – 10 24

10 – 20 14

20 – 30 39

30 – 40 18

40 – 49 5

 

5) The following data represents the age distribution of a sample of 100 people covered by health insurance (private or government). The sample was taken in 2003.

Age Number

25 - 35

23

35 - 45

29

45 - 55

28

55 - 65

20

6) The following data represent the high temperature distribution in degrees Fahrenheit for a sample of 40 days from the month of August in Chicago since 1872.

Temperature

Days

60 - 70 3

70 - 80 15

80 - 90 17

90 – 100 5

Page 52: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

7) The following data represent the annual rainfall distribution in St. Louis, Missouri, for a sample of 25 years from 1870 to 2004.

Rainfall (inches) Number of Years

20 - 25 1

25 - 30 3

30 - 35 5

35 - 40 8

40 - 45 5

45 - 50 2

50 - 55 0

55 - 60 1

8) The following data represent the age distribution of a sample of 70 women having multiple-delivery births in 2002.

Age Number

15 - 20

1

20 - 25

5

25 - 30

16

30 - 35

28

35 - 40

17

40 - 45

3

Answers:

Page 53: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

1) mean = 32.2, std. dev. = 11.1

2) mean = 10.7, std. dev. = 9.6

3) mean = 15.6, std. dev. = 10.8

4) mean = 21.1, std. dev. = 11.7

5) mean = 44.0, std. dev. = 10.6

6) mean = 80.5, std. dev. = 8.1

7) mean = 36.8, std. dev. = 7.6 8) mean = 31.6, std. dev. = 5.2

 Summary of Basic Probability

In the following, the capital letters A, B, or C are used to represent some event like "it will rain tomorrow" or "a computer chip is defective". The notation P(A) then stands for the probability of event A.

An experiment is any process yields a result or observation.

The classical or theoretical definition of probability assumes that there are a finite number of outcomes in a situation and all the outcomes are equally likely.

Classical Definition of Probability

Though you probably have not seen this definition before, you probably have an inherent grasp of the concept. In other words, you could guess the probabilities without knowing the definition.

Cards and Dice The examples that follow require some knowledge of cards and dice. Here are the basic facts needed compute probabilities concerning cards and dice.

A standard deck of cards has four suites: hearts, clubs, spades, diamonds. Each suite has thirteen cards: ace, 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen and king. Thus the entire deck has 52 cards total.

When you are asked about the probability of choosing a certain card from a deck of cards, you assume that the cards have been well-shuffled, and that each card in the deck is visible, though face down, so you do not know what the suite or value of the card is.

A pair of dice consists of two cubes with dots on each side. One of the cubes is called a die, and each die has six sides.Each side of a die has a number of dots (1, 2, 3, 4, 5 or 6), and each number of dots appears only once.

Example 1 The probability of choosing a heart from a deck of cards is given by

Example 2 The probability of choosing a three from a deck of cards is

Page 54: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Example 3 The probability of a two coming up after rolling a die (singular for dice) is

The classical definition works well in determining probabilities for games of chance like poker or roulette, because the stated assumptions readily apply in these cases. Unfortunately, if you wanted to find the probability of something like rain tomorrow or of a licensed driver in Louisiana being involved in an auto accident this year, the classical definition does not apply. Fortunately, there is another definition of probability to apply in these cases.

Empirical Definition of Probability

The probability of event A is the number approached by

as the total number of recorded outcomes becomes "very large."

The idea that the fraction in the previous definition will approach a certain number as the total number of recorded outcomes becomes very large is called the Law of Large Numbers. Because of this law, when the Classical Definition applies to an event A, the probabilities found by either definition should be the same. In other words, if you keep rolling a die, the ratio of the total number of twos to the total number of rolls should approach one-sixth. Similarly, if you draw a card, record its number, return the card, shuffle the deck, and repeat the process; as the number of repetitions increases, the total number of threes over the total number of repetitions should approach 1/13 ≈ 0.0769.

In working with the empirical definition, most of the time you have to settle for an estimate of the probability involved. This estimate is thus called an empirical estimate.

Example 4 To estimate the probability of a licensed driver in Louisiana being involved in an auto accident this year, you could use the ratio

To do better than that, you could use the number of accidents for the last five years and the total number of Louisiana drivers in the last five years. Or to do even better, use the numbers for the last ten years or, better yet, the last twenty years.

Example 5 Estimating the probability of rain tomorrow would be a little more difficult. You could note today's temperature, barometric pressure, prevailing wind direction, and whether or

Page 55: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

not there are rain clouds that could be blown into your area by tomorrow. Then you could find all days on record in the past with similar temperatures, pressures, and wind directions, and clouds in the right location. Your rainfall estimate would then be the ratio

To make your estimate better, you might want to add in humidity, wind speed, or season of the year. Or maybe if there seemed to be no relation between humidity levels and rainfall, you might want add in the days that did not meet your humidity level requirements and thus increase the total number of days.

Example 6 If you want to estimate the probability that a dam will burst, or a bridge will collapse, or a skyscraper will topple, there is usually not much past data available. The next best thing is to do a computer simulation. Simulation results can be compiled a lot faster with a lot less money and less loss of life than actual events. The estimated probability of say a bridge collapsing would be given by the following fraction

The more true to life the simulation is, the better the estimate will be.

Basic Probability Rules For either definition, the probability of an event A is always a number between zero and one, inclusive; i.e.

Sometimes probability values are written using percentages, in which case the rule just given is written as follows

If the event A is not possible, then P(A) = 0 or P(A) = 0%. If event A is certain to occur, then P(A) = 1 or P(A) = 100%.

The sum of the probabilities for each possible outcome of an experiment is 1 or 100%. This is written mathematically as follows using the capital Greek letter sigma (S) to denote summation.

Probability Scale* The best way to find out what the probability of an event means is to compute the probability of a number of events you are familiar with and consider how the probabilities you compute correspond to how frequently the events occur. Until you have computed a large number of probabilities and developed your own sense of what probabilities mean, you can use the following probability scale as a rough starting point. When you gain more experience with

Page 56: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

probabilities, you may want to change some terminology or move the boundaries of the different regions.

SOME BASIC DEFINITIONS

The sample space S for a probability model is the set of all possible outcomes.

For example, suppose there are 5 marbles in a bowl. One is red, one is blue, one is yellow, one is green, and one is purple. If one marble is to be picked at random from the bowl, the sample space possible outcomes S = {red, blue, yellow, green, purple}. If 3 of the marbles are red and 2 are blue, then the sample space S = {red, blue}, since only two possible color outcomes be possible. If, instead, two marbles are picked from a bowl with 3 red marbles and 2 blue marbles, then the sample space S = {(2 red), (2 blue), (1 red and 1 blue)}, the set of all possible outcomes.

An event A is a subset of the sample space S.

Suppose there are 3 red marbles and 2 blue marbles in a bowl. If an individual picks three marbles, one at a time, from the bowl, the event "pick 2 red marbles" can be achieved in 3 ways, so the set of outcomes A = {(red,red, blue),(red, blue,red), (blue, red,red)}. The sample space for picking three marbles, one at a time, is all of the possible ordered combinations of three marbles, S = {(red, red, red), (red, red, blue), (red, blue, red), (blue, red, red), (blue, blue, red), (blue, red, blue), (red, blue, blue)}. Since there are only 2 blue marbles, it is impossible to achieve the event {blue, blue,blue}.

A probability is a numerical value assigned to a given event A. The probability of an event is written P(A), and describes the long-run relative frequency of the event. The first two basic rules of probability are the following:

Rule 1: Any probability P(A) is a number between 0 and 1 (0 < P(A) < 1).

Rule 2: The probability of the sample space S is equal to 1 (P(S) = 1).

Suppose five marbles, each of a different color, are placed in a bowl. The sample space for choosing one marble, from above, is S = {red, blue, yellow, green, purple}. Since one of these must be selected, the probability of choosing any marble is equal to the probability of the sample space S = 1. Suppose the event of interest is choosing the purple marble, A = {purple}. If it is equally likely that any one marble will be selected, then the probability of choosing the purple marble, P(A) = 1/5. In general, the following formula describes the calculation of probabilities for equally likely outcomes:

If there are k possible outcomes for a phenomenon and each is equally likely, then each individual outcome has probability 1/k.

Page 57: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

The probability of any event A is 

If two events have no outcomes in common, then they are called disjoint. For example, the possible outcomes of picking a single marble are disjoint: only one color is possible on each pick. The addition of probabilities for disjoint events is the third basic rule of probability:

Rule 3:  If two events A and B are disjoint, then the probability of either event is the sum of the probabilities of the two events:P(A or B) = P(A) + P(B).

The chance of any (one or more) of two or more events occurring is called the union of the events. The probability of the union of disjoint events is the sum of their individual probabilities.

For example, the probability of drawing either a purple, red, or green marble from a bowl of five differently colored marbles is the sum of the probabilities of drawing any of these marbles: 1/5 + 1/5 + 1/5 = 3/5.

If there are three red marbles and two blue marbles, then the probability of drawing any red marble is the number of outcomes in the event " red," which is equal to three, divided by the total number of outcomes, 5, or 3/5 = 0.6. The sample space is this case is {red,  blue}, which must have total probability equal to 1, so the probability of drawing a blue marble is equal to 2/5 = 0.4. The event of drawing a blue marble does not occur if a red marble is chosen, so the event A = "blue" is called the complement Ac of the event "red." Since an event and its complement together form the entire sample space S, the probability of an event A is equal to the probability of the sample space S, minus the probability of Ac, as follows:

Rule 4: The probability that any event A does not occur is P(Ac) = 1 - P(A).

Venn Diagrams

A useful graphical tool for studying the complements, intersections, and unions of events within a sample space S is known as a Venn diagram. In such a diagram, events are drawn as regions that may or may not overlap.

In the Venn diagram to the right, events A and B are disjoint. Suppose, for example, event A is drawing a red marble from a bowl of five differently colored marbles, and event B is drawing a blue marble. These events cannot both occur, so

Page 58: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

there is no overlapping area. 

In the Venn diagram on the left, events A and B are not disjoint. This means that it is possible for both events to occur, and the overlapping area represents this possibility. Suppose, for instance, there are 3 red marbles and two blue marbles in a bowl. Two marbles are to be drawn from the bowl, one after the other. After the first draw, the marble drawn is returned to the bowl. Define event A to be drawing a red marble from the bowl on the first draw, and define event B to be drawing

a blue marble on the second draw. The occurence of event A is represented by the red area, the occurence of event B is represented by the blue area, the occurence of both events is represented by the overlapping area (also known as the intersection of the two events), and the occurence of either event is represented by the entire colored area (also known as the union of the two events). 

Independence

Consider two events which might occur in succession, such as two flips of a coin. If the outcome of the first event has no effect on the probability of the second event, then the two events are called independent. For two coin flips, the probability of getting a "head" on either flip is 1/2, regardless of the result of the other flip. The fourth basic rule of probability is known as the multiplication rule, and applies only to independent events:

Rule 5:  If two events A and B are independent, then the probability of both events is the product of the probabilities for each event: P(A and B) = P(A)P(B).

The chance of all of two or more events occurring is called the intersection of events. For independent events, the probability of the intersection of two or more events is the product of the probabilities.

In the case of two coin flips, for example, the probability of observing two heads is 1/2*1/2 = 1/4. Similarly, the probability of observing four heads on four coin flips is 1/2*1/2*1/2*1/2 = 1/16.

If two events A and B are not disjoint, then the probability of their union (the event that A or B occurs) is equal to the sum of their probabilities minus the sum of their intersection.

In the example corresponding to the second Venn diagram above, we know that the probability of drawing a red marble on the first draw (event A) is equal to 3/5, and the probability of drawing

Page 59: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

a blue marble on the second draw (event B) is equal to 2/5. Since events A and B are independent, the probability of the intersection of A and B, P(A and B), equals the product P(A)*P(B) = 3/5*2/5 = 6/25. The probability of the union of A and B, P(A or B), is equal to P(A) + P(B) - P(A and B) = 3/5 + 2/5 - 6/25 = 1 - 6/25 = 19/25 = 0.76.

For another example, consider tossing two coins. The probability of a head on any toss is equal to 1/2. Since the tosses are independent, the probability of a head on both tosses (the intersection) is equal to 1/2*1/2 = 1/4. The probability of a head on either toss (the union) is equal to the sum of the probabilities of a head on each toss minus the probability of the intersection, 1/2 + 1/2 - 1/4 = 3/4.

Note: Disjoint events are not independent. In the marble example, consider drawing one marble from the bowl of five, where each marble is a different color. Suppose the event of interest, event A, is drawing a blue marble. The probability of drawing this marble is 1/5. Suppose event B is drawing a green marble. These events are disjoint, since event B cannot occur if event A occurs. Obviously, they are not independent, since the outcome of event A directly affects the outcome of event B. If, instead, two marbles were to be drawn from the bowl, with the first marble replaced before the second marble was drawn, then the event of drawing a blue marble on the first draw would not affect the outcome of the second draw. The event of drawing a green marble on the second draw would be independent of the event of drawing a blue marble on the first draw, so the probability of both events occurring would be the product of the probabilities of each event, 1/5*1/5 = 1/25.

Conditional Probability

The conditional probability of an event B is the probability that the event will occur given the knowledge that an event A has already occurred. This probability is written P(B|A), notation for the probability of B given A. In the case where events A and B are independent (where event A has no effect on the probability of event B), the conditional probability of event B given event A is simply the probability of event B, that is P(B).

If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by P(A and B) = P(A)P(B|A).

From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):

Page 60: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Note: This expression is only valid when P(A)  is greater than 0.

Examples

In a card game, suppose a player needs to draw two cards of the same suit in order to win. Of the 52 cards, there are 13 cards in each suit. Suppose first the player draws a heart. Now the player wishes to draw a second heart. Since one heart has already been chosen, there are now 12 hearts remaining in a deck of 51 cards. So the conditional probability P(Draw second heart|First card a heart) = 12/51.

Suppose an individual applying to a college determines that he has an 80% chance of being accepted, and he knows that dormitory housing will only be provided for 60% of all of the accepted students. The chance of the student being accepted and receiving dormitory housing is defined byP(Accepted and Dormitory Housing) = P(Dormitory Housing|Accepted)P(Accepted) = (0.60)*(0.80) = 0.48.

Measures of Skewness and Kurtosis

Skewness: indicator used in distribution analysis as a sign of asymmetry and deviation from a normal distribution. 

Interpretation: 

Skewness > 0 - Right skewed distribution - most values are concentrated on left of the mean, with extreme values to the right.

Skewness < 0 - Left skewed distribution - most values are concentrated on the right of the mean, with extreme values to the left.

Skewness = 0 - mean = median, the distribution is symmetrical around the mean.

Kurtosis - indicator used in distribution analysis as a sign of flattening or "peakedness" of a distribution. 

Interpretation: (Using the first formula given below)

Kurtosis > 3 - Leptokurtic distribution, sharper than a normal distribution, with values concentrated around the mean and thicker tails. This means high

Page 61: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

probability for extreme values.

Kurtosis < 3 - Platykurtic distribution, flatter than a normal distribution with a wider peak. The probability for extreme values is less than for a normal distribution, and the values are wider spread around the mean.

Kurtosis = 3 - Mesokurtic distribution - normal distribution for example.

Skewness and Kurtosis

A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.

The histogram is an effective graphical technique for showing both the skewness and kurtosis of data set.

Definition of Skewness

For univariate data Y1, Y2, ..., YN, the formula for skewness is:

where   is the mean,   is the standard deviation, and N is the number of data points. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. Some measurements have a lower bound and are skewed right. For example, in reliability studies, failure times cannot be negative.

Definition of Kurtosis

For univariate data Y1, Y2, ..., YN, the formula for kurtosis is:

Page 62: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

where   is the mean,   is the standard deviation, and N is the number of data points.

Alternative Definition of Kurtosis

The kurtosis for a standard normal distribution is three. For this reason, some sources use the following definition of kurtosis (often referred to as "excess kurtosis"):

This definition is used so that the standard normal distribution has a kurtosis of zero. In addition, with the second definition positive kurtosis indicates a "peaked" distribution and negative kurtosis indicates a "flat" distribution.

Which definition of kurtosis is used is a matter of convention (this handbook uses the original definition). When using software to compute the sample kurtosis, you need to be aware of which convention is being followed. Many sources use the term kurtosis when they are actually computing "excess kurtosis", so it may not always be clear.

Examples The following example shows histograms for 10,000 random numbers generated from a normal, a double exponential, a Cauchy, and a Weibull distribution.

Normal Distribution

The first histogram is a sample from a normal distribution. The normal distribution is a symmetric distribution with well-behaved

Page 63: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

tails. This is indicated by the skewness of 0.03. The kurtosis of 2.96 is near the expected value of 3. The histogram verifies the symmetry.

Double Exponential Distribution

The second histogram is a sample from a double exponential distribution. The double exponential is a symmetric distribution. Compared to the normal, it has a stronger peak, more rapid decay, and heavier tails. That is, we would expect a skewness near zero and a kurtosis higher than 3. The skewness is 0.06 and the kurtosis is 5.9.

Cauchy Distribution

The third histogram is a sample from a Cauchy distribution.

For better visual comparison with the other data sets, we restricted the histogram of the Cauchy distribution to values between -10 and 10. The full data set for the Cauchy data in fact has a minimum of approximately -29,000 and a maximum of approximately 89,000.

The Cauchy distribution is a symmetric distribution with heavy tails and a single peak at the center of the distribution. Since it is symmetric, we would expect a skewness near zero. Due to the heavier tails, we might expect the kurtosis to be larger than for a normal distribution. In fact the skewness is 69.99 and the kurtosis is 6,693. These extremely high values can be explained by the heavy tails. Just as the mean and standard deviation can be distorted by extreme values in the tails, so too can the skewness and kurtosis measures.

Weibull Distribution

The fourth histogram is a sample from a Weibull distribution with shape parameter 1.5. The Weibull distribution is a skewed distribution with the amount of skewness depending on the value of the shape parameter. The degree of decay as we move away from the center also depends on the value of the shape parameter. For this data set, the skewness is 1.08 and the kurtosis is 4.46, which indicates moderate skewness and kurtosis.

Probability Tree Diagrams

Here is a tree diagram for the toss of a coin:

Page 64: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

 

There are two "branches" (Heads and Tails)

The probability of each branch is written on the branch

The outcome is written at the end of the branch

We can extend the tree diagram to two tosses of a coin:

How do you calculate the overall probabilities?

You multiply probabilities along the branches

You add probabilities down columns

Now we can see such things as:

The probability of "Head, Head" is 0.5×0.5 = 0.25

All probabilities add to 1.0 (which is always a good check)

The probability of getting at least one Head from two tosses is 0.25+0.25+0.25 = 0.75

... and more

Page 65: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Example: Soccer Game

You are off to soccer, and love being the Goalkeeper, but that depends who is the Coach today:

with Coach Sam the probability of being Goalkeeper is 0.5

with Coach Alex the probability of being Goalkeeper is 0.3

Sam is Coach more often ... about 6 out of every 10 games (a probability of 0.6).

So, what is the probability you will be a Goalkeeper today?

 

Let's build the tree diagram. First we show the two possible coaches: Sam or Alex:

The probability of getting Sam is 0.6, so the probability of Alex must be 0.4 (together the probability is 1)

Now, if you get Sam, there is 0.5 probability of being Goalie (and 0.5 of not being Goalie):

If you get Alex, there is 0.3 probability of being Goalie (and 0.7 not):

The tree diagram is complete, now let's calculate the overall probabilities. This is done by multiplying each probability along the "branches" of the tree.

Here is how to do it for the "Sam, Yes" branch:

Page 66: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

(When we take the 0.6 chance of Sam being coach and include the 0.5 chance that Sam will let you be Goalkeeper we end up with an 0.3 chance.)

But we are not done yet! We haven't included Alex as Coach:

An 0.4 chance of Alex as Coach, followed by an 0.3 chance gives 0.12.

Now we add the column:

0.3 + 0.12 = 0.42 probability of being a Goalkeeper today

(That is a 42% chance)

Check

One final step: complete the calculations and make sure they add to 1:

0.3 + 0.3 + 0.12 + 0.28 = 1

Page 67: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Combinations and Permutations

What's the Difference?

In English we use the word "combination" loosely, without thinking if the order of things is important. In other words:

"My fruit salad is a combination of apples, grapes and bananas" We don't care what order the fruits are in, they could also be "bananas, grapes and apples" or "grapes, apples and bananas", its the same fruit salad.

   

"The combination to the safe was 472". Now we do care about the order. "724" would not work, nor would "247". It has to be exactly 4-7-2.

So, in Mathematics we use more precise language:

If the order doesn't matter, it is a Combination.

If the order does matter it is a Permutation.

  So, we should really call this a "Permutation Lock"!

In other words:

A Permutation is an ordered Combination.

To help you to remember, think "Permutation ... Position"

Permutations

There are basically two types of permutation:

Repetition is Allowed: such as the lock above. It could be "333".

No Repetition: for example the first three people in a running race. You can't be first and second.

 

Page 68: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

1. Permutations with Repetition

These are the easiest to calculate.

When you have n things to choose from ... you have n choices each time!

When choosing r of them, the permutations are:

n × n × ...  (r times)

(In other words, there are n possibilities for the first choice, THEN there are n possibilites for the second choice, and so on, multplying each time.)

Which is easier to write down using an exponent of r:

n × n × ... (r times) = nr

Example: in the lock above, there are 10 numbers to choose from (0,1,..9) and you choose 3 of them:

10 × 10 × ... (3 times) = 103 = 1,000 permutations

So, the formula is simply:

nr

where n is the number of things to choose from, and you choose r of them(Repetition allowed, order matters)

 

2. Permutations without Repetition

In this case, you have to reduce the number of available choices each time.

For example, what order could 16 pool balls be in?

After choosing, say, number "14" you can't choose it again.

So, your first choice would have 16 possibilites, and your next choice would then have 15 possibilities, then 14, 13, etc. And the total permutations would be:

16 × 15 × 14 × 13 × ... = 20,922,789,888,000

But maybe you don't want to choose them all, just 3 of them, so that would be only:

16 × 15 × 14 = 3,360

In other words, there are 3,360 different ways that 3 pool balls could be selected out of 16 balls.

Page 69: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

But how do we write that mathematically? Answer: we use the "factorial function"

The factorial function (symbol: !) just means to multiply a series of descending natural numbers. Examples:

4! = 4 × 3 × 2 × 1 = 24

7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5,040

1! = 1

Note: it is generally agreed that 0! = 1. It may seem funny that multiplying no numbers together gets you 1, but it helps simplify a lot of equations.

So, if you wanted to select all of the billiard balls the permutations would be:

16! = 20,922,789,888,000

But if you wanted to select just 3, then you have to stop the multiplying after 14. How do you do that? There is a neat trick ... you divide by 13! ...

16 × 15 × 14 × 13 × 12 ...

  = 16 × 15 × 14 = 3,360

13 × 12 ...

Do you see? 16! / 13! = 16 × 15 × 14

The formula is written:

where n is the number of things to choose from, and you choose r of them(No repetition, order matters)

Examples:

Our "order of 3 out of 16 pool balls example" would be:

16!

=

16!

=

20,922,789,888,000

= 3,360

(16-3)! 13! 6,227,020,800

(which is just the same as: 16 × 15 × 14 = 3,360)

Page 70: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

How many ways can first and second place be awarded to 10 people?

10!

=

10!

=

3,628,800

= 90

(10-2)! 8! 40,320

(which is just the same as: 10 × 9 = 90)

Notation

Instead of writing the whole formula, people use different notations such as these:

Example: P(10,2) = 90

Combinations

There are also two types of combinations (remember the order does not matter now):

Repetition is Allowed: such as coins in your pocket (5,5,5,10,10)

No Repetition: such as lottery numbers (2,14,15,27,30,33)

 

1. Combinations with Repetition

Actually, these are the hardest to explain, so I will come back to this later.

2. Combinations without Repetition

This is how lotteries work. The numbers are drawn one at a time, and if you have the lucky numbers (no matter what order) you win!

The easiest way to explain it is to:

assume that the order does matter (ie permutations),

then alter it so the order does not matter.

Going back to our pool ball example, let us say that you just want to know which 3 pool balls were chosen, not the order.

We already know that 3 out of 16 gave us 3,360 permutations.

But many of those will be the same to us now, because we don't care what order!

For example, let us say balls 1, 2 and 3 were chosen. These are the possibilites:

Order does matter Order doesn't matter

Page 71: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

1 2 31 3 22 1 32 3 13 1 23 2 1

1 2 3

So, the permutations will have 6 times as many possibilites.

In fact there is an easy way to work out how many ways "1 2 3" could be placed in order, and we have already talked about it. The answer is:

3! = 3 × 2 × 1 = 6

(Another example: 4 things can be placed in 4! = 4 × 3 × 2 × 1 = 24 different ways, try it for yourself!)

So, all we need to do is adjust our permutations formula to reduce it by how many ways the objects could be in order (because we aren't interested in the order any more):

That formula is so important it is often just written in big parentheses like this:

where n is the number of things to choose from, and you choose r of them(No repetition, order doesn't matter)

It is often called "n choose r" (such as "16 choose 3")

And is also known as the "Binomial Coefficient"

Notation

As well as the "big parentheses", people also use these notations:

Example

So, our pool ball example (now without order) is:

Page 72: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

16!

=

16!

=

20,922,789,888,000

= 560

3!(16-3)! 3!×13! 6×6,227,020,800

Or you could do it this way:

16×15×14

=

3360

= 560

3×2×1 6

So remember, do the permutation, then reduce by a further "r!"

... or better still ...

Remember the Formula!

It is interesting to also note how this formula is nice and symmetrical:

In other words choosing 3 balls out of 16, or choosing 13 balls out of 16 have the same number of combinations.

16!

=

16!

=

16!

= 560

3!(16-3)! 13!(16-13)! 3!×13!

POISSON DISTRIBUTION

Examples

1. You have observed that the number of hits to your website occur at a rate of 2 a day.

Let X be be the number of hits in a day

2. You observe that the number of telephone calls that arrive each day on your mobile phone over a period of a year, and note that the average is 3.

Let X be the number of calls that arrive in any one day.

Page 73: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

3. Records show that the average rate of job submissions in a busy computer centre is 4 per minute.

Let X be the number of jobs arriving in any one minute.

4. Records indicate that messages arrive to a computer server at the rate of 6 per hour.

Let X be the number of messages arriving in any one hour. Generally X = number of events, distributed independently in time, occurring in a fixed time interval.

X is a Poisson variable with pdf:

P(X = x) = e−λ λx

x !

, x = 0, 1. . . ∞ where λ is the average.

Example:

Consider a computer system with Poisson job-arrival stream at an average of 2 per minute. Determine the probability that in any one-minute interval there will be

(i) 0 jobs;

(ii) exactly 2 jobs;

(iii) at most 3 arrivals.

(iv) What is the maximum jobs that should arrive one minute

with 90 % certainty?

Solution: Job Arrivals with λ = 2

(i) No job arrivals:

(ii) Exactly 3 job arrivals:

(iii) At most 3 arrivals

P(X ≤ 3) = P(0) + P(1) + P(2) + P(3)

= 0.1353 + 0.2707 + 0.2707 + 0.1805

= 0.8571

more than 3 arrivals:

P(X > 3) = 1 − P(X ≤ 3)

= 1 − 0.8571

= 0.1429

Page 74: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

(iv) Maximum arrivals with at least 90% certainty:

i.e. 90% quantile

Choose k so that

P(X ≤ k) ≥ 0.90

at least a 90% chance that the number of job submissions in any minute does not exceed 4. Equivalently less than a 10% chance that there will be more than 4 job submissions in any one minute.

BINOMIAL DISTRIBUTION

When you flip a coin, there are two possible outcomes: heads and tails. Each outcome has a fixed probability, the same from trial to trial. In the case of coins, heads and tails each have the same probability of 1/2. More generally, there are situations in which the coin is biased, so that heads and tails have different probabilities. In the present section, we consider probability distributions for which there are just two possible outcomes with fixed probabilities summing to one. These distributions are called binomial distributions.

A Simple Example

The four possible outcomes that could occur if you flipped a coin twice are listed below in Table 1. Note that the four outcomes are equally likely: each has probability 1/4. To see this, note that the tosses of the coin are independent (neither affects the other). Hence, the probability of a head on Flip 1 and a head on Flip 2 is the product of P(H) and P(H), which is 1/2 x 1/2 = 1/4. The same calculation applies to the probability of a head on Flip 1 and a tail on Flip 2. Each is 1/2 x 1/2 = 1/4.

Table 1. Four Possible Outcomes.

Outcome First Flip Second Flip

1 Heads Heads

2 Heads Tails

3 Tails Heads

4 Tails Tails

The four possible outcomes can be classified in terms of the number of heads that come up. The number could be two (Outcome 1), one (Outcomes 2 and 3) or 0 (Outcome 4). The probabilities

Page 75: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

of these possibilities are shown in Table 2 and in Figure 1. Since two of the outcomes represent the case in which just one head appears in the two tosses, the probability of this event is equal to 1/4 + 1/4 = 1/2. Table 2 summarizes the situation.

Table 2. Probabilities of Getting 0, 1, or 2 Heads.

Number of Heads Probability

0 1/4

1 1/2

2 1/4

Figure 1. Probabilities of 0, 1, and 2 heads.

Figure 1 is a discrete probability distribution: It shows the probability for each of the values on the X-axis. Defining a head as a "success," Figure 1 shows the probability of 0, 1, and 2 successes for two trials (flips) for an event that has a probability of 0.5 of being a success on each trial. This makes Figure 1 an example of a binomial distribution.

The Formula for Binomial Probabilities

The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of π (the Greek letter pi)

Page 76: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

of occurring. For the coin flip example, N = 2 and π = 0.5. The formula for the binomial distribution is shown below:

where P(x) is the probability of x successes out of N trials, N is the number of trials, and π is the probability of success on a given trial. Applying this to the coin flip example,

If you flip a coin twice, what is the probability of getting one or more heads? Since the probability of getting exactly one head is 0.50 and the probability of getting exactly two heads is 0.25, the probability of getting one or more heads is 0.50 + 0.25 = 0.75.

Now suppose that the coin is biased. The probability of heads is only 0.4. What is the probability of getting heads at least once in two tosses? Substituting into the general formula above, you should obtain the answer .64.

Cumulative Probabilities

We toss a coin 12 times. What is the probability that we get from 0 to 3 heads? The answer is found by computing the probability of exactly 0 heads, exactly 1 head, exactly 2 heads, and exactly 3 heads. The probability of getting from 0 to 3 heads is then the sum of these probabilities. The probabilities are: 0.0002, 0.0029, 0.0161, and 0.0537. The sum of the probabilities is 0.073. The calculation of cumulative binomial probabilities can be quite tedious. Therefore we have provided a binomial calculator to make it easy to calculate these probabilitie

Mean and Standard Deviation of Binomial Distributions

Consider a coin-tossing experiment in which you tossed a coin 12 times and recorded the number of heads. If you performed this experiment over and over again, what would the mean number of heads be? On average, you would expect half the coin tosses to come up heads. Therefore the mean number of heads would be 6. In general, the mean of a binomial distribution with parameters N (the number of trials) and π (the probability of success on each trial) is:

μ = Nπ

where μ is the mean of the binomial distribution. The variance of the binomial distribution is:

Page 77: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

σ2 = Nπ(1-π)

where σ2 is the variance of the binomial distribution.

Let's return to the coin-tossing experiment. The coin was tossed 12 times, so N = 12. A coin has a probability of 0.5 of coming up heads. Therefore, π = 0.5. The mean and variance can therefore be computed as follows:

μ = Nπ = (12)(0.5) = 6σ2 = Nπ(1-π) = (12)(0.5)(1.0 - 0.5) = 3.0.

Naturally, the standard deviation (σ) is the square root of the variance (σ2).

“60% of people who purchase sports cars are men.  If 10 sports car owners are randomly selected, find the probability that exactly 7 are men.”

Step 1:: Identify ‘n’ and ‘X’ from the problem. Using our sample question, n (the number of randomly selected items — in this case, sports car owners are randomly selected) is 10,  and  X (the number you are asked to “find the probability” for) is 7.

Step 2: Figure out the first part of the formula, which is:

n! / (n – X)!  X!

Substituting the variables:

10! / ((10 – 7)! × 7!)

Which equals 120. Set this number aside for a moment.

Step 3: Find “p” the probability of success and “q” the probability of failure. We are given p = 60%, or .6. therefore, the probability of failure is 1 – .6 = .4 (40%).

Step 4: Work the next part of the formula.

pX

= .67

= .0.0279936

Set this number aside while you work the third part of the formula.

Step 5: Work the third part of the formula.

q(.4 – 7)

= .4(10-7)

Page 78: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

= .43

= .0.064

Step 6: Multiply the three answers from steps 2, 4 and 5 together.120  × 0.0279936 × 0.064 = 0.215.

Q1Select all that apply. Which of the following probabilities can be found using the binomial distribution? 

 The probability that 3 out of 8 tosses of a coin will result in heads 

 The probability that Susan will beat Shannon in two of their three tennis matches

 The probability of rolling at least two 3's and two 4's out of twelve rolls of a die 

 The probability of getting a full house poker hand 

 The probability that all 5 of your randomly-chosen group members will have passed the midterm 

 The probability that a student blindly guessing will get at least 8 out of 10 multiple-choice questions correct

Solutions.

A binomial distribution has only two possible outcomes. You can think of them as successes and failures. For the correct answers, the successes are: a flip of heads, a win for Susan, a group member who has passed the midterm, and a correct answer on a multiple-choice question.

Q2You flip a fair coin 10 times. What is the probability of getting 8 or more heads? 

Solution

You may use the Binomial Calculator (n = 10, p = .5, > or = 8). Otherwise add up the probability of getting 8, 9, and 10 heads: .044 + .01 + .001 =0 .055

Page 79: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Q3

The probability that you will win a certain game is 0.3. If you play the game 20 times, what is the probability that you will win at least 8 times? 

Solution

Use the Binomial Calculator (n = 20, p = .3, > or = 8). p =0 .23

Q4

The manufacturer of a bag of sweets claims that there is a 90% chance that the bag contains some toffees. If 20 bags are chosen. What is the probability that

(i) all the bags contain toffees

(ii) more than 18 bags contain toffees

NORMAL DISTRIBUTION

Q1

The amount of mustard dispensed from a machine at The Hotdog Emporium is normally distributed with a mean of 0.9 ounce and a standard deviation of 0.1 ounce.  If the machine is used 500 times, approximately how many times will it be expected to dispense 1 or more ounces of mustard.

Choose:

5             16                   80                  100

Q2

Professor Halen has 184 students in his college mathematics lecture class.  The scores on the midterm exam are normally distributed with a mean of 72.3 and a standard deviation of 8.9.  How many students in the class can be expected to receive a score between 82 and 90?  Express answer to the nearest student. Answer=21 students

Page 80: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

Q3

A machine is used to fill soda bottles.  The amount of soda dispensed into each bottle varies slightly.  Suppose the amount of soda dispensed into the bottles is normally distributed.  If at least 99% of the bottles must have between 585 and 595 milliliters of soda, find the greatest standard deviation, to the nearest hundredth, that can be allowed. Answer=1.67

Q4

Battery lifetime is normally distributed for large samples.  The mean lifetime is 500 days and the standard deviation is 61 days.  To the nearest percent, what percent of batteries have lifetimes longer than 561 days? Answer=15,9%

Q5

A shoe manufacturer collected data regarding men's shoe sizes and found that the distribution of sizes exactly fits the normal curve.  If the mean shoe size is 11 and the standard deviation is 1.5, find:a.  the probability that a man's shoe size is greater than or equal to 11.

b.  the probability that a man's shoe size is greater than or equal to 12.5.

c.  

Answer: Mean = 11 and standard deviation = 1.5

a.   50%  In a normal distribution, the mean divides the data into two equal areas.  Since 11 is the mean, 50% of the data is above 11 and 50% is below 11.

b.  12.5 is exactly one standard deviation above the mean.  Examining the normal distribution chart shows that 15.9%will fall above one standard deviation.  Probability is  .159. 

c.  12.5 and 8 are exactly one standard deviation above the mean and 2 standard deviations below the mean respectively.  Using the chart we

know: 

Q6

Five hundred values are normally distributed with a mean of 125 and a standard deviation of 10.a.  What percent of the values lies in the interval 115 - 135,     to the nearest percent?

Page 81: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

b.  What percent of the values is in the interval 100 - 150,     to the nearest percent?c.  What interval about the mean includes 95% of the data?d.  What interval about the mean includes 50% of the data? 

Answer: 500 scores, mean 125, standard deviation 10a.  What percent of the values is in the interval 115 - 135?mean + one standard deviation = 135mean - one standard deviation = 11515% + 19.1% + 19.1% + 15% = 68.2% (from chart)Percent within one standard deviation of the mean = 68.2% = 68%

b.  What percent of the values is in the interval 100 - 150?mean + 2.5 standard deviations = 150mean - 2.5 standard deviations = 1002(1.7% + 4.4% + 9.2% + 15% + 19.1%) = 98.8% (from chart)Percent with 2.5 standard deviations of the mean = 98.8% = 99%

c.  What interval about the mean includes approximately 95% of the data?   2 standard deviations about the mean for a total interval size of 40, with the mean in the center.   mean + 2 standard deviations = 145mean - 2 standard deviations = 105Interval:  [105,145]

Q7

A group of 625 students has a mean age of 15.8 years with a standard deviation of 0.6 years.  The ages are normally distributed.  How many students are younger than 16.2 years?  Express answer to the nearest student?

Answer 625 students, mean age of 15.8 years, standard deviation of 0.6 years.   mean + 1 standard deviation = 16.4mean + 0.5 standard deviations = 16.1These values of 16.4 and 16.1, which could be determined by using the chart, are not exactly the 16.2 that we need to solve this problem.  The most accurate answer will be found using the calculator.

Page 82: Web viewAn auditor determines the sample size of the book to be audited on ... Does the question include the word ... probabilities for games of chance like poker or

74.7508 % of the students will be less than 16.2 years old.

Answer:  467 students