chapter 1. introduction to statisticsparkj1/math105/mainslide...outline introduction why statistics?...

42
Chapter 1. Introduction to Statistics

Upload: others

Post on 05-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Chapter 1. Introduction to Statistics

  • Outline

    Introduction

    Why Statistics?Separating sampling variation and a true differenceLearning about a population from a sample

    Where is Statistics used?The Sally Clark Case

    Types of Data

    Collecting Data

  • Uncertainty

    Sterling’s slide has continued, with the pound falling

    close to$1.37...The pound also weakened against the

    euro, with the single currency now worth 94 pence.

    If I am planning to make a trip in summer abroad, is it better tochange the currency now than later?

  • Uncertainty

    Sterling’s slide has continued, with the pound falling

    close to$1.37...The pound also weakened against the

    euro, with the single currency now worth 94 pence.

    If I am planning to make a trip in summer abroad, is it better tochange the currency now than later?

    Is there evidence of global warning or is it simply randomfluctuation?

    How would the answer affect your way of living?

  • Decision making

    We follow many different routes, rational or irrational, to find ananswer and to cope with such situations.Often it is useful to obtain some evidence in order to decide whatthe answer should be.What sort of evidence would be useful in answering such questions?

  • For the UK economy,

    ◮ we may look at exchange rates over the past few months tofigure out a trend, if any.

    ◮ we may want to include other factors that may explain thetrend, or study similar periods in the past.

    ◮ To determine such factors or variables we may want to speakto economists.

    For the global warming,

    ◮ we may want to study a pattern in temperature over the pastyears in England, Europe or around the world.

    ◮ There may be other variables of interest, for example,increasing number of flooding or storms.

    ◮ Discussion with climatologist or hydrologist would be helpfulin deciding which variables should be considered.

  • What is Data?

    For all occasions, we need to collect some form of data toinvestigate further.

    ◮ Data refers to information that is collected from experiments,surveys or observational studies.

    ◮ For example 4, 3.5, 3.2 is not data but only a sequence ofnumbers.

    ◮ However if we know these numbers are measurements ofnew-born baby’s weights, then these numbers become a data.

    ◮ But does that mean if we observe three new-born babiesagain, their weight will be one of those numbers?

  • Probability and Statistics

    In Probability,

    ◮ we consider an experiment before it is performed.

    ◮ Numbers to be observed or calculated from observations areat that state random variables.

    ◮ We deduce the probability of various outcomes of theexperiment in terms of certain basic parameters.

    In Statistics,

    ◮ we have to infer things about the values of the parametersfrom the observed outcomes of an experiment alreadyperformed.

    ◮ We can decide whether or not operations on statistics aresensible only by considering probabilities associated with theobservable random variables.

  • Is Friday 13th bad for your health?

    Consider for a moment the following claim:

    I’ve heard that Friday 13th is unlucky, am I more likely to

    be involved in a car accident if I go out on Friday 13th

    than any other day?

    What kind of evidence would be helpful?

  • Suppose that data is available of emergency admissions tohospitals in the Southwest Thames region due to transportaccidents, on six Friday 13ths, and corresponding emergencyadmissions due to transport accidents for the Friday 6thimmediately before each Friday 13th.

    Number 1 2 3 4 5 6

    Accidents on 6th 9 6 11 11 3 5Accidents on 13th 13 12 14 10 4 12

  • Suppose that data is available of emergency admissions tohospitals in the Southwest Thames region due to transportaccidents, on six Friday 13ths, and corresponding emergencyadmissions due to transport accidents for the Friday 6thimmediately before each Friday 13th.

    Number 1 2 3 4 5 6

    Accidents on 6th 9 6 11 11 3 5Accidents on 13th 13 12 14 10 4 12

    ◮ Does the data support the claim?

    ◮ How might we use this data in order to obtain some evidencethat will help us answering the question?

  • We may consider comparing the number of accidents by workingout the average (or mean) number of accidents happening per dayon both days:

    Average number of accidents =Total number of accidents

    Total number of days

  • We may consider comparing the number of accidents by workingout the average (or mean) number of accidents happening per dayon both days:

    Average number of accidents =Total number of accidents

    Total number of days

    so that

    Average number of accidents on 6th =9 + 6 + 11 + 11 + 3 + 5

    6= 7.5

    and

    Average number of accidents on 13th =13 + 12 + 14 + 10 + 4 + 12

    6= 10

  • We may consider comparing the number of accidents by workingout the average (or mean) number of accidents happening per dayon both days:

    Average number of accidents =Total number of accidents

    Total number of days

    so that

    Average number of accidents on 6th =9 + 6 + 11 + 11 + 3 + 5

    6= 7.5

    and

    Average number of accidents on 13th =13 + 12 + 14 + 10 + 4 + 12

    6= 10

    There are more accidents on Friday 13th than on Friday

    6th. Therefore I am more likely to be involved in a car

    accident if I go out on Friday 13th.

    Do you agree with what is being said?

  • Exercise 1.1.1Referring to the Friday 13th example,

    ◮ why have we chosen to compare instead of focusing onaccidents only on 13th Fridays?

    ◮ why have we chosen Friday 6th as the comparison day? Whynot Thrusday 12th, or any other day for that matter?

  • What is this course about

    ◮ illustrate scientific and mathematical contexts wherestatistical issues arise

    ◮ demonstrate where statistics can be useful, by showing thesort of questions it can answer, and the situations in which itis used

    ◮ understand sampling variation and quantify uncertainty

    Specifically, we

    ◮ extend probability models to continuous random variables

    ◮ introduce various exploratory tools and summary statistics fordata analysis

    ◮ introduce specifice techniques in statistical modelling andinference

    ◮ apply to real data examples

  • Why statistics

    ◮ Separating sampling variation and a true difference

    ◮ Learning about a population from a sample

  • Consider a simple example of tossing a coin. If you toss a coin 10times, how many heads would you expect to see? Fill in the boxyour outcomes.

  • Consider a simple example of tossing a coin. If you toss a coin 10times, how many heads would you expect to see? Fill in the boxyour outcomes.

    H, H, H, T, T, H, H, H, H, T

  • Consider a simple example of tossing a coin. If you toss a coin 10times, how many heads would you expect to see? Fill in the boxyour outcomes.

    H, H, H, T, T, H, H, H, H, T

    ◮ Are you surprised that you didn’t have exactly 5, the half ofthe number of trials? Why or why not? Has the resultchanged your opinion about the coin?

    ◮ Are you surprised that your neighbors didn’t have exactly thesame number of heads as you did? Why or why not?

    ◮ Do the same experiment another two times, on two furthercoins and record the number of heads. Did you get the samenumber of heads each time?

    ◮ What would happen if you toss 20 times?

  • Sampling variation or true difference?

    Exercise 1.2.1Think back to the Friday 13th example. Do you have more chanceof being in a car accident on Friday 13th, or is the difference in theaverage number of accidents down to sampling variation?

  • Exercise 1.2.2Suppose we collected the data for the Friday 13th example onsome new dates. How sure are you that we would again see theaverage number of accidents on Friday 13th greater than theaverage number of accidents on Friday 6th?

  • Population and sample

    ◮ Population: a class of all individuals of interest

    ◮ Sample: a subset of the population

    For any sound analysis, we need to

    ◮ define exactly what population is being targeted;

    ◮ choose sample to give good representation.

    Statistical inference: learning about population through thebehaviour of a sample.

  • Where is Statistics used? I

    Environmental monitoring: for the setting of regulatory standardsand in deciding whether these are being met;

    Engineering: to gauge the quality of products used inmanufacturing and building;

    Agriculture: to understand field trials of new varieties and choosethe crops that will grow best in particular conditions;

    Economics: to describe unemployment and inflation, which areused by the government and by business to decideeconomic policies and form financial strategies;

    Finance: risk management, and prediction of the futurebehaviour of the markets;

    Pharmaceutical industry: to judge the clinical effectiveness andsafety of new drugs before they can be licensed;

    Insurance: in setting premium sizes, to reflect the underlying riskof the events that are being insured against;

  • Where is Statistics used? II

    Medicine: to assess the reliability of clinical trials reported injournals, and choose the most effective treatment forpatients;

    Ecology: to monitor population sizes and to model interactionsbetween different species;

    Business: market research is used to plan sales strategies.

  • Sally Clark Case

    Sally Clark was a mother convicted of murder, when two of herbabies died of ‘Cot Death’ - the name given to the unexplaineddeath of a young infant. The paediatrician Sir Roy Meadow, actingas an expert witness for the prosecution in the case, famouslyclaimed that the odds of two unexplained deaths in the samefamily was 1 in 73 million.Where does this figure come from?

  • First problem

    The odds of a single unexplained death in an affluent, non-smokingfamily is estimated as 1 in 8500. The figure 73 million comes frommultiplying these odds by themselves: 8500 × 8500 = 73million

    Independence

    The first problem is that it is only appropriate to multiply theseodds together if the second death is independent of the first.

    Is this really a reasonable assumption?

  • Second problem

    The second problem is known as the ‘prosecutors’s fallacy’, whichgoes as follows:

    Chance and its realisationThe chance of two unexplained deaths in the same familyoccurring by chance is 1 in 73 million. Therefore, the chance ofSally Clark being innocent is 1 in 73 million also.

    What is wrong with this argument? Can you spot the error inreasoning here?

  • The following analogy will help.Suppose you decide to play the British National Lottery. The ideabehind the lottery is that 49 balls are placed in a machine, and 6 ofthem are drawn. Before the draw takes place, you pay 1 pound toplace a guess on which six balls will be drawn. There is a prize ofmillions of pounds available, if your guess turns out to be correct,but the chance of getting it right is 1 in 14 million.

  • You decide to play, and, amazingly, all six of your numbers comeup! You travel to the headquarters of the national lottery to claimyour winnings, and instead you are arrested – accused of cheating!A few months later, you are in court, and the prosecuting lawyermakes the following argument:

    The chance of getting all six balls correct by chance is 1 in 14million. Therefore, the chance of the defendant being innocent is 1in 14 million also.

    How would you defend yourself against this argument?

  • Variables in the data

    ◮ We have already introduced data.

    ◮ In the experiments or surveys, there may be specific attributesthat we are interested in measuring for the subjects. Theseare called variables.

    ◮ For example, in the Friday 13th data, the variable we measureis Number of accidents.

    ◮ Because these variables are random, they are called randomvariables.

  • Types of Data

    Most random variables falls into the following two categories,depending partly on the nature of the characteristic of interest,and partly on how it is measured:

    Discrete. Variables taking values on countable setse.g. gender, eye color, college membership, examgrades(A, B, C, D, E), number of goals in a match,children in a family

    Continuous. Variables taking values on some interval of the realline.e.g. height, weight, direction

  • Collecting data

    Exercise 1.5.1Is there any limit to the amount of evidence that can be obtainedfrom some given data? Think back to the data on Friday 13th –could we use it to decide whether car accidents were especiallycommon on Fridays?

    So if the evidence available is limited by the data we have, itmakes sense that we should think very carefully about how wecollect the data.If you are not collecting the data yourself, it is always important tounderstand how the data is collected, so that you are aware of anylimitations that may place on your analysis.

  • Scenario ISuppose you are interestd in estimating how many hours studentsspend studying every week. So you write a survey and set out tofind participants for your survey.Thinking to yourself where a good place would be to find studentsto fill in your survey, you have a brilliant idea, library! You sitoutside the library and stop students as they leave the library to fillin your survey. After some time you have enough results so you gohome to do your analysis. You find that students spend, onaverage, 30 hours a week studying.

  • Collecting data

    Exercise 1.5.2

    ◮ What is wrong with the way in which the study has beencarried out?

    ◮ If you had stopped students outside the University bar insteadof the library, do you think you would have got similar results?

    ◮ Can you think of a better way to collect data for your survey?

    Although sound silly, this example highlights the importance ofchoosing your sample well.The key message is that the sample should be representative of thepopulation.That’s why it is important to define the population first, otherwisehow can our sample be representative of it?

  • Representative sample

    Referring to the Scenario I, there are many different populationsthat could be of interest here – students in a particulardepartment, students in a particular faculty, all students at theUniversity, all students in the UK, all students in the world....

    Exercise 1.5.3Could you use the same sample for each of these populations? Canyou think of a population for which the sampling method ofstopping people outside the library would be appropriate?

    A representative sample reflects the characteristic or nature of thepopulation. If your sample is not representative, we introduce asystematic error called bias.

  • How large a sample should be

    Another consideration is how large our sample should be. Weusually use n to denote the number of subjects in our sample.

    There are practical as well as statistical considerations to choosingthe size of the sample. On the practical side, financial constraintsmay mean that you cannot have a sample larger than n = 1000.Some statistical considerations will be discussed in Chapter 4.

  • Random sample

    The widely accepted method to avoid bias, and therefore obtain arepresentative sample of the population is by conducting a randomsample.A random sample of size n from a population is one in which eachpossible sample of that size has the same chance of being selected.

  • ◮ One method to ensure random sampling is to write the nameof every member of the population on a slip of paper, placethese slips into a hat, then draw out the required amount forthe sample.

    ◮ A more practical method has been developed using computer,called random number generators.For an example of a pre-election poll, we may need n = 1000random numbers between 1 and 40 million, for a sample sizeof n = 1000 out of the 40 million eligible voters in the UK. Ifwe have all the voters written in a list, we can pick out thedesired subjects for our sample.

  • Other kinds of sampling

    It is not always feasible to carry out sampling in a truly randomfashion. It can be very expensive to involve 1000 completelyrandom people in a pre-election poll, as some of them may bedifficult to reach, and it may take a long time for all the surveys toreturn. We may have to resort to a sampling method that is notrandom for practical reasons. Provided we are careful, we canminimize the bias that is caused.

  • Exercise 1.5.4Suppose we go out on the streets in a cicy centre, and simply stoppeople in the street and ask who they are going to vote for in thenext election. This is sometimes known as convenience sampling.What kinds of bias may be introduced? What steps could be takento minimise this bias?

  • Does increasing the size of a sample decrease the bias?

    Exercise 1.5.5For the student study example,

    ◮ one survey collects 1000 responses, with conveniencesampling, where the interviewer stands outside the library,stops students on the way out to ask them the question.

    ◮ A second survey collects only 50 responses, with randomsampling from the entire student population of the University.

    Which study should we believe more?

    It is almost always better to have a small, representative sample,than a large biased sample.

    IntroductionWhy Statistics?Separating sampling variation and a true differenceLearning about a population from a sample

    Where is Statistics used?The Sally Clark Case

    Types of DataCollecting Data