statnotes.pdf

Upload: siva-nantham

Post on 18-Jan-2016

32 views

Category:

Documents


0 download

TRANSCRIPT

  • Biostatistics Notes and Exercises

    Jeff LongmateDepartment of Biostatistics

    ext. 62478February 5, 2014

  • 2

  • preface

    0.1 Course Organization

    These are lecture notes and problem sets for a seven-week course onbiostatistics at the City of Hopes Irell and Manella Grauate School ofBiological Science.

    Instructor: Jeff Longmate (ext 62478)

    Meetings: Mondays, Wednesdays, and Fridays, 10:45 pm. Fridays will beused for problem set discussion, computing exercises, and exams.

    Evaluations: 60% P-sets; 20% Mid-term exam; 20%; Final exam.

    Texts: 1. Lecture notes & handouts. (All thats really needed)

    2. Statistics, 3rd edition; Friedman, Pisani & Purves. On reserveat Lee Graff Library. Some optional readings from this book willbe given.

    Course Website:http://www.infosci.coh.org/jal/class/index.html

    Computing tools:

    1. Excel (or another spreadsheet)

    2. GraphPad/Prism (ITS can install)

    3. R & R Commander (to be self-installed in tutorial)

    i

  • ii PREFACE

  • Contents

    preface i

    0.1 Course Organization . . . . . . . . . . . . . . . . . . . . . . . i

    1 About Statistics 1

    1.1 Comments About Statistics . . . . . . . . . . . . . . . . . . . 2

    1.2 First Example: A Bioassay for Stem Cells . . . . . . . . . . . 4

    1.2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2.2 Perspective . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3 The taxi problem. . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.3.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3.2 Simulation of estimator performance . . . . . . . . . . 12

    1.4 Philosophical Bearings . . . . . . . . . . . . . . . . . . . . . . 14

    1.4.1 Science as selection . . . . . . . . . . . . . . . . . . . . 15

    1.4.2 Examples of hypothesis rejection . . . . . . . . . . . . 16

    1.4.3 A broad model . . . . . . . . . . . . . . . . . . . . . . 17

    1.5 The notion of population . . . . . . . . . . . . . . . . . . . . . 21

    1.6 Homework: A Computing Tutorial . . . . . . . . . . . . . . . 25

    1.6.1 Creating a Data File in Excel . . . . . . . . . . . . . . 25

    1.6.2 Installation of R . . . . . . . . . . . . . . . . . . . . . . 26

    1.6.3 Trying it out . . . . . . . . . . . . . . . . . . . . . . . 27

    iii

  • iv CONTENTS

    1.6.4 Data Analysis in R . . . . . . . . . . . . . . . . . . . . 29

    1.6.5 Documenting the analysis . . . . . . . . . . . . . . . . 33

    2 Data Summary 37

    2.1 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.1.1 Kinds of Variables . . . . . . . . . . . . . . . . . . . . 38

    2.1.2 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . 39

    2.1.3 Other Notions of Typical . . . . . . . . . . . . . . . . . 42

    2.1.4 Measuring Variation . . . . . . . . . . . . . . . . . . . 45

    2.1.5 Linear Transformation . . . . . . . . . . . . . . . . . . 48

    2.1.6 Quantiles & Percentiles . . . . . . . . . . . . . . . . . . 49

    2.2 Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . 50

    2.2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . 50

    2.2.2 stem-and-leaf plots . . . . . . . . . . . . . . . . . . . . 55

    2.2.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    2.3 Graphical Principles . . . . . . . . . . . . . . . . . . . . . . . 56

    2.4 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    2.5 Homework Exercises Problem Set 1 . . . . . . . . . . . . . 62

    3 Probability 69

    3.1 Example: Mendels Peas . . . . . . . . . . . . . . . . . . . . . 70

    3.1.1 The choice of characters . . . . . . . . . . . . . . . . . 70

    3.1.2 Hybrids and their offspring . . . . . . . . . . . . . . . . 71

    3.1.3 Odds and Probabilities . . . . . . . . . . . . . . . . . . 72

    3.1.4 Subsequent generations . . . . . . . . . . . . . . . . . . 74

    3.1.5 An explanation . . . . . . . . . . . . . . . . . . . . . . 74

    3.1.6 Multiple Characters . . . . . . . . . . . . . . . . . . . . 75

    3.2 Probability formalism . . . . . . . . . . . . . . . . . . . . . . . 77

    3.2.1 Conditional Probability . . . . . . . . . . . . . . . . . 78

  • CONTENTS v

    3.2.2 Marginal, Joint and Conditional Probabilities . . . . . 82

    3.2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . 82

    3.2.4 Complementary Events . . . . . . . . . . . . . . . . . . 84

    3.3 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    3.4 Bayes Rule and Prediction . . . . . . . . . . . . . . . . . . . . 88

    3.4.1 A Diagnostic Test . . . . . . . . . . . . . . . . . . . . . 88

    3.4.2 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . 90

    3.4.3 Example: The ELISA test for HIV . . . . . . . . . . . 92

    3.4.4 Positive by Degrees . . . . . . . . . . . . . . . . . . . . 93

    3.4.5 Perspective . . . . . . . . . . . . . . . . . . . . . . . . 95

    3.5 Problem Set 2, (part 1 of 2) . . . . . . . . . . . . . . . . . . . 96

    4 Estimating and Testing a Probability 101

    4.1 Example: MEFV Gene and Fibromyalgia Syndrome . . . . . . 102

    4.2 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . 105

    4.2.1 Calculating the tail probability . . . . . . . . . . . . . 106

    4.2.2 Computing . . . . . . . . . . . . . . . . . . . . . . . . 108

    4.3 Estimating a probability . . . . . . . . . . . . . . . . . . . . . 109

    4.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 112

    4.4.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . 113

    4.5 The Law of Averages . . . . . . . . . . . . . . . . . . . . . . . 114

    4.5.1 Mean and standard deviation of a binomial . . . . . . . 115

    4.6 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . 118

    4.6.1 Standardized scale . . . . . . . . . . . . . . . . . . . . 121

    4.6.2 Some motivation . . . . . . . . . . . . . . . . . . . . . 123

    4.6.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . 125

    4.6.4 Standard error of the mean . . . . . . . . . . . . . . . 126

    4.6.5 Areas under the normal curve . . . . . . . . . . . . . . 126

  • vi CONTENTS

    4.6.6 Computing . . . . . . . . . . . . . . . . . . . . . . . . 128

    4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    4.8 Homework Exercises . . . . . . . . . . . . . . . . . . . . . . . 131

    5 Estimation & Testing using Students tDistribution 135

    5.1 A Single Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    5.2 A Paired Experiment . . . . . . . . . . . . . . . . . . . . . . . 139

    5.3 The t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . 141

    5.4 Two Independent Samples . . . . . . . . . . . . . . . . . . . . 142

    5.4.1 Standard Error of a Difference . . . . . . . . . . . . . . 143

    5.4.2 Pooled Standard Deviation . . . . . . . . . . . . . . . . 144

    5.4.3 Example: Diet Restriction . . . . . . . . . . . . . . . . 146

    5.5 One-sided versus two-sided . . . . . . . . . . . . . . . . . . . . 148

    5.6 Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    5.7 Exercises (not turned in) . . . . . . . . . . . . . . . . . . . . . 150

    5.8 Homework Exercises (Problem Set 3) . . . . . . . . . . . . . . 152

    6 Comparison Examples 155

    6.1 Example: Genetics of T-cell Development . . . . . . . . . . . . 156

    6.1.1 Computing in R Commander . . . . . . . . . . . . . . 156

    6.1.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . 158

    6.1.3 Computing with Prism . . . . . . . . . . . . . . . . . . 159

    6.2 Tests in General . . . . . . . . . . . . . . . . . . . . . . . . . . 160

    6.2.1 A Lady Tasting Tea . . . . . . . . . . . . . . . . . . . 161

    6.2.2 Tests and Confidence Intervals . . . . . . . . . . . . . . 161

    6.3 t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

    6.3.1 Interpretation of . . . . . . . . . . . . . . . . . . . . 162

    6.3.2 Type I and Type II errors . . . . . . . . . . . . . . . . 163

    6.4 Example: Inference v. Prediction . . . . . . . . . . . . . . . . 163

  • CONTENTS vii

    6.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

    6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

    7 Contingency Tables 169

    7.1 Chi-square goodness-of-fit test . . . . . . . . . . . . . . . . . . 170

    7.2 Comparing 2 Groups: Testing independence of rows and columns174

    7.2.1 One-sample versus two-sample chi-square tests . . . . . 177

    7.2.2 Multiple testing . . . . . . . . . . . . . . . . . . . . . . 179

    7.3 A 2 by 2 Table . . . . . . . . . . . . . . . . . . . . . . . . . . 180

    7.4 Decomposing tables . . . . . . . . . . . . . . . . . . . . . . . . 182

    7.5 Fishers exact test (small samples) . . . . . . . . . . . . . . . . 184

    7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

    8 Power, Sample size, Non-parametric tests 191

    8.1 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

    8.1.1 Sample Size for Confidence Intervals . . . . . . . . . . 193

    8.1.2 Sample Size and the Power of a Test . . . . . . . . . . 195

    8.1.3 Computing . . . . . . . . . . . . . . . . . . . . . . . . 197

    8.1.4 Other Situations . . . . . . . . . . . . . . . . . . . . . 198

    8.2 Paired Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

    8.2.1 Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . 201

    8.2.2 The Wilcoxon Signed-Rank Test . . . . . . . . . . . . . 204

    8.3 Two Independent Groups . . . . . . . . . . . . . . . . . . . . . 206

    8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

    9 Correlation and Regression 213

    9.1 Example: Galtons height data . . . . . . . . . . . . . . . . . . 214

    9.2 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . 215

    9.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

  • viii CONTENTS

    9.3.1 Example: Spouse education . . . . . . . . . . . . . . . 225

    9.4 Uses of Linear Regression . . . . . . . . . . . . . . . . . . . . 227

    9.5 The simple linear regression model . . . . . . . . . . . . . . . 228

    9.6 Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

    9.6.1 Transformed variables . . . . . . . . . . . . . . . . . . 231

    9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

    10 Comparing several means 243

    10.1 Calorie Restriction and Longevity . . . . . . . . . . . . . . . . 244

    10.1.1 Global F-test . . . . . . . . . . . . . . . . . . . . . . . 245

    10.1.2 Pairwise t-tests . . . . . . . . . . . . . . . . . . . . . . 249

    10.2 A Genetics Example . . . . . . . . . . . . . . . . . . . . . . . 250

    10.3 Why ANOVA? . . . . . . . . . . . . . . . . . . . . . . . . . . 251

    10.4 Example: Non-transitive Comparisons . . . . . . . . . . . . . 252

    10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

    11 Context Issues 257

    11.1 Observational & Experimental studies . . . . . . . . . . . . . . 258

    11.1.1 Association versus Cause and Effect . . . . . . . . . . . 259

    11.1.2 Randomization . . . . . . . . . . . . . . . . . . . . . . 259

    11.1.3 The Role of Randomized Assignment . . . . . . . . . . 264

    11.1.4 Simpsons Paradox . . . . . . . . . . . . . . . . . . . . 265

    11.2 Hypothesis-Driven Research v. High Through-Put Screening . 277

    11.2.1 Testing One Hypothesis (Review) . . . . . . . . . . . . 280

    11.2.2 Multiple Testing Situations . . . . . . . . . . . . . . . 280

    11.2.3 Type I Error Control . . . . . . . . . . . . . . . . . . . 282

    11.2.4 Error Rate Definitions . . . . . . . . . . . . . . . . . . 282

    11.2.5 FWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

    11.2.6 FDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

  • CONTENTS ix

    11.2.7 q-values . . . . . . . . . . . . . . . . . . . . . . . . . . 286

    11.2.8 Summary of multiple testing . . . . . . . . . . . . . . . 287

    11.3 A Review Problem . . . . . . . . . . . . . . . . . . . . . . . . 288

  • x CONTENTS

  • Chapter 1

    About Statistics

    1

  • 2 CHAPTER 1. ABOUT STATISTICS

    1.1 Comments About Statistics

    Statistics (plural) are summary numbers, like your grade-pointaverage, or the USA population. The root of the word reflects its ancientconnection with matters of state.

    Statistics (singular) is a field of study, widely applicable to scienceand other rational endevors, concerned with drawing conclusions from datathat are subject to variation or uncertainty.

    Data are numbers in context. Conceiving of data as numerical is notvery limiting, as we can classify objects into categories and count them,digitize images, ask experts to score specimens, and so on. Context,however, is crucial. For example, data from an experimental interventionwill often permit much stronger conclusions than numerically identicalresults from passive observation. The correct handling of data depends onthe context.

    Variation is ubiquitous. Any measurement has limited accuracy, and thelimitations are often large enough to be important. Biological variationmay be seen, even when measurement errors are negligible. Populations ofplants, animals, and even populations of cells within a tissue, all exhibitmany kinds of variation. A genetic cross may produce many types ofoffspring. Cellular immunity in mammals involves the randomrearrangement of receptors. Genes may be transcribed in bursts, leading tohighly variable levels of transcripts in individual cells. A stem cell may shiftbetween multiple expression profiles, without committing to differentiation.Genetically identical NOD mice, kept in specific antigen-free cages, withusually develop diabetes, but some will not, and for no discernable reason.All of these involve an element of randomness, or stochastic behavior.

    Dealing with variation takes effort. It is human nature to do much ofour thinking using examples and stereotypes. Statistics and biology bothrequire population thinking, i.e. going beyond what is typical, andconsidering how individuals vary. This might be as simple as reporting astandard deviation in addition to an average, but it does take more effort(e.g. two numbers instead of one) to keep track of variation. On a deeperlevel, the prominent evolutionary biologist, Ernst Meyr, wrote thatpopulation thinking is essential to modern biological thinking, and that itwas a relatively recent innovation in the history of ideas, explaining why

  • 1.1. COMMENTS ABOUT STATISTICS 3

    nearly two centuries elapsed from the work of Newton to that of Darwin,even though the problem addressed by Newton seems more difficult.

    Probability is the mathematical language for describinguncertainty. Simple probabilities describe the chances of discrete eventsthat may or may not happen. Distributions of random variables describethe probabilities that a numerical value will fall in various intervals. Thedistribution of a random variable is a model for the process of sampling andobserving an individual from a population, so we often speak ofdistributions and populations almost interchangably.

    Observation versus experiment is a major dichotomy in studydesign. Variation that is passively observed and variation in response tointervention are profoundly different things. Associations between variablescan be observed without intervention, but establishing a cause-and-effectrelationship requires something more either experimental manipulation,or strong assumptions.

    The broad objectives of statistics, are to summarize, infer, andpredict. Summary focuses on the data at hand. We may calculatesummary statistics and graphical displays to better appreciate theinformation in a large dataset. Inference involves drawing conclusionsabout a whole population based on a limited sample from that population.Sometimes the sample is quite small. We calculate a statistic (e.g. thesample mean) from the available sample, in order to infer the approximatevalue of a parameter (e.g. the population mean) which is a characteristic ofthe entire population. This is a recurring pattern: using sample statistics toestimate the parameters that characterize a population. Predictionattempts something even more ambitious. Instead of trying to estimate,say, an average value for a large population, we attempt to predict thespecific value for a given member of the population. This often involvesobserving additional variables for the individual of interest. While inferenceproblems often yield to increasing amounts of data, some things areinherently unpredictable.

    The rest of the lecture. Our first example will be a form of prediction (orcalibration), called bioassay, in which we will estimate the number of stemcells in a specific culture, based on the engraftment success rate when thecells are used in a mouse model of bone marrow transplantation. We willthen consider a toy problem of inference. Given a very small sample of

  • 4 CHAPTER 1. ABOUT STATISTICS

    taxicab numbers, we will estimate the total number of taxis in a city. In theremaining time, we will discuss the relationship of statistics to science andmathematics, with a nod to a few famous philosophers. Having brieflyillustrated inference and prediction, we will take up data summary in thefollowing chapter.

    1.2 First Example: A Bioassay for Stem

    Cells

    Lets start with an example from City of Hope. Shih et al.1 describe theexpansion of transplantable human hematopoietic stem cells in ex vivoculture. Documenting this acheivement to the satisfaction of referees,however, presented a problem. It was not possible to identify hematopoeiticstem cells (HSCs) by direct observation. HSCs are defined by their capacityfor both self-renewal and for differentiation into multiple lineages. WhileHSCs are found among cells expressing the CD34 and Thy-1 antigens, mostcells expressing these markers are already committed to a specific lineage.The authors could show that they could expand a population of cellsbearing markers associated with stem cells, and they could also show thatthe expanded culture still contained cells that could support engraftment inthe SCID-hu mouse model of HSC transplantation. Referees, however,pointed out that the engraftment might be supported by a subpopulation ofstem cells that were maintained in culture, but not expanded.

    The problem

    What was needed was a demonstration that they had expanded the unseensubpopulation of cells that can support engraftment. A quantitativeassessment of this sort, based on biological function as opposed to directmeasurement, is called a bioassay.

    Bioassay (noun): Measurement of the concentration orpotency of a substance by its effect on living cells or tissues.

    1Blood 1999, 94, 1623-1636

  • 1.2. FIRST EXAMPLE: A BIOASSAY FOR STEM CELLS 5

    The Result

    The investigators did a dilution experiment using fresh CD34+ Thy-1+ cellsin the SCID-hu mouse model. Four different cell doses (10,000, 3,000, 1,000,and 300 cells per graft) were evaluated. Each dose was used in 60 micefrom each of two model systems (thymus/liver and bone model). For eachmodel, the number of mice with long-term engraftment was observed toincrease with the cell dose, and calibration curves were fit to theengraftment rates, as shown for the bone model in figure 1.1 below.

    69% (207 of 300), 71% (214 of 300), and 72% (216 of 300) fordonors no. 21, 22, and 23, respectively (Table 7). This resultsuggests that 1 in 29 CD341 CD382 cells (v1 in 200 CD341

    thy-11 cells) is capable of expansion under this ex vivo culture

    condition. The percentage of CD341 CD382 cells in myeloidwells (CD341CD382/positive wells without detectable CD191

    cells), averaging 9%, is significantly lower (P , .00001) thanthe respective percentage, averaging 15%, in the mixed lym-phoid/myeloid wells. All of the mixed lymphoid/myeloid wells(36 wells from donor no. 21, 31 wells from donor no. 22, and 34wells from donor no. 23) contain CD341 CD382 cells. Thesedata show that 1 in 29 CD341 CD382 cells is capable ofproliferation, but only 1 in 182 CD341 CD382 cells is capableof both proliferation and multipotential differentiation. Thehigher frequency and percentage of myeloid wells in thesecultures are evidence suggesting that the majority of CD341

    CD382 cells which are capable of proliferation in the cultureshave already committed to the myeloid lineage. However, theCD341 CD382 cells in those mixed lymphoid/myeloid wellspresent both expected characteristics of HSC, proliferation, andmultipotential differentiation. In those mixed lymphoid/myeloid wells, an average of 15% (30,000 cells) are CD341

    CD382 cells, representing a 1,500-fold expansion of cells withthe CD341 CD382 phenotype. The overall bulk equivalent (ie,not counting only the 10% to 12% of mixed lymphoid/myeloidwells) is in excess of a 150-fold expansion of CD341 CD382

    cells under this culture system.In vivo engrafting potential of CD341 CD382 cells before

    and after culture. Our next experiment was to determinewhether ex vivoexpanded CD341 CD38 cells still maintaintheir ability to establish long-term hematopoietic reconstitutionin SCID-hu mice, and to determine whether ex vivoexpandedCD341 CD382 cells from myeloid wells and mixed lymphoid/myeloid wells differ in their in vivo engrafting activity. The exvivoexpanded CD341 CD382 cells were sorted from pools ofmyeloid wells and mixed lymphoid/myeloid wells, respectively,derived from 5-week cultures. The sorted cells in HBSS werethen injected (10,000 cells/graft) into human thymus and bonefragments in SCID-hu mice. Control mice were injected withHBSS only. Thymic and BM engraftment by CD341 CD382

    cells in the SCID-hu mice were analyzed 3 to 4 months afterstem cell injection. Data from 3 independent experiments usingcells from different donors were compiled (Table 8). Donorreconstitution derived from 10,000 freshly purified CD341

    CD382 cells was evident in about 90% of the bone grafts and90% of the thymic grafts for all 3 donors. These results areconsistent with our unpublished data and are also similar to thereconstitution rate when 10,000 CD341 thy-11 cells wereinjected. The percentage of donor-derived hematopoietic cellswas about 35% in both the bone and thymic grafts for all 3donors (Table 8). Compared with the percentage of donor-derived cells in the animals transplanted with 10,000 CD341

    thy-11 cells (40% in the bone mice and 50% in the thy/livmice), the percentage of donor-derived cells from CD341

    CD382 cells is lower. However, ex vivoexpanded CD341

    CD382 cells purified from mixed lymphoid/myeloid wells givethe same results in both reconstitution rate and percentage ofdonor-derived cells as freshly purified CD341 CD382 cellsfrom the same donors (Table 8). These results suggest that exvivoexpanded CD341 CD382 cells from mixed lymphoid/myeloid wells are similar, both qualitatively and quantitatively,to freshly purified CD341 CD382 cells from the same donors.In contrast, reconstitution was only detected in 10% (3 of 30) of

    Fig 5. Statistical measurements for the transplantable humanfetal BM CD341 thy-11 cells before and after culture by a standardcalibration method. (A) The measurements in the SCID-hu thy/livmodel. (B) The measurements in the SCID-hu bone model. Thestatistical analyses are based on the data shown in Tables 1 and 6.Ninety-five percent lower confidence bounds were found by applyinga standard calibration method40 to the exact binomial lower confi-dence bound for the cultured cell engraftment probability, using a97.5% confidence level for both SCID-hu mouse models.

    Table 5. muLIF Can Facilitate Ex Vivo Expansion of Human Fetal BMCD341 thy-11 Cells

    Treatments

    Frequency ofCD341 thy-11/Positive Wells

    Percentage ofCD341 thy-11

    Cells

    IL-3 1 IL-6 1 GM-CSF 1 SCF 0.3% (1/300) 1.2IL-3 1 IL-6 1 GM-CSF 1 SCF 1 muLIF 10% (30/300) 15 6 2IL-3 1 IL-6 1 GM-CSF 1 SCF 1 huLIF 12% (35/300) 15 6 1

    Cells from individual wells were analyzed by flow cytometry after 5weeks in culture. See Table 3 for legend.

    EX VIVO EXPANSION OF TRANSPLANTABLE HSCs 1631

    For personal use only. by on November 28, 2010. www.bloodjournal.orgFrom

    Figure 1.1: Calibration curve, bone model: 10,000 cultured cells are equiva-lent to approximately 16,000 fresh cells. A lower bound on this estimate is7,900 fresh cells.

    The calibration curves permitted the investigators to use the engraftmentrate from cultured cells to estimate an equivalent dose of fresh cells. Forthe bone model, the result was that 10,000 cultured cells were equivalent to16,350 fresh cells. Contrary to the worry that the culture would merelymaintain stem cells, the stem cells seemed to increase somewhat faster thanthe culture as a whole.

    A lower bound on the equivalent cell dose was also calculated. This involves

  • 6 CHAPTER 1. ABOUT STATISTICS

    two sources of variation. If we imagine that each dose of fresh or culturedcells has a true underlying engraftment rate, which we might discover ifwe could do a very large number of experiments, then our actual resultswith a modest number of animals will approximate the true rate with someerror. To get a lower bound, we consider that the true engraftment ratewith cultured cells might be smaller than the rate we observed, and thatthe true calibration curve might be somewhat further to the left than ourestimated curve. We wont go into the details at this point, but theestimated lower bounds was 7,900 for the bone model (fig 1.1). Because theoriginal number of cells grown to 10,000 was much smaller than this, theimplication is that the stem cells were indeed expanded in culture, evenafter allowing for experimental variation. However, our initial estimate that10,000 cultured cells may be equivalent to more than 16000 fresh cells mustbe tempered by realizing that they may also be equivalent to as few as 7900fresh cells. The expansion of stem cells in culture is established with highconfidence. The initial impression of selective growth of stem cells versusnon-stem cells in culture appears to be merely an impression, and not areliable conclusion.

    1.2.1 The Model

    The type of calibration curve that was fit to the reconstitution data is astandard model called a logistic regression. It would not make sense to fit astraight line to the data, because the reconstitution rate must always bebetween 0 and 1 (equivalently, 0% and 100%. Instead, we assume that theengraftment rate increases from zero to one as the cell dose increases,according to some relatively simple function. The function that was used iscalled the logit function, which is the logarithm of the odds of engraftment.If we let p be the proportion of mice that engraft, then p/(1 p) is the oddsof engraftment, and our model is

    log

    (p

    1 p

    )= + x

    where x is the dose, while and are parameters that we choose to makethe curve as close as possible to the data. This model defines a family ofS-shaped curves, one for each combination of and . Three members ofthis family of curves are depicted in figure 1.2 Changing makes the curve

  • 1.2. FIRST EXAMPLE: A BIOASSAY FOR STEM CELLS 7

    steeper or flatter, i.e. more or less responsive to dose. Changing movesthe curve to the left or right, changing the dose that yields 50%engraftment. We take as close as possible to the data to mean that thechoice of and should maximize the likelihood of the data that wereactually observed. The precise definition of likelihood is a theoretical matterthat we wont take up here, but it is worth noting that maximizing alikelihood is a general principle, and there is a body of theory stating thatestimation based on maximum likelihood delivers some desirable properties.

    Figure 1.2: Logistic regression curves The two curves with broken lines illus-trate the effect of varying each of the two parameters.

    0.00.2

    0.40.6

    0.81.0

    Dose

    p

    1.2.2 Perspective

    There are several points worth noting about this example.

    The bioassay allowed the investigators to study a class of cells thatthey could not observe directly.

    The need for a calibration curve required additional laboratory work.

  • 8 CHAPTER 1. ABOUT STATISTICS

    The estimate of an equivalent number of fresh cells came with ameasure of uncertainty in the form of a lower bound.

    The uncertainly permitted a forceful conclusion about the main point.The numbers of cells supporting engraftment were clearly expanded inculture.

    The calibration also suggested that stem cells were preferentiallyexpanded, but this was only a hint. The point estimate was consistentwith faster growth for stem cells, but the lower bound indicates thatthe same, or slightly slower growth cannot be ruled out.

    And finally, adding a small amount of statistical analysis addressed amajor concern of referees, and got the paper through peer review.

    The points above indicate why one would want to do the dilutionexperiment and the necessary statistical calculations. Actually doing thestatsitical work involves understanding a number of distinct concepts andtools. Among these are the following.

    Probability. In the SCID-hu mouse model, transplants of 1000 cells pergraft sometimes engrafted, but often did not. The tendency forengraftment to happen more reliably as the cell dose increases is thebasis for asserting that the culture actually expanded hematopoeiticstem cells. We suppose that the cell dose determines the probabilityof engraftment, but this probability is an unseen parameter that wecan only estimate, using a finite (and rather limited) number ofSCID-hu mice.

    Logarithms The logit function takes the unit interval (from zero to one)in which probabilities must lie, and maps it onto the entire realnumber line. The logit finction is the (natural) logarithm of the oddsof engraftment. Logarithms are often useful in statistics, and areworth at least some refresher-level attention.

    Linear Model The logistic regression model relates the logit-transformedprobability to two parameters, which can be thought of as a slope andintercept. Linear models, of the form y = + x, are extremelyuseful, despite their simplicity, and their usefulness can be extendedfurther by replacing y with a logarithm, logit, or some other function.

  • 1.3. THE TAXI PROBLEM. 9

    Experimental Design. In order to be useful, the experiment needed touse enough SCID-hu mice, at enough different doses, spread across abig enough range. While there are some strageties and tools to helpwith the planning, there is also a lot of guesswork and biologicalintuition. In this example, essentially everything but the engraftmentresponse was controlled by the experimenter. Studies that need torely on naturally occuring variation in more variables can be muchharder to design and analyze.

    The experiment itself. The conclusions depend on maintainingwell-defined experimental conditions and procedures. There is asubstantial body of statistical methods for quality control and processoptimization that are much-used in industry, and sometimes used inlaboratories.

    Model Fitting. The model supposes that probabilities of engraftment atthe four cell doses fall on a smooth, S-shaped curve, which isdetermined by just two parameters. This assumption may be wrongin detail, but it is probably good enough for our purposes.

    All models are wrong. Some are useful. George Box

    We can use a computer program to find the pair of values of and that best fit the data. We can also find a range of values for and that adequately fit the data. Here adequately depends on howmuch chance of an error we are willing to tolerate.

    Computing. To actually get the computing done, we have to organize thedata in a suitable form, and use a computer program. We will gothrough the details in a tutorial exercise.

    1.3 The taxi problem.

    Lets consider a simple problem that involves inferring something about apopulation from a small sample. This is a toy problem described byGottfried Noether2.

    2Introduction to Statistics, a Fresh Approach, Houghton Mifflin (1971)

  • 10 CHAPTER 1. ABOUT STATISTICS

    Dr. Noether was traveling, and trying to hail a taxicab. Several went by,but all were already hired. He started to wonder how many taxicabsoperated in this city (clearly not enough). He noted the numbers on thetaxi medalions that each cab displayed, which were

    97, 234, 166, 7, 65, 17, 4.

    These are the data. He formulated a simple model, in which the taxicabswere numbered sequentially, starting with 1. The highest number, say N ,being the total number of taxicabs in the city. N is the parameter whosevalue we would like to know. It describes a feature of the population oftaxicabs in the city.

    We need a model relating the parameter to the probability of observingdifferent possible samples. If we assume that all cabs are equally likely tobe in operation, then each number from 1 to N has an equal chance ofshowing up among the data. We can now think about ways to estimate Nfrom the observed data.

    1.3.1 Estimators

    Under our assumptions, N must be at least as large as the largestobservation, so we could simply use the largest observation to estimate N .Since this estimate is probably smaller than N , we might add someincrement in the hope of getting closer, but how much should we add? Wemight suppose the largest falls short of N by about the same amount thatthe smallest exceeds 0. This suggests adding the smallest observation to thelargest observation. Lets give these two estimators of N some names todistinguish them. Lets call the maximum Na, the hat designating anestimator of N , and the subscript indicating simply that it is our first ideafor an estimator. Lets then use Nb to denote the maximum plus minimumestimator.

    Can you think of other ideas for a good estimator?

    Heres one more idea. We might consider averaging all of the gaps betweenobserved numbers to estimate the likely gap between the largest and N . Sowe let Nc be the maximum plus the average gap. (With a little thought, wecan come up with an easy way to calculate Nc without actually calculatingall the individual gaps.)

  • 1.3. THE TAXI PROBLEM. 11

    How should we choose among possible estimates?

    Any one of them might turn out to be closest to N in a single sample, solets think of applying our ideas to many future samples. Lets use the termestimator to refer to the method of computing an estimate, the latter termrefering to the actual number we get when we apply the estimator to aparticular set of observations. Since we dont know the value of theparameter N , we cant pick the best estimate with certainty, but if westudy a set of hypothetical populations where we do know N , we may beable to identify the best estimator, i.e. the estimator that tends to be closerto N than the other estimators, on average in repeated use.

    We can easily use a computer to simlate samples from 1, . . . , N , for anyvalue of N we like. We can then compare the results for the differentestimators. We can summarize the simulations graphically, or we can usesummary statistics. Well do both. One particularly compelling summary isthe mean square error between estimators and N . If we square thedifference before averaging over simulations, we penalize large differences,which we would like to avoid. The mean square error can be decomposedinto a measure of variation (or precision) and a measure of bias (oraccuracy). The standard deviation is the root mean square deviation of anestimator from its mean (which might be different from N). This can becomputed without knowing the target parameter, N , and it measuresvariation, without regard to accuracy. In analogy with archery, it wouldmeasure how tightly grouped the arrows are, without regard to thebulls-eye. Bias, on the other hand, is the difference between the averagevalue of the estimator, and the target, N . It is analogous to the distancefrom the middle of the cluster to the bulls-eye. These are related by

    MSE = SD2 + Bias2

    which essentially says that standard deviation and bias are related to MSEby the Pythagorean theorem. To be a little more explicit, lets suppose wehave n simulations, numbered 1, . . . , n, yielding estimates N1, . . . , Nn. Notethat we are now considering only one estimator, say the maximum plusminimum, and using the subscript to denote the result from differentsimulated samples. Let N = (1/n)

    Ni be the average value of our

    estimator over all of the simulations. Then we can write the decompositionof the MSE as

    (1/n)

    (Ni N)2 = (1/n)

    (Ni N)2 + (N N)2

  • 12 CHAPTER 1. ABOUT STATISTICS

    1.3.2 Simulation of estimator performance

    We will do some simulations in class and calculate these summaries. To dothe simulations we need a computer, a programming language, and theactual program code. We will use the R programming language(http://cran.cnr.berkeley.edu/) which will also be encountered in thehomework tutorial.Some program code for simulated sampling and calculation of estimators isshown below. R is an interpreted language, so we can simply start R, andtype in the commands, executing one line at a time, or with this filedisplayed in acrobat, we can copy and paste the code into R. This will bedone in class.

    options(digits=3)

    # sample data:

    y = c(97, 234, 166, 7, 65, 17, 4)

    y

    # Note that in R, when you type the name of an object

    # (y, in this case) the object is printed.

    # Some estimators

    est1 = function(x) max(x)

    est2 = function(x) max(x) + min(x)

    est3 = function(x) max(x) + mean(diff(sort(c(1,x))))

    est4 = function(x) 2*mean(x)

    # package them together for convienience

    est.all = function(x) c(max=est1(x), maxmin=est2(x),

    gap=est3(x), dblmean=est4(x))

    # try them all on the data

    round(est.all(y))

    # Which is best?

    # We decide by comparing performance

  • 1.3. THE TAXI PROBLEM. 13

    # Hypothetical situation

    N = 240 # The number of taxis (which we want to learn from the data)

    n = 7 # the number of observations

    # A simulated sample

    y.eg = sample(1:N,n)

    y.eg

    round(est.all(y.eg))

    # Now repeat the simulation many times

    reps = 200

    yrep = matrix(data=sample(1:N, reps*n, replace=TRUE), nrow=reps, ncol=n)

    # Look at the first 20 simulations

    yrep[1:20,]

    # and apply all the estimators to each sample

    r = apply(yrep, 1, est.all)

    r = t(r)

    # Look at the estimators from the first 20 simulations

    r[1:20,]

    # Graphically examine the results for each estimator

    par(mfrow=c(2,2))

    for(i in 1:4){

    x = r[,i] # get data for one of the estimators

    x.name = dimnames(r)[[2]][i] # get the name of the estimator

    hist(x, breaks=seq(0,450,by=20), main=x.name, xlab="")

    }

    # A different look using boxplots

    par(mfrow=c(1,1))

    boxplot(as.data.frame(r))

    abline(h=N, lty=3)

    # Summarize the performance of each

  • 14 CHAPTER 1. ABOUT STATISTICS

    rmse = apply(r,2,function(x) sqrt( mean((x - N)^2) ))

    bias = apply(r,2,function(x) mean(x) - N)

    sd = apply(r,2,function(x) sqrt( mean((x - mean(x))^2) ))

    tbl = rbind(rmse, bias, sd)

    tbl

    tbl^2

    1.4 Philosophical Bearings

    Why should a student of science, with much to learn and limited time,study statistics? The answer that seems obvious to some of us, is thatstatistics is an important part of the scientific method. Its a positionshared with other things. Much of science does not involve statistics, andstatistical methods are often used in government, finance, and industry, aswell as in science. In terms of methodology, however, statistics lies at theinterface of mathematics and science, which employ quite distinct methods.

    Mathematics is deductive. We start with axioms, which are assumed to betrue, and we deduce their consequences. These consequences may beproven, i.e. shown to be definitely true, without uncertainty. However, thewhole enterprise is concerned with an idealized Platonic world, rather thanthe real world. Science moves in the opposite direction. We observeoutcomes and try to infer the causes that gave rise to them. Theconclusions cannot be proven with mathematical certainty, but they do dealdirectly with the real world.

    Statisticians use the methods of mathematics to help advance the methodsof science. Mathematical statisticians propose methods for designingstudies and analyzing data, and they work out the accuracy andvulnerabilities of these methods. Applied statisticians, or statisticallyknowledgable scientists, apply these methods to draw conclusions. Theapplication of statistical methods does not necessarily require great skill inmathematics, but it does require an awareness of the accuracy andvulnerabilities of the methods to be applied.

    The Speed Berkeley Research Group has a website about their work onstatistical methods in functional genomics that carries this relevant

  • 1.4. PHILOSOPHICAL BEARINGS 15

    statement:3

    What has statistics to offer? A tradition of dealing withvariability and uncertainty; a framework for devising, studyingand comparing approaches to questions whose answers involvedata analysis. In statistics we have horses for courses, not onefor all weathers. That is why guidance is needed.

    Because statistics courses seem to be one of the few places in sciencecurricula where scientific methods get any sort of formal treatment, itseems fitting to start with a broad view of science, to get our bearings.

    1.4.1 Science as selection

    We can draw an analogy between science and natural selection. Theanalogy amounts to an extremely terse history of life on earth.

    1. Life: Information is passed between generations in genes.

    2. Evolution: Natural selection of genes generates diverse speciesadapted to their environments.

    3. Culture: Information passed between generations in minds andliterature.

    4. Science: Data-based selection of ideas increases understanding of thenatural world.

    Aside from making science (and teaching) look remarkably important, thisanalogy has two major points:

    Science is a cultural activity. Science often involves selection among hypotheses.

    The fact that science is cultural means that different people will requiredifferent degrees of evidence for what they believe. This problem is greatlyreduced by the notion of selection, i.e. by focusing on what can be ruledout with compelling force.

    3http://www.stat.berkeley.edu/ terry/Group/home.html

  • 16 CHAPTER 1. ABOUT STATISTICS

    Our belief in some hypotheses can have no stronger basis thanour repeated unsuccessful critical attempts to refute it. KarlPopper (1961, The logic of Scientific Discovery)

    According to Popper, we can never prove any hypothesis, but we makeuseful progress by trying to disprove them. The surviving hypothesesconstitute a useful, but provisional, view of the world. This is approach toscience is sometimes called the hypothetico-deductive method. In plainwords, this means guess and test, but the guessing involves a lot ofknowledge of ones subject.

    1.4.2 Examples of hypothesis rejection

    Discovery of viruses: Loffler and Frosch demonstrated the presence ofultra-microscopic infectious organisms, now known to be viruses.They passed lymph from an animal suffering from foot-and-mouthdisease through a filter, which ruled out bacteria (of ordinary size) asthe infectious agent. They also infected animals serially, which ruledout any non-replicating poison.

    Finding promoters: More recently, the regulatory regions of genes arefrequently dissected by sequentially shortening the sequence understudy and testing for expression, an approach known in some circlesas promoter bashing.

    Genetic exclusion mapping: In an experimental genetic cross involvinga classic Mendelian phenotype, variation in the location of meioticcross-overs in a backcross allows the investigator to rule out most ofthe genome, leaving a plausible interval that decreases in size as thedata accumulate.

    Rejecting chance as an explanation:

    For complex (incompletely penetrant) genetic traits, failure to see aphenotype does not rule out the genotype as a contributing (non-sufficient)cause, it just makes it less likely. Consider a backcross experiment using theNOD mouse model of diabetes.

  • 1.4. PHILOSOPHICAL BEARINGS 17

    1. We genotype the diabetic mice and look for regions of the genomethat depart from Mendelian ratios. If the departure is big, we rule outchance variation and conclude that there is something to interpret.

    2. We include genotypes for non-diabetic mice to rule out a lethalrecessive allele.

    Having ruled out both chance, and lethal recessives as competinghypotheses, we can then conclude that the genetic pattern is related todiabetes.

    We will return to this example when we study specific methods forstatistical hypothesis testing. At this point, the thing to notice is that theability to reject chance as an explanation for patterns in our data can rescuethe hypothesis-rejection strategy, allowing us to investigate situationsinvolving noise and uncertainty. The bioassay described in chapter 1enabled Dr. Shih to study stem cells quantitatively, despite the problemthat they could not be identified by markers. Using breeding experiments,mouse geneticists have identified regions of the genome responsible for adisease even before any genes within those regions have been identified.

    The extension of hypothesis testing into noisy realms comes at a price, inthat many observations are needed. The feasible size of the study may notpermit rejecting the chance hypothesis to the satisfaction of everyone. Thedifferent degrees of evidence required by different individuals can re-enterthe situation when statistical methods are needed. This is particularly truein medical research, where the accumulation of evidence that one therapy issuperior to another may, at some point, preclude further research on theputatively inferior therapy, on ethical grounds.

    1.4.3 A broad model

    The foregoing are all examples where hypotheses are proposed andcompared to data in a single project. Sometimes there is a protracteddebate.

    Gregor Mendel proposed a particulate model of inheritance. This was incontrast to the notion, current in his day, that inheritance involved somesort of blending.

  • 18 CHAPTER 1. ABOUT STATISTICS

    Mendel started with true-breeding lines, e.g. a line that always producesyellow peas, and a line that always produces green peas. When hecross-fertilized these two lines, he did not obtain any blending of the colors.Instead, all of the first generation offspring (the F1 generation in modernnotation) had peas of the same color. When this generation was allowed toself-fertilize, the resulting generation produced both of the original colors ofpeas in a highly repeatable 3:1 ratio, but without any blending of thecharacters.

    Describing what we would now call F2 generations (the result ofhybridization of two inbred lines followed by self-fertilization), Mendelwrote4:

    If now the results of the whole of the experiments be broughttogether, there is found, as between the number of forms withthe dominant and recessive characters, an average ratio of2.98:1, or 3:1.

    In passing from 2.98 to 3, he was asserting that his model fit the data, and,in a sense, rejecting the need for any further explanation of the variation.

    Mendels choice of experimental plants was crucial. The seven traits hestudied were each under the control of a single gene. Traits like height andweight, however, seemed more a matter of blending than a transfer ofdiscrete genes (a word coined later). For more than a decade after therediscovery of Mendels paper, those who studied such traits regarded thisas evidence against Mendels model. These investigators were described asthe biomeric school, in contrasts to the Mendelian school of thought.The tension was resolved, largely by Ronald Fisher, who explained theinheritance of quantitative traits as the result of contributions from manygenes.

    This is an example of progress on the deductive part of thehypothetico-deductive method to explain existing data. It was not arevision of the Mendels notion of particulate inheritance, rather is was animproved understanding of what it implied. Far-reaching ideas likeMendels dont necessarily stand or fall on a single fact.

    4translated, see e.g.http://www.mendelweb.org/

  • 1.4. PHILOSOPHICAL BEARINGS 19

    Description versus hypothesis testing

    Not everything in science fits easily into the hypothetico-deductive mold.Sequencing the human genome, for example, was done for much broaderreasons than testing any specific hypothesis. A recent pair of opinion piecesin Nature5 debate the merits of a data-first approach versus puttinghypotheses first. Of course part of the debate is over what sort of scienceshould be funded. (Almost all grant proposals need to couch theirobjectives in terms of hypotheses to be tested.) More constructive, perhaps,is a news article6 in the same issue of Nature, with the title Life isComplicated, which addresses recent efforts to incorporate rapidlyaccumulating molecular information with more traditionalhypothesis-driven research. Sequencing, and many other large-scale datacollection activities are essentially observational. But sometimes thedata-first approach can involve experiments. The article cites the exampleof Eric Davidsons lab, at Caltech, which works on the sea urchin as amodel system. The lab has been systematically knocking out transcriptionfactors to build a map of how they work together in development. Thisapproach combines assessment of the whole transcriptome with highlyspecific experimental manipulations, and it has uncovered a modularstructure of regultory networks.

    More examples from genetics

    Even though everything need not fit into a neat philosophy, having somephilosophical bearings can be useful when things get difficult. In particular,the key feature that makes an hypothesis scientific is that it is potentiallyfalsifiable, i.e. if it is wrong, it can be shown to be wrong. This principal issometimes useful for identifying ill-posed questions. Sometimes, however,data precedes any clear hypotheses.

    Exercise 1.4.1 Several brief descriptions of landmark results in geneticsare given below. For each description, can you identify a hypothesis that hasbeen rejected, or does it seem to be more a matter of observing andexplaining?

    5Nature 2010, 464:678-6796Nature 2010, 464:664-667

  • 20 CHAPTER 1. ABOUT STATISTICS

    1. The early ideas about genes treated them as discrete and constant. Inthe 1930s, Herman Joe Muller showed that exposing fruit flies toX-rays produced flies with novel features that are inherited. Thismeant that genes could be altered.

    2. In the early 1940s chromosomes had long been recognized as thecellular location of genes, but it was still unclear whether it was theproteins or the DNA in chromosomes that carried the information ofheredity. Oswald Avery showed that a strain of an attenuatedbacterium could be exposed to extracts from a virulent strain andrecover virulence, which could then be passed on through cell divisions.If the DNA in the extract was destroyed, no virulence ensued. If theprotein was destroyed, the bacteria continued to become virulent.

    3. In 1953, Watson and Crick published the structure of DNA based onthe data of Franklin and Wilkins. They wrote,

    It has not escaped our notice that the specific pairing wehave postulated immediately suggests a possible copyingmechanism for the genetic material.

    4. Shortly after the structure of DNA had been published, there wasspeculation as the nature of the genetic code for amino acid sequences.A particularly neat hypothesis was that the code was comma-freeand would make sense in any reading frame. This reduced the 64possible three-base codons to 20 feasible codons exactly the rightnumber for encoding the amino acids in proteins. This hypothesis wassoon discredited by an experiment that fed poly-U RNA to the cellulartranslation machinery and obtained a monotonous peptide. UUU wasone of the forbidden codons in the comma-free hypothesis.

    Paradigms

    While the hypothetico-deductive method holds a prominent place inscience, there is no single scientific method. In 1962, Thomas Kuhnpublished a short but very influential book, The Structure of ScientificRevolutions7, that introduced the notion that scientists tend to follow

    7see book review, Science Vol. 338 16, November 2012

  • 1.5. THE NOTION OF POPULATION 21

    established paradigms, which are previously successful approaches to anarea of science. Ordinary science involves working within a paradigm. Ascientific revolution is the replacement of a paradigm. And paradigms arereplaced, not necessarily because they are contradicted by experiment, butwhen they cease to be productive.

    A 50th anniversary review of Kuhns book is among the handouts.

    One reason for taking notice of Kuhn, is that as a scientist, you willcertainly enouter the term paradigm, used in this sense. Another reason isthat the field of statistics is somewhat unusual is having two paradigmsthat have long coexisted, without replacing each other. The frequentistparadigm focuses on the evaluation of statistical methods by theirperformance in repeated use. The simplest statistical methods tend to befrequentist procedures, hence the frequentist paradgm dominateselementary statistics textbooks. The Bayesian paradigm uses probabilitiesto simultaneously model both the distribution of data and the truth ofinferential statements. It has a more coherent theory, and some advantagesof interpretation, but it requires a little more expertise in theory andcomputing. It is particularly useful in problems that require combiningdata from multiple sources. You might encounter mention of Bayesianmethods in statistical treatments of genetics and functional genomics.

    1.5 The notion of population

    In the bioassay example, the probability of engraftment was thought to beprimarily determined by the number of stem cells injected into a mouse.This assumes that a lot other things are fixed parts of the situation. Wecan think of the whole experimental set-up as generating a population ofpotential transplants, with the probability of engraftment being a propertyof this population.

    In the taxi problem, the population was more definite. We assumed a fixednumber of taxicabs, N , working in the city. We sample from thispopulation by trying to hail a cab and noting the medalion. We mightworry that our sampling method leaves something to be desired, but thenotion of population is pretty clear.

    The hypothetico-deductive method can be related to the notion of a

  • 22 CHAPTER 1. ABOUT STATISTICS

    population. In general, we can deduce what kind of sample to expect undereach of several competing hypotheses, and then try to obtain enough datathat we can reject some of the hypotheses. If we want to estimate anumerical quantity, like the speed of light, or the probability of a mutation,we can consider each possible numeric value as a distinct hypotheticalvalue, characteristic of the population. Given a sample from thepopulation, the set of hypothesized values that we cannot reject as aplausible source of the data will form an interval, which probably containsthe actual value for the population.

    In the reading assignment, Ernst Mayr notes that biological populations notonly consist of non-identical individuals, but they may not be constant overtime. The length of a rabbits ears varies somewhat from rabbit to rabbit,but the typical length characteristic of a species has probably varied duringthe evolution of rabbits. We need to keep this sort of variation in mindwhen we consider whether a statistical inference problem is well-posed.

    Exercise 1.5.1 Figure 1.3 gives results from several studies all attemptingto measure the same physical quantity, the angular deflection of light due tothe gravity of the sun. Taken as a whole, are the observations consistentwith the general relativity value? How many experiments produced intervalsthat exclude the value = 1.0? What do you make of the intervals fromthese experiements.

    Exercise 1.5.2 Figure 1.4 gives results from several studies, all attemptingto measure the risk of non-febrile seizures following febrile seizures. Thestudies fall into two classes. The clinic-based studies sampled patients whowere seen at a variety of specialty clinics and hospitals. Thepopulation-based studies attempted to follow all reports of febrile seizureswithin a defined population. Are the clinic-based studies consistent with anysingle value for the risk of subsequent non-febrile seizures? Are thepopulation-based studies consistent with a single numerical risk ofsubsequent non-febrile seizures applicable to all patients with febrileseizures?

  • 1.5. THE NOTION OF POPULATION 23

    Figure 1.3: Multiple estimates of a physical constant.

  • 24 CHAPTER 1. ABOUT STATISTICS

    Figure 1.4: Multiple estimates based on samples from two different kinds ofpopulation.

  • 1.6. HOMEWORK: A COMPUTING TUTORIAL 25

    1.6 Homework: A Computing Tutorial

    In this exercise we will:

    1. enter data from Dr. Shihs bioassay experiment into a spreadsheetand save it as a comma-separated (.csv) file;

    2. download and install the R program, which we will use again later;

    3. read the data into R and produce a plot, and related calculations;

    4. save our computing instructions in a file to document the calculations.

    The point of this tutorial primarily to get a better idea of what is involvedin carrying out the statistical analysis of an actual experiment, and to getthe R program installed, so that we can use it for some simpler calculationslater.

    1.6.1 Creating a Data File in Excel

    Lets use Microsoft Excel for entering the bone system engraftment datainto a file. Excel is a very common tool for organizing raw data. If youdont have a copy of Excel, you can use any plain text editor, such asNotepad, to create a file with columns of numbers separated by commas.Dont use commas for any other purpose, however, e.g. dont use commaswithin large numbers. Other alternatives, in the absence of excel, are todownload either Libre Office from http://www.libreoffice.org/ or OpenOffice from http://www.openoffice.org/.

    Create three columns of numbers, with a header row at the top to label thecolumns. The data should be organized as in Table 1.1.

    Aside from the header row, each cell should contain one number andnothing else. You should start at the top, and leave no blank rows, nor anyblank fields within the rectangular table of data. The space between Celland Dose in one of the variable names is ok. When we read the data into R,this will get converted into a period, yielding the name Cell.Dose.

    Save the table as a comma-separated (.csv) file. Lets assume the file nameis bone.csv, and the file is in the folder C:/biostat.

  • 26 CHAPTER 1. ABOUT STATISTICS

    Cell Dose r n10000 53 603000 12 601000 4 60300 0 60

    Table 1.1: Spreadsheet layout for the bioassay data.

    The bone.csv file can be read by the R program, which can easily dofurther calculations. If we want to use Prism, instead of R, to plot theengraftment rates against the cell dose, we would need to calculate therates outside of the Prism program, e.g. in excel, so lets look at how thatcan be done.

    In excel, label a fourth column (top row) as p (for proportion). Move thecursor to the second row, fourth column and type an equal sign. This tellsthe spreadsheet that a formula is coming. Then click on the cell with the53. This will write the cell name in the formula. Follow that with the slashfor division, and click on the 60 next to the 53 to put that cell name in theformula, and complete the formula by hitting the tab key. Now highlightthe cell with your new formula, and the three cells below it. Ctl-D will copythe formula down the column, adjusting the cell names in each formula torefer to the different rows. You now have a column with the calculatedproportions.

    1.6.2 Installation of R

    The following assumes you are using Windows. Installation is similar underMac OS X, but some of the link names will be different. It is also assumedthat you have administrator priveledges on the computer where you areinstalling R.

    1. Point your web browser at http://cran.cnr.berkeley.edu/ (or doa google search on CRAN and use any mirror site you like).

    2. Download the installation file;

    Follow the Download R for Windows link.

  • 1.6. HOMEWORK: A COMPUTING TUTORIAL 27

    Select base. Select Download R 3.0.2 for Windows, or whatever newer

    version is there. Save it in any convieneient folder on yourmachine, or just on the desktop.

    3. Start the installer program (by double-clicking it). This is a typicalinstall wizard that will ask a few questions. You can take all of thedefaults, but on windows I like to install programs to my own progsfolder, leaving Program Files for the things that ITS installs. Whenthe installation is complete, there should be an R icon on yourdesktop and an R folder on your start button. On a mac, there willbe an R application it the applications folder.

    1.6.3 Trying it out

    Click on the R icon. You should get an R Console window. In that window,try typing 2 + 3 then press return. Try sin(pi/2). You can use R as ascientific calculator.

    Try typing these lines, with a return after each line.

    F = c(-40,0,32,68,98.6, 212)

    F

    C = (F - 32)*5/9

    C

    cbind(F,C)

    The first line combines 5 numbers, representing Farenheit temperatures,into a vector, and assigns it to an object named F. This happens silently,i.e. without printing anything. In the second line, giving the name of theobject, without an assignment, causes the object to be printed. The thirdline does a calculation on each element of F, converting the temperatures toCelcius, and assigning the result to an object named C. Typing C alone onthe fourth line prints the Celcius values. The final line uses the cbindfunction to create a small table by binding two columns of numberstogether. Because there is no assignment, the table is printed in the RConsole window.

    To quit R, you can either

  • 28 CHAPTER 1. ABOUT STATISTICS

    1. close the R Console window,

    2. choose Quit R from the R pull-down menu, or

    3. type the command quit() at the command line, followed by return.

    R is a statistical programming language, as opposed to a menu-drivenstatistics and graphics application of the sort advertised in the pages ofScience. A very large number of specialized statistical tools have beenimplemented in R by the scientific community. The Comprehensive RArchive Library (CRAN) has a large general collection, and theBioconductor site as a more specialized collection, oriented towardsmolecular biology. But because it is a language, learning it takes someeffort. Once one has a basic orientation, doing standard statisticalcalculations in R is really no more difficult than using a menu-basedprogram. One simply looks through the help pages, instead of lookingthrough the menus, to find the appropriate functions.

    The help system in R is available from the Help pull-down. Choose Htmlhelp on a windows system, or R Help on a Mac. The Search Engine &Keywords link will let you find relevant material by searching, or via ahierarchical list of topics. The Introduction link provides an manual ofsorts, with a sample session in an appendix.

    Help pages may actually be preferable to menus, as help pages are moreinformative, and often offer references to help the user understand what isbeing calculated. There is much more to data analysis that understandinghow to use a computer program, and it is possible for statistical tools to betoo handy. However, programming is more like playing the piano than likeriding a bicycle, in that the skill fades quickly if you dont practice. Forthat reason, future computing exercises will use either Prism, which is acommercial menu-driven package, or an add-on package for R, called RCommander, that provides a graphical user interface (GUI) that makes Rwork more like a commercial statistics package. This tutorial, however, willuse R in its basic form, which is how someone doing more involved workwould likely use it.

  • 1.6. HOMEWORK: A COMPUTING TUTORIAL 29

    1.6.4 Data Analysis in R

    This will necessarily be a rather mechanical exercise in copying commandsto the R console window and executing them.

    After starting R, we need to set the working directory to the file where wesaved the dataset. You can either use the pull-down menus (File / Changedir... under windows, Misc / Change Working Directory under Mac OS X)and navigate to your folder, or you can type the path in a setwd command,as below.

    setwd("C:/biostat")

    This just lets the R program know where to look for things like data files.

    There is a counterpart to the setwd function, that you can use like so:

    getwd()

    This will print the full path to the working directory, i.e. it tells you whereyou are. Note that you dont need to supply the getwd function with anyargument, but you do need to type the parentheses that usually holdfunction arguments. The presence of parentheses tells R that you want toexecute the function. If you just type the function name, without theparentheses, R tries to print the function object, just like any other object.This can be handy if you have written functions in R, and want to look atthe code, but in the case of getwd, it only produces some uninterestingtechnical information.

    To read the data:

    bone = read.csv("bone.csv", header=TRUE)

    This will be silent. To see the data, type bone (followed by return, ofcourse). The object bone, created by the read.csv function, is somethingcalled a data frame. It is a retangular array of variables, of possiblydifferent types (e.g. numeric, logical, character, factor we havent yetdefined these things), but all of the same length. Columns are variables,and rows are distinct individuals or observations.

    Lets make a plot.

  • 30 CHAPTER 1. ABOUT STATISTICS

    plot(r/n ~ Cell.Dose, data=bone)

    The ~ character is part of a statistical modeling notation. The commandsays to plot r/n as the response, on the vertical axis, and Cell.Dose as thepredictor variable, on the horizontal axis. The variable names are inside thebone data frame object, so we have to tell the plot function where to findthem with the data= argument. If you used different variable names whenyou made the data file, you will need to adjust the plot command to usethose variable names.

    The plot actually looks pretty straight. We might have gotten away with asimpler linear model, rather than the logit model, but the later is morebroadly applicable.

    Lets repeat that plot, but with a logarithmic horizontal axis, covering awider range of values.

    plot(r/n ~ Cell.Dose, data=bone,

    log="x", xlim=c(300,20000), ylim=c(0,1))

    Dont worry about understanding the details. The point is that manyvariations on the plot are possibly by specifying extra arguments to the plotfunction. Typing ?plot will bring up a help page, but its not essential thatyou look at it right now.

    Now lets fit the logistic regression model, which is the main event here.

    bone.fit = glm(r/n ~ log(Cell.Dose),

    data=bone, family=binomial, weights=n)

    bone.fit

    Note that we split the long function call across two lines. If one line isntcomplete, R will keep looking on the next line. Our two-line instructionhere calls the glm (generalized linear models) function to fit the logisticregression model, and assigns the result to an object called bone.fit. Thesecond instruction simply gives the name of the object, which prints theresult, in a brief form. In some programs, a statsitical procedure like thiswould spew a page or more of results. In R, it is typical for a statisticialfunction to wrap its results up into an object which can then beinterrogated by other functions to get what you need. Just typing the name

  • 1.6. HOMEWORK: A COMPUTING TUTORIAL 31

    of the object usually produces a rather brief summary. Here we have fit amodel of the form

    log

    (p

    1 p

    )= + x

    and the summary gives us estimates of and . It is common is statsiticsto adorn the parameter with a hat (or similar mark) to distinguish anestimate from the true value of the parameter. Following this convention,we can write

    = 19.357 = 2.292.

    These two parameters determine the calibration curve. We might like tolook at the curve. Heres how.

    x = seq(from=300, to=20000, by=20)

    y = -19.357 + 2.292 * log(x)

    inv.logit = function(z){ exp(z)/(1 + exp(z))}

    logit = function(p) { log(p/(1 - p)) }

    lines(inv.logit(y) ~ x)

    The first line creates a sequence of points along the horizontal axis, from300 to 16000, one every 20 units. (We can see how many points we createdby typing length(x).) The second line just applies the fitted model to eachof the points we generated. The model, however, links a linear function ofcell dose to the logit of the probability of engraftment, not to theengraftment rate directly. In order to convert the linear predictor, y to aprobability, we have to apply the inverse of the logit function. The thirdline defines the inverse logit as a function. The fourth line defines the logitfunction. This really isnt necessary, but it allows us to check that we reallydid get the inverse right, by a few calculations line these:

    > inv.logit(logit(.5))

    [1] 0.5

    > logit(inv.logit(2.5))

    [1] 2.5

    Finally, the lines function was used to add a line to the plot. The line isreally a bunch of segments connecting many dots, but it looks prettysmooth.

  • 32 CHAPTER 1. ABOUT STATISTICS

    Lets use the calibration curve to evaluate the engraftment from culturedcells. According to table 4 of Shih et al., 52 out of 56 mice engrafted afterreceiving 10000 cultured cells. (This is a slightly different rate from thatwhich led to the estimates quoted in the text, and above, but arguably asrelevant.) If we draw a horizontal line at 52/56 0.93 and note where thatintersects the calibtation curve, we can read off the equivalent dose of freshcells.

    abline(h=52/56, lty=2)

    The abline function is for drawing lines with intercept a and slope b, buthere we use an extra argument, h for specifying the height of a horizontalline. The lty=2 argument just specifies a broken line. (Note thatarguments to R functions can be specified by position, useful for requiredarguments in the first positions, or by name, which is useful for skippingover optional arguments.)

    We can find where the rate of 52/56 intersects the calibration curve using alinear interpolation function.

    approx(inv.logit(y), x, 52/56)

    We see that this is at a cell dose of about 14250 fresh cells. (Slightly lowerthan the 16350 in figure 1.)

    We can draw a line segment or arrow at that point with the followinginstruction. The arguments are, respectively, the x and y coordinates of thebeginning and end of the arrow.

    arrows(14250, 52/56, 14250, 0)

    The lower bound calculation involves consideration of several sources oferror. For simplicity, we just consider one. The observation that 52 our of56 animals engrafted may be accidentally optimistic. This is a verycommon situation, in which we have a number (56) of independent trials ofan experiment yielding a binary result (engraftment or not) with the sameprobability of success on each trial. The total number of successes in such asituation is said to follow a binomial distribution. We can easily calculate a95% lower confidence bound. This means that the method of calculatingthe bound will, in 19 out of 20 experiments, yield a bound that is in factbelow the probability of engraftment.

  • 1.6. HOMEWORK: A COMPUTING TUTORIAL 33

    binom.test(52, 56, alt="greater")

    The lower bound is approximately 0.844. The binim.test function reportsa number of other things that we can ignore.

    We can do the same calibration exercise using this lower bound.

    abline(h=0.844, lty=2)

    approx(inv.logit(y), x, 0.844)

    arrows(9721, 0.844, 9721, 0)

    Note that 93% engraftment is higher than any of the engraftment resultsused to make the calibration curve. This is extrapolation, and it is ratherinaccurate due to the flatness of the curve in this region, and evendangerous, as it depends on the shape of the curve in unexplored territory.However, if we take a lower bound on a cultured cell engraftment rate, thatwill be within the range of the calibration data, where calibration is morereliable. Fortunately, the argument that culture expanded stem cells restson concluding that the lower bound is high enough, and the calibrationcurve is adequate for this purpose.

    1.6.5 Documenting the analysis

    One of the advantages of using a statsitical programming language, asopposed to a typical statistics package, is that it is easy to document yourcalculations in a program. After an interactive session in which we explorethe data, correct errors, and so forth, we can collect the key calculations asa list of R instructions, as shown below. The pound signs mark commentsthat might be useful later. Everything to the left of a pound sign will beignored by R.

    # Calibration calculations for determining equivalent fresh cells

    # from engraftment rates, using data from Shih et al., 1999.

    # Read the bone system data

    bone = read.csv("bone.csv", header=TRUE)

    # Fit a logistic regression

    bone.fit = glm(r/n ~ log(Cell.Dose),

    data=bone, family=binomial, weights=n)

  • 34 CHAPTER 1. ABOUT STATISTICS

    # Print the estimated model:

    print(bone.fit)

    # Start a plot

    plot(r/n ~ Cell.Dose, data=bone,

    log="x", xlim=c(300,20000), ylim=c(0,1))

    x = seq(from=300, to=20000, by=20)

    y = -19.357 + 2.292 * log(x) ### fit results are hard-coded here

    inv.logit = function(z){ exp(z)/(1 + exp(z))}

    logit = function(p) { log(p/(1 - p)) }

    lines(inv.logit(y) ~ x)

    # calibrate an engraftment rate

    abline(h=52/56, lty=2)

    approx(inv.logit(y), x, 52/56)

    arrows(14250, 52/56, 14250, 0) ### hard-coded calibration results

    # get a lower bound and calibrate that

    binom.test(52, 56, alt="greater")

    abline(h=0.844, lty=2) ### hard-coded lower-bound

    approx(inv.logit(y), x, 0.844)

    arrows(9721, 0.844, 9721, 0) ### hard-coded calibration results

    These commands can be executed again. Paste these R instructions into afile, using a plain text editor, like notepad. Save the file as, say,calibrate.R, in your working folder. In R, you can either go to the Filepull-down, select Source File and navigate to calibrate.R, or you cansimply type source(calibrate.R) on the command line. This will executethe instructions, which should produce a plot.

    Major caveat: Documenting what you did with one set of data and writinga program to apply similar steps to new data are two different things. Notethat several lines have been marked with a comment that something ishard-coded. This means that some result from earlier steps has beencopied directly into the instruction. If one were to run these instructionswith new data, these results would be incorrect. It is possible to turn atranscript of an interactive session into a re-usable computer program, butthat involves more effort, additional programming techniques, judgement,and testing. The proper use of code that documents analysis steps is toread it, in order to understand any questions that arise about the details,and to re-use it only while reading it and thinking about it.

  • 1.6. HOMEWORK: A COMPUTING TUTORIAL 35

    Finishing up

    Look up the title function in the R help pages. Use it to put a title onyour plot, with your name, and turn it in.

  • 36 CHAPTER 1. ABOUT STATISTICS

  • Chapter 2

    Data Summary

    37

  • 38 CHAPTER 2. DATA SUMMARY

    Suggested reading: Samuels and Witmer (SW) Chapter 2.

    2.1 Summary Statistics

    Summary statistics can be contrasted with inferential statistics. Insummarizing a dataset, we are concerned with the data at hand. Inferentialstatistics arise when the data at hand are a sample of some largerpopulation or process. We would like to have a summary of the population,but we have to settle for estimates based on a smaller sample.

    Statistics are numerical summaries of actual observations.

    Parameters are features of an unobserved population. The populationmay be real and finite, or an indefinite number of potential resultsfrom some process.

    We sometimes need to distinguish a sample mean from a population mean,or a sample standard deviation from a population standard deviation, butin this chapter, we will focus primarily on summaries of actual data.

    2.1.1 Kinds of Variables

    We will use the term variable to refer to a well-defined measurement takenon each member of a sample of interest. A dataset is often represented in acomputer file as a rectangular array, with columns representing variablesand rows representing individuals that are observed or measured, andperhaps experimentally treated.

    An often-cited classification of variables is due to Stevens1:

    nominal, e.g. blood type (A, B, AB, O);

    ordinal, e.g. pathology scores (-, +, ++, +++);

    interval, a variable with a well-defined unit, but without a well-definedorigin, e.g. time, or Celcius temperature;

    1SS Stevens, On the Theory of Scales of Measurement. Science, 1946, 103:677-680

  • 2.1. SUMMARY STATISTICS 39

    ratio, a positive variable with a unit and an origin, e.g. weight or number.

    We often speak of quantitative measurements, without distinguishingbetween interval and ratio scales, but some statistics only make sense forvariables that are strictly positive, e.g. the coefficient of variation, which isa measure of variation expressed as a fraction of the mean. These require aratio-scale.

    Discrete variables take values from a finite set. Continuous variables could,in principle, take a value between any other two values. In practice, there isalways a limited resolution, and a finite scale. A categorical variable is adiscrete variable that only takes a few possible values, so many observationswill be in the same category. The pathology scores are an example ofordered categories. Sometimes observations can be ranked, i.e. put in order,so that there are few, if any, ties. There are special methods to deal withranked data, and sometimes we decide to only pay attention to the rankorder of quantitative data.

    2.1.2 The Mean

    The mean (average, or arithmetic mean) is a common summary ofquantitative measurements. It conveys an idea of what is typical, or central.

    The mean of a set of n numbers is the total divided by n. This is usuallywhat people mean when they say the average.

    The mean is an equal share. If you split the bill at a restaurant equally,each person pays the mean cost.

    Given a sample, x1, . . . , xn, the sample mean, usually denoted x, can bethought of as

    an equal share of the total

    x =

    xin

    or a weighted sumx =

    wixi

    wherewi = 1.

  • 40 CHAPTER 2. DATA SUMMARY

    Example: Calculate the average number of alleles shared identical bydescent at the DBQ1 locus in 278 pairs of sisters with cervical cancer, giventhe following data:

    xi wi0 0.228 = 63/2781 0.457 = 127/2782 0.315 = 88/278

    Here the xi are the three possible values for the number of shared alleles,and the wi are the fractions of the sample with the respective value of xi, so

    x =3i=1

    wixi = (0.228)0 + (0.457)1 + (0.315)2 = 1.087.

    Expressing the mean as a weighted sum for grouped data like this amountsto using the distributive law to reduce our labor.

    Some things to notice about the mean:

    A histogram balances when supported at the mean.

    The sum of deviations from the mean is zero.ni=1

    (xi x) = 0

    The mean minimizes the sum of squared deviations, i.e.ni=1

    (xi k)2

    is minimized by taking k = x.

    The mean permits recovery of the total, if the number of observations isknown. Given means and sizes of subgroups, a grand mean can becalculated.

    The mean may fail to represent what is typical if there are a few extremevalues. This can be a problem if a measuring instrument occasionally giveswild results, or if there are genuine but extreme values in the sample orpopulation.

  • 2.1. SUMMARY STATISTICS 41

    The Mean of a zero-one indicator variable is the proportion of ones,as shown below.

    0 0.2 1

    Sample versus Population Mean

    Lets re-visit the allele-sharing example, but consider the weights to be thehypothetical neutral probabilities of sharing zero, one, or two alleles IBD.

    xi wi0 1/41 1/22 1/4

    It would be conventional to use the greek letter to denote the mean,defined as

    = 0(1/4) + 1(1/2) + 2(1/4) = 1

  • 42 CHAPTER 2. DATA SUMMARY

    because this is the mean of a hypothetical population of indefinite size,rather than the mean of a specific set of numbers. Sometimes this notion ofa mean is called the expected value.

    2.1.3 Other Notions of Typical

    The median of a set of numbers has half of the numbers above and halfbelow. If the number of observations is odd, it is the middle value. If thereis an even number of observations, the convention is to take the average ofthe two middle values.

    The median is resistant to perturbations. Several large observations can bemade extrememly large without affecting the median. The median is agood summary for things like individual or family incomes, because itremains representative of many actual values, even when a few values areextremely large.

    Geometric mean

    The geometric mean of two numbers, a and b, isab.

    The geometric mean of three numbers, a, b, c, is 3abc.

    Another way of thinking about these, is that the geometric mean of a set ofnumbers is the mean of the logarithms of those numbers, transformed backto the original scale. In other words, the geometric mean is the antilog ofthe mean of logs.

    For example, if a = 10, b = 100 and c = 1000, the mean of (common) logs is(1 + 2 + 3)/3 = 2, so the geometric mean is 102 = 100. Compare this to thearithmetic mean, (10 + 100 + 1000)/3 = 370. The geometric mean is lessinfluenced by very large observations.

    Note that the base of logarithms does not matter, so long as the same baseis used for the anti-log. This is because logarithms to one base are constantmultiples of logarithms to a different base, and averaging preserves thatmultiple.

    Geometric means are common because analysis of data on logarithmicscales is often useful.

  • 2.1. SUMMARY STATISTICS 43

    Harmonic mean

    The harmonic mean of a set of numbers is the reciprocal of the average ofreciprocals.

    If we have a fleet of three vehicles with respective gas mileages of 10, 20,and 30 miles per gallon, and we plan a trip of 30 miles, we expect toconsume

    30

    10+

    30

    20+

    30

    30= 5.5

    gallons of fuel. If all we knew about the fleet gas mileage was thearithmetic mean of 20 mpg, our estimate would be

    3

    (30

    20

    )= 4.5

    gallons, which is too small. If, instead, we knew the the harmonic mean,3/(1/10 + 1/20 + 1/30) 16.36, we could calculate

    3

    (30

    16.36

    )= 5.50

    gallons, which is as good as having the individual mileage numbers.

    Both the harmonic mean and the geometric mean involve transformingdata, calculating a mean, and transforming back to the original scale.However, the harmonic mean is probably not encountered as often as thegeometric mean.

    Some exercises involving means

    Exercise 2.1.1 (Restaurant Bill) Alice goes to dinner with three friendsand orders an $18 meal. Her three friends each order the $14 special. SinceAlice drove them all to the restaurant, one of Alices friends proposes thatthey split the bill evenly into four equal shares. Suppose neither tip nor taxapply. (a) What is the average price of the four meals? (b) How much doeseach person contribute? (c) How much does Alice save by splitting the billequally? (d) How much extra is it costing each of her friends?

  • 44 CHAPTER 2. DATA SUMMARY

    Exercise 2.1.2 (Loaves) Here is a well-known2, but slightly more complexpuzzle about sharing. Three travelers meet on a road and share a campfire.They also decide to share their evening meal. One of them has five smallloaves of bread. The second has three similar loaves. The third has no food,but has eight coins. They agree to share the eight loaves equally, and thethird traveler will pay the eight coins for his share of the bread. The secondtraveler (who had three loaves) suggests that he be paid three coins, and thatthe first traveler be paid five coins. The first traveler says that he should getmore than five coins. Is he right? How should the money be divided up?

    Exercise 2.1.3 (Clever Chemist) Long ago, a chemist wanted to weigha sample as accurately as possible using an old twin-pan balance. With thesample in the left pan, the balancing weights in the right pan totaled Agrams. With the sample in the right pan, the balancing weights in the leftpan totaled B grams. Assuming the left and right lever arms of the balanceare of lengths L and R, respectively, the balancing weights satisfy

    XL = AR

    andXR = BL

    Where X is the unknown weight of the sample. How should the twoobservations, A and B be combined?

    Exercise 2.1.4 (Sum of residuals, in general) Given samplemeasurements x1, x2, . . . , xn, their mean is

    x =1

    n

    ni=1

    xi.

    The sum of residuals (differences from the mean) is always zero, i.e.

    ni=1

    (xi x) = 0.

    Give either an intuitive explanation, or a mathematical argument, or agraphical explanation for why this is so.

    2Jim Loy, http://www.jimloy.com/puzz/8loaves.htm

  • 2.1. SUMMARY STATISTICS 45

    Exercise 2.1.5 (Rosner Table 2.4) The following table gives thedistributon of minimal inhibitory concentratons (MIC) of penicillin G forN. gonorrhoerae.(g/ml) Concentration Frequency0.03125 = 20(0.03125) 210.0625 = 21(0.03125) 60.125 = 22(0.03125) 80.250 = 23(0.03125) 190.5 = 24(0.03125) 171.0 = 25(0.03125) 3

    Calculate the geometric mean. Why might the geometric mean be desirable,compared to the simple mean? Why might any mean be inadequate forsummarizing these data?

    2.1.4 Measuring Variation

    While the mean can convery an idea of a typical value, a second summarynumber is needed to provide an idea of the variation around that centralnumber, and the standard deviation often serves this role.

    The standard deviation is the root mean square of deviations around themean.

    This conceptual definition can be applied directly, if the list of numbers isthe entire population on interest. We then call it the population standarddeviation. However, if we are using a sample of n observations to estimatethe amount of variation in a larger (perhaps infinite) population, wegenerally compute a sample standard deviation, which is inflated by a factorofn/(n 1). For large samples, this inflation factor makes very little

    difference.

    Root Mean Square

    As a preliminary, lets consider summarizing how large a set of numbersare, not in the usual sense where large means greater than zero, but in thesense of absolute value, where large means far from zero in either direction.

    The root mean square of a list of