engrd 2700 lecture #1

24
The Greatest Scope More Details EDA Describing sample data Title Page Page 1 of 25 Go Back Full Screen Close Quit ENGRD 2700 Engineering Probability and Statistics Lecture 1: Introduction David S. Matteson School of Operations Research and Information Engineering Rhodes Hall, Cornell University Ithaca NY 14853 USA [email protected] January 20, 2009

Upload: anon283

Post on 16-Apr-2015

69 views

Category:

Documents


4 download

DESCRIPTION

Lecture slides from ENGRD 2700 Basic Engineering Probability and Statistics

TRANSCRIPT

Page 1: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 1 of 25

Go Back

Full Screen

Close

Quit

ENGRD 2700Engineering Probability and

StatisticsLecture 1: Introduction

David S. MattesonSchool of Operations Research and Information Engineering

Rhodes Hall, Cornell UniversityIthaca NY 14853 USA

[email protected]

January 20, 2009

Page 2: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 2 of 25

Go Back

Full Screen

Close

Quit

1. Why you can’t live without prob & stat

Basis for the scientific method, management science, inteligent empiricism.Tidbits:

• Fact based decision making.

• Scientific decision making.

• In God we trust; everyone else bring data.

Conclusion: Hunches can be wrong, intuition can be fooled, instincts may be im-mature or just wrong.

Positive societal contribution; ubiquity:

• Drug testing: Drug companies are not allowed to market a drug untilit it proven safe and e!ective via clinical trials. (Living link with history:SIR1 was part of the largest public health clinical trial ever–the Salk poliovaccine clinical trial.) The drug company employs statisticians to supervisethe clinical trial; the FDA employs statisticians to review the drug companyresults.

• Digital network tra!c: Fundamentally di!erent from POTS=plain oldtelephone tra"c. Prize winning Bellcore study (1993) statistically analyzeddata network traces and concluded they exhibit characteristics not compatiblewith telephone models. Probability models were then constructed to explainthe anomolies.

Page 3: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 3 of 25

Go Back

Full Screen

Close

Quit

• Risk managment: Governmental bodies set standards which control risk.

– Hydrology: 10,000 year flood. Build the dam so high that water willexceed the height once per 10,000 years. (Small quantile estimation tothe cognoscenti.)

– Finance: Value at risk–the banks risk reserve needs to be high enoughto cover a loss so big that it occurs with probability 1/10,000.

– Environment: City is ruled out of compliance with EPA regulations ifpollutant concentration exceeds a specified level more than 5% of thedays in a year; that is, the standard is set so the prob of exceeding thestandard is 5%.

• Weather prediction: Chance of rain tomorrow is 53%.! What does this mean?

But image problems persist:

• Pick your favorite quote:

– ”There are lies, damn lies and statistics.”–Sir Winston Churchill

– ”There are three kinds of lies: Lies, Damn Lies, and Statistics.”–Benjamin Disraeli

– ”You can use statistics to prove anything.”–Homer Simpson

– ”He uses statistics as a drunken man uses a lamppost - for support ratherthan illumination.”–Andrew Lang

Page 4: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 4 of 25

Go Back

Full Screen

Close

Quit

– ”Statistics are like bikinis. What they reveal is suggestive, but whatthey conceal is vital.–Aaron Levenstein

– ”It is easy to lie with statistics, but it is easier to lie without them.”–Frederick Mosteller

• Subjects perceived as dull. (Why?)! Tell someone at a social gathering that you are studying statistics andwatch their reaction. (Is the reaction the same if you say Engineering?)! ”A statistician is someone who is good with numbers, but lacks the per-sonality to be an accountant.” (Or is it the other way round?)

Page 5: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 5 of 25

Go Back

Full Screen

Close

Quit

• Statisticians sometimes perceived as hired guns. (”My statistician can beatup your statistician.”)Statisticians are often used as! expert witnesses in court.! Government witnesses in regulatory hearings.

Lowest blow

Best selling book:. How to Lie with Statistics

Figure 1: A best seller.

Page 6: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 6 of 25

Go Back

Full Screen

Close

QuitFigure 2: Blowup

Page 7: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 7 of 25

Go Back

Full Screen

Close

Quit

2. What this course is about.

• Probability: Build models of random phenomena.

– Random phenomena: Phenomena whose outcomes cannot be predictedin advance." Outcome of flip of a die." Outcome: tomorrow’s stock price." Outcome: result of the upcoming presidential election.

– Caution:All models are wrong; some models are useful.

–George Box

• Statistics: The science of organizing and summarizing data and using infor-mation in the data to draw conclusions; scientific extraction of informationfrom data.

– Statisticians consumers of probability.– Statisticians fit parameters in probability models.– Statisticians draw conclusions about population based on the

partial information contained in a sample. The manner in whichthe conclusions are drawn use probability tools and reasoning. (Thesampling error in the estimate of p is 5%.)

Page 8: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 8 of 25

Go Back

Full Screen

Close

Quit

3. More Details

Page 9: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 9 of 25

Go Back

Full Screen

Close

Quit

More Details (cont)

• Population: a well-defined set of items under attention and discussion.(It is clear how to decide if an item is or is not in the population.)

• Sample: a well-defined subset of the population that has been selected forstudy or measurement.

• Observation: an individual measurement from a sample.

Examples:

• Population: all US universitiesSample: ivy league universities.

• Population: all voters in the US.Sample: all voters with incomes over $200,000.

• Population: Yearly best times in seconds (minus 3 minutes) in the mile runover 121 years (ending in mid-80’s).Sample: Yearly best times which were record times during this time period.

Page 10: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 10 of 25

Go Back

Full Screen

Close

Quit

Time

mile

0 20 40 60 80 100 120

5060

7080

90100

Figure 3: Time series plot of yearly best times in mile.

Page 11: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 11 of 25

Go Back

Full Screen

Close

Quit

More on populations and samples:

Four Stages in statistical analysis:

• Precisely define the population and then formulate clear answerable questionsabout the population.

• Collect data to help answer the questions; requires sampling scheme andexperimental design.

• Exploratory step: Describe and present data using the tools of

– graphics– descriptive statistics such as numerical summaries of data–mean, me-

dian, variance, quantiles, . . . .

• Confirmatory step; formal inference: Analyze the data to draw conclusionsabout the population.

Page 12: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 12 of 25

Go Back

Full Screen

Close

Quit

More on populations and samples:

Types of populations and sets:

1. Large but finite population: Di"cult to take a census (ie, sample wholepopulation). Thus, typically only a relatively small number of items are besampled.

• Population of the US. Actual census only every 10 years.• Even in the presidential elections, the sample consisting of eligible voters

who vote is a relatively small percentage of the population.• Only 53.9 percent of the voting-age population actually voted in 1980

when Ronald Reagan was elected.• Pollsters predict outcomes based on much smaller samples of the order

of hundreds.

2. Small and finite population: can be sampled in its entirety.

• Average age for ENGRD 2700 students this semester.• Average number of publications of ORIE professors. (#)

Page 13: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 13 of 25

Go Back

Full Screen

Close

Quit

3. Abstracted population represented by an infinite set–say an interval of realnumbers (or worse).

• Population representing time to failure of a machine component; popu-lation could be represented by the set of all positive real numbers [0,#).

• Population representing the study of the point of maximum pollutionconcentration in city; population could be represented by a two dimen-sional set

{(x, y) : 0 $ x $ 1, 0 $ y $ 1}.

Page 14: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 14 of 25

Go Back

Full Screen

Close

Quit

Types of samples:

• A simple random sample of size n: a sample of size n in which each subsetof size n in the population has the same likelihood of being selected.

– Problem with definition" Likelihood is undefined." For populations from continuous infinite sets, the likelihood may be

zero; eg, the probability of drawing a sample of size 2 from [0, 1]and getting 1

4 , 34 is 0.

• Stratified random sample: The population consists of sub-populations. EG:Take a sample of size 700 from the Cornell student population by sampling100 from each college. (Useful to predict outcome of union vote?)

• Biased sample. The samples are clearly dependent and unrepresentative.

– Population: residents of Florida. Determine the average income of resi-dents of Florida. Biased sample: NBA players living in Florida.

More later. Meanwhile, read the book. Carefully!

Page 15: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 15 of 25

Go Back

Full Screen

Close

Quit

4. Need for descriptive statistics; exploratory data anal-ysis.

Most (not all) populations coded by numbers.

• Population of US Voters

– Vote to re-elect Mayor = 1– Vote to elect challenger = 0.

• Epidemiology

– Infected =0– Not infected =1

• Failure times; population is all real numbers.

• Data networks: Observe packet counts per unit time.

• Finance: Observe stock index like DJ.

Exceptions:

• Makes of cars sold in the US

{Buick Le Sabre, Volkswagon Passat, VW Golf, Honda Accord, Honda Civic, . . . }.

• Marketing: Brands of toothpaste {Colgate, Crest, Tom’s, . . . }.

Page 16: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 16 of 25

Go Back

Full Screen

Close

Quit

Conclusion: Result of sampling often yields a set of numbers. Sometimes theset is large. We need to make sense of the set of numbers.

Pay tribute to computer

The computer makes handling big data sets tolerable; 30 years ago statistics wasfrequently done

• By hand (pencil & paper, abacus, slide rule) or with a mechanical calculator.(Hence elementary stat texts had examples consisting of data sets of 5 points.)Then

• Electronic hand calculator. (Tedious to enter data.)Then

• Computer.

Page 17: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 17 of 25

Go Back

Full Screen

Close

Quit

Large data sets becoming increasingly common.

• Walmart records and stores a record of each transaction. What to do withso much data? % emerging field of data mining.

• Internet joke (but true):Question: What network does a research physicist use to ship his data fromone facilty to another?Answer: UPS; they ship the hard drive from one location to another by UPS.

• In Internet studies, sni!ers and other automatic data collection devices givearbitrarily large data sets.

• In finance, there is high resolution data (recordings at very small time inter-vals) and even tic by tic data.

Example: Trace is packet counts per 100 milliseconds=1/10 second for Fi-nancial Company X’s wide area network link including USA-UK tra"c. Lengthof dataset=288,009; 8 hours of collection from 9am–5pm. Top plot too muddy;bottom represents subsets sized 20,000.

Page 18: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 18 of 25

Go Back

Full Screen

Close

Quit

Page 19: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 19 of 25

Go Back

Full Screen

Close

Quit

Conclusion: How do we make sense of large (and small) amounts of data. %First step:

descriptive statistics and exploratory data analysis.

Descriptive statistics: organization and summarization of large amountsof data for the purposes of drawing conclusions. Use

• Graphics

• Summary statistics (mean, median, variance, ...)

Use of descriptive statistics is often a first step in an exploratory data analysis.Somewhat informal, pictorial.

Formal inference: Draw scientific inferences about the population from thedata. More formal methods. For example we can formally test hypotheses. (Some-times the hypotheses are suggested by the EDA.)

Most famous clinical trial: Salk vaccine given to sample of kids inearly ’50’s with a control group receiving a placebo. The formal hypothesiswas thatthe Salk vacine was more e!ective than randomness.

Page 20: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 20 of 25

Go Back

Full Screen

Close

Quit

5. Describing sample data

Use of graphs and other descriptive statistics.

5.1. stem-and-leaf plot

Older method originated for small univariate data sets when analysis was often byhand.This is just a clever arrangement of the data values to reflect the shape of thedistribution.

• Advantage: simple, quick, easy to construct.

• Disadvantage: a little primitive; doesn’t capitalize on graphics capabilities ofpackages and computers.

Procedure: given data x1, . . . , xn where each xi consists of (at least) 2digits.

1. split each xi into a stem of leading digits and a leaf of the remaining digits.

2. List the stem values in the left hand margin column and to the right list theleaves corresponding to each stem, listed in the order they are encounteredin the data set.

Page 21: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 21 of 25

Go Back

Full Screen

Close

Quit

A simple illustration: Suppose student scores on an exam are 48, 63, 67, 69, 70,73, 76, 79, 79, 80, 80, 83, 88, 95. A stem-and-leaf plot is below:

9 | 58 | 00387 | 0336996 | 3795 |4 | 8

Positive features:

• The entire data set can be read with ease from the display.

• Gives a clear indication of the shape of the distribution of data values.

R does this automatically with the following commands:

> grades<-c( 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95)> stem(grades)

4 | 85 |6 | 3797 | 036998 | 00389 | 5

Page 22: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 22 of 25

Go Back

Full Screen

Close

Quit

Minitab output (check the drop down menu under GRAPH)

Stem-and-Leaf Display: C1

Stem-and-leaf of C1 N = 14Leaf Unit = 1.0

1 4 81 54 6 379

(5) 7 036995 8 00381 9 5

Minitab output: extra column on the left.Features:

• The number in parentheses gives the number of observations on the line thatcontains the median (or the middle value).

• The ”4” in the row above that, gives the total number of observations in thefirst three rows, i.e. there were 4 scores below 70.

• The ”1” above that indicates that there was one score in the 50’s or below.

• The ”5” below the median line indicates that there are 5 scores in the 80’sand above.

• The ”1” on the last line indicates one score & 90.

Page 23: ENGRD 2700 Lecture #1

The Greatest

Scope

More Details

EDA

Describing sample data

Title Page

!! ""

! "

Page 23 of 25

Go Back

Full Screen

Close

Quit

Note

• The help file gives a detailed explanation.

• More extensive data sets - particularly those in which three or more digitsvary - are dealt with in a variety of ways, but all are similar to the simplecase shown above.

Page 24: ENGRD 2700 Lecture #1

Title Page

!! ""

! "

Page 24 of 25

Go Back

Full Screen

Close

Quit

Contents

The Greatest

Scope

More Details

EDA

Describing sample data