1 statistical distribution fitting dr. jason merrick

20
1 Statistical Distribution Fitting Dr. Jason Merrick

Upload: noah-hill

Post on 13-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Statistical Distribution Fitting Dr. Jason Merrick

1

Statistical Distribution Fitting

Dr. Jason Merrick

Page 2: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/2

Some Issues in Fitting Input Distributions

• Not an exact science — no “right” answer

• Consider physical or logical process that generates the data

• Consider range of distribution– Infinite both ways (e.g., normal)– Positive (e.g., exponential, gamma)– Bounded (e.g., beta, uniform)

• Consider ease of parameter manipulation to affect means, variances - decision variables

• Outliers, multimodal data– Maybe split data set (see textbook for details)– Consider theoretical vs. empirical

Page 3: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/3

Eyeballing

• One way to see if a sample of data fits a distribution is to– draw a frequency histogram– estimate the parameters of the possible distribution– draw the probability density function– see if the two shapes are similar

freq

uenc

y

data values

Page 4: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/4

Chi-Squared Test

• Formalizes this notion of distribution fit

– Oi represents the number of observed data values in the i-th interval.

– pi is the probability of a data value falling in the i-th interval under the hypothesized distribution.

– So we would expect to observe Ei = npi, if we have n observations

freq

uenc

y

data values

iO

pdf

data values

ip

Page 5: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/5

Chi-Squared Test

• So the chi-squared statistic is

• By assuming that the Oi - Ei terms are normally distributed, – it can be shown that the distribution of the statistic is

approximately chi-squared with k-s-1 degrees of freedom– s is the number of parameters of the distribution

• Hint: consider

k

i i

ii

E

EO

1

2

20

k

i i

ii

p

pnO

1

2

20

Page 6: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/6

Chi-Squared Test

• So the hypotheses are

– H0: the random variable, X, conforms to the distributional assumption with parameters given by the parameter estimates.

– H1: the random variable does not conform.

• The critical value is then , which is also the 100%-quantile of a gamma distribution with scale 1/2 and shape (k-s-1)/2.– Reject if

• This gives a test with significance level .– But what about the power of the test?

21, sk

21,

20 sk

Page 7: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/7

Chi-Squared Test

• If the expected frequencies Ei are too small, then the test statistic will not reflect the departure of the observed from the expected frequencies.– The test can reject because of noise

– In practice a minimum of Ei 5 is used

– If Ei is too small for a given interval, then adjacent intervals can be combined

• For discrete distributions– each possible discrete value can be a class interval

– combine adjacent values if the Ei’s are too small

Page 8: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/8

Chi-Squared Test

• For continuous data– intervals that give equal probabilities should be used, not

equal length intervals– this gives a better power for the test

• the power of test is the probability of rejecting a false hypothesis

– it is not known what probability gives the highest power, but we want

5inp

5k

n

kpi

1

5

nk

5iE

Page 9: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/9

Chi-Squared Test

• Example: the exponential distribution– Suppose that we have n observations, possibly exponential– We estimate that using the data

– So we must use k 10 intervals, so we choose 8 to get p = 0.125

– To find the endpoints of the i-th interval, [ai-1,ai)

X

ipeaF iai ̂1)(

ipe ia 1̂

)1ln(ˆ1

ipai

Page 10: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/10

Eyeballing

• Another method of seeing if a distribution fits sample data is the q-q plot– x is the q-quantile of a random variable X with cdf F if F(x)=q or x=F-1(q)

– Take a data sample {x1,…xn} and order them to get y1 y2 ... yn

– yj is an estimate of the (j - 0.5)/n quantile

– Plot yj versus F-1((j - 0.5)/n )

– This should give a straight line

0

10

20

30

40

50

60

70

80

0 5 10 15 20

Order Statistics

Exp

on

en

tial Q

ua

ntil

e

Page 11: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/11

Eyeballing

• Note:– Will never actually be a straight line– Order statistics are not independent– One point above line will likely be followed by another– The variance at the extremes is larger– So for exponential, you will likely see more discrepancy at larger values

0

5

10

15

0 5 10 15

Order Statistics

Exp

onen

tial Q

uant

ile

Page 12: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/12

Kolomogorov-Smirnov Test

• Formalizes the idea of a q-q plot

– The scales are changed by applying the CDF to each axis

– D+ = maxj {(j - 0.5)/n ) - F(yj)}

– D- = maxj {F(yj) - (j - 1 - 0.5)/n )}

– Note that there are no D+‘s for some observations

– The test statistic is given by D = max{D+, D-}

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

F(Order Statistics)

j-0.5

/nj Order Statistics Normal Quantile F(Order Statistic) (j-0.5)/n D+ D-1 0.0307 0.14 0.01 0.05 0.04 -2 0.4838 0.43 0.17 0.15 -0.02 0.123 1.3364 0.77 0.39 0.25 -0.14 0.244 2.0778 1.15 0.54 0.35 -0.19 0.295 2.1446 1.60 0.55 0.45 -0.10 0.206 2.7039 2.14 0.64 0.55 -0.09 0.197 3.0289 2.81 0.68 0.65 -0.03 0.138 4.4276 3.71 0.81 0.75 -0.06 0.169 4.9919 5.08 0.85 0.85 0.00 0.1010 5.5265 8.01 0.87 0.95 0.08 0.02

Page 13: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/13

Comparing the Two Tests

• The Chi-Squared Test– Not just a maximum deviation, but a sum of squared

deviations– Uses more of the information in the data– So it needs more data to be accurate– Is more accurate if it has enough data

• The Kolmogorov-Smirnov Test– Just a maximum deviation– Needs less data to be accurate– Is less accurate with more data

Page 14: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/14

Empirical Distribution

• “Fit” Empirical distribution (continuous or discrete): Fit/Empirical– Can interpret results as a Discrete or Continuous

distribution• Discrete: get pairs (Cumulative Probability, Value)

• Continuous: Arena will linearly interpolate within the data range according to these pairs (so you can never generate values outside the range, which might be good or bad)

– Empirical distribution can be used when “theoretical” distributions fit poorly, or intentionally

– When sampling from the empirical distribution, you are just re-sampling from the data

Page 15: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/15

No Data?

• Happens more often than you’d like

• No good solution; some (bad) options:– Interview “experts”

• Min, Max: Uniform

• Avg., % error or absolute error: Uniform

• Min, Mode, Max: Triangular– Mode can be different from Mean — allows asymmetry

– Interarrivals - independent, stationary• Exponential - still need some value for mean

– Number of “random” events in an interval: Poisson– Sum of independent “pieces”: normal– Product of independent “pieces”: lognormal

Page 16: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/16

Multivariate and Correlated Input Data

• Usually we assume that all generated random observations across a simulation are independent (though from possibly different distributions)

• Sometimes this isn’t true:– If a clerk starts to get long jobs, they may get tired and slow

down– A “difficult” part requires long processing in both the Prep

and Sealer operations

• Ignoring such relations can invalidate model

Page 17: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/17

Checking for Auto-Correlation

• Suppose we have a series of inter-arrival times– What is the relationship between the j-th observation and

the (j-1)st?– What is the relationship between the j-th observation and

the (j-2)nd?

• We are talking about auto-correlation as the series is correlated with itself

• How many steps back we are looking is called the lag

Page 18: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/18

Auto-Correlation

1 2 3 4 50.161689 -0.11597 -0.1101 0.020206 0.140538

0 1 2 3 4 51 7.636883 - - - - -2 0.62654 7.636883 - - - -3 10.54177 0.62654 7.636883 - - -4 4.25373 10.54177 0.62654 7.636883 - -5 6.015199 4.25373 10.54177 0.62654 7.636883 -6 0.879388 6.015199 4.25373 10.54177 0.62654 7.6368837 0.728755 0.879388 6.015199 4.25373 10.54177 0.626548 1.144225 0.728755 0.879388 6.015199 4.25373 10.541779 0.409323 1.144225 0.728755 0.879388 6.015199 4.25373

10 0.953624 0.409323 1.144225 0.728755 0.879388 6.01519911 3.772148 0.953624 0.409323 1.144225 0.728755 0.87938812 4.628748 3.772148 0.953624 0.409323 1.144225 0.72875513 7.916579 4.628748 3.772148 0.953624 0.409323 1.14422514 0.133024 7.916579 4.628748 3.772148 0.953624 0.40932315 0.264536 0.133024 7.916579 4.628748 3.772148 0.95362416 1.836931 0.264536 0.133024 7.916579 4.628748 3.77214817 7.046523 1.836931 0.264536 0.133024 7.916579 4.62874818 8.356191 7.046523 1.836931 0.264536 0.133024 7.91657919 6.451392 8.356191 7.046523 1.836931 0.264536 0.13302420 3.768201 6.451392 8.356191 7.046523 1.836931 0.264536

Lags

Auto-Correlations

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

LagA

uto

corr

ela

tion

Standard deviation of auto-correlation

estimate isnsobservatio of nos.

1

2

Page 19: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/19

Time Series Models

• If the auto-correlation calculations show a correlation, then you may have to use a time-series model

• Such models are auto-regression models and moving average models

• Using the auto-correlation and another concept called the partial auto-correlation, you can fit these models

• The details are too much for this course

Page 20: 1 Statistical Distribution Fitting Dr. Jason Merrick

Simulation with Arena — Statistical Distribution Fitting C5/20

Multivariate Input Data

• A “difficult” part requires long processing in both the Prep and Sealer operations– The service times at the Prep and Sealer areas would be

correlated– Some multivariate models are quite easy, for instance the

multivariate normal model– You can also use the multiplication rule, to specify the

marginal distribution of one time and then specify the other time conditional on the first time

)()|(),( |, yfyxfyxf YYXYX