regression analysis. unscheduled maintenance issue: l 36 flight squadrons l each experiences...

Post on 19-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Regression Analysis

Unscheduled Maintenance Issue:

36 flight squadrons

Each experiences unscheduled maintenance actions (UMAs)

UMAs costs $1000 to repair, on average.

You’ve got the Data… Now What?

Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

101 36 53 51 61 63 54 50 65 62 51 68 45

104 60 42 56 63 39 65 63 67 66 52 59 60

108 53 61 59 87 61 46 52 85 84 75 78 68

Unscheduled Maintenance Actions(UMAs)

What do you want to know?

How many UMAs will there be next month? What is the average number of UMAs ?

Sample Mean

xxni

60

Sample Standard Deviation

sx xni

( )

.2

112 05

UMA Sample Statistics

UMAs

Mean 60Standard Error of Mean 2.01Median 60.5Mode 61Standard Deviation 12.05Minimum 36Maximum 87Count 36

UMAs Next Month

95% Confidence Interval

x 60 2 12

36 84 x

Average UMAs

95% Confidence Interval

60 2

1236

56 64

Model: Cost of UMAs for one squadron

If the cost per UMA = $1000, the

Expected cost for one squadron = $60,000

Model: Total Cost of UMAs

Expected Cost for all squadrons

= 60 * $1000 * 36 = $2,160,000

Model: Total Cost of UMAs

Expected Cost for all squadrons

= 60 * $1000 * 36 = $2,160,000

How confident are we about this estimate?

-3 -2 -1 0 1 2 3

.3413 .3413

.1359 .1359

.0215 .0215

~ 95%

mean (=60)

standard error =12/36 = 2

-3 -2 -1 0 1 2 3

.3413 .3413

.1359 .1359

.0215 .0215

~ 95%

~56 ~58 60 ~62 ~64

(1 standard unit = 2)

95% Confidence Interval on our estimate of UMAs and costs

60 + 2(2) = [56, 64]

low cost: 56 * $1000 * 36 = $2,016,000

high cost: 64 * $1000 * 36 = $2,304,000

What do you want to know?

How many UMAs will there be next month? What is the average number of UMAs ? Is there a relationship between UMAs and

and some other variable that may be used to predict UMAs?

What is that relationship?

Relationships

What might be related to UMAs? Pilot Experience ? Flight hours ? Sorties flown ? Mean time to failure (for specific parts) ? Number of landings / takeoffs ?

Regression:

To estimate the expected or mean value of UMAs for next month:

look for a linear relationship between UMAs and a “predictive” variable

If a linear relationship exists, use regression analysis

Regression analysis:

describes and evaluates

relationships between one variable

(dependent or explained variable), and

one or more other variables (called the independent or explanatory variables).

What is a good estimating variable for UMAs?

quantifiable predictable logical relationship with dependent

variable must be a linear relationship:

Y = a + bX

Sorties

Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

101 100 120 114 132 146 124 110 138 140 114 157 106

104 130 106 124 140 100 146 142 141 148 118 128 130

108 122 134 126 190 136 110 120 196 184 154 172 157

Pilot Experience

Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

101 6.06 2.81 3.37 3.87 4.22 6.67 2.61 1.96 2.96 2.45 3.29 3.73

104 4.61 2.45 4.65 5.71 7.23 3.01 2.53 1.54 4.49 1.73 4.81 5.17

108 1.11 5.75 4.9 3.59 6.88 1.17 2.59 5.87 7.28 7.79 5.87 2.47

Sample Statistics

Sorties Exp

Mean 135 4.06Standard Error of Mean 3.99 0.31Median 131 3.80Mode 100 #N/AStandard Deviation 23.92 1.84Minimum 100 1.11Maximum 196 7.79Count 36 36

Describing the Relationship

Is there a relationship? Do the two variables (UMAs and sorties or

experience) move together? Do they move in the same direction or in

opposite directions? How strong is the relationship?

How closely do they move together?

Positive Relationship

0

10

20

30

40

50

60

0 10 20 30 40 50 60

X

Y

Strong Positive Relationship

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Negative Relationship

0

10

20

30

40

50

0 10 20 30 40 50

X

Y

Strong Negative Relationship

0

10

20

30

40

50

60

0 10 20 30 40 50 60

0

5

10

15

20

25

0 10 20 30 40 50 60

No Relationship

Relationship?

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60

X

Y

Correlation Coefficient

Statistical measure of how closely two variables are moving together in a coordinated fashion Measures strength and direction

Value ranges from -1.0 to +1.0 +1.0 indicates “perfect” positive linear relation -1.0 indicates “perfect” negative linear relation 0 indicates no relation between the two variables

Correlation Coefficient

r

n x y x y

n x x n y yi i i i

i i i i

( )

( ) ( )2 2 2 2

Sorties vs. UMAs

0

10

20

30

40

50

60

70

80

90

0 50 100 150 200

Sorties

UM

As

r = .9788

Experience vs. UMAs

0

10

20

30

40

50

60

70

80

90

0.00 2.00 4.00 6.00 8.00 10.00

Pilot Experience

UM

As

r = .1896

Correlation Matrix

Correlation UMAs Sorties ExpUMAs 1Sorties 0.9787613 1Exp 0.1895905 0.198641 1

A Word of Caution...

Correlation does NOT imply causation It simply measures the coordinated

movement of two variables Variation in two variables may be due to

a third common variable The observed relationship may be due

to chance alone

What is the Relationship?

In order to use the correlation information to help describe the relationship between two variables we need a model

The simplest one is a linear model:

Y a bX

Fitting a Line to the Data

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12 14

X

Y

One Possibility

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12 14

X

Y

Sum of errors = 0

Another Possibility

Sum of errors = 0

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12 14

X

Y

Which is Better?

Both have sum of errors = 0 Compare sum of absolute errors:

Y Y1 Error Abs err8 6 2 21 5 -4 46 4 2 24 5.5 -1.5 1.56 4.5 1.5 1.5

0 11

Y2 Error Abs err2 6 65 -4 48 -2 2

3.5 0.5 0.56.5 -0.5 0.5

0 13

Fitting a Line to the Data

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12

X

Y

One Possibility

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12

X

Y

Sum of absolute errors = 6

Another Possibility

Sum of absolute errors = 6

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12

X

Y

Which is Better?

Sum of the absolute errors are equal Compare sum of errors squared:

Y Y1 Abs err Sum Sq4 4 0 07 3 4 162 2 0 05 3.5 1.5 2.252 2.5 0.5 0.25

6 18.5

Y2 Abs err Sum Sq5.6 1.6 2.563.8 3.2 10.24

2 0 04.7 0.3 0.092.9 0.9 0.81

6 13.7

50

60

70

80

90

100

100 110 120 130X

Y

The Correct Relationship: Y = a + bX + U

systematic random

50

60

70

80

90

100

100 110 120 130X

Y

The correct relationship:Y = a + bX + U

systematic random

Least-Squares Method

Penalizes large absolute errors

Y- intercept:

Slope:

bXY nXY

X nX

2 2

a Y bX

Assumptions

Linear relationship: Errors are random and normally distributed

with mean = 0 and variance = Supported by Central Limit Theorem

Y a bX U

2

Least Squares Regression for Sorties and UMAs

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200

Sorties

UM

As

Regression Calculations

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36

ANOVAdf SS MS F Significance F

Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848

Sorties vs. UMAs

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200

Sorties

UM

As

. .Y X 654 49

Regression Calculations: Confidence in the predictions

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36

ANOVAdf SS MS F Significance F

Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848

Confidence Interval for Estimate

30

40

50

60

70

80

90

100

90 100 110 120 130 140 150 160 170 180 190 200

Sorties

UM

As

( )/Y a bX t se 2

95% Confidence Interval for the model (b)

X

Y

Testing Model Parameters

How well does the model explain the variation in the dependent variable?

Does the independent variable really seem to matter?

Is the intercept constant statistically significant?

Variation

30

40

50

60

70

80

90

100

90 100 110 120 130 140 150 160 170 180 190 200

Sorties

UMAs

Y

YY

Coefficient of Determination

Values between 0 and 1 R2 = 1 when all data on line (r=1) R2 = 0 when no correlation (r=0)

R = Explained Variation

Total Variation2

Regression Calculations: How well does the model explain the variation?

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36

ANOVAdf SS MS F Significance F

Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848

Does the IndependentVariable Matter?

If sorties do not help predict UMAs we expect b = 0

If b is not 0, is it statistically significant?

Y a bX

Regression Calculations: Does the Independent Variable Matter?

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36

ANOVAdf SS MS F Significance F

Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848

95% Confidence Interval for the slope (a)

Mean of Y

Mean of X X

Y

Confidence Interval for Slope

30

40

50

60

70

80

90

100

90 100 110 120 130 140 150 160 170 180 190 200

Sorties

UM

As

Is the InterceptStatistically Significant?

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36

ANOVAdf SS MS F Significance F

Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848

Confidence Intervalfor Y-intercept

30

40

50

60

70

80

90

100

90 110 130 150 170 190 210Sorties

UM

As

Basic Steps ofRegression Analysis

Formulate the model Plot scatter diagram for visual inspection Compute correlation coefficient Fit the regression line Test the model

Factors affecting estimation accuracy

Sample size (larger is better) Range of X values (wider is better) Standard deviation of U (smaller is

better)

Uses and Limitationsof Regression Analysis

Identifying relationships Not necessarily cause May be due to chance only

Forecasting future outcomes Only valid over the range of the data Past may not be good predictor of future

Common pitfalls in regression

Failure to draw scatter diagrams Omitting important variables from the

model The “two point” phenomenon Unfounded claims of model sophistication Insufficient attention to interval estimates

and predictions Predicting too far outside of known range

Lines can be deceiving...

X Variable 1 Line Fit Plot

0

2

4

6

8

10

12

14

0 5 10 15 20

X Variable 1

Y

R2 = .6662

Nonlinear Relationship

y = -0.1267x2 + 2.7808x - 5.9957R2 = 1

0

2

4

6

8

10

12

14

0 5 10 15 20

X

Y

Best fit?

X Variable 1 Line Fit Plot

0

2

4

6

8

10

12

14

0 5 10 15 20

X Variable 1

Y

Misleading data

X Variable 1 Line Fit Plot

0

2

4

6

8

10

12

14

0 5 10 15 20

X Variable 1

Y

Summary

Regression Analysis is a useful tool Helps quantify relationships

But be careful Does not imply cause and effect Don’t go outside range of data Check linearity assumptions Use common sense!

05

101520253035404550

0 5 10 15 20

Output

Co

st

r = 0.0

Non-linear relationship between output and cost

top related