data analysis using sas

44
December 2006 SAS Data Analysis - Final Project University of California at Berkeley (Extension) December 2006 Instructor: Jianmin Liu, Ph.D SAS-X446 Project Team Dan Brockman, Saranne Warner, Christine Iodice, Satish Prasad Data Analysis Using SAS

Upload: damita

Post on 16-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

University of California at Berkeley (Extension) December 2006 Instructor: Jianmin Liu, Ph.D SAS-X446 Project Team Dan Brockman, Saranne Warner, Christine Iodice, Satish Prasad. Data Analysis Using SAS. Introduction. Many Topics to Cover A. BUSINESS OBJECTIVE B. DATA SET - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Analysis Using SAS

December 2006SAS Data Analysis - Final Project

University of California at Berkeley (Extension)December 2006

Instructor: Jianmin Liu, Ph.D

SAS-X446 Project TeamDan Brockman, Saranne Warner, Christine Iodice, Satish Prasad

Data Analysis Using SAS

Page 2: Data Analysis Using SAS

2SAS Data Analysis – Final Project

Introduction

Many Topics to Cover

A. BUSINESS OBJECTIVE

B. DATA SET

C. DATA EXPLORATION

D. OUTLIERS

E. MODEL SPECIFICATION

F. TOP CHOICE MODEL

H. MEAN VALUE OF CPM / 95% CI’S

I. REGRESSION DIAGNOSTICS (RESIDUALS)

J. ALTERNATIVE MODELS

K FINAL THOUGHTS

Page 3: Data Analysis Using SAS

3SAS Data Analysis – Final Project

• Focus of discussion will be on our Best Regression Model

Introduction

Regression Model

Attributes Description

Functional Form CPM = 18.63 + .0002596 ASL – 1.534 LUTL – 3.3222 LFilled + 10.56 SPA

Dependent Variable CPM

Independent Variables Flight Length, Log of Utilization, Log Seats Filled, # Seats

RSquare .685

Adj RSq .64

T-Test < .01 LUTL, LFillled, SPA

T-Test <.10 ASL

Collinearity testing No Issues, VIF’s all lower than 10

When we remove outlier CPM=3.306, RSquare = .79, Adj RSq = .76

Page 4: Data Analysis Using SAS

4SAS Data Analysis – Final Project

Introduction

Short Range vs Long Range Modeling

Type Description R squared

Short Range Model CPM = − 2.293 LFilled + 1.288 LEmpty − .83802 UTL .89

Long Range Model CPM = −1.98 LFilled + 1.66 LPsize .76

• We will also introduce a Short Range model and Long Range Model. • Both with impressive R-Squared Values!

Page 5: Data Analysis Using SAS

5SAS Data Analysis – Final Project

Business Objective

• Develop a Cost Model for airline service– Use regression modeling to estimate the Cost per Passenger Mile (CPM)

– Identify the key factors that determine cost (statistically significant variables)

– Identify the functional relationship between key factors and Cost (functional form)

– Identify the impact that each factor has on cost (Beta coefficients)

• Why Cost Models? – Profit optimization (cost versus revenues)

– Long range planning

– Cost sensitivity analysis

– Operational budgeting

– Just to name a few….

Page 6: Data Analysis Using SAS

6SAS Data Analysis – Final Project

• Data Source– Civil Aeronautics Board report: Aircraft Operating Cost and Performance Report

(August 1972)

• Variables

The Data Set

Variables in Source Data Set

Variable Name Variable Description

CPM Cost per passenger mile (cents) ** DEPENDENT VARIABLE**

UTL Average hours per day use of aircraft (Utilization)

ASL Average length of nonstop legs of flights

SPA Average number of seats per aircraft

ALF Average load factor (% of seats occupied by passengers)

TYPE Binary Variable: ASL binned into two groups0 = short flight plane < 1200 miles on average1 = long flight >= 1200 miles on average

Page 7: Data Analysis Using SAS

7SAS Data Analysis – Final Project

Exploratory look at the data

• Know thy data!– Learn about your variables

– Seek out potential errors

– Spot potential outliers

– Check for normal distribution of variables

– Jump start model specification by looking at correlations between Dependent and Independent variables

– Identify potential correlation issues among independent variables

• Data Exploration Methods– Proc Contents / Proc Print

– Proc Means

– Proc Univariate / Histograms

– Proc Corr

– Scatterplots

Page 8: Data Analysis Using SAS

8SAS Data Analysis – Final Project

Data Exploration

• Proc Contents / Proc Print– 33 observations, All numeric variables

– No missing values

– 2 observations with CPM = 3.306 (potential error?)

Page 9: Data Analysis Using SAS

9SAS Data Analysis – Final Project

Data Exploration

• Proc Means– Low / High values

– Spread of data

Page 10: Data Analysis Using SAS

10SAS Data Analysis – Final Project

Data Exploration

• Proc Means (by Type)– 33 Total Observations

14 short range (42%) 19 long range (58%)

– 3.11 cents = Avg CPM for all flights3.34 cents = Avg CPM for short range 2.94 cents = Avg CPM for long range

– .58 SD for all flights.58 SD for short range.53 SD for long range

• Average Utilization is not very different between short/long range planes– Short range planes are flying several legs a day

Page 11: Data Analysis Using SAS

11SAS Data Analysis – Final Project

Data Exploration

• Proc Univariate– Focusing on extreme values and distribution of

data

– Extreme high value in CPM (outliers??)

– Extreme low value in ALF

CPM

ALF

Page 12: Data Analysis Using SAS

12SAS Data Analysis – Final Project

Data Exploration

• More Histograms– 75% of observations have between 100 and 150 seats roughly

– Not a lot of variation ALS in what is considered short range planes

– Much bigger variation in long range planes

SeatsASL: Length

Page 13: Data Analysis Using SAS

13SAS Data Analysis – Final Project

Data Exploration

• Proc Corr– Inverse relationship b/w independents and dependent

– UTL and ALF appear to have strongest correlation with CPM

– Multicollinearity potential

Page 14: Data Analysis Using SAS

14SAS Data Analysis – Final Project

Data Exploration

• Scatterplots– Data Screening / Seeking Outliers

– Nonlinearities / Data transformations

Page 15: Data Analysis Using SAS

15SAS Data Analysis – Final Project

Data Exploration

• Clustering of data pointsindicates need for data transformation(? Log Form?)

Page 16: Data Analysis Using SAS

16SAS Data Analysis – Final Project

Data Exploration

• Scatterplot Matrix – Very Cool!

Page 17: Data Analysis Using SAS

17SAS Data Analysis – Final Project

Outliers

• Suspects– Extreme high values of CPM (4.737, 4.024))

– Extreme low value of ALF (.287)

– Same value of CPM (3.306) / coincidence or data error?

• Our Approach– Wait and See!

– Even though CPM = 4.737 is more than 2 SD’s from the mean, the other values of the observation may make if plausible

› Low Seats (SPA) combined with low Load Factor (ALF) make for few passengers transported

› Low Flight Length (UTL) combined with Low Hrs of Use (UTL) make for few total miles travelled

› ….which lead us to “What do we think causes high/low CPM? What are our underlying theories?”

• But first… one more note on outliers…

Page 18: Data Analysis Using SAS

18SAS Data Analysis – Final Project

Outliers: Spotlight on Innovative Code

• If we had more observations, we might consider….

Page 19: Data Analysis Using SAS

19SAS Data Analysis – Final Project

Outliers: Spotlight on Innovative Code

• Identification of extreme values (<1% or >99%) using previous code would point to the following outliers in our data sample

Page 20: Data Analysis Using SAS

20SAS Data Analysis – Final Project

Model Specification: Hypothesized Relationships

• So what are our “hunches”, “theories”,“hypotheses” regarding what drives Cost Per Passenger Mile (CPM)?

• Airline industry “produces” Passenger Miles (our unit)

• Large fixed costs of production– Cost of a running a plane

› Cost of plane itself, pilot /staff, related infrastructure (slips/spots at airports, etc)

• Economies of Scale– An increase in passengers decreases cost per passenger (nonlinear)

– An increase in miles decreases cost per mile (nonlinear)

› Holding constant gas mileage and gas costs

• CPM = Function (Fixed Costs, Passengers, Miles)

Page 21: Data Analysis Using SAS

21SAS Data Analysis – Final Project

Model Specification: Independent Variables

• Variables listed below in category (Passenger, Miles, Fixed Costs)

• Note calculated variables and use of Binning method

Category Type Name Formula Interpretation

Passenger Given ALF Load Factor ALF % Occupied

Derived AvgFilled ALF * SPA Filled Seats or Passengers

Derived AvgEmpty (1-ALF) * SPA Empty Seats or Missing Passengers

Miles Given ASL Avg Length of Leg

ASL Avg miles per flight

Given Type 0 < 1200 miles1 >=1200 miles

Differentiates Short Range and Long Range planes

Given UTL: Hrs in Use UTL Utilization/ Avg hours per day of use. Can be thought of as a proxy for total miles.Expect if the plane is in use, it is travelling miles. Short range plane may have more “down time” as they will have more take-offs and landings per day

Derived Pdistance: Plane distance

Created by Binning ASLsee code

Categorical variable to define avg miles traveledValues range from 1 through 8.

Fixed Costs

Given SPA: Avg Seats SPA Average Seats per Aircraft can thought of as proxy for size of plane

Derived Psize:Plane Size

Created by Binning SPAsee code

Categorical variable to define plane sizeValues range from 1 through 4

Page 22: Data Analysis Using SAS

22SAS Data Analysis – Final Project

Model Specification: Independent Variables

• Performed Data Transformations to explore different functional forms – Created log form of all variables

Lx = Log(x)

– Created squared variablesSx = x**2;

• In sum, we created a slate of potential independent variablesto explore– Variables calculated from others

– Binning technique

– Log and Squared transformations

Page 23: Data Analysis Using SAS

23SAS Data Analysis – Final Project

Model Specification: Regression Equations

• While we had our ideas …we explored many…many…many… models! – So many models, that we needed to write code to compare them all

Page 24: Data Analysis Using SAS

24SAS Data Analysis – Final Project

Model Specification: Selected Models

Top Regression Models

# Model Adj Rsq

1 CPM = ASL LUTL LFilled SPA .64

2 CPM = ASL LUTL LFilled LEmpty .62

3 CPM = LFilled LPsize .63

4 CPM = IALP ISPA .51

5 CPM = LFilled Lempty .50

------------------------------------Other Models Explored------------------------------

6. LCPM = UTL ASL SPA ALF

7. CPM = LUTL LASL LSPA LALF Type / selection = stepwise

8. CPM = ASL Psize AvgFilled AvgEmpty

9. CPM = LDistance Psize AvgFilled AvgEmpty

10. CPM = UTL ASL Avgfilled Sfilled

* CPM = UTL ASL SPA ALF TYPE Avgfilled avgempty Psize pdistance LUTL LASL LSPA LALF Lfilled

Lempty Lpsize lpdistance SUTL SASL SSPA SALF Sfilled Sempty / Selection=stepwise

• Explored many functional forms and independent variables

Page 25: Data Analysis Using SAS

25SAS Data Analysis – Final Project

Best Regression Model

• CPM = ASL LUTL LFilled SPA– Explains nearly 70% of the variance of CPM (Rsquare = .685, Adj Rsq=.64)

– LUTL, LFilled, SPA are highly statistically significant at the .01 level

– ASL mildly significant at the .1 (10%) level

• CPM = 18.363 + 0.0002596 ASL -1.534 LUTL –3.3222 LFilled + 10.56 SPA

Intercept Log of UTL Log of Filled Seats Avg # of SeatsAvg Flight Length

Intercept Miles Traveled Passengers Plane Size/ Fixed Cost Flight Length

Page 26: Data Analysis Using SAS

26SAS Data Analysis – Final Project

Best Regression Model

CPM = 18.363 + 0.0003 ASL -1.534 LUTL –3.322 LFilled + 10.560 SPA

• Additional Explanation:– CPM increases as Flight Length (ASL) increase

› Staffing costs, food costs

– CPM decreases as Utilization (LUTL) increases.

› UTL is proxy for Miles Traveled. Relationship is nonlinear.

– CPM decreases as filled seats (Lfilled) increases.

› Filled seats equates to # of passengers. Relationship is nonlinear

– CPM increases as seats per aircraft (SPA) increases.

› Seats per aircraft is indicate plane size. Larger planes cost more to purchase and to operate(i.e. larger staffing, more gas)

Page 27: Data Analysis Using SAS

27SAS Data Analysis – Final Project

Best Regression Model

• Key Performance Statistics– Rsquare, Adj Rsq, T-tests

– Variance of Influence (VIF) measures multicollinearity. Value less than 10 is OK.

When we remove outlier CPM=3.306, RSquare = .79, Adj RSq = .76

Page 28: Data Analysis Using SAS

28SAS Data Analysis – Final Project

Best Model : Coefficients Standardized

• What does a 3.32 change in LFilled really mean?How strong is that coefficient compared to others?

• Standardized values are used to Compare the relative strength of variables– Standardized Values (Beta coefficients) are measured in the standard deviations,

instead of the units of the variables, thus they can be used to compared variables

• How to interpret– Raw Coefficient: A one unit increase in LFilled would yield a 3.32 unit decrease in CPM

Beta Coefficient: A one SD increase in LFilled would yield a 2.08 SD decrease in CPM

› LFilled is our strongest predictor

› Use STB option in Model statement to get Beta Coefficients

Page 29: Data Analysis Using SAS

29SAS Data Analysis – Final Project

Mean Value of CPM at 95% Confidence Level

• CPM = SPA

The 95% confidence intervalshown to the right indicates where mean value of CPM is most likely to lie

Another less often used confidence interval shows the interval for individual data points. (not shown)

Page 30: Data Analysis Using SAS

30SAS Data Analysis – Final Project

Regression Diagnostics

• Plotting Residuals– Check assumptions made in the modeling process by examining the residuals

– Residuals are the difference between the observed and fitted values of the response variable. (CPM)

• Three Types of Residual Plots– A plot of residuals against predicted values of the response variable

– A plot of the residuals against each explanatory variable in the model

– A normal probability plot of the residuals

Page 31: Data Analysis Using SAS

31SAS Data Analysis – Final Project

Regression Diagnostics

• Plot of residuals against predicted values of CPM– One of the assumptions of linear regression is that the residuals are normally

distributed. This assures that the p-values for the t-tests will be valid.

Should appear completely random

If the variance of the response appears to increase with the predicted value, a transformation of the response variable may be necessary

Looks good!

Page 32: Data Analysis Using SAS

32SAS Data Analysis – Final Project

Regression Diagnostics

• Plot of residuals against each explanatory variable– The presence of a curvilinear relationship, for example, would suggest that a higher-

order term (quadratic) in the explanatory variable is needed in the model

LUTL LFilled

Page 33: Data Analysis Using SAS

33SAS Data Analysis – Final Project

Regression Diagnostics

• Plot of residuals against each explanatory variable (continued)

ASL SPA

Horse shoe shape: Potential issue (?)Should look at partial regression plots (?)

Page 34: Data Analysis Using SAS

34SAS Data Analysis – Final Project

Regression Diagnostics

• Normal probability plot of the residuals– After all systematic variation has

been removed from the data, the residuals should look like a sample from the normal distribution

– The NPP is a graphical technique for assessing whether or not a data set is approximately normally distributed.

– The data is plotted against a theoretical distribution such that the points should form an approximate straight line.

– Departures from a straight line indicate departures from normality

Page 35: Data Analysis Using SAS

35SAS Data Analysis – Final Project

Regression Diagnostics

• Index plot of Cooks Distance Statistics– Cook’s D is a distance measure that helps us determine how strongly a particular data

point affects the overall regression.

– Large absolute values of D (2 or more) indicate possible problems with model or data points that require scrutiny.

Only .6

Page 36: Data Analysis Using SAS

36SAS Data Analysis – Final Project

Alternative Models

• Short Range vs Long Range Modeling– Is the CPM function different for short range planes and long range planes?

– A short range plane might be able to achieve an acceptable CPM running flights w/ fewer passengers on average (ALF * SPA)--- as long as it operates many flights (multiple legs) per day.

– On the other hand, long range planes may NOT be able to maintain an acceptable CPM with fewer passengers b/c of the longer distances they fly.

Page 37: Data Analysis Using SAS

37SAS Data Analysis – Final Project

Alternative Models

Regression Model for Short Range Airplanes

Attributes Description

Functional Form CPM = − 2.293 LFilled + 1.288 LEmpty − .83802 UTL

Dependent Variable CPM

Independent Variables

LFilled, LEmpty, UTL

RSquare .89 ***WOW***

Adj RSq .86

T-Test < .01 LFilled, Lempty, UTL (i.e ALL variables)

T-Test <.10 None

Collinearity testing No Collinearity issues. VIF’s all below 2

• Regresson Model for Short Range Airplanes– All observations included, Beta Coeffficients displayed below

Page 38: Data Analysis Using SAS

38SAS Data Analysis – Final Project

Alternative Models

• Regression for Short Range Planes: Performance

Page 39: Data Analysis Using SAS

39SAS Data Analysis – Final Project

Alternative Models

Regression Model for Short Range Airplanes

Attributes Description

Functional Form CPM = −1.98 LFilled + 1.66 LPsize

Dependent Variable CPM

Independent Variables

LFilled, LPsize (Log of Plane Size Category variable)

RSquare .76 ***Pretty Good***

Adj RSq .73

T-Test < .01 LFilled, LPsize

T-Test <.10 None

Collinearity testing No Collinearity issues. VIF’s all below 5.5

• Regresson Model for Long Range Airplanes– All observations included, Beta Coeffficients displayed below

Page 40: Data Analysis Using SAS

40SAS Data Analysis – Final Project

Alternative Models

• Regression for Long Range Planes: Performance

Page 41: Data Analysis Using SAS

41SAS Data Analysis – Final Project

Alternative Models

• Model with good Adj-R-Sq, robust regarding outliers– Model 4 (working name: im1)

– CPM = -1.093 +0.2003 ISPA +1.265 IALF (33 observations, Adj-R-sq = 0.51)CPM = -0.5231 +0.1574 ISPA +1.112 IALF (31 observations, Adj-R-sq = 0.49)CPM = -0.9557 +0.1470 ISPA +1.350 IALF (27 observations, Adj-R-sq = 0.50)

– ISPA = (1/SPA), IALF = (1/ALF)

– Good t-scores, good VIF

– Rationale

› CPM = $ / (passengers * miles)

› Perhaps Passengers = SPA * ALF * K (for some constant K)

› Then CPM is roughly inversely proportional to SPA and ALF

› Model 4 is roughly inversely proportional to SPA and ALF

Page 42: Data Analysis Using SAS

42SAS Data Analysis – Final Project

Alternative Models

• Model with high Adj-R-Sq excluding certain outliers– Model 3 (working name: Long3)

– CPM = 12.13 –2.639 LFilled –2.595 Lpsize (31 observations)

› LFilled is log of number of seats filled

› Lpsize is log of the plane’s size bucket

– Adj R-Sq = 0.63, good t-scores, good VIF

– 2 outliers removed by “star” method

• Model with good Adj-R-Sq, robust regarding outliers– Model 5 (working name: Long2)

– CPM = 7.079 –2.017 LFilled +1.039 Lempty (33 observations, Adj-R-sq = 0.47)

– CPM = 6.168 +1.822 LFilled +1.040 Lempty (31 observations, Adj-R-sq = 0.50)

– Good t-scores, good VIF

• Serendipity– We found them while searching for specialized Long-Flight and Short-Flight models

– Generally applicable to all flights.

Page 43: Data Analysis Using SAS

43SAS Data Analysis – Final Project

Final Thoughts

• xxxxx– xxxxx

Page 44: Data Analysis Using SAS

December 2006SAS Data Analysis - Final Project

Thank You

Questions ?