project presentation slides

13
College Scorecard Predicting Earnings To Debt Ratio Emdadul Haque and Derek Atwood

Upload: emdadul-haque

Post on 15-Feb-2017

68 views

Category:

Career


0 download

TRANSCRIPT

Page 1: Project presentation slides

College ScorecardPredicting Earnings To Debt Ratio

Emdadul Haque and Derek Atwood

Page 2: Project presentation slides

Data DescriptionCollege Scorecard data: https://www.kaggle.com/kaggle/college-scorecard

●Data collected from 1996 - 2013

●2009 dataset chosen for completeness and recency

●7149 observations / 1484 features

●Each observation corresponds to a unique College

●Features related to demographics, cost of attendance, proportion of students receiving financial aid, earnings multiple years after matriculation, etc

Page 3: Project presentation slides

Data Description●Lots of missing data!

●Some information not reported by specific Colleges

●Some information suppressed for privacy

Page 4: Project presentation slides

Data Processing●Variables with >15% of observations missing were removed

●Response variable created as a ratio of median earnings six years after matriculation vs. median debt

●For each variable, missing values were replaced with the median of non-missing values

●Highly correlated and low variance variables were removed

Page 5: Project presentation slides

Data Processing●Outliers diagnosed and removed (~0.5% of response variable)

Page 6: Project presentation slides

Analysis●Originally we intended to use data from 2009 to predict earnings to

debt ratio for 2011

●Predictors with low amounts of missing values in 2009 had large amounts of missing values in 2011, and vice versa

●Final data consisted of 5130 observations and 223 predictors

●2009 data split into training (70%) and testing (30%) sets

Page 7: Project presentation slides

MethodologyLinear Model:

●Poor performance (negative predicted ratios)

Lasso:

●Exploratory lasso model selected ~120-130 variables for various iterations

●Models resulted in MSE of ~0.45 (R2 ~0.65)

Principal Component Analysis

●No single predictor explained a significant percentage of variance

Page 8: Project presentation slides

Random Forest Explained●Ensemble learning method that aggregates regression trees

●A subset of the total predictors is used to build each tree

●+ Handles large numbers of variable without deletion

●+ Runs efficiently on large data sets

●+ Inherent treating of interactions between variables

●- Loss of interpretability

Page 9: Project presentation slides

Random Forest

Page 10: Project presentation slides

Random ForestFinal Model:

One-half of the total predictors used per tree

Forest of 200 trees

MSE of ~0.3 (R2 ~ 0.75)

Page 11: Project presentation slides
Page 12: Project presentation slides

Conclusion●Missing data provided greatest challenge to building an accurate

model

●Data was decidedly unclean - redundant variables, missing factor levels, etc

●Significant amount of data processing required (~¾ of time spent)

●Imputing missing data with median values increased model performance

●The large amount of missing data likely sets an upper bound on the performance of this model, but more data processing, feature engineering, and additional tuning of parameters could result in more robust performance.

Page 13: Project presentation slides

Questions?