project presentation slides

College ScorecardPredicting Earnings To Debt Ratio

Emdadul Haque and Derek Atwood

Data DescriptionCollege Scorecard data: https://www.kaggle.com/kaggle/college-scorecard

●Data collected from 1996 - 2013

●2009 dataset chosen for completeness and recency

●7149 observations / 1484 features

●Each observation corresponds to a unique College

●Features related to demographics, cost of attendance, proportion of students receiving financial aid, earnings multiple years after matriculation, etc

https://www.kaggle.com/kaggle/college-scorecard

Data Description●Lots of missing data!

●Some information not reported by specific Colleges

●Some information suppressed for privacy

Data Processing●Variables with >15% of observations missing were removed

●Response variable created as a ratio of median earnings six years after matriculation vs. median debt

●For each variable, missing values were replaced with the median of non-missing values

●Highly correlated and low variance variables were removed

Data Processing●Outliers diagnosed and removed (~0.5% of response variable)

Analysis●Originally we intended to use data from 2009 to predict earnings to

debt ratio for 2011

●Predictors with low amounts of missing values in 2009 had large amounts of missing values in 2011, and vice versa

●Final data consisted of 5130 observations and 223 predictors

●2009 data split into training (70%) and testing (30%) sets

MethodologyLinear Model:

●Poor performance (negative predicted ratios)

Lasso:

●Exploratory lasso model selected ~120-130 variables for various iterations

●Models resulted in MSE of ~0.45 (R2 ~0.65)

Principal Component Analysis

●No single predictor explained a significant percentage of variance

Random Forest Explained●Ensemble learning method that aggregates regression trees

●A subset of the total predictors is used to build each tree

●+ Handles large numbers of variable without deletion

●+ Runs efficiently on large data sets

●+ Inherent treating of interactions between variables

●- Loss of interpretability

Random Forest

Random ForestFinal Model:

One-half of the total predictors used per tree

Forest of 200 trees

MSE of ~0.3 (R2 ~ 0.75)

Conclusion●Missing data provided greatest challenge to building an accurate

model

●Data was decidedly unclean - redundant variables, missing factor levels, etc

●Significant amount of data processing required (~¾ of time spent)

●Imputing missing data with median values increased model performance

●The large amount of missing data likely sets an upper bound on the performance of this model, but more data processing, feature engineering, and additional tuning of parameters could result in more robust performance.

Questions?

project presentation slides

Career