top of page

Model Analysis

Data Cleaning

In order to properly run models, we need to clean the data so that it is in a model friendly format. This means dropping unnecessary features, making features into ints and floats instead of strings, imputing or dropping missing data, generating dummy variables, and standardizing the data.

​

To start we made a function that tells us the amount of unique values a given column has. This helped us determine what should be made into dummy variables and what should be dropped.

​

image3.png
image7.png

We see that emp_title and title have too many values to create dummies for. So we should drop them. We should likely keep addr_state information as that could be relevant. Anything that has 2 unique values can be converted into a binary feature. Anything with more, we make dummies for. We do this for term as well because it can either be a short term loan or long term loan. The lists below that indicate the way these features need to be cleaned and are followed by their cleaning functions.

image1.png
image4.png
image2.png
image5.png

Helper Functions

We built helper functions to 1. explore a slice of a dataframe 2. determine NaN prevalent features 3. calculate the accuracy of our predictions within a margin of error and 4. convert subgrades into numerical values that fit regression models.

image26.png
image12.png

Feature Selection

To focus our model on exploring how Lending Club appraises an application and then assigns it a grade, we had to eliminate all data that would not be submitted with the initial application, which meant dropping 121 feature variables. The spreadsheet below enumerates the features we initially kept, though those highlighted in red were only kept for data augmentation and were later dropped. 

image23.png
FeaturesExcel.jpeg

Data Augmentation

We created three new features variables: fico_avg, secondary, and cr_line_hist. We created fico_avg because the dataset provided a FICO score range instead. We created secondary as a binary variable to take into account the effect of a second signer on the application which happened around 4% of the time. Finally we created cr_line_hist to make the information conveyed by earliest_cr_line contextual and useful for our analysis. We also dropped all features with high NaN counts. 

image2.png
image9.png
image17.png

Data Prep

We handled some last loose ends, such as either dropping or imputing the last remaining NaN observations, before then splitting and standardizing our data. When we split our data into training and test sets, we stratified on sub_grade, so that an equal distribution of grades was present in both sets.

image7.png
image13.png

Models

We used a simple OLS regression as our baseline model and followed up with LASSO and Ridge regressions, a Logistic regression, a Decision Tree, and a Random Forest model. Our LASSO regression and Decision Tree models proved particularly insightful. 

Baseline Model

image10.png
image22.png
BaselineBox.jpeg

LASSO Model

image27.png
image29.png

Regularizing suppresses features and coefficient magnitudes, making it ideal for exploring how features stack up against one another and influence risk grade. We've presented a piece of that exploration below. 

Screen Shot 2018-12-13 at 1.27.48 AM.jpe
Screen Shot 2018-12-13 at 1.28.39 AM.jpe
LASSObox.jpeg

Ridge Model

image8.png
image20.png
RidgeBox.jpeg

Decision Tree Model

image14.png
image1.png
image21.png
DecisionBox.jpeg

Random Forest Model

image19.png
image6.png
RandomBox.jpeg

Logistic Regression

image30.png
image5.png
LogisticBox.jpeg

Summary

The most accurate model we made was the Decision Tree, able to predict an application's risk score within a letter grade with an accuracy of around 77%. The linear regression with Lasso regularization performed only slightly worse and offers the benefit of the linear regression coefficients which can be used to explain the relative importance of variables to a lay person. Therefore, we chose the Lasso results for the conclusions on our website.

Data Cleaning
Helper
Feature
Augmentation
Prep
Model
bottom of page