top of page

Exploratory Data Analysis

Data Description

Lending Club publicly provides historic data about the loan applications they have accepted and rejected, sorted into respective csv files stretching back to 2007, as well a dictionary that describes the features included in those datasets. Unfortunately, that is about as kind as Lending Club gets.

​

Rather than one continuous historic data set for accepted and rejected applications, their respective data is split across multiple files, sometimes corresponding to different time frames. More critically, the accepted and rejected loan data have completely different feature compositions. While the rejected loan data only has nine feature variables for each observation, the accepted loan data comes with well over 145 features, of which only seven mirror the feature variables of the rejected data (though we could approximate the remaining two). This disparity in features complicates the reconciliation of both types of datasets, and the dearth of rejected loan data features called into question our initial project goal of building a model to advise a comprehensive decision to accept or reject a loan application.

 

Further complicating matters is that the type of data described by some of the feature variables we do have change with time but remain under the same feature. For example, ‘Risk Score’ represents a borrower’s FICO score for applications prior to Nov 5, 2013 but afterwards presents their vantage score instead. Even worse, some features that are clearly categorical are instead random strings in earlier data sets, such that ‘Loan Title’ would be one of a few predefined categories in 2018, but in 2007 includes entries such as “Cancer is Killing my Credit” and “Jaguar10301”

 

Finally, there’s a high prevalence of junk data: observations labeled NaN all along a feature column.

 

Data Challenges Summarized:

  • Different data structure between accepted and rejected loan data

  • Lack of rejected loan data feature variables

  • Feature variables whose definitions do not remain consistent over time

  • Junk or missing data

 

​

variableEDA.jpeg

Because of the limitations of the rejected loan data, we decided to abandon application outcome (accept/reject) as our response variable and instead searched for a new response to model, focusing only on the accepted loan dataset. We explored two new approaches: helping lenders predict the likelihood a loan applicant would default or helping applicants build better applications to attract better interest rates. For each approach we examined the potential response variables in the table to the right.

Default Risk EDA

This analysis revealed a rampant issue of NaN values. Moreover, it laid bare how different these indicators really were. Neither variable alone would be a comprehensive indicator of default risk, meaning we would have to create our own response variable combining these. The formula we’d use to do so however would be inherently arbitrary since we aren’t experts on default risk. These challenges dissuaded us from modeling default risk.

DefaultEDA.jpeg

Application Attractiveness EDA

AttractEDA.jpeg
GoodEDA.jpeg

In the Lending Club system, interest rates are calculated by using the risk grade of an application, but our EDA shows that these two statistics are not interchangeable as their relationship is not stable. By plotting the interest rates across time by each grade, we can clearly see that interest rates change with time. In 2007, the least attractively graded application (Grade G) received an interest rate comparable to a Grade E application only four years later.  As a result, interest rate is not a reliable response variable as a measure of how attractive an application is. Instead, because interest rates are determined in part by market conditions at the time and those forces aren’t included in our dataset, it makes more sense to use grade and sub_grade as our response variable because it is stable and interpretable.

Description
Default
Attract
bottom of page