

Exploratory Data Analysis
Data Description
Lending Club publicly provides historic data about the loan applications they have accepted and rejected, sorted into respective csv files stretching back to 2007, as well a dictionary that describes the features included in those datasets. Unfortunately, that is about as kind as Lending Club gets.
​
Rather than one continuous historic data set for accepted and rejected applications, their respective data is split across multiple files, sometimes corresponding to different time frames. More critically, the accepted and rejected loan data have completely different feature compositions. While the rejected loan data only has nine feature variables for each observation, the accepted loan data comes with well over 145 features, of which only seven mirror the feature variables of the rejected data (though we could approximate the remaining two). This disparity in features complicates the reconciliation of both types of datasets, and the dearth of rejected loan data features called into question our initial project goal of building a model to advise a comprehensive decision to accept or reject a loan application.
Further complicating matters is that the type of data described by some of the feature variables we do have change with time but remain under the same feature. For example, ‘Risk Score’ represents a borrower’s FICO score for applications prior to Nov 5, 2013 but afterwards presents their vantage score instead. Even worse, some features that are clearly categorical are instead random strings in earlier data sets, such that ‘Loan Title’ would be one of a few predefined categories in 2018, but in 2007 includes entries such as “Cancer is Killing my Credit” and “Jaguar10301”
Finally, there’s a high prevalence of junk data: observations labeled NaN all along a feature column.
Data Challenges Summarized:
-
Different data structure between accepted and rejected loan data
-
Lack of rejected loan data feature variables
-
Feature variables whose definitions do not remain consistent over time
-
Junk or missing data
​

Because of the limitations of the rejected loan data, we decided to abandon application outcome (accept/reject) as our response variable and instead searched for a new response to model, focusing only on the accepted loan dataset. We explored two new approaches: helping lenders predict the likelihood a loan applicant would default or helping applicants build better applications to attract better interest rates. For each approach we examined the potential response variables in the table to the right.
Default Risk EDA
This analysis revealed a rampant issue of NaN values. Moreover, it laid bare how different these indicators really were. Neither variable alone would be a comprehensive indicator of default risk, meaning we would have to create our own response variable combining these. The formula we’d use to do so however would be inherently arbitrary since we aren’t experts on default risk. These challenges dissuaded us from modeling default risk.

Application Attractiveness EDA

