EDA of Lending Club Data - II

In the last post we looked at some initial cleanup of the data. We will start from there by loading the pickled dataframe.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_pickle("/home/ssingh/LendingClubData/Part1.pickle")

Lets first check what all columns are remaining in our dataframe. As there are still more than 100 variables left, we will initially focus on the first 25 ones only.

In [2]:
print(df.columns)
print(df.columns.shape)

Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
       ...
       'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
       'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens',
       'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit',
       'total_il_high_credit_limit'],
      dtype='object', length=111)
(111,)

From the data dictionary, we can see that funded_amnt is total amount committed till now, and funded_amnt_inv is the amount funded by investors. It is difficult to think of a direct correlation between the charged interest rate and the actual funded amount. However, this amount can give us a range of risk that one will be taking when investing. Given the two committed amounts are very similar, we will drop the the “funded_amnt” column. The installment column gives us feel of how much burden the loan will be on the borrower. However, this will be direct function of term and rate of the loan and hence should be dropped from any further analysis. the “grade” and “sub_grade” are LC assigned grades to the loan. We can keep these as secondary variables to check the liability of models used by LC.

In [3]:
df.drop(['funded_amnt', 'installment', "pymnt_plan"],1, inplace=True)

In [4]:
df.ix[:4,11:19]

Out[4]:
url desc purpose title zip_code addr_state dti delinq_2yrs
0 https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 235xx VA 12.03 0
1 https://www.lendingclub.com/browse/loanDetail.... NaN credit_card Credit card refinancing 937xx CA 14.92 0
2 https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 850xx AZ 34.81 0
3 https://www.lendingclub.com/browse/loanDetail.... NaN car Car financing 953xx CA 8.31 1
4 https://www.lendingclub.com/browse/loanDetail.... NaN debt_consolidation Debt consolidation 077xx NJ 25.81 0

For our purpose, we will not be going into any kind of natural language processing, hence, the description and the url variables are of no use to us.

In [5]:
df.drop(['url', 'desc'],1, inplace=True)

Let us check what are typical “purpose” used for requesting loans. We can view this as a histogram plot.

In [6]:
sns.set()
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
total = float(len(df.index))
ax = sns.countplot(x="purpose", data=df, palette="Set2");
ax.set(yscale = "log")
plt.xticks(rotation=90)
plt.show()

We can also look for any kind of correlation between the purpose and the interest rate of loan using a box plot. We can clearly see this could be useful for building our model!

In [7]:
sns.boxplot(x="purpose", y="int_rate", data=df)
plt.xticks(rotation=90)
plt.show()

Let also look for any kind of correlations between “employment length”, “rate” and “status” of loans. Status here, if you remember from the previous post refers to the risk factor involved with the loan.

In [8]:
sns.set(style="ticks", color_codes=True)
sns.pairplot(df, vars=["int_rate", "emp_length"], hue="loan_status", diag_kind="kde")

Out[8]:
<seaborn.axisgrid.PairGrid at 0x7f2e52116278>

As expected, we find good loans to have larger employment length. Interestingly, interest rate tends to be all over the place for high risk loans. But, if you think about it, that is what we are trying to fix here!

Analyzing tile of loans could be tricky. Again, due to lack of any kind of natural language processing, let us drop this as well.

The location address of borrowers can say interesting pattern about the interest rates. First three letters of zip code can give much more information than states. However, if the zip info is missing, state can provide a reasonable approx. of the data. Lets check if we have any data where zip data is missing. If none, we can simply drop the state information.

In [9]:
df['zip_code'] = df['zip_code'].str.replace('xx','')

In [10]:
df.drop(['title'],1, inplace=True)

In [11]:
df.zip_code.isnull().sum()

Out[11]:
0

In [12]:
df.drop(['addr_state'],1, inplace=True)

The “dti” column in the data dictionary has been described as - “A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income”. Based on this information, Debt_to_Income ratio is a direct measure of the loan risk.

Lets check effects of delinquency over last 2 years on interest rate using a box plot:

In [13]:
sns.boxplot(x="delinq_2yrs", y="int_rate", data=df)

Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e502c3a58>

We can see visualize effects of delinquency over last 2 years. Let us bin this data into three bins - Low, Medium and High. We will now move on to the next set of columns.

In [14]:
df["delinq_2yrs"] = pd.cut(df.delinq_2yrs, bins=3, labels=["Low", "Medium", "High"], include_lowest = True)

In [15]:
df.ix[:4,15:23]

Out[15]:
earliest_cr_line fico_range_low fico_range_high inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec
0 Aug-1994 750 754 0 NaN NaN 6 0
1 Sep-1989 710 714 2 42.0 NaN 17 0
2 Aug-2002 685 689 1 NaN NaN 11 0
3 Oct-2000 665 669 0 17.0 NaN 8 0
4 Nov-1992 680 684 0 NaN NaN 12 0

Earliest credit line should play an important role in determining the rate. We will replace this column by something more quantitative - credit_age.

In [16]:
now = pd.Timestamp('20160501')
df["credit_age"] = pd.to_datetime(df.earliest_cr_line, format="%b-%Y")
df['credit_age'] = (now - df['credit_age']).dt.days.divide(30).astype("int64")
df.drop(['earliest_cr_line'],1, inplace=True)

Let us try to find a trend between interest rate, fico ranges and loan status.

In [17]:
sns.pairplot(df, vars=["int_rate", "fico_range_low", "fico_range_high"], hue="loan_status", diag_kind="kde")

Out[17]:
<seaborn.axisgrid.PairGrid at 0x7f2e5239cef0>

We find 2 FICO scores to be highly collinear. Further, high risk loans have much larger lower values of fico scores. We can safely replace these with the mean values of fico scores.

In [18]:
df['fico'] = 0.5*(df['fico_range_high'] + df['fico_range_low'])
df.drop(['fico_range_high'],1, inplace=True)
df.drop(['fico_range_low'],1, inplace=True)

Similar to the 2 year delinquency, let us also look at the 6 month inquiry data. Other data like mths_since_last_delinq and mths_since_last_record can be safely removed, as they will be correlated to 2 year delinquency data.

In [19]:
sns.boxplot(x="inq_last_6mths", y="int_rate", data=df)

Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e50ff59b0>

Let us find correlations between many of these similar variables.

In [20]:
sns.pairplot(df, vars=["int_rate", "pub_rec", "open_acc", "inq_last_6mths"], hue="delinq_2yrs", diag_kind="kde")

Out[20]:
<seaborn.axisgrid.PairGrid at 0x7f2e50dea1d0>

Both open_acc and inq_last_6_mnths have a strong correlation with delinq_2year, and hence can be safely dropped. pub_rec too has a distinct shape for each levels of delinq_2yrs showing interdependence and hence we can drop this as well.

In [21]:
df.drop(['pub_rec'],1, inplace=True)
df.drop(['open_acc'],1, inplace=True)
df.drop(['inq_last_6mths'],1, inplace=True)
df.drop(['mths_since_last_delinq'],1, inplace=True)
df.drop(['mths_since_last_record'],1, inplace=True)

We will now move on to the next set of columns.

In [22]:
df.ix[:4,15:25]

Out[22]:
revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int
0 138008 29% 17 w 12484.99 12484.99 4364.64 4364.64 2515.01 1849.63
1 6133 31.6% 36 w 6892.58 6892.58 4163.94 4163.94 3507.42 656.52
2 16822 91.9% 20 f 0.00 0.00 2281.98 2281.98 704.38 339.61
3 5753 100.9% 13 w 10868.67 10868.67 4117.57 4117.57 1931.33 2186.24
4 16388 59.4% 44 f 0.00 0.00 9973.43 9973.43 9600.00 373.43

Revolving balance and revolving utilization, is a measure of “how leveraged your credit cards are”. revol_util should provide a relative measure of leverage, whereas revol_bal should provide an absolute measurement. Before we proceed, we need to convert ‘%’ data to fraction.

In [23]:
df.revol_util = pd.Series(df.revol_util).str.replace('%', '').astype(float)
df.revol_util = df.revol_util * 0.01

In [24]:
g = sns.pairplot(df, vars=["revol_bal", "revol_util", "total_acc"], hue="loan_status", diag_kind="kde")
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=90)

None of these variables seem to make any direct correlation with the risk levels of the loan. Given their direct use in the FICO score calculation, we will keep these in our analysis.

Let us take a look at the initial listing status of the loan. Then, we can find a correlation between these and the risk level.

In [25]:
sns.countplot(x="initial_list_status", hue="loan_status", data=df)

Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2e4fe23898>

For high risk loans as well low risk ones, there does not seem to be any significant difference among two types of initial listing of the loan and hence we can drop it.

In [26]:
df.drop(['initial_list_status'],1, inplace=True)

Following variables remaining in the list refer to the current state of the loan and hence will not be playing any effect on the general state or risk level of the loan, therefore should be dropped from our analysis. We will also not consider any joint data for this analysis.

In [27]:
df.drop(['out_prncp'],1, inplace=True)
df.drop(['out_prncp_inv'],1, inplace=True)
df.drop(['total_pymnt'],1, inplace=True)
df.drop(['total_pymnt_inv'],1, inplace=True)
df.drop(['total_rec_prncp'],1, inplace=True)
df.drop(['total_rec_int'],1, inplace=True)
df.drop(['total_rec_late_fee'],1, inplace=True)
df.drop(['recoveries'],1, inplace=True)
df.drop(['collection_recovery_fee'],1, inplace=True)
df.drop(['last_pymnt_d'],1, inplace=True)
df.drop(['last_pymnt_amnt'],1, inplace=True)
df.drop(['next_pymnt_d'],1, inplace=True)
df.drop(['policy_code'],1, inplace=True)
df.drop(['application_type'],1, inplace=True)
df.drop(['annual_inc_joint'],1, inplace=True)
df.drop(['dti_joint'],1, inplace=True)
df.drop(['verification_status_joint'],1, inplace=True)

In [28]:
df.ix[:4,18:24]

Out[28]:
last_credit_pull_d last_fico_range_high last_fico_range_low collections_12_mths_ex_med mths_since_last_major_derog acc_now_delinq
0 Feb-2016 684 680 0 NaN 0
1 Feb-2016 679 675 0 59.0 0
2 Dec-2015 539 535 0 NaN 0
3 Feb-2016 704 700 0 36.0 0
4 Feb-2016 684 680 0 NaN 0

First we need to convert, last credit pull day to a numeric value as days since lst credit pull. Let us find if there are any NA values.

In [29]:
print("No. of Data with NA values = {}".format(len(df.last_credit_pull_d) - df.last_credit_pull_d.count()))

No. of Data with NA values = 27

We will replace these NA values with, Day corresponding with the oldest date of their account, i.e. now - credit history date.

In [30]:
df.last_credit_pull_d.fillna("Jan-1980", inplace=True)

In [31]:
df["last_credit_pull_d"] = pd.to_datetime(df.last_credit_pull_d, format="%b-%Y")
df['last_credit_pull_d'] = (now - df['last_credit_pull_d']).dt.days.divide(30).astype("int64")
df[df['last_credit_pull_d'] >= 7000].last_credit_pull_d = df[df['last_credit_pull_d'] >= 7000].credit_age

Let us compare last fico score to the overall fico score.

In [32]:
sns.pairplot(df, vars=["last_fico_range_high", "last_fico_range_low", "fico"], hue="loan_status")

Out[32]:
<seaborn.axisgrid.PairGrid at 0x7f2e5004d828>

As before, last fico high and low scores are correlated, and also with overall fico score, and hence we can get rid of these.

In [33]:
df.drop(['last_fico_range_high'],1, inplace=True)
df.drop(['last_fico_range_low'],1, inplace=True)

We can also get of “collections_12_mths_ex_med” column as this corresponds only to the current state of loan. Other two variables, “mths_since_last_major_derog” and “acc_now_delinq” should have no additional impact than ones already considered.

In [34]:
df.drop(['collections_12_mths_ex_med'],1, inplace=True)
df.drop(['mths_since_last_major_derog'],1, inplace=True)
df.drop(['acc_now_delinq'],1, inplace=True)

In [35]:
df.ix[:4,19:29]

Out[35]:
tot_coll_amt tot_cur_bal open_acc_6m open_il_6m open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m
0 0 149140 NaN NaN NaN NaN NaN NaN NaN NaN
1 0 162110 NaN NaN NaN NaN NaN NaN NaN NaN
2 0 64426 NaN NaN NaN NaN NaN NaN NaN NaN
3 0 261815 NaN NaN NaN NaN NaN NaN NaN NaN
4 0 38566 NaN NaN NaN NaN NaN NaN NaN NaN

Again, we can go on and delete all the columns that are related to only the current states of loans, including the ones with large amount of missing data.

In [36]:
df.drop(['tot_coll_amt'],1, inplace=True)
df.drop(['open_acc_6m'],1, inplace=True)
df.drop(['tot_cur_bal'],1, inplace=True)
df.drop(['open_il_6m'],1, inplace=True)
df.drop(['open_il_12m'],1, inplace=True)
df.drop(['open_il_24m'],1, inplace=True)
df.drop(['mths_since_rcnt_il'],1, inplace=True)
df.drop(['total_bal_il'],1, inplace=True)
df.drop(['il_util'],1, inplace=True)
df.drop(['open_rv_12m'],1, inplace=True)

In [37]:
df.ix[:4,19:29]

Out[37]:
open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy
0 NaN NaN NaN 184500 NaN NaN NaN 5 29828.0 9525.0
1 NaN NaN NaN 19400 NaN NaN NaN 7 9536.0 7599.0
2 NaN NaN NaN 18300 NaN NaN NaN 6 5857.0 332.0
3 NaN NaN NaN 5700 NaN NaN NaN 2 32727.0 0.0
4 NaN NaN NaN 27600 NaN NaN NaN 8 3214.0 6494.0

Out of these variables, only “avg_cur_bal” is viable additional feature for our model. We will also look at the distribution of average current balance. However, in order to use it correctly, we need to if there are any NA values in to and replace them correctly.

In [38]:
df.drop(['open_rv_24m'],1, inplace=True)
df.drop(['max_bal_bc'],1, inplace=True)
df.drop(['all_util'],1, inplace=True)
df.drop(['inq_fi'],1, inplace=True)
df.drop(['total_cu_tl'],1, inplace=True)
df.drop(['inq_last_12m'],1, inplace=True)
df.drop(['acc_open_past_24mths'],1, inplace=True)
df.drop(['bc_open_to_buy'],1, inplace=True)
df.drop(['total_rev_hi_lim'],1, inplace=True)

In [39]:
print("No. of Data with NA values = {}".format(len(df.avg_cur_bal) - df.avg_cur_bal.count()))

No. of Data with NA values = 6

In [40]:
df.avg_cur_bal.fillna(df.avg_cur_bal.min(), inplace=True)

In [41]:
g = sns.pairplot(df, vars=["avg_cur_bal", "int_rate"], hue="loan_status")
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=90)

In [42]:
df.ix[:4,20:28]

Out[42]:
bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc
0 4.7 0 0 103.0 244 1 1 0
1 41.5 0 0 76.0 290 1 1 1
2 93.2 0 0 137.0 148 8 8 0
3 103.2 0 0 16.0 170 21 16 5
4 69.2 0 0 183.0 265 23 3 0

Similar to before, we can again get rid of variables that will not make significant impact on our model. Then look at the pair-wise effect of rest of them. We will also replace NAs with the mean values.

In [43]:
df.drop(['chargeoff_within_12_mths'],1, inplace=True)
df.drop(['delinq_amnt'],1, inplace=True)
df.drop(['mo_sin_old_il_acct'],1, inplace=True)
df.drop(['mo_sin_old_rev_tl_op'],1, inplace=True)
df.drop(['mo_sin_rcnt_rev_tl_op'],1, inplace=True)
df.drop(['mo_sin_rcnt_tl'],1, inplace=True)

In [44]:
df.bc_util.fillna(df.bc_util.min(), inplace=True)

In [45]:
sns.pairplot(df, vars=["bc_util", "int_rate"], hue="loan_status")

Out[45]:
<seaborn.axisgrid.PairGrid at 0x7f2e4e1ab550>

In [46]:
df.ix[:4,21:27]

Out[46]:
mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd
0 0 47.0 NaN NaN NaN 0
1 1 5.0 42.0 1.0 42.0 4
2 0 17.0 NaN 3.0 NaN 0
3 5 21.0 17.0 1.0 17.0 1
4 0 24.0 NaN 17.0 NaN 0

In this list only variable of our interest is number of mortgage accounts.

In [47]:
sns.pairplot(df, vars=["mort_acc", "int_rate"], hue="loan_status")

Out[47]:
<seaborn.axisgrid.PairGrid at 0x7f2e4dd0ed30>

In [48]:
df.drop(['mths_since_recent_bc'],1, inplace=True)
df.drop(['mths_since_recent_bc_dlq'],1, inplace=True)
df.drop(['mths_since_recent_inq'],1, inplace=True)
df.drop(['mths_since_recent_revol_delinq'],1, inplace=True)
df.drop(['num_accts_ever_120_pd'],1, inplace=True)

In [49]:
df.ix[:4,22:31]

Out[49]:
num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats
0 1 4 1 2 8 5 9 4 6
1 6 9 7 18 2 14 32 9 17
2 1 4 1 4 12 4 8 4 11
3 3 5 3 5 1 5 7 5 8
4 4 7 5 16 17 8 26 7 12

All of these variables are related to some kind of number of accounts. Lets take a look at their inter-dependence.

In [50]:
sns.pairplot(df, vars=["num_actv_bc_tl", "num_actv_rev_tl", "num_bc_sats", "num_bc_tl",
                       "num_op_rev_tl"], hue="loan_status")

Out[50]:
<seaborn.axisgrid.PairGrid at 0x7f2e4de95208>