EDA of Lending Club Data

We will first look at various aspects of the LendingClub data using techniques of Exploratory Data Analysis (EDA). Please look at my past post for finding further details on EDA techniques. Different data files for this analysis have already been downloaded in the current folder.

In [4]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Let’s also take a quick look at the data via shell scripts (file size, head, line count, column count).

In [2]:
!du -h /home/ssingh/LendingClubData/LoanStats3c_securev1.csv
#!tail -3 /home/ssingh/LendingClubData/LoanStats3c_securev1.csv
#!head -3 /home/ssingh/LendingClubData/LoanStats3c_securev1.csv

181M   /home/ssingh/LendingClubData/LoanStats3c_securev1.csv

Examining the data we see that most of feature names are intuitive. We can get the specifics from the provided data dictionary.

In [3]:
!wc -l < /home/ssingh/LendingClubData/LoanStats3c_securev1.csv
!head -2 /home/ssingh/LendingClubData/LoanStats3c_securev1.csv | sed 's/[^,]//g' | wc -c

235633
116

Based on the above analysis, we find that we have a total of 235633-2-2 = 235659 rows and 116 - 1 = 115 columns of data! Let us first look at the detailed description of columns from the dictionary of the data.

In [5]:
df = pd.read_csv("/home/ssingh/LendingClubData/LoanStatsDict.csv", sep=",", engine='c', encoding = "ISO-8859-1", na_filter=False)
df = df.ix[1:,0:2]
from IPython.display import HTML
HTML(df.to_html())

Out[5]:
LoanStatNew Description
1 acc_open_past_24mths Number of trades opened in past 24 months.
2 addr_state The state provided by the borrower in the loan...
3 all_util Balance to credit limit on all trades
4 annual_inc The self-reported annual income provided by th...
5 annual_inc_joint The combined self-reported annual income provi...
6 application_type Indicates whether the loan is an individual ap...
7 avg_cur_bal Average current balance of all accounts
8 bc_open_to_buy Total open to buy on revolving bankcards.
9 bc_util Ratio of total current balance to high credit/...
10 chargeoff_within_12_mths Number of charge-offs within 12 months
11 collection_recovery_fee post charge off collection fee
12 collections_12_mths_ex_med Number of collections in 12 months excluding m...
13 delinq_2yrs The number of 30+ days past-due incidences of ...
14 delinq_amnt The past-due amount owed for the accounts on w...
15 desc Loan description provided by the borrower
16 dti A ratio calculated using the borrowerÕs total ...
17 dti_joint A ratio calculated using the co-borrowers' tot...
18 earliest_cr_line The month the borrower's earliest reported cre...
19 emp_length Employment length in years. Possible values ar...
20 emp_title The job title supplied by the Borrower when ap...
21 fico_range_high The upper boundary range the borrowerÕs FICO a...
22 fico_range_low The lower boundary range the borrowerÕs FICO a...
23 funded_amnt The total amount committed to that loan at tha...
24 funded_amnt_inv The total amount committed by investors for th...
25 grade LC assigned loan grade
26 home_ownership The home ownership status provided by the borr...
27 id A unique LC assigned ID for the loan listing.
28 il_util Ratio of total current balance to high credit/...
29 initial_list_status The initial listing status of the loan. Possib...
30 inq_fi Number of personal finance inquiries
31 inq_last_12m Number of credit inquiries in past 12 months
32 inq_last_6mths The number of inquiries in past 6 months (excl...
33 installment The monthly payment owed by the borrower if th...
34 int_rate Interest Rate on the loan
35 issue_d The month which the loan was funded
36 last_credit_pull_d The most recent month LC pulled credit for thi...
37 last_fico_range_high The upper boundary range the borrowerÕs last F...
38 last_fico_range_low The lower boundary range the borrowerÕs last F...
39 last_pymnt_amnt Last total payment amount received
40 last_pymnt_d Last month payment was received
41 loan_amnt The listed amount of the loan applied for by t...
42 loan_status Current status of the loan
43 max_bal_bc Maximum current balance owed on all revolving ...
44 member_id A unique LC assigned Id for the borrower member.
45 mo_sin_old_il_acct Months since oldest bank installment account o...
46 mo_sin_old_rev_tl_op Months since oldest revolving account opened
47 mo_sin_rcnt_rev_tl_op Months since most recent revolving account opened
48 mo_sin_rcnt_tl Months since most recent account opened
49 mort_acc Number of mortgage accounts.
50 mths_since_last_delinq The number of months since the borrower's last...
51 mths_since_last_major_derog Months since most recent 90-day or worse rating
52 mths_since_last_record The number of months since the last public rec...
53 mths_since_rcnt_il Months since most recent installment accounts ...
54 mths_since_recent_bc Months since most recent bankcard account opened.
55 mths_since_recent_bc_dlq Months since most recent bankcard delinquency
56 mths_since_recent_inq Months since most recent inquiry.
57 mths_since_recent_revol_delinq Months since most recent revolving delinquency.
58 next_pymnt_d Next scheduled payment date
59 num_accts_ever_120_pd Number of accounts ever 120 or more days past due
60 num_actv_bc_tl Number of currently active bankcard accounts
61 num_actv_rev_tl Number of currently active revolving trades
62 num_bc_sats Number of satisfactory bankcard accounts
63 num_bc_tl Number of bankcard accounts
64 num_il_tl Number of installment accounts
65 num_op_rev_tl Number of open revolving accounts
66 num_rev_accts Number of revolving accounts
67 num_rev_tl_bal_gt_0 Number of revolving trades with balance >0
68 num_sats Number of satisfactory accounts
69 num_tl_120dpd_2m Number of accounts currently 120 days past due...
70 num_tl_30dpd Number of accounts currently 30 days past due ...
71 num_tl_90g_dpd_24m Number of accounts 90 or more days past due in...
72 num_tl_op_past_12m Number of accounts opened in past 12 months
73 open_acc The number of open credit lines in the borrowe...
74 open_acc_6m Number of open trades in last 6 months
75 open_il_12m Number of installment accounts opened in past ...
76 open_il_24m Number of installment accounts opened in past ...
77 open_il_6m Number of currently active installment trades
78 open_rv_12m Number of revolving trades opened in past 12 m...
79 open_rv_24m Number of revolving trades opened in past 24 m...
80 out_prncp Remaining outstanding principal for total amou...
81 out_prncp_inv Remaining outstanding principal for portion of...
82 pct_tl_nvr_dlq Percent of trades never delinquent
83 percent_bc_gt_75 Percentage of all bankcard accounts > 75% of l...
84 policy_code publicly available policy_code=1 new products ...
85 pub_rec Number of derogatory public records
86 pub_rec_bankruptcies Number of public record bankruptcies
87 purpose A category provided by the borrower for the lo...
88 pymnt_plan Indicates if a payment plan has been put in pl...
89 recoveries post charge off gross recovery
90 revol_bal Total credit revolving balance
91 revol_util Revolving line utilization rate, or the amount...
92 sub_grade LC assigned loan subgrade
93 tax_liens Number of tax liens
94 term The number of payments on the loan. Values are...
95 title The loan title provided by the borrower
96 tot_coll_amt Total collection amounts ever owed
97 tot_cur_bal Total current balance of all accounts
98 tot_hi_cred_lim Total high credit/credit limit
99 total_acc The total number of credit lines currently in ...
100 total_bal_ex_mort Total credit balance excluding mortgage
101 total_bal_il Total current balance of all installment accounts
102 total_bc_limit Total bankcard high credit/credit limit
103 total_cu_tl Number of finance trades
104 total_il_high_credit_limit Total installment high credit/credit limit
105 total_pymnt Payments received to date for total amount funded
106 total_pymnt_inv Payments received to date for portion of total...
107 total_rec_int Interest received to date
108 total_rec_late_fee Late fees received to date
109 total_rec_prncp Principal received to date
110 total_rev_hi_lim Ê Total revolving high credit/credit limit
111 url URL for the LC page with listing data.
112 verification_status Indicates if income was verified by LC, not ve...
113 verified_status_joint Indicates if the co-borrowers' joint income wa...
114 zip_code The first 3 numbers of the zip code provided b...

Lets choose some of the most important variables from these.

The Response Variable:

  • Interest Rate (int_rate)

And some of possible important factors are:

  1. Annual Income (annual_inc)
  2. State (addr_state)
  3. Purpose (purpose)
  4. Description for Loan (desc)
  5. Amount Requested (loan_amount)
  6. Amount Funded (funded_amnt)
  7. Loan Length (term)
  8. Debt Income Ratio (dti)
  9. Home Ownership status (home_ownership)
  10. FICO high (fico_range_high)
  11. FICO low (fico_range_low)
  12. Last FICO low (last_fico_range_low)
  13. Last FICO high (last_fico_range_high)
  14. Average current balance (avg_cur_bal)
  15. Charge Offs in last Year (chargeoff_within_12_mths)
  16. Number of 30+ days past-due incidences (delinq_2yrs)
  17. Employment Length (emp_length)
  18. No. of Credit Inquiries (inq_last_6mths)
  19. Maximum current balance owed on all revolving (max_bal_bc)
  20. Total credit revolving balance (revol_bal)
  21. LC Verification status (verification_status)
  22. Revolving line utilization rate (revol_util)
  23. Percentage of account never delinquent (pct_tl_nvr_dlq)
  24. Months since most recent 90-day or worse rating (mths_since_last_major_derog)
  25. Total Credit Balance (total_bal_ex_mort)

We will first look at effects of some of these variables using EDA.

Later, if we find any need to use some additional variables, we will revisit this list.

First, lets load our data as a Pandas data frame:

In [5]:
df = pd.read_csv("/home/ssingh/LendingClubData/LoanStats3c_securev1.csv", skiprows=1, skipfooter=2)
df.info(verbose = False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235629 entries, 0 to 235628
Columns: 115 entries, id to total_il_high_credit_limit
dtypes: float64(44), int64(47), object(24)
memory usage: 206.7+ MB

In [6]:
df.head(3)

Out[6]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit
0 38098114 40860827 15000 15000 15000 60 months 12.39% 336.64 C C1 ... 0 4 100.0 0.0 0 0 196500 149140 10000 12000
1 36805548 39558264 10400 10400 10400 36 months 6.99% 321.08 A A3 ... 0 4 83.3 14.3 0 0 179407 15030 13000 11325
2 37662224 40425321 7650 7650 7650 36 months 13.66% 260.20 C C3 ... 0 2 100.0 100.0 0 0 82331 64426 4900 64031

3 rows × 115 columns

Lets us remove columns/data that is redundant for prediction of interest rates. For example, id, member_id for sure are of no importance to us. Let’s work through the columns in batches to keep the cognitive burden low:

In [7]:
# .ix[row slice, column slice] 
df.ix[:4,:7]

Out[7]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate
0 38098114 40860827 15000 15000 15000 60 months 12.39%
1 36805548 39558264 10400 10400 10400 36 months 6.99%
2 37662224 40425321 7650 7650 7650 36 months 13.66%
3 37612354 40375473 12800 12800 12800 60 months 17.14%
4 37822187 40585251 9600 9600 9600 36 months 13.66%

We won’t need id or member_id as it has no real predictive power so we can drop them from this table int_rate was loaded as an object data type instead of float due to the ‘%’ character. Let’s strip that out and convert the column type. And Also do similar transformation for term variable to get rid of months.

In [8]:
df.drop(['id','member_id'], 1, inplace=True)
df.int_rate = pd.Series(df.int_rate).str.replace('%', '').astype(float)
df['term'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
df['term'] = df['term'].convert_objects(convert_numeric=True)

Moving on to next columns:

In [9]:
df.ix[:4,5:15]

Out[9]:
installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status
0 336.64 C C1 MANAGEMENT 10+ years RENT 78000.0 Source Verified Dec-2014 Current
1 321.08 A A3 Truck Driver Delivery Personel 8 years MORTGAGE 58000.0 Not Verified Dec-2014 Current
2 260.20 C C3 Technical Specialist < 1 year RENT 50000.0 Source Verified Dec-2014 Charged Off
3 319.08 D D4 Senior Sales Professional 10+ years MORTGAGE 125000.0 Verified Dec-2014 Current
4 326.53 C C3 Admin Specialist 10+ years RENT 69000.0 Source Verified Dec-2014 Fully Paid

At first, it seems we employment title should be important. Let us first have a look at how many unique values we have for these. We would like to convert emp_length into an integer variable.

In [10]:
print(df.emp_title.value_counts().head())
print(df.emp_title.value_counts().tail())
df.emp_title.unique().shape

Teacher             4569
Manager             3772
Registered Nurse    1960
RN                  1816
Supervisor          1663
Name: emp_title, dtype: int64
Care Aid                         1
Deputy Probation                 1
Front office staff               1
factor                           1
Independent Contractor/Driver    1
Name: emp_title, dtype: int64
Out[10]:
(75353,)

This is just too many. Unless, we do some semantics based grouping of these titles, we would not be able to get any meaningful data out of this. If you think harder, this should be highly correlated with income. Just for the purpose of loan, it is highly unlikely (not impossible though!) that a high paying job title would be different than another!

In [11]:
df.drop(['emp_title'], 1, inplace=True)

Let first look at unique value of emp_legth variable.

In [12]:
df.emp_length.value_counts()

Out[12]:
10+ years    79505
2 years      20487
3 years      18267
< 1 year     17982
1 year       14593
4 years      13528
7 years      13099
5 years      13051
n/a          12019
8 years      11853
6 years      11821
9 years       9424
Name: emp_length, dtype: int64

Let us make this variable simple integers. We will replace na entries with 0, and all non numeric entries.

In [13]:
df.replace('n/a', np.nan, inplace=True)
df.emp_length.fillna(value=0,inplace=True)
df['emp_length'].replace(to_replace='^<', value=0.0, inplace=True, regex=True)
df['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
df['emp_length'] = df['emp_length'].astype(int)

Lets us look at unique emp_length entries again:

In [14]:
df.emp_length.value_counts()

Out[14]:
10    79505
0     30001
2     20487
3     18267
1     14593
4     13528
7     13099
5     13051
8     11853
6     11821
9      9424
Name: emp_length, dtype: int64

We should convert verification status into three ordinal values. Lets see what are different possible values of the verification status. We will convert these to ordinals accordingly.

In [15]:
df.verification_status.value_counts()

Out[15]:
Source Verified    97741
Not Verified       70659
Verified           67229
Name: verification_status, dtype: int64

We will assign the lowest rating to Not Verified and the highest rating to Verified.

In [16]:
df["verification_status"] = df["verification_status"].astype('category')
df["verification_status"] = df["verification_status"].cat.set_categories(["Not Verified", "Source Verified", "Verified"], ordered = True)

Loan status and issue dates are not of any interest to us, as we only plan to use this data for making models that can predict interest rate. However, let us do some digging into this data to see some statistics of loans on Lending Club! Let us look at a histogram of different states of loans.

In [23]:
import seaborn as sns
sns.set()
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
total = float(len(df.index))
ax = sns.countplot(x="loan_status", data=df, palette="Set2");
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x(), height+18, '%2.2f'%(height*100/total)+"%")
plt.xticks(rotation=60)
plt.show()

We can also look at a histogram of number of loans issued based on the month of the year. Just for fun, we will also look for any correlation between the issue month of loans and their states.

In [18]:
df['issue_d'].replace(to_replace='[^A-Z,a-z]+', value='', inplace=True, regex=True)
ax = sns.countplot(x="issue_d", data=df, palette="Set2");
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x(), height+18, '%2.1f'%(height*100/total))
plt.xticks(rotation=60)
plt.show()

First, we find that more loans are issued during the holidays seasons: Oct and July! Nothing surprising there.

In [19]:
g = sns.factorplot("loan_status", col="issue_d", col_wrap=4, data=df,kind="count", aspect=1.25)
(g.set_axis_labels("", "")
 .set_titles("{col_name}")
 .set_xticklabels(rotation=60)
 .despine(left=True))

Out[19]:
<seaborn.axisgrid.FacetGrid at 0x7fe6351a72b0>

We do not find any obvious correlation between the issue date and their states at this time. Not interesting. Lets drop both issue date from our further analysis. Moving on to next set of columns.

We will also divide loan status into two categories: Low, and High risks.

In [20]:
defaultList = ["Default", "Charged Off", "Late (31-120 days)", "Late (16-30 days)", "In Grace Period"]
df.loc[df.loan_status.isin(defaultList), "loan_status"] = "High"
goodList = ["Current", "Fully Paid"]
df.loc[df.loan_status.isin(goodList), "loan_status"] = "Low"

In [21]:
df.drop(['issue_d'],1, inplace=True)

Let us stop here for this post. We will continue our EDA analysis in the next. We will save our pandas object as pickle and then catch up from there.

In [22]:
df.to_pickle("/home/ssingh/LendingClubData/Part1.pickle")

comments powered by Disqus