Portfolio

Miguel Bautista

This project was completed through my education at General Assembly. My assignment was to predict whether or not a person should be approved for a loan based on a variety of factors.

Predict Loan Funding

This challenge is a standard supervised learning problem. The dataset includes application data for every loan given over a 6 month period.

There are 4 tabs with data, labeled "trainX", "trainY", "testX" and "testY". The train tabs contain the first 4.5 months of loans, while the test tabs have loan data from the following 1.5 months.

trainY and testY contain the targets. They represent whether the loans were fully funded (TRUE values) or partially funded (FALSE values). The fully funded values are withheld from testY, and your job will be to fill them in.

Use trainX and trainY to build a model to predict whether or not future loans will be fully funded. You should then use your model on the data from testX to make new predictions. We’ll score those predictions against the true values of testY to see how well your model performs.

This is intended to be a fairly straightforward task. I didn’t intentionally include any big surprises or “gotchas”. I hope that your model performs well, but it’s even more important that your approach is sound and you avoid major mistakes.

Please put your predictions in column B of the "testY" tab. The predictions should be made such that higher values are more likely to be TRUE and lower values more likely FALSE.

Pleae include a short description of which evaluation metric you selected & why.

Here's the dataset

Data Dictionary

Variable Definition Type
customer_id unique customer id alphanumeric
status was customer approved or denied String
residence_rent_or_own customer is renting Boolean
monthly_rent_amount monthly rent amount Numeric
bank_account_direct_deposit customer signed up for direct deposit Boolean
application_when date when customer applied for a loan MM/DD/YY HH:MM
loan_duration term of loan Numeric
payment_ach has customer signed up for ACH payments Boolean
num_payments # of payments made by customer Numeric
address_zip customer resident zip code Numeric
bank_routing_number customer bank routing number Numeric
home_phone_type type of customer phone String
monthly_income_amount customer monthly income amount Numeric
raw_l2c_score Third party score Numeric
raw_FICO_telecom Third party score Numeric
raw_FICO_retail Third party score Numeric
raw_FICO_bank_card Third party score Numeric
raw_FICO_money Third party score Numeric
FullyFunded Fund customer Boolean

Good luck!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


import random
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV


# Importing the various models

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, Ridge  
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.svm import SVC

# Ensemble model
from sklearn.ensemble import VotingClassifier


# Import sklearn
from sklearn import feature_selection, linear_model
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
/anaconda/envs/py35_ds_dt_16/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Loading train and test data.

In [2]:
sheets = ['trainX','trainY','testX','testY']

trainX = pd.read_excel("HW3Data.xlsx", sheetname='trainX')
trainY = pd.read_excel("HW3Data.xlsx", sheetname='trainY')
testX = pd.read_excel("HW3Data.xlsx", sheetname='testX')
testY = pd.read_excel("HW3Data.xlsx", sheetname='testY')

trainX.head()
Out[2]:
customer_id status residence_rent_or_own monthly_rent_amount bank_account_direct_deposit application_when loan_duration payment_ach num_payments address_zip bank_routing_number home_phone_type monthly_income_amount raw_l2c_score raw_FICO_telecom raw_FICO_retail raw_FICO_bank_card raw_FICO_money
0 9ece67d6c5 Approved True 0 True 2010-10-16 14:06:00 3 True 6 84118 124001545 Mobile 1560 614 574 600 656 561
1 5c2c402094 Approved True 0 False 2010-10-17 13:01:00 6 True 13 84062 124000054 Mobile 900 708 501 550 651 563
2 e6254cad30 Approved True 620 True 2010-10-17 19:52:00 6 True 13 84119 124001545 Mobile 1434 687 522 561 661 598
3 49fb42f51d Approved False 785 False 2010-10-18 07:05:00 4 True 8 84405 124002971 Mobile 1600 616 560 552 634 591
4 195fbe5739 Approved True 700 True 2010-10-19 10:58:00 4 True 8 84404 124002971 Mobile 1360 681 603 654 659 636
In [3]:
trainY.head()
Out[3]:
customer_id FullyFunded
0 9ece67d6c5 True
1 5c2c402094 True
2 e6254cad30 True
3 49fb42f51d True
4 195fbe5739 True
In [4]:
trainX.shape
print("Dataset shape: {}, {}".format(trainX.shape[0], trainX.shape[1]))
Dataset shape: 400, 18
In [5]:
# Info about data set, datatype
trainX.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 18 columns):
customer_id                    400 non-null object
status                         400 non-null object
residence_rent_or_own          400 non-null bool
monthly_rent_amount            400 non-null int64
bank_account_direct_deposit    400 non-null bool
application_when               400 non-null datetime64[ns]
loan_duration                  400 non-null int64
payment_ach                    400 non-null bool
num_payments                   400 non-null int64
address_zip                    400 non-null int64
bank_routing_number            400 non-null int64
home_phone_type                400 non-null object
monthly_income_amount          400 non-null int64
raw_l2c_score                  400 non-null int64
raw_FICO_telecom               400 non-null int64
raw_FICO_retail                400 non-null int64
raw_FICO_bank_card             400 non-null int64
raw_FICO_money                 400 non-null int64
dtypes: bool(3), datetime64[ns](1), int64(11), object(3)
memory usage: 48.1+ KB
In [6]:
# Checking for missing values
trainX.isnull().sum()
Out[6]:
customer_id                    0
status                         0
residence_rent_or_own          0
monthly_rent_amount            0
bank_account_direct_deposit    0
application_when               0
loan_duration                  0
payment_ach                    0
num_payments                   0
address_zip                    0
bank_routing_number            0
home_phone_type                0
monthly_income_amount          0
raw_l2c_score                  0
raw_FICO_telecom               0
raw_FICO_retail                0
raw_FICO_bank_card             0
raw_FICO_money                 0
dtype: int64

Turns out to be no missing values, so no need to impute data.

In [7]:
# Check customer id is same for features & target
print("Is customer ID same for feature & target? (training)", (pd.Series.equals(trainX.customer_id, trainY.customer_id)))
print("Is customer ID same for feature & target? (test)", (pd.Series.equals(testX.customer_id, testY.customer_id)))

# Check if columns are the same/same order
trainX.columns == testX.columns
Is customer ID same for feature & target? (training) True
Is customer ID same for feature & target? (test) True
Out[7]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)
In [8]:
# Checking for any quirks
trainX.describe()
Out[8]:
monthly_rent_amount loan_duration num_payments address_zip bank_routing_number monthly_income_amount raw_l2c_score raw_FICO_telecom raw_FICO_retail raw_FICO_bank_card raw_FICO_money
count 400.000000 400.000000 400.000000 400.000000 4.000000e+02 400.000000 400.00000 400.000000 400.000000 400.000000 400.000000
mean 594.317500 5.860000 10.800000 82409.300000 1.815323e+08 2645.565000 606.55000 566.180000 592.242500 664.005000 602.565000
std 435.895893 1.511991 4.485007 6506.387121 9.981754e+07 1651.009577 108.37745 43.278519 51.673388 40.141042 30.989865
min 0.000000 2.000000 3.000000 57003.000000 1.240000e+04 300.000000 50.00000 222.000000 222.000000 222.000000 222.000000
25% 250.000000 5.000000 8.000000 84010.000000 1.240001e+08 1557.500000 544.00000 535.000000 553.000000 648.000000 587.750000
50% 577.500000 5.000000 10.000000 84084.000000 1.240030e+08 2397.000000 600.50000 566.000000 590.000000 671.000000 605.500000
75% 873.750000 8.000000 13.000000 84120.000000 3.240796e+08 3200.000000 671.75000 593.250000 632.000000 680.000000 621.000000
max 2100.000000 8.000000 34.000000 84790.000000 5.113005e+08 19392.000000 808.00000 698.000000 756.000000 796.000000 662.000000
In [9]:
# Number of categorical columns
print("Number of categorical columns:", len(trainX.describe(include=['object']).columns))
trainX.describe(include=['object', 'datetime64'])
Number of categorical columns: 3
Out[9]:
customer_id status application_when home_phone_type
count 400 400 400 400
unique 400 1 398 3
top 24960509ce Approved 2011-03-01 12:48:00 Mobile
freq 1 400 2 334
first NaN NaN 2010-10-16 14:06:00 NaN
last NaN NaN 2011-03-01 14:34:00 NaN
In [10]:
# Plotting univariate distributions to check for skew, patterns, types of distribution
trainX.hist(layout = (3,6), figsize = (14,9), bins = 20);

Some graphs show a right skew (monthly_income_amount, monthly_rent_amount, num_payments). Others are categorical such as if someone rents or owns, or loan duration.

In [11]:
sns.pairplot(trainX)
plt.show()
In [12]:
trainX.plot(kind='density', subplots=True, layout=(3,6), figsize=(14,9), sharex=False);
In [13]:
# Looking for skew
trainX.skew()
Out[13]:
residence_rent_or_own          -0.678066
monthly_rent_amount             0.528339
bank_account_direct_deposit    -1.717120
loan_duration                   0.177955
payment_ach                   -20.000000
num_payments                    1.023847
address_zip                    -3.612955
bank_routing_number             0.701593
monthly_income_amount           3.640856
raw_l2c_score                  -1.149318
raw_FICO_telecom               -0.962150
raw_FICO_retail                -0.683382
raw_FICO_bank_card             -3.588302
raw_FICO_money                 -4.823780
dtype: float64
In [14]:
# Checking when these applications were submitted, find the time range
trainX.application_when.sort_values()
Out[14]:
0     2010-10-16 14:06:00
1     2010-10-17 13:01:00
2     2010-10-17 19:52:00
3     2010-10-18 07:05:00
4     2010-10-19 10:58:00
5     2010-10-21 21:21:00
6     2010-10-23 19:52:00
7     2010-10-24 16:05:00
8     2010-10-26 11:22:00
9     2010-10-27 14:12:00
10    2010-10-27 19:10:00
11    2010-10-28 07:50:00
12    2010-10-28 18:50:00
13    2010-10-30 11:53:00
14    2010-10-31 21:00:00
15    2010-10-31 21:43:00
16    2010-11-01 09:59:00
17    2010-11-01 11:39:00
18    2010-11-01 17:50:00
19    2010-11-01 19:08:00
20    2010-11-02 20:17:00
21    2010-11-03 08:03:00
22    2010-11-03 16:58:00
23    2010-11-04 09:23:00
24    2010-11-04 10:09:00
25    2010-11-04 10:24:00
26    2010-11-04 22:34:00
27    2010-11-06 00:32:00
28    2010-11-06 05:35:00
29    2010-11-06 23:29:00
              ...        
370   2011-02-24 07:31:00
371   2011-02-24 07:42:00
372   2011-02-24 10:01:00
373   2011-02-24 10:52:00
374   2011-02-24 11:00:00
375   2011-02-24 11:55:00
376   2011-02-24 12:15:00
377   2011-02-25 02:52:00
378   2011-02-25 10:59:00
379   2011-02-25 11:01:00
380   2011-02-25 17:05:00
381   2011-02-25 18:00:00
382   2011-02-25 18:23:00
383   2011-02-26 09:30:00
384   2011-02-26 10:16:00
385   2011-02-26 12:41:00
386   2011-02-26 16:43:00
387   2011-02-27 08:14:00
388   2011-02-27 13:16:00
389   2011-02-28 07:02:00
390   2011-02-28 08:43:00
391   2011-02-28 09:31:00
392   2011-02-28 11:40:00
393   2011-02-28 16:09:00
394   2011-03-01 09:01:00
395   2011-03-01 10:00:00
396   2011-03-01 12:48:00
397   2011-03-01 12:48:00
398   2011-03-01 12:59:00
399   2011-03-01 14:34:00
Name: application_when, Length: 400, dtype: datetime64[ns]
In [15]:
# Checking longest loan duration, appears to be 8 [years?] -- most common is 5 year loan
# trainX.loan_duration.sort_values(ascending=False)
trainX.loan_duration.value_counts()
Out[15]:
5    170
8    110
6     71
3     24
4     18
7      6
2      1
Name: loan_duration, dtype: int64

Idea: Is it possible to add years (loan duration) to the date loan was applied for? Maybe to see if you can weigh number of payments strongly.

In [16]:
# Checking address_zip, see if there is an imbalance in where these people live
trainX['address_zip'].value_counts()
Out[16]:
84120    15
84118    13
84404    11
84119    10
84047    10
84067     9
84015     8
84123     8
84084     8
84010     8
84115     7
84088     7
84041     7
84660     6
84121     6
84058     6
84095     6
84003     6
83401     5
84065     5
84103     5
84655     5
83201     5
84321     5
84414     4
83404     4
84044     4
84005     4
84401     4
84116     4
         ..
57262     1
57274     1
83325     1
83402     1
83661     1
57325     1
57106     1
84752     1
57103     1
84741     1
84105     1
84109     1
84110     1
84124     1
84639     1
83617     1
83714     1
57013     1
84663     1
83642     1
83644     1
57018     1
57032     1
84526     1
57055     1
83713     1
84724     1
83703     1
83705     1
83607     1
Name: address_zip, Length: 146, dtype: int64

Shows 146 different address_zip values.

In [17]:
# Checking string and other variables.
print(trainX.status.value_counts())
print("******************")
trainX.bank_account_direct_deposit.value_counts()
Approved    400
Name: status, dtype: int64
******************
Out[17]:
True     330
False     70
Name: bank_account_direct_deposit, dtype: int64

Noticed that all loans in trainX are approved. I checked with testX and it is also the same. There are more than 4 times the amount of people who use direct deposit than not.

Checking for any correlations.

Example:

  • See if monthly income has any correlation with FICO score
  • Do number of payments correlate with FICO score?
  • Does loan duration correlate with number of payments?
In [18]:
corr = trainX.corr()
# plot correlation matrix
corr = trainX.corr()
sns.heatmap(corr,
           xticklabels=corr.columns.values,
           yticklabels=corr.columns.values);

There are a lot of highly correlated variables, particularly when it comes to the credit scores.

So maybe need to drop some of them?
A possible approach would be to check their coefficients and p-values, and pick a FICO score.
Note: Averaging would probably remove the differences, so I will not be doing that. (From Aug. 27 office hours)

Checking for "best" FICO score using the get_linear_model_metrics from the linear regression notebook.

In [19]:
def get_linear_model_metrics(X, y, algo):
    # Get the pvalue of X given y. Ignore f-stat for now.
    pvals = feature_selection.f_regression(X, y)[1]
    # Start with an empty linear regression object
    # .fit() runs the linear regression function on X and y
    algo.fit(X,y)
    residuals = (y-algo.predict(X)).values

    # Print the necessary values
    print('P Values:', pvals)
    print('Coefficients:', algo.coef_)
    print('y-intercept:', algo.intercept_)
    print('R-Squared:', algo.score(X,y))

    # Plot residuals
#     h = sns.residplot(X,y)
#     h.set(xlabel="ah")
    
    # Keep the model
    return algo

# lm = get_linear_model_metrics(X, y, lm)


# The set of variables I will be checking coefficient, p-value, r-squared values:
x_sets = (
    ['raw_l2c_score'],
    ['raw_FICO_telecom'],
    ['raw_FICO_retail'],
    ['raw_FICO_bank_card'],
    ['raw_FICO_money'],
    ['raw_l2c_score', 'raw_FICO_telecom'],
    ['raw_l2c_score', 'raw_FICO_retail'],
    ['raw_l2c_score', 'raw_FICO_bank_card'],
    ['raw_l2c_score', 'raw_FICO_money'],
    ['raw_FICO_money','raw_FICO_bank_card']
)

# Using logistic regression as the linear model
for x in x_sets:
    print(', '.join(x))
    get_linear_model_metrics(trainX[x], trainY.FullyFunded, linear_model.LogisticRegression())
    print()
raw_l2c_score
P Values: [ 0.00570359]
Coefficients: [[ 0.00216036]]
y-intercept: [-0.29147695]
R-Squared: 0.73

raw_FICO_telecom
P Values: [ 0.00297036]
Coefficients: [[ 0.00364486]]
y-intercept: [-1.05015373]
R-Squared: 0.7325

raw_FICO_retail
P Values: [ 0.06799307]
Coefficients: [[ 0.00169972]]
y-intercept: [  2.77129940e-06]
R-Squared: 0.73

raw_FICO_bank_card
P Values: [ 0.01775891]
Coefficients: [[ 0.00151773]]
y-intercept: [  2.20145093e-06]
R-Squared: 0.73

raw_FICO_money
P Values: [ 0.00098666]
Coefficients: [[ 0.00167863]]
y-intercept: [  2.65596142e-06]
R-Squared: 0.73

raw_l2c_score, raw_FICO_telecom
P Values: [ 0.00570359  0.00297036]
Coefficients: [[ 0.0015589   0.00018553]]
y-intercept: [-0.02477229]
R-Squared: 0.73

raw_l2c_score, raw_FICO_retail
P Values: [ 0.00570359  0.06799307]
Coefficients: [[ 0.00184632 -0.00016244]]
y-intercept: [-0.00130779]
R-Squared: 0.73

raw_l2c_score, raw_FICO_bank_card
P Values: [ 0.00570359  0.01775891]
Coefficients: [[ 0.00205132 -0.00033713]]
y-intercept: [ -3.90915802e-06]
R-Squared: 0.7275

raw_l2c_score, raw_FICO_money
P Values: [ 0.00570359  0.00098666]
Coefficients: [[ 0.00185728 -0.0001708 ]]
y-intercept: [-0.00131104]
R-Squared: 0.73

raw_FICO_money, raw_FICO_bank_card
P Values: [ 0.00098666  0.01775891]
Coefficients: [[ 0.00399904 -0.00210482]]
y-intercept: [ -5.06615670e-05]
R-Squared: 0.73

I opted to use raw_FICO_telecom since it accounts for utility and phone bills, and it returned the highest R-Squared value (albeit barely squeezed out the highest at 0.7325 compared to 0.73). Also it appears to show that adding more than one credit score does not contribute changes to the R-Squared value. Seems useful for people new to credit.

I will try to do a raw_l2c_score to raw_FICO_telecom ratio or something like that. From searching online, the l2c score appears to be a way to score people with little to no credit history.

Source: http://money.cnn.com/2015/04/02/pf/new-fico-credit-score/index.html

Removing columns that may not have anything to do with whether a loan is fully funded.

Removing home_phone_type, bank routing number, customer id, status

In [20]:
# Home phone type
trainX.home_phone_type.value_counts()
# Bank routing number
# customer_id
# status - all loans are Approved in this dataset and the testX dataset.
Out[20]:
Mobile    334
Home       63
Work        3
Name: home_phone_type, dtype: int64

Going to use credit scores as primary predictors of whether or not they will be fully funded.

Checking for quirks in their distributions.

In [21]:
f, axarr = plt.subplots(3, 2, figsize=(14, 10));
# funded = trainY.FullyFunded.values

axarr[0,0].hist(trainX['raw_l2c_score'].values, bins = 20)
axarr[0,0].set_title('raw_l2c_score');

axarr[0,1].hist(trainX['num_payments'].values, bins = 20)
axarr[0,1].set_title('raw_FICO_telecom');

axarr[1,0].hist(trainX['raw_FICO_retail'].values, bins = 20)
axarr[1,0].set_title('raw_FICO_retail');

axarr[1,1].hist(trainX['raw_FICO_bank_card'].values, bins = 20)
axarr[1,1].set_title('raw_FICO_bank_card');

axarr[2,0].hist(trainX['raw_FICO_money'].values, bins = 20)
axarr[2,0].set_title('raw_FICO_money');

axarr[2,1].hist(trainX['monthly_income_amount'].values, bins = 20)
axarr[2,1].set_title('monthly_income_amount');
In [22]:
f, axarr = plt.subplots(3, 2, figsize=(10, 9))
funded = trainY.FullyFunded.values
axarr[0, 0].scatter(trainX.num_payments.values, funded)
axarr[0, 0].set_title('num_payments');
axarr[0, 1].scatter(trainX.raw_FICO_telecom.values, funded)
axarr[0, 1].set_title('raw_FICO_telecom');
axarr[1, 0].scatter(trainX.monthly_rent_amount.values, funded)
axarr[1, 0].set_title('monthly_rent_amount');
axarr[1, 1].scatter(trainX['raw_l2c_score'].values, funded)
axarr[1, 1].set_title('raw_l2c_score');
axarr[2, 0].scatter(trainX.raw_FICO_bank_card.values, funded)
axarr[2, 0].set_title('raw_FICO_bank_card');
axarr[2, 1].scatter(trainX.raw_FICO_retail.values, funded)
axarr[2, 1].set_title('raw_FICO_retail');

Perhaps people who consistently make payments (num_payments) are more likely to get funding? Could be a contributor to credit history. Wanted to see if there was anything with that, but nothing noticeable showed up in these plots.

What about categorizing into levels of FICO score? Using get dummies to assign 0 & 1 to the various levels.

Figuring out how to assign scores by credit level.

Creating tiers based off the original FICO Score

  • Exceptional, Very Good, Good, Fair, Very Poor

FICO Score

Assigning 4 levels of mean credit score:

  • Exceptional (800-850)
  • Very Good (740 to 799)
  • Good (670 to 739)
  • Fair (580 to 669)
  • Very Poor (579 and below)
In [23]:
train_X = trainX.copy() # train_X is the copy dataset!

def fico_telecom_level(df):
    df.loc[df['raw_FICO_telecom'].between(800,850), 'credit_level'] = 'exceptional'
    df.loc[df['raw_FICO_telecom'].between(740,799), 'credit_level'] = 'very good'
    df.loc[df['raw_FICO_telecom'].between(670,739), 'credit_level'] = 'good'
    df.loc[df['raw_FICO_telecom'].between(580,669), 'credit_level'] = 'fair'
    df.loc[df['raw_FICO_telecom'].between(0,579), 'credit_level'] = 'very poor'

    return df

def credit_dummies(df):
    credit_level_dummies = pd.get_dummies(df['credit_level'], prefix='credit')
    df = pd.concat([df, credit_level_dummies], axis=1)
    df.drop('credit_level', axis=1, inplace=True)
    
    return df

# Dropping features that don't seem to be very predictable
def drop_features(df):
    dropCols = ['bank_routing_number', 'customer_id', 'home_phone_type','status','address_zip',
               'application_when','bank_account_direct_deposit', 'payment_ach',
               'residence_rent_or_own']
    
    df = df.drop(dropCols, axis = 1, inplace = True)
    return df

# Dropping the credit score columns
def drop_credit(df):
    creditCols = ['raw_l2c_score','raw_FICO_retail', 'raw_FICO_bank_card','raw_FICO_money'] # no raw_FICO_telecom
    df = df.drop(creditCols, axis = 1, inplace = True)
    return df

# Checking to see if training data set copy is same as original
train_X.head()
Out[23]:
customer_id status residence_rent_or_own monthly_rent_amount bank_account_direct_deposit application_when loan_duration payment_ach num_payments address_zip bank_routing_number home_phone_type monthly_income_amount raw_l2c_score raw_FICO_telecom raw_FICO_retail raw_FICO_bank_card raw_FICO_money
0 9ece67d6c5 Approved True 0 True 2010-10-16 14:06:00 3 True 6 84118 124001545 Mobile 1560 614 574 600 656 561
1 5c2c402094 Approved True 0 False 2010-10-17 13:01:00 6 True 13 84062 124000054 Mobile 900 708 501 550 651 563
2 e6254cad30 Approved True 620 True 2010-10-17 19:52:00 6 True 13 84119 124001545 Mobile 1434 687 522 561 661 598
3 49fb42f51d Approved False 785 False 2010-10-18 07:05:00 4 True 8 84405 124002971 Mobile 1600 616 560 552 634 591
4 195fbe5739 Approved True 700 True 2010-10-19 10:58:00 4 True 8 84404 124002971 Mobile 1360 681 603 654 659 636
In [24]:
# Applying functions to training data set.
drop_features(train_X)
drop_credit(train_X)
train_X = credit_dummies(fico_telecom_level(train_X)) # needs to be assigned to save the dataframe result.

# Check to see if functions transformed dataset successfully.
train_X.head()
Out[24]:
monthly_rent_amount loan_duration num_payments monthly_income_amount raw_FICO_telecom credit_fair credit_good credit_very poor
0 0 3 6 1560 574 0 0 1
1 0 6 13 900 501 0 0 1
2 620 6 13 1434 522 0 0 1
3 785 4 8 1600 560 0 0 1
4 700 4 8 1360 603 1 0 0

Was not able to apply get_dummies inside a function for some reason.
Going to apply it separately.

Trying out CART as my first model.

In [25]:
scoring = 'roc_auc'
num_folds = 5
seed = 10
kfold = KFold(n_splits = num_folds, random_state = seed)
y = trainY['FullyFunded'] # Extract target (FullyFunded), not customer_id.

model = DecisionTreeClassifier(random_state = seed)
# Fits the model
model.fit(train_X, y)

scores = cross_val_score(model, train_X, y, cv = kfold, scoring = scoring)
print('CV AUC {}, Avg. AUC {}'.format(scores, scores.mean()))
CV AUC [ 0.54615385  0.63034188  0.63696716  0.70519713  0.56974922], Avg. AUC 0.6176818462496471

Tweaking CART model parameters. Picked these two values for max_depth & min_samples leaf based off the class we looked over the StumbleUpon data.

In [26]:
model = DecisionTreeClassifier(
                max_depth = 4,
                min_samples_leaf = 5, random_state = seed)

model.fit(train_X, y)
scores = cross_val_score(model, train_X, y, cv = kfold, scoring = scoring)
print('CV AUC {}, Avg. AUC {}'.format(scores, scores.mean()))
CV AUC [ 0.54410256  0.66381766  0.77952481  0.80241935  0.79663009], Avg. AUC 0.7172988969259038

Changing the max_depth and min_samples_leaf results in a 10% increase in AUC score.

In [27]:
features = train_X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head(20)
Out[27]:
Features Importance Score
2 num_payments 0.402043
1 loan_duration 0.248258
4 raw_FICO_telecom 0.171630
0 monthly_rent_amount 0.120484
3 monthly_income_amount 0.057585
5 credit_fair 0.000000
6 credit_good 0.000000
7 credit_very poor 0.000000

Turns out the indicator columns I created turned out to be nonfactors in the model.
Although the FICO telecom score is based off the original FICO score with the same score range, it did not contribute anything meaningful to the model.

Evaluating various models' performances

In [28]:
def model_performance(df):
    scoring = 'roc_auc'
    num_folds = 5
    seed = 10
    kfold = KFold(n_splits = num_folds, random_state = seed)

    results = []
    names = []
    models = []

    # Tweak CART ---- max_depth = 4, min_samples_leaf = 5,
    models.append( ('CART', DecisionTreeClassifier(random_state = seed)) )
    models.append( ('CART v2', DecisionTreeClassifier(random_state = seed, max_depth = 4, min_samples_leaf = 5)) )
    models.append( ('RandomForestClassifier', RandomForestClassifier(random_state = seed)) )
    models.append( ('LogisticRegression', LogisticRegression(random_state = seed)) )
    models.append( ('SVC', SVC(random_state = seed)) )
    models.append( ('GradientBoostingClassifier', GradientBoostingClassifier(random_state = seed)) )
    models.append( ('KNeighborsClassifier', KNeighborsClassifier()) ) 
    models.append( ('XGBClassifier', xgb.XGBClassifier(seed = seed)) )


    for name, model in models:
        kfold = KFold(n_splits = num_folds, random_state = seed)

        names.append(name)
        model.fit(df, trainY['FullyFunded'])

        scores = cross_val_score(model, df, trainY['FullyFunded'], cv = kfold, scoring = scoring)
        results.append(scores)

        msg = "{}: {} ({})".format(name, scores.mean(), scores.std())
        print(msg)
        print()
    return df
In [29]:
model_performance(train_X)
CART: 0.6176818462496471 (0.05586576194485695)

CART v2: 0.7172988969259038 (0.10026784846312621)

RandomForestClassifier: 0.7136026666627964 (0.06201602954831483)

LogisticRegression: 0.7001591302548761 (0.050607798742858014)

SVC: 0.507339361282107 (0.02479878072069761)

GradientBoostingClassifier: 0.71222368826686 (0.10678444093660712)

KNeighborsClassifier: 0.5684595632821964 (0.05419567007366402)

XGBClassifier: 0.7412029512374759 (0.07457911863731648)

Out[29]:
monthly_rent_amount loan_duration num_payments monthly_income_amount raw_FICO_telecom credit_fair credit_good credit_very poor
0 0 3 6 1560 574 0 0 1
1 0 6 13 900 501 0 0 1
2 620 6 13 1434 522 0 0 1
3 785 4 8 1600 560 0 0 1
4 700 4 8 1360 603 1 0 0
5 1600 4 4 4900 573 0 0 1
6 865 4 4 3200 580 1 0 0
7 1000 2 4 2200 537 0 0 1
8 169 4 4 1200 532 0 0 1
9 1265 6 13 4160 527 0 0 1
10 1175 4 8 5326 526 0 0 1
11 0 6 12 2200 501 0 0 1
12 720 4 17 2400 501 0 0 1
13 190 6 13 1200 585 1 0 0
14 0 4 17 1920 654 1 0 0
15 860 4 8 1200 501 0 0 1
16 300 6 12 1160 614 1 0 0
17 750 6 13 9000 592 1 0 0
18 650 6 13 600 501 0 0 1
19 635 6 26 3600 525 0 0 1
20 889 6 12 2580 559 0 0 1
21 0 6 13 429 501 0 0 1
22 650 6 12 4500 524 0 0 1
23 700 8 13 4200 573 0 0 1
24 0 7 15 1200 518 0 0 1
25 1100 5 5 3007 570 0 0 1
26 750 5 10 5260 556 0 0 1
27 0 5 10 2400 611 1 0 0
28 0 5 10 1200 534 0 0 1
29 89 5 10 1120 543 0 0 1
... ... ... ... ... ... ... ... ...
370 375 6 13 3600 549 0 0 1
371 800 5 10 3370 561 0 0 1
372 620 8 8 2600 635 1 0 0
373 0 6 13 3720 576 0 0 1
374 100 6 13 3100 602 1 0 0
375 1182 8 6 2600 539 0 0 1
376 400 5 5 1500 512 0 0 1
377 800 5 5 2998 535 0 0 1
378 500 5 5 3000 620 1 0 0
379 1326 5 5 3192 693 0 1 0
380 575 5 10 1600 587 1 0 0
381 250 6 6 3500 575 0 0 1
382 650 8 13 2000 571 0 0 1
383 375 6 13 1160 564 0 0 1
384 645 5 10 1266 578 0 0 1
385 0 6 6 1887 536 0 0 1
386 0 3 6 3440 602 1 0 0
387 400 5 10 2400 539 0 0 1
388 250 5 10 930 587 1 0 0
389 382 5 10 5600 583 1 0 0
390 485 5 5 1000 638 1 0 0
391 873 5 10 2466 559 0 0 1
392 945 5 10 1300 572 0 0 1
393 375 5 10 3850 501 0 0 1
394 1200 8 6 3133 516 0 0 1
395 0 5 5 1173 513 0 0 1
396 400 5 10 3200 513 0 0 1
397 0 7 13 1860 576 0 0 1
398 0 5 10 3000 511 0 0 1
399 400 5 10 1200 576 0 0 1

400 rows × 8 columns

In [30]:
scoring = 'roc_auc'
num_folds = 5
seed = 10
kfold = KFold(n_splits = num_folds, random_state = seed)


results = []
names = []
models = []

# CART v2 is the modified CART ---- max_depth = 4, min_samples_leaf = 5,
models.append( ('CART', DecisionTreeClassifier(random_state = seed)) )
models.append( ('CART v2', DecisionTreeClassifier(random_state = seed, max_depth = 4, min_samples_leaf = 5)) )
models.append( ('RandomForestClassifier', RandomForestClassifier(random_state = seed)) )
models.append( ('LogisticRegression', LogisticRegression(random_state = seed)) )
models.append( ('SVC', SVC(random_state = seed)) )
models.append( ('GradientBoostingClassifier', GradientBoostingClassifier(random_state = seed)) )
models.append( ('KNeighborsClassifier', KNeighborsClassifier()) ) 
models.append( ('XGBClassifier', xgb.XGBClassifier(seed = seed)) )


for name, model in models:
    kfold = KFold(n_splits = num_folds, random_state = seed)
    
    names.append(name)
    model.fit(train_X, trainY['FullyFunded'])
    
    scores = cross_val_score(model, train_X, trainY['FullyFunded'], cv = kfold, scoring = scoring)
    results.append(scores)
    
    msg = "{}: {} ({})".format(name, scores.mean(), scores.std())
    print(msg)
    print()
CART: 0.6176818462496471 (0.05586576194485695)

CART v2: 0.7172988969259038 (0.10026784846312621)

RandomForestClassifier: 0.7136026666627964 (0.06201602954831483)

LogisticRegression: 0.7001591302548761 (0.050607798742858014)

SVC: 0.507339361282107 (0.02479878072069761)

GradientBoostingClassifier: 0.71222368826686 (0.10678444093660712)

KNeighborsClassifier: 0.5684595632821964 (0.05419567007366402)

XGBClassifier: 0.7412029512374759 (0.07457911863731648)

Refinement Workflow Results - Trial 1

Model Description Local CV (Std Dev)
Baseline: CART 0.6176 (0.0558)
CART v2: Modified parameters 0.7172 (0.1002)
RandomForestClassifier 0.7136 (0.0620)
Logistic Regression 0.7001 (0.0506)
SVC 0.5073 (0.0247)
GradientBoostingClassifier 0.7122 (0.1067)
KNeighborsClassifier 0.5684 (0.0541)
XGBClassifier 0.7412 (0.0745)

It appears to be that with a training dataset of only 7 columns, XGBClassifier scores the highest.
The engineered features in this first test were one hot encoded credit score tiers (exceptional, very good, good, fair, very poor), and did not end up contributing to the CART model.

Feature Engineering (FE)

  • Created a payment_loan_ratio that divides number of payments by loan duration. Perhaps making more payments is indicative of getting a fully funded loan. In a previous scatterplot above, num_payments did not show any correlation with becoming FullyFunded. Although making more payments over the course of a loan could build credit.
  • Created an income_rent_difference feature, which shows how much income a person has left over after subtracting their rent payment for the month. People who have more leftover money could be more likely to make their loan payments.
  • Created a raw_FICO_telecom to raw_l2c_score ratio since both scores appear to be measures of people who are new to having credit (meaning they have little to no credit history). Settled on using that since raw_l2c_score had a higher max value in the dataset.
    • Used describe() on raw_l2c_score & raw_FICO_telecom to check.
In [31]:
# Feature Engineering (FE)
def first_order_FE(df):
    df["payment_loan_ratio"] = df.apply(payment_loan_ratio, axis=1)
    df["income_rent_difference"] = df.apply(income_rent_difference, axis=1)
    df["telecom_l2c_ratio"] = df.apply(telecom_l2c_ratio, axis=1)
    return df

def payment_loan_ratio(df):
    return df.num_payments / df.loan_duration

def income_rent_difference(df):
    return df.monthly_income_amount - df.monthly_rent_amount

def telecom_l2c_ratio(df):
    return df.raw_FICO_telecom / df.raw_l2c_score

One additional method to feature engineer is to predict negative income, see how their monthly income changes from month to month. Data is not available though.

  • A shortcoming of income_rent_difference is that we do not know the person's housing situation.
  • If they split rent, then it possible to have a higher rent than income.
    • Also the residence_rent_or_own variable is very strange.

Transforming the dataset using feature engineering.

In [32]:
# Dropping different credit score columns
def drop_credit_FE(df):
    creditCols = ['raw_FICO_retail', 'raw_FICO_bank_card','raw_FICO_money'] # no raw_FICO_telecom or raw_l2c_score
    df = df.drop(creditCols, axis = 1, inplace = True)
    return df
In [33]:
train_X2 = trainX.copy()
In [34]:
first_order_FE(train_X2)
drop_features(train_X2)
drop_credit_FE(train_X2)
print("Data shape is:",train_X2.shape)
train_X2.head()
Data shape is: (400, 9)
Out[34]:
monthly_rent_amount loan_duration num_payments monthly_income_amount raw_l2c_score raw_FICO_telecom payment_loan_ratio income_rent_difference telecom_l2c_ratio
0 0 3 6 1560 614 574 2.000000 1560 0.934853
1 0 6 13 900 708 501 2.166667 900 0.707627
2 620 6 13 1434 687 522 2.166667 814 0.759825
3 785 4 8 1600 616 560 2.000000 815 0.909091
4 700 4 8 1360 681 603 2.000000 660 0.885463

Evaluating Model Performances (Test 2)

In [35]:
model_performance(train_X2)
CART: 0.6394799994283697 (0.060898134262455106)

CART v2: 0.719942773695182 (0.0771449229982496)

RandomForestClassifier: 0.7034117021847415 (0.04230854804834636)

LogisticRegression: 0.6996489925448727 (0.03263125738942909)

SVC: 0.5034252539912918 (0.004231061736970055)

GradientBoostingClassifier: 0.7310378055090209 (0.07371397740202118)

KNeighborsClassifier: 0.5228196330176098 (0.05548563404216828)

XGBClassifier: 0.74950630403108 (0.06869051525966671)

Out[35]:
monthly_rent_amount loan_duration num_payments monthly_income_amount raw_l2c_score raw_FICO_telecom payment_loan_ratio income_rent_difference telecom_l2c_ratio
0 0 3 6 1560 614 574 2.000000 1560 0.934853
1 0 6 13 900 708 501 2.166667 900 0.707627
2 620 6 13 1434 687 522 2.166667 814 0.759825
3 785 4 8 1600 616 560 2.000000 815 0.909091
4 700 4 8 1360 681 603 2.000000 660 0.885463
5 1600 4 4 4900 716 573 1.000000 3300 0.800279
6 865 4 4 3200 486 580 1.000000 2335 1.193416
7 1000 2 4 2200 530 537 2.000000 1200 1.013208
8 169 4 4 1200 645 532 1.000000 1031 0.824806
9 1265 6 13 4160 586 527 2.166667 2895 0.899317
10 1175 4 8 5326 564 526 2.000000 4151 0.932624
11 0 6 12 2200 663 501 2.000000 2200 0.755656
12 720 4 17 2400 523 501 4.250000 1680 0.957935
13 190 6 13 1200 712 585 2.166667 1010 0.821629
14 0 4 17 1920 765 654 4.250000 1920 0.854902
15 860 4 8 1200 252 501 2.000000 340 1.988095
16 300 6 12 1160 734 614 2.000000 860 0.836512
17 750 6 13 9000 537 592 2.166667 8250 1.102421
18 650 6 13 600 645 501 2.166667 -50 0.776744
19 635 6 26 3600 557 525 4.333333 2965 0.942549
20 889 6 12 2580 602 559 2.000000 1691 0.928571
21 0 6 13 429 616 501 2.166667 429 0.813312
22 650 6 12 4500 716 524 2.000000 3850 0.731844
23 700 8 13 4200 530 573 1.625000 3500 1.081132
24 0 7 15 1200 733 518 2.142857 1200 0.706685
25 1100 5 5 3007 771 570 1.000000 1907 0.739300
26 750 5 10 5260 645 556 2.000000 4510 0.862016
27 0 5 10 2400 544 611 2.000000 2400 1.123162
28 0 5 10 1200 638 534 2.000000 1200 0.836991
29 89 5 10 1120 762 543 2.000000 1031 0.712598
... ... ... ... ... ... ... ... ... ...
370 375 6 13 3600 602 549 2.166667 3225 0.911960
371 800 5 10 3370 716 561 2.000000 2570 0.783520
372 620 8 8 2600 638 635 1.000000 1980 0.995298
373 0 6 13 3720 550 576 2.166667 3720 1.047273
374 100 6 13 3100 593 602 2.166667 3000 1.015177
375 1182 8 6 2600 763 539 0.750000 1418 0.706422
376 400 5 5 1500 808 512 1.000000 1100 0.633663
377 800 5 5 2998 586 535 1.000000 2198 0.912969
378 500 5 5 3000 541 620 1.000000 2500 1.146026
379 1326 5 5 3192 666 693 1.000000 1866 1.040541
380 575 5 10 1600 231 587 2.000000 1025 2.541126
381 250 6 6 3500 645 575 1.000000 3250 0.891473
382 650 8 13 2000 712 571 1.625000 1350 0.801966
383 375 6 13 1160 492 564 2.166667 785 1.146341
384 645 5 10 1266 586 578 2.000000 621 0.986348
385 0 6 6 1887 748 536 1.000000 1887 0.716578
386 0 3 6 3440 546 602 2.000000 3440 1.102564
387 400 5 10 2400 525 539 2.000000 2000 1.026667
388 250 5 10 930 542 587 2.000000 680 1.083026
389 382 5 10 5600 492 583 2.000000 5218 1.184959
390 485 5 5 1000 762 638 1.000000 515 0.837270
391 873 5 10 2466 499 559 2.000000 1593 1.120240
392 945 5 10 1300 522 572 2.000000 355 1.095785
393 375 5 10 3850 536 501 2.000000 3475 0.934701
394 1200 8 6 3133 614 516 0.750000 1933 0.840391
395 0 5 5 1173 616 513 1.000000 1173 0.832792
396 400 5 10 3200 564 513 2.000000 2800 0.909574
397 0 7 13 1860 617 576 1.857143 1860 0.933549
398 0 5 10 3000 530 511 2.000000 3000 0.964151
399 400 5 10 1200 501 576 2.000000 800 1.149701

400 rows × 9 columns

Model Description Performance 2 - Performance 1 Change in CV Score
Baseline: CART 0.6394 - 0.6176 0.0218
CART v2: Modified parameters 0.7199427737 - 0.7172 0.0027
RandomForestClassifier 0.7034117022 - 0.7136 -0.01018
Logistic Regression 0.6996489925 - 0.7001 -0.0004510
SVC 0.503425254 - 0.5073 -0.003874
GradientBoostingClassifier 0.7310378055 - 0.7122 0.01883
KNeighborsClassifier 0.522819633 - 0.5684 -0.04558
XGBClassifier 0.749506304 - 0.7412 0.008306

Overall, only very slight changes in model performances. The baseline model increased by 2%, GradientBoostingClassifier performance improved by 1.88 % while the modified CART (CART v2), and XGBClassifier increased less than a hundredth.

Next is model tuning!

Decided on tuning GradientBoostingClassifier & XGBClassifier!

Local CV performance function

In [36]:
def local_cv(model, params, printFeatureImportance=True):
    param_grid = params
    kfold = KFold(n_splits = num_folds, random_state = seed)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
    grid_result = grid.fit(train_X2, trainY.FullyFunded)
    
    # Plotting the Feature Importances
    if printFeatureImportance:
        predictors = [x for x in train_X2.columns] # Every x column
        feat_imp = pd.Series(grid_result.best_estimator_.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp.plot(kind='bar', title='Feature Importances')
        plt.ylabel('Feature Importance Score')
        # Dataframe of features
        features_df = pd.DataFrame({'Features': predictors, 'Importance Score': grid_result.best_estimator_.feature_importances_})
        features_df.sort_values('Importance Score', inplace=True, ascending=False)
    
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    print()
    for params, mean_score, scores in grid_result.grid_scores_:
        print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))
    
    return features_df

GradientBoostingClassifier

In [37]:
GradientBoostingClassifier().get_params().keys()
Out[37]:
dict_keys(['loss', 'min_samples_leaf', 'max_depth', 'min_impurity_split', 'learning_rate', 'init', 'max_features', 'n_estimators', 'verbose', 'min_weight_fraction_leaf', 'random_state', 'max_leaf_nodes', 'subsample', 'min_samples_split', 'criterion', 'warm_start', 'presort'])

Using default parameters for GradientBoostingClassifier.

In [38]:
params = {"learning_rate": [0.1],
          "n_estimators": [100],
          "min_samples_split": [2], 
          "min_samples_leaf": [1],
          "max_depth": [3]}
local_cv(GradientBoostingClassifier(random_state = seed), params)
Best: 0.731038 using {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 3, 'learning_rate': 0.1, 'n_estimators': 100}

0.731038 (0.073714) with: {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 3, 'learning_rate': 0.1, 'n_estimators': 100}
Out[38]:
Features Importance Score
8 telecom_l2c_ratio 0.227898
6 payment_loan_ratio 0.124212
7 income_rent_difference 0.124137
3 monthly_income_amount 0.118957
0 monthly_rent_amount 0.115138
5 raw_FICO_telecom 0.105550
4 raw_l2c_score 0.089340
2 num_payments 0.073142
1 loan_duration 0.021627

Using tuned parameters for GradientBoostingClassifier.

In [39]:
params = {"max_depth": [5],
          "n_estimators": [150],
          "min_samples_split": [2],
          "min_samples_leaf": [7],
          "learning_rate": np.arange(0.01, 0.101, 0.01),}

local_cv(GradientBoostingClassifier(random_state = seed), params)
Best: 0.776827 using {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.029999999999999999, 'n_estimators': 150}

0.762412 (0.098131) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.01, 'n_estimators': 150}
0.764469 (0.087164) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.02, 'n_estimators': 150}
0.776827 (0.084874) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.029999999999999999, 'n_estimators': 150}
0.769877 (0.084030) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.040000000000000001, 'n_estimators': 150}
0.767794 (0.077459) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.050000000000000003, 'n_estimators': 150}
0.767196 (0.084415) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.060000000000000005, 'n_estimators': 150}
0.766978 (0.071697) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.069999999999999993, 'n_estimators': 150}
0.764184 (0.076127) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.080000000000000002, 'n_estimators': 150}
0.763203 (0.079038) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.089999999999999997, 'n_estimators': 150}
0.770267 (0.071361) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.099999999999999992, 'n_estimators': 150}
Out[39]:
Features Importance Score
8 telecom_l2c_ratio 0.218291
6 payment_loan_ratio 0.157007
5 raw_FICO_telecom 0.113896
0 monthly_rent_amount 0.110226
2 num_payments 0.100089
7 income_rent_difference 0.097350
3 monthly_income_amount 0.096922
4 raw_l2c_score 0.081752
1 loan_duration 0.024466

Able to make a 3% improvement in score by tuning parameters.

When examining the two feature importances plots, the importance of payment_loan_ratio increases in the tuned model; however, it also brings up the scores of the other features as well so it doesn't rely as much on one feature.

Tweaking the learning rate.
Found that 0.0409999 --> rounded to 0.041 gives us the best score.

In [40]:
params = {"max_depth": [5],
          "n_estimators": [150],
          "min_samples_split": [2],
          "min_samples_leaf": [7],
          "learning_rate": [0.041],
          "max_features": ['auto']}


local_cv(GradientBoostingClassifier(random_state = seed), params)
Best: 0.779260 using {'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.041, 'n_estimators': 150, 'max_features': 'auto', 'min_samples_split': 2}

0.779260 (0.072950) with: {'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.041, 'n_estimators': 150, 'max_features': 'auto', 'min_samples_split': 2}
Out[40]:
Features Importance Score
8 telecom_l2c_ratio 0.197587
6 payment_loan_ratio 0.146133
0 monthly_rent_amount 0.125543
5 raw_FICO_telecom 0.122390
7 income_rent_difference 0.120055
3 monthly_income_amount 0.096514
4 raw_l2c_score 0.080484
2 num_payments 0.073079
1 loan_duration 0.038215

Tweaking the learning rate gets us even more of an even distribution of importance scores among the features.

In [41]:
xgb.XGBClassifier().get_params().keys()
Out[41]:
dict_keys(['missing', 'learning_rate', 'min_child_weight', 'gamma', 'nthread', 'scale_pos_weight', 'reg_alpha', 'max_delta_step', 'silent', 'objective', 'n_estimators', 'subsample', 'base_score', 'max_depth', 'seed', 'colsample_bytree', 'colsample_bylevel', 'reg_lambda'])
In [42]:
params = {"max_depth": [3],
          "n_estimators": [100],
          "min_child_weight": [7]}

local_cv(xgb.XGBClassifier(seed = seed), params)
Best: 0.769292 using {'max_depth': 3, 'min_child_weight': 7, 'n_estimators': 100}

0.769292 (0.037175) with: {'max_depth': 3, 'min_child_weight': 7, 'n_estimators': 100}
Out[42]:
Features Importance Score
3 monthly_income_amount 0.178922
8 telecom_l2c_ratio 0.147059
5 raw_FICO_telecom 0.144608
7 income_rent_difference 0.115196
0 monthly_rent_amount 0.102941
6 payment_loan_ratio 0.098039
4 raw_l2c_score 0.090686
2 num_payments 0.075980
1 loan_duration 0.046569

Creating ensemble model and submitting prediction

First, the test dataset needs to be transformed. Using the feature engineering functions created to transform the dataset.

In [43]:
first_order_FE(testX)
drop_features(testX)
drop_credit_FE(testX)
print("Data shape is:", testX.shape)
testX.head()
Data shape is: (247, 9)
Out[43]:
monthly_rent_amount loan_duration num_payments monthly_income_amount raw_l2c_score raw_FICO_telecom payment_loan_ratio income_rent_difference telecom_l2c_ratio
0 750 8 13 3500 718 594 1.625 2750 0.827298
1 525 5 10 670 719 571 2.000 145 0.794159
2 575 3 6 2200 719 647 2.000 1625 0.899861
3 400 8 6 800 616 526 0.750 400 0.853896
4 795 8 6 4141 705 513 0.750 3346 0.727660
In [44]:
# Create the sub models. Using XGBClassifier & GradientBoostingClassifier
estimators = []

model1 = xgb.XGBClassifier(n_estimators = 100, min_child_weight = 7, max_depth = 3)
estimators.append(('XGBClassifier', model1))

model2 = GradientBoostingClassifier(max_depth = 5, n_estimators = 150, 
                                    min_samples_split = 2, 
                                    min_samples_leaf = 7, 
                                    learning_rate = 0.041, 
                                    max_features = 'auto')
estimators.append(('GradientBoostingClassifier', model2))


# Creating ensemble model
ensemble = VotingClassifier(estimators, voting='soft')
results = cross_val_score(ensemble, train_X2, trainY.FullyFunded, cv = kfold) 

# Printing results
print("CV Scores: {} // Mean Score: {} // Std Dev: {}".format(results, results.mean(), results.std()))
print()

# Fitting model then making a prediction
ensemble_result = ensemble.fit(train_X2, trainY.FullyFunded)
ensemble_preds = ensemble_result.predict(testX)

print("Estimators variable:\n", estimators)
print()
print(ensemble)
CV Scores: [ 0.8     0.7375  0.8125  0.825   0.825 ] // Mean Score: 0.8 // Std Dev: 0.03259601202601321

Estimators variable:
 [('XGBClassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=7, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)), ('GradientBoostingClassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.041, loss='deviance', max_depth=5,
              max_features='auto', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=7,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=150, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False))]

VotingClassifier(estimators=[('XGBClassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=7, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,...=150, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False))],
         n_jobs=1, voting='soft', weights=None)

Submission to CSV file

In [45]:
submission = pd.DataFrame({"customerId": testY.customer_id, "FullyFunded": ensemble_preds})
submission.to_csv("HW3_Final_Predictions.csv", index = False)

Final Workflow Results

Workflow Results - Trial 1: One Hot Encoded Credit Score Levels

Model Description Local CV (Std Dev)
Baseline: CART 0.6176 (0.0558)
CART v2: Modified parameters 0.7172 (0.1002)
RandomForestClassifier 0.7136 (0.0620)
Logistic Regression 0.7001 (0.0506)
SVC 0.5073 (0.0247)
GradientBoostingClassifier 0.7122 (0.1067)
KNeighborsClassifier 0.5684 (0.0541)
XGBClassifier 0.7412 (0.0745)

It appears to be that with a training dataset of only 7 columns, XGBClassifier scores the highest.
The engineered features in this first test were one hot encoded credit score tiers (exceptional, very good, good, fair, very poor), and did not end up contributing to the CART model when looking at feature importances.

Workflow Results - Trial 2: With Feature Engineering

Model Description Performance 2 - Performance 1 Change in CV Score
Baseline: CART 0.6394 - 0.6176 0.0218
CART v2: Modified parameters 0.7199427737 - 0.7172 0.0027
RandomForestClassifier 0.7034117022 - 0.7136 -0.01018
Logistic Regression 0.6996489925 - 0.7001 -0.0004510
SVC 0.503425254 - 0.5073 -0.003874
GradientBoostingClassifier 0.7310378055 - 0.7122 0.01883
KNeighborsClassifier 0.522819633 - 0.5684 -0.04558
XGBClassifier 0.749506304 - 0.7412 0.008306

Workflow Results - Trial 3: Using GridSearch and Ensemble (w/ Feature Engineering)

Model Description Local CV (Std Dev)
GradientBoostingClassifier (Default Parameters) + feature engineering 0.731038 (0.073714)
GradientBoostingClassifier + feature engineering & tuning 0.776827 (0.084874)
GradientBoostingClassifier + feature engineering & additional learning rate tuning 0.779260 (0.07950)
XGBClassifier + feature engineering & tuning 0.769292 (0.037175)
Ensemble XGBClassifier and GradientBoostingClassifier (feature engineering and tuning) 0.8025 (0.03259601202601321)