Miguel Bautista¶

This project was completed through my education at General Assembly. My assignment was to predict whether or not a person should be approved for a loan based on a variety of factors.

Predict Loan Funding¶

This challenge is a standard supervised learning problem. The dataset includes application data for every loan given over a 6 month period.

There are 4 tabs with data, labeled "trainX", "trainY", "testX" and "testY". The train tabs contain the first 4.5 months of loans, while the test tabs have loan data from the following 1.5 months.

trainY and testY contain the targets. They represent whether the loans were fully funded (TRUE values) or partially funded (FALSE values). The fully funded values are withheld from testY, and your job will be to fill them in.

Use trainX and trainY to build a model to predict whether or not future loans will be fully funded. You should then use your model on the data from testX to make new predictions. We’ll score those predictions against the true values of testY to see how well your model performs.

This is intended to be a fairly straightforward task. I didn’t intentionally include any big surprises or “gotchas”. I hope that your model performs well, but it’s even more important that your approach is sound and you avoid major mistakes.

Please put your predictions in column B of the "testY" tab. The predictions should be made such that higher values are more likely to be TRUE and lower values more likely FALSE.

Pleae include a short description of which evaluation metric you selected & why.

Here's the dataset

Data Dictionary

Variable	Definition	Type
customer_id	unique customer id	alphanumeric
status	was customer approved or denied	String
residence_rent_or_own	customer is renting	Boolean
monthly_rent_amount	monthly rent amount	Numeric
bank_account_direct_deposit	customer signed up for direct deposit	Boolean
application_when	date when customer applied for a loan	MM/DD/YY HH:MM
loan_duration	term of loan	Numeric
payment_ach	has customer signed up for ACH payments	Boolean
num_payments	# of payments made by customer	Numeric
address_zip	customer resident zip code	Numeric
bank_routing_number	customer bank routing number	Numeric
home_phone_type	type of customer phone	String
monthly_income_amount	customer monthly income amount	Numeric
raw_l2c_score	Third party score	Numeric
raw_FICO_telecom	Third party score	Numeric
raw_FICO_retail	Third party score	Numeric
raw_FICO_bank_card	Third party score	Numeric
raw_FICO_money	Third party score	Numeric
FullyFunded	Fund customer	Boolean

Good luck!

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


import random
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV


# Importing the various models

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, Ridge  
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.svm import SVC

# Ensemble model
from sklearn.ensemble import VotingClassifier


# Import sklearn
from sklearn import feature_selection, linear_model
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

/anaconda/envs/py35_ds_dt_16/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Loading train and test data.¶

sheets = ['trainX','trainY','testX','testY']

trainX = pd.read_excel("HW3Data.xlsx", sheetname='trainX')
trainY = pd.read_excel("HW3Data.xlsx", sheetname='trainY')
testX = pd.read_excel("HW3Data.xlsx", sheetname='testX')
testY = pd.read_excel("HW3Data.xlsx", sheetname='testY')

trainX.head()

trainY.head()

trainX.shape
print("Dataset shape: {}, {}".format(trainX.shape[0], trainX.shape[1]))

Dataset shape: 400, 18

# Info about data set, datatype
trainX.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 18 columns):
customer_id                    400 non-null object
status                         400 non-null object
residence_rent_or_own          400 non-null bool
monthly_rent_amount            400 non-null int64
bank_account_direct_deposit    400 non-null bool
application_when               400 non-null datetime64[ns]
loan_duration                  400 non-null int64
payment_ach                    400 non-null bool
num_payments                   400 non-null int64
address_zip                    400 non-null int64
bank_routing_number            400 non-null int64
home_phone_type                400 non-null object
monthly_income_amount          400 non-null int64
raw_l2c_score                  400 non-null int64
raw_FICO_telecom               400 non-null int64
raw_FICO_retail                400 non-null int64
raw_FICO_bank_card             400 non-null int64
raw_FICO_money                 400 non-null int64
dtypes: bool(3), datetime64[ns](1), int64(11), object(3)
memory usage: 48.1+ KB

# Checking for missing values
trainX.isnull().sum()

customer_id                    0
status                         0
residence_rent_or_own          0
monthly_rent_amount            0
bank_account_direct_deposit    0
application_when               0
loan_duration                  0
payment_ach                    0
num_payments                   0
address_zip                    0
bank_routing_number            0
home_phone_type                0
monthly_income_amount          0
raw_l2c_score                  0
raw_FICO_telecom               0
raw_FICO_retail                0
raw_FICO_bank_card             0
raw_FICO_money                 0
dtype: int64

Turns out to be no missing values, so no need to impute data.

# Check customer id is same for features & target
print("Is customer ID same for feature & target? (training)", (pd.Series.equals(trainX.customer_id, trainY.customer_id)))
print("Is customer ID same for feature & target? (test)", (pd.Series.equals(testX.customer_id, testY.customer_id)))

# Check if columns are the same/same order
trainX.columns == testX.columns

Is customer ID same for feature & target? (training) True
Is customer ID same for feature & target? (test) True

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

# Checking for any quirks
trainX.describe()

# Number of categorical columns
print("Number of categorical columns:", len(trainX.describe(include=['object']).columns))
trainX.describe(include=['object', 'datetime64'])

Number of categorical columns: 3

# Plotting univariate distributions to check for skew, patterns, types of distribution
trainX.hist(layout = (3,6), figsize = (14,9), bins = 20);

Some graphs show a right skew (monthly_income_amount, monthly_rent_amount, num_payments). Others are categorical such as if someone rents or owns, or loan duration.

sns.pairplot(trainX)
plt.show()

trainX.plot(kind='density', subplots=True, layout=(3,6), figsize=(14,9), sharex=False);

# Looking for skew
trainX.skew()

residence_rent_or_own          -0.678066
monthly_rent_amount             0.528339
bank_account_direct_deposit    -1.717120
loan_duration                   0.177955
payment_ach                   -20.000000
num_payments                    1.023847
address_zip                    -3.612955
bank_routing_number             0.701593
monthly_income_amount           3.640856
raw_l2c_score                  -1.149318
raw_FICO_telecom               -0.962150
raw_FICO_retail                -0.683382
raw_FICO_bank_card             -3.588302
raw_FICO_money                 -4.823780
dtype: float64

# Checking when these applications were submitted, find the time range
trainX.application_when.sort_values()

0     2010-10-16 14:06:00
1     2010-10-17 13:01:00
2     2010-10-17 19:52:00
3     2010-10-18 07:05:00
4     2010-10-19 10:58:00
5     2010-10-21 21:21:00
6     2010-10-23 19:52:00
7     2010-10-24 16:05:00
8     2010-10-26 11:22:00
9     2010-10-27 14:12:00
10    2010-10-27 19:10:00
11    2010-10-28 07:50:00
12    2010-10-28 18:50:00
13    2010-10-30 11:53:00
14    2010-10-31 21:00:00
15    2010-10-31 21:43:00
16    2010-11-01 09:59:00
17    2010-11-01 11:39:00
18    2010-11-01 17:50:00
19    2010-11-01 19:08:00
20    2010-11-02 20:17:00
21    2010-11-03 08:03:00
22    2010-11-03 16:58:00
23    2010-11-04 09:23:00
24    2010-11-04 10:09:00
25    2010-11-04 10:24:00
26    2010-11-04 22:34:00
27    2010-11-06 00:32:00
28    2010-11-06 05:35:00
29    2010-11-06 23:29:00
              ...        
370   2011-02-24 07:31:00
371   2011-02-24 07:42:00
372   2011-02-24 10:01:00
373   2011-02-24 10:52:00
374   2011-02-24 11:00:00
375   2011-02-24 11:55:00
376   2011-02-24 12:15:00
377   2011-02-25 02:52:00
378   2011-02-25 10:59:00
379   2011-02-25 11:01:00
380   2011-02-25 17:05:00
381   2011-02-25 18:00:00
382   2011-02-25 18:23:00
383   2011-02-26 09:30:00
384   2011-02-26 10:16:00
385   2011-02-26 12:41:00
386   2011-02-26 16:43:00
387   2011-02-27 08:14:00
388   2011-02-27 13:16:00
389   2011-02-28 07:02:00
390   2011-02-28 08:43:00
391   2011-02-28 09:31:00
392   2011-02-28 11:40:00
393   2011-02-28 16:09:00
394   2011-03-01 09:01:00
395   2011-03-01 10:00:00
396   2011-03-01 12:48:00
397   2011-03-01 12:48:00
398   2011-03-01 12:59:00
399   2011-03-01 14:34:00
Name: application_when, Length: 400, dtype: datetime64[ns]

# Checking longest loan duration, appears to be 8 [years?] -- most common is 5 year loan
# trainX.loan_duration.sort_values(ascending=False)
trainX.loan_duration.value_counts()

5    170
8    110
6     71
3     24
4     18
7      6
2      1
Name: loan_duration, dtype: int64

Idea: Is it possible to add years (loan duration) to the date loan was applied for? Maybe to see if you can weigh number of payments strongly.¶

# Checking address_zip, see if there is an imbalance in where these people live
trainX['address_zip'].value_counts()

84120    15
84118    13
84404    11
84119    10
84047    10
84067     9
84015     8
84123     8
84084     8
84010     8
84115     7
84088     7
84041     7
84660     6
84121     6
84058     6
84095     6
84003     6
83401     5
84065     5
84103     5
84655     5
83201     5
84321     5
84414     4
83404     4
84044     4
84005     4
84401     4
84116     4
         ..
57262     1
57274     1
83325     1
83402     1
83661     1
57325     1
57106     1
84752     1
57103     1
84741     1
84105     1
84109     1
84110     1
84124     1
84639     1
83617     1
83714     1
57013     1
84663     1
83642     1
83644     1
57018     1
57032     1
84526     1
57055     1
83713     1
84724     1
83703     1
83705     1
83607     1
Name: address_zip, Length: 146, dtype: int64

Shows 146 different address_zip values.

# Checking string and other variables.
print(trainX.status.value_counts())
print("******************")
trainX.bank_account_direct_deposit.value_counts()

Approved    400
Name: status, dtype: int64
******************

True     330
False     70
Name: bank_account_direct_deposit, dtype: int64

Noticed that all loans in trainX are approved. I checked with testX and it is also the same. There are more than 4 times the amount of people who use direct deposit than not.

Checking for any correlations.¶

Example:¶

See if monthly income has any correlation with FICO score
Do number of payments correlate with FICO score?
Does loan duration correlate with number of payments?

corr = trainX.corr()
# plot correlation matrix
corr = trainX.corr()
sns.heatmap(corr,
           xticklabels=corr.columns.values,
           yticklabels=corr.columns.values);

There are a lot of highly correlated variables, particularly when it comes to the credit scores.

So maybe need to drop some of them?
A possible approach would be to check their coefficients and p-values, and pick a FICO score.
Note: Averaging would probably remove the differences, so I will not be doing that. (From Aug. 27 office hours)

Checking for "best" FICO score using the get_linear_model_metrics from the linear regression notebook.¶

def get_linear_model_metrics(X, y, algo):
    # Get the pvalue of X given y. Ignore f-stat for now.
    pvals = feature_selection.f_regression(X, y)[1]
    # Start with an empty linear regression object
    # .fit() runs the linear regression function on X and y
    algo.fit(X,y)
    residuals = (y-algo.predict(X)).values

    # Print the necessary values
    print('P Values:', pvals)
    print('Coefficients:', algo.coef_)
    print('y-intercept:', algo.intercept_)
    print('R-Squared:', algo.score(X,y))

    # Plot residuals
#     h = sns.residplot(X,y)
#     h.set(xlabel="ah")
    
    # Keep the model
    return algo

# lm = get_linear_model_metrics(X, y, lm)


# The set of variables I will be checking coefficient, p-value, r-squared values:
x_sets = (
    ['raw_l2c_score'],
    ['raw_FICO_telecom'],
    ['raw_FICO_retail'],
    ['raw_FICO_bank_card'],
    ['raw_FICO_money'],
    ['raw_l2c_score', 'raw_FICO_telecom'],
    ['raw_l2c_score', 'raw_FICO_retail'],
    ['raw_l2c_score', 'raw_FICO_bank_card'],
    ['raw_l2c_score', 'raw_FICO_money'],
    ['raw_FICO_money','raw_FICO_bank_card']
)

# Using logistic regression as the linear model
for x in x_sets:
    print(', '.join(x))
    get_linear_model_metrics(trainX[x], trainY.FullyFunded, linear_model.LogisticRegression())
    print()

raw_l2c_score
P Values: [ 0.00570359]
Coefficients: [[ 0.00216036]]
y-intercept: [-0.29147695]
R-Squared: 0.73

raw_FICO_telecom
P Values: [ 0.00297036]
Coefficients: [[ 0.00364486]]
y-intercept: [-1.05015373]
R-Squared: 0.7325

raw_FICO_retail
P Values: [ 0.06799307]
Coefficients: [[ 0.00169972]]
y-intercept: [  2.77129940e-06]
R-Squared: 0.73

raw_FICO_bank_card
P Values: [ 0.01775891]
Coefficients: [[ 0.00151773]]
y-intercept: [  2.20145093e-06]
R-Squared: 0.73

raw_FICO_money
P Values: [ 0.00098666]
Coefficients: [[ 0.00167863]]
y-intercept: [  2.65596142e-06]
R-Squared: 0.73

raw_l2c_score, raw_FICO_telecom
P Values: [ 0.00570359  0.00297036]
Coefficients: [[ 0.0015589   0.00018553]]
y-intercept: [-0.02477229]
R-Squared: 0.73

raw_l2c_score, raw_FICO_retail
P Values: [ 0.00570359  0.06799307]
Coefficients: [[ 0.00184632 -0.00016244]]
y-intercept: [-0.00130779]
R-Squared: 0.73

raw_l2c_score, raw_FICO_bank_card
P Values: [ 0.00570359  0.01775891]
Coefficients: [[ 0.00205132 -0.00033713]]
y-intercept: [ -3.90915802e-06]
R-Squared: 0.7275

raw_l2c_score, raw_FICO_money
P Values: [ 0.00570359  0.00098666]
Coefficients: [[ 0.00185728 -0.0001708 ]]
y-intercept: [-0.00131104]
R-Squared: 0.73

raw_FICO_money, raw_FICO_bank_card
P Values: [ 0.00098666  0.01775891]
Coefficients: [[ 0.00399904 -0.00210482]]
y-intercept: [ -5.06615670e-05]
R-Squared: 0.73

I opted to use raw_FICO_telecom since it accounts for utility and phone bills, and it returned the highest R-Squared value (albeit barely squeezed out the highest at 0.7325 compared to 0.73). Also it appears to show that adding more than one credit score does not contribute changes to the R-Squared value. Seems useful for people new to credit.

I will try to do a raw_l2c_score to raw_FICO_telecom ratio or something like that. From searching online, the l2c score appears to be a way to score people with little to no credit history.

Source: http://money.cnn.com/2015/04/02/pf/new-fico-credit-score/index.html

Removing columns that may not have anything to do with whether a loan is fully funded.¶

Removing home_phone_type, bank routing number, customer id, status

# Home phone type
trainX.home_phone_type.value_counts()
# Bank routing number
# customer_id
# status - all loans are Approved in this dataset and the testX dataset.

Mobile    334
Home       63
Work        3
Name: home_phone_type, dtype: int64

Going to use credit scores as primary predictors of whether or not they will be fully funded.¶

Checking for quirks in their distributions.

f, axarr = plt.subplots(3, 2, figsize=(14, 10));
# funded = trainY.FullyFunded.values

axarr[0,0].hist(trainX['raw_l2c_score'].values, bins = 20)
axarr[0,0].set_title('raw_l2c_score');

axarr[0,1].hist(trainX['num_payments'].values, bins = 20)
axarr[0,1].set_title('raw_FICO_telecom');

axarr[1,0].hist(trainX['raw_FICO_retail'].values, bins = 20)
axarr[1,0].set_title('raw_FICO_retail');

axarr[1,1].hist(trainX['raw_FICO_bank_card'].values, bins = 20)
axarr[1,1].set_title('raw_FICO_bank_card');

axarr[2,0].hist(trainX['raw_FICO_money'].values, bins = 20)
axarr[2,0].set_title('raw_FICO_money');

axarr[2,1].hist(trainX['monthly_income_amount'].values, bins = 20)
axarr[2,1].set_title('monthly_income_amount');

f, axarr = plt.subplots(3, 2, figsize=(10, 9))
funded = trainY.FullyFunded.values
axarr[0, 0].scatter(trainX.num_payments.values, funded)
axarr[0, 0].set_title('num_payments');
axarr[0, 1].scatter(trainX.raw_FICO_telecom.values, funded)
axarr[0, 1].set_title('raw_FICO_telecom');
axarr[1, 0].scatter(trainX.monthly_rent_amount.values, funded)
axarr[1, 0].set_title('monthly_rent_amount');
axarr[1, 1].scatter(trainX['raw_l2c_score'].values, funded)
axarr[1, 1].set_title('raw_l2c_score');
axarr[2, 0].scatter(trainX.raw_FICO_bank_card.values, funded)
axarr[2, 0].set_title('raw_FICO_bank_card');
axarr[2, 1].scatter(trainX.raw_FICO_retail.values, funded)
axarr[2, 1].set_title('raw_FICO_retail');

Perhaps people who consistently make payments (num_payments) are more likely to get funding? Could be a contributor to credit history. Wanted to see if there was anything with that, but nothing noticeable showed up in these plots.

What about categorizing into levels of FICO score? Using get dummies to assign 0 & 1 to the various levels.¶

Figuring out how to assign scores by credit level.

Creating tiers based off the original FICO Score

Exceptional, Very Good, Good, Fair, Very Poor

FICO Score

Assigning 4 levels of mean credit score:

Exceptional (800-850)
Very Good (740 to 799)
Good (670 to 739)
Fair (580 to 669)
Very Poor (579 and below)

train_X = trainX.copy() # train_X is the copy dataset!

def fico_telecom_level(df):
    df.loc[df['raw_FICO_telecom'].between(800,850), 'credit_level'] = 'exceptional'
    df.loc[df['raw_FICO_telecom'].between(740,799), 'credit_level'] = 'very good'
    df.loc[df['raw_FICO_telecom'].between(670,739), 'credit_level'] = 'good'
    df.loc[df['raw_FICO_telecom'].between(580,669), 'credit_level'] = 'fair'
    df.loc[df['raw_FICO_telecom'].between(0,579), 'credit_level'] = 'very poor'

    return df

def credit_dummies(df):
    credit_level_dummies = pd.get_dummies(df['credit_level'], prefix='credit')
    df = pd.concat([df, credit_level_dummies], axis=1)
    df.drop('credit_level', axis=1, inplace=True)
    
    return df

# Dropping features that don't seem to be very predictable
def drop_features(df):
    dropCols = ['bank_routing_number', 'customer_id', 'home_phone_type','status','address_zip',
               'application_when','bank_account_direct_deposit', 'payment_ach',
               'residence_rent_or_own']
    
    df = df.drop(dropCols, axis = 1, inplace = True)
    return df

# Dropping the credit score columns
def drop_credit(df):
    creditCols = ['raw_l2c_score','raw_FICO_retail', 'raw_FICO_bank_card','raw_FICO_money'] # no raw_FICO_telecom
    df = df.drop(creditCols, axis = 1, inplace = True)
    return df

# Checking to see if training data set copy is same as original
train_X.head()

# Applying functions to training data set.
drop_features(train_X)
drop_credit(train_X)
train_X = credit_dummies(fico_telecom_level(train_X)) # needs to be assigned to save the dataframe result.

# Check to see if functions transformed dataset successfully.
train_X.head()

Was not able to apply get_dummies inside a function for some reason.
Going to apply it separately.

Trying out CART as my first model.¶

scoring = 'roc_auc'
num_folds = 5
seed = 10
kfold = KFold(n_splits = num_folds, random_state = seed)
y = trainY['FullyFunded'] # Extract target (FullyFunded), not customer_id.

model = DecisionTreeClassifier(random_state = seed)
# Fits the model
model.fit(train_X, y)

scores = cross_val_score(model, train_X, y, cv = kfold, scoring = scoring)
print('CV AUC {}, Avg. AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.54615385  0.63034188  0.63696716  0.70519713  0.56974922], Avg. AUC 0.6176818462496471

Tweaking CART model parameters. Picked these two values for max_depth & min_samples leaf based off the class we looked over the StumbleUpon data.

model = DecisionTreeClassifier(
                max_depth = 4,
                min_samples_leaf = 5, random_state = seed)

model.fit(train_X, y)
scores = cross_val_score(model, train_X, y, cv = kfold, scoring = scoring)
print('CV AUC {}, Avg. AUC {}'.format(scores, scores.mean()))

CV AUC [ 0.54410256  0.66381766  0.77952481  0.80241935  0.79663009], Avg. AUC 0.7172988969259038

Changing the max_depth and min_samples_leaf results in a 10% increase in AUC score.

features = train_X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df.head(20)

Turns out the indicator columns I created turned out to be nonfactors in the model.
Although the FICO telecom score is based off the original FICO score with the same score range, it did not contribute anything meaningful to the model.

Evaluating various models' performances¶

def model_performance(df):
    scoring = 'roc_auc'
    num_folds = 5
    seed = 10
    kfold = KFold(n_splits = num_folds, random_state = seed)

    results = []
    names = []
    models = []

    # Tweak CART ---- max_depth = 4, min_samples_leaf = 5,
    models.append( ('CART', DecisionTreeClassifier(random_state = seed)) )
    models.append( ('CART v2', DecisionTreeClassifier(random_state = seed, max_depth = 4, min_samples_leaf = 5)) )
    models.append( ('RandomForestClassifier', RandomForestClassifier(random_state = seed)) )
    models.append( ('LogisticRegression', LogisticRegression(random_state = seed)) )
    models.append( ('SVC', SVC(random_state = seed)) )
    models.append( ('GradientBoostingClassifier', GradientBoostingClassifier(random_state = seed)) )
    models.append( ('KNeighborsClassifier', KNeighborsClassifier()) ) 
    models.append( ('XGBClassifier', xgb.XGBClassifier(seed = seed)) )


    for name, model in models:
        kfold = KFold(n_splits = num_folds, random_state = seed)

        names.append(name)
        model.fit(df, trainY['FullyFunded'])

        scores = cross_val_score(model, df, trainY['FullyFunded'], cv = kfold, scoring = scoring)
        results.append(scores)

        msg = "{}: {} ({})".format(name, scores.mean(), scores.std())
        print(msg)
        print()
    return df

model_performance(train_X)

CART: 0.6176818462496471 (0.05586576194485695)

CART v2: 0.7172988969259038 (0.10026784846312621)

RandomForestClassifier: 0.7136026666627964 (0.06201602954831483)

LogisticRegression: 0.7001591302548761 (0.050607798742858014)

SVC: 0.507339361282107 (0.02479878072069761)

GradientBoostingClassifier: 0.71222368826686 (0.10678444093660712)

KNeighborsClassifier: 0.5684595632821964 (0.05419567007366402)

XGBClassifier: 0.7412029512374759 (0.07457911863731648)

scoring = 'roc_auc'
num_folds = 5
seed = 10
kfold = KFold(n_splits = num_folds, random_state = seed)


results = []
names = []
models = []

# CART v2 is the modified CART ---- max_depth = 4, min_samples_leaf = 5,
models.append( ('CART', DecisionTreeClassifier(random_state = seed)) )
models.append( ('CART v2', DecisionTreeClassifier(random_state = seed, max_depth = 4, min_samples_leaf = 5)) )
models.append( ('RandomForestClassifier', RandomForestClassifier(random_state = seed)) )
models.append( ('LogisticRegression', LogisticRegression(random_state = seed)) )
models.append( ('SVC', SVC(random_state = seed)) )
models.append( ('GradientBoostingClassifier', GradientBoostingClassifier(random_state = seed)) )
models.append( ('KNeighborsClassifier', KNeighborsClassifier()) ) 
models.append( ('XGBClassifier', xgb.XGBClassifier(seed = seed)) )


for name, model in models:
    kfold = KFold(n_splits = num_folds, random_state = seed)
    
    names.append(name)
    model.fit(train_X, trainY['FullyFunded'])
    
    scores = cross_val_score(model, train_X, trainY['FullyFunded'], cv = kfold, scoring = scoring)
    results.append(scores)
    
    msg = "{}: {} ({})".format(name, scores.mean(), scores.std())
    print(msg)
    print()

CART: 0.6176818462496471 (0.05586576194485695)

CART v2: 0.7172988969259038 (0.10026784846312621)

RandomForestClassifier: 0.7136026666627964 (0.06201602954831483)

LogisticRegression: 0.7001591302548761 (0.050607798742858014)

SVC: 0.507339361282107 (0.02479878072069761)

GradientBoostingClassifier: 0.71222368826686 (0.10678444093660712)

KNeighborsClassifier: 0.5684595632821964 (0.05419567007366402)

XGBClassifier: 0.7412029512374759 (0.07457911863731648)

Refinement Workflow Results - Trial 1

Model Description	Local CV (Std Dev)
Baseline: CART	0.6176 (0.0558)
CART v2: Modified parameters	0.7172 (0.1002)
RandomForestClassifier	0.7136 (0.0620)
Logistic Regression	0.7001 (0.0506)
SVC	0.5073 (0.0247)
GradientBoostingClassifier	0.7122 (0.1067)
KNeighborsClassifier	0.5684 (0.0541)
XGBClassifier	0.7412 (0.0745)

It appears to be that with a training dataset of only 7 columns, XGBClassifier scores the highest.
The engineered features in this first test were one hot encoded credit score tiers (exceptional, very good, good, fair, very poor), and did not end up contributing to the CART model.

Feature Engineering (FE)¶

Created a payment_loan_ratio that divides number of payments by loan duration. Perhaps making more payments is indicative of getting a fully funded loan. In a previous scatterplot above, num_payments did not show any correlation with becoming FullyFunded. Although making more payments over the course of a loan could build credit.
Created an income_rent_difference feature, which shows how much income a person has left over after subtracting their rent payment for the month. People who have more leftover money could be more likely to make their loan payments.
Created a raw_FICO_telecom to raw_l2c_score ratio since both scores appear to be measures of people who are new to having credit (meaning they have little to no credit history). Settled on using that since raw_l2c_score had a higher max value in the dataset.
- Used describe() on raw_l2c_score & raw_FICO_telecom to check.

# Feature Engineering (FE)
def first_order_FE(df):
    df["payment_loan_ratio"] = df.apply(payment_loan_ratio, axis=1)
    df["income_rent_difference"] = df.apply(income_rent_difference, axis=1)
    df["telecom_l2c_ratio"] = df.apply(telecom_l2c_ratio, axis=1)
    return df

def payment_loan_ratio(df):
    return df.num_payments / df.loan_duration

def income_rent_difference(df):
    return df.monthly_income_amount - df.monthly_rent_amount

def telecom_l2c_ratio(df):
    return df.raw_FICO_telecom / df.raw_l2c_score

One additional method to feature engineer is to predict negative income, see how their monthly income changes from month to month. Data is not available though.

A shortcoming of income_rent_difference is that we do not know the person's housing situation.
If they split rent, then it possible to have a higher rent than income.
- Also the residence_rent_or_own variable is very strange.

Transforming the dataset using feature engineering.¶

# Dropping different credit score columns
def drop_credit_FE(df):
    creditCols = ['raw_FICO_retail', 'raw_FICO_bank_card','raw_FICO_money'] # no raw_FICO_telecom or raw_l2c_score
    df = df.drop(creditCols, axis = 1, inplace = True)
    return df

train_X2 = trainX.copy()

first_order_FE(train_X2)
drop_features(train_X2)
drop_credit_FE(train_X2)
print("Data shape is:",train_X2.shape)
train_X2.head()

Data shape is: (400, 9)

Evaluating Model Performances (Test 2)¶

model_performance(train_X2)

CART: 0.6394799994283697 (0.060898134262455106)

CART v2: 0.719942773695182 (0.0771449229982496)

RandomForestClassifier: 0.7034117021847415 (0.04230854804834636)

LogisticRegression: 0.6996489925448727 (0.03263125738942909)

SVC: 0.5034252539912918 (0.004231061736970055)

GradientBoostingClassifier: 0.7310378055090209 (0.07371397740202118)

KNeighborsClassifier: 0.5228196330176098 (0.05548563404216828)

XGBClassifier: 0.74950630403108 (0.06869051525966671)

Model Description	Performance 2 - Performance 1	Change in CV Score
Baseline: CART	0.6394 - 0.6176	0.0218
CART v2: Modified parameters	0.7199427737 - 0.7172	0.0027
RandomForestClassifier	0.7034117022 - 0.7136	-0.01018
Logistic Regression	0.6996489925 - 0.7001	-0.0004510
SVC	0.503425254 - 0.5073	-0.003874
GradientBoostingClassifier	0.7310378055 - 0.7122	0.01883
KNeighborsClassifier	0.522819633 - 0.5684	-0.04558
XGBClassifier	0.749506304 - 0.7412	0.008306

Overall, only very slight changes in model performances. The baseline model increased by 2%, GradientBoostingClassifier performance improved by 1.88 % while the modified CART (CART v2), and XGBClassifier increased less than a hundredth.

Next is model tuning!¶

Decided on tuning GradientBoostingClassifier & XGBClassifier!

Local CV performance function¶

def local_cv(model, params, printFeatureImportance=True):
    param_grid = params
    kfold = KFold(n_splits = num_folds, random_state = seed)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
    grid_result = grid.fit(train_X2, trainY.FullyFunded)
    
    # Plotting the Feature Importances
    if printFeatureImportance:
        predictors = [x for x in train_X2.columns] # Every x column
        feat_imp = pd.Series(grid_result.best_estimator_.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp.plot(kind='bar', title='Feature Importances')
        plt.ylabel('Feature Importance Score')
        # Dataframe of features
        features_df = pd.DataFrame({'Features': predictors, 'Importance Score': grid_result.best_estimator_.feature_importances_})
        features_df.sort_values('Importance Score', inplace=True, ascending=False)
    
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
    print()
    for params, mean_score, scores in grid_result.grid_scores_:
        print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))
    
    return features_df

GradientBoostingClassifier¶

GradientBoostingClassifier().get_params().keys()

dict_keys(['loss', 'min_samples_leaf', 'max_depth', 'min_impurity_split', 'learning_rate', 'init', 'max_features', 'n_estimators', 'verbose', 'min_weight_fraction_leaf', 'random_state', 'max_leaf_nodes', 'subsample', 'min_samples_split', 'criterion', 'warm_start', 'presort'])

Using default parameters for GradientBoostingClassifier.

params = {"learning_rate": [0.1],
          "n_estimators": [100],
          "min_samples_split": [2], 
          "min_samples_leaf": [1],
          "max_depth": [3]}
local_cv(GradientBoostingClassifier(random_state = seed), params)

Best: 0.731038 using {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 3, 'learning_rate': 0.1, 'n_estimators': 100}

0.731038 (0.073714) with: {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 3, 'learning_rate': 0.1, 'n_estimators': 100}

Using tuned parameters for GradientBoostingClassifier.

params = {"max_depth": [5],
          "n_estimators": [150],
          "min_samples_split": [2],
          "min_samples_leaf": [7],
          "learning_rate": np.arange(0.01, 0.101, 0.01),}

local_cv(GradientBoostingClassifier(random_state = seed), params)

Best: 0.776827 using {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.029999999999999999, 'n_estimators': 150}

0.762412 (0.098131) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.01, 'n_estimators': 150}
0.764469 (0.087164) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.02, 'n_estimators': 150}
0.776827 (0.084874) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.029999999999999999, 'n_estimators': 150}
0.769877 (0.084030) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.040000000000000001, 'n_estimators': 150}
0.767794 (0.077459) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.050000000000000003, 'n_estimators': 150}
0.767196 (0.084415) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.060000000000000005, 'n_estimators': 150}
0.766978 (0.071697) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.069999999999999993, 'n_estimators': 150}
0.764184 (0.076127) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.080000000000000002, 'n_estimators': 150}
0.763203 (0.079038) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.089999999999999997, 'n_estimators': 150}
0.770267 (0.071361) with: {'min_samples_split': 2, 'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.099999999999999992, 'n_estimators': 150}

Able to make a 3% improvement in score by tuning parameters.¶

When examining the two feature importances plots, the importance of payment_loan_ratio increases in the tuned model; however, it also brings up the scores of the other features as well so it doesn't rely as much on one feature.

Tweaking the learning rate.
Found that 0.0409999 --> rounded to 0.041 gives us the best score.

params = {"max_depth": [5],
          "n_estimators": [150],
          "min_samples_split": [2],
          "min_samples_leaf": [7],
          "learning_rate": [0.041],
          "max_features": ['auto']}


local_cv(GradientBoostingClassifier(random_state = seed), params)

Best: 0.779260 using {'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.041, 'n_estimators': 150, 'max_features': 'auto', 'min_samples_split': 2}

0.779260 (0.072950) with: {'min_samples_leaf': 7, 'max_depth': 5, 'learning_rate': 0.041, 'n_estimators': 150, 'max_features': 'auto', 'min_samples_split': 2}

Tweaking the learning rate gets us even more of an even distribution of importance scores among the features.

xgb.XGBClassifier().get_params().keys()

dict_keys(['missing', 'learning_rate', 'min_child_weight', 'gamma', 'nthread', 'scale_pos_weight', 'reg_alpha', 'max_delta_step', 'silent', 'objective', 'n_estimators', 'subsample', 'base_score', 'max_depth', 'seed', 'colsample_bytree', 'colsample_bylevel', 'reg_lambda'])

params = {"max_depth": [3],
          "n_estimators": [100],
          "min_child_weight": [7]}

local_cv(xgb.XGBClassifier(seed = seed), params)

Best: 0.769292 using {'max_depth': 3, 'min_child_weight': 7, 'n_estimators': 100}

0.769292 (0.037175) with: {'max_depth': 3, 'min_child_weight': 7, 'n_estimators': 100}

Creating ensemble model and submitting prediction¶

First, the test dataset needs to be transformed. Using the feature engineering functions created to transform the dataset.

first_order_FE(testX)
drop_features(testX)
drop_credit_FE(testX)
print("Data shape is:", testX.shape)
testX.head()

Data shape is: (247, 9)

# Create the sub models. Using XGBClassifier & GradientBoostingClassifier
estimators = []

model1 = xgb.XGBClassifier(n_estimators = 100, min_child_weight = 7, max_depth = 3)
estimators.append(('XGBClassifier', model1))

model2 = GradientBoostingClassifier(max_depth = 5, n_estimators = 150, 
                                    min_samples_split = 2, 
                                    min_samples_leaf = 7, 
                                    learning_rate = 0.041, 
                                    max_features = 'auto')
estimators.append(('GradientBoostingClassifier', model2))


# Creating ensemble model
ensemble = VotingClassifier(estimators, voting='soft')
results = cross_val_score(ensemble, train_X2, trainY.FullyFunded, cv = kfold) 

# Printing results
print("CV Scores: {} // Mean Score: {} // Std Dev: {}".format(results, results.mean(), results.std()))
print()

# Fitting model then making a prediction
ensemble_result = ensemble.fit(train_X2, trainY.FullyFunded)
ensemble_preds = ensemble_result.predict(testX)

print("Estimators variable:\n", estimators)
print()
print(ensemble)

CV Scores: [ 0.8     0.7375  0.8125  0.825   0.825 ] // Mean Score: 0.8 // Std Dev: 0.03259601202601321

Estimators variable:
 [('XGBClassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=7, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)), ('GradientBoostingClassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.041, loss='deviance', max_depth=5,
              max_features='auto', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=7,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=150, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False))]

VotingClassifier(estimators=[('XGBClassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=7, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,...=150, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False))],
         n_jobs=1, voting='soft', weights=None)

Submission to CSV file¶

submission = pd.DataFrame({"customerId": testY.customer_id, "FullyFunded": ensemble_preds})
submission.to_csv("HW3_Final_Predictions.csv", index = False)

Final Workflow Results¶

Workflow Results - Trial 1: One Hot Encoded Credit Score Levels

Model Description	Local CV (Std Dev)
Baseline: CART	0.6176 (0.0558)
CART v2: Modified parameters	0.7172 (0.1002)
RandomForestClassifier	0.7136 (0.0620)
Logistic Regression	0.7001 (0.0506)
SVC	0.5073 (0.0247)
GradientBoostingClassifier	0.7122 (0.1067)
KNeighborsClassifier	0.5684 (0.0541)
XGBClassifier	0.7412 (0.0745)

It appears to be that with a training dataset of only 7 columns, XGBClassifier scores the highest.
The engineered features in this first test were one hot encoded credit score tiers (exceptional, very good, good, fair, very poor), and did not end up contributing to the CART model when looking at feature importances.

Workflow Results - Trial 2: With Feature Engineering

Model Description	Performance 2 - Performance 1	Change in CV Score
Baseline: CART	0.6394 - 0.6176	0.0218
CART v2: Modified parameters	0.7199427737 - 0.7172	0.0027
RandomForestClassifier	0.7034117022 - 0.7136	-0.01018
Logistic Regression	0.6996489925 - 0.7001	-0.0004510
SVC	0.503425254 - 0.5073	-0.003874
GradientBoostingClassifier	0.7310378055 - 0.7122	0.01883
KNeighborsClassifier	0.522819633 - 0.5684	-0.04558
XGBClassifier	0.749506304 - 0.7412	0.008306

Workflow Results - Trial 3: Using GridSearch and Ensemble (w/ Feature Engineering)

Model Description	Local CV (Std Dev)
GradientBoostingClassifier (Default Parameters) + feature engineering	0.731038 (0.073714)
GradientBoostingClassifier + feature engineering & tuning	0.776827 (0.084874)
GradientBoostingClassifier + feature engineering & additional learning rate tuning	0.779260 (0.07950)
XGBClassifier + feature engineering & tuning	0.769292 (0.037175)
Ensemble XGBClassifier and GradientBoostingClassifier (feature engineering and tuning)	0.8025 (0.03259601202601321)

	monthly_rent_amount	loan_duration	num_payments	address_zip	bank_routing_number	monthly_income_amount	raw_l2c_score	raw_FICO_telecom	raw_FICO_retail	raw_FICO_bank_card	raw_FICO_money
count	400.000000	400.000000	400.000000	400.000000	4.000000e+02	400.000000	400.00000	400.000000	400.000000	400.000000	400.000000
mean	594.317500	5.860000	10.800000	82409.300000	1.815323e+08	2645.565000	606.55000	566.180000	592.242500	664.005000	602.565000
std	435.895893	1.511991	4.485007	6506.387121	9.981754e+07	1651.009577	108.37745	43.278519	51.673388	40.141042	30.989865
min	0.000000	2.000000	3.000000	57003.000000	1.240000e+04	300.000000	50.00000	222.000000	222.000000	222.000000	222.000000
25%	250.000000	5.000000	8.000000	84010.000000	1.240001e+08	1557.500000	544.00000	535.000000	553.000000	648.000000	587.750000
50%	577.500000	5.000000	10.000000	84084.000000	1.240030e+08	2397.000000	600.50000	566.000000	590.000000	671.000000	605.500000
75%	873.750000	8.000000	13.000000	84120.000000	3.240796e+08	3200.000000	671.75000	593.250000	632.000000	680.000000	621.000000
max	2100.000000	8.000000	34.000000	84790.000000	5.113005e+08	19392.000000	808.00000	698.000000	756.000000	796.000000	662.000000

	monthly_rent_amount	loan_duration	num_payments	monthly_income_amount	raw_l2c_score	raw_FICO_telecom	payment_loan_ratio	income_rent_difference	telecom_l2c_ratio
0	0	3	6	1560	614	574	2.000000	1560	0.934853
1	0	6	13	900	708	501	2.166667	900	0.707627
2	620	6	13	1434	687	522	2.166667	814	0.759825
3	785	4	8	1600	616	560	2.000000	815	0.909091
4	700	4	8	1360	681	603	2.000000	660	0.885463

	monthly_rent_amount	loan_duration	num_payments	monthly_income_amount	raw_l2c_score	raw_FICO_telecom	payment_loan_ratio	income_rent_difference	telecom_l2c_ratio
0	0	3	6	1560	614	574	2.000000	1560	0.934853
1	0	6	13	900	708	501	2.166667	900	0.707627
2	620	6	13	1434	687	522	2.166667	814	0.759825
3	785	4	8	1600	616	560	2.000000	815	0.909091
4	700	4	8	1360	681	603	2.000000	660	0.885463
5	1600	4	4	4900	716	573	1.000000	3300	0.800279
6	865	4	4	3200	486	580	1.000000	2335	1.193416
7	1000	2	4	2200	530	537	2.000000	1200	1.013208
8	169	4	4	1200	645	532	1.000000	1031	0.824806
9	1265	6	13	4160	586	527	2.166667	2895	0.899317
10	1175	4	8	5326	564	526	2.000000	4151	0.932624
11	0	6	12	2200	663	501	2.000000	2200	0.755656
12	720	4	17	2400	523	501	4.250000	1680	0.957935
13	190	6	13	1200	712	585	2.166667	1010	0.821629
14	0	4	17	1920	765	654	4.250000	1920	0.854902
15	860	4	8	1200	252	501	2.000000	340	1.988095
16	300	6	12	1160	734	614	2.000000	860	0.836512
17	750	6	13	9000	537	592	2.166667	8250	1.102421
18	650	6	13	600	645	501	2.166667	-50	0.776744
19	635	6	26	3600	557	525	4.333333	2965	0.942549
20	889	6	12	2580	602	559	2.000000	1691	0.928571
21	0	6	13	429	616	501	2.166667	429	0.813312
22	650	6	12	4500	716	524	2.000000	3850	0.731844
23	700	8	13	4200	530	573	1.625000	3500	1.081132
24	0	7	15	1200	733	518	2.142857	1200	0.706685
25	1100	5	5	3007	771	570	1.000000	1907	0.739300
26	750	5	10	5260	645	556	2.000000	4510	0.862016
27	0	5	10	2400	544	611	2.000000	2400	1.123162
28	0	5	10	1200	638	534	2.000000	1200	0.836991
29	89	5	10	1120	762	543	2.000000	1031	0.712598
...	...	...	...	...	...	...	...	...	...
370	375	6	13	3600	602	549	2.166667	3225	0.911960
371	800	5	10	3370	716	561	2.000000	2570	0.783520
372	620	8	8	2600	638	635	1.000000	1980	0.995298
373	0	6	13	3720	550	576	2.166667	3720	1.047273
374	100	6	13	3100	593	602	2.166667	3000	1.015177
375	1182	8	6	2600	763	539	0.750000	1418	0.706422
376	400	5	5	1500	808	512	1.000000	1100	0.633663
377	800	5	5	2998	586	535	1.000000	2198	0.912969
378	500	5	5	3000	541	620	1.000000	2500	1.146026
379	1326	5	5	3192	666	693	1.000000	1866	1.040541
380	575	5	10	1600	231	587	2.000000	1025	2.541126
381	250	6	6	3500	645	575	1.000000	3250	0.891473
382	650	8	13	2000	712	571	1.625000	1350	0.801966
383	375	6	13	1160	492	564	2.166667	785	1.146341
384	645	5	10	1266	586	578	2.000000	621	0.986348
385	0	6	6	1887	748	536	1.000000	1887	0.716578
386	0	3	6	3440	546	602	2.000000	3440	1.102564
387	400	5	10	2400	525	539	2.000000	2000	1.026667
388	250	5	10	930	542	587	2.000000	680	1.083026
389	382	5	10	5600	492	583	2.000000	5218	1.184959
390	485	5	5	1000	762	638	1.000000	515	0.837270
391	873	5	10	2466	499	559	2.000000	1593	1.120240
392	945	5	10	1300	522	572	2.000000	355	1.095785
393	375	5	10	3850	536	501	2.000000	3475	0.934701
394	1200	8	6	3133	614	516	0.750000	1933	0.840391
395	0	5	5	1173	616	513	1.000000	1173	0.832792
396	400	5	10	3200	564	513	2.000000	2800	0.909574
397	0	7	13	1860	617	576	1.857143	1860	0.933549
398	0	5	10	3000	530	511	2.000000	3000	0.964151
399	400	5	10	1200	501	576	2.000000	800	1.149701

	customer_id	status	residence_rent_or_own	monthly_rent_amount	bank_account_direct_deposit	application_when	loan_duration	payment_ach	num_payments	address_zip	bank_routing_number	home_phone_type	monthly_income_amount	raw_l2c_score	raw_FICO_telecom	raw_FICO_retail	raw_FICO_bank_card	raw_FICO_money
0	9ece67d6c5	Approved	True	0	True	2010-10-16 14:06:00	3	True	6	84118	124001545	Mobile	1560	614	574	600	656	561
1	5c2c402094	Approved	True	0	False	2010-10-17 13:01:00	6	True	13	84062	124000054	Mobile	900	708	501	550	651	563
2	e6254cad30	Approved	True	620	True	2010-10-17 19:52:00	6	True	13	84119	124001545	Mobile	1434	687	522	561	661	598
3	49fb42f51d	Approved	False	785	False	2010-10-18 07:05:00	4	True	8	84405	124002971	Mobile	1600	616	560	552	634	591
4	195fbe5739	Approved	True	700	True	2010-10-19 10:58:00	4	True	8	84404	124002971	Mobile	1360	681	603	654	659	636

	customer_id	status	application_when	home_phone_type
count	400	400	400	400
unique	400	1	398	3
top	24960509ce	Approved	2011-03-01 12:48:00	Mobile
freq	1	400	2	334
first	NaN	NaN	2010-10-16 14:06:00	NaN
last	NaN	NaN	2011-03-01 14:34:00	NaN

	Features	Importance Score
2	num_payments	0.402043
1	loan_duration	0.248258
4	raw_FICO_telecom	0.171630
0	monthly_rent_amount	0.120484
3	monthly_income_amount	0.057585
5	credit_fair	0.000000
6	credit_good	0.000000
7	credit_very poor	0.000000

	Features	Importance Score
8	telecom_l2c_ratio	0.227898
6	payment_loan_ratio	0.124212
7	income_rent_difference	0.124137
3	monthly_income_amount	0.118957
0	monthly_rent_amount	0.115138
5	raw_FICO_telecom	0.105550
4	raw_l2c_score	0.089340
2	num_payments	0.073142
1	loan_duration	0.021627

	Features	Importance Score
8	telecom_l2c_ratio	0.218291
6	payment_loan_ratio	0.157007
5	raw_FICO_telecom	0.113896
0	monthly_rent_amount	0.110226
2	num_payments	0.100089
7	income_rent_difference	0.097350
3	monthly_income_amount	0.096922
4	raw_l2c_score	0.081752
1	loan_duration	0.024466

	Features	Importance Score
8	telecom_l2c_ratio	0.197587
6	payment_loan_ratio	0.146133
0	monthly_rent_amount	0.125543
5	raw_FICO_telecom	0.122390
7	income_rent_difference	0.120055
3	monthly_income_amount	0.096514
4	raw_l2c_score	0.080484
2	num_payments	0.073079
1	loan_duration	0.038215

	Features	Importance Score
3	monthly_income_amount	0.178922
8	telecom_l2c_ratio	0.147059
5	raw_FICO_telecom	0.144608
7	income_rent_difference	0.115196
0	monthly_rent_amount	0.102941
6	payment_loan_ratio	0.098039
4	raw_l2c_score	0.090686
2	num_payments	0.075980
1	loan_duration	0.046569

	monthly_rent_amount	loan_duration	num_payments	monthly_income_amount	raw_l2c_score	raw_FICO_telecom	payment_loan_ratio	income_rent_difference	telecom_l2c_ratio
0	750	8	13	3500	718	594	1.625	2750	0.827298
1	525	5	10	670	719	571	2.000	145	0.794159
2	575	3	6	2200	719	647	2.000	1625	0.899861
3	400	8	6	800	616	526	0.750	400	0.853896
4	795	8	6	4141	705	513	0.750	3346	0.727660