Dataset from Kaggle (House Prices: Advanced Regression Techniques)
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Data Dictionary (Link)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# Loading the housing data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
print("Training dataset shape:", train.shape) # Contains sell price ("SalePrice") as last column!
print("Test dataset shape:", test.shape) # Has no SalePrice
trainY = train[['SalePrice']] # Creating a trainY dataframe
train_IDs = train.Id
test_IDs = test.Id
# Dropping ID
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis = 1, inplace = True)
# Dropping SalePrice
# train.drop('SalePrice', axis=1, inplace=True)
print("New training dataset shape:", train.shape)
print("Are the columns the same in train and test?", (np.array_equal(train.columns, test.columns)))
# It is False because we have not dropped the SalePrice column yet, this will be done in a later step when dealing with outliers.
# Previewing dataset
Columns, Rows = train.shape
pd.set_option('max_columns', Rows) # default is 20 columns
train.head()
There are 1460 rows and 81 columns in the training dataset.
One column belongs to ID and the other is SalePrice.
So we have 79 features.
total_missing = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
# Plotting missing data
f, ax = plt.subplots(figsize=(12, 8))
plt.xticks(rotation='90')
sns.barplot(x = total_missing.index, y = percent * 100, data = missing_data)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15);
PoolQC, MiscFeature, Alley, Fence, FireplaceQu have a lot of missing values. Why?
Seems to be an alarming amount of missing values.
For PoolQC, MiscFeature, Alley, Fence, FireplaceQu, NA is used to represent "no pool, miscellaneous feature, alley, fence, fireplace."
According to data documentation:
imputedColumns = ['PoolQC', 'MiscFeature', 'Alley', 'Fence',
'FireplaceQu', 'GarageType', 'GarageFinish',
'GarageQual', 'GarageCond', "BsmtFinType2",
"BsmtExposure", "BsmtFinType1", "BsmtCond",
"BsmtQual", "MasVnrType"]
def impute_missing(df):
for col in imputedColumns:
df[col] = df[col].fillna('None')
# Only a few missing values in the training set for these two columns.
# Using 0, since if there is no garage, there is no year built. If no masonry veneer, there will be no area.
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0)
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)
# Only 1 missing value in the training set. Imputing most common value for 'Electrical' which is SBrkr
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])
return df
impute_missing(train)
train.head()
# Checking to see the percentages of missing data.
total_missing = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum() / train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)
Lot Frontage is the last feature that needs imputing
Lot Frontage (Continuous): Linear feet of street connected to property
train.groupby("Neighborhood")["LotFrontage"].describe()
train['LotFrontage'] = train.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
train.isnull().sum().sort_values(ascending = False)[:5] # Confirm that there are no more missing values.
There are also some additional columns that need number imputing.
These columns will be imputed with 0, since having no garage would mean there is no year built, no area, no cars in garage.
total_missing = test.isnull().sum().sort_values(ascending = False)
percent = (test.isnull().sum() / test.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(40)
There are more features to impute here compared to the training dataset, but we can impute accordingly. Such as MSZoning, KitchenQual - imputing the most common event makes sense.
impute_missing(test)
test.isnull().sum().sort_values(ascending = False)[:18]
zero_cols = ['GarageYrBlt', 'GarageArea', 'GarageCars',
'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
# Imputing zero for columns that use numbers as measures (such as basement in square feet [BsmtFinSF])
for col in zero_cols:
test[col] = test[col].fillna(0)
# Filling in the missing LotFrontage, accounts for 15% of missing data
test['LotFrontage'] = test.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
modeCols = ['MSZoning','Exterior1st', 'Exterior2nd', 'SaleType', 'Utilities', 'KitchenQual']
# there are only a few missing values for these columns, so imputing with the mode.
for col in modeCols:
test[col] = test[col].fillna(test[col].mode()[0])
# NA probably means no subclass
test['MSSubClass'] = test['MSSubClass'].fillna("None")
# Functional - data dictionary says assume Typical
test['Functional'] = test['Functional'].fillna("Typ")
# Confirming that there are no more missing values
test.isnull().sum().sort_values(ascending = False)[:10]
# Note: Utilities <-- there is no variation in the test dataset
print("Train \n", train['Utilities'].value_counts())
print("Test \n", test['Utilities'].value_counts())
# There is variation in the training dataset, however it is not used in prediction though, so it can be dropped.
train.drop('Utilities', axis=1, inplace=True)
test.drop('Utilities', axis=1, inplace=True)
train['MSSubClass'] = train['MSSubClass'].astype(str)
test['MSSubClass'] = test['MSSubClass'].astype(str)
Looking into how the prices are distributed.
print(train.SalePrice.describe())
print()
print("Number of missing sale prices:", train.SalePrice.isnull().sum())
print("Skew: {}, Kurtosis: {}".format(train.SalePrice.skew(), train.SalePrice.kurtosis()))
plt.figure(figsize = (14, 8))
sns.distplot(train.SalePrice, bins = 20);
The dataset is slightly skewed to the right.
Goal is to check if there are any price differences between houses depending on neighborhoods, number of bathrooms, year built, year sold, condition
plt.figure(figsize = (30, 8))
sns.boxplot(x = 'Neighborhood', y = 'SalePrice', data = train);
Sale price does vary by neighborhood.
f, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, figsize=(30, 8))
plt.figure(figsize = (21, 9));
sns.boxplot(x = train.OverallQual, y = train.SalePrice, ax = ax1);
sns.boxplot(x = train.OverallCond, y = train.SalePrice, ax = ax2);
sns.boxplot(x = train.MSZoning, y = train.SalePrice, ax = ax3);
sns.boxplot(x = train.BldgType, y = train.SalePrice, ax = ax4);
It appears that as quality of house increases, the price increases. Relationship looks kind of linear, positively correlated. The condition of house and sale price does not showcase any kind of correlation.
plt.figure(figsize = (9, 5))
sns.stripplot(x = train.HouseStyle, y = train.SalePrice, jitter = True);
Shows that the majority of houses in the dataset are 1 and 2 story houses.
plt.figure(figsize = (16, 8));
sns.factorplot(x = 'HouseStyle', y = 'SalePrice', hue = 'CentralAir', data = train);
Houses that have central air conditioning do sell for noticeably more.
# Boxplot by Year Built
plt.figure(figsize = (30, 8))
sns.boxplot(x = train.YearBuilt, y = train.SalePrice);
plt.xticks(rotation = 90);
Showcases that house prices tend to go up, the newer the house is.
# Strip Plot by Year Built
plt.figure(figsize = (16,9))
sns.stripplot(x = train.YearBuilt, y = train.SalePrice, jitter = 0.04);
plt.xticks(fontsize = 8, rotation=90);
# Linear Regression Plot by Year Built
plt.figure(figsize = (16,9))
sns.regplot(x = train.YearBuilt, y = train.SalePrice);
plt.xticks(fontsize = 10, rotation = 45);
The stripplot and regplot of Sale Price vs. Year Built confirms that newer houses tend to have a higher sale price, and that a lot of houses are newer.
But just to confirm.
sns.distplot(train.YearBuilt, bins = 20);
train['YearBuilt'].groupby([train.YearBuilt]).agg('count').sort_index(ascending = False)[:20]
"There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students."
plt.figure(figsize = (10, 7))
plt.scatter(x = train.GrLivArea, y = trainY.SalePrice)
plt.ylabel('SalePrice', fontsize=12);
plt.xlabel('GrLivArea', fontsize=12);
# Deleting outliers
train = train.drop(train[(train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000)].index)
# Check the graph again
plt.figure(figsize = (10, 7))
plt.scatter(x = train.GrLivArea, y = train.SalePrice)
plt.ylabel('SalePrice', fontsize=12);
plt.xlabel('GrLivArea', fontsize=12);
trainY = train[['SalePrice']] # Creating a trainY dataframe
train.drop('SalePrice', axis=1, inplace=True)
print("Are the columns the same in train and test?", (np.array_equal(train.columns, test.columns)))
trainY.head()
train.info()
One thing to note is that YearBuilt is still an int64. Should this be converted to a categorical since newer houses don't guarantee higher prices?
plt.figure(figsize = (16,9));
sns.heatmap(data=train.corr());
# Scoring stuff
from sklearn.metrics import roc_curve, auc, mean_squared_error, r2_score
# Import model selection tools
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import GridSearchCV
# Import models
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Lasso, Ridge, LinearRegression, ElasticNet
import xgboost as xgb # XGBRegressor
from sklearn.preprocessing import LabelEncoder
def R2(df):
scoring = 'r2'
num_folds = 5
seed = 10
kfold = KFold(n_splits = num_folds, random_state = seed)
results = []
names = []
models = []
models.append( ('LinearRegression', LinearRegression()) )
models.append( ('RandomForestRegressor', RandomForestRegressor(random_state = seed)) )
models.append( ('GradientBoostingRegressor', GradientBoostingRegressor(random_state = seed)) )
models.append( ('Lasso', Lasso(random_state = seed)) )
models.append( ('Ridge', Ridge(random_state = seed)) )
models.append( ('ElasticNet', ElasticNet(random_state = seed)) )
models.append( ('XGBRegressor', xgb.XGBRegressor(seed = seed)) )
for name, model in models:
kfold = KFold(n_splits = num_folds, random_state = seed)
names.append(name)
model.fit(df, trainY.SalePrice)
scores = cross_val_score(model, df, trainY.SalePrice, cv = kfold, scoring = scoring)
results.append(scores)
msg = "{}: R2: {} ({})".format(name, scores.mean(), scores.std())
print(msg)
print()
return results
def RMSE(df):
scoring = 'neg_mean_squared_error'
num_folds = 5
seed = 10
kfold = KFold(n_splits = num_folds, random_state = seed)
results = []
names = []
models = []
models.append( ('LinearRegression', LinearRegression()) )
models.append( ('RandomForestRegressor', RandomForestRegressor(random_state = seed)) )
models.append( ('GradientBoostingRegressor', GradientBoostingRegressor(random_state = seed)) )
models.append( ('Lasso', Lasso(random_state = seed)) )
models.append( ('Ridge', Ridge(random_state = seed)) )
models.append( ('ElasticNet', ElasticNet(random_state = seed)) )
models.append( ('XGBRegressor', xgb.XGBRegressor(seed = seed)) )
for name, model in models:
kfold = KFold(n_splits = num_folds, random_state = seed)
names.append(name)
model.fit(df, trainY.SalePrice)
scores = np.sqrt(-cross_val_score(model, df, trainY.SalePrice, cv = kfold, scoring = scoring))
results.append(scores)
msg = "{}: RMSE: {} ({})".format(name, scores.mean(), scores.std())
print(msg)
print()
return results
# RMSE
def rmse(model):
scoring = 'neg_mean_squared_error'
num_folds = 5
seed = 10
kf = KFold(n_splits = num_folds, shuffle = True, random_state = seed).get_n_splits(train.values)
rmse = np.sqrt(-cross_val_score(model, train.values, trainY.SalePrice, scoring = scoring, cv = kf))
return rmse
I will use get dummies to do that.
df_object = train.select_dtypes(include=['object'])
print("Number of categorical columns:", len(df_object.columns))
df_object.head(10)
df_object.describe()
Overall looks fine, some parts have heavy class imbalance (1452 have paved streets, 6 do not), but overall nothing that has "zero" variation.
categorical_Cols = list(df_object.columns) # Inputting the categorical columns into a list
train.shape
from sklearn.preprocessing import LabelEncoder
# process columns, apply LabelEncoder to categorical features
for c in categorical_Cols:
lbl = LabelEncoder()
lbl.fit(list(train[c].values))
train[c] = lbl.transform(list(train[c].values))
# shape
print('Shape all_data: {}'.format(train.shape))
train.head()
Dataframe appears to be transformed into all numbers.
R2(train)
RMSE(train)
The leaderboard for the Kaggle competition uses RMSE scoring, and generally it is with values less than 0.2. There is a lot of work to do.
#Changing OverallCond into a categorical variable
train['OverallCond'] = train['OverallCond'].astype(str)
#Year and month sold are transformed into categorical features.
train['YrSold'] = train['YrSold'].astype(str)
train['MoSold'] = train['MoSold'].astype(str)
newCols = ['OverallCond', 'YrSold', 'MoSold']
for c in newCols:
lbl = LabelEncoder()
lbl.fit(list(train[c].values))
train[c] = lbl.transform(list(train[c].values))
R2(train)
RMSE(train)
cols = ['Neighborhood', 'LotFrontage', 'YearBuilt', 'OverallQual', 'CentralAir', 'GrLivArea']
train_v2 = train[cols]
RMSE(train_v2)
plt.scatter(x = 'OverallQual', y = trainY.SalePrice, data = train);
plt.scatter(x = 'CentralAir', y = trainY.SalePrice, data = train);
plt.scatter(x = 'ExterQual', y = trainY.SalePrice, data = train);