Please fill out:
- Student name: Matthew Sparr
- Student pace: self paced
- Scheduled project review date/time:
- Instructor name: Eli
- Blog post URL:
For this project I chose a Kaggle dataset for an ongoing competition that can be found at https://www.kaggle.com/c/petfinder-adoption-prediction. This competition involves predicting the speed of adoption for a pet adoption site in Malaysia. Provided are various data fields such as the color of the pet, the age, and the breed.
Also provided are image data on uploaded photos of the pets that was ran through Google's Vision API and sentiment data that was ran through Google's Natural Language API, on the description given for the pets.
The goal of this project is to acheieve a decent score on the Kaggle competition test data and hopefully place high on the leaderboard.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,AdaBoostClassifier,VotingClassifier
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import os
import json
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
from sklearn import svm
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
The test set provided does not include the 'AdoptionSpeed' target variable and is only used to make predictions to submit to Kaggle for scoring.
train = pd.read_csv('train/train.csv')
test = pd.read_csv('test/test.csv')
The 'Name' and 'Description' columns are the only two columns will missing data for both the train and test set. Since both fields are text, they will be filled with a blank space, ' '.
train.isna().sum()
Type 0
Name 1257
Age 0
Breed1 0
Breed2 0
Gender 0
Color1 0
Color2 0
Color3 0
MaturitySize 0
FurLength 0
Vaccinated 0
Dewormed 0
Sterilized 0
Health 0
Quantity 0
Fee 0
State 0
RescuerID 0
VideoAmt 0
Description 12
PetID 0
PhotoAmt 0
AdoptionSpeed 0
dtype: int64
test.isna().sum()
Type 0
Name 303
Age 0
Breed1 0
Breed2 0
Gender 0
Color1 0
Color2 0
Color3 0
MaturitySize 0
FurLength 0
Vaccinated 0
Dewormed 0
Sterilized 0
Health 0
Quantity 0
Fee 0
State 0
RescuerID 0
VideoAmt 0
Description 2
PetID 0
PhotoAmt 0
dtype: int64
train.Name.fillna(' ', inplace=True)
train.Description.fillna(' ', inplace=True)
test.Name.fillna(' ', inplace=True)
test.Description.fillna(' ', inplace=True)
Below is a basic histogram of all of the variables.
train.hist(figsize=(20,20))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000022794D508D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000022792C089B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000022795235128>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002279363C278>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000022799173390>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000022799173748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227991DAB38>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227938B14E0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6777DD8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D64E5128>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6505438>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63A5780>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63C7A58>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63C1D68>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63B30B8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D656D3C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D659C6D8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D65909E8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6469CF8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6470048>]],
dtype=object)
Most of the variables do not have a normal distribution which means we will probably want to standardize them later on. The target variable 'AdoptionSpeed' has a low count of '0' values which could negatively impact training a classifier on the training set.
We can also see that most pets have only one breed and one color as there are many zero values for 'Breed2', 'Color2', and 'Color3'.
Now we can look at some of the value counts of various columns just to get a feel of the distribution of the pets.
train['Type'].value_counts().rename({1:'Dog',
2:'Cat'}).plot(kind='barh',
figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.title('Type Distribution', fontsize='xx-large')
Text(0.5, 1.0, 'Type Distribution')
Slightly more dogs than cats.
train['AdoptionSpeed'][train['Type'] == 1].value_counts().sort_index().plot(kind='barh',
figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.ylabel('Adoption Speed')
plt.title('Adoption Speed Distribution (Dogs)', fontsize='xx-large')
Text(0.5, 1.0, 'Adoption Speed Distribution (Dogs)')
train['AdoptionSpeed'][train['Type'] == 2].value_counts().sort_index().plot(kind='barh',
figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.ylabel('Adoption Speed')
plt.title('Adoption Speed Distribution (Cats)', fontsize='xx-large')
Text(0.5, 1.0, 'Adoption Speed Distribution (Cats)')
pd.DataFrame([train['AdoptionSpeed'][train['Type'] == 1].mean(),train['AdoptionSpeed'][train['Type'] == 2].mean()]).rename({0:'Dogs',
1:'Cats'}).plot(kind='barh',
figsize=(15,6), legend=None)
plt.yticks(fontsize='xx-large')
plt.xlabel('Adoption Speed')
plt.title('Average Adoption Speed', fontsize='xx-large')
Text(0.5, 1.0, 'Average Adoption Speed')
The largest number of dogs aren't adopted after 100 days of being listed whereas the largest number of cats are adopted in the first month of being listed. Dogs on average take a longer amount of time to be adopted than cats.
train['Breed1'].value_counts().head(10).plot(kind='barh',
figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.ylabel('Breed ID')
plt.title('Breed Distribution (Breed1)', fontsize='xx-large')
Text(0.5, 1.0, 'Breed Distribution (Breed1)')
Breed 307 which signifies an unknown breed is the most common primary breed followed by Breed 266 which are domestic shorthair cats.
train['Breed2'].value_counts().head(10).plot(kind='barh',
figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.ylabel('Breed ID')
plt.title('Breed Distribution(Breed2)', fontsize='xx-large')
Text(0.5, 1.0, 'Breed Distribution(Breed2)')
Most pets do not have a second breed but the largest number of the ones that do have an unknown second breed.
train['Gender'][(train['Gender'] == 1) | (train['Gender'] == 2)].value_counts().rename({1:'Male',
2:'Female'}).plot(kind='barh',
figsize=(15,6))
plt.yticks(fontsize='xx-large')
plt.title('Gender Distribution (excluding groups of pets)', fontsize='xx-large')
Text(0.5, 1.0, 'Gender Distribution (excluding groups of pets)')
More pets are female.
train['PhotoAmt'].value_counts().sort_index().plot(kind='barh',
figsize=(20,15))
plt.yticks(fontsize='xx-large')
plt.ylabel('Number of Photos')
plt.title('Amount of Photos Distribution', fontsize='xx-large')
Text(0.5, 1.0, 'Amount of Photos Distribution')
Most listings that have photos only have 1-5 of them.
The image metadata is given by a collection of JSON files with the 'PetID' of the corresponding pet in the name of the file. Some pets have multiple pictures but I will initially just use the first three photos of each pet if available as these are likely the main photos seen by people searching for pets to adopt and thus have the largest effect on drawing in a perspective adoption.
for index, row in train.iterrows(): ## First photo
file = 'train_metadata/' + row['PetID'] + '-1.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x']
vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y']
bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence']
bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1)
dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
train.loc[index, 'vertex_x']= vertex_x
train.loc[index, 'vertex_y']= vertex_y
train.loc[index, 'bounding_conf']= bounding_confidence
train.loc[index, 'bounding_imp']= bounding_importance_frac
train.loc[index, 'dom_blue']= dominant_blue
train.loc[index, 'dom_green']= dominant_green
train.loc[index, 'dom_red']= dominant_red
train.loc[index, 'pixel_frac']= dominant_pixel_frac
train.loc[index, 'score']= dominant_score
else:
train.loc[index, 'vertex_x']= -1
train.loc[index, 'vertex_y']= -1
train.loc[index, 'bounding_conf']= -1
train.loc[index, 'bounding_imp']= -1
train.loc[index, 'dom_blue']= -1
train.loc[index, 'dom_green']= -1
train.loc[index, 'dom_red']= -1
train.loc[index, 'pixel_frac']= -1
train.loc[index, 'score']= -1
for index, row in train.iterrows(): ## Second photo
file = 'train_metadata/' + row['PetID'] + '-2.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x']
vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y']
bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence']
bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1)
try:
dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
except:
dominant_blue = -1
try:
dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
except:
dominant_green = -1
try:
dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
except:
dominant_red = -1
dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
train.loc[index, 'vertex_x2']= vertex_x
train.loc[index, 'vertex_y2']= vertex_y
train.loc[index, 'bounding_conf2']= bounding_confidence
train.loc[index, 'bounding_imp2']= bounding_importance_frac
train.loc[index, 'dom_blue2']= dominant_blue
train.loc[index, 'dom_green2']= dominant_green
train.loc[index, 'dom_red2']= dominant_red
train.loc[index, 'pixel_frac2']= dominant_pixel_frac
train.loc[index, 'score2']= dominant_score
else:
train.loc[index, 'vertex_x2']= -1
train.loc[index, 'vertex_y2']= -1
train.loc[index, 'bounding_conf2']= -1
train.loc[index, 'bounding_imp2']= -1
train.loc[index, 'dom_blue2']= -1
train.loc[index, 'dom_green2']= -1
train.loc[index, 'dom_red2']= -1
train.loc[index, 'pixel_frac2']= -1
train.loc[index, 'score2']= -1
for index, row in train.iterrows(): ## Third photo
file = 'train_metadata/' + row['PetID'] + '-3.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x']
vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y']
bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence']
bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1)
try:
dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
except:
dominant_blue = -1
try:
dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
except:
dominant_green = -1
try:
dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
except:
dominant_red = -1
dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
train.loc[index, 'vertex_x3']= vertex_x
train.loc[index, 'vertex_y3']= vertex_y
train.loc[index, 'bounding_conf3']= bounding_confidence
train.loc[index, 'bounding_imp3']= bounding_importance_frac
train.loc[index, 'dom_blue3']= dominant_blue
train.loc[index, 'dom_green3']= dominant_green
train.loc[index, 'dom_red3']= dominant_red
train.loc[index, 'pixel_frac3']= dominant_pixel_frac
train.loc[index, 'score3']= dominant_score
else:
train.loc[index, 'vertex_x3']= -1
train.loc[index, 'vertex_y3']= -1
train.loc[index, 'bounding_conf3']= -1
train.loc[index, 'bounding_imp3']= -1
train.loc[index, 'dom_blue3']= -1
train.loc[index, 'dom_green3']= -1
train.loc[index, 'dom_red3']= -1
train.loc[index, 'pixel_frac3']= -1
train.loc[index, 'score3']= -1
for index, row in test.iterrows(): # First photo
file = 'test_metadata/' + row['PetID'] + '-1.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x']
vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y']
bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence']
bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1)
dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
test.loc[index, 'vertex_x']= vertex_x
test.loc[index, 'vertex_y']= vertex_y
test.loc[index, 'bounding_conf']= bounding_confidence
test.loc[index, 'bounding_imp']= bounding_importance_frac
test.loc[index, 'dom_blue']= dominant_blue
test.loc[index, 'dom_green']= dominant_green
test.loc[index, 'dom_red']= dominant_red
test.loc[index, 'pixel_frac']= dominant_pixel_frac
test.loc[index, 'score']= dominant_score
else:
test.loc[index, 'vertex_x']= -1
test.loc[index, 'vertex_y']= -1
test.loc[index, 'bounding_conf']= -1
test.loc[index, 'bounding_imp']= -1
test.loc[index, 'dom_blue']= -1
test.loc[index, 'dom_green']= -1
test.loc[index, 'dom_red']= -1
test.loc[index, 'pixel_frac']= -1
test.loc[index, 'score']= -1
for index, row in test.iterrows(): # Second photo
file = 'test_metadata/' + row['PetID'] + '-2.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x']
vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y']
bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence']
bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1)
try:
dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
except:
dominant_blue = -1
try:
dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
except:
dominant_green = -1
try:
dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
except:
dominant_red = -1
dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
test.loc[index, 'vertex_x2']= vertex_x
test.loc[index, 'vertex_y2']= vertex_y
test.loc[index, 'bounding_conf2']= bounding_confidence
test.loc[index, 'bounding_imp2']= bounding_importance_frac
test.loc[index, 'dom_blue2']= dominant_blue
test.loc[index, 'dom_green2']= dominant_green
test.loc[index, 'dom_red2']= dominant_red
test.loc[index, 'pixel_frac2']= dominant_pixel_frac
test.loc[index, 'score2']= dominant_score
else:
test.loc[index, 'vertex_x2']= -1
test.loc[index, 'vertex_y2']= -1
test.loc[index, 'bounding_conf2']= -1
test.loc[index, 'bounding_imp2']= -1
test.loc[index, 'dom_blue2']= -1
test.loc[index, 'dom_green2']= -1
test.loc[index, 'dom_red2']= -1
test.loc[index, 'pixel_frac2']= -1
test.loc[index, 'score2']= -1
for index, row in test.iterrows(): # Third photo
file = 'test_metadata/' + row['PetID'] + '-3.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x']
vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y']
bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence']
bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1)
try:
dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
except:
dominant_blue = -1
try:
dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
except:
dominant_green = -1
try:
dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
except:
dominant_red = -1
dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
test.loc[index, 'vertex_x3']= vertex_x
test.loc[index, 'vertex_y3']= vertex_y
test.loc[index, 'bounding_conf3']= bounding_confidence
test.loc[index, 'bounding_imp3']= bounding_importance_frac
test.loc[index, 'dom_blue3']= dominant_blue
test.loc[index, 'dom_green3']= dominant_green
test.loc[index, 'dom_red3']= dominant_red
test.loc[index, 'pixel_frac3']= dominant_pixel_frac
test.loc[index, 'score3']= dominant_score
else:
test.loc[index, 'vertex_x3']= -1
test.loc[index, 'vertex_y3']= -1
test.loc[index, 'bounding_conf3']= -1
test.loc[index, 'bounding_imp3']= -1
test.loc[index, 'dom_blue3']= -1
test.loc[index, 'dom_green3']= -1
test.loc[index, 'dom_red3']= -1
test.loc[index, 'pixel_frac3']= -1
test.loc[index, 'score3']= -1
The sentiment data, similar to the image data, is provided as JSON files with the 'PetID' of the corresponding pet as the file name. The relevant values I chose to include from the sentiment data are magnitude and score.
for index, row in train.iterrows():
file = 'train_sentiment/' + row['PetID'] + '.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
mag = data['documentSentiment']['magnitude']
score = data['documentSentiment']['score']
train.loc[index, 'magnitude']= mag
train.loc[index, 'sentiment_score']= score
else:
train.loc[index, 'magnitude']= -1
train.loc[index, 'sentiment_score']= -1
for index, row in test.iterrows():
file = 'test_sentiment/' + row['PetID'] + '.json'
if os.path.exists(file):
data = json.load(open(file, encoding="utf8"))
mag = data['documentSentiment']['magnitude']
score = data['documentSentiment']['score']
test.loc[index, 'magnitude']= mag
test.loc[index, 'sentiment_score']= score
else:
test.loc[index, 'magnitude']= -1
test.loc[index, 'sentiment_score']= -1
I will be adding additional columns of data but wanted to save a copy of the train and test sets to compare with later on.
train.to_csv('pre_train.csv')
test.to_csv('pre_test.csv')
To include a bit more data on 'Description' column and the otherwise unused 'Name' column, I decided to include the length of each as new columns of data.
train['NameLength'] = train['Name'].map(lambda x: len(str(x))).astype('int')
train['DescLength'] = train['Description'].map(lambda x: len(str(x))).astype('int')
test['NameLength'] = test['Name'].map(lambda x: len(str(x))).astype('int')
test['DescLength'] = test['Description'].map(lambda x: len(str(x))).astype('int')
pd.DataFrame([train['DescLength'][train['AdoptionSpeed'] == 0].mean(),
train['DescLength'][train['AdoptionSpeed'] == 1].mean(),
train['DescLength'][train['AdoptionSpeed'] == 2].mean(),
train['DescLength'][train['AdoptionSpeed'] == 3].mean(),
train['DescLength'][train['AdoptionSpeed'] == 4].mean()]).plot(kind='barh',figsize=(16,5))
plt.yticks(fontsize='xx-large')
plt.ylabel('Adoption Speed')
plt.xlabel('Description Length')
plt.title('Average Description Length', fontsize='xx-large')
Text(0.5, 1.0, 'Average Description Length')
There average description length trends upward as the adoption speed window increases until it hits level 4, where the average description length then is lower.
Using data from an AKC website as well as Wikipedia, I assigned a breed group to each dog breed as I suspect that there is a difference in adoptability amongst the dog breed groups. I added these breed groups in Microsoft Excel and generated a csv file 'dog_breeds' using the providing csv of breed labels. Now I just have to add a new column 'Group' to the train and test set. Since this only works for dogs, any cats will just be assigned the group 'Cat'.
dog_data = pd.read_csv('dog_breeds.csv')
dog_data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | BreedID | Type | BreedName | Group | |
---|---|---|---|---|---|
0 | 0 | 1 | 1 | Affenpinscher | Toy |
1 | 1 | 2 | 1 | Afghan Hound | Hound |
2 | 2 | 3 | 1 | Airedale Terrier | Terrier |
3 | 3 | 4 | 1 | Akbash | Working |
4 | 4 | 5 | 1 | Akita | Working |
for index, row in train.iterrows():
for i, r in dog_data.iterrows():
if row['Breed1'] == r['BreedID']:
train.at[index,'Group'] = r['Group']
break
for index, row in test.iterrows():
for i, r in dog_data.iterrows():
if row['Breed1'] == r['BreedID']:
test.at[index,'Group'] = r['Group']
break
train.Group.isna().sum()
6853
dog_data.columns
Index(['Unnamed: 0', 'BreedID', 'Type', 'BreedName', 'Group'], dtype='object')
for index, row in train.iterrows():
try:
breed = row['Breed1']
group = dog_data[dog_data['BreedID'] == breed]['Group'].values[0]
except:
group = 'Cat'
train.loc[index,'Group'] = group
for index, row in test.iterrows():
try:
breed = row['Breed1']
group = dog_data[dog_data['BreedID'] == breed]['Group'].values[0]
except:
group = 'Cat'
test.loc[index,'Group'] = group
train['Group'][train['Group'] != 'Cat'].value_counts().sort_index().plot(kind='barh',
figsize=(20,15))
plt.yticks(fontsize='xx-large')
plt.title('Distribution of Dog Groups', fontsize='xx-large')
Text(0.5, 1.0, 'Distribution of Dog Groups')
It seems that 'Misc' is by far the most common group assigned to the dogs.
train['Group'][(train['Group'] != 'Cat') & (train['Group'] != 'Misc')].value_counts().sort_index().plot(kind='barh',
figsize=(20,15))
plt.yticks(fontsize='xx-large')
plt.title('Distribution of Dog Groups', fontsize='xx-large')
Text(0.5, 1.0, 'Distribution of Dog Groups')
Removing the 'Misc' group we can see the distribution of the other groups much better. From this, 'Sporting' and 'Toy' are the most common with 'Hunting' being the least common.
Using data from http://www.catbreedslist.com, I decided to include two new variables for the 'Cats' in the dataset. The first is 'Hypo' which is whether or not the cat breed is hypoallergenic. The second is 'Cute' which if the value in this column is 1 then that cat breed is one of the top 10 cutest cat breeds.
cat_data = pd.read_csv('cat_info.csv')
cat_data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
BreedID | Type | BreedName | Cute | Hypo | |
---|---|---|---|---|---|
0 | 241 | 2 | Abyssinian | 0 | 0 |
1 | 242 | 2 | American Curl | 1 | 0 |
2 | 243 | 2 | American Shorthair | 1 | 0 |
3 | 244 | 2 | American Wirehair | 0 | 0 |
4 | 245 | 2 | Applehead Siamese | 0 | 0 |
for index, row in train.iterrows():
try:
breed = row['Breed1']
cute = cat_data[cat_data['BreedID'] == breed]['Cute'].values[0]
hypo = cat_data[cat_data['BreedID'] == breed]['Hypo'].values[0]
except:
cute = -1
hypo = -1
train.loc[index,'Cat_Cute'] = cute
train.loc[index,'Cat_Hypo'] = hypo
for index, row in test.iterrows():
try:
breed = row['Breed1']
cute = cat_data[cat_data['BreedID'] == breed]['Cute'].values[0]
hypo = cat_data[cat_data['BreedID'] == breed]['Hypo'].values[0]
except:
cute = -1
hypo = -1
test.loc[index,'Cat_Cute'] = cute
test.loc[index,'Cat_Hypo'] = hypo
pd.DataFrame([train['AdoptionSpeed'][train['Cat_Hypo'] == 0].mean(),train['AdoptionSpeed'][train['Cat_Hypo'] == 1].mean()]).rename({1:'Hypoallergenic', 0:'Non-hypoallergenic'}).plot(kind='barh',figsize=(16,5))
plt.yticks(fontsize='xx-large')
plt.title('Hypoallergenic Adoption Speeds', fontsize='xx-large')
Text(0.5, 1.0, 'Hypoallergenic Adoption Speeds')
It seems that hypoallergenic cat breeds are adopted more quickly on average than non-hypoallergenic cat breeds.
Using census data found on Wikipedia for the states in Malaysia, I added the population, percentage of urban environment, and population density for each state.
state_data = pd.read_csv('state_data.csv')
state_data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
State | Population | StateID | UrbanPercent | PopDensity | |
---|---|---|---|---|---|
0 | Kuala Lumpur | 1627172 | 41401 | 100.0 | 6891 |
1 | Labuan | 86908 | 41415 | 82.3 | 950 |
2 | Johor | 3348283 | 41336 | 71.9 | 174 |
3 | Kedah | 1890098 | 41325 | 64.6 | 199 |
4 | Kelantan | 1459994 | 41367 | 42.4 | 97 |
for index, row in train.iterrows():
state = row['State']
urban = state_data[state_data['StateID'] == state]['UrbanPercent'].values[0]
pop = state_data[state_data['StateID'] == state]['Population'].values[0]
pop_den = state_data[state_data['StateID'] == state]['PopDensity'].values[0]
train.loc[index,'UrbanPercent'] = urban
train.loc[index,'Population'] = pop
train.loc[index,'PopDensity'] = pop_den
for index, row in test.iterrows():
state = row['State']
urban = state_data[state_data['StateID'] == state]['UrbanPercent'].values[0]
pop = state_data[state_data['StateID'] == state]['Population'].values[0]
pop_den = state_data[state_data['StateID'] == state]['PopDensity'].values[0]
test.loc[index,'UrbanPercent'] = urban
test.loc[index,'Population'] = pop
test.loc[index,'PopDensity'] = pop_den
Saving data at this step to avoid repeating it in the future.
#train.to_csv('processed_train.csv')
#test.to_csv('processed_test.csv')
train = pd.read_csv('processed_train.csv')
test = pd.read_csv('processed_test.csv')
train = pd.get_dummies(train, columns = ['Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3',
'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health',
'State', 'Type', 'Group'
])
test = pd.get_dummies(test, columns = ['Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3',
'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health',
'State', 'Type', 'Group'
])
Encoding variables like 'Breed1' creates many new columns, some of which may only exist in the training set or test set. To remedy this we can make sure each dataset has the same columns and if a column was missing, its values will be fill with 0.
diff_columns = set(train.columns).difference(set(test.columns))
for i in diff_columns:
test[i] = test.apply(lambda _: 0, axis=1)
diff_columns2 = set(test.columns).difference(set(train.columns))
for i in diff_columns2:
train[i] = train.apply(lambda _: 0, axis=1)
test = test[train.columns]
train.shape
(14993, 453)
test.shape
(3948, 453)
Training set and test set now have the same number of columns.
To deal with variables that may be highly correlated with eachother, we can grab all of those pairs where the correlation value is above the threshold of 0.85.
corr = train.corr()
indices = np.where(corr > 0.85)
indices = [(corr.index[x], corr.columns[y]) for x, y in zip(*indices)
if x != y and x < y]
indices
[('bounding_conf', 'bounding_imp'),
('bounding_conf', 'pixel_frac'),
('dom_blue', 'dom_green'),
('dom_green', 'dom_red'),
('vertex_y2', 'bounding_conf2'),
('vertex_y2', 'bounding_imp2'),
('vertex_y2', 'pixel_frac2'),
('vertex_y2', 'score2'),
('bounding_conf2', 'bounding_imp2'),
('bounding_conf2', 'pixel_frac2'),
('bounding_conf2', 'score2'),
('bounding_imp2', 'pixel_frac2'),
('bounding_imp2', 'score2'),
('dom_blue2', 'dom_green2'),
('dom_blue2', 'dom_red2'),
('dom_green2', 'dom_red2'),
('pixel_frac2', 'score2'),
('vertex_x3', 'vertex_y3'),
('vertex_x3', 'bounding_conf3'),
('vertex_x3', 'bounding_imp3'),
('vertex_x3', 'pixel_frac3'),
('vertex_x3', 'score3'),
('vertex_y3', 'bounding_conf3'),
('vertex_y3', 'bounding_imp3'),
('vertex_y3', 'pixel_frac3'),
('vertex_y3', 'score3'),
('bounding_conf3', 'bounding_imp3'),
('bounding_conf3', 'pixel_frac3'),
('bounding_conf3', 'score3'),
('bounding_imp3', 'pixel_frac3'),
('bounding_imp3', 'score3'),
('dom_blue3', 'dom_green3'),
('dom_blue3', 'dom_red3'),
('dom_green3', 'dom_red3'),
('pixel_frac3', 'score3'),
('Cat_Cute', 'Cat_Hypo'),
('Cat_Cute', 'Type_2'),
('Cat_Cute', 'Group_Cat'),
('Cat_Hypo', 'Type_2'),
('Cat_Hypo', 'Group_Cat'),
('Population', 'State_41326'),
('PopDensity', 'State_41401'),
('Breed1_143', 'Breed2_146'),
('Breed1_155', 'Breed2_155'),
('Breed1_307', 'Group_Misc'),
('Type_2', 'Group_Cat')]
Before immediately dropping one of each of the above pairs, I decided to look at the list closely and decided that some in some pairs, dropping one of the variables over the other is better. These are 'Group_Cat' and 'Type_2' because the inclusion of 'Cat_Cute' and 'Cat_Hypo' made them redundant. And also 'State_41326' and 'State_41401' because they correlated highly with 'Population' and 'PopDensity' respectively, but the latter two are more important to keep in the dataset.
drop_list = list(['Group_Cat', 'Type_2', 'State_41326', 'State_41401'])
for i in manual_drop:
train.drop(i, axis=1, inplace=True)
test.drop(i, axis=1, inplace=True)
for i in indices:
if (i[0] in drop_list) or (i[1] in drop_list):
pass
else:
try:
train.drop(i[0], axis=1, inplace=True)
test.drop(i[0], axis=1, inplace=True)
drop_list.append(i[0])
except:
## already dropped
pass
Below are all the columns that were dropped to deal with multicollinearity.
drop_list
['Group_Cat',
'Type_2',
'State_41326',
'State_41401',
'bounding_conf',
'dom_blue',
'dom_green',
'vertex_y2',
'bounding_conf2',
'bounding_imp2',
'dom_blue2',
'dom_green2',
'pixel_frac2',
'vertex_x3',
'vertex_y3',
'bounding_conf3',
'bounding_imp3',
'dom_blue3',
'dom_green3',
'pixel_frac3',
'Cat_Cute',
'Breed1_143',
'Breed1_155',
'Breed1_307']
target = train['AdoptionSpeed'].astype('int')
Dropping unnecessary columns as well as target column 'AdoptionSpeed'.
X = train.drop(['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed', 'Unnamed: 0'], axis=1)
X_pred = test.drop(['Name', 'RescuerID', 'Description', 'Unnamed: 0'], axis=1)
According to the rules for the Kaggle competition, the results are scored using the quadratic weighted kappa. I will be using the cohen_kappa_score with 'weights' set to quadratic from sklearn.metrics to evaluate my results.
Although the data provided is labeled as 'train', to test classifier performance we need to set aside a validation set.
X_train, X_val, target_train, target_val = train_test_split(X,
target,
test_size=0.25,
random_state=47)
Now it is time to start testing some classifiers. I will grab a baseline score for a RandomForest, XGBoost, and Adaboost classifier.
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, target_train)
C:\Users\Matthew\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
cohen_kappa_score(target_val, clf_rf.predict(X_val), weights='quadratic')
0.29257257000903714
feature_importances = pd.DataFrame(clf_rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head(25)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
importance | |
---|---|
score | 0.045825 |
DescLength | 0.044196 |
pixel_frac | 0.043875 |
Age | 0.041879 |
dom_red | 0.041559 |
magnitude | 0.039455 |
score2 | 0.036550 |
NameLength | 0.034588 |
dom_red2 | 0.033189 |
sentiment_score | 0.033013 |
vertex_x | 0.032567 |
vertex_y | 0.030376 |
dom_red3 | 0.028200 |
score3 | 0.027632 |
PhotoAmt | 0.025038 |
vertex_x2 | 0.023397 |
Population | 0.015160 |
UrbanPercent | 0.014829 |
Quantity | 0.013536 |
PopDensity | 0.012904 |
Fee | 0.012840 |
Group_Misc | 0.010596 |
Gender_2 | 0.010524 |
Color1_1 | 0.010236 |
Sterilized_2 | 0.010147 |
The top three importance features for the baseline RandomForest classifier are 'score', 'dom_red', and 'DescLength'.
clf_xgb = xgb.XGBClassifier()
clf_xgb.fit(X_train, target_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
cohen_kappa_score(target_val, clf_xgb.predict(X_val), weights='quadratic')
0.36919141456485527
Using an XGBClassifier improved the cohen score significantly.
from xgboost import plot_importance
fig, ax = plt.subplots(figsize=(12,18))
plot_importance(clf_xgb, max_num_features=25, height=0.8, ax=ax)
plt.show()
From the feature importance chart it seems that 'Age', 'DescLength', and 'score' are the top 3 most important features.
clf_ada = AdaBoostClassifier()
clf_ada.fit(X_train, target_train)
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None)
cohen_kappa_score(target_val, clf_ada.predict(X_val), weights='quadratic')
0.33028851731478026
feature_importances = pd.DataFrame(clf_ada.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head(25)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
importance | |
---|---|
Age | 0.06 |
vertex_y | 0.06 |
DescLength | 0.06 |
Group_Misc | 0.04 |
pixel_frac | 0.04 |
magnitude | 0.04 |
UrbanPercent | 0.04 |
Color1_7 | 0.02 |
Type_1 | 0.02 |
State_41336 | 0.02 |
Sterilized_3 | 0.02 |
Sterilized_2 | 0.02 |
Dewormed_2 | 0.02 |
FurLength_3 | 0.02 |
FurLength_1 | 0.02 |
Color3_5 | 0.02 |
Color1_1 | 0.02 |
Breed1_179 | 0.02 |
Breed1_11 | 0.02 |
Gender_1 | 0.02 |
Breed1_213 | 0.02 |
Breed2_291 | 0.02 |
Breed2_247 | 0.02 |
Breed2_207 | 0.02 |
Breed1_283 | 0.02 |
Now I will try to improve the baseline scores of the three classifiers by using GridSearchCV to find the optimal parameters for each classifier.
rf_params = {
'bootstrap': [True, False],
'max_depth': [25, 50, 75, 100],
'max_features': ['auto'],
'min_samples_leaf': [2, 3, 5, 10],
'min_samples_split': [5, 10, 15],
'n_jobs':[-1],
'n_estimators': [50, 100, 200, 300],
'random_state' : [47]
}
rf_gridsearch = GridSearchCV(estimator = clf_rf,
param_grid = rf_params,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring=make_scorer(cohen_kappa_score,weights='quadratic'))
rf_gridsearch.fit(X_train, target_train)
Fitting 3 folds for each of 384 candidates, totalling 1152 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 10.9s
[Parallel(n_jobs=-1)]: Done 176 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 426 tasks | elapsed: 2.4min
[Parallel(n_jobs=-1)]: Done 776 tasks | elapsed: 5.0min
[Parallel(n_jobs=-1)]: Done 1152 out of 1152 | elapsed: 8.0min finished
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
fit_params=None, iid='warn', n_jobs=-1,
param_grid={'n_estimators': [50, 100, 200, 300], 'bootstrap': [True, False], 'n_jobs': [-1], 'min_samples_leaf': [2, 3, 5, 10], 'max_features': ['auto'], 'max_depth': [25, 50, 75, 100], 'min_samples_split': [5, 10, 15], 'random_state': [47]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=make_scorer(cohen_kappa_score, weights=quadratic),
verbose=1)
rf_gridsearch.best_params_
{'bootstrap': False,
'max_depth': 25,
'max_features': 'auto',
'min_samples_leaf': 2,
'min_samples_split': 5,
'n_estimators': 300,
'n_jobs': -1,
'random_state': 47}
rf_gridsearch.best_score_
0.343240804127584
clf_rf_best = RandomForestClassifier(bootstrap=False, max_depth=25, max_features='auto', min_samples_leaf=2,
min_samples_split=5,
n_estimators=300, n_jobs=-1, random_state=47)
clf_rf_best.fit(X_train, target_train)
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=25, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=-1,
oob_score=False, random_state=47, verbose=0, warm_start=False)
cohen_kappa_score(target_val, clf_rf_best.predict(X_val),weights='quadratic')
0.3582847219900983
xgb_params = {'objective' : ['multi:softmax'],
'eta' : [0.01],
'max_depth' : [3, 4, 6],
'min_child_weight' : [2, 3, 4],
}
xgb_gridsearch = GridSearchCV(estimator = clf_xgb,
param_grid = xgb_params,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring=make_scorer(cohen_kappa_score,weights='quadratic'))
xgb_gridsearch.fit(X_train, target_train)
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done 27 out of 27 | elapsed: 4.3min finished
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1),
fit_params=None, iid='warn', n_jobs=-1,
param_grid={'min_child_weight': [2, 3, 4], 'objective': ['multi:softmax'], 'max_depth': [3, 4, 6], 'eta': [0.01]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=make_scorer(cohen_kappa_score, weights=quadratic),
verbose=1)
xgb_gridsearch.best_params_
{'eta': 0.01,
'max_depth': 4,
'min_child_weight': 4,
'objective': 'multi:softmax'}
xgb_gridsearch.best_score_
0.34320068021975203
clf_xgb_best = xgb.XGBClassifier(eta = 0.01, max_depth = 4, min_child_weight = 4, objective = 'multi:softmax')
clf_xgb_best.fit(X_train, target_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, eta=0.01, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=4, min_child_weight=4, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='multi:softprob', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
subsample=1)
cohen_kappa_score(target_val, clf_xgb_best.predict(X_val),weights='quadratic')
0.38268750072423174
ada_params = {'base_estimator': [None, DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=5)],
'n_estimators': [50, 100, 200, 300]}
ada_gridsearch = GridSearchCV(estimator = clf_ada,
param_grid = ada_params,
cv = 3,
n_jobs = -1,
verbose = 1,
scoring=make_scorer(cohen_kappa_score,weights='quadratic'))
ada_gridsearch.fit(X_train, target_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 57.9s finished
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None),
fit_params=None, iid='warn', n_jobs=-1,
param_grid={'base_estimator': [None, DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weigh...resort=False, random_state=None,
splitter='best')], 'n_estimators': [50, 100, 200, 300]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=make_scorer(cohen_kappa_score, weights=quadratic),
verbose=1)
ada_gridsearch.best_params_
{'base_estimator': None, 'n_estimators': 100}
ada_gridsearch.best_score_
0.30248923825346563
clf_ada_best = AdaBoostClassifier(base_estimator=None, n_estimators=100)
clf_ada_best.fit(X_train, target_train)
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=100, random_state=None)
cohen_kappa_score(target_val, clf_ada_best.predict(X_val),weights='quadratic')
0.3467641968995947
Since all three classifiers have decent, comparable performances, I will combined all three into one final ensemble classifier using VotingClassifer with soft voting.
clf_vot = VotingClassifier(estimators=[('RF',clf_rf_best),('XGB',clf_xgb_best),('ADA',clf_ada_best)],voting='soft')
clf_vot.fit(X_train, target_train)
VotingClassifier(estimators=[('RF', RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=25, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=5,
min_wei...='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=100, random_state=None))],
flatten_transform=None, n_jobs=None, voting='soft', weights=None)
cohen_kappa_score(target_val, clf_vot.predict(X_val),weights='quadratic')
0.3863173663782099
To visualize the differences in predictions of the three base classifiers and the ensemble classifier, we can look at bar charts of each 'AdoptionSpeed' prediction for the classifiers below.
probas = [c.fit(X_train, target_train).predict(X_val) for c in (clf_rf_best, clf_xgb_best, clf_ada_best, clf_vot)]
class_0 = list()
class_1 = list()
class_2 = list()
class_3 = list()
class_4 = list()
for i in probas:
class_0.append(np.array(np.unique(i, return_counts=True))[1][0])
class_1.append(np.array(np.unique(i, return_counts=True))[1][1])
class_2.append(np.array(np.unique(i, return_counts=True))[1][2])
class_3.append(np.array(np.unique(i, return_counts=True))[1][3])
class_4.append(np.array(np.unique(i, return_counts=True))[1][4])
N = 4 # number of groups
ind = np.arange(N) # group positions
width = 0.5 # bar width
ax1 = plt.subplot2grid((2,6), (0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)
# bars for base classifiers
p0 = ax1.bar(ind + width, np.hstack(([class_0[:-1], [0]])), width, color='green', alpha=0.5, edgecolor='k')
p1 = ax2.bar(ind + width, np.hstack(([class_1[:-1], [0]])), width, color='blue',alpha=0.5, edgecolor='k')
p2 = ax3.bar(ind + width, np.hstack(([class_2[:-1], [0]])), width, color='red',alpha=0.5, edgecolor='k')
p3 = ax4.bar(ind + width, np.hstack(([class_3[:-1], [0]])), width, color='orange',alpha=0.5, edgecolor='k')
p4 = ax5.bar(ind + width, np.hstack(([class_4[:-1], [0]])), width, color='purple',alpha=0.5, edgecolor='k')
# bars for voting classifier
p5 = ax1.bar(ind + width, [0, 0, 0, class_0[-1]], width,color='green', edgecolor='k')
p6 = ax2.bar(ind + width, [0, 0, 0, class_1[-1]], width,color='blue', edgecolor='k')
p7 = ax3.bar(ind + width, [0, 0, 0, class_2[-1]], width,color='red', edgecolor='k')
p8 = ax4.bar(ind + width, [0, 0, 0, class_3[-1]], width,color='orange', edgecolor='k')
p9 = ax5.bar(ind + width, [0, 0, 0, class_4[-1]], width,color='purple', edgecolor='k')
# plot annotations
ax1.set_xticks(ind + width)
ax1.set_ylabel('Number of predictions')
ax1.set_xticklabels(['RandomForest',
'XGBoost',
'AdaBoost',
'VotingClassifier'],
rotation=40,
ha='right')
ax2.set_xticks(ind + width)
ax2.set_xticklabels(['RandomForest',
'XGBoost',
'AdaBoost',
'VotingClassifier'],
rotation=40,
ha='right')
ax3.set_xticks(ind + width)
ax3.set_xticklabels(['RandomForest',
'XGBoost',
'AdaBoost',
'VotingClassifier'],
rotation=40,
ha='right')
ax4.set_xticks(ind + width)
ax4.set_ylabel('Number of predictions')
ax4.set_xticklabels(['RandomForest',
'XGBoost',
'AdaBoost',
'VotingClassifier'],
rotation=40,
ha='right')
ax5.set_xticks(ind + width)
ax5.set_xticklabels(['RandomForest',
'XGBoost',
'AdaBoost',
'VotingClassifier'],
rotation=40,
ha='right')
ax1.set_title('Adoption Speed 0')
ax2.set_title('Adoption Speed 1')
ax3.set_title('Adoption Speed 2')
ax4.set_title('Adoption Speed 3')
ax5.set_title('Adoption Speed 4')
plt.rcParams["figure.figsize"] = [20,20]
plt.show()
From the above chart we can see how the voting classifier averages out the predictions of the three base classifiers to better predict 'AdoptionSpeed'. One thing to note is that for 'Adoption Speed 0', the RandomForest and XGBoost classifiers predicted far less cases of this value than the AdaBoost classifier. This ultimately did not raise the average generated by the VotingClassifier, but it is still a significantly outlier when comparing the charts side by side.
Now that we have our final classifier, we can fit it to the entire training set.
clf_vot_final = VotingClassifier(estimators=[('RF',clf_rf_best),('XGB',clf_xgb_best),('ADA',clf_ada_best)],voting='soft')
clf_vot_final.fit(X, target)
VotingClassifier(estimators=[('RF', RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=25, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=5,
min_wei...='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=100, random_state=None))],
flatten_transform=None, n_jobs=None, voting='soft', weights=None)
With the final classifier trained on all the data, we can now make predictions based on the given test data from the Kaggle competition.
test_pred = clf_vot_final.predict(X_pred.drop(['AdoptionSpeed','PetID'], axis=1))
Now we can view the distribution of predictions on the test data.
plt.rcParams["figure.figsize"] = [7,7]
pd.DataFrame(test_pred).hist()
plt.title('Adoption Speed Predictions on Test Data')
plt.ylabel('Number of prediction')
plt.xticks(np.arange(5))
([<matplotlib.axis.XTick at 0x2291f2bd4a8>,
<matplotlib.axis.XTick at 0x22971b59e10>,
<matplotlib.axis.XTick at 0x22971b59b38>,
<matplotlib.axis.XTick at 0x2291f2cdeb8>,
<matplotlib.axis.XTick at 0x2291f2a5630>],
<a list of 5 Text xticklabel objects>)
pd.DataFrame(test_pred)[0].value_counts()
4 1788
2 1257
1 684
3 219
Name: 0, dtype: int64
Somewhat surprising, there are no predictions of an 'AdoptionSpeed' of 0 for any of the test data. In the training data, there were significantly fewer cases of the lowest 'AdoptionSpeed' which may be why it's possible for the test set to have zero occurances. However, it still seems unusual for that kind of imbalance and could be investigated further.
Saving the predictions to a seperate CSV file will allow me to upload it to the Kaggle competition to receive a scoring.
pred['PetID'] = X_pred['PetID']
pred['AdoptionSpeed'] = test_pred
pred.set_index('PetID').to_csv("submission.csv", index=True)
The prediction submitted received a score of 0.333 on the Kaggle competition. This placed us in about the 50th percentile of all of the competitors.
The current high score on the Kaggle competition is 0.452 so there is certainly room for improvement. However, the final classifier of this project still scored decently well given this was my first participation in a Kaggle competition. The final classifier was a definite improvement over the baseline classifiers and even the tuned classifiers so choosing to ensemble them using a VotingClassifier was a good choice.
With more time I would focus on the following:
- Class imbalance
- The lowest 'AdoptionSpeed' had a very low occurence in the training data and no occurence in the test predictions. This seems unusual and could be investigate further. I would want to look into the confusion matrix of predictions on the training data to see how well or how not so well the classifier predicts an 'AdoptionSpeed' of 0.
- I would consider using SMOTE to balance the classes of 'AdoptionSpeed' better.
- Removing data
- I added a few of my own columns to the dataset. In hindsight, some of this additional data could have just added noise to the dataset. I would test removing some of the added columns of data.
- I would also test not using as much of the image data as I did - maybe only the first photo for each pet.
- Further investigation of image and sentiment data
- For both the image and sentiment data, I simply used the variables provided without much research into the Google APIs behind them. I would like to learn more about how these APIs work and what the values they generate signify.
- I would also like to possibly try running either the images or descriptions through my own computer vision or NLP algorithm.
- Utilize other classifiers
- Most of the highest scoring kernals on the Kaggle competition use LightGBM for their predictions. I would have liked to learn how to utilize that as a classifier to improve my score.
- I could also include more base classifiers in my VotingClassifier as well as trying out other ensemble methods.