GitHub - matthewsparr/Pet-Adoption-Speed-Prediction: EDA and classification of Pet Finder website in attempt to predict pet adoption speed

Final Project Submission

Please fill out:

Student name: Matthew Sparr
Student pace: self paced
Scheduled project review date/time:
Instructor name: Eli
Blog post URL:

Introduction

For this project I chose a Kaggle dataset for an ongoing competition that can be found at https://www.kaggle.com/c/petfinder-adoption-prediction. This competition involves predicting the speed of adoption for a pet adoption site in Malaysia. Provided are various data fields such as the color of the pet, the age, and the breed.

Also provided are image data on uploaded photos of the pets that was ran through Google's Vision API and sentiment data that was ran through Google's Natural Language API, on the description given for the pets.

The goal of this project is to acheieve a decent score on the Kaggle competition test data and hopefully place high on the leaderboard.

Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,AdaBoostClassifier,VotingClassifier
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import os
import json
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
from sklearn import svm
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier

Grab train and test set

The test set provided does not include the 'AdoptionSpeed' target variable and is only used to make predictions to submit to Kaggle for scoring.

train = pd.read_csv('train/train.csv')
test = pd.read_csv('test/test.csv')

Fill missing values

The 'Name' and 'Description' columns are the only two columns will missing data for both the train and test set. Since both fields are text, they will be filled with a blank space, ' '.

train.isna().sum()

Type                0
Name             1257
Age                 0
Breed1              0
Breed2              0
Gender              0
Color1              0
Color2              0
Color3              0
MaturitySize        0
FurLength           0
Vaccinated          0
Dewormed            0
Sterilized          0
Health              0
Quantity            0
Fee                 0
State               0
RescuerID           0
VideoAmt            0
Description        12
PetID               0
PhotoAmt            0
AdoptionSpeed       0
dtype: int64

test.isna().sum()

Type              0
Name            303
Age               0
Breed1            0
Breed2            0
Gender            0
Color1            0
Color2            0
Color3            0
MaturitySize      0
FurLength         0
Vaccinated        0
Dewormed          0
Sterilized        0
Health            0
Quantity          0
Fee               0
State             0
RescuerID         0
VideoAmt          0
Description       2
PetID             0
PhotoAmt          0
dtype: int64

train.Name.fillna(' ', inplace=True)
train.Description.fillna(' ', inplace=True)

test.Name.fillna(' ', inplace=True)
test.Description.fillna(' ', inplace=True)

Explore variables

Below is a basic histogram of all of the variables.

train.hist(figsize=(20,20))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000022794D508D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022792C089B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022795235128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002279363C278>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000022799173390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022799173748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227991DAB38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227938B14E0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6777DD8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D64E5128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6505438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63A5780>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63C7A58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63C1D68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D63B30B8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D656D3C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000227D659C6D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D65909E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6469CF8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000227D6470048>]],
      dtype=object)

Most of the variables do not have a normal distribution which means we will probably want to standardize them later on. The target variable 'AdoptionSpeed' has a low count of '0' values which could negatively impact training a classifier on the training set.

We can also see that most pets have only one breed and one color as there are many zero values for 'Breed2', 'Color2', and 'Color3'.

Now we can look at some of the value counts of various columns just to get a feel of the distribution of the pets.

Are dogs or cats more common?

train['Type'].value_counts().rename({1:'Dog', 2:'Cat'}).plot(kind='barh', figsize=(15,6)) plt.yticks(fontsize='xx-large') plt.title('Type Distribution', fontsize='xx-large')

Text(0.5, 1.0, 'Type Distribution')

Slightly more dogs than cats.

Do dogs and cats have different adoption rates on average?

train['AdoptionSpeed'][train['Type'] == 1].value_counts().sort_index().plot(kind='barh', figsize=(15,6)) plt.yticks(fontsize='xx-large') plt.ylabel('Adoption Speed') plt.title('Adoption Speed Distribution (Dogs)', fontsize='xx-large')

Text(0.5, 1.0, 'Adoption Speed Distribution (Dogs)')

train['AdoptionSpeed'][train['Type'] == 2].value_counts().sort_index().plot(kind='barh', figsize=(15,6)) plt.yticks(fontsize='xx-large') plt.ylabel('Adoption Speed') plt.title('Adoption Speed Distribution (Cats)', fontsize='xx-large')

Text(0.5, 1.0, 'Adoption Speed Distribution (Cats)')

pd.DataFrame([train['AdoptionSpeed'][train['Type'] == 1].mean(),train['AdoptionSpeed'][train['Type'] == 2].mean()]).rename({0:'Dogs', 1:'Cats'}).plot(kind='barh', figsize=(15,6), legend=None) plt.yticks(fontsize='xx-large') plt.xlabel('Adoption Speed') plt.title('Average Adoption Speed', fontsize='xx-large')

Text(0.5, 1.0, 'Average Adoption Speed')

The largest number of dogs aren't adopted after 100 days of being listed whereas the largest number of cats are adopted in the first month of being listed. Dogs on average take a longer amount of time to be adopted than cats.

What breeds are most common?

train['Breed1'].value_counts().head(10).plot(kind='barh', figsize=(15,6)) plt.yticks(fontsize='xx-large') plt.ylabel('Breed ID') plt.title('Breed Distribution (Breed1)', fontsize='xx-large')

Text(0.5, 1.0, 'Breed Distribution (Breed1)')

Breed 307 which signifies an unknown breed is the most common primary breed followed by Breed 266 which are domestic shorthair cats.

train['Breed2'].value_counts().head(10).plot(kind='barh', figsize=(15,6)) plt.yticks(fontsize='xx-large') plt.ylabel('Breed ID') plt.title('Breed Distribution(Breed2)', fontsize='xx-large')

Text(0.5, 1.0, 'Breed Distribution(Breed2)')

Most pets do not have a second breed but the largest number of the ones that do have an unknown second breed.

More male or female pets?

train['Gender'][(train['Gender'] == 1) | (train['Gender'] == 2)].value_counts().rename({1:'Male', 2:'Female'}).plot(kind='barh', figsize=(15,6)) plt.yticks(fontsize='xx-large') plt.title('Gender Distribution (excluding groups of pets)', fontsize='xx-large')

Text(0.5, 1.0, 'Gender Distribution (excluding groups of pets)')

More pets are female.

train['PhotoAmt'].value_counts().sort_index().plot(kind='barh', figsize=(20,15)) plt.yticks(fontsize='xx-large') plt.ylabel('Number of Photos') plt.title('Amount of Photos Distribution', fontsize='xx-large')

Text(0.5, 1.0, 'Amount of Photos Distribution')

Most listings that have photos only have 1-5 of them.

Add image metadata

The image metadata is given by a collection of JSON files with the 'PetID' of the corresponding pet in the name of the file. Some pets have multiple pictures but I will initially just use the first three photos of each pet if available as these are likely the main photos seen by people searching for pets to adopt and thus have the largest effect on drawing in a perspective adoption.

for index, row in train.iterrows(): ## First photo file = 'train_metadata/' + row['PetID'] + '-1.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x'] vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y'] bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence'] bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1) dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue'] dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green'] dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red'] dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction'] dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score'] train.loc[index, 'vertex_x']= vertex_x train.loc[index, 'vertex_y']= vertex_y train.loc[index, 'bounding_conf']= bounding_confidence train.loc[index, 'bounding_imp']= bounding_importance_frac train.loc[index, 'dom_blue']= dominant_blue train.loc[index, 'dom_green']= dominant_green train.loc[index, 'dom_red']= dominant_red train.loc[index, 'pixel_frac']= dominant_pixel_frac train.loc[index, 'score']= dominant_score else: train.loc[index, 'vertex_x']= -1 train.loc[index, 'vertex_y']= -1 train.loc[index, 'bounding_conf']= -1 train.loc[index, 'bounding_imp']= -1 train.loc[index, 'dom_blue']= -1 train.loc[index, 'dom_green']= -1 train.loc[index, 'dom_red']= -1 train.loc[index, 'pixel_frac']= -1 train.loc[index, 'score']= -1

for index, row in train.iterrows(): ## Second photo file = 'train_metadata/' + row['PetID'] + '-2.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x'] vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y'] bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence'] bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1) try: dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue'] except: dominant_blue = -1 try: dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green'] except: dominant_green = -1 try: dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red'] except: dominant_red = -1 dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction'] dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score'] train.loc[index, 'vertex_x2']= vertex_x train.loc[index, 'vertex_y2']= vertex_y train.loc[index, 'bounding_conf2']= bounding_confidence train.loc[index, 'bounding_imp2']= bounding_importance_frac train.loc[index, 'dom_blue2']= dominant_blue train.loc[index, 'dom_green2']= dominant_green train.loc[index, 'dom_red2']= dominant_red train.loc[index, 'pixel_frac2']= dominant_pixel_frac train.loc[index, 'score2']= dominant_score else: train.loc[index, 'vertex_x2']= -1 train.loc[index, 'vertex_y2']= -1 train.loc[index, 'bounding_conf2']= -1 train.loc[index, 'bounding_imp2']= -1 train.loc[index, 'dom_blue2']= -1 train.loc[index, 'dom_green2']= -1 train.loc[index, 'dom_red2']= -1 train.loc[index, 'pixel_frac2']= -1 train.loc[index, 'score2']= -1

for index, row in train.iterrows(): ## Third photo file = 'train_metadata/' + row['PetID'] + '-3.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x'] vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y'] bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence'] bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1) try: dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue'] except: dominant_blue = -1 try: dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green'] except: dominant_green = -1 try: dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red'] except: dominant_red = -1 dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction'] dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score'] train.loc[index, 'vertex_x3']= vertex_x train.loc[index, 'vertex_y3']= vertex_y train.loc[index, 'bounding_conf3']= bounding_confidence train.loc[index, 'bounding_imp3']= bounding_importance_frac train.loc[index, 'dom_blue3']= dominant_blue train.loc[index, 'dom_green3']= dominant_green train.loc[index, 'dom_red3']= dominant_red train.loc[index, 'pixel_frac3']= dominant_pixel_frac train.loc[index, 'score3']= dominant_score else: train.loc[index, 'vertex_x3']= -1 train.loc[index, 'vertex_y3']= -1 train.loc[index, 'bounding_conf3']= -1 train.loc[index, 'bounding_imp3']= -1 train.loc[index, 'dom_blue3']= -1 train.loc[index, 'dom_green3']= -1 train.loc[index, 'dom_red3']= -1 train.loc[index, 'pixel_frac3']= -1 train.loc[index, 'score3']= -1

for index, row in test.iterrows(): # First photo file = 'test_metadata/' + row['PetID'] + '-1.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x'] vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y'] bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence'] bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1) dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue'] dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green'] dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red'] dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction'] dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score'] test.loc[index, 'vertex_x']= vertex_x test.loc[index, 'vertex_y']= vertex_y test.loc[index, 'bounding_conf']= bounding_confidence test.loc[index, 'bounding_imp']= bounding_importance_frac test.loc[index, 'dom_blue']= dominant_blue test.loc[index, 'dom_green']= dominant_green test.loc[index, 'dom_red']= dominant_red test.loc[index, 'pixel_frac']= dominant_pixel_frac test.loc[index, 'score']= dominant_score else: test.loc[index, 'vertex_x']= -1 test.loc[index, 'vertex_y']= -1 test.loc[index, 'bounding_conf']= -1 test.loc[index, 'bounding_imp']= -1 test.loc[index, 'dom_blue']= -1 test.loc[index, 'dom_green']= -1 test.loc[index, 'dom_red']= -1 test.loc[index, 'pixel_frac']= -1 test.loc[index, 'score']= -1

for index, row in test.iterrows(): # Second photo file = 'test_metadata/' + row['PetID'] + '-2.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x'] vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y'] bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence'] bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1) try: dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue'] except: dominant_blue = -1 try: dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green'] except: dominant_green = -1 try: dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red'] except: dominant_red = -1 dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction'] dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score'] test.loc[index, 'vertex_x2']= vertex_x test.loc[index, 'vertex_y2']= vertex_y test.loc[index, 'bounding_conf2']= bounding_confidence test.loc[index, 'bounding_imp2']= bounding_importance_frac test.loc[index, 'dom_blue2']= dominant_blue test.loc[index, 'dom_green2']= dominant_green test.loc[index, 'dom_red2']= dominant_red test.loc[index, 'pixel_frac2']= dominant_pixel_frac test.loc[index, 'score2']= dominant_score else: test.loc[index, 'vertex_x2']= -1 test.loc[index, 'vertex_y2']= -1 test.loc[index, 'bounding_conf2']= -1 test.loc[index, 'bounding_imp2']= -1 test.loc[index, 'dom_blue2']= -1 test.loc[index, 'dom_green2']= -1 test.loc[index, 'dom_red2']= -1 test.loc[index, 'pixel_frac2']= -1 test.loc[index, 'score2']= -1

for index, row in test.iterrows(): # Third photo file = 'test_metadata/' + row['PetID'] + '-3.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) vertex_x = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['x'] vertex_y = data['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][2]['y'] bounding_confidence = data['cropHintsAnnotation']['cropHints'][0]['confidence'] bounding_importance_frac = data['cropHintsAnnotation']['cropHints'][0].get('importanceFraction', -1) try: dominant_blue = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue'] except: dominant_blue = -1 try: dominant_green = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green'] except: dominant_green = -1 try: dominant_red = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red'] except: dominant_red = -1 dominant_pixel_frac = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction'] dominant_score = data['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score'] test.loc[index, 'vertex_x3']= vertex_x test.loc[index, 'vertex_y3']= vertex_y test.loc[index, 'bounding_conf3']= bounding_confidence test.loc[index, 'bounding_imp3']= bounding_importance_frac test.loc[index, 'dom_blue3']= dominant_blue test.loc[index, 'dom_green3']= dominant_green test.loc[index, 'dom_red3']= dominant_red test.loc[index, 'pixel_frac3']= dominant_pixel_frac test.loc[index, 'score3']= dominant_score else: test.loc[index, 'vertex_x3']= -1 test.loc[index, 'vertex_y3']= -1 test.loc[index, 'bounding_conf3']= -1 test.loc[index, 'bounding_imp3']= -1 test.loc[index, 'dom_blue3']= -1 test.loc[index, 'dom_green3']= -1 test.loc[index, 'dom_red3']= -1 test.loc[index, 'pixel_frac3']= -1 test.loc[index, 'score3']= -1

Add sentiment data

The sentiment data, similar to the image data, is provided as JSON files with the 'PetID' of the corresponding pet as the file name. The relevant values I chose to include from the sentiment data are magnitude and score.

for index, row in train.iterrows(): file = 'train_sentiment/' + row['PetID'] + '.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) mag = data['documentSentiment']['magnitude'] score = data['documentSentiment']['score'] train.loc[index, 'magnitude']= mag train.loc[index, 'sentiment_score']= score else: train.loc[index, 'magnitude']= -1 train.loc[index, 'sentiment_score']= -1

for index, row in test.iterrows(): file = 'test_sentiment/' + row['PetID'] + '.json' if os.path.exists(file): data = json.load(open(file, encoding="utf8")) mag = data['documentSentiment']['magnitude'] score = data['documentSentiment']['score'] test.loc[index, 'magnitude']= mag test.loc[index, 'sentiment_score']= score else: test.loc[index, 'magnitude']= -1 test.loc[index, 'sentiment_score']= -1

Save data before adding addtional columns

I will be adding additional columns of data but wanted to save a copy of the train and test sets to compare with later on.

train.to_csv('pre_train.csv') test.to_csv('pre_test.csv')

Add name and description length

To include a bit more data on 'Description' column and the otherwise unused 'Name' column, I decided to include the length of each as new columns of data.

train['NameLength'] = train['Name'].map(lambda x: len(str(x))).astype('int') train['DescLength'] = train['Description'].map(lambda x: len(str(x))).astype('int') test['NameLength'] = test['Name'].map(lambda x: len(str(x))).astype('int') test['DescLength'] = test['Description'].map(lambda x: len(str(x))).astype('int')

pd.DataFrame([train['DescLength'][train['AdoptionSpeed'] == 0].mean(), train['DescLength'][train['AdoptionSpeed'] == 1].mean(), train['DescLength'][train['AdoptionSpeed'] == 2].mean(), train['DescLength'][train['AdoptionSpeed'] == 3].mean(), train['DescLength'][train['AdoptionSpeed'] == 4].mean()]).plot(kind='barh',figsize=(16,5)) plt.yticks(fontsize='xx-large') plt.ylabel('Adoption Speed') plt.xlabel('Description Length') plt.title('Average Description Length', fontsize='xx-large')

Text(0.5, 1.0, 'Average Description Length')

There average description length trends upward as the adoption speed window increases until it hits level 4, where the average description length then is lower.

Add dog data

Using data from an AKC website as well as Wikipedia, I assigned a breed group to each dog breed as I suspect that there is a difference in adoptability amongst the dog breed groups. I added these breed groups in Microsoft Excel and generated a csv file 'dog_breeds' using the providing csv of breed labels. Now I just have to add a new column 'Group' to the train and test set. Since this only works for dogs, any cats will just be assigned the group 'Cat'.

dog_data = pd.read_csv('dog_breeds.csv')

dog_data.head()

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>

Unnamed: 0 BreedID Type BreedName Group

0 0 1 1 Affenpinscher Toy

1 1 2 1 Afghan Hound Hound

2 2 3 1 Airedale Terrier Terrier

3 3 4 1 Akbash Working

4 4 5 1 Akita Working

for index, row in train.iterrows(): for i, r in dog_data.iterrows(): if row['Breed1'] == r['BreedID']: train.at[index,'Group'] = r['Group'] break

for index, row in test.iterrows(): for i, r in dog_data.iterrows(): if row['Breed1'] == r['BreedID']: test.at[index,'Group'] = r['Group'] break

train.Group.isna().sum()

6853

dog_data.columns

Index(['Unnamed: 0', 'BreedID', 'Type', 'BreedName', 'Group'], dtype='object')

for index, row in train.iterrows(): try: breed = row['Breed1'] group = dog_data[dog_data['BreedID'] == breed]['Group'].values[0] except: group = 'Cat' train.loc[index,'Group'] = group

for index, row in test.iterrows(): try: breed = row['Breed1'] group = dog_data[dog_data['BreedID'] == breed]['Group'].values[0] except: group = 'Cat' test.loc[index,'Group'] = group

train['Group'][train['Group'] != 'Cat'].value_counts().sort_index().plot(kind='barh', figsize=(20,15)) plt.yticks(fontsize='xx-large') plt.title('Distribution of Dog Groups', fontsize='xx-large')

Text(0.5, 1.0, 'Distribution of Dog Groups')

It seems that 'Misc' is by far the most common group assigned to the dogs.

train['Group'][(train['Group'] != 'Cat') & (train['Group'] != 'Misc')].value_counts().sort_index().plot(kind='barh', figsize=(20,15)) plt.yticks(fontsize='xx-large') plt.title('Distribution of Dog Groups', fontsize='xx-large')

Text(0.5, 1.0, 'Distribution of Dog Groups')

Removing the 'Misc' group we can see the distribution of the other groups much better. From this, 'Sporting' and 'Toy' are the most common with 'Hunting' being the least common.

Add cat data

Using data from http://www.catbreedslist.com, I decided to include two new variables for the 'Cats' in the dataset. The first is 'Hypo' which is whether or not the cat breed is hypoallergenic. The second is 'Cute' which if the value in this column is 1 then that cat breed is one of the top 10 cutest cat breeds.

cat_data = pd.read_csv('cat_info.csv')

cat_data.head()

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>

BreedID Type BreedName Cute Hypo

0 241 2 Abyssinian 0 0

1 242 2 American Curl 1 0

2 243 2 American Shorthair 1 0

3 244 2 American Wirehair 0 0

4 245 2 Applehead Siamese 0 0

for index, row in train.iterrows(): try: breed = row['Breed1'] cute = cat_data[cat_data['BreedID'] == breed]['Cute'].values[0] hypo = cat_data[cat_data['BreedID'] == breed]['Hypo'].values[0] except: cute = -1 hypo = -1 train.loc[index,'Cat_Cute'] = cute train.loc[index,'Cat_Hypo'] = hypo

for index, row in test.iterrows(): try: breed = row['Breed1'] cute = cat_data[cat_data['BreedID'] == breed]['Cute'].values[0] hypo = cat_data[cat_data['BreedID'] == breed]['Hypo'].values[0] except: cute = -1 hypo = -1 test.loc[index,'Cat_Cute'] = cute test.loc[index,'Cat_Hypo'] = hypo

pd.DataFrame([train['AdoptionSpeed'][train['Cat_Hypo'] == 0].mean(),train['AdoptionSpeed'][train['Cat_Hypo'] == 1].mean()]).rename({1:'Hypoallergenic', 0:'Non-hypoallergenic'}).plot(kind='barh',figsize=(16,5)) plt.yticks(fontsize='xx-large') plt.title('Hypoallergenic Adoption Speeds', fontsize='xx-large')

Text(0.5, 1.0, 'Hypoallergenic Adoption Speeds')

It seems that hypoallergenic cat breeds are adopted more quickly on average than non-hypoallergenic cat breeds.

Add state data

Using census data found on Wikipedia for the states in Malaysia, I added the population, percentage of urban environment, and population density for each state.

state_data = pd.read_csv('state_data.csv') state_data.head()

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>

State Population StateID UrbanPercent PopDensity

0 Kuala Lumpur 1627172 41401 100.0 6891

1 Labuan 86908 41415 82.3 950

2 Johor 3348283 41336 71.9 174

3 Kedah 1890098 41325 64.6 199

4 Kelantan 1459994 41367 42.4 97

for index, row in train.iterrows(): state = row['State'] urban = state_data[state_data['StateID'] == state]['UrbanPercent'].values[0] pop = state_data[state_data['StateID'] == state]['Population'].values[0] pop_den = state_data[state_data['StateID'] == state]['PopDensity'].values[0] train.loc[index,'UrbanPercent'] = urban train.loc[index,'Population'] = pop train.loc[index,'PopDensity'] = pop_den

for index, row in test.iterrows(): state = row['State'] urban = state_data[state_data['StateID'] == state]['UrbanPercent'].values[0] pop = state_data[state_data['StateID'] == state]['Population'].values[0] pop_den = state_data[state_data['StateID'] == state]['PopDensity'].values[0] test.loc[index,'UrbanPercent'] = urban test.loc[index,'Population'] = pop test.loc[index,'PopDensity'] = pop_den

Save preprocessed data

Saving data at this step to avoid repeating it in the future.

#train.to_csv('processed_train.csv') #test.to_csv('processed_test.csv')

Import preprocessed data

train = pd.read_csv('processed_train.csv') test = pd.read_csv('processed_test.csv')

Encode categorical variables

train = pd.get_dummies(train, columns = ['Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health', 'State', 'Type', 'Group' ]) test = pd.get_dummies(test, columns = ['Breed1', 'Breed2', 'Gender', 'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed', 'Sterilized', 'Health', 'State', 'Type', 'Group' ])

Make sure train and test have same columns

Encoding variables like 'Breed1' creates many new columns, some of which may only exist in the training set or test set. To remedy this we can make sure each dataset has the same columns and if a column was missing, its values will be fill with 0.

diff_columns = set(train.columns).difference(set(test.columns)) for i in diff_columns: test[i] = test.apply(lambda _: 0, axis=1) diff_columns2 = set(test.columns).difference(set(train.columns)) for i in diff_columns2: train[i] = train.apply(lambda _: 0, axis=1) test = test[train.columns]

train.shape

(14993, 453)

test.shape

(3948, 453)

Training set and test set now have the same number of columns.

Check multicollinearity

To deal with variables that may be highly correlated with eachother, we can grab all of those pairs where the correlation value is above the threshold of 0.85.

corr = train.corr() indices = np.where(corr > 0.85) indices = [(corr.index[x], corr.columns[y]) for x, y in zip(*indices) if x != y and x < y] indices

[('bounding_conf', 'bounding_imp'), ('bounding_conf', 'pixel_frac'), ('dom_blue', 'dom_green'), ('dom_green', 'dom_red'), ('vertex_y2', 'bounding_conf2'), ('vertex_y2', 'bounding_imp2'), ('vertex_y2', 'pixel_frac2'), ('vertex_y2', 'score2'), ('bounding_conf2', 'bounding_imp2'), ('bounding_conf2', 'pixel_frac2'), ('bounding_conf2', 'score2'), ('bounding_imp2', 'pixel_frac2'), ('bounding_imp2', 'score2'), ('dom_blue2', 'dom_green2'), ('dom_blue2', 'dom_red2'), ('dom_green2', 'dom_red2'), ('pixel_frac2', 'score2'), ('vertex_x3', 'vertex_y3'), ('vertex_x3', 'bounding_conf3'), ('vertex_x3', 'bounding_imp3'), ('vertex_x3', 'pixel_frac3'), ('vertex_x3', 'score3'), ('vertex_y3', 'bounding_conf3'), ('vertex_y3', 'bounding_imp3'), ('vertex_y3', 'pixel_frac3'), ('vertex_y3', 'score3'), ('bounding_conf3', 'bounding_imp3'), ('bounding_conf3', 'pixel_frac3'), ('bounding_conf3', 'score3'), ('bounding_imp3', 'pixel_frac3'), ('bounding_imp3', 'score3'), ('dom_blue3', 'dom_green3'), ('dom_blue3', 'dom_red3'), ('dom_green3', 'dom_red3'), ('pixel_frac3', 'score3'), ('Cat_Cute', 'Cat_Hypo'), ('Cat_Cute', 'Type_2'), ('Cat_Cute', 'Group_Cat'), ('Cat_Hypo', 'Type_2'), ('Cat_Hypo', 'Group_Cat'), ('Population', 'State_41326'), ('PopDensity', 'State_41401'), ('Breed1_143', 'Breed2_146'), ('Breed1_155', 'Breed2_155'), ('Breed1_307', 'Group_Misc'), ('Type_2', 'Group_Cat')]

Before immediately dropping one of each of the above pairs, I decided to look at the list closely and decided that some in some pairs, dropping one of the variables over the other is better. These are 'Group_Cat' and 'Type_2' because the inclusion of 'Cat_Cute' and 'Cat_Hypo' made them redundant. And also 'State_41326' and 'State_41401' because they correlated highly with 'Population' and 'PopDensity' respectively, but the latter two are more important to keep in the dataset.

drop_list = list(['Group_Cat', 'Type_2', 'State_41326', 'State_41401']) for i in manual_drop: train.drop(i, axis=1, inplace=True) test.drop(i, axis=1, inplace=True)

for i in indices: if (i[0] in drop_list) or (i[1] in drop_list): pass else: try: train.drop(i[0], axis=1, inplace=True) test.drop(i[0], axis=1, inplace=True) drop_list.append(i[0]) except: ## already dropped pass

Below are all the columns that were dropped to deal with multicollinearity.

drop_list

['Group_Cat', 'Type_2', 'State_41326', 'State_41401', 'bounding_conf', 'dom_blue', 'dom_green', 'vertex_y2', 'bounding_conf2', 'bounding_imp2', 'dom_blue2', 'dom_green2', 'pixel_frac2', 'vertex_x3', 'vertex_y3', 'bounding_conf3', 'bounding_imp3', 'dom_blue3', 'dom_green3', 'pixel_frac3', 'Cat_Cute', 'Breed1_143', 'Breed1_155', 'Breed1_307']

Set target variable

target = train['AdoptionSpeed'].astype('int')

Drop irrelevent columns

Dropping unnecessary columns as well as target column 'AdoptionSpeed'.

X = train.drop(['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed', 'Unnamed: 0'], axis=1) X_pred = test.drop(['Name', 'RescuerID', 'Description', 'Unnamed: 0'], axis=1)

Test metric

According to the rules for the Kaggle competition, the results are scored using the quadratic weighted kappa. I will be using the cohen_kappa_score with 'weights' set to quadratic from sklearn.metrics to evaluate my results.

Set aside validation set

Although the data provided is labeled as 'train', to test classifier performance we need to set aside a validation set.

X_train, X_val, target_train, target_val = train_test_split(X, target, test_size=0.25, random_state=47)

Baseline classifier performance

Now it is time to start testing some classifiers. I will grab a baseline score for a RandomForest, XGBoost, and Adaboost classifier.

Baseline RandomForest

clf_rf = RandomForestClassifier() clf_rf.fit(X_train, target_train)

C:\Users\Matthew\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

cohen_kappa_score(target_val, clf_rf.predict(X_val), weights='quadratic')

0.29257257000903714

feature_importances = pd.DataFrame(clf_rf.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False) feature_importances.head(25)

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>

importance

score 0.045825

DescLength 0.044196

pixel_frac 0.043875

Age 0.041879

dom_red 0.041559

magnitude 0.039455

score2 0.036550

NameLength 0.034588

dom_red2 0.033189

sentiment_score 0.033013

vertex_x 0.032567

vertex_y 0.030376

dom_red3 0.028200

score3 0.027632

PhotoAmt 0.025038

vertex_x2 0.023397

Population 0.015160

UrbanPercent 0.014829

Quantity 0.013536

PopDensity 0.012904

Fee 0.012840

Group_Misc 0.010596

Gender_2 0.010524

Color1_1 0.010236

Sterilized_2 0.010147

The top three importance features for the baseline RandomForest classifier are 'score', 'dom_red', and 'DescLength'.

Baseline XGBoost

clf_xgb = xgb.XGBClassifier() clf_xgb.fit(X_train, target_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)

cohen_kappa_score(target_val, clf_xgb.predict(X_val), weights='quadratic')

0.36919141456485527

Using an XGBClassifier improved the cohen score significantly.

from xgboost import plot_importance fig, ax = plt.subplots(figsize=(12,18)) plot_importance(clf_xgb, max_num_features=25, height=0.8, ax=ax) plt.show()

From the feature importance chart it seems that 'Age', 'DescLength', and 'score' are the top 3 most important features.

AdaBoost Baseline

clf_ada = AdaBoostClassifier() clf_ada.fit(X_train, target_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=50, random_state=None)

cohen_kappa_score(target_val, clf_ada.predict(X_val), weights='quadratic')

0.33028851731478026

feature_importances = pd.DataFrame(clf_ada.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False) feature_importances.head(25)

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>

importance

Age 0.06

vertex_y 0.06

DescLength 0.06

Group_Misc 0.04

pixel_frac 0.04

magnitude 0.04

UrbanPercent 0.04

Color1_7 0.02

Type_1 0.02

State_41336 0.02

Sterilized_3 0.02

Sterilized_2 0.02

Dewormed_2 0.02

FurLength_3 0.02

FurLength_1 0.02

Color3_5 0.02

Color1_1 0.02

Breed1_179 0.02

Breed1_11 0.02

Gender_1 0.02

Breed1_213 0.02

Breed2_291 0.02

Breed2_247 0.02

Breed2_207 0.02

Breed1_283 0.02

Parameter tuning

Now I will try to improve the baseline scores of the three classifiers by using GridSearchCV to find the optimal parameters for each classifier.

RandomForest Tuning

rf_params = { 'bootstrap': [True, False], 'max_depth': [25, 50, 75, 100], 'max_features': ['auto'], 'min_samples_leaf': [2, 3, 5, 10], 'min_samples_split': [5, 10, 15], 'n_jobs':[-1], 'n_estimators': [50, 100, 200, 300], 'random_state' : [47] }

rf_gridsearch = GridSearchCV(estimator = clf_rf, param_grid = rf_params, cv = 3, n_jobs = -1, verbose = 1, scoring=make_scorer(cohen_kappa_score,weights='quadratic'))

rf_gridsearch.fit(X_train, target_train)

Fitting 3 folds for each of 384 candidates, totalling 1152 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 10.9s [Parallel(n_jobs=-1)]: Done 176 tasks | elapsed: 1.1min [Parallel(n_jobs=-1)]: Done 426 tasks | elapsed: 2.4min [Parallel(n_jobs=-1)]: Done 776 tasks | elapsed: 5.0min [Parallel(n_jobs=-1)]: Done 1152 out of 1152 | elapsed: 8.0min finished GridSearchCV(cv=3, error_score='raise-deprecating', estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid='warn', n_jobs=-1, param_grid={'n_estimators': [50, 100, 200, 300], 'bootstrap': [True, False], 'n_jobs': [-1], 'min_samples_leaf': [2, 3, 5, 10], 'max_features': ['auto'], 'max_depth': [25, 50, 75, 100], 'min_samples_split': [5, 10, 15], 'random_state': [47]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=make_scorer(cohen_kappa_score, weights=quadratic), verbose=1)

rf_gridsearch.best_params_

{'bootstrap': False, 'max_depth': 25, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 300, 'n_jobs': -1, 'random_state': 47}

rf_gridsearch.best_score_

0.343240804127584

Testing best parameters on validation set

clf_rf_best = RandomForestClassifier(bootstrap=False, max_depth=25, max_features='auto', min_samples_leaf=2, min_samples_split=5, n_estimators=300, n_jobs=-1, random_state=47)

clf_rf_best.fit(X_train, target_train)

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini', max_depth=25, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=-1, oob_score=False, random_state=47, verbose=0, warm_start=False)

cohen_kappa_score(target_val, clf_rf_best.predict(X_val),weights='quadratic')

0.3582847219900983

XGBoost Tuning

xgb_params = {'objective' : ['multi:softmax'], 'eta' : [0.01], 'max_depth' : [3, 4, 6], 'min_child_weight' : [2, 3, 4], }

xgb_gridsearch = GridSearchCV(estimator = clf_xgb, param_grid = xgb_params, cv = 3, n_jobs = -1, verbose = 1, scoring=make_scorer(cohen_kappa_score,weights='quadratic'))

xgb_gridsearch.fit(X_train, target_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 27 out of 27 | elapsed: 4.3min finished GridSearchCV(cv=3, error_score='raise-deprecating', estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1), fit_params=None, iid='warn', n_jobs=-1, param_grid={'min_child_weight': [2, 3, 4], 'objective': ['multi:softmax'], 'max_depth': [3, 4, 6], 'eta': [0.01]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=make_scorer(cohen_kappa_score, weights=quadratic), verbose=1)

xgb_gridsearch.best_params_

{'eta': 0.01, 'max_depth': 4, 'min_child_weight': 4, 'objective': 'multi:softmax'}

xgb_gridsearch.best_score_

0.34320068021975203

Testing best parameters on validation set

clf_xgb_best = xgb.XGBClassifier(eta = 0.01, max_depth = 4, min_child_weight = 4, objective = 'multi:softmax')

clf_xgb_best.fit(X_train, target_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, eta=0.01, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=4, min_child_weight=4, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)

cohen_kappa_score(target_val, clf_xgb_best.predict(X_val),weights='quadratic')

0.38268750072423174

AdaBoost Tuning

ada_params = {'base_estimator': [None, DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=5)], 'n_estimators': [50, 100, 200, 300]}

ada_gridsearch = GridSearchCV(estimator = clf_ada, param_grid = ada_params, cv = 3, n_jobs = -1, verbose = 1, scoring=make_scorer(cohen_kappa_score,weights='quadratic'))

ada_gridsearch.fit(X_train, target_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers. [Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 57.9s finished GridSearchCV(cv=3, error_score='raise-deprecating', estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=50, random_state=None), fit_params=None, iid='warn', n_jobs=-1, param_grid={'base_estimator': [None, DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weigh...resort=False, random_state=None, splitter='best')], 'n_estimators': [50, 100, 200, 300]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=make_scorer(cohen_kappa_score, weights=quadratic), verbose=1)

ada_gridsearch.best_params_

{'base_estimator': None, 'n_estimators': 100}

ada_gridsearch.best_score_

0.30248923825346563

Testing best parameters on validation set

clf_ada_best = AdaBoostClassifier(base_estimator=None, n_estimators=100)

clf_ada_best.fit(X_train, target_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=100, random_state=None)

cohen_kappa_score(target_val, clf_ada_best.predict(X_val),weights='quadratic')

0.3467641968995947

Combining tuned classifiers with VotingClassifier

Since all three classifiers have decent, comparable performances, I will combined all three into one final ensemble classifier using VotingClassifer with soft voting.

clf_vot = VotingClassifier(estimators=[('RF',clf_rf_best),('XGB',clf_xgb_best),('ADA',clf_ada_best)],voting='soft')

clf_vot.fit(X_train, target_train)

VotingClassifier(estimators=[('RF', RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini', max_depth=25, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=5, min_wei...='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=100, random_state=None))], flatten_transform=None, n_jobs=None, voting='soft', weights=None)

cohen_kappa_score(target_val, clf_vot.predict(X_val),weights='quadratic')

0.3863173663782099

To visualize the differences in predictions of the three base classifiers and the ensemble classifier, we can look at bar charts of each 'AdoptionSpeed' prediction for the classifiers below.

probas = [c.fit(X_train, target_train).predict(X_val) for c in (clf_rf_best, clf_xgb_best, clf_ada_best, clf_vot)]

class_0 = list() class_1 = list() class_2 = list() class_3 = list() class_4 = list() for i in probas: class_0.append(np.array(np.unique(i, return_counts=True))[1][0]) class_1.append(np.array(np.unique(i, return_counts=True))[1][1]) class_2.append(np.array(np.unique(i, return_counts=True))[1][2]) class_3.append(np.array(np.unique(i, return_counts=True))[1][3]) class_4.append(np.array(np.unique(i, return_counts=True))[1][4])

N = 4 # number of groups ind = np.arange(N) # group positions width = 0.5 # bar width ax1 = plt.subplot2grid((2,6), (0,0), colspan=2) ax2 = plt.subplot2grid((2,6), (0,2), colspan=2) ax3 = plt.subplot2grid((2,6), (0,4), colspan=2) ax4 = plt.subplot2grid((2,6), (1,1), colspan=2) ax5 = plt.subplot2grid((2,6), (1,3), colspan=2) # bars for base classifiers p0 = ax1.bar(ind + width, np.hstack(([class_0[:-1], [0]])), width, color='green', alpha=0.5, edgecolor='k') p1 = ax2.bar(ind + width, np.hstack(([class_1[:-1], [0]])), width, color='blue',alpha=0.5, edgecolor='k') p2 = ax3.bar(ind + width, np.hstack(([class_2[:-1], [0]])), width, color='red',alpha=0.5, edgecolor='k') p3 = ax4.bar(ind + width, np.hstack(([class_3[:-1], [0]])), width, color='orange',alpha=0.5, edgecolor='k') p4 = ax5.bar(ind + width, np.hstack(([class_4[:-1], [0]])), width, color='purple',alpha=0.5, edgecolor='k') # bars for voting classifier p5 = ax1.bar(ind + width, [0, 0, 0, class_0[-1]], width,color='green', edgecolor='k') p6 = ax2.bar(ind + width, [0, 0, 0, class_1[-1]], width,color='blue', edgecolor='k') p7 = ax3.bar(ind + width, [0, 0, 0, class_2[-1]], width,color='red', edgecolor='k') p8 = ax4.bar(ind + width, [0, 0, 0, class_3[-1]], width,color='orange', edgecolor='k') p9 = ax5.bar(ind + width, [0, 0, 0, class_4[-1]], width,color='purple', edgecolor='k') # plot annotations ax1.set_xticks(ind + width) ax1.set_ylabel('Number of predictions') ax1.set_xticklabels(['RandomForest', 'XGBoost', 'AdaBoost', 'VotingClassifier'], rotation=40, ha='right') ax2.set_xticks(ind + width) ax2.set_xticklabels(['RandomForest', 'XGBoost', 'AdaBoost', 'VotingClassifier'], rotation=40, ha='right') ax3.set_xticks(ind + width) ax3.set_xticklabels(['RandomForest', 'XGBoost', 'AdaBoost', 'VotingClassifier'], rotation=40, ha='right') ax4.set_xticks(ind + width) ax4.set_ylabel('Number of predictions') ax4.set_xticklabels(['RandomForest', 'XGBoost', 'AdaBoost', 'VotingClassifier'], rotation=40, ha='right') ax5.set_xticks(ind + width) ax5.set_xticklabels(['RandomForest', 'XGBoost', 'AdaBoost', 'VotingClassifier'], rotation=40, ha='right') ax1.set_title('Adoption Speed 0') ax2.set_title('Adoption Speed 1') ax3.set_title('Adoption Speed 2') ax4.set_title('Adoption Speed 3') ax5.set_title('Adoption Speed 4') plt.rcParams["figure.figsize"] = [20,20] plt.show()

From the above chart we can see how the voting classifier averages out the predictions of the three base classifiers to better predict 'AdoptionSpeed'. One thing to note is that for 'Adoption Speed 0', the RandomForest and XGBoost classifiers predicted far less cases of this value than the AdaBoost classifier. This ultimately did not raise the average generated by the VotingClassifier, but it is still a significantly outlier when comparing the charts side by side.

Fitting classifier to whole train data

Now that we have our final classifier, we can fit it to the entire training set.

clf_vot_final = VotingClassifier(estimators=[('RF',clf_rf_best),('XGB',clf_xgb_best),('ADA',clf_ada_best)],voting='soft')

clf_vot_final.fit(X, target)

VotingClassifier(estimators=[('RF', RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini', max_depth=25, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=5, min_wei...='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=100, random_state=None))], flatten_transform=None, n_jobs=None, voting='soft', weights=None)

Predictions on test data

With the final classifier trained on all the data, we can now make predictions based on the given test data from the Kaggle competition.

test_pred = clf_vot_final.predict(X_pred.drop(['AdoptionSpeed','PetID'], axis=1))

Now we can view the distribution of predictions on the test data.

plt.rcParams["figure.figsize"] = [7,7] pd.DataFrame(test_pred).hist() plt.title('Adoption Speed Predictions on Test Data') plt.ylabel('Number of prediction') plt.xticks(np.arange(5))

([<matplotlib.axis.XTick at 0x2291f2bd4a8>, <matplotlib.axis.XTick at 0x22971b59e10>, <matplotlib.axis.XTick at 0x22971b59b38>, <matplotlib.axis.XTick at 0x2291f2cdeb8>, <matplotlib.axis.XTick at 0x2291f2a5630>], <a list of 5 Text xticklabel objects>)

pd.DataFrame(test_pred)[0].value_counts()

4 1788 2 1257 1 684 3 219 Name: 0, dtype: int64

Somewhat surprising, there are no predictions of an 'AdoptionSpeed' of 0 for any of the test data. In the training data, there were significantly fewer cases of the lowest 'AdoptionSpeed' which may be why it's possible for the test set to have zero occurances. However, it still seems unusual for that kind of imbalance and could be investigated further.

Saving predictions on test data

Saving the predictions to a seperate CSV file will allow me to upload it to the Kaggle competition to receive a scoring.

pred['PetID'] = X_pred['PetID'] pred['AdoptionSpeed'] = test_pred pred.set_index('PetID').to_csv("submission.csv", index=True)

Scoring results on Kaggle

The prediction submitted received a score of 0.333 on the Kaggle competition. This placed us in about the 50th percentile of all of the competitors.

Summary

The current high score on the Kaggle competition is 0.452 so there is certainly room for improvement. However, the final classifier of this project still scored decently well given this was my first participation in a Kaggle competition. The final classifier was a definite improvement over the baseline classifiers and even the tuned classifiers so choosing to ensemble them using a VotingClassifier was a good choice.

Future work

With more time I would focus on the following:

Class imbalance

The lowest 'AdoptionSpeed' had a very low occurence in the training data and no occurence in the test predictions. This seems unusual and could be investigate further. I would want to look into the confusion matrix of predictions on the training data to see how well or how not so well the classifier predicts an 'AdoptionSpeed' of 0.

I would consider using SMOTE to balance the classes of 'AdoptionSpeed' better.

Removing data

I added a few of my own columns to the dataset. In hindsight, some of this additional data could have just added noise to the dataset. I would test removing some of the added columns of data.

I would also test not using as much of the image data as I did - maybe only the first photo for each pet.

Further investigation of image and sentiment data

For both the image and sentiment data, I simply used the variables provided without much research into the Google APIs behind them. I would like to learn more about how these APIs work and what the values they generate signify.

I would also like to possibly try running either the images or descriptions through my own computer vision or NLP algorithm.

Utilize other classifiers

Most of the highest scoring kernals on the Kaggle competition use LightGBM for their predictions. I would have liked to learn how to utilize that as a classifier to improve my score.

I could also include more base classifiers in my VotingClassifier as well as trying out other ensemble methods.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
final_files		final_files
student_files		student_files
test		test
test_metadata		test_metadata
test_sentiment		test_sentiment
train		train
train_metadata		train_metadata
train_sentiment		train_sentiment
.gitignore		.gitignore
.gitignore.txt		.gitignore.txt
.learn		.learn
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Project 3 Presentation.pdf		Project 3 Presentation.pdf
ReadMe.md		ReadMe.md
Untitled.ipynb		Untitled.ipynb
Untitled1.ipynb		Untitled1.ipynb
Untitled2.ipynb		Untitled2.ipynb
Untitled3.ipynb		Untitled3.ipynb
Untitled4.ipynb		Untitled4.ipynb
adopt_all.png		adopt_all.png
adopt_cats.png		adopt_cats.png
adopt_dogs.png		adopt_dogs.png
breed_labels.csv		breed_labels.csv
cat_info.csv		cat_info.csv
cathypo.png		cathypo.png
dog_breeds.csv		dog_breeds.csv
dog_groups.csv		dog_groups.csv
doggroups.png		doggroups.png
dogs_test.csv		dogs_test.csv
dogs_train.csv		dogs_train.csv
dt.gif		dt.gif
finalpreds.png		finalpreds.png
index.ipynb		index.ipynb
petfinder.png		petfinder.png
pets_test.csv		pets_test.csv
pets_train.csv		pets_train.csv
pre_test.csv		pre_test.csv
pre_train.csv		pre_train.csv
processed_test.csv		processed_test.csv
processed_train.csv		processed_train.csv
sample_submission.csv		sample_submission.csv
scores.png		scores.png
smart.gif		smart.gif
state_data.csv		state_data.csv
state_labels.csv		state_labels.csv
student.ipynb		student.ipynb
submission-rf.csv		submission-rf.csv
submission.csv		submission.csv
submission_old.csv		submission_old.csv
test.csv		test.csv
voting_class.png		voting_class.png

	Unnamed: 0	BreedID	Type	BreedName	Group
0	0	1	1	Affenpinscher	Toy
1	1	2	1	Afghan Hound	Hound
2	2	3	1	Airedale Terrier	Terrier
3	3	4	1	Akbash	Working
4	4	5	1	Akita	Working

	BreedID	Type	BreedName	Cute
0	241	2	Abyssinian	0
1	242	2	American Curl	1
2	243	2	American Shorthair	1
3	244	2	American Wirehair	0
4	245	2	Applehead Siamese	0

	State	Population	StateID	UrbanPercent	PopDensity
0	Kuala Lumpur	1627172	41401	100.0	6891
1	Labuan	86908	41415	82.3	950
2	Johor	3348283	41336	71.9	174
3	Kedah	1890098	41325	64.6	199
4	Kelantan	1459994	41367	42.4	97

	importance
score	0.045825
DescLength	0.044196
pixel_frac	0.043875
Age	0.041879
dom_red	0.041559
magnitude	0.039455
score2	0.036550
NameLength	0.034588
dom_red2	0.033189
sentiment_score	0.033013
vertex_x	0.032567
vertex_y	0.030376
dom_red3	0.028200
score3	0.027632
PhotoAmt	0.025038
vertex_x2	0.023397
Population	0.015160
UrbanPercent	0.014829
Quantity	0.013536
PopDensity	0.012904
Fee	0.012840
Group_Misc	0.010596
Gender_2	0.010524
Color1_1	0.010236
Sterilized_2	0.010147

	importance
Age	0.06
vertex_y	0.06
DescLength	0.06
Group_Misc	0.04
pixel_frac	0.04
magnitude	0.04
UrbanPercent	0.04
Color1_7	0.02
Type_1	0.02
State_41336	0.02
Sterilized_3	0.02
Sterilized_2	0.02
Dewormed_2	0.02
FurLength_3	0.02
FurLength_1	0.02
Color3_5	0.02
Color1_1	0.02
Breed1_179	0.02
Breed1_11	0.02
Gender_1	0.02
Breed1_213	0.02
Breed2_291	0.02
Breed2_247	0.02
Breed2_207	0.02
Breed1_283	0.02

License

matthewsparr/Pet-Adoption-Speed-Prediction

Folders and files

Latest commit

History

Repository files navigation

Final Project Submission

Introduction

Import libraries

Grab train and test set

Fill missing values

Explore variables

Are dogs or cats more common?

Do dogs and cats have different adoption rates on average?

What breeds are most common?

More male or female pets?

Add image metadata

Add sentiment data

Save data before adding addtional columns

Add name and description length

Add dog data

Add cat data

Add state data

Save preprocessed data

Import preprocessed data

Encode categorical variables

Make sure train and test have same columns

Check multicollinearity

Set target variable

Drop irrelevent columns

Test metric

Set aside validation set

Baseline classifier performance

Baseline RandomForest

Baseline XGBoost

AdaBoost Baseline

Parameter tuning

RandomForest Tuning

Testing best parameters on validation set

XGBoost Tuning

Testing best parameters on validation set

AdaBoost Tuning

Testing best parameters on validation set

Combining tuned classifiers with VotingClassifier

Fitting classifier to whole train data

Predictions on test data

Saving predictions on test data

Scoring results on Kaggle

Summary

Future work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages