Binary Classification Problem

Project Dependencies:

Python 3.7
Pandas 0.24.2
SKLearn 0.21.1
SciPy: 1.3.0
Numpy: 1.16.3

Given a training data of about 3900 examples (training and validation), it is required to clean the data, and create a model that classifies the validation data correctly.

Thoughts and observations

First: Data Cleaning - data.py

I dealt with NANs with two aproaches:

For Columns containing descrete values, NANs was unfeasible, so I removed rows containing them.
For Columns containing non-descrete values
- If the column had low number of NANs (<100), I completely removed it
- If the column had high number of NANs (>100), I replaced it with the mean value of that column.
There was a column that had a lot of NANs (~2000) which made it not useful at all, so I dropped it completely.

Next, I "hot encoded" columns with descrete values, extracting the dummies to new columns and deleting the old column. At that point, columns order got messed up, but since that wouldn't impact the model, I discarded reordering it for the sake of simplicity.

Finally, I normalized the dataset becuase it had values from different scales.

Second: Creating The Model - main.py

This part was easier and took much less time comparing to the first part.
I started of with a simple Logestic Regression Classifier but it did not get me anywhere, so I opted for a Neural Net using MLPClassifier
Arbitrarily, I chose the default model with an architecture of four layers, three of them having 5 units and the last one had 2. Later on I tuned this but got no where better so I got back to my initial settings.
My model was overfitting the dataset due to the large number of features compared to the amount of given data, so I tried to select the features, but having over 50 feature (after extracting dummies) with no clue what each feature represent, I randomly chose features and by try and error I selected the set with the highest accuracy - ~86%
- All random chosen features were saved in a text file named accuracies_for_different_set_of_features.txt
I then started tuning the regularizaton parameter and got the best results (~90% Accuracy) at alpha = 0.1.
Finally I tried out other solvers and activation functions but got no where better than I was.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
__pycache__		__pycache__
README.md		README.md
accuracies_for_different_set_of_features.txt		accuracies_for_different_set_of_features.txt
binary_classifier_data.zip		binary_classifier_data.zip
data.py		data.py
main.py		main.py
training.csv		training.csv
validation.csv		validation.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary Classification Problem

Project Dependencies:

Thoughts and observations

About

Releases

Packages

Languages

bfahm/MLP-Binary-Classifier

Folders and files

Latest commit

History

Repository files navigation

Binary Classification Problem

Project Dependencies:

Thoughts and observations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages