This repository contains the code for a group project completed for the course CSE342 - Statistical Machine Learning. The goal of this project is to predict the future sales of items for a store using various data columns provided in the dataset.
The dataset contains information on items sold in a store. It has the following columns:
Item_Identifier
: Unique identifier for each itemItem_Weight
: Weight of the item (some values are missing)Item_Fat_Content
: Fat content of the itemItem_Visibility
: Visibility of the item in the storeItem_Type
: Type of the itemItem_MRP
: Maximum Retail Price of the itemOutlet_Identifier
: Unique identifier for each outletOutlet_Establishment_Year
: Year of establishment of the outletOutlet_Size
: Size of the outlet (some values are missing)Outlet_Location_Type
: Location type of the outletOutlet_Type
: Type of the outletItem_Outlet_Sales
: Sales of the item in the outlet
-
Handling Missing Values: Missing values in columns
Item_Weight
andOutlet_Size
are treated accordingly. -
Data Splitting: The training data is split into 80% training set and 20% testing set. The training set is further split into two subsets.
The following machine learning models are used for prediction:
- Decision Tree Regressor
- Random Forest
- Linear Regression
- ADA Boosting
- Gradient Boosting
Later feature selection and feature scaling is performed to increase the accuracy in decision trees, which drastically affects the accuracies of all models.
It increases the accuracy of Decision Trees and Random Forests but decreases the accuracy of the rest models.
- Plot of mean and median replaced values against orginal dataset.
- Plot of interpolation replaced values against original dataset.
- Scatter plots are generated to visualize the predictions of each model against true values.
- Correlation Matrix and feature importance graph is plotted.
- The accuracy of each model is plotted on a histogram using Root Mean Squared Error (RMSE) as the loss function.
- Other evaluation metrics like MAE, MSE, RMSE are also used to compare the results.
- Aditya Bagri - 2022029
- Suyash Kumar - 2021293