Sales Forecasting Project

Objective

The primary goal of this project is to identify products that are predicted to generate the highest revenue based on their current features and characteristics. The project involves cleaning and preprocessing raw data, building an ETL pipeline, and developing a machine learning model to estimate product performance.

Project Overview

This project is structured to follow a data engineering and machine learning pipeline:

Data Extraction: Download and store raw sales data from Kaggle in AWS S3.
Data Cleaning and Transformation: Preprocess the data to handle missing values, invalid entries, and outliers. Engineer features relevant for sales prediction.
ETL Pipeline: Build an Extract, Transform, and Load pipeline using Python and AWS S3 to automate data flow.
Machine Learning Model: Train and evaluate a model to predict sales performance for individual products.
Visualization and Reporting: Present insights using visualizations and document findings for decision-making.

Workflow Steps

1. Goal

Predict product sales for the upcoming year using historical sales data.
Evaluate the model's performance using metrics like RMSE or R².

2. Data Acquisition

Dataset Source: Kaggle
Example Data Fields:
- Transaction ID
- Customer ID
- Product
- Category
- Price Per Unit
- Quantity
- Transaction Date
- Location
- Discount Applied

3. ETL Pipeline

Extract: Download the dataset from S3 using boto3.
Transform: Clean and preprocess the dataset using Pandas:
- Validate fields like Transaction ID, Customer ID, and Prices.
- Handle missing or invalid values.
- Engineer features like seasonal trends and price-per-unit.
Load: Save the cleaned data back into S3 under the transformed/ and preprocessed/ folders.

4. Machine Learning

Train a predictive model using algorithms such as Linear Regression or XGBoost.
Evaluate the model using validation techniques and refine it based on metrics.

5. Visualization --in progress

Generate visual insights:
- Predicted vs. actual sales performance.
- Product rankings based on forecasted sales.

Technologies Used

Programming Language: Python
Cloud Platform: AWS S3
Libraries:
- pandas, numpy for data manipulation.
- matplotlib, seaborn for visualization. -- in progress
- scikit-learn, xgboost for machine learning.
- boto3 for AWS S3 integration.
- dotenv for environment variable management.

Setup and Usage

Clone the repository:

git clone https://github.com/username/sales-forecasting.git
cd sales-forecasting

Install required Python packages:
```
pip install -r requirements.txt
```

Configure AWS credentials in a .env file:

AWS_ACCESS_KEY=your_access_key
AWS_SECRET_KEY=your_secret_key

Run the ETL pipeline:
```
python etl_pipeline.py
```
Train and evaluate the machine learning model:
```
python train_model.py
```

Outputs

Cleaned Dataset:
- Transformed sales data stored in AWS S3 under the transformed/ folder.
Trained Model:
- A model trained to forecast product performance.

Predictions:

Forecasted sales data for the upcoming year.

Top Predicted Products:

Reconstructed_Item	Predicted Revenue
Item_Item_25_FUR	24715.45
Item_Item_25_EHE	23073.85
Item_Item_25_BUT	22244.72
Item_Item_24_FUR	20758.97
Item_Item_25_FOOD	20270.84
Item_Item_22_BUT	19938.79
Item_Item_23_BUT	19286.40
Item_Item_19_MILK	18953.87
Item_Item_20_BUT	18836.56
Item_Item_23_PAT	18439.58

Model Performance:
- Mean Squared Error (MSE): 961.23
- R² Score: 0.89

Future Work

Integrate time-series forecasting models for seasonal trends.
Deploy the trained model using AWS Lambda for real-time predictions.
Build a user-friendly dashboard for sales insights.

Contributors

Muntaqa Maahi
Data Engineer and Machine Learning Enthusiast

⬆️ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
local_retail_store_sales.csv		local_retail_store_sales.csv
preprocessed_retail_store_sales.csv		preprocessed_retail_store_sales.csv
product_sales_etl.py		product_sales_etl.py
requirements.txt		requirements.txt
transformed_retail_store_sales.csv		transformed_retail_store_sales.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales Forecasting Project

Objective

Project Overview

Workflow Steps

1. Goal

2. Data Acquisition

3. ETL Pipeline

4. Machine Learning

5. Visualization --in progress

Technologies Used

Setup and Usage

Outputs

Future Work

Contributors

About

Releases

Packages

Languages

muntaqam/Sales_forecast_ETL

Folders and files

Latest commit

History

Repository files navigation

Sales Forecasting Project

Objective

Project Overview

Workflow Steps

1. Goal

2. Data Acquisition

3. ETL Pipeline

4. Machine Learning

5. Visualization --in progress

Technologies Used

Setup and Usage

Outputs

Future Work

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages