The primary goal of this project is to identify products that are predicted to generate the highest revenue based on their current features and characteristics. The project involves cleaning and preprocessing raw data, building an ETL pipeline, and developing a machine learning model to estimate product performance.
This project is structured to follow a data engineering and machine learning pipeline:
- Data Extraction: Download and store raw sales data from Kaggle in AWS S3.
- Data Cleaning and Transformation: Preprocess the data to handle missing values, invalid entries, and outliers. Engineer features relevant for sales prediction.
- ETL Pipeline: Build an Extract, Transform, and Load pipeline using Python and AWS S3 to automate data flow.
- Machine Learning Model: Train and evaluate a model to predict sales performance for individual products.
- Visualization and Reporting: Present insights using visualizations and document findings for decision-making.
- Predict product sales for the upcoming year using historical sales data.
- Evaluate the model's performance using metrics like RMSE or R².
- Dataset Source: Kaggle
- Example Data Fields:
Transaction ID
Customer ID
Product
Category
Price Per Unit
Quantity
Transaction Date
Location
Discount Applied
- Extract: Download the dataset from S3 using
boto3
. - Transform: Clean and preprocess the dataset using Pandas:
- Validate fields like Transaction ID, Customer ID, and Prices.
- Handle missing or invalid values.
- Engineer features like seasonal trends and price-per-unit.
- Load: Save the cleaned data back into S3 under the transformed/ and preprocessed/ folders.
- Train a predictive model using algorithms such as Linear Regression or XGBoost.
- Evaluate the model using validation techniques and refine it based on metrics.
- Generate visual insights:
- Predicted vs. actual sales performance.
- Product rankings based on forecasted sales.
- Programming Language: Python
- Cloud Platform: AWS S3
- Libraries:
pandas
,numpy
for data manipulation.matplotlib
,seaborn
for visualization. -- in progressscikit-learn
,xgboost
for machine learning.boto3
for AWS S3 integration.dotenv
for environment variable management.
- Clone the repository:
git clone https://github.com/username/sales-forecasting.git cd sales-forecasting
- Install required Python packages:
pip install -r requirements.txt
- Configure AWS credentials in a
.env
file:AWS_ACCESS_KEY=your_access_key AWS_SECRET_KEY=your_secret_key
- Run the ETL pipeline:
python etl_pipeline.py
- Train and evaluate the machine learning model:
python train_model.py
-
Cleaned Dataset:
- Transformed sales data stored in AWS S3 under the
transformed/
folder.
- Transformed sales data stored in AWS S3 under the
-
Trained Model:
- A model trained to forecast product performance.
-
Predictions:
-
Forecasted sales data for the upcoming year.
-
Top Predicted Products:
Reconstructed_Item Predicted Revenue Item_Item_25_FUR 24715.45 Item_Item_25_EHE 23073.85 Item_Item_25_BUT 22244.72 Item_Item_24_FUR 20758.97 Item_Item_25_FOOD 20270.84 Item_Item_22_BUT 19938.79 Item_Item_23_BUT 19286.40 Item_Item_19_MILK 18953.87 Item_Item_20_BUT 18836.56 Item_Item_23_PAT 18439.58
-
-
Model Performance:
- Mean Squared Error (MSE): 961.23
- R² Score: 0.89
- Integrate time-series forecasting models for seasonal trends.
- Deploy the trained model using AWS Lambda for real-time predictions.
- Build a user-friendly dashboard for sales insights.
- Muntaqa Maahi
Data Engineer and Machine Learning Enthusiast