End to End solution built with Azure Databricks and other Azure services (Synapse, Azure Function, Logic App, Power BI) to predict churns and so retain customers.
The KKbox datasets used are from Kaggle challenge: https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data.
KKbox is a music streaming service.
Machine Learning models are from the Databricks blog: https://databricks.com/blog/2020/08/24/profit-driven-retention-management-with-machine-learning.html
Data have been manually copied to Azure Data Lake from Kaggle.
In a real scenario, data will be sourced from transactional systems.
Use a Power BI dashboard to get meaningfull insights and prevent customers to churn
- Azure Subscription
- Terraform (for deployment)
- Powershell (for deployment)
- Databricks Powershell Module (for the Databricks deployment part) : https://github.com/gbrueckl/Databricks.API.PowerShell
- You might need to bypass the execution policy to install this module
- https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy?view=powershell-7.1
- Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
- Install-Module -Name DatabricksPS
- You might need to bypass the execution policy to install this module
Run the following commands from the automatic deployment directory (run them from a powershell):
- terraform init
- terraform plan
- terraform apply -auto-approve
Deployment will create all the resources part of the architecture.
In addition of creating resources, deployment will also:
- Azure Databricks
- Upload notebooks in the Databricks Workspace
- Azure Synapse Analytics
- Upload Sql scripts in Azure Synapse Analytics
- Create the Linked Services
- Create the Datasets
- Create the Pipelines
- Azure Data Lake Storage
- Create the directories where raw data and predictions will be saved
From the resource-group tf file in the resource-group directory, you will find out what is the name of the resource group created. Per default, resource group name will start with e2e-churn-demo-
- Create a cluster : https://docs.microsoft.com/en-us/azure/databricks/clusters/create
- Mount the storage (created during the deployment) in Databricks
- You can use the notebook "Churn 00_Mount Storage" (uploaded automatically during the deployment) as a template
- The example relies on Azure Key Vault and Databricks Key Vault https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes
- https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access
- Upload KKbox Dataset from Kaggle (https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data) then run the notebook called "Load Data" (uploaded automatically during the deployment)
- Assign the storage blob data contributor role to e2e-churn-demo-3v8nqm (Synapse Workspace) for the storage e2echurndemostor3v8nqm
- Assign the storage blob data contributor role to e2e-churn-demo-workspace-3v8nqm (Databricks Workspace) for the storage e2echurndemostor3v8nqm
- As for Databricks, You will have to create a Service Principal. It will be used to access the data lake.
- https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access
When deploying the solution, SQL scripts are uploaded automatically.
Please run at least these 2 scripts in this order on the serverless pool:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/author-sql-script#run-your-sql-script
- Create Database Demo
- Create Master Key
- choose a Master Key Encryption password and run the script using the Database Demo
- Run any SQL script left
Make sure to select the serverless Pool
When deploying the solution, Linked Services are created automatically.
Please verify the connection for the following Linked Services:
- Serverless Synapse
- Enter the password (check the synapse.tf file) and test the connection
- Databricks
- Get a Databricks Token : https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/authentication#--generate-a-personal-access-token)
- Create a cluster in Databricks that will be used to run the Databricks Notebooks : https://docs.microsoft.com/en-us/azure/databricks/clusters/create
- Enter these information and test the connection
An Azure function is automatically created once deployment is complete.
You will have to implement the code. In this case, bind the function to a queue.
In this architecture we are sending a message to a queue (predictionChurning) to trigger a logic app workflow. The logic app will send an email with a Power BI dashboard attached containing information about potential customer churns.
https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook
You can implement easily different type of logic with logic app.
In this architecture, we use it to automatically refresh the Power BI report then send an email with the dashboard attached.
https://docs.microsoft.com/en-us/azure/logic-apps/logic-apps-overview
You might need either Power BI premium or enable Power BI embedded to make the Power Bi part working
- You can create the model directly in Databricks or use the pipeline Churning Model Creation in Azure Synapse
- You will have to mount the storage and have the KKbox dataset in your datalake.
- You can run the prediction notebook in Databricks or use the pipeline Churning predictions in Azure Synapse
- You can run the notification pipeline in Azure Synapse to trigger the Logic app workflow and receive the Power BI report automatically in an email.
- You can also visualize directly the report in Azure Synapse.
- https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-power-bi
- https://github.com/microsoft/MCW-Azure-Synapse-Analytics-and-AI
- https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
- https://databricks.com/blog/2020/08/24/profit-driven-retention-management-with-machine-learning.html
- https://docs.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-ml