This labs guides you through Azure Databricks ETL pipeline.
Sign in to the Azure portal.
Note
This tutorial cannot be carried out using Azure Free Trial Subscription. If you have a free account, go to your profile and change your subscription to pay-as-you-go. For more information, see Azure free account. Then, remove the spending limit, and request a quota increase for vCPUs in your region. When you create your Azure Databricks workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks DBUs for 14 days.
In this section, you create an Azure Databricks workspace using the Azure portal.
-
In the Azure portal, select Create a resource > Analytics > Azure Databricks.
-
Under Azure Databricks Service, provide the values to create a Databricks workspace.
Provide the following values:
Property Description Workspace name Provide a name for your Databricks workspace Subscription From the drop-down, select your Azure subscription. Resource group Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see Azure Resource Group overview. Location Select West US 2. For other available regions, see Azure services available by region. Pricing Tier Choose between Standard, Premium, or Trial. For more information on these tiers, see Databricks pricing page. Select Pin to dashboard and then click Create.
-
The workspace creation takes a few minutes. During workspace creation, you can view the deployment status in Notifications.
Databricks® provides a notebook-oriented Apache Spark™ as-a-service workspace environment, making it easy to manage clusters and explore data interactively.
- Read and process huge files and data sets
- Query, explore, and visualize data sets
- Join disparate data sets found in data lakes
- Train and evaluate machine learning models
- Process live streams of data
- Perform analysis on large graph data sets and social networks
-sandbox Hint: This step will require you navigate around Databricks while doing this lesson. We recommend you open a second browser window when navigating around Databricks. This way you can view these instructions in one window and navigate in another.
Databricks notebooks are backed by clusters, or networked computers that work together to process your data. Create a Spark cluster (*if you already have a running cluster, skip to Step 2 *):
- In your new window, click the Clusters button in the sidebar.
2. Click the **Create Cluster** button.
3. Name your cluster. Use your name or initials to easily differentiate your cluster from your coworkers. 4. Select the cluster type. We recommend the latest runtime (**4.0** or newer) and Scala **2.11**. 5. Specify your cluster configuration. * For clusters created on a **Community Edition** shard the default values are sufficient for the remaining fields. * For all other environments, please refer to your company's policy on creating and using clusters.
6. Right click on **Cluster** button on left side and open a new tab. Click the **Create Cluster** button.
Hint: Check with your local system administrator to see if there is a recommended default cluster at your company to use for the rest of the class. This could save you some money!
Import the AzureAcademy-DataAnalyst-Day-II.dbc
file into your environment.
Open lab notebook in 02-ETL-Process-Overview
inside the Azure Academy lab folder you just imported.
© 2019 Databricks, Inc. All rights reserved.
Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Use | Support