- Could be used on multiple clouds
- Run big data pipelines using both batch and real-time data
- Batch data from Data Factory
- Real-time data from Event Hub, IoT hub
- Train machine-learning models
- Uses Spark
Networking diagram:
- Control plane in MSFT-managed subscription
- Data plane in a MSFT-managed VNet in customer subscription
- It has MSFT managed NSG associated to subnets
- There's a managed resource group,
- It contains:
- DBFS(Databricks filesystem) storage account
dbstoragexxxxx
- A user-assigned managed identity
dbmanagedidentity
, this will be assigned to cluster node VMs - Cluster nodes(as VMs) you created in the workspace
- The managed VNet if you are not using your own VNet
- DBFS(Databricks filesystem) storage account
- Databricks adds a Deny Assignment to the RG, so you can't change things, except adding PEP to the DBFS storage account
- It contains:
- For the workspace VNet, instead of using a MSFT-managed one, you could use your custom VNet
- The vNet must have two subnets dedicated to
Microsoft.Databricks/workspaces
, subnets- a container subnet (private subnet)
- a host subnet (public subnet)
- the minimum size of each subnet is
/28
- each cluster node will have tow NICs, one in container subnet, one in host subnet
- Subnets cannot be shared across workspaces or with other Azure resources
- Azure Databricks auto-provision and manages some rules in the NSG for these two subnets, you can't delete or update these rules
- Some rules have VirtualNetwork assigned as source and destination, this is because Azure does not have subnet-level service tag. All clusters are protected by a second layer of network policy internally, so cluster A cannot connect to Cluster B in the same/another workspace.
- If you have Azure resources in another subnet you want to protect, add a Inbound deny rule
- You might need to configure UDRs for these two subnets, to route via
INTERNET
for following destinations:Azure Databricks
- Extended infrastructure IP (for standby Azure Databricks inferastructure to improve the stability of Databricks services)
Azure SQL
(for Azure Databricks metastore)Azure Storage
(for artifact Blob storage and log Blob storage)Azure Event Hub
(for logging to Azure Event Hub)
Requirements:
- Workspace must be on premium tier
- Workspace must use VNet injection (even for front-end-only connection)
- You create a separate subnet for PEPs (could be
/27
)
Connection types:
- Front-end Private Link (user to control plane)
- Target sub-resource:
databricks_ui_api
- Between users and workspace (the control plane)
- Used for connection to ADB web application, REST API, Databricks Connect API
- Also used by JDBC/ODBC and PowerBI integrations
- Target sub-resource:
- Back-end Private Link (data plane to control plane)
- Target sub-resource:
databricks_ui_api
- From the clusters to the secure cluster connectivity relay endpoint and REST API endpoint
- Secure cluster connectivity (SCC / No Public IP / NPIP) must be enabled
- data plane here refers to the Classic data plane, the compute layer of Azure Databricks, NOT serverless data plane that supports serverless SQL warehouses
- Target sub-resource:
- Web auth private connections
- Target sub-resource:
browser_authentication
- The domain name for a region is like:
australiaeast.pl-auth.azuredatabricks.net
- There might be more than one if there are multiple Azure Databricks control plane instances in the same region, like
australiaeast-c2.pl-auth.azuredatabricks.net
- There might be more than one if there are multiple Azure Databricks control plane instances in the same region, like
- Special configuration for SSO login callbacks to the Azure Databricks web application.
- Allows AAD to redirect users after login to the correct control plane instance
- Not needed for REST API calls
- Exactly one PEP needed for all workspaces in the same region which share one private DNS zone
- Strongly recommended to create a separate private web auth workspace for this
- This workspace exists just for this web auth PEP
- Don't put any workload in it
- Don't config it for user login
- No need for connection from data plane to control plane
- Don't delete it
- Target sub-resource:
Two types of deployment:
-
Standard (recommended):
- Two PEPs for the workspace
- Back-end PEP in a separate subnet in the workspace VNet
- Front-end PEP in a spearate VNet
-
Simplified:
- A single PEP for both front-end and back-end connections
- The transit subnet in the workspace VNet
- Can't be front-end only
Objects:
- Create a separate private web auth workspace per region for SSO login
- This workspace needs its own VNet for VNet injection (though you won't put anything in it), not shown in the diagram
- Set a lock on this workspace, so it won't be deleted
- You need two separate private DNS zones for the
databricks_ui_api
endpoint, one for backend, one for frontend
- Also known as No Public IP or NPIP
- In ARM template, set
enableNoPublicIp
totrue
- When enabled,
- customer virtual networks have no open ports
- Data plane (Databricks Runtime cluster) nodes have no public IP addresses
- Both container and host subnets are private
- How
- Each cluster initiates a connection to the control plane secure cluster connectivity relay during cluster creation, using port 443 (HTTPS), and a different IP than is used for web application and REST API
- When control plane starts new Databricks Runtime jobs or performs other cluster management tasks, these requests are sent to the cluster through this tunnel
- All Azure Databricks network traffic between the data plane VNet and the Azure Databricks control plane goes across the Microsoft network backbone, not the public Internet. This is true even if secure cluster connectivity is disabled.
Scenarios:
-
Managed VNet
- Azure Databricks automatically creates a NAT gateway for outbound traffic to Azure backbone and public network.
- This NAT gateway is associated with both subnets
-
VNet injection, three options
- Use an outbound/egress load balancer, its configuration is managed by Azure Databricks
- Use an Azure NAT gateway
- Use UDR, directly to the endpoints or through a firewall
Two levels:
- Account level
- At
https://accounts.azuredatabricks.net
- Manages:
- Get SCIM user provisioning URL and token
- Users and groups
- IP access list
- At
- Workspace level
- At
https://adb-xxxxx.xx.azuredatabricks.net
- Manages:
- Users and groups in workspace
- Workspace settings, eg. Access control, Storage, Cluster, etc
- SQL settings
- SQL warehouse settings
- At
Initial setup:
- An AAD Global Admin user login to Azure Portal
- Find the Databricks resource, click on "Launch Workspace"
- This account will be set up as "Account admin", he can assign the "Account admin" role to another user
Hierarchy:
- Metastore
- Catalog
- Schema (databases)
- Tables/views