diff --git a/source/cloud/azure/azureml.md b/source/cloud/azure/azureml.md
index 44fa887d..3c4f554b 100644
--- a/source/cloud/azure/azureml.md
+++ b/source/cloud/azure/azureml.md
@@ -16,7 +16,7 @@ Follow these high-level steps to get started:
**2. Workspace.** Within the Resource Group, create an Azure Machine Learning service Workspace.
-**3. Quota.** Check your Usage + Quota to ensure you have enough quota within your region to launch your desired cluster size.
+**3. Quota.** Check your subscription Usage + Quota to ensure you have enough quota within your region to launch your desired compute instance.
## Azure ML Compute instance
@@ -28,7 +28,7 @@ The compute instance provides an integrated Jupyter notebook service, JupyterLab
Sign in to [Azure Machine Learning Studio](https://ml.azure.com/) and navigate to your workspace on the left-side menu.
-Select **New** > **Compute instance** (Create compute instance) > choose a [RAPIDS compatible GPU](https://docs.rapids.ai/install/#system-req) VM size (e.g., `Standard_NC12s_v3`)
+Select **New** > **Compute instance** (Create compute instance) > choose an [Azure RAPIDS compatible GPU](https://docs.rapids.ai/deployment/stable/cloud/azure/) VM size (e.g., `Standard_NC12s_v3`)

@@ -40,20 +40,19 @@ Choose "Provision with a creation script" to install RAPIDS and dependencies.
Put the following in a local file called `rapids-azure-startup.sh`:
```bash
-#!/bin/bash
-
sudo -u azureuser -i <<'EOF'
source /anaconda/etc/profile.d/conda.sh
conda create -y -n rapids \
- {{ rapids_conda_channels }} \
+ {{ rapids_conda_channels }}
-c microsoft \
- {{ rapids_conda_packages }} \
- 'azure-ai-ml>=2024.12' \
- 'azure-identity>=24.12' \
+ {{ rapids_conda_packages }} \
+ 'azure-identity>=1.19' \
ipykernel
conda activate rapids
+pip install 'azure-ai-ml>=1.24'
+
python -m ipykernel install --user --name rapids
echo "kernel install completed"
EOF
@@ -73,7 +72,7 @@ Once your Notebook Instance is `Running`, open "JupyterLab" and select the `rapi
## Azure ML Compute cluster
-Launch Azure's [ML Compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python) to distribute your RAPIDS training jobs across a cluster of single or multi-GPU compute nodes.
+In the next section we will launch Azure's [ML Compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python) to distribute your RAPIDS training jobs across a cluster of single or multi-GPU compute nodes.
The Compute cluster scales up automatically when a job is submitted, and executes in a containerized environment, packaging your model dependencies in a Docker container.
@@ -102,7 +101,9 @@ ml_client = MLClient.from_config(
You will need to create a [compute target](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target?view=azureml-api-2#azure-machine-learning-compute-managed) using Azure ML managed compute ([AmlCompute](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-ai-ml/0.1.0b4/azure.ai.ml.entities.html)) for remote training.
-Note: Be sure to check limits within your available region.
+```{note}
+Be sure to check instance availability and its limits within the region where you created your compute instance.
+```
This [article](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas?view=azureml-api-2#azure-machine-learning-compute) includes details on the default limits and how to request more quota.
@@ -121,7 +122,7 @@ from azure.ai.ml.entities import AmlCompute
gpu_compute = AmlCompute(
name="rapids-cluster",
type="amlcompute",
- size="Standard_NC12s_v3",
+ size="Standard_NC12s_v3", # this VM type needs to be available in your current region
max_instances=3,
idle_time_before_scale_down=300, # Seconds of idle time before scaling down
tier="low_priority", # optional
@@ -129,6 +130,9 @@ gpu_compute = AmlCompute(
ml_client.begin_create_or_update(gpu_compute).result()
```
+If you name your cluster `"rapids-cluster"` you can check [https://ml.azure.com/compute/rapids-cluster/details](https://ml.azure.com/compute/rapids-cluster/details)
+to see the details about your cluster.
+
### Access Datastore URI
A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls&view=azureml-api-2#access-data-from-a-datastore-uri-like-a-filesystem-preview) is a reference to a blob storage location (path) on your Azure account. You can copy-and-paste the datastore URI from the AzureML Studio UI:
@@ -144,14 +148,27 @@ A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-t
To run an AzureML experiment, you must specify an [environment](https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments?view=azureml-api-2) that contains all the necessary software dependencies to run the training script on distributed nodes.
You can define an environment from a [pre-built](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-docker-image) docker image or create-your-own from a [Dockerfile](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-docker-build-context) or [conda](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-conda-specification) specification file.
-Create your custom RAPIDS docker image using the example below, making sure to install additional packages needed for your workflows.
+In a notebook cell, run the following to copy the example code from this documentation into a new folder,
+and to create a Dockerfile to build and image that starts from a RAPIDS image and install additional packages needed for the
+workflow.
+
+```ipython
+%%bash
+mkdir -p ./training-code
+repo_url='https://raw.githubusercontent.com/rapidsai/deployment/refs/heads/main/source/examples'
-```dockerfile
-# Use latest rapids image with the necessary dependencies
+# download training scripts
+wget -O ./training-code/train_rapids.py "${repo_url}/rapids-azureml-hpo/train_rapids.py"
+wget -O ./training-code/rapids_csp_azure.py "${repo_url}/rapids-azureml-hpo/rapids_csp_azure.py"
+touch ./training-code/__init__.py
+
+# create a Dockerfile defining the image the code will run in
+cat > ./training-code/Dockerfile <=2024.4.4' \
&& pip install azureml-mlflow
+EOF
```
Now create the Environment, making sure to label and provide a description:
@@ -162,7 +179,7 @@ from azure.ai.ml.entities import Environment, BuildContext
# NOTE: 'path' should be a filepath pointing to a directory containing a file named 'Dockerfile'
env_docker_image = Environment(
build=BuildContext(path="./training-code/"),
- name="rapids-mlflow",
+ name="rapids-mlflow", # label
description="RAPIDS environment with azureml-mlflow",
)
@@ -173,27 +190,6 @@ ml_client.environments.create_or_update(env_docker_image)
Now that we have our environment and custom logic, we can configure and run the `command` [class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) to submit training jobs.
-In a notebook cell, copy the example code from this documentation into a new folder.
-
-```ipython
-%%bash
-mkdir -p ./training-code
-repo_url='https://raw.githubusercontent.com/rapidsai/deployment/refs/heads/main/source/examples'
-
-# download training scripts
-wget -O ./training-code/train_rapids.py "${repo_url}/rapids-azureml-hpo/train_rapids.py"
-wget -O ./training-code/rapids_csp_azure.py "${repo_url}/rapids-azureml-hpo/rapids_csp_azure.py"
-touch ./training-code/__init__.py
-
-# create a Dockerfile defining the image the code will run in
-cat > ./training-code/Dockerfile <=2024.4.4' \
- && pip install azureml-mlflow
-EOF
-```
-
`inputs` is a dictionary of command-line arguments to pass to the training script.
```python
@@ -228,11 +224,19 @@ command_job = command(
# submit training job
returned_job = ml_client.jobs.create_or_update(command_job)
+returned_job # displays status and details page of the experiment
```
-After creating the job, go to [the "Experiments" page](https://ml.azure.com/experiments) to view logs, metrics, and outputs.
+After creating the job, click on the details page provided in the output of `returned_job`, or go
+to [the "Experiments" page](https://ml.azure.com/experiments) to view logs, metrics, and outputs.
+
+```{note}
+For reference this job took ~7 min while using `size="Standard_NC6s_v3"` in the `gpu_compute` creation
+```
-Next, try performing a sweep over a set of hyperparameters.
+
+
+Next, we can perform a sweep over a set of hyperparameters.
```python
from azure.ai.ml.sweep import Choice, Uniform
@@ -252,10 +256,20 @@ sweep_job = command_job_for_sweep.sweep(
goal="Maximize",
)
+# setting a very small limit of trials for demo purposes
+sweep_job.set_limits(
+ max_total_trials=3, max_concurrent_trials=3, timeout=18000, trial_timeout=3600
+)
+
# submit job
returned_sweep_job = ml_client.create_or_update(sweep_job)
+returned_sweep_job
```
+Once the job is created, click on the details page provided in the output of `returned_sweep_job`, or go
+to [the "Experiments" page](https://ml.azure.com/experiments) to view logs, metrics, and outputs. The three trials
+set in the `sweep_job.set_limits(...)` take between 20-40 min to complete when using `size="Standard_NC6s_v3"`.
+
### Clean Up
When you're done, remove the compute resources.
@@ -264,6 +278,9 @@ When you're done, remove the compute resources.
ml_client.compute.begin_delete(gpu_compute.name).wait()
```
+Then check [https://ml.azure.com/compute/list/instances](https://ml.azure.com/compute/list/instances) and make sure your compute instance
+is also stopped, and deleted if desired.
+
```{relatedexamples}
```
diff --git a/source/images/azureml_returned_job_completed.png b/source/images/azureml_returned_job_completed.png
new file mode 100644
index 00000000..3b98e0df
Binary files /dev/null and b/source/images/azureml_returned_job_completed.png differ