diff --git a/source/cloud/azure/azureml.md b/source/cloud/azure/azureml.md index 44fa887d..3c4f554b 100644 --- a/source/cloud/azure/azureml.md +++ b/source/cloud/azure/azureml.md @@ -16,7 +16,7 @@ Follow these high-level steps to get started: **2. Workspace.** Within the Resource Group, create an Azure Machine Learning service Workspace. -**3. Quota.** Check your Usage + Quota to ensure you have enough quota within your region to launch your desired cluster size. +**3. Quota.** Check your subscription Usage + Quota to ensure you have enough quota within your region to launch your desired compute instance. ## Azure ML Compute instance @@ -28,7 +28,7 @@ The compute instance provides an integrated Jupyter notebook service, JupyterLab Sign in to [Azure Machine Learning Studio](https://ml.azure.com/) and navigate to your workspace on the left-side menu. -Select **New** > **Compute instance** (Create compute instance) > choose a [RAPIDS compatible GPU](https://docs.rapids.ai/install/#system-req) VM size (e.g., `Standard_NC12s_v3`) +Select **New** > **Compute instance** (Create compute instance) > choose an [Azure RAPIDS compatible GPU](https://docs.rapids.ai/deployment/stable/cloud/azure/) VM size (e.g., `Standard_NC12s_v3`) ![Screenshot of create new notebook with a gpu-instance](../../images/azureml-create-notebook-instance.png) @@ -40,20 +40,19 @@ Choose "Provision with a creation script" to install RAPIDS and dependencies. Put the following in a local file called `rapids-azure-startup.sh`: ```bash -#!/bin/bash - sudo -u azureuser -i <<'EOF' source /anaconda/etc/profile.d/conda.sh conda create -y -n rapids \ - {{ rapids_conda_channels }} \ + {{ rapids_conda_channels }} -c microsoft \ - {{ rapids_conda_packages }} \ - 'azure-ai-ml>=2024.12' \ - 'azure-identity>=24.12' \ + {{ rapids_conda_packages }} \ + 'azure-identity>=1.19' \ ipykernel conda activate rapids +pip install 'azure-ai-ml>=1.24' + python -m ipykernel install --user --name rapids echo "kernel install completed" EOF @@ -73,7 +72,7 @@ Once your Notebook Instance is `Running`, open "JupyterLab" and select the `rapi ## Azure ML Compute cluster -Launch Azure's [ML Compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python) to distribute your RAPIDS training jobs across a cluster of single or multi-GPU compute nodes. +In the next section we will launch Azure's [ML Compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python) to distribute your RAPIDS training jobs across a cluster of single or multi-GPU compute nodes. The Compute cluster scales up automatically when a job is submitted, and executes in a containerized environment, packaging your model dependencies in a Docker container. @@ -102,7 +101,9 @@ ml_client = MLClient.from_config( You will need to create a [compute target](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target?view=azureml-api-2#azure-machine-learning-compute-managed) using Azure ML managed compute ([AmlCompute](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-ai-ml/0.1.0b4/azure.ai.ml.entities.html)) for remote training. -Note: Be sure to check limits within your available region. +```{note} +Be sure to check instance availability and its limits within the region where you created your compute instance. +``` This [article](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas?view=azureml-api-2#azure-machine-learning-compute) includes details on the default limits and how to request more quota. @@ -121,7 +122,7 @@ from azure.ai.ml.entities import AmlCompute gpu_compute = AmlCompute( name="rapids-cluster", type="amlcompute", - size="Standard_NC12s_v3", + size="Standard_NC12s_v3", # this VM type needs to be available in your current region max_instances=3, idle_time_before_scale_down=300, # Seconds of idle time before scaling down tier="low_priority", # optional @@ -129,6 +130,9 @@ gpu_compute = AmlCompute( ml_client.begin_create_or_update(gpu_compute).result() ``` +If you name your cluster `"rapids-cluster"` you can check [https://ml.azure.com/compute/rapids-cluster/details](https://ml.azure.com/compute/rapids-cluster/details) +to see the details about your cluster. + ### Access Datastore URI A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls&view=azureml-api-2#access-data-from-a-datastore-uri-like-a-filesystem-preview) is a reference to a blob storage location (path) on your Azure account. You can copy-and-paste the datastore URI from the AzureML Studio UI: @@ -144,14 +148,27 @@ A [datastore URI](https://learn.microsoft.com/en-us/azure/machine-learning/how-t To run an AzureML experiment, you must specify an [environment](https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments?view=azureml-api-2) that contains all the necessary software dependencies to run the training script on distributed nodes.
You can define an environment from a [pre-built](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-docker-image) docker image or create-your-own from a [Dockerfile](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-docker-build-context) or [conda](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python&view=azureml-api-2#create-an-environment-from-a-conda-specification) specification file. -Create your custom RAPIDS docker image using the example below, making sure to install additional packages needed for your workflows. +In a notebook cell, run the following to copy the example code from this documentation into a new folder, +and to create a Dockerfile to build and image that starts from a RAPIDS image and install additional packages needed for the +workflow. + +```ipython +%%bash +mkdir -p ./training-code +repo_url='https://raw.githubusercontent.com/rapidsai/deployment/refs/heads/main/source/examples' -```dockerfile -# Use latest rapids image with the necessary dependencies +# download training scripts +wget -O ./training-code/train_rapids.py "${repo_url}/rapids-azureml-hpo/train_rapids.py" +wget -O ./training-code/rapids_csp_azure.py "${repo_url}/rapids-azureml-hpo/rapids_csp_azure.py" +touch ./training-code/__init__.py + +# create a Dockerfile defining the image the code will run in +cat > ./training-code/Dockerfile <=2024.4.4' \ && pip install azureml-mlflow +EOF ``` Now create the Environment, making sure to label and provide a description: @@ -162,7 +179,7 @@ from azure.ai.ml.entities import Environment, BuildContext # NOTE: 'path' should be a filepath pointing to a directory containing a file named 'Dockerfile' env_docker_image = Environment( build=BuildContext(path="./training-code/"), - name="rapids-mlflow", + name="rapids-mlflow", # label description="RAPIDS environment with azureml-mlflow", ) @@ -173,27 +190,6 @@ ml_client.environments.create_or_update(env_docker_image) Now that we have our environment and custom logic, we can configure and run the `command` [class](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml?view=azure-python#azure-ai-ml-command) to submit training jobs. -In a notebook cell, copy the example code from this documentation into a new folder. - -```ipython -%%bash -mkdir -p ./training-code -repo_url='https://raw.githubusercontent.com/rapidsai/deployment/refs/heads/main/source/examples' - -# download training scripts -wget -O ./training-code/train_rapids.py "${repo_url}/rapids-azureml-hpo/train_rapids.py" -wget -O ./training-code/rapids_csp_azure.py "${repo_url}/rapids-azureml-hpo/rapids_csp_azure.py" -touch ./training-code/__init__.py - -# create a Dockerfile defining the image the code will run in -cat > ./training-code/Dockerfile <=2024.4.4' \ - && pip install azureml-mlflow -EOF -``` - `inputs` is a dictionary of command-line arguments to pass to the training script. ```python @@ -228,11 +224,19 @@ command_job = command( # submit training job returned_job = ml_client.jobs.create_or_update(command_job) +returned_job # displays status and details page of the experiment ``` -After creating the job, go to [the "Experiments" page](https://ml.azure.com/experiments) to view logs, metrics, and outputs. +After creating the job, click on the details page provided in the output of `returned_job`, or go +to [the "Experiments" page](https://ml.azure.com/experiments) to view logs, metrics, and outputs. + +```{note} +For reference this job took ~7 min while using `size="Standard_NC6s_v3"` in the `gpu_compute` creation +``` -Next, try performing a sweep over a set of hyperparameters. +![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png) + +Next, we can perform a sweep over a set of hyperparameters. ```python from azure.ai.ml.sweep import Choice, Uniform @@ -252,10 +256,20 @@ sweep_job = command_job_for_sweep.sweep( goal="Maximize", ) +# setting a very small limit of trials for demo purposes +sweep_job.set_limits( + max_total_trials=3, max_concurrent_trials=3, timeout=18000, trial_timeout=3600 +) + # submit job returned_sweep_job = ml_client.create_or_update(sweep_job) +returned_sweep_job ``` +Once the job is created, click on the details page provided in the output of `returned_sweep_job`, or go +to [the "Experiments" page](https://ml.azure.com/experiments) to view logs, metrics, and outputs. The three trials +set in the `sweep_job.set_limits(...)` take between 20-40 min to complete when using `size="Standard_NC6s_v3"`. + ### Clean Up When you're done, remove the compute resources. @@ -264,6 +278,9 @@ When you're done, remove the compute resources. ml_client.compute.begin_delete(gpu_compute.name).wait() ``` +Then check [https://ml.azure.com/compute/list/instances](https://ml.azure.com/compute/list/instances) and make sure your compute instance +is also stopped, and deleted if desired. + ```{relatedexamples} ``` diff --git a/source/images/azureml_returned_job_completed.png b/source/images/azureml_returned_job_completed.png new file mode 100644 index 00000000..3b98e0df Binary files /dev/null and b/source/images/azureml_returned_job_completed.png differ