Update AzureML docs #519

ncclementi · 2025-02-10T23:46:00Z

Closes Bug in AzureML cloud docs #515

@jacobtomlinson this needs some input/answers hence opened as draft, but if you know the answer of some of this, feel free to either commit directly to the PR or leave a comment.

source/cloud/azure/azureml.md

ncclementi · 2025-02-10T23:57:46Z

source/cloud/azure/azureml.md

@@ -146,7 +147,7 @@ You can define an environment from a [pre-built](https://learn.microsoft.com/en-

 Create your custom RAPIDS docker image using the example below, making sure to install additional packages needed for your workflows.

-```dockerfile
+```Dockerfile


I modify this to leave a particular comment. This section ("Custom Rapids Environment") is very confusing on what should we do, in particular because the following section ("Submit Rapids Training Jobs") in the %bash script we are creating the Dockerfile.

I'm not super familiarized with how this work, I think it'll be good if someone that understand the flow can help clean this up or clarify what's needed.

There is definitely some duplication in here that will cause confusion. I don't think we need to create the Dockerfile in the bash section.

The steps for this kind of workflow are:

Start from an interactive environment (could be your laptop or a compute instance)

Create a compute cluster

Create a software environment. This can be from a Dockerfile

Submit a batch job that uses the environment to the compute cluster

Wait for the results

source/cloud/azure/azureml.md

ncclementi · 2025-02-12T02:08:58Z

source/cloud/azure/azureml.md

+
+![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png)
+
+Next, we can perform a sweep over a set of hyperparameters.


So I think we need to modify the parameters here, because this sweep creates 1000 jobs, I was able to complete ~5 in a span of 20 min, using size="Standard_NC6s_v3". I'm not sure what's a more reasonable sweep but we should change this, any thoughts?

I think something small is totally fine. The goal here is to show how to dispatch the jobs, not how to do a good sweep. If you think it's best to do a small number of sweeps as an illustration then just make a note to say "this is a small example, in reality you would do more".

Yes, I think I'll modify this and test it very quickly so it's also easier for us to test if things finish as expected.

ncclementi · 2025-02-12T02:10:43Z

This is ready for a final review. I left a comment about the sweep space, that we should definitely modify.

jacobtomlinson · 2025-02-12T15:16:43Z

source/cloud/azure/azureml.md

+
+![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png)
+
+Next, we can perform a sweep over a set of hyperparameters.


I think something small is totally fine. The goal here is to show how to dispatch the jobs, not how to do a good sweep. If you think it's best to do a small number of sweeps as an illustration then just make a note to say "this is a small example, in reality you would do more".

source/cloud/azure/azureml.md

ncclementi · 2025-02-12T19:31:37Z

source/cloud/azure/azureml.md

@@ -252,10 +256,20 @@ sweep_job = command_job_for_sweep.sweep(
    goal="Maximize",
 )

+# setting a very small limit of trials for demo purposes
+sweep_job.set_limits(


@jacobtomlinson I added this to get only 3 trials, given we have 3 GPU notes in the cluster created. each job took between 20-40 min, but I think this depends on the characteristics of the job, sometimes it's 10-15 min. I also put a note below about the times.

If this is good, that's the last change I wanted to make. This should be good to go.

jacobtomlinson

Great thanks!

update azureml docs

422fdaf