Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AzureML docs #519

Merged
merged 5 commits into from
Feb 13, 2025
Merged

Conversation

ncclementi
Copy link
Contributor

@ncclementi ncclementi commented Feb 10, 2025

@jacobtomlinson this needs some input/answers hence opened as draft, but if you know the answer of some of this, feel free to either commit directly to the PR or leave a comment.

@@ -146,7 +147,7 @@ You can define an environment from a [pre-built](https://learn.microsoft.com/en-

Create your custom RAPIDS docker image using the example below, making sure to install additional packages needed for your workflows.

```dockerfile
```Dockerfile
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modify this to leave a particular comment. This section ("Custom Rapids Environment") is very confusing on what should we do, in particular because the following section ("Submit Rapids Training Jobs") in the %bash script we are creating the Dockerfile.

I'm not super familiarized with how this work, I think it'll be good if someone that understand the flow can help clean this up or clarify what's needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely some duplication in here that will cause confusion. I don't think we need to create the Dockerfile in the bash section.

The steps for this kind of workflow are:

  • Start from an interactive environment (could be your laptop or a compute instance)
  • Create a compute cluster
  • Create a software environment. This can be from a Dockerfile
  • Submit a batch job that uses the environment to the compute cluster
  • Wait for the results

@ncclementi ncclementi marked this pull request as ready for review February 12, 2025 02:03
@ncclementi ncclementi requested a review from a team as a code owner February 12, 2025 02:03

![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png)

Next, we can perform a sweep over a set of hyperparameters.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we need to modify the parameters here, because this sweep creates 1000 jobs, I was able to complete ~5 in a span of 20 min, using size="Standard_NC6s_v3". I'm not sure what's a more reasonable sweep but we should change this, any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something small is totally fine. The goal here is to show how to dispatch the jobs, not how to do a good sweep. If you think it's best to do a small number of sweeps as an illustration then just make a note to say "this is a small example, in reality you would do more".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I'll modify this and test it very quickly so it's also easier for us to test if things finish as expected.

@ncclementi
Copy link
Contributor Author

This is ready for a final review. I left a comment about the sweep space, that we should definitely modify.


![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png)

Next, we can perform a sweep over a set of hyperparameters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something small is totally fine. The goal here is to show how to dispatch the jobs, not how to do a good sweep. If you think it's best to do a small number of sweeps as an illustration then just make a note to say "this is a small example, in reality you would do more".

source/cloud/azure/azureml.md Show resolved Hide resolved
@@ -252,10 +256,20 @@ sweep_job = command_job_for_sweep.sweep(
goal="Maximize",
)

# setting a very small limit of trials for demo purposes
sweep_job.set_limits(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobtomlinson I added this to get only 3 trials, given we have 3 GPU notes in the cluster created. each job took between 20-40 min, but I think this depends on the characteristics of the job, sometimes it's 10-15 min. I also put a note below about the times.

If this is good, that's the last change I wanted to make. This should be good to go.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks!

@jacobtomlinson jacobtomlinson merged commit 27deca6 into rapidsai:main Feb 13, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in AzureML cloud docs
3 participants