-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update AzureML docs #519
Update AzureML docs #519
Conversation
source/cloud/azure/azureml.md
Outdated
@@ -146,7 +147,7 @@ You can define an environment from a [pre-built](https://learn.microsoft.com/en- | |||
|
|||
Create your custom RAPIDS docker image using the example below, making sure to install additional packages needed for your workflows. | |||
|
|||
```dockerfile | |||
```Dockerfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modify this to leave a particular comment. This section ("Custom Rapids Environment") is very confusing on what should we do, in particular because the following section ("Submit Rapids Training Jobs") in the %bash
script we are creating the Dockerfile.
I'm not super familiarized with how this work, I think it'll be good if someone that understand the flow can help clean this up or clarify what's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is definitely some duplication in here that will cause confusion. I don't think we need to create the Dockerfile
in the bash section.
The steps for this kind of workflow are:
- Start from an interactive environment (could be your laptop or a compute instance)
- Create a compute cluster
- Create a software environment. This can be from a
Dockerfile
- Submit a batch job that uses the environment to the compute cluster
- Wait for the results
|
||
![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png) | ||
|
||
Next, we can perform a sweep over a set of hyperparameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we need to modify the parameters here, because this sweep creates 1000 jobs, I was able to complete ~5 in a span of 20 min, using size="Standard_NC6s_v3"
. I'm not sure what's a more reasonable sweep but we should change this, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something small is totally fine. The goal here is to show how to dispatch the jobs, not how to do a good sweep. If you think it's best to do a small number of sweeps as an illustration then just make a note to say "this is a small example, in reality you would do more".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I'll modify this and test it very quickly so it's also easier for us to test if things finish as expected.
This is ready for a final review. I left a comment about the sweep space, that we should definitely modify. |
|
||
![Screenshot of job under the test_rapids_mlflow experiment](../../images/azureml_returned_job_completed.png) | ||
|
||
Next, we can perform a sweep over a set of hyperparameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something small is totally fine. The goal here is to show how to dispatch the jobs, not how to do a good sweep. If you think it's best to do a small number of sweeps as an illustration then just make a note to say "this is a small example, in reality you would do more".
@@ -252,10 +256,20 @@ sweep_job = command_job_for_sweep.sweep( | |||
goal="Maximize", | |||
) | |||
|
|||
# setting a very small limit of trials for demo purposes | |||
sweep_job.set_limits( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobtomlinson I added this to get only 3 trials, given we have 3 GPU notes in the cluster created. each job took between 20-40 min, but I think this depends on the characteristics of the job, sometimes it's 10-15 min. I also put a note below about the times.
If this is good, that's the last change I wanted to make. This should be good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks!
@jacobtomlinson this needs some input/answers hence opened as draft, but if you know the answer of some of this, feel free to either commit directly to the PR or leave a comment.