Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to current_training_job_name before .train() #5047

Open
discort opened this issue Feb 18, 2025 · 5 comments
Open

Access to current_training_job_name before .train() #5047

discort opened this issue Feb 18, 2025 · 5 comments
Labels
component: training Relates to the SageMaker Training Platform type: feature request

Comments

@discort
Copy link

discort commented Feb 18, 2025

Describe the feature you'd like
I want to keep training artifacts and tensorboard logs for a training job in the same s3 folder.

How would this feature be used? Please describe.
This feature allows to keep my artifacts and tensorboard logs organized. For instance, I can easily find my logs by a job name.

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute, TensorBoardOutputConfig

image = "<image>"

source_code = SourceCode(
    source_dir="code",
    command="python train.py"
)
compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
	compute=compute,
)
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path="s3://<default_bucket>/<base_job_name>/<base_job_name-timestamp>/tensorboard",
    local_path="/opt/ml/output/tensorboard",
)
model_trainer = model_trainer.with_tensorboard_output_config(tensorboard_output_config)
model_trainer.train()

results on s3://:

- default_bucket
    - base_job_name
        - base_job_name-<timestamp>
            - artifacts
            - tensorboard

Describe alternatives you've considered
The only alternative that's coming to my mind is using timestamp in base_job_name. However, the drawback of this approach results in getting unpleasant training job name like base_job_name-<my-timestamp>-<generated-timestamp>

Additional context

@discort
Copy link
Author

discort commented Feb 18, 2025

cc @benieric

@rsareddy0329 rsareddy0329 added the component: model Relates to SageMaker Model label Feb 28, 2025
@benieric
Copy link
Contributor

benieric commented Feb 28, 2025

Hi @discort, I wonder if better solution would be to have the TensorBoardOutputConfig have the s3_output_path and local_path be optional.

By default, ModelTrainer could set the s3_output_path to follow same contract as the rest of artifacts like:

- default_bucket
    - base_job_name
        - base_job_name-<timestamp>
            - artifacts
            - tensorboard

So user could provide a TensorBoardOutputConfig() directly without manually being required to set up the paths explicitly.

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
	compute=compute,
)

model_trainer = model_trainer.with_tensorboard_output_config(TensorBoardOutputConfig())
model_trainer.train()

This way user will be able to call .train() multiple times consecutively. If we resolved the full unique training job name once during initialization of ModelTrainer .train() would only be able to get called once

@discort
Copy link
Author

discort commented Mar 3, 2025

Hi @benieric ,
Thanks for a response.

I wonder if better solution would be to have the TensorBoardOutputConfig have the s3_output_path and local_path be optional.

I think it could resolve the case I described for tensorboard logs. But what if I want to apply the same rule to CheckpointConfig and OutputDataConfig. Could I have all the training artifacts under base_job_name-<timestamp>?

@benieric
Copy link
Contributor

benieric commented Mar 4, 2025

Yeah, so if we went with a solution like this one, would make sense to also have the s3 path be optional for both OutputDataConfig and CheckpointConfig and have the ModelTrainer resolve the path to be under same rules

@discort
Copy link
Author

discort commented Mar 6, 2025

That makes total sense to me.

@benieric benieric added component: training Relates to the SageMaker Training Platform and removed component: model Relates to SageMaker Model labels Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: training Relates to the SageMaker Training Platform type: feature request
Projects
None yet
Development

No branches or pull requests

3 participants