Access to current_training_job_name before .train() #5047

discort · 2025-02-18T10:03:50Z

Describe the feature you'd like
I want to keep training artifacts and tensorboard logs for a training job in the same s3 folder.

How would this feature be used? Please describe.
This feature allows to keep my artifacts and tensorboard logs organized. For instance, I can easily find my logs by a job name.

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute, TensorBoardOutputConfig

image = "<image>"

source_code = SourceCode(
    source_dir="code",
    command="python train.py"
)
compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
	compute=compute,
)
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path="s3://<default_bucket>/<base_job_name>/<base_job_name-timestamp>/tensorboard",
    local_path="/opt/ml/output/tensorboard",
)
model_trainer = model_trainer.with_tensorboard_output_config(tensorboard_output_config)
model_trainer.train()

results on s3://:

- default_bucket
    - base_job_name
        - base_job_name-<timestamp>
            - artifacts
            - tensorboard

Describe alternatives you've considered
The only alternative that's coming to my mind is using timestamp in base_job_name. However, the drawback of this approach results in getting unpleasant training job name like base_job_name-<my-timestamp>-<generated-timestamp>

Additional context

The text was updated successfully, but these errors were encountered:

discort · 2025-02-18T10:04:57Z

cc @benieric

benieric · 2025-02-28T23:45:18Z

Hi @discort, I wonder if better solution would be to have the TensorBoardOutputConfig have the s3_output_path and local_path be optional.

By default, ModelTrainer could set the s3_output_path to follow same contract as the rest of artifacts like:

- default_bucket
    - base_job_name
        - base_job_name-<timestamp>
            - artifacts
            - tensorboard

So user could provide a TensorBoardOutputConfig() directly without manually being required to set up the paths explicitly.

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
	compute=compute,
)

model_trainer = model_trainer.with_tensorboard_output_config(TensorBoardOutputConfig())
model_trainer.train()

This way user will be able to call .train() multiple times consecutively. If we resolved the full unique training job name once during initialization of ModelTrainer .train() would only be able to get called once

discort · 2025-03-03T21:59:24Z

Hi @benieric ,
Thanks for a response.

I wonder if better solution would be to have the TensorBoardOutputConfig have the s3_output_path and local_path be optional.

I think it could resolve the case I described for tensorboard logs. But what if I want to apply the same rule to CheckpointConfig and OutputDataConfig. Could I have all the training artifacts under base_job_name-<timestamp>?

benieric · 2025-03-04T02:06:29Z

Yeah, so if we went with a solution like this one, would make sense to also have the s3 path be optional for both OutputDataConfig and CheckpointConfig and have the ModelTrainer resolve the path to be under same rules

discort · 2025-03-06T20:34:51Z

That makes total sense to me.

rsareddy0329 added the component: model Relates to SageMaker Model label Feb 28, 2025

benieric added component: training Relates to the SageMaker Training Platform and removed component: model Relates to SageMaker Model labels Mar 12, 2025

rsareddy0329 added the type: feature request label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to current_training_job_name before .train() #5047

Access to current_training_job_name before .train() #5047

discort commented Feb 18, 2025

discort commented Feb 18, 2025

benieric commented Feb 28, 2025 •

edited

Loading

discort commented Mar 3, 2025

benieric commented Mar 4, 2025 •

edited

Loading

discort commented Mar 6, 2025

Access to current_training_job_name before .train() #5047

Access to current_training_job_name before .train() #5047

Comments

discort commented Feb 18, 2025

discort commented Feb 18, 2025

benieric commented Feb 28, 2025 • edited Loading

discort commented Mar 3, 2025

benieric commented Mar 4, 2025 • edited Loading

discort commented Mar 6, 2025

benieric commented Feb 28, 2025 •

edited

Loading

benieric commented Mar 4, 2025 •

edited

Loading