Ability to ignore specific files/folders in ModelTrainer's script mode #5091

discort · 2025-03-18T09:45:35Z

Describe the feature you'd like
I do not want having .git, .env, .vscode, data, __pycache__ or any irrelevant files/folders to be uploaded to S3 artifacts when I use script mode of ModelTrainer in SourceCode. Moreover, coping source_dir may be time-consuming due to the large number of files, such as .git or/and .env.

How would this feature be used? Please describe.
During development/sanity-checing on local machine I have some unnecessary files/folders. The idea is to not upload them during using a script mode.
Let's say I have project structure:

tree
.
├── README.md
├── .git
├── pipeline1
│       ├── train.py
│       ├── data
│       ├── .env
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc
├── pipeline2
│       ├── train.py
│       ├── data
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute

image = "<image>"

source_code = SourceCode(
    source_dir="pipeline1",
    command="python train.py"
)

# or 
# source_code = SourceCode(
# source_dir=".",
# command="python -m pipeline1.train")

compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
     compute=compute,
)
model_trainer.train()

expected result:
s3://<default_bucket_path>/<base_job_name>/input/code/ w/o ignored files/folders

Describe alternatives you've considered

Create TempDir, copy script w/o unnecessary files and pass it as source_dir.
Re-design the project structure to have unwanted files/dirs outside the script you want to upload. (
Always use BYOC instead script mode (isn't practical for some cases, e.g. sanity-check)

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

benieric · 2025-03-19T23:27:52Z

Seems similar request to this issue we have received in the past - #2875

discort · 2025-03-20T14:51:47Z

Thanks for a response @benieric .

The proposal in #2875 looks fine to me. It seems that using a .sagemakerignore file - similar to a .gitignore or .dockerignore - could be an effective solution.

In my case, I rely exclusively on BYOC mode. I also was tested script mode, but I rejected it because it too long to execute. For example, every training run copies a large number of files, resulting in over 30 minutes to upload the necessary folder to S3. Just FYI what I have locally

> find .git -type f | wc -l

    3384

> find .env -type f | wc -l

   81047

discort · 2025-04-07T08:48:01Z

Hi @benieric ,
What do you think to apply this kind of solution? It should work. Any potential drawbacks?

ignore_patterns = ['.env', '.git', 'data', '__pycache__']
source_code = SourceCode(source_dir="source", entry_script="train.py", ignore_patterns=ignore_patterns)
...

if self.source_code.source_dir:
    source_code_channel = self.create_input_data_channel(
        channel_name=SM_CODE,
        data_source=self.source_code.source_dir,
        key_prefix=input_data_key_prefix,
        ignore_patterns=self.source_code.ignore_patterns,
    )

...
def create_input_data_channel(
        self,
        channel_name: str,
        data_source: DataSourceType,
        key_prefix: Optional[str] = None,
        ignore_patterns: Optional[List[str]] = None,
    ) -> Channel:
    ...
    tmp_dir = TemporaryDirectory()
    shutil.copytree(
        data_source,
        os.path.join(tmp_dir.name, data_source),
        dirs_exist_ok=True,
        ignore=shutil.ignore_patterns(*ignore_patterns)
    )
    s3_uri = self.sagemaker_session.upload_data(
        path=tmp_dir.name,
        bucket=self.sagemaker_session.default_bucket(),
        key_prefix=key_prefix,
    )

nargokul added type: feature request component: training Relates to the SageMaker Training Platform labels Mar 19, 2025

nargokul added the component: pysdk-team Related to SageMaker Python SDK Core Issues label Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to ignore specific files/folders in ModelTrainer's script mode #5091

Ability to ignore specific files/folders in ModelTrainer's script mode #5091

discort commented Mar 18, 2025

benieric commented Mar 19, 2025

discort commented Mar 20, 2025

discort commented Apr 7, 2025

Ability to ignore specific files/folders in ModelTrainer's script mode #5091

Ability to ignore specific files/folders in ModelTrainer's script mode #5091

Comments

discort commented Mar 18, 2025

benieric commented Mar 19, 2025

discort commented Mar 20, 2025

discort commented Apr 7, 2025