Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to ignore specific files/folders in ModelTrainer's script mode #5091

Open
discort opened this issue Mar 18, 2025 · 3 comments
Open
Labels
component: pysdk-team Related to SageMaker Python SDK Core Issues component: training Relates to the SageMaker Training Platform type: feature request

Comments

@discort
Copy link

discort commented Mar 18, 2025

Describe the feature you'd like
I do not want having .git, .env, .vscode, data, __pycache__ or any irrelevant files/folders to be uploaded to S3 artifacts when I use script mode of ModelTrainer in SourceCode. Moreover, coping source_dir may be time-consuming due to the large number of files, such as .git or/and .env.

How would this feature be used? Please describe.
During development/sanity-checing on local machine I have some unnecessary files/folders. The idea is to not upload them during using a script mode.
Let's say I have project structure:

tree
.
├── README.md
├── .git
├── pipeline1
│       ├── train.py
│       ├── data
│       ├── .env
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc
├── pipeline2
│       ├── train.py
│       ├── data
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute

image = "<image>"

source_code = SourceCode(
    source_dir="pipeline1",
    command="python train.py"
)

# or 
# source_code = SourceCode(
# source_dir=".",
# command="python -m pipeline1.train")

compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
     compute=compute,
)
model_trainer.train()

expected result:
s3://<default_bucket_path>/<base_job_name>/input/code/ w/o ignored files/folders

Describe alternatives you've considered

  1. Create TempDir, copy script w/o unnecessary files and pass it as source_dir.
  2. Re-design the project structure to have unwanted files/dirs outside the script you want to upload. (
  3. Always use BYOC instead script mode (isn't practical for some cases, e.g. sanity-check)

Additional context
Add any other context or screenshots about the feature request here.

@nargokul nargokul added type: feature request component: training Relates to the SageMaker Training Platform labels Mar 19, 2025
@benieric
Copy link
Contributor

Seems similar request to this issue we have received in the past - #2875

@discort
Copy link
Author

discort commented Mar 20, 2025

Thanks for a response @benieric .

The proposal in #2875 looks fine to me. It seems that using a .sagemakerignore file - similar to a .gitignore or .dockerignore - could be an effective solution.

In my case, I rely exclusively on BYOC mode. I also was tested script mode, but I rejected it because it too long to execute. For example, every training run copies a large number of files, resulting in over 30 minutes to upload the necessary folder to S3. Just FYI what I have locally

> find .git -type f | wc -l

    3384

> find .env -type f | wc -l

   81047

@discort
Copy link
Author

discort commented Apr 7, 2025

Hi @benieric ,
What do you think to apply this kind of solution? It should work. Any potential drawbacks?

ignore_patterns = ['.env', '.git', 'data', '__pycache__']
source_code = SourceCode(source_dir="source", entry_script="train.py", ignore_patterns=ignore_patterns)
...

if self.source_code.source_dir:
    source_code_channel = self.create_input_data_channel(
        channel_name=SM_CODE,
        data_source=self.source_code.source_dir,
        key_prefix=input_data_key_prefix,
        ignore_patterns=self.source_code.ignore_patterns,
    )

...
def create_input_data_channel(
        self,
        channel_name: str,
        data_source: DataSourceType,
        key_prefix: Optional[str] = None,
        ignore_patterns: Optional[List[str]] = None,
    ) -> Channel:
    ...
    tmp_dir = TemporaryDirectory()
    shutil.copytree(
        data_source,
        os.path.join(tmp_dir.name, data_source),
        dirs_exist_ok=True,
        ignore=shutil.ignore_patterns(*ignore_patterns)
    )
    s3_uri = self.sagemaker_session.upload_data(
        path=tmp_dir.name,
        bucket=self.sagemaker_session.default_bucket(),
        key_prefix=key_prefix,
    )

@nargokul nargokul added the component: pysdk-team Related to SageMaker Python SDK Core Issues label Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pysdk-team Related to SageMaker Python SDK Core Issues component: training Relates to the SageMaker Training Platform type: feature request
Projects
None yet
Development

No branches or pull requests

3 participants