Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Batch Sampling Pipeline with Integration Tests for GFS Data #57

Closed

Conversation

siddharth7113
Copy link
Contributor

Pull Request

Description

This PR introduces a new batch sampling pipeline for the GFS dataset. The changes include:

  • batch_samples.py
    Implements the logic to load and preprocess GFS data using GFSDataSampler, wraps the data in a PyTorch DataLoader with an identity collate function, and processes/saves batches using multiprocessing. This module ensures that data is processed in batches, facilitating downstream training or analysis tasks.

  • batch_utils.py
    Encapsulates the batch saving functionality with the BatchSaveFunc class and process_and_save_batches function. This makes the batch saving logic modular and reusable.

  • test_batch_samples.py
    Adds integration test that runs the end-to-end batching process on a small subset of realdata. These tests verify that the pipeline correctly processes and saves batches.

These changes help improve data processing efficiency and modularity in the repository. This PR fixes issue #8

How Has This Been Tested?

  • Executed the integration test on a small subset of real public data (using a known public S3 path and ensuring anonymous access) to verify the end-to-end pipeline.
  • Verified that the output files are correctly named (e.g., batch_00000000.pt, etc.) and contain the expected batch data.
  • Performed a sanity check by manually inspecting log outputs and sample batch files.

Checklist:


Feel free to adjust the title, issue number, or other details as needed before submitting the pull request.

Added detailed documentation, improved logging, structured main block for better testing & debugging
- Implement batch_samples.py to load and preprocess GFS data via the GFSDataSampler,
  wrap it in a PyTorch DataLoader with a custom identity collate function,
  and process/save batches using multiprocessing (forkserver and file_system sharing).
- Implement batch_utils.py to encapsulate batch saving logic via BatchSaveFunc
  and process_and_save_batches for writing batches to disk.
- Add test_batch_samples.py with unit tests for the collate function and batch
  saving functions, plus an integration test that runs the end-to-end batching process
  on a small subset of real data.
@siddharth7113
Copy link
Contributor Author

@peterdudfield I have tried to make a custom implementation taking inspiration from the here, Please review and suggest any changes if necessary.
Thanks.

@siddharth7113
Copy link
Contributor Author

@peterdudfield,

Should I close the PR since our last conversation I am guessing we could use ocf-data-sampler instead to convert these into the required formats?

@peterdudfield
Copy link
Contributor

Yea, close it for now, and we can always reopen it if needed

@siddharth7113
Copy link
Contributor Author

Yea, close it for now, and we can always reopen it if needed

OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in open-data-pvnet ? I think @jcamier and @alirashidAR are working with met-office data , and somebody is working on getting up the ML-models, as of now I can't see anything else if that's need any work? If there is , please do update it on issues.
Thanks,
Siddharth

@peterdudfield
Copy link
Contributor

Yea, close it for now, and we can always reopen it if needed

OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in open-data-pvnet ? I think @jcamier and @alirashidAR are working with met-office data , and somebody is working on getting up the ML-models, as of now I can't see anything else if that's need any work? If there is , please do update it on issues. Thanks, Siddharth

Do we have a working pipeline using ocf-data-sampler? Could that be made an issue?

@siddharth7113
Copy link
Contributor Author

Yea, close it for now, and we can always reopen it if needed

OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in open-data-pvnet ? I think @jcamier and @alirashidAR are working with met-office data , and somebody is working on getting up the ML-models, as of now I can't see anything else if that's need any work? If there is , please do update it on issues. Thanks, Siddharth

Do we have a working pipeline using ocf-data-sampler? Could that be made an issue?

Sorry, I didn't understand what is meant by pipeline ? Could you please elaborate a little bit on it.

Does it mean something like?

NWP data --> ocf-data-sampler --> data-sampler-format --> input to model training 

@peterdudfield
Copy link
Contributor

Yea, we should use ocf-data-sampler, to take NWP and PVLive data, to make it into samplers ready then for ML training

@siddharth7113
Copy link
Contributor Author

Yea, we should use ocf-data-sampler, to take NWP and PVLive data, to make it into samplers ready then for ML training

Sorry I might be a little confused here, but in this PR, I was converting GFS zarr to sampler format using ocf-data-sampler for ML training, but from our meeting ,I infered instead of custom implementation, there would unified API in ocf-data-sampler where we would simply write a script and it would convert any zarr format into the sampler-format without needing to implement normalization and other functions separately.

Isn't that the plan? I might be having a flawed understanding of things here, please orrect me if I got anything wrong

@peterdudfield
Copy link
Contributor

Yea, that's the plan

@siddharth7113
Copy link
Contributor Author

Yea, that's the plan

Ok nice,so I am guessing , this would be needed to be worked from ocf-data-sampler side, I would head over there, ans see if I could contribute or help in some way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants