Add Batch Sampling Pipeline with Integration Tests for GFS Data #57

siddharth7113 · 2025-02-06T03:28:24Z

Pull Request

Description

This PR introduces a new batch sampling pipeline for the GFS dataset. The changes include:

batch_samples.py
Implements the logic to load and preprocess GFS data using GFSDataSampler, wraps the data in a PyTorch DataLoader with an identity collate function, and processes/saves batches using multiprocessing. This module ensures that data is processed in batches, facilitating downstream training or analysis tasks.
batch_utils.py
Encapsulates the batch saving functionality with the BatchSaveFunc class and process_and_save_batches function. This makes the batch saving logic modular and reusable.
test_batch_samples.py
Adds integration test that runs the end-to-end batching process on a small subset of realdata. These tests verify that the pipeline correctly processes and saves batches.

These changes help improve data processing efficiency and modularity in the repository. This PR fixes issue #8

How Has This Been Tested?

Executed the integration test on a small subset of real public data (using a known public S3 path and ensuring anonymous access) to verify the end-to-end pipeline.
Verified that the output files are correctly named (e.g., batch_00000000.pt, etc.) and contain the expected batch data.
Performed a sanity check by manually inspecting log outputs and sample batch files.

Checklist:

My code follows [OCF's coding style guidelines](https://github.com/openclimatefix/.github/blob/main/coding_style.md)
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

Feel free to adjust the title, issue number, or other details as needed before submitting the pull request.

Added detailed documentation, improved logging, structured main block for better testing & debugging

- Implement batch_samples.py to load and preprocess GFS data via the GFSDataSampler, wrap it in a PyTorch DataLoader with a custom identity collate function, and process/save batches using multiprocessing (forkserver and file_system sharing). - Implement batch_utils.py to encapsulate batch saving logic via BatchSaveFunc and process_and_save_batches for writing batches to disk. - Add test_batch_samples.py with unit tests for the collate function and batch saving functions, plus an integration test that runs the end-to-end batching process on a small subset of real data.

siddharth7113 · 2025-02-06T03:32:46Z

@peterdudfield I have tried to make a custom implementation taking inspiration from the here, Please review and suggest any changes if necessary.
Thanks.

siddharth7113 · 2025-02-20T16:23:44Z

@peterdudfield,

Should I close the PR since our last conversation I am guessing we could use ocf-data-sampler instead to convert these into the required formats?

peterdudfield · 2025-02-20T16:30:48Z

Yea, close it for now, and we can always reopen it if needed

siddharth7113 · 2025-02-20T16:39:52Z

Yea, close it for now, and we can always reopen it if needed

OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in open-data-pvnet ? I think @jcamier and @alirashidAR are working with met-office data , and somebody is working on getting up the ML-models, as of now I can't see anything else if that's need any work? If there is , please do update it on issues.
Thanks,
Siddharth

peterdudfield · 2025-02-20T16:59:58Z

Yea, close it for now, and we can always reopen it if needed

OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in open-data-pvnet ? I think @jcamier and @alirashidAR are working with met-office data , and somebody is working on getting up the ML-models, as of now I can't see anything else if that's need any work? If there is , please do update it on issues. Thanks, Siddharth

Do we have a working pipeline using ocf-data-sampler? Could that be made an issue?

siddharth7113 · 2025-02-20T17:05:01Z

Yea, close it for now, and we can always reopen it if needed

OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in open-data-pvnet ? I think @jcamier and @alirashidAR are working with met-office data , and somebody is working on getting up the ML-models, as of now I can't see anything else if that's need any work? If there is , please do update it on issues. Thanks, Siddharth

Do we have a working pipeline using ocf-data-sampler? Could that be made an issue?

Sorry, I didn't understand what is meant by pipeline ? Could you please elaborate a little bit on it.

Does it mean something like?

NWP data --> ocf-data-sampler --> data-sampler-format --> input to model training

peterdudfield · 2025-02-20T17:07:24Z

Yea, we should use ocf-data-sampler, to take NWP and PVLive data, to make it into samplers ready then for ML training

siddharth7113 · 2025-02-20T17:16:21Z

Yea, we should use ocf-data-sampler, to take NWP and PVLive data, to make it into samplers ready then for ML training

Sorry I might be a little confused here, but in this PR, I was converting GFS zarr to sampler format using ocf-data-sampler for ML training, but from our meeting ,I infered instead of custom implementation, there would unified API in ocf-data-sampler where we would simply write a script and it would convert any zarr format into the sampler-format without needing to implement normalization and other functions separately.

Isn't that the plan? I might be having a flawed understanding of things here, please orrect me if I got anything wrong

peterdudfield · 2025-02-20T18:12:08Z

Yea, that's the plan

siddharth7113 · 2025-02-20T18:19:20Z

Yea, that's the plan

Ok nice,so I am guessing , this would be needed to be worked from ocf-data-sampler side, I would head over there, ans see if I could contribute or help in some way!

siddharth7113 added 3 commits February 6, 2025 08:38

Refactor GFS Data Sampler

d3774ee

Added detailed documentation, improved logging, structured main block for better testing & debugging

Formatted

eac0918

siddharth7113 closed this Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Batch Sampling Pipeline with Integration Tests for GFS Data #57

Add Batch Sampling Pipeline with Integration Tests for GFS Data #57

siddharth7113 commented Feb 6, 2025

siddharth7113 commented Feb 6, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

Add Batch Sampling Pipeline with Integration Tests for GFS Data #57

Add Batch Sampling Pipeline with Integration Tests for GFS Data #57

Conversation

siddharth7113 commented Feb 6, 2025

Pull Request

Description

How Has This Been Tested?

Checklist:

siddharth7113 commented Feb 6, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025

peterdudfield commented Feb 20, 2025

siddharth7113 commented Feb 20, 2025