-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Batch Sampling Pipeline with Integration Tests for GFS Data #57
Conversation
Added detailed documentation, improved logging, structured main block for better testing & debugging
- Implement batch_samples.py to load and preprocess GFS data via the GFSDataSampler, wrap it in a PyTorch DataLoader with a custom identity collate function, and process/save batches using multiprocessing (forkserver and file_system sharing). - Implement batch_utils.py to encapsulate batch saving logic via BatchSaveFunc and process_and_save_batches for writing batches to disk. - Add test_batch_samples.py with unit tests for the collate function and batch saving functions, plus an integration test that runs the end-to-end batching process on a small subset of real data.
@peterdudfield I have tried to make a custom implementation taking inspiration from the here, Please review and suggest any changes if necessary. |
Should I close the PR since our last conversation I am guessing we could use ocf-data-sampler instead to convert these into the required formats? |
Yea, close it for now, and we can always reopen it if needed |
OK sure, Also if possible Could you please update if there are any issues, that needs to be worked in |
Do we have a working pipeline using ocf-data-sampler? Could that be made an issue? |
Sorry, I didn't understand what is meant by pipeline ? Could you please elaborate a little bit on it. Does it mean something like?
|
Yea, we should use ocf-data-sampler, to take NWP and PVLive data, to make it into samplers ready then for ML training |
Sorry I might be a little confused here, but in this PR, I was converting GFS zarr to sampler format using Isn't that the plan? I might be having a flawed understanding of things here, please orrect me if I got anything wrong |
Yea, that's the plan |
Ok nice,so I am guessing , this would be needed to be worked from |
Pull Request
Description
This PR introduces a new batch sampling pipeline for the GFS dataset. The changes include:
batch_samples.py
Implements the logic to load and preprocess GFS data using
GFSDataSampler
, wraps the data in a PyTorch DataLoader with an identity collate function, and processes/saves batches using multiprocessing. This module ensures that data is processed in batches, facilitating downstream training or analysis tasks.batch_utils.py
Encapsulates the batch saving functionality with the
BatchSaveFunc
class andprocess_and_save_batches
function. This makes the batch saving logic modular and reusable.test_batch_samples.py
Adds integration test that runs the end-to-end batching process on a small subset of realdata. These tests verify that the pipeline correctly processes and saves batches.
These changes help improve data processing efficiency and modularity in the repository. This PR fixes issue #8
How Has This Been Tested?
batch_00000000.pt
, etc.) and contain the expected batch data.Checklist:
Feel free to adjust the title, issue number, or other details as needed before submitting the pull request.