Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify open source gridded NWP #3

Open
Tracked by #5
peterdudfield opened this issue Dec 4, 2024 · 17 comments
Open
Tracked by #5

Identify open source gridded NWP #3

peterdudfield opened this issue Dec 4, 2024 · 17 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Dec 4, 2024

Identify open source gridded NWP that is already in zarr format. Need to make sure we it has enough variables for solar forecasts and enough years (>2) of data. Note that OCF publish satellite data already, which could be used. We try to use at least 1 year for training and 1 year for testing. Better to have more liek 5 years of training, but lets see what we can do.

@peterdudfield peterdudfield changed the title Identify open source gridded NWP that is already in zarr format. Need to make sure we it has enough variables for solar forecasts and enough years (>2) of data. Note that OCF publish satellite data already, which could be used. We try to use at least 1 year for training and 1 year for testing. Better to have more liek 5 years of training, but lets see what we can do. Identify open source gridded NWP Dec 4, 2024
@peterdudfield peterdudfield transferred this issue from openclimatefix/PVNet Dec 4, 2024
@peterdudfield peterdudfield moved this to Todo in Open Data PVNet Dec 5, 2024
@jacobbieker
Copy link
Member

While not ideal in terms of its reanalysis, there is now both the ARCO-ERA5, as well as the UFS Replay datasets, both Zarr, both going back quite a few years. A lot of the weather forecasting models train on ERA5, then get finetuned on the live ECMWF IFS model or other real-time NWPs. So could be an option there. Otherwise, there is the OCF DWD archive (although might want to move that somewhere easier to stream from/Source Cooperative.

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

@peterdudfield this is great insight from @jacobbieker . What are your thoughts?

@peterdudfield
Copy link
Contributor Author

Thanks @jacobbieker and @jcamier

I like the option of training on ERA5 and finetunning on ECMWF IFS. I also like the idea of trying to keep the process simple at the beginning, as in general its quite complicated.

I would probably push for us, either to use GFS, the forecasts are that good, but its easy to get and not too big. Or we could use DWD ICON data that OCF store here (there is also a eu only one too)

One next step is to make sure we collect all the variables we want, of course they are all named slightly differentl in each NWP but we can im sure make some progress

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

@peterdudfield so would this be the recommended architecture and approach for this project?

  1. Start by training with ERA5

  2. If needed, fine-tune with ECMWF IFS

    • More complex but improves accuracy.
  3. Alternatively, simplify even further by using either:

    • GFS: Easier to access and use.
    • DWD ICON: A more advanced option but may offer better forecasts.

I recommend we get all these datasources, if possible, and put them on Hugging Face or Source Cooperative.

However, also put ERA5 in a standard AWS s3 bucket, as I believe this will have faster I/O reads/writes for initial training.

Then, we could have different volunteer teams work with different models to see what performs better with fine tuning on top of the foundational model and various data sources?

@Sukh-P @jacobbieker thoughts?

@jacobbieker
Copy link
Member

jacobbieker commented Dec 11, 2024

So, the ERA5 dataset and UFS are both already streamable from either GCP or AWS for free, and would probably be as fast as copying them to a S3 bucket and training from there. The downside from just using the already made datasets are they are global forecasts and models, so if we want to subset the data, to make it faster to load, then would need to copy it to a different S3 bucket or HF.

GFS is easy to access on AWS, but is in GRIB files mostly, so would need conversion and then putting somewhere, I know @devsjc has done that in the NWP consumer, so could use that to convert the raw GFS to Zarr and put it in source cooperative, or a different public S3 bucket.

ICON is potentially already better to use than GFS, as its already Zarr, a few less training years than GFS, but already on HF, and would be relatively straightforward, if take awhile, to put the data in one large Zarr on Source Cooperative or S3 bucket for faster and easier training.

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

@jacobbieker we could maybe create a dataset.md file. Something like this...

Datasets

Here are links to datasets that might be useful:

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

This would allow the volunteer teams to explore the data if they want but ultimately I think we want to limit to UK data for the original scope that is in a specific format (the curated data) for the models to consume which would be in some S3 bucket for further processing was my understanding? @Sukh-P is this correct?

@jacobbieker
Copy link
Member

@jacobbieker we could maybe create a dataset.md file. Something like this...

Datasets

Here are links to datasets that might be useful:

Yeah, I can write one up, there are some other open sources of data that we could link to that could be helpful to people, but aren't necessarily as ready to go as a processed dataset

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

Great - thanks @jacobbieker !!!

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

@jacobbieker just updated the getting_started for your new file. Here is the PR

#24

Let me know if you approve.

@Sukh-P
Copy link
Member

Sukh-P commented Dec 16, 2024

Thanks for the great input all I agree on starting simple and using what we have already almost ready to go for this (GFS & ICON). We even have some GFS in zarr format locally at OCF which we could push to the S3 bucket potentially. Also agree that copying the ICON Europe data on HF to an S3 bucket makes sense.

Another data source which I think @jacobbieker mentioned before is the Met Office UK deterministic NWP - rolling 2 year dataset, this is in GRIB format I believe but we could use @devsjc's NWP consumer (may need some modifying, unsure) to convert this into zarr format

@jacobbieker
Copy link
Member

Yeah, also, storing Zarrs from the rolling archive could end up having a similar usefulness beyond just this project, providing another longer-term archive of NWP operational outputs.

@peterdudfield
Copy link
Contributor Author

Yep, I agree Rolling archives for zarrs would be useful. However I think for the moment we should forcus on the first aim, and see if this works.
This would be collect NWP data for UK, and train a ML model for solar generation using this data. Of course later on we can focus on add rolling zarrs.

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Dec 23, 2024

Ive uploaded GFS 2023 data into our s3 bucket s3://ocf-open-data-pvnet/data/gfs/, this hopefully can be it can set an example of what it should be like. And unlocks some ML people if they wanted to start training e.t.c. Note that this data for 1 year is only 0.5 GB, so not to big

@jcamier
Copy link
Collaborator

jcamier commented Dec 23, 2024

Here is my first pass at some code I have been working on to fetch and convert nwp data for met office uk
https://github.com/openclimatefix/open-data-pvnet/tree/docs/issue-28

The command line for example is open-data-pvnet metoffice archive --year 2022 --month 12 --day 1 --hour 0 --region uk

Now sure if you guys will like this approach but I tried to follow similar command logic to nwp-consumer. I tried to make it extensible eventually to do global met office, dwd and gfs. Also, it only gets specific variables that mirror what you gave me as an idea from
https://huggingface.co/openclimatefix/pvnet_uk_region/blob/0bc344fafb2232fb0b6bb0bf419f0449fe11c643/data_config.yaml

I tried to mirror what these variables are with met office

I know you guys already have a lot of this data but believe this project was supposed to be the one repo to eventually get the open data from the three main sources: met office, dwd and gfs and convert them to zarr as well as do some machine learning training etc. Having one repo for volunteers will make this a lot easier to work with then go to several different repos was my thought.

This is not completed yet as I need to add functionality to write to source cooperative from the converted zarr files and then delete the tmp files that were written locally...

My goal is to get this completed by year end for met office uk and a stretch would be gfs. This is on a separate branch so we can always take different approaches based on further requirements we can discuss in the new year

@peterdudfield
Copy link
Contributor Author

Thanks @jcamier , great you've been able to make a first pass at this. That super useful!

Good luck for you goal! Good to have something written down, and then others can jump in and help.

Thanks once again for this work, and we'll chat in the new year

@jcamier
Copy link
Collaborator

jcamier commented Jan 7, 2025

@peterdudfield I believe we can close this issue. @jacobbieker created this dataset.md file that contains the available open source NWP datasets. The main branch has this file in the repo.

https://github.com/openclimatefix/open-data-pvnet/blob/main/datasets.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

4 participants