Identify open source gridded NWP #3

peterdudfield · 2024-12-04T16:18:13Z

Identify open source gridded NWP that is already in zarr format. Need to make sure we it has enough variables for solar forecasts and enough years (>2) of data. Note that OCF publish satellite data already, which could be used. We try to use at least 1 year for training and 1 year for testing. Better to have more liek 5 years of training, but lets see what we can do.

jacobbieker · 2024-12-10T21:30:32Z

While not ideal in terms of its reanalysis, there is now both the ARCO-ERA5, as well as the UFS Replay datasets, both Zarr, both going back quite a few years. A lot of the weather forecasting models train on ERA5, then get finetuned on the live ECMWF IFS model or other real-time NWPs. So could be an option there. Otherwise, there is the OCF DWD archive (although might want to move that somewhere easier to stream from/Source Cooperative.

jcamier · 2024-12-11T01:20:40Z

@peterdudfield this is great insight from @jacobbieker . What are your thoughts?

peterdudfield · 2024-12-11T08:53:22Z

Thanks @jacobbieker and @jcamier

I like the option of training on ERA5 and finetunning on ECMWF IFS. I also like the idea of trying to keep the process simple at the beginning, as in general its quite complicated.

I would probably push for us, either to use GFS, the forecasts are that good, but its easy to get and not too big. Or we could use DWD ICON data that OCF store here (there is also a eu only one too)

One next step is to make sure we collect all the variables we want, of course they are all named slightly differentl in each NWP but we can im sure make some progress

jcamier · 2024-12-11T12:54:52Z

@peterdudfield so would this be the recommended architecture and approach for this project?

Start by training with ERA5
If needed, fine-tune with ECMWF IFS
- More complex but improves accuracy.
Alternatively, simplify even further by using either:
- GFS: Easier to access and use.
- DWD ICON: A more advanced option but may offer better forecasts.

I recommend we get all these datasources, if possible, and put them on Hugging Face or Source Cooperative.

However, also put ERA5 in a standard AWS s3 bucket, as I believe this will have faster I/O reads/writes for initial training.

Then, we could have different volunteer teams work with different models to see what performs better with fine tuning on top of the foundational model and various data sources?

@Sukh-P @jacobbieker thoughts?

jacobbieker · 2024-12-11T13:00:39Z

So, the ERA5 dataset and UFS are both already streamable from either GCP or AWS for free, and would probably be as fast as copying them to a S3 bucket and training from there. The downside from just using the already made datasets are they are global forecasts and models, so if we want to subset the data, to make it faster to load, then would need to copy it to a different S3 bucket or HF.

GFS is easy to access on AWS, but is in GRIB files mostly, so would need conversion and then putting somewhere, I know @devsjc has done that in the NWP consumer, so could use that to convert the raw GFS to Zarr and put it in source cooperative, or a different public S3 bucket.

ICON is potentially already better to use than GFS, as its already Zarr, a few less training years than GFS, but already on HF, and would be relatively straightforward, if take awhile, to put the data in one large Zarr on Source Cooperative or S3 bucket for faster and easier training.

jcamier · 2024-12-11T13:10:54Z

@jacobbieker we could maybe create a dataset.md file. Something like this...

Datasets

Here are links to datasets that might be useful:

jcamier · 2024-12-11T13:14:26Z

This would allow the volunteer teams to explore the data if they want but ultimately I think we want to limit to UK data for the original scope that is in a specific format (the curated data) for the models to consume which would be in some S3 bucket for further processing was my understanding? @Sukh-P is this correct?

jacobbieker · 2024-12-11T13:27:29Z

@jacobbieker we could maybe create a dataset.md file. Something like this...

Datasets

Here are links to datasets that might be useful:

ARCO-ERA5 Dataset

DWD Open Data - Numerical Weather Prediction (NWP)

UFS Replay Data

Yeah, I can write one up, there are some other open sources of data that we could link to that could be helpful to people, but aren't necessarily as ready to go as a processed dataset

jcamier · 2024-12-11T13:31:50Z

Great - thanks @jacobbieker !!!

jcamier · 2024-12-11T15:30:03Z

@jacobbieker just updated the getting_started for your new file. Here is the PR

#24

Let me know if you approve.

Sukh-P · 2024-12-16T13:27:26Z

Thanks for the great input all I agree on starting simple and using what we have already almost ready to go for this (GFS & ICON). We even have some GFS in zarr format locally at OCF which we could push to the S3 bucket potentially. Also agree that copying the ICON Europe data on HF to an S3 bucket makes sense.

Another data source which I think @jacobbieker mentioned before is the Met Office UK deterministic NWP - rolling 2 year dataset, this is in GRIB format I believe but we could use @devsjc's NWP consumer (may need some modifying, unsure) to convert this into zarr format

jacobbieker · 2024-12-16T13:36:48Z

Yeah, also, storing Zarrs from the rolling archive could end up having a similar usefulness beyond just this project, providing another longer-term archive of NWP operational outputs.

peterdudfield · 2024-12-23T12:00:03Z

Yep, I agree Rolling archives for zarrs would be useful. However I think for the moment we should forcus on the first aim, and see if this works.
This would be collect NWP data for UK, and train a ML model for solar generation using this data. Of course later on we can focus on add rolling zarrs.

peterdudfield · 2024-12-23T12:00:56Z

Ive uploaded GFS 2023 data into our s3 bucket s3://ocf-open-data-pvnet/data/gfs/, this hopefully can be it can set an example of what it should be like. And unlocks some ML people if they wanted to start training e.t.c. Note that this data for 1 year is only 0.5 GB, so not to big

jcamier · 2024-12-23T15:50:45Z

Here is my first pass at some code I have been working on to fetch and convert nwp data for met office uk
https://github.com/openclimatefix/open-data-pvnet/tree/docs/issue-28

The command line for example is open-data-pvnet metoffice archive --year 2022 --month 12 --day 1 --hour 0 --region uk

Now sure if you guys will like this approach but I tried to follow similar command logic to nwp-consumer. I tried to make it extensible eventually to do global met office, dwd and gfs. Also, it only gets specific variables that mirror what you gave me as an idea from
https://huggingface.co/openclimatefix/pvnet_uk_region/blob/0bc344fafb2232fb0b6bb0bf419f0449fe11c643/data_config.yaml

I tried to mirror what these variables are with met office

I know you guys already have a lot of this data but believe this project was supposed to be the one repo to eventually get the open data from the three main sources: met office, dwd and gfs and convert them to zarr as well as do some machine learning training etc. Having one repo for volunteers will make this a lot easier to work with then go to several different repos was my thought.

This is not completed yet as I need to add functionality to write to source cooperative from the converted zarr files and then delete the tmp files that were written locally...

My goal is to get this completed by year end for met office uk and a stretch would be gfs. This is on a separate branch so we can always take different approaches based on further requirements we can discuss in the new year

peterdudfield · 2024-12-23T16:29:18Z

Thanks @jcamier , great you've been able to make a first pass at this. That super useful!

Good luck for you goal! Good to have something written down, and then others can jump in and help.

Thanks once again for this work, and we'll chat in the new year

jcamier · 2025-01-07T13:15:18Z

@peterdudfield I believe we can close this issue. @jacobbieker created this dataset.md file that contains the available open source NWP datasets. The main branch has this file in the repo.

https://github.com/openclimatefix/open-data-pvnet/blob/main/datasets.md

ayushraj09 · 2025-03-01T09:58:51Z

Hi @peterdudfield

It looks like the datasets.md file is now in the main branch, and most of the discussed tasks seem to be completed. Is there any additional work left on this issue, or is it closed?

Please let me know if there’s anything else needed!

jcamier · 2025-03-06T01:35:24Z

@peterdudfield I think we should close this issue. It was resolved with our datasets.md file specifying the sources.

peterdudfield · 2025-03-06T05:26:35Z

Can you close it?

jcamier · 2025-03-06T11:46:29Z

Wanted your approval before doing so. Moved to done.

peterdudfield mentioned this issue Dec 4, 2024

Meta Issue: Open Data PVnet #5

Open

7 tasks

peterdudfield transferred this issue from openclimatefix/PVNet Dec 4, 2024

peterdudfield moved this to Todo in Open Data PVNet Dec 5, 2024

peterdudfield mentioned this issue Dec 10, 2024

Get UK Target data #2

Open

peterdudfield mentioned this issue Jan 14, 2025

ML Modelling #37

Open

jcamier moved this from Todo to Done in Open Data PVNet Mar 6, 2025

jcamier closed this as completed by moving to Done in Open Data PVNet Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify open source gridded NWP #3

Identify open source gridded NWP #3

peterdudfield commented Dec 4, 2024 •

edited

Loading

jacobbieker commented Dec 10, 2024

jcamier commented Dec 11, 2024

peterdudfield commented Dec 11, 2024

jcamier commented Dec 11, 2024

jacobbieker commented Dec 11, 2024 •

edited

Loading

jcamier commented Dec 11, 2024

jcamier commented Dec 11, 2024

jacobbieker commented Dec 11, 2024

Datasets

jcamier commented Dec 11, 2024

jcamier commented Dec 11, 2024

Sukh-P commented Dec 16, 2024

jacobbieker commented Dec 16, 2024

peterdudfield commented Dec 23, 2024

peterdudfield commented Dec 23, 2024 •

edited

Loading

jcamier commented Dec 23, 2024 •

edited

Loading

peterdudfield commented Dec 23, 2024

jcamier commented Jan 7, 2025

ayushraj09 commented Mar 1, 2025 •

edited

Loading

jcamier commented Mar 6, 2025

peterdudfield commented Mar 6, 2025

jcamier commented Mar 6, 2025

Identify open source gridded NWP #3

Identify open source gridded NWP #3

Comments

peterdudfield commented Dec 4, 2024 • edited Loading

jacobbieker commented Dec 10, 2024

jcamier commented Dec 11, 2024

peterdudfield commented Dec 11, 2024

jcamier commented Dec 11, 2024

jacobbieker commented Dec 11, 2024 • edited Loading

jcamier commented Dec 11, 2024

Datasets

jcamier commented Dec 11, 2024

jacobbieker commented Dec 11, 2024

Datasets

jcamier commented Dec 11, 2024

jcamier commented Dec 11, 2024

Sukh-P commented Dec 16, 2024

jacobbieker commented Dec 16, 2024

peterdudfield commented Dec 23, 2024

peterdudfield commented Dec 23, 2024 • edited Loading

jcamier commented Dec 23, 2024 • edited Loading

peterdudfield commented Dec 23, 2024

jcamier commented Jan 7, 2025

ayushraj09 commented Mar 1, 2025 • edited Loading

jcamier commented Mar 6, 2025

peterdudfield commented Mar 6, 2025

jcamier commented Mar 6, 2025

peterdudfield commented Dec 4, 2024 •

edited

Loading

jacobbieker commented Dec 11, 2024 •

edited

Loading

peterdudfield commented Dec 23, 2024 •

edited

Loading

jcamier commented Dec 23, 2024 •

edited

Loading

ayushraj09 commented Mar 1, 2025 •

edited

Loading