-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify open source gridded NWP #3
Comments
While not ideal in terms of its reanalysis, there is now both the ARCO-ERA5, as well as the UFS Replay datasets, both Zarr, both going back quite a few years. A lot of the weather forecasting models train on ERA5, then get finetuned on the live ECMWF IFS model or other real-time NWPs. So could be an option there. Otherwise, there is the OCF DWD archive (although might want to move that somewhere easier to stream from/Source Cooperative. |
@peterdudfield this is great insight from @jacobbieker . What are your thoughts? |
Thanks @jacobbieker and @jcamier I like the option of training on ERA5 and finetunning on ECMWF IFS. I also like the idea of trying to keep the process simple at the beginning, as in general its quite complicated. I would probably push for us, either to use GFS, the forecasts are that good, but its easy to get and not too big. Or we could use DWD ICON data that OCF store here (there is also a eu only one too) One next step is to make sure we collect all the variables we want, of course they are all named slightly differentl in each NWP but we can im sure make some progress |
@peterdudfield so would this be the recommended architecture and approach for this project?
I recommend we get all these datasources, if possible, and put them on Hugging Face or Source Cooperative. However, also put ERA5 in a standard AWS s3 bucket, as I believe this will have faster I/O reads/writes for initial training. Then, we could have different volunteer teams work with different models to see what performs better with fine tuning on top of the foundational model and various data sources? @Sukh-P @jacobbieker thoughts? |
So, the ERA5 dataset and UFS are both already streamable from either GCP or AWS for free, and would probably be as fast as copying them to a S3 bucket and training from there. The downside from just using the already made datasets are they are global forecasts and models, so if we want to subset the data, to make it faster to load, then would need to copy it to a different S3 bucket or HF. GFS is easy to access on AWS, but is in GRIB files mostly, so would need conversion and then putting somewhere, I know @devsjc has done that in the NWP consumer, so could use that to convert the raw GFS to Zarr and put it in source cooperative, or a different public S3 bucket. ICON is potentially already better to use than GFS, as its already Zarr, a few less training years than GFS, but already on HF, and would be relatively straightforward, if take awhile, to put the data in one large Zarr on Source Cooperative or S3 bucket for faster and easier training. |
@jacobbieker we could maybe create a dataset.md file. Something like this... DatasetsHere are links to datasets that might be useful: |
This would allow the volunteer teams to explore the data if they want but ultimately I think we want to limit to UK data for the original scope that is in a specific format (the curated data) for the models to consume which would be in some S3 bucket for further processing was my understanding? @Sukh-P is this correct? |
Yeah, I can write one up, there are some other open sources of data that we could link to that could be helpful to people, but aren't necessarily as ready to go as a processed dataset |
Great - thanks @jacobbieker !!! |
@jacobbieker just updated the getting_started for your new file. Here is the PR Let me know if you approve. |
Thanks for the great input all I agree on starting simple and using what we have already almost ready to go for this (GFS & ICON). We even have some GFS in zarr format locally at OCF which we could push to the S3 bucket potentially. Also agree that copying the ICON Europe data on HF to an S3 bucket makes sense. Another data source which I think @jacobbieker mentioned before is the Met Office UK deterministic NWP - rolling 2 year dataset, this is in GRIB format I believe but we could use @devsjc's NWP consumer (may need some modifying, unsure) to convert this into zarr format |
Yeah, also, storing Zarrs from the rolling archive could end up having a similar usefulness beyond just this project, providing another longer-term archive of NWP operational outputs. |
Yep, I agree Rolling archives for zarrs would be useful. However I think for the moment we should forcus on the first aim, and see if this works. |
Ive uploaded GFS 2023 data into our s3 bucket s3://ocf-open-data-pvnet/data/gfs/, this hopefully can be it can set an example of what it should be like. And unlocks some ML people if they wanted to start training e.t.c. Note that this data for 1 year is only 0.5 GB, so not to big |
Here is my first pass at some code I have been working on to fetch and convert nwp data for met office uk The command line for example is Now sure if you guys will like this approach but I tried to follow similar command logic to nwp-consumer. I tried to make it extensible eventually to do global met office, dwd and gfs. Also, it only gets specific variables that mirror what you gave me as an idea from I tried to mirror what these variables are with met office I know you guys already have a lot of this data but believe this project was supposed to be the one repo to eventually get the open data from the three main sources: met office, dwd and gfs and convert them to zarr as well as do some machine learning training etc. Having one repo for volunteers will make this a lot easier to work with then go to several different repos was my thought. This is not completed yet as I need to add functionality to write to source cooperative from the converted zarr files and then delete the tmp files that were written locally... My goal is to get this completed by year end for met office uk and a stretch would be gfs. This is on a separate branch so we can always take different approaches based on further requirements we can discuss in the new year |
Thanks @jcamier , great you've been able to make a first pass at this. That super useful! Good luck for you goal! Good to have something written down, and then others can jump in and help. Thanks once again for this work, and we'll chat in the new year |
@peterdudfield I believe we can close this issue. @jacobbieker created this dataset.md file that contains the available open source NWP datasets. The main branch has this file in the repo. https://github.com/openclimatefix/open-data-pvnet/blob/main/datasets.md |
Identify open source gridded NWP that is already in zarr format. Need to make sure we it has enough variables for solar forecasts and enough years (>2) of data. Note that OCF publish satellite data already, which could be used. We try to use at least 1 year for training and 1 year for testing. Better to have more liek 5 years of training, but lets see what we can do.
The text was updated successfully, but these errors were encountered: