Identify where we can store our data #11

peterdudfield · 2024-12-05T07:54:36Z

Look at public storage options. Tentative storage will be 10 TB for initial project to hold several years worth of data.

peterdudfield · 2024-12-05T10:03:06Z

Btw, for GFS at 1 degree, for 15 variables. We think the data is only ~10MB per init time for the UK. 2 a day, would only be ~6GB per year. Suddenly we perhaps dont need a bit data store.

jcamier · 2024-12-05T13:20:25Z

@peterdudfield @Sukh-P
I just looked at pricing on AWS S3 in the US and it is $0.023 per GB/month for standard S3 Storage. 6 GB is only $0.14/month! Originally, I was going to apply for a grant but I rather sponsor $0.14/month, lol. Even if we did 1 TB, that would only be USD $23/month, which I would be able to sponsor as well. Also, we could use Glacier Deep Archive for some of the data which is only $0.00099 per GB/month as I don't think we will be training all the time with most of this data. For example, put UK data in standard S3 and the rest of the countries/world in Glacier. So, in theory we keep 1 TB of UK data in standard S3 and the remaining 9TB (10TB - 1TB). Then, 9TB would only be about $8.91/month with Glacier.

The key then would be to create an issue to automate this storage process to save on costs.

Put a day of data in S3
Have a process that separates UK data and rest of the world/country data.
Transfer rest of the world/country data to Glacier for long term storage and future training

What do you guys think?

peterdudfield · 2024-12-05T13:37:22Z

Yea, I like the split of S3 and Glacier, helps with the costs.

One thing we might want to do first is just get an estimate of the global GFS data for the variables. We might find we can just put it on HF and then when people want to use it, they just download if from there on to there vm/local machine

jcamier · 2024-12-05T19:52:41Z

It seems like there can be a lot of ways to save on this storage costs. @Sukh-P you mentioned slowness of Hugging Face for streaming data when training. So maybe the UK streaming data will be on standard S3 (1 TB of data) and the rest of GFS data archives we could put on Glacier or Hugging Face? I like Hugging Face as it seems more open source friendly and more accessible. However, I think Hugging Face has now implemented a 300 GB limit...

https://huggingface.co/docs/hub/en/repositories-recommendations

peterdudfield · 2024-12-06T14:06:54Z

It seems like there can be a lot of ways to save on this storage costs. @Sukh-P you mentioned slowness of Hugging Face for streaming data when training. So maybe the UK streaming data will be on standard S3 (1 TB of data) and the rest of GFS data archives we could put on Glacier or Hugging Face? I like Hugging Face as it seems more open source friendly and more accessible. However, I think Hugging Face has now implemented a 300 GB limit...

https://huggingface.co/docs/hub/en/repositories-recommendations

yea, HF is definately to slow to stream. However if the data is small enough, we could ask users to just download and use it locally

jcamier · 2024-12-07T14:23:25Z

@peterdudfield since the storage costs don't seem to be a large factor at the moment, would you mind creating an AWS public s3 bucket that we can start collaborating on. It looks like the most we would use at this time would be 1 TB for UK data and the rest of the world we could put in Glacier and Hugging Face. As I mentioned HF as a 300GB limit, hence why I am recommending AWS glacier to host the other 9TB estimated amount of GFS data.

My estimates of 1TB would be $23 USD/month plus I/O time. However, we won't really incur that monthly costs for a couple of months as it will take time to acquire that much data. I can sponsor $20 USD/month if you want to start. However, I think it would be in the best interest that OCF creates the public S3 bucket as opposed to me creating my own in my own S3 bucket that I allow others to use, but willing to do so, if you would prefer. We could create IAM privileges for read for most users and machines and a few of us with write. Thoughts?

peterdudfield · 2024-12-08T21:12:53Z

Thanks so much @jcamier. Lets start with a S3 bucket then. Yea we can make it public so its read access for all. And then we can make a few write users

peterdudfield · 2024-12-09T12:59:35Z

There is now a s3 bucket called ocf-open-data-pvnet that should be private. @jcamier we'll have to add you to a AWS user group that means you can write to the bucket.

jacobbieker · 2024-12-10T21:25:50Z

For Hugging Face, they do say you can ask them to get a higher than 300GB limit for a repo, so that is one option. Or could put it on Source Cooperative, its S3, but public and free hosting, as well as geared towards geospatial data. OCF has an account there, and this seems exactly like the kind of things they'd like.

jcamier · 2024-12-11T01:17:21Z

Thanks @jacobbieker - this is great insight!
@peterdudfield Source Cooperative sounds like a great place to store much of the archive data and the data in general. Do you know how much space they give us? We could use the other s3 bucket for potentially faster reads/writes when training the model. This would be a good experiment to see what the delay or I/O reads/writes between Source Cooperative and AWS S3 buckets are and then weight the pros/cons of costs/speed

This could be a good home for it:
https://source.coop/

peterdudfield · 2024-12-11T08:55:17Z

Yea, there is something very nice about not hosting the data our selves.

Sorry i dont know the details of Source Cooperative, perhaps someone can find out?
Huggingface is also an option - One important step is to download some data and identify how much the total size will be. This might help us choose

jacobbieker · 2024-12-11T13:07:56Z

Source Cooperative doesn't have limits on the storage, they have lots of multi-TB datasets stored there, and I've been putting a lot of data under my personal account there without issues. I think its probably the best option for this kind of thing. dynamical.org has also been using it to store the GFS Analysis, so would this would fit that. I think their buckets are based in the US though, so faster reading/writing might be dependent on where those are located.

jcamier · 2024-12-11T14:44:55Z

@jacobbieker @peterdudfield I think Source Cooperative it is then. Everyone in Agreement?

@jacobbieker can you create some instructions for this for new volunteers that are not familiar with it. Hoping to make the onboarding process as easy as possible to accelerate development from the community.

jcamier · 2024-12-18T13:12:57Z

GFS data is already on AWS. Just providing the link for easy access: https://registry.opendata.aws/noaa-gfs-bdp-pds/

clarkmaio · 2025-01-21T18:20:54Z

Hi, so what about GFS?
Will we use directly GFS AWS bucket? Or are you still willing to store gfs data on openclimatefix bucket?
I think having data in openclimatefix would be better to achieve more flexibility (a.e. data on GFS bucket are organized as .nc files for each step and frun).

If second option will be adopted I would like to help to build zarr archive.

peterdudfield added this to Open Data PVNet Dec 4, 2024

peterdudfield converted this from a draft issue Dec 5, 2024

peterdudfield moved this to In Progress in Open Data PVNet Dec 5, 2024

peterdudfield mentioned this issue Dec 8, 2024

Add Open Data Pvnet openclimatefix/ocf-infrastructure#713

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify where we can store our data #11

Identify where we can store our data #11

peterdudfield commented Dec 5, 2024

peterdudfield commented Dec 5, 2024

jcamier commented Dec 5, 2024

peterdudfield commented Dec 5, 2024

jcamier commented Dec 5, 2024 •

edited

Loading

peterdudfield commented Dec 6, 2024

jcamier commented Dec 7, 2024 •

edited

Loading

peterdudfield commented Dec 8, 2024

peterdudfield commented Dec 9, 2024

jacobbieker commented Dec 10, 2024

jcamier commented Dec 11, 2024 •

edited

Loading

peterdudfield commented Dec 11, 2024

jacobbieker commented Dec 11, 2024

jcamier commented Dec 11, 2024

jcamier commented Dec 18, 2024

clarkmaio commented Jan 21, 2025

Identify where we can store our data #11

Identify where we can store our data #11

Comments

peterdudfield commented Dec 5, 2024

peterdudfield commented Dec 5, 2024

jcamier commented Dec 5, 2024

peterdudfield commented Dec 5, 2024

jcamier commented Dec 5, 2024 • edited Loading

peterdudfield commented Dec 6, 2024

jcamier commented Dec 7, 2024 • edited Loading

peterdudfield commented Dec 8, 2024

peterdudfield commented Dec 9, 2024

jacobbieker commented Dec 10, 2024

jcamier commented Dec 11, 2024 • edited Loading

peterdudfield commented Dec 11, 2024

jacobbieker commented Dec 11, 2024

jcamier commented Dec 11, 2024

jcamier commented Dec 18, 2024

clarkmaio commented Jan 21, 2025

jcamier commented Dec 5, 2024 •

edited

Loading

jcamier commented Dec 7, 2024 •

edited

Loading

jcamier commented Dec 11, 2024 •

edited

Loading