Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify where we can store our data #11

Open
peterdudfield opened this issue Dec 5, 2024 · 15 comments
Open

Identify where we can store our data #11

peterdudfield opened this issue Dec 5, 2024 · 15 comments

Comments

@peterdudfield
Copy link
Contributor

Look at public storage options. Tentative storage will be 10 TB for initial project to hold several years worth of data.

@peterdudfield peterdudfield converted this from a draft issue Dec 5, 2024
@peterdudfield peterdudfield moved this to In Progress in Open Data PVNet Dec 5, 2024
@peterdudfield
Copy link
Contributor Author

Btw, for GFS at 1 degree, for 15 variables. We think the data is only ~10MB per init time for the UK. 2 a day, would only be ~6GB per year. Suddenly we perhaps dont need a bit data store.

@jcamier
Copy link
Collaborator

jcamier commented Dec 5, 2024

@peterdudfield @Sukh-P
I just looked at pricing on AWS S3 in the US and it is $0.023 per GB/month for standard S3 Storage. 6 GB is only $0.14/month! Originally, I was going to apply for a grant but I rather sponsor $0.14/month, lol. Even if we did 1 TB, that would only be USD $23/month, which I would be able to sponsor as well. Also, we could use Glacier Deep Archive for some of the data which is only $0.00099 per GB/month as I don't think we will be training all the time with most of this data. For example, put UK data in standard S3 and the rest of the countries/world in Glacier. So, in theory we keep 1 TB of UK data in standard S3 and the remaining 9TB (10TB - 1TB). Then, 9TB would only be about $8.91/month with Glacier.

The key then would be to create an issue to automate this storage process to save on costs.

  1. Put a day of data in S3
  2. Have a process that separates UK data and rest of the world/country data.
  3. Transfer rest of the world/country data to Glacier for long term storage and future training

What do you guys think?

@peterdudfield
Copy link
Contributor Author

Yea, I like the split of S3 and Glacier, helps with the costs.

One thing we might want to do first is just get an estimate of the global GFS data for the variables. We might find we can just put it on HF and then when people want to use it, they just download if from there on to there vm/local machine

@jcamier
Copy link
Collaborator

jcamier commented Dec 5, 2024

It seems like there can be a lot of ways to save on this storage costs. @Sukh-P you mentioned slowness of Hugging Face for streaming data when training. So maybe the UK streaming data will be on standard S3 (1 TB of data) and the rest of GFS data archives we could put on Glacier or Hugging Face? I like Hugging Face as it seems more open source friendly and more accessible. However, I think Hugging Face has now implemented a 300 GB limit...
Screenshot 2024-12-05 at 1 52 01 PM

https://huggingface.co/docs/hub/en/repositories-recommendations

@peterdudfield
Copy link
Contributor Author

It seems like there can be a lot of ways to save on this storage costs. @Sukh-P you mentioned slowness of Hugging Face for streaming data when training. So maybe the UK streaming data will be on standard S3 (1 TB of data) and the rest of GFS data archives we could put on Glacier or Hugging Face? I like Hugging Face as it seems more open source friendly and more accessible. However, I think Hugging Face has now implemented a 300 GB limit... Screenshot 2024-12-05 at 1 52 01 PM

https://huggingface.co/docs/hub/en/repositories-recommendations

yea, HF is definately to slow to stream. However if the data is small enough, we could ask users to just download and use it locally

@jcamier
Copy link
Collaborator

jcamier commented Dec 7, 2024

@peterdudfield since the storage costs don't seem to be a large factor at the moment, would you mind creating an AWS public s3 bucket that we can start collaborating on. It looks like the most we would use at this time would be 1 TB for UK data and the rest of the world we could put in Glacier and Hugging Face. As I mentioned HF as a 300GB limit, hence why I am recommending AWS glacier to host the other 9TB estimated amount of GFS data.

My estimates of 1TB would be $23 USD/month plus I/O time. However, we won't really incur that monthly costs for a couple of months as it will take time to acquire that much data. I can sponsor $20 USD/month if you want to start. However, I think it would be in the best interest that OCF creates the public S3 bucket as opposed to me creating my own in my own S3 bucket that I allow others to use, but willing to do so, if you would prefer. We could create IAM privileges for read for most users and machines and a few of us with write. Thoughts?

@peterdudfield
Copy link
Contributor Author

Thanks so much @jcamier. Lets start with a S3 bucket then. Yea we can make it public so its read access for all. And then we can make a few write users

@peterdudfield
Copy link
Contributor Author

There is now a s3 bucket called ocf-open-data-pvnet that should be private. @jcamier we'll have to add you to a AWS user group that means you can write to the bucket.

@jacobbieker
Copy link
Member

For Hugging Face, they do say you can ask them to get a higher than 300GB limit for a repo, so that is one option. Or could put it on Source Cooperative, its S3, but public and free hosting, as well as geared towards geospatial data. OCF has an account there, and this seems exactly like the kind of things they'd like.

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

Thanks @jacobbieker - this is great insight!
@peterdudfield Source Cooperative sounds like a great place to store much of the archive data and the data in general. Do you know how much space they give us? We could use the other s3 bucket for potentially faster reads/writes when training the model. This would be a good experiment to see what the delay or I/O reads/writes between Source Cooperative and AWS S3 buckets are and then weight the pros/cons of costs/speed

This could be a good home for it:
https://source.coop/

@peterdudfield
Copy link
Contributor Author

Yea, there is something very nice about not hosting the data our selves.

Sorry i dont know the details of Source Cooperative, perhaps someone can find out?
Huggingface is also an option - One important step is to download some data and identify how much the total size will be. This might help us choose

@jacobbieker
Copy link
Member

Source Cooperative doesn't have limits on the storage, they have lots of multi-TB datasets stored there, and I've been putting a lot of data under my personal account there without issues. I think its probably the best option for this kind of thing. dynamical.org has also been using it to store the GFS Analysis, so would this would fit that. I think their buckets are based in the US though, so faster reading/writing might be dependent on where those are located.

@jcamier
Copy link
Collaborator

jcamier commented Dec 11, 2024

@jacobbieker @peterdudfield I think Source Cooperative it is then. Everyone in Agreement?

@jacobbieker can you create some instructions for this for new volunteers that are not familiar with it. Hoping to make the onboarding process as easy as possible to accelerate development from the community.

@jcamier
Copy link
Collaborator

jcamier commented Dec 18, 2024

GFS data is already on AWS. Just providing the link for easy access: https://registry.opendata.aws/noaa-gfs-bdp-pds/

@clarkmaio
Copy link

Hi, so what about GFS?
Will we use directly GFS AWS bucket? Or are you still willing to store gfs data on openclimatefix bucket?
I think having data in openclimatefix would be better to achieve more flexibility (a.e. data on GFS bucket are organized as .nc files for each step and frun).

If second option will be adopted I would like to help to build zarr archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants