-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify where we can store our data #11
Comments
Btw, for GFS at 1 degree, for 15 variables. We think the data is only ~10MB per init time for the UK. 2 a day, would only be ~6GB per year. Suddenly we perhaps dont need a bit data store. |
@peterdudfield @Sukh-P The key then would be to create an issue to automate this storage process to save on costs.
What do you guys think? |
Yea, I like the split of S3 and Glacier, helps with the costs. One thing we might want to do first is just get an estimate of the global GFS data for the variables. We might find we can just put it on HF and then when people want to use it, they just download if from there on to there vm/local machine |
It seems like there can be a lot of ways to save on this storage costs. @Sukh-P you mentioned slowness of Hugging Face for streaming data when training. So maybe the UK streaming data will be on standard S3 (1 TB of data) and the rest of GFS data archives we could put on Glacier or Hugging Face? I like Hugging Face as it seems more open source friendly and more accessible. However, I think Hugging Face has now implemented a 300 GB limit... https://huggingface.co/docs/hub/en/repositories-recommendations |
yea, HF is definately to slow to stream. However if the data is small enough, we could ask users to just download and use it locally |
@peterdudfield since the storage costs don't seem to be a large factor at the moment, would you mind creating an AWS public s3 bucket that we can start collaborating on. It looks like the most we would use at this time would be 1 TB for UK data and the rest of the world we could put in Glacier and Hugging Face. As I mentioned HF as a 300GB limit, hence why I am recommending AWS glacier to host the other 9TB estimated amount of GFS data. My estimates of 1TB would be $23 USD/month plus I/O time. However, we won't really incur that monthly costs for a couple of months as it will take time to acquire that much data. I can sponsor $20 USD/month if you want to start. However, I think it would be in the best interest that OCF creates the public S3 bucket as opposed to me creating my own in my own S3 bucket that I allow others to use, but willing to do so, if you would prefer. We could create IAM privileges for read for most users and machines and a few of us with write. Thoughts? |
Thanks so much @jcamier. Lets start with a S3 bucket then. Yea we can make it public so its read access for all. And then we can make a few write users |
There is now a s3 bucket called ocf-open-data-pvnet that should be private. @jcamier we'll have to add you to a AWS user group that means you can write to the bucket. |
For Hugging Face, they do say you can ask them to get a higher than 300GB limit for a repo, so that is one option. Or could put it on Source Cooperative, its S3, but public and free hosting, as well as geared towards geospatial data. OCF has an account there, and this seems exactly like the kind of things they'd like. |
Thanks @jacobbieker - this is great insight! This could be a good home for it: |
Yea, there is something very nice about not hosting the data our selves. Sorry i dont know the details of Source Cooperative, perhaps someone can find out? |
Source Cooperative doesn't have limits on the storage, they have lots of multi-TB datasets stored there, and I've been putting a lot of data under my personal account there without issues. I think its probably the best option for this kind of thing. dynamical.org has also been using it to store the GFS Analysis, so would this would fit that. I think their buckets are based in the US though, so faster reading/writing might be dependent on where those are located. |
@jacobbieker @peterdudfield I think Source Cooperative it is then. Everyone in Agreement? @jacobbieker can you create some instructions for this for new volunteers that are not familiar with it. Hoping to make the onboarding process as easy as possible to accelerate development from the community. |
GFS data is already on AWS. Just providing the link for easy access: https://registry.opendata.aws/noaa-gfs-bdp-pds/ |
Hi, so what about GFS? If second option will be adopted I would like to help to build zarr archive. |
Look at public storage options. Tentative storage will be 10 TB for initial project to hold several years worth of data.
The text was updated successfully, but these errors were encountered: