-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr Restore Space Considerations in Kubernetes #726
Comments
(Hi @mchennupati - it looks like the formatting on your post mangled a few things, so apologies if I'm missing something.) afaict your question isn't necessarily related to using the operator for restores, it's just a question about the disk and network costs of restoring a Solr collection? Assuming I've got that right - a better place to ask in the future would be our project's "user" mailing list: [email protected]. Please subscribe and ask similar questions there going forward! To your specific question: if you're restoring data to an existing collection, Solr will have each replica fetch data from the backup repository. (So if you have three replicas each fetching a 100gb index, you'll pull 300gb from GCS). Restores to a new collection work slightly differently, with only one replica fetching the index and then distributing it within your Solr cluster as needed. So the network impact of restores can be tuned a little bit. In terms of disk space though - ultimately all replicas of a shard will need a full copy of that shard's data, which sounds like 665GB in your case. |
Thank you for your reply. Yes, I think i didnt quite understand how the
restore worked, but i figured it out eventually.
One aspect of my question still remains, perhaps its missing documentation.
The solr operator or a CRD allows one to do a backup. But a similar restore
doesnt exist or is missing from the docs ?
Thanks !
…On Tue 10. Dec 2024 at 18:33, Jason Gerlowski ***@***.***> wrote:
(Hi @mchennupati <https://github.com/mchennupati> - it looks like the
formatting on your post mangled a few things, so apologies if I'm missing
something.)
afaict your question isn't necessarily related to using the operator for
restores, it's just a question about the disk and network costs of
restoring a Solr collection? Assuming I've got that right - a better place
to ask in the future would be our project's "user" mailing list:
***@***.*** Please subscribe and ask similar questions there
going forward!
To your specific question: if you're restoring data to an existing
collection, Solr will have each replica fetch data from the backup
repository. (So if you have three replicas each fetching a 100gb index,
you'll pull 300gb from GCS). Restores to a new collection work slightly
differently, with only one replica fetching the index and then distributing
it within your Solr cluster as needed. So the network impact of restores
can be tuned a little bit.
In terms of disk space though - ultimately all replicas of a shard will
need a full copy of that shard's data, which sounds like 665GB in your case.
—
Reply to this email directly, view it on GitHub
<#726 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7RXOKFB4UDOV4O6MUUG4T2E4QXJAVCNFSM6AAAAABPYMBCFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZSGM3DINRYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am restoring a large index (655G) that is currently on google cloud storage to a new solr cloud on kubernetes instance. I am trying to understand how much space I need to allocate to each of my node pvcs.
I am currently using the collections api, with async to restore a collection saved in gcs.
When I check my disk usage for /var/solr/data on each of the nodes, it looks like this. So each of them appears to be downloading the entire index. I initially allocated 500G to each of the pvcs but that turned out to be too little. I am now doing it with 700G.
Is this expected behaviour or am I doing something wrong ? One would have expected the metadata has enough information to download the index in parts and not do it 655G x 3. It's cost me a fair bit in network costs already as I reiterate :)
In general, how would one restore a large index, I did not find a solrrestore similar to solrbackups in the solr operator crds.
So I ran an async job using the solr collections api.
Thanks !
/var/solr/data$ du
4 ./userfiles
4 ./backup-restore/gcs-backups/gcscredential/..2024_10_11_06_16_24.1266852566
4 ./backup-restore/gcs-backups/gcscredential
8 ./backup-restore/gcs-backups
12 ./backup-restore
4 ./filestore
4 ./mycoll_shard3_replica_n3/data/tlog
4 ./mycoll_shard3_replica_n3/data/snapshot_metadata
8 ./mycoll_shard3_replica_n3/data/index
85744132 ./mycoll_shard3_replica_n3/data/restore.20241011062904489
85744152 ./mycoll_shard3_replica_n3/data
85744160 ./mycoll_shard3_replica_n3
85744192 .
solr@mycoll-solrcloud-0:/var/solr/data$ du -sh
The text was updated successfully, but these errors were encountered: