Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] should we update our s3 sccache prefixes? #161

Open
gforsyth opened this issue Mar 13, 2025 · 5 comments
Open

[discussion] should we update our s3 sccache prefixes? #161

gforsyth opened this issue Mar 13, 2025 · 5 comments
Labels
question Further information is requested

Comments

@gforsyth
Copy link

@bdice pointed out in
rapidsai/cuvs#751 (comment) that we usually
key our S3_SCCACHE_KEY_PREFIX as ${library}-${arch}, but that this might
lead to us having a bunch of CUDA 11 artifacts available to CUDA 12 builds, and
vice versa.

Should we also add ${cuda_major} to the prefix so that we can separate out
those artifacts? Or is this a non-issue?

@gforsyth gforsyth added the question Further information is requested label Mar 13, 2025
@jameslamb
Copy link
Member

jameslamb commented Mar 13, 2025

To be clear, that SCCACHE_S3_KEY_PREFIX is NOT the same as the cache key used by sccache to try to find precompiled objects.

So I think the language "having a bunch of CUDA 11 artifacts available to CUDA 12 builds" is a bit misleading.

Omitting CUDA version from that prefix doesn't, for example, increase the risk of a CUDA 12 builds getting CUDA 11 objects. See the rules at https://github.com/mozilla/sccache/blob/main/docs/Caching.md for what goes into the cache key... there are several inputs where CUDA version will be differentiated there, including "hash of the compiler binary" and "environment variables".

Per https://github.com/mozilla/sccache/blob/main/docs/S3.md#s3

You can also define a prefix that will be prepended to the keys of all cache objects created and read within the S3 bucket, effectively creating a scope. To do that use the SCCACHE_S3_KEY_PREFIX environment variable. This can be useful when sharing a bucket with another application.


is this a non-issue?

I think "yes, non-issue".

But let's talk specifics. Here are some of the reasons I can think of for why you'd want to change SCCACHE_S3_KEY_PREFIX:

  • issuing different permissions for subsets of objects
    • e.g. "this role can read and write all objects with prefix cudf-arm64-cu11/, but not anything else"
  • using different lifecycle configurations for subsets of objects
  • any applications that need to filter down to a subset of objects efficiently
    • e.g. to support filtering in Apache Spark queries, using Hive-style partitioning (docs)
    • or for queries like "tell me how many cugraph arm64 CUDA 12 objects there are in the bucket"

If we don't have any of those needs, we shouldn't change the prefixes.

I wouldn't expect finer-grained prefixes to make sccache any more efficient, as I expect that it's constructing a key of the form {SCCACHE_S3_KEY_PREFIX}{key_from_compilation_inputs} and doing a direct lookup on that in the bucket.

And changing that prefix will change the keys which will impose the 1-time cost of needing full uncached rebuilds to repopulate the cache, which can be expensive.

@gforsyth
Copy link
Author

Thanks for all of that info @jameslamb !

Given all of that, it sounds like maybe including the ${arch} in the prefix key is also unnecessary, unless, as you note, we want to be able to distinguish those objects in the bucket somehow.

I don't know that it's worth the rebuild to change that cache-key everywhere, but as I port things to rattler, I have to rebuild the cache anyway, so I could remove the ${arch} bit of the key then.

@jameslamb
Copy link
Member

as I port things to rattler, I have to rebuild the cache anyway

Ah ok, I didn't know that.

If you're not worried about having to rebuild the cache because you have to anyway, and you do decide to change the cache prefixes, then I'd recommend making the prefixes MORE granular, not LESS, by doing what you proposed at the top of this issue and adding CUDA major version. That'd leave open the possibility of finer-grained access patterns like the ones I mentioned above in the future.

If you do that, I'd also take this opportunity to change the delimiter for keys from - to /. So like:

${library}/${arch}/${cuda_major}

That's often done by convention in S3 to make things look a little bit more like filepaths... the S3 UI will even render a directory-like structure if you use that delimiter. / is preferable to - because it's unlikely to otherwise occur in the things you might filter by... unlike -, which shows up in names like dask-cudf or nx-cugraph.

Docs on that: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html

@gforsyth
Copy link
Author

Ok, cool, I'm trying out the more granular prefix and the slash delimited in the cuml and cuvs rattler PRs, but I think this will be a good change to have!

@bdice
Copy link
Contributor

bdice commented Mar 17, 2025

@jameslamb Thanks for all the info! This helps. I agree with the decision to go with ${library}/${arch}/${cuda_major}. Concretely, something like cudf/amd64/cuda12. Appreciate both of your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants