-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for URL path rewrites #56
Conversation
@sisp thanks! I don't see any issues with this approach. The only potentially concerning thing is that we have regex now in config which Studio is dealing with. We'll be running this server side with an arbitrary values. Meaning there is a surface for DDoS, or some other attacks. |
I'm not sure if I want this to be in Also, using [core]
remote = default
['remote "default"']
url = ...
method = PUT
legacy_cache_location = "files"
new_cache_location = "files-md5" We could also make it generic, but it might be hard to maintain compatibility (if the format args, or structure changes): ['remote "default"']
url = ...
method = PUT
cache_format = "files-{hash}" # or, "files/{hash}" But I'd prefer to wait for @iterative/dvc to respond. |
@shcheklein Thanks for your feedback and good catch about potential attacks. @skshetry Thanks for your feedback! I like the generalized simpler config applicable to any remote. The way I understand the split file content hash (2 chars for the directory, the remaining chars for the filename), this is to optimize performance on e.g. Amazon S3, but it may not be necessary for all remotes including some custom HTTP remote. Could we make this configurable, too? Perhaps using a setting like |
For the record: we've discussed in the team earlier this week and we think that the best way to go about this is for you to create something like an fs (gitlabartifactfs? 🤷♂️ ) for this that would encode paths in a way that will work for gitlab. |
Thanks, @efiop! Just to clarify: It sounds like you mean creating a new |
@sisp Yes, we mean creating an fsspec-compatible fs and using it in a plugin, that's just the easiest way to go about it as a whole. Note that neither the fs nor the plugin need to be public, you could implement it and keep it yourself. Regarding code duplication, nothing really prevents you from just wrapping an httpfs, that's totally normal and we've done fs wraps ourselves in the past. Regarding mapping simplicity, you are correct that it is not complex right now, but I We've also considered creating a gitlab specific odb implementation, but that's not really general enough and will likely not suffice long term. An fs seems like the safest and simplest way, all things considered. Considering just how deep you are already, creating a good quality fsspec implementation for this should be a piece of cake (the only head scratching that I see is how to encode path separators the best way). We have an extremely limited capacity these days so I have to clarify that we won't be able to maintain that fs ourselves (or the plugin) and likely won't be able to help much if at all, but worst case scenario you will be 1-2 simple monkey patches away (just need to patch dvc config and fs registry) from making everything work, and best case scenario this will push us to finalize small things that are left to make our plugin mechanism fully operational so that you could properly plug in dvc-gitlab through official API. |
Thanks for the detailed clarification, @efiop! Sounds like a good plan! I'll pursue your suggested approach. Encoding URL paths to make GitLab's generic package registry fs-like is a good idea, I hadn't thought about that. 👌 I'll close this PR as it's clear this isn't the right approach. Thanks again for the fruitful discussion and guidance! 🙏 |
Just to make sure we are on the same page: please correct me if I'm wrong, but the only problem we are trying to get around here is encoding paths that are deeper than 3 dirs, so the simplest things that come to mind are just
|
@sisp Btw, does that gitlab artifact api allow listing? If so, then you should be able to implement |
I was thinking to encode all paths to namespace Yes, GitLab's generic package registry supports listing files in a package and even deleting files. |
@sisp Oh, I see what you mean. Yeah, that sounds like a plan. |
This is an attempt at resolving #50 (comment) which requires rewriting the remote URL path from
.../files/md5/xx/xxxx...
to, e.g.,.../files-md5/xx/xxxx...
(or even a more substantial rewrite to.../dvc/files-md5/xxxxxx...
) because our HTTP backend does not support arbitrary subpaths but is limited to at most 3 levels of nesting. Originally, we were considering a dedicated DVC plugin, but it seems to lead to significant copy & paste from this plugin as the main difference is the modified remote URL path.I've added a new config field
rewrites
which is a list of rewrite directives inspired by, e.g., NGINX rewrite rules which consists of a regex with capture groups matched against the source path and a replacement string containing capture group references. For instance, a DVC config containing a rewrite directive looks like this:[core] remote = default ['remote "default"'] url = ... method = PUT + rewrites = /files/md5/([^/]+)/(.+)$ /dvc/files-md5/\1\2,
Note the trailing comma, which turns this value into a list (see the
configobj
config file format spec).Because of iterative/dvc#9711, a PR submitted to https://github.com/iterative/dvc is necessary to extend the config schema, whose changes might look like this: iterative/dvc@main...sisp:dvc:feat/dvc-http-rewrites
Originally, I was attempting to subclass
aiohttp.ClientSession
, add URL path rewrites there, and pass that custom client session instance todvc_http.retry.ReadOnlyRetryClient
, but it turned out that subclassingaiohttp.ClientSession
is discouraged (and will be even forbidden inaiohttp
v4). See also aio-libs/aiohttp#2691. So, I decided to dump the subclassing idea and rather create a new custom client, compatible with the (implicitly) required client interface infsspec.implementations.http.HTTPFileSystem
, that is composed of theaiohttp_retry.RetryClient
, adds read-only retries (as before), and also supports URL path rewriting.Testing of this extension is currently a bit thin because I'm not sure where I'd add tests that actually check requests with URL rewrites.
WDYT about this extension?
/cc @shcheklein