Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for URL path rewrites #56

Closed
wants to merge 1 commit into from

Conversation

sisp
Copy link
Contributor

@sisp sisp commented Dec 15, 2023

This is an attempt at resolving #50 (comment) which requires rewriting the remote URL path from .../files/md5/xx/xxxx... to, e.g., .../files-md5/xx/xxxx... (or even a more substantial rewrite to .../dvc/files-md5/xxxxxx...) because our HTTP backend does not support arbitrary subpaths but is limited to at most 3 levels of nesting. Originally, we were considering a dedicated DVC plugin, but it seems to lead to significant copy & paste from this plugin as the main difference is the modified remote URL path.

I've added a new config field rewrites which is a list of rewrite directives inspired by, e.g., NGINX rewrite rules which consists of a regex with capture groups matched against the source path and a replacement string containing capture group references. For instance, a DVC config containing a rewrite directive looks like this:

 [core]
     remote = default
 ['remote "default"']
     url = ...
     method = PUT
+    rewrites = /files/md5/([^/]+)/(.+)$ /dvc/files-md5/\1\2,

Note the trailing comma, which turns this value into a list (see the configobj config file format spec).

Because of iterative/dvc#9711, a PR submitted to https://github.com/iterative/dvc is necessary to extend the config schema, whose changes might look like this: iterative/dvc@main...sisp:dvc:feat/dvc-http-rewrites

Originally, I was attempting to subclass aiohttp.ClientSession, add URL path rewrites there, and pass that custom client session instance to dvc_http.retry.ReadOnlyRetryClient, but it turned out that subclassing aiohttp.ClientSession is discouraged (and will be even forbidden in aiohttp v4). See also aio-libs/aiohttp#2691. So, I decided to dump the subclassing idea and rather create a new custom client, compatible with the (implicitly) required client interface in fsspec.implementations.http.HTTPFileSystem, that is composed of the aiohttp_retry.RetryClient, adds read-only retries (as before), and also supports URL path rewriting.

Testing of this extension is currently a bit thin because I'm not sure where I'd add tests that actually check requests with URL rewrites.

WDYT about this extension?

/cc @shcheklein

@dberenbaum dberenbaum requested review from a team and shcheklein December 21, 2023 18:59
@shcheklein
Copy link
Member

@sisp thanks!

I don't see any issues with this approach. The only potentially concerning thing is that we have regex now in config which Studio is dealing with. We'll be running this server side with an arbitrary values. Meaning there is a surface for DDoS, or some other attacks.

@skshetry
Copy link
Member

skshetry commented Dec 22, 2023

I'm not sure if I want this to be in http remote. I'd not do this just because there is going to be code duplication. I don't think that's a problem, but there are ways to fix that.

Also, using regex opens up dvc to ReDOS attack, and feels a very heavyweight solution.
We could introduce a simpler config to set a remote mapping, which could be applicable to any remotes.

 [core]
     remote = default
 ['remote "default"']
     url = ...
     method = PUT
     legacy_cache_location = "files"
     new_cache_location = "files-md5"

We could also make it generic, but it might be hard to maintain compatibility (if the format args, or structure changes):

 ['remote "default"']
     url = ...
     method = PUT
     cache_format = "files-{hash}" # or, "files/{hash}"

But I'd prefer to wait for @iterative/dvc to respond.

@sisp
Copy link
Contributor Author

sisp commented Dec 22, 2023

@shcheklein Thanks for your feedback and good catch about potential attacks.

@skshetry Thanks for your feedback! I like the generalized simpler config applicable to any remote. The way I understand the split file content hash (2 chars for the directory, the remaining chars for the filename), this is to optimize performance on e.g. Amazon S3, but it may not be necessary for all remotes including some custom HTTP remote. Could we make this configurable, too? Perhaps using a setting like cache_file_prefix = 2, which could be set to 0 for disabling the 2 chars subdirectory on the remote. And we should also add similar configs for the run-cache.

@efiop
Copy link
Contributor

efiop commented Dec 29, 2023

For the record: we've discussed in the team earlier this week and we think that the best way to go about this is for you to create something like an fs (gitlabartifactfs? 🤷‍♂️ ) for this that would encode paths in a way that will work for gitlab. dvc-http is meant to cover simple http servers really, and this goes way beyond its purpose.

@sisp
Copy link
Contributor Author

sisp commented Dec 29, 2023

Thanks, @efiop! Just to clarify: It sounds like you mean creating a new fsspec implementation (like gdrivefs) that would be used in a new DVC plugin (like dvc-gdrive). Right? If so, GitLab's generic package registry is at best a very limited filesystem-like storage because it has exactly 3 nesting levels, each of which actually has a meaning (although it can be ignored from a technical point of view). With DVC's content-adressable storage, a reasonable mapping like dvc/files-md5/<hash> is possible and isn't too hacky, but I don't quite see how an fsspec implementation would fit. A new DVC plugin that uses fsspec.implementation.http.HTTPFileSystem very similar to dvc-http would be possible, but there would be quite a bit of code duplication because it would essentially be dvc-http with a particular hardcoded path mapping. That's the reason why I created this PR and find #56 (comment) a good idea.

@efiop
Copy link
Contributor

efiop commented Dec 29, 2023

@sisp Yes, we mean creating an fsspec-compatible fs and using it in a plugin, that's just the easiest way to go about it as a whole. Note that neither the fs nor the plugin need to be public, you could implement it and keep it yourself.

Regarding code duplication, nothing really prevents you from just wrapping an httpfs, that's totally normal and we've done fs wraps ourselves in the past.

Regarding mapping simplicity, you are correct that it is not complex right now, but I files/md5 is just one type of objects and we currently have at least one more (runs/) and there will likely be even more in the future, which are not really limited to content-addressable storage (runs is not quite content-addressable btw). So it is in all of our interest for this implementation to be general enough to not require constant maintenance, like when in 3.0 our http implementation suddenly broke for you.

We've also considered creating a gitlab specific odb implementation, but that's not really general enough and will likely not suffice long term. An fs seems like the safest and simplest way, all things considered.

Considering just how deep you are already, creating a good quality fsspec implementation for this should be a piece of cake (the only head scratching that I see is how to encode path separators the best way). We have an extremely limited capacity these days so I have to clarify that we won't be able to maintain that fs ourselves (or the plugin) and likely won't be able to help much if at all, but worst case scenario you will be 1-2 simple monkey patches away (just need to patch dvc config and fs registry) from making everything work, and best case scenario this will push us to finalize small things that are left to make our plugin mechanism fully operational so that you could properly plug in dvc-gitlab through official API.

@sisp
Copy link
Contributor Author

sisp commented Dec 29, 2023

Thanks for the detailed clarification, @efiop! Sounds like a good plan! I'll pursue your suggested approach. Encoding URL paths to make GitLab's generic package registry fs-like is a good idea, I hadn't thought about that. 👌

I'll close this PR as it's clear this isn't the right approach. Thanks again for the fruitful discussion and guidance! 🙏

@sisp sisp closed this Dec 29, 2023
@efiop
Copy link
Contributor

efiop commented Dec 29, 2023

Encoding URL paths to make GitLab's generic package registry fs-like

Just to make sure we are on the same page: please correct me if I'm wrong, but the only problem we are trying to get around here is encoding paths that are deeper than 3 dirs, so the simplest things that come to mind are just .replace("/". "-") (or maybe use something more special like @ instead of -, since - is maybe a bit too generic) or maybe replace but starting from a certain depth (e.g. foo/bar/baz/qux/quux -> foo/bar/baz-qux-quux). If we are on the same page so far, then you'll need to do something really simple like:

class GitLabArtifactFileSystem(AbstractFileSystem):

    def __init__(self, ...):
        ...
        self._fs = HTTPFileSystem(...)

    def _get_url(self, path):
       return posixpath.join(self.url, path.replace("/", "@"))  # or whatever encoding logic makes sense to you
       
    def exists(self, path):
        return self._fs.exists(self._get_url(path))

   # and also decoding for ls and so on
    ...

@efiop
Copy link
Contributor

efiop commented Dec 29, 2023

@sisp Btw, does that gitlab artifact api allow listing? If so, then you should be able to implement ls, which dvc-http is missing. This should make it even more full-featured than our basic dvc-http implementation, which should be really nice (e.g. things like dvc gc should work here).

@sisp
Copy link
Contributor Author

sisp commented Dec 29, 2023

I was thinking to encode all paths to namespace fsspec-managed files in the generic package registry, e.g. under a package "fsspec" with version "0" or "files" or whatever. Perhaps a URL-safe base64 encoding could work (not sure about the possible = character though)? Thanks for the snippet!

Yes, GitLab's generic package registry supports listing files in a package and even deleting files.

@efiop
Copy link
Contributor

efiop commented Dec 29, 2023

@sisp Oh, I see what you mean. Yeah, that sounds like a plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants