Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httpcache vs. Conditional Request vs. X-Varied-Authorization header #437

Open
andygrunwald opened this issue Mar 7, 2025 · 4 comments
Open

Comments

@andygrunwald
Copy link
Contributor

andygrunwald commented Mar 7, 2025

Context

I run go-githubapp with github.com/gregjones/httpcache (github.com/gregjones/httpcache/diskcache in particular). I run it in an "Github App context", means: The user installs a Github app, based on this, I get permission to make request:

[...]
clientCreator, err := githubapp.NewDefaultCachingClientCreator(
	config.Github,
	githubapp.WithClientTimeout(60*time.Second),
	githubapp.WithClientCaching(false, func() httpcache.Cache { return httpTransportCache }),
	githubapp.WithClientMiddleware(
		githubapp.ClientLogging(zerolog.InfoLevel, githubapp.LogRateLimitInformation(&githubapp.RateLimitLoggingOption{
			Limit:     true,
			Remaining: true,
			Used:      true,
			Reset:     true,
			Resource:  true,
		})),
	),
)
[...]

followed by ...

installationClient, err := clientCreator.NewInstallationClient(<ID>)

followed by ...

issue, resp, err := installationClient.Issues.Get(context.Background(), "andygrunwald", "go-jira", 687)

This works great, as expected.
Caching is enabled and it writes and read the cache. However, I also activated logging via

loggerConsoleOutput := zerolog.ConsoleWriter{Out: os.Stdout, TimeFormat: time.RFC3339}
	logger := zerolog.New(loggerConsoleOutput).Level(zerolog.DebugLevel).With().Timestamp().Logger()
	zerolog.DefaultContextLogger = &logger

to get details about the usage of the cache:

2025-03-06T23:11:40+01:00 INF github_request cached=false elapsed=400.52325 method=GET path=https://api.github.com/repos/andygrunwald/go-jira/issues/687 ratelimit={"limit":7950,"remaining":7943,"reset":"2025-03-06T23:19:40+01:00","resource":"core","used":7} size=-1 status=200
2025-03-06T23:11:40+01:00 INF github_request cached=true elapsed=0.086375 method=GET path=https://api.github.com/repos/andygrunwald/go-jira/issues/687 ratelimit={"limit":7950,"remaining":7943,"reset":"2025-03-06T23:19:40+01:00","resource":"core","used":7} size=-1 status=200

I discovered that the cache is not working as I expected and as described in Use conditional requests if appropriate.

Caching deep dive

The caching entry of a request contains the response headers like (stripped version)

HTTP/2.0 200 OK
Cache-Control: private, max-age=60, s-maxage=60
Etag: W/"7640648674d954ec24d5ee4c94bae93674a9a85852d6f7192cbee73f83c78da0"
Last-Modified: Mon, 14 Oct 2024 01:13:57 GMT
Vary: Accept, Authorization, Cookie, X-GitHub-OTP,Accept-Encoding, Accept, X-Requested-With
X-Varied-Accept: application/vnd.github.squirrel-girl-preview
X-Varied-Authorization: token <SOME TOKEN>

httpcache is respecting (and comparing) all headers listed in Vary in https://github.com/gregjones/httpcache/blob/901d90724c7919163f472a9812253fb26761123d/httpcache.go#L160
This includes X-Varied-Authorization.
X-Varied-Authorization contains the token that has a limited lifetime.

Once the token lifetime is reached, it makes an uncached (not conditional) request, as X-Varied-Authorization varies, but the content itself does not change at all. In bigger repositories, when you aim to crawl those, you can hit the rate limit pretty quickly.

As far as I understand, this behavior is in line with RFC 7234.

Question on X-Varied-Authorization

In this usecase (crawling repositories by a Github app), I was wondering:

Is there any (negative) side affect to discard (read: Remove) the X-Varied-Authorization when writing the cache?

When the X-Varied-Authorization header is not compared in https://github.com/gregjones/httpcache/blob/901d90724c7919163f472a9812253fb26761123d/httpcache.go#L121, the cache hitrate would increase by a lot.

Another benefit: You would not cache an auth secret (here, I am not 100% sure if this is a real issue, as the token is not valid anymore).

Implementation

Next thought would be "How would this be implemented?".
The simplest solution would be "to hack this in" by forking httpcache and adding a piece of if-condition into the varyMatches or related.

Another option: Can we somehow "hook" into the process somewhere? I would like to avoid copying the diskcache part, as it receives only a []byte which would require a lot of parsing.

Small runnable example

If requested, I can provide a minimal code example to reproduce it.

Disclaimer

I do understand that this is not exactly a go-githubapp issue. Rather a caching issue. However, it is pretty much related to the usecase of go-githubapp.

I would appreciate your thoughts on this.

@bluekeyes
Copy link
Member

I'm not sure if this is something that can directly fix in go-githubapp. Including authentication material in the cache is important to avoid leaking data when sharing a cache between clients with different installation IDs. For instance, if installation ID 123 has access to private repositories, it shouldn't be possible for a client with installation ID 456 to request the same resource and get a response from the cache when a real response from GitHub would return an error.

Given the API we expose in go-githubapp, I think any direct options to ignore the authorization information when caching could make it too easy to do the wrong thing and leak information. But we might be able to expose more general functionality (like alternate middleware options) so that clients who are sure that caches will be isolated by authentication domain could customize the functionality.

As a starting point with current features, maybe try something like this:

clientCreator, err := githubapp.NewDefaultCachingClientCreator(
	config.Github,
	githubapp.WithClientTimeout(60*time.Second),
	githubapp.WithClientMiddleware(
		githubapp.ClientLogging(zerolog.InfoLevel, githubapp.LogRateLimitInformation(&githubapp.RateLimitLoggingOption{
			Limit:     true,
			Remaining: true,
			Used:      true,
			Reset:     true,
			Resource:  true,
		})),
		func(next http.RoundTripper) http.RoundTripper {
			return &httpcache.Transport{
				Transport:			 next,
				Cache:				 httpTransportCache,
				MarkCachedResponses: true,
			}
		},
	),
)

Note that this no longer uses WithClientCaching, but instead adds the cache as custom middleware. This means the cache sees requests before they have authentication, so it should be isolated from any changes to the authentication token.

That said, the go-github README now links to https://github.com/bored-engineer/github-conditional-http-transport, which claims that the ETag values GitHub returns also use the authentication information. So even if the cache is not directly using the authentication information, you may still not see the level of caching you expect.

@andygrunwald
Copy link
Contributor Author

Thanks for the hint. I digged a bit into it and also stumbled upon https://github.com/bored-engineer/github-conditional-http-transport. I read through it and you are right. If the claims are true, the "home grown middleware" might not be the full solution.

It really looks like a not solvable problem when you rely on Installation tokens and want to avoid any non-stable implementation.

About this issue: Not sure if it is worth to keep it open. Feel free to close it. Appreciated your response.

@andygrunwald
Copy link
Contributor Author

@bluekeyes I found another small possible idea: Maybe the etag is not the solution, but only the Last-Modified header. See bored-engineer/github-conditional-http-transport#1

@bluekeyes
Copy link
Member

Yeah, using Last-Modified makes sense for the resources that support it. Here's another (untested) suggestion for how to test this out without forking things:

type PreferLastModified struct {
	next http.RoundTripper
}

func (plm *PreferLastModified) RoundTrip(req *http.Request) (*http.Response, error) {
	res, err := plm.next.RoundTrip(req)
	if res != nil {
		// If the response includes a Last-Modified header, remove any ETag
		// header to force the cache to use time-based conditional requests
		if lastModified := res.Header.Get("last-modified"); lastModified != "" {
			res.Header.Del("etag")
		}
	}
	return res, err
}

// Later, when initializing the ClientCreator
clientCreator, err := githubapp.NewDefaultCachingClientCreator(
	config.Github,
	githubapp.WithClientTimeout(60*time.Second),
	githubapp.WithClientMiddleware(
		githubapp.ClientLogging(zerolog.InfoLevel, githubapp.LogRateLimitInformation(&githubapp.RateLimitLoggingOption{
			Limit:     true,
			Remaining: true,
			Used:      true,
			Reset:     true,
			Resource:  true,
		})),
		func(next http.RoundTripper) http.RoundTripper {
			return &httpcache.Transport{
				Transport:           &PreferLastModified{next},
				Cache:               httpTransportCache,
				MarkCachedResponses: true,
			}
		},
	),
)

Like my last suggestion, this puts the cache outside of the authentication to avoid seeing the Authorization header. However, now it also modifies the response from GitHub to remove the ETag header before the cache sees the response content. I think that should force the cache to prefer time-based requests.

If this all works, it might be something we could support natively in go-githubapp. I think it would need a new option function (WithClientCachingV2 or something), but because the library knows the authentication material (token or ID) used with each client, we can theoretically ensure that responses are segregated in the cache without relying on the Authorization header checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants