Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle customer proxy re-auth response by retrying, not prompting user for different token #6652

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sqs
Copy link
Member

@sqs sqs commented Jan 15, 2025

A customer has a proxy that enforces recent 2fa on their machines to access the Sourcegraph instance. After 24h, the user needs to tap their 2fa key to make the network connection to Sourcegraph be reauthed and start working again. The responses from the proxy after the 24h period expires are being treated as "invalid token" and causing users to need to re-auth into Cody. The response looks like:

{"auth_url":"https://example.com","error":"request not authenticated","requested_url":"example.com/.api/client-config"}

with an HTTP header of the form X-${CUSTOMER}-U2f-Challenge: true.

  • Send Accept: application/json headers to ensure we get that JSON response and not an HTTP redirect from the proxy when the 24h period expires
  • Interpret that (or that kind of) error response in our auth code as a temporary network error and do not tell the user that token is invalid and ask for another token.
  • In general, distinguish between ephemeral network/availability errors and invalid access token errors. Do not ask the user to provide a different token for the former kind of errors.
  • Improve look and feel of the UI authentication error banner, and add a Try Again button.
  • Automatically refresh periodically (currently every 2.5s) when authentication failed awaiting an auth challenge.

Fixed issues (by @pkukielka ):

  • failed autocomplete triggers change to the status bar; if user forces autocomplete popup is additionally show
  • periodic refresh causes UI jitter in the status bar fas been fixed
  • prompt library dropdown error handling has been fixed (by the way of general improvements to the error handling)
  • fixed cody ignore logic to not mask network or auth erros as policy errors
  • fixed some incorrect caching of the ignore API requests
  • completions can now report/reset connection errors shown in a cody status icon

Fixes https://linear.app/sourcegraph/issue/CODY-4695/handle-customer-proxy-re-auth-response-by-retrying-not-prompting-user

Screenshots

When the user needs to complete the auth challenge, they will see the following errors, depending on what they're doing.

If the extension hasn't successfully authenticated since the editor was restarted:

image

In chat:

image

After failed completion:
image

Test plan

TBD

Changelog

  • Cody better distinguishes between ephemeral network/availability errors and invalid access token errors. You are no longer asked to provide a different token if authentication fails due to a network error.
  • For customers using an authentication challenge proxy that wraps the Sourcegraph instance, if it returns HTTP 401 with an HTTP header whose name matches /^X-.*-U2f-Challenge$/i and whose value is true, the user is asked to complete the authentication challenge on their device ("Tap YubiKey to Authenticate...") and is not prompted for a different access token. When the device's authentication challenge is successful, the extension will automatically proceeed to sign the user in using their stored access token (if any).

@sqs sqs marked this pull request as draft January 15, 2025 18:35
@sqs sqs changed the title add narrower Accept: application/json header to outgoing HTTP requests Handle customer proxy re-auth response by retrying, not prompting user for different token Jan 15, 2025
@sqs sqs force-pushed the sqs/client-config-accept branch 5 times, most recently from c77e517 to e7ed90b Compare January 16, 2025 07:51
Comment on lines 1723 to 1728
///////////////////////////////////
// TODO!(sqs): remove
///////////////////////////////////
if (require('node:fs').existsSync('/tmp/is-custom-auth-challenge-error')) {
return true
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is for local testing and will be removed before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I added https://github.com/sourcegraph/cody/blob/main/agent/scripts/reverse-proxy.py which is very simplistic for now but could be easily extended to simulate that kind of proxy response.

@sqs sqs force-pushed the sqs/client-config-accept branch from e7ed90b to 1e632ce Compare January 16, 2025 08:16
@pkukielka pkukielka force-pushed the sqs/client-config-accept branch 4 times, most recently from 1daa94c to c08e156 Compare January 27, 2025 10:25
@pkukielka pkukielka force-pushed the sqs/client-config-accept branch from c08e156 to ae6f900 Compare February 7, 2025 14:58
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs work, some ideas inline.

@@ -35,6 +35,17 @@ async def proxy_handler(request):
headers=headers,
data=await request.read()
) as response:
if random.random() < u2f_challenge_chance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but it depends on the frequency of requests we make. Wouldn't it be useful to have a pattern of, for example, accept for n seconds, challenge for m seconds? That would be more like an expiry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to just kill it and then restart with whatever parameter I need when I need it.
But yea, we can definitely make it more useful and support more scenarios when someone else than we will start using it :)

@@ -1,9 +1,9 @@
#!/usr/bin/env python3

import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test the accept/json change that @sqs mentioned by generating JSON here?


const uri = getTestURI({ repoName: 'whatever', filePath: 'foo/bar.ts' })
expect(await provider.isUriIgnored(uri)).toBe('has-ignore-everything-filters')
expect(await provider.isUriIgnored(uri)).toBe(error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to separate this PR into pieces so we have the retryable errors narrowly focused in one piece, the 2 factor challenges in another, etc.

I think this change is okay because Boolean(e) is true for errors e but it would be easier to review more focused, incremental changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but fixing error handling in network code should really be first PR, and when I inherited this PR I was hoping we will be able to finish it quickly.
Do you want to split it now?
It will definitely take some time to untangle things poperly.

}

this.contextFiltersSubscriber.notify(contextFilters)
this.contextFiltersSubscriber.notify(this.lastContextFiltersResponse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subtly different because of canonicalizeContextFilters. Should we start freezing everything coming out of there, because it is getting passed around more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I know what you mean.

But in general I'm wondering if we need canonicalizeContextFilters at all. I understand we have it so we can relay on object equality later, but I somehow does not like that. Only place which uses it anyway is:

    private hasIgnoreEverythingFilters() {
        return this.lastContextFiltersResponse === EXCLUDE_EVERYTHING_CONTEXT_FILTERS
    }

And we could just do better check here. I may try to rework it a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We canonicalize so we can do === because it is fast. The field is set once but checked often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reverted that particular change

@@ -148,12 +150,14 @@ export class ContextFiltersProvider implements vscode.Disposable {
verbose: contextFilters,
})
}

const filters = isError(contextFilters) ? EXCLUDE_EVERYTHING_CONTEXT_FILTERS : contextFilters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be less error prone:

Suggested change
const filters = isError(contextFilters) ? EXCLUDE_EVERYTHING_CONTEXT_FILTERS : contextFilters
if (isError(contextFilters)) {
contextFilters = EXCLUDE_EVERYTHING_CONTEXT_FILTERS
}

This is better because after this point someone might accidentally use filters where they meant contextFilters or vice versa. Only the type system would flap if the point of use did not admit Error.

Copy link
Contributor

@pkukielka pkukielka Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it might be reworked to eliminate that error but I don't think overwriting contextFilters is actually good idea. Type of contextFilters is now ContextFilters | Error (for better or worse) and dropping that error on this var would IMHO be not correct.

I will rework it to make everything less error prone.

@@ -6,9 +6,9 @@ export async function isUriIgnoredByContextFilterWithNotification(
uri: vscode.Uri,
feature: CodyIgnoreFeature
): Promise<IsIgnored> {
const isIgnored = await contextFiltersProvider.isUriIgnored(uri)
const isIgnored = await contextFiltersProvider.isUriIgnored(uri, /* foreceFetch = */ true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling, but see other comments about this parameter


await showSignInMenu()
expect(mockShowInputBox).toHaveBeenCalled()
expect(mockShowErrorMessage).toHaveBeenCalledWith('The access token is invalid or has expired')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests need updates.

You are making life difficult for JetBrains by putting the title in the text like that (the 'Invalid Access Token:' part), it would be better to have titles separate so when we do these notifications we can conform to the JetBrains UI guidelines.

VSCode can have a thunk that does the string concatenation if it needs that.

Copy link
Contributor

@pkukielka pkukielka Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have big choice, this is VSCode API:

/**
 * Show an error message.
 *
 * @see {@link window.showInformationMessage showInformationMessage}
 *
 * @param message The message to show.
 * @param items A set of items that will be rendered as actions in the message.
 * @return A thenable that resolves to the selected item or `undefined` when being dismissed.
 */
export function showErrorMessage<T extends string>(message: string, ...items: T[]): Thenable<T | undefined>;

private async fetchIfNeeded(): Promise<void> {
if (!this.fetchIntervalId && !this.isTesting) {
private async fetchIfNeeded(forceFetch = false): Promise<void> {
if (forceFetch || (!this.fetchIntervalId && !this.isTesting)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is rotten and we need to clean this up. Specifically:

  • startRefetchTimer's timeout calls setRefetchTimer. Thus it is like setInterval, effectively.
  • Nothing calls clearTimeout (except reset, which is only used by tests or dispose, so not relevant here.)
  • Thus multiple calls to startRefetchTimer will "pile up" the timeouts and hammer the network.

The above complaints have nothing to do with this PR, but we need to clean that up. The intent evidently was if fetchIntervalId is set, then to not call setTimeout again. But the await this.fetchContextFilters() here means any number of fetchIfNeeded can pile up in here before we reach startRefetchTimer and set the fetchIntervalId.

Basically we should put the brakes on multiple scheduled timeouts in startRefetchTimer. And we should make the context filters fetch de-dup multiple in flight requests given the same endpoint or whatever.

That out of the way, forceFetch makes this even more dangerous: It skips any attempt at the existing interval check. Seatbelts? Off. Let's roll. 💀

You should go with the grain of the design... it looks like fetchContextFilters can decide some errors are transient and return a lower refetch period with exponential backoff. Integrate the refetching behavior you want with that instead of papering over it like this.

@@ -226,11 +230,11 @@ export class ContextFiltersProvider implements vscode.Disposable {
return isIgnored
}

public async isUriIgnored(uri: vscode.Uri): Promise<IsIgnored> {
public async isUriIgnored(uri: vscode.Uri, forceFetch = false): Promise<IsIgnored> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this idea at all! Don't do it!

First, basic API design lesson, but boolean parameters are terrible because they give no semantic clue at the callsite. I guarantee if you ask three people what isUriIgnored(uri, false) does, they're not going to say "it means don't make a network request unless the context filters refresh timeout has passed."

Second, there is no brake on isUriIgnored(..., true) completely spamming us with requests. Today the callers are fine... some human triggered operations via the ...WithNotification method. But "drop everything and do it = true" this is simply the wrong level of abstraction.

The fetcher already has the concept of retrying transient errors like network errors more aggressively. Isn't tuning that enough? If not, let's make this semantic, like if we classified requests as being background/foreground, passive/active, something like that. With the idea being that if the user explicitly did something or is retrying something then you'd act differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, there is no brake on isUriIgnored(..., true) completely spamming us with requests. Today the callers are fine... some human triggered operations via the ...WithNotification method. But "drop everything and do it = true" this is simply the wrong level of abstraction.

I agree and disagree at the same time.
I think explicit human actions should always force fetch.
E.g. if I had network disabled, I enable it, and then do some Cody action, there should be no situation that some internal mechanism decides "I already tried that 3 times, now I will wait 1 minute before retrying, here is a cached failed result!".
Now, I agree we would need some general mechanism which would prevent e.g. such request being repeated in a loop when triggered just once. Or being triggered by no-human just because of some refactoring in the code.
But I see no easy way to do it and surely not without some big refactoring.

I think in this specific cause if we can somehow guarantee it is called only in isUriIgnoredByContextFilterWithNotification it will be pretty safe. If isUriIgnoredByContextFilterWithNotification would be called in the wrong context even without my changes, it would spam user with notifications. So we are already relying on it being called in a proper context.

@pkukielka pkukielka force-pushed the sqs/client-config-accept branch from ae6f900 to 84f8d4e Compare February 14, 2025 15:51
@pkukielka pkukielka marked this pull request as ready for review February 14, 2025 15:51
logIgnored(document.uri, 'context-filter', isManualCompletion)
const isIgnored = await contextFiltersProvider.isUriIgnored(
document.uri,
/* foreceFetch = */ isManualCompletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling here too

onShow: () => {},
})
}

// TODO(philipp-spiess): Bring back this code once we have fewer uncaught errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, is this comment related to all these changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of, I'm bringing back some but not all :)

Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback inline.

}

this.contextFiltersSubscriber.notify(contextFilters)
this.contextFiltersSubscriber.notify(this.lastContextFiltersResponse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We canonicalize so we can do === because it is fast. The field is set once but checked often.

this.lastFetchDelay *= nextIntervalHint.backoff
} else {
this.lastFetchDelay = nextIntervalHint.initialInterval
console.log(this.lastFetchDelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this console.log

const intervalHint = await this.fetchContextFilters()
this.startRefetchTimer(intervalHint)
}
this.lastResultLifetime = this.lastResultLifetime.then(async intervalHint => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stacking up the Promises like this is nice and simple.

I assume the VM doesn't do anything weird like trying to keep a long chain of callstacks alive for debugging...

Copy link
Contributor

@pkukielka pkukielka Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I know it should be safe.

Something like this utilizes 100% of my CPU for a few minutes but finishes successfully:

let p = Promise.resolve(0)
for (let i = 0; i < 100000000; i++) {
    p = p.then(v => v + 1)
    const result = await p
    if (result % 1000000 === 0) console.log(result)
}

@@ -269,17 +268,13 @@ export class ContextFiltersProvider implements vscode.Disposable {
}

private reset(): void {
this.lastFetchTimestamp = Date.now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to put this in the past so the first fetch happens immediately?

Copy link
Contributor

@pkukielka pkukielka Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I changed it to be 0 👌

@@ -289,12 +284,12 @@ export class ContextFiltersProvider implements vscode.Disposable {
private hasAllowEverythingFilters(): boolean {
return (
isDotCom(currentAuthStatus()) ||
this.lastContextFiltersResponse === INCLUDE_EVERYTHING_CONTEXT_FILTERS
isEqual(this.lastContextFiltersResponse, INCLUDE_EVERYTHING_CONTEXT_FILTERS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did canonicalization, === and hasAllowEverythingFilters because we wanted to not slow down the common case. If you measured and we don't need it, that is fine, but please measure if you did not already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants