Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The CI docs skip hack is fundamentally broken WRT branch protection settings #10365

Closed
geekosaur opened this issue Sep 16, 2024 · 16 comments · Fixed by #10431
Closed

The CI docs skip hack is fundamentally broken WRT branch protection settings #10365

geekosaur opened this issue Sep 16, 2024 · 16 comments · Fixed by #10431

Comments

@geekosaur
Copy link
Collaborator

Describe the bug
See https://github.com/haskell/cabal/actions/runs/10888849265/job/30214209267.
TL;DR: the "validate skip" post job succeeded immediately, so branch protection considered "Validate post job" to have succeeded and the PR was merged as soon as the other required jobs completed.

It's not clear whether "Bootstrap post job" has the same problem: the post job also completed¸ but so also did all the actual bootstrap jobs. It's possible that the so-called "quick jobs" take long enough for it to finish (I regularly see that getting canceled when I push a fix to something revealed within the first few minutes of a PR build). Or some other required job; I'd have to remind myself what's actually in the branch protection requirements to make more than a slightly-informed guess.

This doesn't seem to happen on "pull_request" or "push" jobs. I suspect that's because branch protection doesn't come into play on those, suggesting that it's overactive branch protection checks that are the real problem (which is a GitHub issue).

@geekosaur
Copy link
Collaborator Author

geekosaur commented Sep 16, 2024

Okay, it's a little more complicated than that. The post job works as expected for pull_request and push, just not for the run before it's merged for some reason. I think the problem might instead be that Mergify takes over the branch protection rules (this is mentioned in the Mergify.io documentation), and doesn't evaluate them the same way GitHub Actions does.

In this case, the workaround might be to set the option in the branch protection rules that blocks administrators from overriding GitHub with respect to branch protection rules. But we would need to then toggle it if we need to merge something that doesn't pass CI as part of fixing a complex CI breakage, or else cherry-pick all fixes into a single PR so its CI passes (I did this as part of fixing the Windows CI breakage on 3.12).

@geekosaur
Copy link
Collaborator Author

Interestingly, #10357 ran the full set. But I had noticed that the marathon run of backports went in unexpectedly quickly, so now I need to check how the release branch rules differ from master. (I have worked on those, but only on stuff before it starts the merge.)

@geekosaur
Copy link
Collaborator Author

geekosaur commented Sep 17, 2024

The behavior is fairly inconsistent. I still think this is something about how Mergify evaluates the branch protection rules (which are the same between 3.* and master modulo fourmolu and hlint) and want to test whether removing Validate post job and Bootstrap post job from the required list and doing it in Mergify config will work.

That said, I also wonder why Validate skip is even running: supposedly it only triggers on docs, as configured. Supporting this is that it doesn't even show up in the CI summary. But there's still that Validate post job clearly was running exit 0 (from Validate skip) instead of the outcomes check in Validate.

@geekosaur
Copy link
Collaborator Author

Okay, looked over triggers for Validate skip. As written,

  • it runs only for PRs targeting master
  • it runs on any PR that touches documentation, even if it also touches code

So if we have a PR targeting 3.12/3.14 or that doesn't affect docs, the collision doesn't happen and CI runs fully. If it does affect docs, Validate skip finishes first and Mergify's interpretation of the branch protection rules is satisfied. I'm going to see if there's to make this "either/or" with Validate, and allow it to run on release branches as well. (Currently I think we get away with its configuration because both push (constrained) and pull_request (not constrained) are true, so it runs on release branches.)

@geekosaur
Copy link
Collaborator Author

Okay, so GHA controls aren't flexible enough. But Mergify can I think be told to accept "skipped" as "success" — or may even do so by default, unlike GHA. So we'd again just move that branch protection rule to Mergify config.

@geekosaur
Copy link
Collaborator Author

Hm. Actually, GHA claims that "skipped" counts as success, contrary to the claim in the comments for our docs hack. Is this a bad interaction with the if: always() in the post job, and if so can that be fixed (it's checking the outcomes of the individual steps, we can adjust that conditional)? always() is a really big hammer and might be knocking holes in things.

@geekosaur
Copy link
Collaborator Author

Another observation: our CI proceeds in weird steps that may be contributing here. For example, our validate post job checks all the preceding steps, but those steps don't exist as far as both GitHub Actions and Mergify are concerned while earlier steps are running; they're only populated when earlier steps are finished, for some reason.

That said, I have seen cases where the initial validate job is still running when something is merged, so this probably isn't (usually?) it. But I still wonder why those later jobs aren't listed as pending.

@geekosaur
Copy link
Collaborator Author

It's worse than I thought… and it's not Mergify that's at fault, it is definitely GitHub.

[06 14:52:06] <haskellbridge> <geekosaur> caught one in the act! and it looks like it's something with github
[06 14:54:33] <haskellbridge> <geekosaur> https://github.com/haskell/cabal/pull/10417 just got merged while CI was still running. it required a rebase beforehand, which happened as expected. GitHub showed that the PR wasn't mergeable for several minutes, then cleared that and Mergify immediately merged it. Notable: "Validate post job" shows as complete even though all the validate jobs are still running
[06 14:54:53] <haskellbridge> <geekosaur> (likewise with "Bootstrap post job")
[06 14:56:01] <haskellbridge> <geekosaur> _but_ this can't be the skip jobs _per se_ because they completed immediately.
[06 14:57:07] <haskellbridge> <geekosaur> …Mergify thinks it's still not merged!
[06 14:57:35] <haskellbridge> <geekosaur> I think it merged as soon as the so-called "quick jobs" finished
[06 15:02:21] <haskellbridge> <geekosaur> so. one of my working hypotheses was that GitHub and Mergify disagreed about the thing we've been relying on for the skip hack. But it was GitHub that called the job finished, not Mergify, so the whole basis for the skip hack is incorrect. GitHub (again, _not_ Mergify) considered the post job complete as soon as the skip job's version ran. This doesn't happen during normal PR checks because they're not checked...
[06 15:02:26] <haskellbridge> ... against the branch protection rules.
[06 15:04:25] <haskellbridge> <geekosaur> So now my top question is: why did GitHub auto-merge the PR insytead of waiting for Mergify to complete its checks?

@geekosaur
Copy link
Collaborator Author

[06 15:18:49] <haskellbridge> <geekosaur> hm, but github claims mergify committed it, while mergify says "The pull request embarked with master (b1e842d) will be merged soon"
[06 15:19:33] <haskellbridge> <geekosaur> mergify dashboard says nothing running…
[06 15:20:12] <haskellbridge> <geekosaur> methinks we have here a major disconnect between mergify and github, and both are complicit

@ulysses4ever
Copy link
Collaborator

I again propose to revert the hack and stop wasting the precious resource for investigating it any further.

Mergify can't be not a part of the equation because the hack is known to work for pure GitHub workflows?

@geekosaur
Copy link
Collaborator Author

geekosaur commented Oct 6, 2024

As far as I can determine, it's branch protection that's failing, not Mergify; Mergify is just obeying the branch protection rules, and the sequence of events when the PR was committed clearly indicate that it was GitHub, not Mergify, that went green on branch protection inappropriately. Possibly the dual intention of our post jobs (both reducing the checks in e.g. validate to a single mega-check and enabling the docs skip hack) is contributing, because branch protection no longer knows about the details?

I also think the reason GitHub's own merge queue isn't affected is because it waits for all checks to finish regardless. If they change that or anything related to it, it'll fail the same way because it goes by the branch protection rules and those are going green too early.

In any case, I now think #10431 will fix it because Mergify would no longer jump the gun when GitHub's branch protection does. This is pretty much the only way Mergify is involved in the situation; GitHub's own merge queue probably still waits for the job to finish.

@geekosaur
Copy link
Collaborator Author

Also, I think reverting the docs hack is itself a hackaround, because this will still be lying in wait to bite us in some other way.

@geekosaur
Copy link
Collaborator Author

I reported Mergify's jumping the gun on the checks as a bug; sadly, the only feedback I got was a "thank you" from the bug reporting page, not even a bug number.

@ulysses4ever
Copy link
Collaborator

Fair enough.

#10431 will fix it

🤞

reported Mergify's jumping the gun on the checks as a bug

Is there a public link?

@geekosaur
Copy link
Collaborator Author

Like I said, no feedback whatsoever aside from the "thank you". I had intended to put a link here, but apparently their bugs are not public.

@geekosaur
Copy link
Collaborator Author

The Mergify tweak didn't work; it again merged while CI was running. Mergify support asked for more information which I gave them, and I'm waiting to hear back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants