Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing ofBorg with GitHub Actions #355847

Open
4 of 7 tasks
Mic92 opened this issue Nov 14, 2024 · 91 comments
Open
4 of 7 tasks

Replacing ofBorg with GitHub Actions #355847

Mic92 opened this issue Nov 14, 2024 · 91 comments
Labels
5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: continuous integration Affects continuous integration (CI) in Nixpkgs, including Ofborg and GitHub Actions 6.topic: developer experience

Comments

@Mic92
Copy link
Member

Mic92 commented Nov 14, 2024

This is one of the two plans to ensure we can also perform github evaluation checks in the future.

See https://discourse.nixos.org/t/infrastructure-announcement-the-future-of-ofborg-your-help-needed/56025
for more information.

To replace OfBorg’s functions with GitHub Actions the following tasks need to be implemented:

  • Running evaluation checks on Nixpkgs
  • Eval NixOS options.
  • Identifying package rebuilds and adding appropriate labels to the repository.
  • Notifying package maintainers
  • (Optional) Rebuilding selected packages for Linux/macOS.
  • build lib-tests if ./lib changes
  • Performance report equivalent

I already created a proof of concept pull request here: #352808

Update

We have our first jitsi meeting to coordinate the migration on the 14.11 (today) at 17:00 UTC (18:00 Berlin time) at https://jitsi.lassul.us/nixos-infra

@Mic92 Mic92 added the 0.kind: bug Something is broken label Nov 14, 2024
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/infrastructure-announcement-the-future-of-ofborg-your-help-needed/56025/2

@Bot-wxt1221
Copy link
Member

evaluation checks takes too many resource. I'm worried about if github action's machine can run it in reasonable time.

@Mic92
Copy link
Member Author

Mic92 commented Nov 14, 2024

@Bot-wxt1221 I managed to run it in 5 minutes for naive nix-env evaluation based on the default.nix entry point and 15 minutes using the same logic that ofborg uses: https://github.com/Mic92/nixpkgs/actions/workflows/eval.yml

Both seem already faster compared to the hours of waiting for the ofborg queue that we experience today.

Also this is not yet the end of the line of optimizations. We still have https://github.com/Mic92/nixpkgs/blob/main/pkgs/top-level/release-attrpaths-superset.nix to split evaluation in smaller parts that can run even in parallel.

@JohnRTitor
Copy link
Contributor

Will PR commands like @ofborg build hello be supported with GitHub action?

@JohnRTitor
Copy link
Contributor

#352808 (comment) and

I worry that bot accounts like ryantm-r can easily hit the limit of CI. CC @ryantm

@Mic92
Copy link
Member Author

Mic92 commented Nov 14, 2024

@JohnRTitor

Yes it's possible:

name: Trigger on PR Comment

on:
  issue_comment:
    types: [created]

jobs:
  run-on-comment:
    if: github.event.issue.pull_request != null && contains(github.event.comment.body, '/build')
    runs-on: ubuntu-latest
    steps:
      - name: Check out code
        uses: actions/checkout@v3

@FliegendeWurst FliegendeWurst added the 6.topic: continuous integration Affects continuous integration (CI) in Nixpkgs, including Ofborg and GitHub Actions label Nov 14, 2024
@Mic92
Copy link
Member Author

Mic92 commented Nov 14, 2024

#352808 (comment) and

I worry that bot accounts like ryantm-r can easily hit the limit of CI. CC @ryantm

Well. We have to try and see. Just now it's speculation if it works or not.

@JohnRTitor
Copy link
Contributor

Good to know, though huge builds like kernel and its modules, chromium and firefox will obviously not work. And we'll possibly have to setup a blacklist else even individual contributors will hit their limits.

@Bot-wxt1221
Copy link
Member

According to github doc:

https://docs.github.com/en/billing/managing-billing-for-your-products/managing-billing-for-github-actions/about-billing-for-github-actions

GitHub Actions usage is free for standard GitHub-hosted runners in public repositories, and for self-hosted runners. For private repositories, each GitHub account receives a certain amount of free minutes and storage for use with GitHub-hosted runners, depending on the account's plan. Any usage beyond the included amounts is controlled by spending limits.

So maybe we don't need to worry about time?

@Mic92
Copy link
Member Author

Mic92 commented Nov 14, 2024

Good to know, though huge builds like kernel and its modules, chromium and firefox will obviously not work. And we'll possibly have to setup a blacklist else even individual contributors will hit their limits.

You can run builds for 12h. Obviously we should establish some reasonable timeouts to be a good citizen in the ecosystem.

@Mic92
Copy link
Member Author

Mic92 commented Nov 14, 2024

Added a ^ meeting date for this.

@ibizaman
Copy link

Maybe of interest for this issue, at least just for inspiration, but I've also (ab)used GitHub actions to build tests in my project using a dynamically generated matrix. My project uses flakes but this should be adaptable to non-flakes https://github.com/ibizaman/selfhostblocks/blob/main/.github/workflows/build.yaml
This matrix then produces a big list of jobs, one job per test https://github.com/ibizaman/selfhostblocks/actions/runs/11502502422 like so:
image

@Mic92
Copy link
Member Author

Mic92 commented Nov 14, 2024

See the meeting notes for today's infra meeting where we mainly discussed the CI situation: https://github.com/NixOS/infra/blob/7688f20babbeb27a10e4d8669fffe4b0ed00e17f/docs/meeting-notes/2024-11-14.md

Here is the high-level plan:

  • Infinisil wants to take a look at evaluating nixpkgs in github actions to compute the number of changed paths
  • Independently we will take a look how we can build packages.
  • For the beginning we will just run github actions as they are designed as a pull_request event. This is because it's the most straight forward way and we actually have not validated if we cannot just build everything fast enough without resorting to my initial strategy.

Independently from meeting we also have other discussions about how we can develop ofborg in the future. However this might not happen before February, so we need some alternative solution in the meantime if not longer.

@infinisil
Copy link
Member

I've opened a draft PR here for evaluating Nixpkgs using GitHub Actions: #356023. For just evaluation (and those only taking 5 minutes on each arch) instead of also building, I don't think we need to do the running-on-forks dance. Building is harder to get, but it's arguably also less important (and very orthogonal to evaluation).

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/infrastructure-announcement-the-future-of-ofborg-your-help-needed/56025/27

@adisbladis
Copy link
Member

One important aspect that ofborg currently provides, and that this issue doesn't mention, is the performance report.
This currently works by evaluating nixpkgs twice, once before the PR and once after.

For the majority of PRs the performance report is not important, but for work on lib & stdenv, it can be very important.

The report currently does not report the impact of checkMeta, something that has lead to a less than stellar review experience since contributors & reviewers don't actually understand the real performance impact.

@JohnRTitor JohnRTitor removed the 0.kind: bug Something is broken label Nov 15, 2024
@JohnRTitor JohnRTitor pinned this issue Nov 15, 2024
@Mic92
Copy link
Member Author

Mic92 commented Nov 15, 2024

One important aspect that ofborg currently provides, and that this issue doesn't mention, is the performance report. This currently works by evaluating nixpkgs twice, once before the PR and once after.

For the majority of PRs the performance report is not important, but for work on lib & stdenv, it can be very important.

The report currently does not report the impact of checkMeta, something that has lead to a less than stellar review experience since contributors & reviewers don't actually understand the real performance impact.

Could that be another on-demand GitHub actions job? We could even run automatically if certain paths has been changed.

@azuwis
Copy link
Contributor

azuwis commented Nov 15, 2024

Good to know, though huge builds like kernel and its modules, chromium and firefox will obviously not work. And we'll possibly have to setup a blacklist else even individual contributors will hit their limits.

Building linux kernel is fine on Github Actions, the CPU time is sufficient, it takes less than 2 hours to build Jovian-NixOS linux kernel, and Github Actions offer max 6 hours per run.

The only concern is disk space, workarounds:

  1. Bind mount /mnt/nix to /nix, /mnt is 66G free by default.
  2. Set build-dir = /nix/var in nix.conf, by default nix use /tmp to hold /build in the sandbox, and takes up disk space in /, 20G free, not enough for building linux kernel.
  3. Remove files we don't need, docker images, /usr/local, /usr/share/swift, etc. It's possible to get more than 63G free disk space in / without affecting nix.
  4. Use BTRFS RAID0 to combine / and /mnt, and enable zstd compression, it's possible to get total 126G free disk space, and should be sufficient for most build tasks.

All of the above workarounds are implemented in https://github.com/azuwis/actions/blob/main/nix/prepare.sh.

Well, expect for 2), which can be set by:

    - uses: cachix/install-nix-action@v30
      with:
        extra_nix_config: |
          build-dir = /nix/var

@adisbladis
Copy link
Member

One important aspect that ofborg currently provides, and that this issue doesn't mention, is the performance report. This currently works by evaluating nixpkgs twice, once before the PR and once after.
...

Could that be another on-demand GitHub actions job? We could even run automatically if certain paths has been changed.

Sounds good to me.

@JohnRTitor
Copy link
Contributor

Building linux kernel is fine on Github Actions, the CPU time is sufficient, it takes less than 2 hours to build Jovian-NixOS linux kernel, and Github Actions offer max 6 hours per run.

I am concerned about building the kernel modules (both in tree and out of tree).

@Mic92
Copy link
Member Author

Mic92 commented Nov 16, 2024

Building linux kernel is fine on Github Actions, the CPU time is sufficient, it takes less than 2 hours to build Jovian-NixOS linux kernel, and Github Actions offer max 6 hours per run.

I am concerned about building the kernel modules (both in tree and out of tree).

Well. We should be quickly able to filter out and blacklist packages we don't want to build once the source of truth lives in the repository? Also we can actually stop github actions, which was not possible with ofborg builds.

@Kamillaova
Copy link
Contributor

Kamillaova commented Nov 19, 2024

Maybe of interest for this issue, at least just for inspiration, but I've also (ab)used GitHub actions to build tests in my project using a dynamically generated matrix. My project uses flakes but this should be adaptable to non-flakes https://github.com/ibizaman/selfhostblocks/blob/main/.github/workflows/build.yaml This matrix then produces a big list of jobs, one job per test https://github.com/ibizaman/selfhostblocks/actions/runs/11502502422 like so:

@ibizaman did you see this? https://github.com/thecaralice/flake-gha

@wolfgangwalther
Copy link
Contributor

Edge case: When only changing the base branch without force pushing a rebase, eval will not run again, changed packages will not be updated, maintainers will not be requested for review.

Fix in #372475, running eval on base branch changes, too.

@wolfgangwalther
Copy link
Contributor

Avoid requesting maintainer reviews in draft mode, similar to codeowners: #372479

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-stuck-in-passthru-tests-for-darwin/59257/3

@doronbehar
Copy link
Contributor

doronbehar commented Jan 25, 2025

I got this failure when pushing a commit to a r-ryantm PR (#376432).

Fetching the repository
  /usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=2 origin 9d4b004fbb88c54361ddebebb9dd9e89d36f5385
  Error: fatal: remote error: upload-pack: not our ref 9d4b004fbb88c54361ddebebb9dd9e89d36f5385
  The process '/usr/bin/git' failed with exit code 128
  Waiting 12 seconds before trying again
  /usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=2 origin 9d4b004fbb88c54361ddebebb9dd9e89d36f5385
  Error: fatal: remote error: upload-pack: not our ref 9d4b004fbb88c54361ddebebb9dd9e89d36f5385
  The process '/usr/bin/git' failed with exit code 128
  Waiting 10 seconds before trying again
  /usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=2 origin 9d4b004fbb88c54361ddebebb9dd9e89d36f5385
  Error: fatal: remote error: upload-pack: not our ref 9d4b004fbb88c54361ddebebb9dd9e89d36f5385
  Error: The process '/usr/bin/git' failed with exit code 128

Not bothering me too much, but thought people here would like to know...

@wolfgangwalther
Copy link
Contributor

Eval / Process fails with error: stack overflow; max-call-depth exceeded for a treewide in https://github.com/NixOS/nixpkgs/actions/runs/12976078847/job/36187969238 - haven't seen that before.

@JohnRTitor
Copy link
Contributor

That seems like a .nix issue.

Another thing, would be nice if we could get a performance report like Ofborg did.

There was already an implementation in SQL by @paparodeo (#362844 (comment)), example. Perhaps anyone wants to reimplement it in Nix?

@infinisil
Copy link
Member

Eval / Process fails with error: stack overflow; max-call-depth exceeded for a treewide in https://github.com/NixOS/nixpkgs/actions/runs/12976078847/job/36187969238 - haven't seen that before.

Addressed by #377434

@infinisil
Copy link
Member

This needs to be fixed still: #371223

@wolfgangwalther
Copy link
Contributor

wolfgangwalther commented Feb 2, 2025

When "Eval / Comparison" can't find the "run" on the target commit, it will silently pass the job, but "Eval / Tag" will be skipped. In this case, we don't get any rebuild labels, rebuild status, maintainer pings.

Random example: https://github.com/NixOS/nixpkgs/actions/runs/12954374703/job/36136463837

I'd argue that we should fail in this case, because this is not expected.

Edit: Opened #378909 to do just that.

@doronbehar
Copy link
Contributor

Can anyone help me please understand this eval error:

Image

🙏

@GaetanLepage
Copy link
Contributor

Can anyone help me please understand this eval error:

I think this is just GHA being very flaky lately. Sometimes jobs fail for no reason and work fine when you restart them...
Not sure what the root cause is.

@doronbehar
Copy link
Contributor

Can anyone help me please understand this eval error:

I think this is just GHA being very flaky lately. Sometimes jobs fail for no reason and work fine when you restart them... Not sure what the root cause is.

I restarted it and it failed exactly the same way... After a rebase and force push it got fixed. Could it be it was due to an eval error that was present in branch master from which my PR's branch descended?

@7c6f434c
Copy link
Member

7c6f434c commented Feb 9, 2025

One thing that also happens, is GHA being flaky in a shard-dependent way where changing the commit hash rerolls the dice; typically this case doesn't look like your example, though. But maybe they have found new ways to be flaky.

@wolfgangwalther
Copy link
Contributor

Can anyone help me please understand this eval error:

The eval error regarding "target run id" is this:

  • We eval the PR branch and then compare this against the PR's target branch (the branch it is merged into).
  • To do the comparison, we look for a previous CI run on the target branch - and we do so on the exact commit that GHA created the temporary merge commit on.
  • If on this commit GHA didn't succeed - for example due to the flaky stuff mentioned in Replacing ofBorg with GitHub Actions #355847 (comment), then we can't get the results to compare against - and thus we error out.

When you force push your PR, GHA will create a new temporary merge commit. If something has happened on the target branch already, this might then be based off of a different commit on the target branch - one where CI was passing, thus fixing CI for you.

So the failure you are seeing is essentially a cascaded effect of GHA flakiness...

@lucasew
Copy link
Contributor

lucasew commented Feb 9, 2025

What if we start using impure derivations and do all this composition system using Nix itself?

GH Actions would only setup and run Nix once.

That way we could also run more extensive tests on packages.

@wolfgangwalther
Copy link
Contributor

IIUC, you are suggesting to eval both the PR branch and the target branch in the PR's job. Eval already takes a lot of resources right now, and this would double the resources needed. Using the already evaluated results from the target branch does speed up CI and save resources significantly, if the target run succeeded.

@wolfgangwalther
Copy link
Contributor

One thing that also happens, is GHA being flaky in a shard-dependent way where changing the commit hash rerolls the dice; typically this case doesn't look like your example, though. But maybe they have found new ways to be flaky.

I contacted GitHub support about those failures. Let's see whether we get any useful feedback that way.

@wolfgangwalther
Copy link
Contributor

I contacted GitHub support about those failures.

Feedback from support is as follows:

The Actions activity throughout the NixOS/nixpkgs repo's large fork network is causing increased latency when Actions attempts to fetch workflow files for repos in the network. This is leading to startup failures, like the example you shared.

Our engineering team has been aware of this issue for a bit, and have already introduced some improvements to reduce the likeliness of it happening. They are still monitoring these errors scoped to the repo and fork network while working on further improvements.

I will check with this team to see if there's any recent updates. For now, if a run experiences this error, we recommend triggering a new workflow run, if possible.

So nothing we can do about that, really.

@pbsds
Copy link
Member

pbsds commented Feb 11, 2025

After a rebase and force push it got fixed.

For me "Rerun all jobs" seems to work. Could it be that one of the previous jobs fail to upload some artifact?

@wolfgangwalther
Copy link
Contributor

After a rebase and force push it got fixed.

For me "Rerun all jobs" seems to work. Could it be that one of the previous jobs fail to upload some artifact?

See #355847 (comment).

If there is enough time between the first run and your re-run, GHA will already have created a new temporary merge commit - so this can fix it as well.

@booxter
Copy link
Contributor

booxter commented Feb 15, 2025

Let's say I hit the issue mentioned in this comment. How do I restart the failed job? Do I need some additional permissions to do so? (I don't see a Re-run button.) Do I have to rebase to re-trigger?

@GaetanLepage
Copy link
Contributor

Let's say I hit the issue mentioned in this comment. How do I restart the failed job? Do I need some additional permissions to do so? (I don't see a Re-run button.) Do I have to rebase to re-trigger?

Yes, as a non-committer, this is your only option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: continuous integration Affects continuous integration (CI) in Nixpkgs, including Ofborg and GitHub Actions 6.topic: developer experience
Projects
None yet
Development

No branches or pull requests