Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] docs: Add documentation on how to build Flatcar in the fastest and most energy efficient way possible #1550

Open
ader1990 opened this issue Sep 30, 2024 · 14 comments
Labels
kind/feature A feature request

Comments

@ader1990
Copy link

ader1990 commented Sep 30, 2024

Current situation

Flatcar full build - starting from a clean build environment and to a fully fledged bootable image takes around 2 hours on a modern build environment. On a state of the art build environment, the build time takes 1.5 hours.

Flatcar is a Gentoo derivative, source based operating system. Its immutability is important in this regard too, as one cannot build just one subset of Flatcar packages, all the packages need to be built to create a production image.

Impact

First time Flatcar builders, engineers and community do not have a clear view on build times, hardware needed or software setups for the fastest build.

Ideal future situation

Document the best ways to achieve the fastest possible builds in terms of hardware, software and workflow.

Implementation options

Create a pull request to document the information.

@ader1990
Copy link
Author

ader1990 commented Sep 30, 2024

My personal experience with Building Flatcar

Hardware requirements:

  • AMD64 box, 4-8 real cores at minimum.
    From experience, the increase of cores to over 32 is not providing a big speedup.
  • NVME storage is a must, at least PCIe 3.
    There are various IO related hardware specs advertised from different vendors, which should be taken with a grain of salt.
    A quick dd if=/dev/random of=test.img oflag=direct,sync bs=4k count=81920 status=progress can be done. If the result is around 20-30MB/s, the environment is great.
    A Flatcar build, depending on how many images and how many different builds are done, requires at least 25-50 GB of free space.
  • RAMFS/TMPFS storage.
    This requires enough RAM to be used as storage. Around 30-50GB of free RAM on the system can be enough.
    It means that the box should have at least 64 GB of RAM, as Flatcar build needs its own 6-8 GB of RAM, depending on how many cores are on the system.
    A dd if=/dev/random of=test.img oflag=direct,sync bs=4k count=81920 status=progress should provide results of around 200-300MB/s using RAMDISK
  • Fast Internet connection
    Flatcar build relies on a dockerized environment and the current SDK docker image has a size of around 10GB.
    Also, during package building, a few thousand rustc / golang sub-packages (crates and modules source code) are downloaded from the source repositories.
  • CPU with Virtualization enabled for faster testing of Flatcar images

Software requirements:

  • Docker engine
    If ramdisk is used, the docker root-data must be set to the RAMDISK mount point.
  • qemu static if ARM64 Flatcar builds are done (currently Flatcar builds for ARM64 can only be done on AMD64)
  • qemu if vms are to be started for testing

Workflow:

  • Create the RAMDISK for best speedups
  • Install Docker engine
    Set the docker root-data to the proper mount point if RAMDISK is to be used
  • git clone https://github.com/flatcar/scripts
    For best speedups, clone the repo on the same mount point where the RAMDISK is mounted
  • ./run_sdk_container -t
    This command downloads the latest Docker image of the SDK container.
    This Docker image size is of around 10 GB at this moment (9.78GB on 30th of Sept 2024)
  • ./build_packages
  • ./build_image
  • ./image_to_vm.sh <params>
  • Create a qemu-kvm instance of the resulting image and test that the image boots correctly

@chewi
Copy link
Contributor

chewi commented Sep 30, 2024

This is good, thank you.

Is -U needed with ./run_sdk_container to use binary packages? I've not done much with binary packages locally. I know we're talking about building the whole thing here, but people still want to know about this.

I don't have much to add. The main things I do to speed up Gentoo are tmpfs and distcc. You've already covered tmpfs. I haven't tried it with Flatcar because my laptop only has 16GB RAM. 😅 distcc might be worth documenting, but I haven't yet tried that with Flatcar either.

sys-fs/libeatmydata could possibly help in some places, although I don't think it helps with tmpfs. I've never tried it, but it should be simple enough.

@ader1990
Copy link
Author

I have found that a bottleneck of at least a few minutes is downloading one by one a huge amount of Rust crates and Go modules from https://mirror.release.flatcar-linux.net/portage-stable/distfiles/. Anyone has an idea how to maybe download those files in a parallel way?

@chewi
Copy link
Contributor

chewi commented Sep 30, 2024

Portage downloads files in the background while building, assuming that FEATURES="parallel-fetch" is on, so it shouldn't hold everything up. Downloading tons of small files like this obviously isn't what Portage was designed for though, so it's not great. It could probably benefit a lot from reusing HTTP connections. For that, Portage would need to pass URLs en masse to wget (or whatever, that's configurable). That's an interesting idea. I suspect it's not a trivial change, but I'd love to spend time on it.

@ader1990
Copy link
Author

Portage downloads files in the background while building, assuming that FEATURES="parallel-fetch" is on, so it shouldn't hold everything up. Downloading tons of small files like this obviously isn't what Portage was designed for though, so it's not great. It could probably benefit a lot from reusing HTTP connections. For that, Portage would need to pass URLs en masse to wget (or whatever, that's configurable). That's an interesting idea. I suspect it's not a trivial change, but I'd love to spend time on it.

There are 500+ creates in there, all downloaded sequentally. A HTTP connection reuse or bulk download at least (let's say 10 connections or configurable amount), would be great.

@ader1990
Copy link
Author

Made a run with fetchonly -> ./build_packages --fetchonly - it took 24 minutes.

du -sh /mnt/host/source/.cache/distfiles
1.3G    /mnt/host/source/.cache/distfiles

ls -liath /mnt/host/source/.cache/distfiles/*.crate | wc -l
433

@chewi
Copy link
Contributor

chewi commented Sep 30, 2024

Okay, but how much of that time is on packages with more than one file?

@ader1990
Copy link
Author

Okay, but how much of that time is on packages with more than one file?

I can check on a clean build with no fetch, at some moment, actually 2 times, the build grinds to a halt in order to download go modules and crates, will report back when I get the information.

@vielmetti
Copy link

My personal goal in this is for a full build of Flatcar to run within 1 hour of a suitably sized Equinix Metal machine. That device should tick all the boxes for "modern development environment" as described above, probably the m3 or a3 system.

Is the build timeline instrumented with something like otel or otel-cli -- getting some visibility into where the delays live would be useful.

@t-lo
Copy link
Member

t-lo commented Sep 30, 2024

A number of (raw) thoughts:

  • Using a recent nightly (4100.0.0 or newer) will cut off the "cyclic dep build" time from build_packages as cyclic dependencies are included in the SDK (as per SDK container: include circular dependency breaking packages in SDK container scripts#2340). This speeds up build_packages by about ~30% at the cost of a larger SDK container.
  • Building from binary packages is likely not a good metric as it strongly depends on internet (uplink) speed. Also, releases will build a lot faster than nightlies b/c we have release build binpkgs available via our CDN while nightly binpkgs are only on a single server (bincache). If we want to benchmark I/O, build_image is more suitable. We could use ./build_packages --nousepkg --nogetbinpkg to always build from source when benchmarking.
  • If you don't want to mess around with your host system you could as well mount a tmpfs inside the SDK container at /build/amd64-usr to build on tmpfs (you'd need to make sure to move the original contents of /build/amd64-usr onto the tmpfs of course)

@t-lo
Copy link
Member

t-lo commented Oct 1, 2024

I ran some test builds overnight on my laptop (i9-13900HX, 24 cores / 32 threads), it takes a little less than ~45 minutes to build all packages with the latest nightly (i.e. w/ circular deps already resolved).

I did not see any difference when using tmpfs over regular NVME I/O.

Load is distributed unevenly over the build. From the build log below we can see multiple bulks of packages being built with load saturating the host (as it should) separated by quiet times (most notably 8:49:14 - 8:53:01, with load staying low 'til 8:56:00) where packages are installed that provide dependencies for successive builds.

Based on the build.log I'd say our current bottleneck are build dependencies, and it's tough to get rid off of those. It would be interesting to see what packages exactly take this long to build / install, and then maybe do some clever build re-ordering to maximise host saturation. Is there any way I can get emerge to output package build stats (build times, dependencies, etc) in a meaningful way?

@krishjainx
Copy link

Based on the build.log I'd say our current bottleneck are build dependencies, and it's tough to get rid off of those. It would be interesting to see what packages exactly take this long to build / install, and then maybe do some clever build re-ordering to maximise host saturation. Is there any way I can get emerge to output package build stats (build times, dependencies, etc) in a meaningful way?

I haven't tested it with Flatcar, but I've heard of this one. Genlop is a powerful utility for extracting information about emerged packages from Portage log files. It can provide detailed build statistics and more. https://wiki.gentoo.org/wiki/Genlop

@vielmetti
Copy link

This PR from @krishjainx looks relevant for your review: flatcar/flatcar-website#364

@ader1990
Copy link
Author

ader1990 commented Oct 8, 2024

Genlop is not the tool for the job, as it only does log crawling.

equery is better, as it does have a tree like output. Maybe combining the two tools like lead to better insights.

equery -N -C g --depth=0 coreos-base/coreos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A feature request
Projects
Status: 📝 Needs Triage
Development

No branches or pull requests

5 participants