-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sporadic compilation segfaults #121056
Comments
What is with the "compiler acts suspicious enough that memtest has to be used to try to rule that out, but the results come back clean for memtest, but we still can't really pin down a specific problem in rustc" issues lately? Perhaps I was too hasty in closing #119005... |
Just a hunch: what CPU do you have? |
Reminds me of #120955 |
@git-girl This is a weird request, but can you bootstrap a new copy of rustc and LLVM, on a host that builds with a matching glibc version, and then deploy it on this wonky-ass server, and see if that still has the same spotty build errors? You may have to refer to the rustc-dev-guide for assistance. My hypothesis is that maybe one of the post-build optimizations we apply, that doesn't happen to a freshly bootstrapped host, might be causing an issue in a rare system configuration we're not testing. |
@rustbot label +S-needs-repro |
thanks so much for everyones replies! i am still working on the reproduction with a docker and i think i got stuck (just to make sure @workingjubilee i need to just follow the rustc build and that build already builds all the @ds84182 here is the `lscpu`
and the docker file i have atmFROM debian:11
# RUN apt-get update && apt-get install -y libc6 git curl tar python3 build-essential g++ libssl-dev ninja-build
RUN apt-get update && apt-get install -y libc6 git curl tar python3 build-essential g++ libssl-dev # ninja-build
RUN bash -c 'if [[ "$(ldd --version | awk '\''NR==1{print $NF}'\'')" == "2.31" ]]; then echo "libc okay"; else exit; fi;'
# This was from my attempt to compile llvm because:
# build cmake from source as debian 11 cmake isnt new enough :{
# WORKDIR /home
# RUN curl -L https://github.com/Kitware/CMake/releases/download/v3.28.3/cmake-3.28.3.tar.gz -o cmake.tar.gz
# RUN tar -xzf cmake.tar.gz
# WORKDIR /home/cmake-3.28.3
# RUN ./bootstrap
# RUN gmake
# RUN gmake install
# Attempt to compile llvm from upstream against the rust cmake files
# LLVM version: 17.0.6
# WORKDIR /home
# # this repo seems to not be correct as building it with the cmake command below i run into the issue of the
# RUN curl -L https://github.com/llvm/llvm-project/archive/refs/tags/llvmorg-17.0.6.tar.gz -o llvm_src.tar.gz
# # this seems to be the wrong repo to build from as clang/ is missing from the root
# # RUN curl -L https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.6/llvm-17.0.6.src.tar.xz -o llvm_src.tar.xz
# RUN tar -xzf llvm_src.tar.gz
# RUN mkdir -p /home/llvm-project-llvmorg-17.0.6/builddir
# WORKDIR /home/llvm-project-llvmorg-17.0.6/builddir
# # RUN cmake -DCMAKE_BUILD_TYPE=Release /home/llvm-17.0.6.src
# RUN cmake -G Ninja -C /home/rustc-1.76.0-src/src/llvm-project/clang/cmake/caches/DistributionExample.cmake /home/llvm-project-llvmorg-17.0.6
# RUN ninja stage2-distribution
# RUN ninja stage2-install-distribution
# Attempt building LLVM from llvm upstream in isolation
# RUN cmake --build . --target install
# # to easier find this
# RUN cmake -DCMAKE_INSTALL_PREFIX=/home/llvm.result -P cmake_install.cmake
# build rust
WORKDIR /home
RUN curl -L https://static.rust-lang.org/dist/rustc-1.76.0-src.tar.gz -o rust.tar.gz
RUN tar -xzf rust.tar.gz
WORKDIR /home/rustc-1.76.0-src
RUN cp config.example.toml config.toml
RUN ./x.py build
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | bash -s -- -y
ADD bash_stuff.sh /root/bash_stuff.sh
ENTRYPOINT ["/bin/bash", "/root/bash_stuff.sh"]
#```
# #!/usr/bin/env bash
#
# source "$HOME/.cargo/env"
# rustup toolchain link stage0 build/host/stage0-sysroot
# rustup toolchain link stage1 build/host/stage1
# rustup override set stage1
#```
# How do i use llvm that was compiled by the previous build step for the stage2 build
# somewhere in the sysroot guy?
#
# CMD cargo version -Vv |
Looking at CPU Errata for that CPU I found: Checked the Linux kernel and I didn't see anything mentioning it, and considering this bug I'm pretty sure its unpatched. See also:
Just to confirm @git-girl can you run the linked reproducer to see if it (eventually) crashes on your system? |
@ds84182 oh no i think you are right in that it's a cpu issue. the reproduction is segfaulting. i can't test setting the msr bit because i'll need to make changes to grub and reboot which i can't do remotely, i'll get to try that tomorrow morning though :) username@hostname:~/cpu.reproduction$ LANGUAGE=en_US strace ./a.out
execve("./a.out", ["./a.out"], 0x7fff5f10a220 /* 36 vars */) = 0
brk(NULL) = 0x55a06369d000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=66860, ...}) = 0
mmap(NULL, 66860, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f061e42f000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1901536, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f061e42d000
mmap(NULL, 1914496, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f061e259000
mmap(0x7f061e27b000, 1413120, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f061e27b000
mmap(0x7f061e3d4000, 323584, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f061e3d4000
mmap(0x7f061e423000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c9000) = 0x7f061e423000
mmap(0x7f061e429000, 13952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f061e429000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7f061e42e540) = 0
mprotect(0x7f061e423000, 16384, PROT_READ) = 0
mprotect(0x55a062994000, 4096, PROT_READ) = 0
mprotect(0x7f061e46a000, 4096, PROT_READ) = 0
munmap(0x7f061e42f000, 66860) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV +++ here is the output of me stepping through starting from main:
|
Hmm, I think you may have copied the assembly snippet incorrectly. Can you verify that it the movq line is |
ah omg yes sorry, still segfaults after a while though, which i guess means this is the issue. this is from
and a bt all from lldb:
|
what also worked was to bootstrap the compiler as you suggested @workingjubilee i hope i did everything right bootsrapping i was a bit afraid that i messed up
i built it like this: #!/usr/bin/env bash
source "$HOME/.cargo/env"
rustup toolchain link stage0 build/host/stage0-sysroot
rustup toolchain link stage1 build/host/stage1
rustup override set stage1
cd /home/rustc-1.76.0-src
./x.py build --stage 2
rustup toolchain link stage2 build/host/stage2
# i hate docker
bash rustc versions:
|
sorry for the many comments, we are about to do a clean install on the machine and i just checked that setting the msr bit didnt prevent the compiliation segfaults @ds84182 |
Basically, yes, although IIRC the default config.toml you used would download its LLVM from CI. I don't think that matters too much, however. Knowing that the "default build" of rustc is seemingly free of these problems makes me wonder if this is BOLT or BOLT+strip or BOLT+PGO interactions again, as in:
Unfortunately I don't know an easy or even just tractable to enable running a BOLT-optimized or PGO-optimized build so that we can recombine the various optimizations and see what breaks on your machine...
That is slightly concerning! Just to verify: How exactly did you set the MSR bit? And thank you for doing so much investigatory work! |
@git-girl If you are brave enough to try using PGO, this is the command we use in the middle of our distribution build's opt pipeline:
Apparently with a cwd of |
hey there :3
i went to deploy to a pretty wonky debian server yesterday but the compiler segfauted. at this point we are sure that this is triggered by the system being misconfigured in some way. we also found issues with memory slots however after having roughly tested the ram with memtest we still ran into the same segfaults. the compiler segfaults disappeared once we booted a different system on the hardware (this is to say that there might be hardware issues involved but imo rather unlikely and more likely this by frankensteinish system configs {however one thing to note is that the rest of the services on that server are seemingly healthy since quite a while}).
i hope all of this is put concretely enough, if anything is missing please let me know and i can provide more information. we do have backups of this system, but are planning to do a clean install on friday after which i don't know if i am realistically going to be able to reproduce this behavior.
a basic hello world segfaulted more often then not (but it might also compile).
so this:
Meta
rustc --version --verbose
:i also tried nightly but also ran into segfaults there.
uname -a
Linux hostname 4.19.0-25-amd64 #1 SMP Debian 4.19.289-2 (2023-08-08) x86_64 GNU/Linux
and
Linux hostname 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31) x86_64 GNU/Linux
(i upgraded the system to no avail - one thing of note may be that i specifically could not find any reference to the syscall
futex_waitv
which is 5.19 and later but onlyfutex_wait
)apt list libc6-dev
(as syscalls and futex timeouts were involved)libc6-dev/oldstable,now 2.31-13+deb11u8 amd64
(installed)and
apt list libc6
libc6/oldstable,now 2.31-13+deb11u8 amd64
(installed)(and i also installed all recommended or suggested libc on pacakges.debian)
Error output
i have quite a few more error logs and i think that the segfaults didnt always happen in the same ways.
Another Error
i also found panics in raw vec because of capacity overflows (there are even more traces where i think the causes are different, i am happy to provide them and can also go through them on another day {right now i just need to go to bed 😬})
Raw Vec Panic
LLDB
LLDB
obtained by lldb
bt all
(note that these were technically two different rust projects failing here and as said before not all segfaults were alike so yeah but yeah)best regards and thanks so much for all the work!
The text was updated successfully, but these errors were encountered: