Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting segfaults, on dmesg #261

Open
alosarjos opened this issue Jan 18, 2025 · 17 comments
Open

Getting segfaults, on dmesg #261

alosarjos opened this issue Jan 18, 2025 · 17 comments

Comments

@alosarjos
Copy link

Hi,

I've been trying to figure out this issue, when fossilize is triggered I usually get some segfaults from dmesg, for example:

[ 1176.464051] fossilize_repla[6453]: segfault at 7d5d5e485551 ip 00007d5d5e485551 sp 00007ffde2bfb390 error 14 likely on CPU 7 (core 7, socket 0)
[ 1176.464060] Code: Unable to access opcode bytes at 0x7d5d5e485527.
[ 1176.464548] fossilize_repla[6451]: segfault at 7d5d5e485551 ip 00007d5d5e485551 sp 00007ffde2bfb390 error 14 likely on CPU 12 (core 4, socket 0)
[ 1176.464554] Code: Unable to access opcode bytes at 0x7d5d5e485527.
[ 1176.464686] fossilize_repla[6452]: segfault at 7d5d5e485551 ip 00007d5d5e485551 sp 00007ffde2bfb390 error 14 likely on CPU 5 (core 5, socket 0)
[ 1176.464692] Code: Unable to access opcode bytes at 0x7d5d5e485527.

At first I thought I could be related to the little undervolt I've done to the CPU, which hasn't caused any issue or crashed the system. I have been testing with Core Cycler since it has stress test for different CPU arquitectures focusing on just a single core when doing it. Passed 7 iterations in around 12 hours. So I'm not sure it's something related to that.

Also checking for mce errors on the journal, all I see is:

ene 18 11:18:19 archlinux kernel: MCE: In-kernel MCE decoding enabled.

I'm not sure if I can get some more help figuring out what could it be here.

System Info:

Arch Linux (Updated)
Ryzen 7800X3D
32 GB of RAM
AMD 7800XT

Shader log in case it helps...

shader_log.txt

@alosarjos
Copy link
Author

PD: I'm not sure if there is a manual way to trigger fossilize so I can test it better.

@alosarjos
Copy link
Author

I have also passed Memtest and passed correctly

@kakra
Copy link
Contributor

kakra commented Jan 18, 2025

Also checking for mce errors on the journal, all I see is:

This is usually an indicator of a stressed CPU due to overclocking or too long power limits. But this is the first time I see this for AMD. Nevertheless, maybe check your BIOS settings if there are overly relaxed overclocking settings by default. Mainboard manufacturers often do this by default to win the benchmarks for their marketing - but those defaults are not always safe. For Intel, the tuning knob to fix this is usually the power limit and power limit duration. I cannot recommend anything for AMD here.

Other than that, I was also seeing the opcode errors a lot in the past. It's probably not much to worry about. But they seem mostly gone since I cleaned old and stale fossilize caches from my system, something like:

# fully stop Steam first
find ~/.steam/steam/steamapps/shadercache/* -type f -mtime +180 -delete

This removes all cache files not touched in the last 180 days. So if you didn't play a game for longer, it may completely remove caches for such games.

@alosarjos
Copy link
Author

alosarjos commented Jan 18, 2025

Replying to #261 (comment)

I have 0 overclock here supposedly.

The chip by default "overclocks" itself until it reaches 85º. With max frequency capped at 5Ghz. Thats default. Only thing I've done is a little undervolt and capping max temp to 70º so it shouldn't be stressed at all...

Will try cleaning the caches, but would love to get some kind of manual mechanism to trigger shader cache compilation to try after changing different BIOS settings

@kakra
Copy link
Contributor

kakra commented Jan 18, 2025

The chip by default "overclocks" itself until it reaches 85º. With max frequency capped at 5Ghz. Thats default. Only thing I've done is a little undervolt and capping max temp to 70º so it shouldn't be stressed at all...

Then it should be fine. As said, I'm not familiar with AMD. There is a tool to decode MCEs, tho: mcelog. It can be started as a daemon or run in foreground to decode the errors.

Will try cleaning the caches, but would love to get some kind of manual mechanism to trigger shader cache compilation to try after changing different BIOS settings

Well, after cleaning, shader compilation should kick in automatically if you start Steam or a game. But since I do irregular cleanups, I'm no longer seeing an overly active fossilize process, neither do I get long compilation times when starting a game. Usually, there's only high activity after updating the graphics drivers and/or libs.

@alosarjos
Copy link
Author

I closed Steam, cleared the whole shader cache directory (For games I'm playing nowadays, even this morning), restarted the computer into the BIOS, check that all ASUS Enhancement stuff on the Overclock page was dissabled (And yes, all was disabled), rebooted, launched Steam, and the fossilize is not kicking it. (No processes, no CPU load and no subdirectories inside the shadercache dir)

@kakra
Copy link
Contributor

kakra commented Jan 18, 2025

Well, if this is different from previous behavior, things seemingly have improved. But of course that's completely unrelated to your original problem.

Maybe check if starting a game, playing for a little bit, then stop gaming, kicks in fossilize. You should try with a recent game and one that you didn't play for a long time.

I don't know of a manual way to start fossilize. In theory, it's possible from the cmdline but you should avoid trying that because Steam sets up a very tailored environment to run it, and running fossilize from outside this environment is not supported.

@kakra
Copy link
Contributor

kakra commented Jan 18, 2025

cleared the whole shader cache directory

Oh, if you completely purged the directory, fossilize will only start pre-compiling caches after it either downloaded its crowed-sources caches, or collected fresh caches from your gaming sessions. I may have misunderstood what you wrote.

@alosarjos
Copy link
Author

I don't see it kicking it after playing a game. But looks like it kicked in while playing the game, shadercache dirs are created

Image

Now, I see that on the mesa_sahder_cache_sf dir, there are 2 dirs, and by the names, looks like its generating sahder caches for both the dedicated gpu and the integrated gpu. Could it be something related to the integrated one?

@alosarjos
Copy link
Author

So, went to the storage settings on Steam, disabled downloading pre-compiled shaders and background processing, restarted steam and enabled both. Fossilize_Replay started working. Single process. Very low CPU usage, and still getting a segfault?

Not sure it's related to overclock or stress on the CPU... (I've been checking the system monitor and dmesg and haven't seen a peak on CPU usage) when those two segfaults.

Now, after these there are some CPU usage "spikes" (That I know are normal with Fossilize since I usually see multple processes) but CPU is at most, at 15% global usage

System Monitor:

Image

And just got there 2 messages:

[ 1628.556699] fossilize_repla[16326]: segfault at 7799df085551 ip 00007799df085551 sp 00007ffe0b5408d0 error 14 likely on CPU 8 (core 0, socket 0)
[ 1628.556707] Code: Unable to access opcode bytes at 0x7799df085527.
[ 1690.954052] fossilize_repla[16327]: segfault at 7799df085551 ip 00007799df085551 sp 00007ffe0b5408d0 error 14 likely on CPU 2 (core 2, socket 0)
[ 1690.954059] Code: Unable to access opcode bytes at 0x7799df085527.

@alosarjos
Copy link
Author

I don't know if it means something that the message:

[ 1628.556707] Code: Unable to access opcode bytes at 0x7799df085527.

it's always the same for all segfaults, referring to what I understand is the same memory address all the time?

@kakra
Copy link
Contributor

kakra commented Jan 18, 2025

Well, fossilize works in two stages: While playing the game, it will record the shader pipeline (in its GPU-agnostic raw format). Additionally, it may download crowd-sourced pipeline caches so you get the "source code" of shaders that you, while playing the game, have not encountered yet. This will actually not yet compile the shaders for your GPU, it only collects what is used and thus creates the directories (well, ofc, your driver still compiles these shaders and caches them, but fossilize does nothing at this point yet except recording the shader pipeline).

Then, when the system is idle or while starting a game, fossilize will start compiling those caches using your GPU drivers - for each GPU driver it encounters (this is the fossilize_replay process). This reduces or eliminates stutters in games because shader compilation on demand would block rendering. Your GPU drivers now create another set of files, similar in size to the pipeline caches previously created. These compiled pipelines are specific to your system (driver versions, GPU, libs) and need to be recompiled if any component changes.

If you see two GPUs, I would expect your shader cache size to triple, because you have the raw pipeline from the game, and two compiled pipelines for your GPUs.

Even tho shaders are GPU code, they are usually compiled on your CPU. The errors you're seeing may come from your integrated GPU driver if it does not support all types of shaders. Or it may come from old buggy caches. Thus, it's probably harmless. The opcode errors do not indicate a hardware problem. It may just come from incomplete driver support. You could check if the error is gone if you disable your integrated GPU.

The memory address is virtually mapped, so it probably points to a different physical memory location each time the process starts. As said, I don't think this indicates are hardware problem. And you already memtested, so your memory is fine.

@alosarjos
Copy link
Author

alosarjos commented Jan 18, 2025

I haven't noticed any "real" issue when playing games, I was just trying to figure out this since it has always happened, and also because fossilize_replay is filling system coredumps:

Sat 2025-01-18 11:02:46 CET 22772 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       50.8M
Sat 2025-01-18 11:03:05 CET 22773 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                         51M
Sat 2025-01-18 11:03:06 CET 22775 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       48.8M
Sat 2025-01-18 11:03:14 CET 22774 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                         55M
Sat 2025-01-18 11:03:18 CET 24092 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       49.8M
Sat 2025-01-18 11:03:18 CET 24091 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                         51M
Sat 2025-01-18 11:03:18 CET 24093 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       54.2M
Sat 2025-01-18 11:03:18 CET 24094 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       51.2M
Sat 2025-01-18 11:03:20 CET 24181 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        6.2M
Sat 2025-01-18 11:03:20 CET 24178 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        6.8M
Sat 2025-01-18 11:03:20 CET 24180 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        7.8M
Sat 2025-01-18 11:03:20 CET 24179 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        8.2M
Sat 2025-01-18 11:14:10 CET  4647 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        3.7M
Sat 2025-01-18 11:14:10 CET  4640 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.1M
Sat 2025-01-18 11:14:10 CET  4648 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.1M
Sat 2025-01-18 11:14:10 CET  4643 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        2.8M
Sat 2025-01-18 11:14:10 CET  4653 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.2M
Sat 2025-01-18 11:14:10 CET  4645 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                          4M
Sat 2025-01-18 11:14:10 CET  4654 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        3.9M
Sat 2025-01-18 11:14:10 CET  4644 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.2M
Sat 2025-01-18 11:14:10 CET  4649 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.5M
Sat 2025-01-18 11:14:10 CET  4641 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.7M
Sat 2025-01-18 11:14:10 CET  4642 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        3.2M
Sat 2025-01-18 11:14:10 CET  4651 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.5M
Sat 2025-01-18 11:14:10 CET  4646 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.1M
Sat 2025-01-18 11:14:10 CET  4652 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                        4.8M
Sat 2025-01-18 11:37:54 CET  6453 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                          1M
Sat 2025-01-18 11:37:54 CET  6451 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                      994.5K
Sat 2025-01-18 11:37:54 CET  6452 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                      994.6K
Sat 2025-01-18 11:37:56 CET  6645 1000 1000 SIGSYS  present  /home/alosarjos/.local/share/Steam/steamapps/common/Proton - Experimental/files/bin/wine64-preloader   8.3K
Sat 2025-01-18 12:07:20 CET 13141 1000 1000 SIGABRT present  /home/alosarjos/.local/share/spotify-launcher/install/usr/share/spotify/spotify                        5.7M
Sat 2025-01-18 15:01:18 CET 16326 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       51.1M
Sat 2025-01-18 15:02:20 CET 16327 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       46.1M
Sat 2025-01-18 15:03:18 CET 16328 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       51.7M
Sat 2025-01-18 15:04:08 CET 16329 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       40.1M
Sat 2025-01-18 15:21:39 CET 26780 1000 1000 SIGSEGV present  /home/alosarjos/.local/share/Steam/ubuntu12_64/fossilize_replay                                       46.8M

There are a couple non-related to fossilize, bit the amounts of coredumps is.... yikes. I will try to see what happens if I disable de integrated GPU on the BIOS, but I hope that if that's the case, this can be fixed since I use it for some stuff...

@alosarjos
Copy link
Author

Just in case, I tried launching steam like:

FOSSILIZE_DUMP_SIGSEGV=1 FOSSILIZE_DUMP_PATH=/home/alosarjos/fos /usr/bin/steam-runtime

To see if that would dump some info, triggered coredumps but the DUMP_PATH is empty

@alosarjos
Copy link
Author

Still triggering the issue with Integrated GPU disabled from the BIOS

@kakra
Copy link
Contributor

kakra commented Jan 18, 2025

because fossilize_replay is filling system coredumps

Yeah, these are gone for me since some time now. But I also had a lot of them.

For me, the remaining coredumps are from /usr/bin/dash as part of the Steam runtime (I don't have dash installed in my system) - and that's probably because the runtime glibc and the Steam-included Chromium framework (or other sandboxed apps) do not properly match. This happens to other Chromium-based apps, too, as you can see (Spotify etc, everything using Electron). It's mainly an issue with glibc and sandboxing. Nothing to worry about except it dumps a lot of core files.

Maybe try launching Steam with /usr/bin/steam -no-cef-sandbox to see if this reduces the messages. This is one of the last changes I did to fix non-rendering Steam pages inside the Steam client. It looks like Steam is actually running inside its own Steam runtime (using SteamID=0), and that ships a version of glibc which is too new for the used Chromium framework.

Note: Be aware that disabling the CEF sandbox (for the embedded Chromium framework) may expose you to some risks if browsing websites inside the Steam client.

@alosarjos
Copy link
Author

I'm still getting way of these messages. Not sure when this gets updated on Steam and at this point "all my cpus" have segfaulted but I have no issue when stress testing the machine or doing any other kind of heavy task like compiling rust or playing some cpu demanding games...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants