Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock at process shutdown #54918

Open
lpinca opened this issue Sep 13, 2024 · 8 comments
Open

Deadlock at process shutdown #54918

lpinca opened this issue Sep 13, 2024 · 8 comments
Labels
confirmed-bug Issues with confirmed bugs.

Comments

@lpinca
Copy link
Member

lpinca commented Sep 13, 2024

Version

v23.0.0-pre

Platform

Linux USR-PC2HD4AD 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

There is a deadlock that prevents the Node.js process from exiting. The issue is causing a lot (all?) of timeout failures in our CI. It can be reproduced by running a test in parallel with our test.py tool, for example

$ ./tools/test.py --repeat=10000 test/parallel/test-http2-large-write-multiple-requests.js

See also

How often does it reproduce? Is there a required condition?

Rarely, but often enough to be a serious issue for CI.

What is the expected behavior? Why is that the expected behavior?

The process exits.

What do you see instead?

The process does not exit.

Additional information

Attaching gdb to two of the hanging processes obtained from the command above, produces the following outputs:

$ gdb -p 15884
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15884
[New LWP 15953]
[New LWP 15952]
[New LWP 15951]
[New LWP 15950]
[New LWP 15900]
[New LWP 15889]
[New LWP 15888]
[New LWP 15887]
[New LWP 15886]
[New LWP 15885]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/futex-internal.c.
0x00007f432025fd61 in __futex_abstimed_wait_common64 (private=21855, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:57

warning: 57     ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0  0x00007f432025fd61 in __futex_abstimed_wait_common64 (private=21855, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=21855, abstime=0x0, clockid=0, expected=0,
    futex_word=0x555fa9fce8f0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555fa9fce8f0, expected=expected@entry=0,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f43202627dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555fa9fce870,
    cond=0x555fa9fce8c8) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x555fa9fce8c8, mutex=0x555fa9fce870) at ./nptl/pthread_cond_wait.c:627
#5  0x0000555fa3d30d7d in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>)
    at ../deps/uv/src/unix/thread.c:814
#6  0x0000555fa2c8c063 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
#7  0x0000555fa2ac528d in node::SpinEventLoopInternal(node::Environment*) ()
#8  0x0000555fa2c4abc4 in node::NodeMainInstance::Run() ()
#9  0x0000555fa2b9233a in node::Start(int, char**) ()
#10 0x00007f43201f11ca in __libc_start_call_main (main=main@entry=0x555fa2ababa0 <main>, argc=argc@entry=2,
    argv=argv@entry=0x7ffe74216608) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007f43201f128b in __libc_start_main_impl (main=0x555fa2ababa0 <main>, argc=2, argv=0x7ffe74216608,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe742165f8)
    at ../csu/libc-start.c:360
#12 0x0000555fa2ac06a5 in _start ()
(gdb)
$ gdb -p 15885
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 15885
Reading symbols from /home/luigi/node/out/Release/node...
Reading symbols from /lib/x86_64-linux-gnu/libstdc++.so.6...
Reading symbols from /home/luigi/.cache/debuginfod_client/40b9b0d17fdeebfb57331304da2b7f85e1396ef2/debuginfo...
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...
Reading symbols from /usr/lib/debug/.build-id/90/32976b3ecc78d1362fedfcd88528562bbfb7e4.debug...
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...
Reading symbols from /home/luigi/.cache/debuginfod_client/92123f0e6223c77754bac47062c0b9713ed363df/debuginfo...
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
Reading symbols from /usr/lib/debug/.build-id/6d/64b17fbac799e68da7ebd9985ddf9b5cb375e6.debug...
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/35/3e1b6cb0eebc08cf3ff812eae8a51b4efd684e.debug...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument.  Continuing without source file ./misc/../sysdeps/unix/sysv/linux/epoll_pwait.c.
0x00007f43202f0ee0 in __GI_epoll_pwait (epfd=epfd@entry=9, events=events@entry=0x7f43201bdcb0,
    maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0)
    at ../sysdeps/unix/sysv/linux/epoll_pwait.c:40

warning: 40     ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory
(gdb) bt
#0  0x00007f43202f0ee0 in __GI_epoll_pwait (epfd=epfd@entry=9, events=events@entry=0x7f43201bdcb0,
    maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0)
    at ../sysdeps/unix/sysv/linux/epoll_pwait.c:40
#1  0x0000555fa3d364e7 in uv__io_poll (loop=loop@entry=0x555fa9fcea88, timeout=<optimized out>)
    at ../deps/uv/src/unix/linux.c:1432
#2  0x0000555fa3d20c17 in uv_run (loop=0x555fa9fcea88, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:448
#3  0x0000555fa2c8d4b1 in node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*)
    ()
#4  0x00007f4320263a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#5  0x00007f43202f0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)
@lpinca
Copy link
Member Author

lpinca commented Sep 13, 2024

As written in #52550 (comment), applying the patch from #47452 seems to fix the issue, but it is not a solution since that patch brings other issues.

@RedYetiDev
Copy link
Member

This can been seen in various CI runs, assuming confirmed-bug

@lpinca
Copy link
Member Author

lpinca commented Sep 13, 2024

The output of lldb on macOS:

$ lldb -p 5033
(lldb) process attach --pid 5033
Process 5033 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff802363c3a libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`:
->  0x7ff802363c3a <+10>: jae    0x7ff802363c44            ; <+20>
    0x7ff802363c3c <+12>: movq   %rax, %rdi
    0x7ff802363c3f <+15>: jmp    0x7ff8023617e0            ; cerror_nocancel
    0x7ff802363c44 <+20>: retq   
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64h-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff802363c3a libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8023a16f3 libsystem_pthread.dylib`_pthread_cond_wait + 1211
    frame #2: 0x000000010df93b23 node`uv_cond_wait(cond=0x00007f9a4d404108, mutex=0x00007f9a4d404098) at thread.c:798:7 [opt]
    frame #3: 0x000000010d0be52b node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007f9a4d404108, mutex=0x00007f9a4d404098) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010d0be511 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007f9a4d404108, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010d0be511 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007f9a4d404098) at node_platform.cc:640:20 [opt]
    frame #6: 0x000000010d0be4fb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007f9a4d404098) at node_platform.cc:214:25 [opt]
    frame #7: 0x000000010d0be4fb node`node::NodePlatform::DrainTasks(this=0x0000600002718000, isolate=<unavailable>) at node_platform.cc:467:33 [opt]
    frame #8: 0x000000010cf18a50 node`node::SpinEventLoopInternal(env=0x00007f9a2c814e00) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x000000010d084083 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007f9a2c814e00) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x000000010d083fe9 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x000000010cfe4159 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1489:24 [opt]
    frame #12: 0x000000010cfe3f54 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1496:27 [opt]
    frame #13: 0x00007ff802015345 dyld`start + 1909
(lldb)

@aduh95
Copy link
Contributor

aduh95 commented Sep 17, 2024

@lpinca could you try ab23463? I'm not able to reproduce the deadlock locally, so I can't confirm if it is effective.

@lpinca
Copy link
Member Author

lpinca commented Sep 17, 2024

@aduh95 see #54918 (comment). It seems the same patch.

@aduh95
Copy link
Contributor

aduh95 commented Sep 17, 2024

No it's not, it's based on it but it's meant to fix the test failures.

@targos
Copy link
Member

targos commented Sep 17, 2024

@lpinca
Copy link
Member Author

lpinca commented Sep 17, 2024

The issue persists 😞

gdb -p 18560
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 18560
[New LWP 18621]
[New LWP 18620]
[New LWP 18619]
[New LWP 18618]
[New LWP 18574]
[New LWP 18565]
[New LWP 18564]
[New LWP 18563]
[New LWP 18562]
[New LWP 18561]
Downloading separate debug info for /lib/x86_64-linux-gnu/libstdc++.so.6
Downloading separate debug info for system-supplied DSO at 0x7ffdf50f4000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/futex-internal.c.
0x00007fca5f88ad61 in __futex_abstimed_wait_common64 (private=21976, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:57

warning: 57     ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0  0x00007fca5f88ad61 in __futex_abstimed_wait_common64 (private=21976, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=21976, abstime=0x0, clockid=0, expected=0,
    futex_word=0x55d834ea10e0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55d834ea10e0, expected=expected@entry=0,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fca5f88d7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d834ea1060,
    cond=0x55d834ea10b8) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x55d834ea10b8, mutex=0x55d834ea1060) at ./nptl/pthread_cond_wait.c:627
#5  0x000055d82f131fed in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>)
    at ../deps/uv/src/unix/thread.c:814
#6  0x000055d82e08c853 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
#7  0x000055d82dec528d in node::SpinEventLoopInternal(node::Environment*) ()
#8  0x000055d82e04b364 in node::NodeMainInstance::Run() ()
#9  0x000055d82df9276a in node::Start(int, char**) ()
#10 0x00007fca5f81c1ca in __libc_start_call_main (main=main@entry=0x55d82debac50 <main>, argc=argc@entry=2,
    argv=argv@entry=0x7ffdf50a9a48) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007fca5f81c28b in __libc_start_main_impl (main=0x55d82debac50 <main>, argc=2, argv=0x7ffdf50a9a48,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdf50a9a38)
    at ../csu/libc-start.c:360
#12 0x000055d82dec0785 in _start ()
(gdb)

Also, test/wasi/test-wasi-initialize-validation.js and test/wasi/test-wasi-start-validation.js fail.

$ ./node test/wasi/test-wasi-initialize-validation.js
(node:13188) ExperimentalWarning: WASI is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Mismatched noop function calls. Expected exactly 1, actual 0.
    at Proxy.mustCall (/home/luigi/node/test/common/index.js:451:10)
    at Object.<anonymous> (/home/luigi/node/test/wasi/test-wasi-initialize-validation.js:154:18)
    at Module._compile (node:internal/modules/cjs/loader:1557:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1702:10)
    at Module.load (node:internal/modules/cjs/loader:1328:32)
    at Module._load (node:internal/modules/cjs/loader:1138:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)
$ ./node test/wasi/test-wasi-start-validation.js
(node:13199) ExperimentalWarning: WASI is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Mismatched noop function calls. Expected exactly 1, actual 0.
    at Proxy.mustCall (/home/luigi/node/test/common/index.js:451:10)
    at Object.<anonymous> (/home/luigi/node/test/wasi/test-wasi-start-validation.js:151:18)
    at Module._compile (node:internal/modules/cjs/loader:1557:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1702:10)
    at Module.load (node:internal/modules/cjs/loader:1328:32)
    at Module._load (node:internal/modules/cjs/loader:1138:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs.
Projects
None yet
Development

No branches or pull requests

4 participants