fix: Run dumb-init with --single-child for graceful termination #4913

meringu · 2024-09-10T23:50:42Z

what

By default, dumb init forward SIGTERM signals to all child processes. This causes Atlantis to shutdown ungracefully when running a Terraform command. The process tree looks as follows:

dumb-init -> atlantis -> sh -> terraform -> provider(s).

When the container runtime stops the container while Atlantis is running a Terraform command (E.g Kubernetes node scale down), dumb-init will forward the SIGTERM to all these processes in reverse order. Both Terraform and Atlantis will trap the signal and try to exit gracefully, however the provider will crash, resulting in Atlantis commenting the following on a PR:

Stopping operation...

Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

╷
│ Error: execution halted
│ 
│ 
╵
╷
│ Error: execution halted
│ 
│ 
╵
╷
│ Error: Plugin did not respond
│ 
│   with time_sleep.wait_30_seconds,
│   on main.tf line 1, in resource "time_sleep" "wait_30_seconds":
│    1: resource "time_sleep" "wait_30_seconds" {
│ 
│ The plugin encountered an error, and failed to respond to the
│ plugin.(*GRPCProvider).ApplyResourceChange call. The plugin logs may
│ contain more details.
╵

This is consistently reproducible by using the time_sleep resource (as above) and stopping the container during an apply.

The impact of this behaviour can be partial applies that need manual clean up, particularly in the case that the provider has created an asset, but has not finished waiting for it to converge. It will exist, but not be tracked in the statefile. Some crashes, like resource updates can be resolved just by planning and applying again.

why

There are a few options here, we've set the DUMB_INIT_SETSID=0 environment variable to all of our Atlantis instances, which has resolved this issue for us. Atlantis will finish the Terraform command before finishing scale down. We could also set the --single-child flag in the entrypoint script (might be cleaner, but was not as easy to test as adding this env var in my environment). More details on how dumb init works here: https://github.com/Yelp/dumb-init?tab=readme-ov-file#what-dumb-init-does. Another option would be to remove it entirely, and just use #!/bin/sh as the entrypoint.

tests

Tested within Kubernetes. I reproduced it by deleting a pod during an apply. Adding this env var then trying again results in Atlantis finishing the command then scaling down.

references

None, I debugged this yesterday. I have some strace output if anyone is interested. It shows that PID 1 (dumb-init)
is sending SIGTERM to all the child PIDs (Atlantis, Terraform, provider etc...)

jamengual

Could you add something to the docs about this change?

meringu · 2024-09-11T01:06:10Z

I can do @jamengual. I couldn't see any specific docs about Docker, do you have any recommendations where this would best fit? Or would a code comment suffice?

I was thinking, I might quickly test with --single-child. As it is specific to the entrypoint script, it might be nice to contain this within there, rather than proliferate the env var to all child processes.

jamengual · 2024-09-11T03:41:30Z

we could try that in the entry point and maybe add a comment there is all works

meringu · 2024-09-12T01:34:31Z

I've ended up moving dumb-init from the shebang to the exec. Setting --single-child in the shebang is not viable as the container runtime will use sh to start the container process, which results in #!/usr/bin/dumb-init --single-child /bin/sh being interpreted as /usr/bin/dumb-init '--single-child /bin/sh'.

The side effect is that if any /docker-entrypoint.d/ scripts fork a process into the background, they won't be cleaned up by dumb init, but this would have been the effect of setting --single-child anyway I believe.

I've tested this to make sure the SIGTERM is caught correctly in my environment to no longer kill the plugin in the middle of an apply, the auto prefixing of the atlantis binary works, and execution of /docker-entrypoint.d/ scripts still function as I expect.

meringu · 2024-09-12T01:43:35Z

I'm not entirely sure if there is any value in keeping dumb-init with this change. I believe one would need to overwrite the entrypoint script to take advantage of other features like the signal rewriting. It doesn't hurt leaving it in however.

chenrui333 · 2024-09-12T19:59:31Z

Thanks @meringu!

meringu · 2024-10-30T19:57:57Z

Hey @chenrui333, sorry for the direct mention.

I was wondering what the process is to get this merged? I also had #4928 open, which would be a great help to avoid restarts and reduce downtime for our deployments too which I would be very interested in progressing also.

jamengual · 2024-10-31T04:36:43Z

@meringu We have a ton of PRs waiting on review, we will do our best to review it soon.

X-Guardian · 2024-11-02T20:08:01Z

This change can be made simpler by using env and the --split-string option in the shebang. See https://www.gnu.org/software/coreutils/manual/html_node/env-invocation.html#g_t_002dS_002f_002d_002dsplit_002dstring-usage-in-scripts

i.e.

#!/usr/bin/env --split-string dumb-init --single-child /bin/sh

meringu · 2024-11-04T00:34:12Z

Thanks for the tip @X-Guardian! TIL. This is much tidier. I managed to get this working with -S instead of --split-string. I was getting this error otherwise:

/usr/bin/env: unrecognized option '--split-string dumb-init --single-child /bin/sh'
Try '/usr/bin/env --help' for more information.

I tested the Debian variant in my environment to ensure graceful termination. I just did a quick test with the alpine container locally by running env directly from an interactive shell, -S also works as expected after installing coreutils-env.

Signed-off-by: Henry Muru Paenga <[email protected]>

X-Guardian · 2024-11-04T17:56:00Z

Looks good @meringu. Last thing, can you refine the PR title? It is used for the text entry on the changelog.

meringu · 2024-11-04T19:41:12Z

Thanks. I have updated it to better reflect the change.

X-Guardian · 2024-11-04T22:53:42Z

Thanks for this @meringu. You can test this using one of these container images: dev-alpine-2c15413 or dev-debian-2c15413

meringu · 2024-11-05T10:42:40Z

Thanks for merging @X-Guardian!

I have tested this in my environment with the dev-debian-2c15413 image. Atlantis now logs that it receives the interrupt, but now waits for the command to finish rather than the provider crashing like before.

meringu requested review from a team as code owners September 10, 2024 23:50

meringu requested review from chenrui333, lukemassa and X-Guardian and removed request for a team September 10, 2024 23:50

github-actions bot added the build Relating to how we build Atlantis label Sep 10, 2024

jamengual reviewed Sep 10, 2024

View reviewed changes

jamengual changed the title ~~Atlantis ungraceful termination~~ fix: Atlantis ungraceful termination Sep 10, 2024

meringu force-pushed the patch-1 branch 2 times, most recently from eb2d368 to b475a1d Compare September 12, 2024 01:25

chenrui333 previously approved these changes Sep 12, 2024

View reviewed changes

meringu force-pushed the patch-1 branch from 8173a80 to e05c267 Compare October 30, 2024 20:01

meringu dismissed chenrui333’s stale review via 7f4058e November 3, 2024 21:43

meringu force-pushed the patch-1 branch 3 times, most recently from 98803e6 to 0ed43c6 Compare November 4, 2024 00:14

Run dumb-init with --single-child

f66775b

Signed-off-by: Henry Muru Paenga <[email protected]>

meringu force-pushed the patch-1 branch from 0ed43c6 to f66775b Compare November 4, 2024 02:27

Merge branch 'main' into patch-1

218f4b1

meringu changed the title ~~fix: Atlantis ungraceful termination~~ fix: Run dumb-init with --single-child for graceful termination Nov 4, 2024

Merge branch 'main' into patch-1

cee1155

X-Guardian approved these changes Nov 4, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 4, 2024

X-Guardian merged commit 2c15413 into runatlantis:main Nov 4, 2024
33 checks passed

meringu deleted the patch-1 branch November 5, 2024 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Run dumb-init with --single-child for graceful termination #4913

fix: Run dumb-init with --single-child for graceful termination #4913

meringu commented Sep 10, 2024

jamengual left a comment

meringu commented Sep 11, 2024

jamengual commented Sep 11, 2024

meringu commented Sep 12, 2024 •

edited

Loading

meringu commented Sep 12, 2024

chenrui333 commented Sep 12, 2024

meringu commented Oct 30, 2024

jamengual commented Oct 31, 2024

X-Guardian commented Nov 2, 2024

meringu commented Nov 4, 2024

X-Guardian commented Nov 4, 2024

meringu commented Nov 4, 2024

X-Guardian commented Nov 4, 2024

meringu commented Nov 5, 2024

fix: Run dumb-init with --single-child for graceful termination #4913

fix: Run dumb-init with --single-child for graceful termination #4913

Conversation

meringu commented Sep 10, 2024

what

why

tests

references

jamengual left a comment

Choose a reason for hiding this comment

meringu commented Sep 11, 2024

jamengual commented Sep 11, 2024

meringu commented Sep 12, 2024 • edited Loading

meringu commented Sep 12, 2024

chenrui333 commented Sep 12, 2024

meringu commented Oct 30, 2024

jamengual commented Oct 31, 2024

X-Guardian commented Nov 2, 2024

meringu commented Nov 4, 2024

X-Guardian commented Nov 4, 2024

meringu commented Nov 4, 2024

X-Guardian commented Nov 4, 2024

meringu commented Nov 5, 2024

meringu commented Sep 12, 2024 •

edited

Loading