-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Run dumb-init with --single-child for graceful termination #4913
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add something to the docs about this change?
I can do @jamengual. I couldn't see any specific docs about Docker, do you have any recommendations where this would best fit? Or would a code comment suffice? I was thinking, I might quickly test with |
we could try that in the entry point and maybe add a comment there is all works |
eb2d368
to
b475a1d
Compare
I've ended up moving dumb-init from the shebang to the The side effect is that if any I've tested this to make sure the |
I'm not entirely sure if there is any value in keeping dumb-init with this change. I believe one would need to overwrite the entrypoint script to take advantage of other features like the signal rewriting. It doesn't hurt leaving it in however. |
Thanks @meringu! |
Hey @chenrui333, sorry for the direct mention. I was wondering what the process is to get this merged? I also had #4928 open, which would be a great help to avoid restarts and reduce downtime for our deployments too which I would be very interested in progressing also. |
@meringu We have a ton of PRs waiting on review, we will do our best to review it soon. |
This change can be made simpler by using i.e. #!/usr/bin/env --split-string dumb-init --single-child /bin/sh |
98803e6
to
0ed43c6
Compare
Thanks for the tip @X-Guardian! TIL. This is much tidier. I managed to get this working with
I tested the Debian variant in my environment to ensure graceful termination. I just did a quick test with the alpine container locally by running env directly from an interactive shell, |
Signed-off-by: Henry Muru Paenga <[email protected]>
Looks good @meringu. Last thing, can you refine the PR title? It is used for the text entry on the changelog. |
Thanks. I have updated it to better reflect the change. |
Thanks for this @meringu. You can test this using one of these container images: dev-alpine-2c15413 or dev-debian-2c15413 |
Thanks for merging @X-Guardian! I have tested this in my environment with the dev-debian-2c15413 image. Atlantis now logs that it receives the interrupt, but now waits for the command to finish rather than the provider crashing like before. |
what
By default, dumb init forward SIGTERM signals to all child processes. This causes Atlantis to shutdown ungracefully when running a Terraform command. The process tree looks as follows:
When the container runtime stops the container while Atlantis is running a Terraform command (E.g Kubernetes node scale down), dumb-init will forward the SIGTERM to all these processes in reverse order. Both Terraform and Atlantis will trap the signal and try to exit gracefully, however the provider will crash, resulting in Atlantis commenting the following on a PR:
This is consistently reproducible by using the
time_sleep
resource (as above) and stopping the container during an apply.The impact of this behaviour can be partial applies that need manual clean up, particularly in the case that the provider has created an asset, but has not finished waiting for it to converge. It will exist, but not be tracked in the statefile. Some crashes, like resource updates can be resolved just by planning and applying again.
why
There are a few options here, we've set the
DUMB_INIT_SETSID=0
environment variable to all of our Atlantis instances, which has resolved this issue for us. Atlantis will finish the Terraform command before finishing scale down. We could also set the--single-child
flag in the entrypoint script (might be cleaner, but was not as easy to test as adding this env var in my environment). More details on how dumb init works here: https://github.com/Yelp/dumb-init?tab=readme-ov-file#what-dumb-init-does. Another option would be to remove it entirely, and just use#!/bin/sh
as the entrypoint.tests
Tested within Kubernetes. I reproduced it by deleting a pod during an apply. Adding this env var then trying again results in Atlantis finishing the command then scaling down.
references
None, I debugged this yesterday. I have some
strace
output if anyone is interested. It shows that PID 1 (dumb-init)is sending
SIGTERM
to all the child PIDs (Atlantis, Terraform, provider etc...)