-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple events for the same node cause a crash #307
Comments
We've discussed a related situation previously here #272 I think this issue #297 could be used to mitigate this situation but we may not be able to run multiple replicated NTH pods reliably. I'd be interested to see if using ec2 instance tags to turn on and off NTH management could mitigate this (although I'm not sure that would be the right use of it or work in every instance). I'm not opposed to removing the crash and just log the event if it's not found. I don't think there's a possible workaround at this time. |
I'd prefer the crash removal if possible. I can understand why not being able to find a node might have been worthy of causing a crash during early design, but it seems increasingly that this is a common scenario. |
Agree facing same issue - in our case we basically see the pod crashing all the time |
I still observe this issue with 1.11 if I enable both ASG lifecycle events
and EC2 state change events
|
@haugenj anything we can do to get a fix? Would you accept a PR? |
PRs are always welcome! If you could include a test to validate this scenario that would be 💯 |
@haugenj fixed this issue by #313 Background: went all the way back, instead of just removing the os.exit(1) found that actually the nodeName was empty but not verified. The empty nodeName parsing derives from So created a custom error that allows to handle that case. |
Currently I'm sending both EC2 state change events and ASG lifecycle events to the NTH in queue processor mode. This sounds like it should be supported, however in practice what seems to be happening is:
I'm not convinced that the NTH needs to crash if it can't find a node, but that aside, is there any other workaround for this? Or should I just pick EC2 state change events OR ASG lifecycle events to handle?
The text was updated successfully, but these errors were encountered: