-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachinePool not appropriately updating status #12036
Comments
This issue is currently awaiting triage. If CAPI contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is also related to a still existing bug which had its ticket closed: |
Following the old links, could this be this CAPA bug (so not CAPI): kubernetes-sigs/cluster-api-provider-aws#4618 ? |
Thanks for providing that link! So the resources and such described in that PR are different but potentially it could be the same issue. Digging into the CAPI code the proximate cause is very clear here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L65C1-L67C3 As long as those failure messages are not cleared the status of the MachinePool will never change. I hadn't had time yet to trace if anything is actually updating those once a downstream resource becomes healthy. I'll have to look deeper into your link., I see from the comments there was some thoughts it might be a solved issue due to https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5174/files but I will have to look deeper. |
OK I had some time to look into it and whether or not there is an issue at the AWS provider level I think there is still a bug here in CAPI MachinePool handling. I noted the initial proximate cause above. Following that train of thought I can see that the MachinePool status for failureMessage and failureReason are set at https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_phases.go#L155-L169 and The issue here is they are never cleared or set to empty when the external. I checked all the reconciliation methods ( I would like to:
@chrischdi Before I put effort into a pull-request can you review this approach and let me know if it makes sense to you? |
Ah just recognized that failureMessage and failureReason are set. It works by design that if they are set, that the controllers don't do anything anymore because they are considered as terminal failure (at awsmanagedmachinepool level). If these are not terminal failures, then whoever did set them should not do this. Besides that: in future we will remove both fields anyway when implementation of https://github.com/fabriziopandini/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md is done:
|
Sorry I'm not really sure what direction you're suggesting. Are you saying a "fix" for this is not welcome because it is intended behavior? I'm also unclear on the terminal failure part. These resources (machine pool and awsmanagedmachinepool) are active, working, and modifiable ( can scale up or down ). What is the intended path for a person in my situation with machinepools in a stuck failed phase? When you say Is that in and of itself the problem or maybe I'm not understanding? Do you feel the source of the issue is still on the AWS provider side? |
As Chris mentioned above, the capi contractual assumption is that failures signaled through those fields would be terminal. So it is a provider implementer task to honour that assumption. In this case if the failure is indeed recoverable, I would expect aws implementation to not signal it via those fields. |
Thanks @enxebre @chrischdi! Appreciate the context and that all sounds reasonable to me. I'll take it up over there and link that over to this issue. |
What steps did you take and what happened?
I'm having an issue with some MachinePool resources reporting a failure state when that doesn't seem to be true downstream (MachinePool on top of AWSManagedMachinePool where latter reports healthy in both CAPA and AWS UI).
Reproduce
To reproduce the bug you should be able to deploy a machinepool which uses a misconfigured launch template or otherwise invalud config that produces the CAPI MachinePool error:
And then fix the cause of error.
Information
Here is an example of an impacted MachinePool and AWSManagedMachinePool manifest
MachinePool
with failure:The related
AWSManagedMachinePool
working happily:Here is a snippet of the CAPI logs for a cluster that is currently reporting this issue:
The CAPA logs contain nothing useful because from the functional, AWS UI, and CAPA perspective everything is working perfectly.
Attempted trouble-shooting
Once failureReason and failureMessage are set, there is no code in the controller that I can find that explicitly clears these fields when subsequent reconciliations succeed. Even when the infrastructure resource is found and everything else is working correctly, these fields remain populated, forcing the .status.phase to stay in Failed state.
The reason this is an issue for me is because I have written automation that depends on these fields reporting an accurate status. Otherwise the actual AWS side of things and the CAPI operations seem to be working fine.
What did you expect to happen?
I expected the status to reflect the accurate state of the cluster as AWS sees it
Cluster API version
Latest (even tried new beta 1.10)
Kubernetes version
1.30 - 1.32
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: