Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachinePool not appropriately updating status #12036

Closed
jwitko opened this issue Mar 27, 2025 · 9 comments
Closed

MachinePool not appropriately updating status #12036

jwitko opened this issue Mar 27, 2025 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@jwitko
Copy link

jwitko commented Mar 27, 2025

What steps did you take and what happened?

I'm having an issue with some MachinePool resources reporting a failure state when that doesn't seem to be true downstream (MachinePool on top of AWSManagedMachinePool where latter reports healthy in both CAPA and AWS UI).

Reproduce
To reproduce the bug you should be able to deploy a machinepool which uses a misconfigured launch template or otherwise invalud config that produces the CAPI MachinePool error:

status:
  failureMessage: MachinePool infrastructure resource infrastructure.cluster.x-k8s.io/v1beta2,
    Kind=AWSManagedMachinePool with name "my-dev-us-west-2-ecs2eks001-infra" has
    been deleted after being ready
  failureReason: InvalidConfiguration

And then fix the cause of error.

Information
Here is an example of an impacted MachinePool and AWSManagedMachinePool manifest
MachinePool with failure:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: my-dev-us-west-2-ecs2eks001-infra-machine
  namespace: kube-clusters
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: my-dev-us-west-2-ecs2eks001
    uid: 160cd85c-a51a-4b54-b4ab-5155796cb732
spec:
  clusterName: my-dev-us-west-2-ecs2eks001
  minReadySeconds: 0
  providerIDList:
  - aws:///us-west-2a/i-asdf
  - aws:///us-west-2c/i-asdf
  - aws:///us-west-2b/i-asdf
  replicas: 3
  template:
    metadata: {}
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: my-dev-us-west-2-ecs2eks001-infra-eksconfig
          namespace: kube-clusters
        dataSecretName: my-dev-us-west-2-ecs2eks001-infra-eksconfig
      clusterName: my-dev-us-west-2-ecs2eks001
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: my-dev-us-west-2-ecs2eks001-infra
        namespace: kube-clusters
      nodeDeletionTimeout: 10s
      version: v1.32.0
status:
  availableReplicas: 3
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2025-03-20T17:54:00Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-02-27T22:30:20Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2025-02-27T22:49:55Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2025-03-20T17:54:00Z"
    status: "True"
    type: ReplicasReady
  failureMessage: MachinePool infrastructure resource infrastructure.cluster.x-k8s.io/v1beta2,
    Kind=AWSManagedMachinePool with name "my-dev-us-west-2-ecs2eks001-infra" has
    been deleted after being ready
  failureReason: InvalidConfiguration
  infrastructureReady: true
  nodeRefs:
  - apiVersion: v1
    kind: Node
    name: ip-10-0-225-182.us-west-2.compute.internal
    uid: 5aa877a4-d75d-4d6d-a99d-e47be6b11542
  - apiVersion: v1
    kind: Node
    name: ip-10-0-242-129.us-west-2.compute.internal
    uid: 2311958c-78b0-4724-a665-03843ed9198a
  - apiVersion: v1
    kind: Node
    name: ip-10-0-238-125.us-west-2.compute.internal
    uid: 6a55e395-5850-425a-98e8-1fdf4a010512
  observedGeneration: 10
  phase: Failed
  readyReplicas: 3
  replicas: 3
  v1beta2:
    conditions:
    - lastTransitionTime: "2025-03-26T04:14:44Z"
      message: ""
      observedGeneration: 10
      reason: NotPaused
      status: "False"
      type: Paused

The related AWSManagedMachinePool working happily:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: my-dev-us-west-2-ecs2eks001-infra
  namespace: kube-clusters
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachinePool
    name: my-dev-us-west-2-ecs2eks001-infra-machine
    uid: a1a7c0bf-6365-4adf-8346-3e6e9b697242
  resourceVersion: "616612471"
  uid: 171f7627-5e70-4687-a7a0-9bdacf453827
spec:
  amiType: CUSTOM
  awsLaunchTemplate:
    ami: {}
    instanceMetadataOptions:
      httpEndpoint: enabled
      httpPutResponseHopLimit: 2
      httpTokens: required
      instanceMetadataTags: disabled
    instanceType: t3a.large
    name: infra-launch-template
    rootVolume:
      size: 20
      type: gp3
  capacityType: onDemand
  eksNodegroupName: infra
  labels:
    workload-type: infra
  providerIDList:
  - aws:///us-west-2a/i-zzzz
  - aws:///us-west-2c/i-yyyy
  - aws:///us-west-2b/i-xxx
  roleName: my-dev-us-west-2-ecs2eks001-node-additional
  scaling:
    maxSize: 5
    minSize: 2
  subnetIDs:
  - subnet-adsf
  - subnet-asdf
  - subnet-adsf
  taints:
  - effect: no-schedule
    key: workload-type
    value: infra
  updateConfig:
    maxUnavailable: 1
status:
  conditions:
  - lastTransitionTime: "2025-02-27T22:49:56Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-02-27T22:49:56Z"
    status: "True"
    type: EKSNodegroupReady
  - lastTransitionTime: "2025-02-27T22:47:22Z"
    status: "True"
    type: IAMNodegroupRolesReady
  - lastTransitionTime: "2025-02-27T22:47:22Z"
    status: "True"
    type: LaunchTemplateReady
  - lastTransitionTime: "2025-03-26T04:14:47Z"
    reason: NotPaused
    status: "False"
    type: Paused
  - lastTransitionTime: "2025-03-01T03:47:43Z"
    status: "True"
    type: PostLaunchTemplateUpdateOperationSuccess
  launchTemplateID: lt-asdf
  launchTemplateVersion: "4"
  ready: true
  replicas: 3

Here is a snippet of the CAPI logs for a cluster that is currently reporting this issue:

I0326 04:30:15.304866 1 cluster_accessor.go:445] "Creating watch machinepool-watchNodes for *v1.Node" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="kube-clusters/my-qa-us-west-2-ecs2eks001-lz-external-machine" namespace="kube-clusters" name="my-qa-us-west-2-ecs2eks001-lz-external-machine" reconcileID="8f341e0e-10c2-4a3d-b95e-b16b9cbbd04a" Cluster="kube-clusters/my-qa-us-west-2-ecs2eks001"
I0326 04:30:25.240989 1 cluster_accessor.go:315] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="7b8694ea-f113-4866-8c03-1f9cdc8e9bec"
I0326 04:30:25.241017 1 cluster_accessor.go:322] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="7b8694ea-f113-4866-8c03-1f9cdc8e9bec"
E0326 04:30:25.241765 1 cluster_controller_status.go:838] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment's RollingOut conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="dbaba844-6420-4f5f-8468-2af957e8a166"
E0326 04:30:25.241806 1 cluster_controller_status.go:915] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingUp conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="dbaba844-6420-4f5f-8468-2af957e8a166"
E0326 04:30:25.241831 1 cluster_controller_status.go:992] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingDown conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="dbaba844-6420-4f5f-8468-2af957e8a166"
I0326 04:30:25.242228 1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="f36ddad1-4511-426b-b7ae-c7b12cd004f0"
I0326 04:30:27.359987 1 cluster_accessor.go:271] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="f36ddad1-4511-426b-b7ae-c7b12cd004f0"
E0326 04:30:27.360849 1 cluster_controller_status.go:838] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment's RollingOut conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="kube-clusters/my-dev-us-west-2-ecs2eks001" namespace="kube-clusters" name="my-dev-us-west-2-ecs2eks001" reconcileID="ca3fbe43-0de4-46f1-8538-e933f32bcf15"

The CAPA logs contain nothing useful because from the functional, AWS UI, and CAPA perspective everything is working perfectly.

Attempted trouble-shooting

  • I set the AWS Launch Template to an older version to try and "reset" the phase: status output. It did update it back to the up to date launch template version but the MachinePool status was not updated to reflect success.
  • I was able to scale the node pool up successfully. The MachinePool resource had its replicas updated and it worked. The .status is still reporting failure.

Once failureReason and failureMessage are set, there is no code in the controller that I can find that explicitly clears these fields when subsequent reconciliations succeed. Even when the infrastructure resource is found and everything else is working correctly, these fields remain populated, forcing the .status.phase to stay in Failed state.

The reason this is an issue for me is because I have written automation that depends on these fields reporting an accurate status. Otherwise the actual AWS side of things and the CAPI operations seem to be working fine.

What did you expect to happen?

I expected the status to reflect the accurate state of the cluster as AWS sees it

Cluster API version

Latest (even tried new beta 1.10)

Kubernetes version

1.30 - 1.32

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 27, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jwitko
Copy link
Author

jwitko commented Mar 27, 2025

This is also related to a still existing bug which had its ticket closed:
#10541

@chrischdi
Copy link
Member

Following the old links, could this be this CAPA bug (so not CAPI): kubernetes-sigs/cluster-api-provider-aws#4618 ?

@jwitko
Copy link
Author

jwitko commented Mar 27, 2025

Following the old links, could this be this CAPA bug (so not CAPI): kubernetes-sigs/cluster-api-provider-aws#4618 ?

Thanks for providing that link!

So the resources and such described in that PR are different but potentially it could be the same issue.

Digging into the CAPI code the proximate cause is very clear here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L65C1-L67C3

As long as those failure messages are not cleared the status of the MachinePool will never change. I hadn't had time yet to trace if anything is actually updating those once a downstream resource becomes healthy.

I'll have to look deeper into your link., I see from the comments there was some thoughts it might be a solved issue due to https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5174/files but I will have to look deeper.

@jwitko
Copy link
Author

jwitko commented Mar 28, 2025

OK I had some time to look into it and whether or not there is an issue at the AWS provider level I think there is still a bug here in CAPI MachinePool handling.

I noted the initial proximate cause above. Following that train of thought I can see that the MachinePool status for failureMessage and failureReason are set at https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_phases.go#L155-L169 and

The issue here is they are never cleared or set to empty when the external. I checked all the reconciliation methods (reconcileBootstrap, reconcileInfrastructure, reconcileNodeRefs), and none of them reset these failure fields when they succeed after a previous failure.

I would like to:

@chrischdi Before I put effort into a pull-request can you review this approach and let me know if it makes sense to you?

@chrischdi
Copy link
Member

chrischdi commented Mar 28, 2025

Ah just recognized that failureMessage and failureReason are set.

It works by design that if they are set, that the controllers don't do anything anymore because they are considered as terminal failure (at awsmanagedmachinepool level).

If these are not terminal failures, then whoever did set them should not do this.

Besides that: in future we will remove both fields anyway when implementation of https://github.com/fabriziopandini/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md is done:

K8s resources do not have a concept similar to "terminal failure" in Cluster API resources, and users approaching the project are struggling with this idea. In some cases also provider's implementers are struggling with it. Accordingly, Cluster API resources are dropping FailureReason and FailureMessage fields. Like in K8s objects, "terminal failures" should be surfaced using conditions, with a well documented type/reason representing a "terminal failure"; it is up to consumers to treat them accordingly. There is no special treatment for these conditions within Cluster API.

@jwitko
Copy link
Author

jwitko commented Mar 29, 2025

Sorry I'm not really sure what direction you're suggesting. Are you saying a "fix" for this is not welcome because it is intended behavior?

I'm also unclear on the terminal failure part. These resources (machine pool and awsmanagedmachinepool) are active, working, and modifiable ( can scale up or down ).

What is the intended path for a person in my situation with machinepools in a stuck failed phase?

When you say whoever set the failure message. My understanding is it is machine pool controller setting the message. Of course it gets a failure reason from the downstream AWS resource but the failure is not terminal from the AWS side it seems?

Is that in and of itself the problem or maybe I'm not understanding? Do you feel the source of the issue is still on the AWS provider side?

@enxebre
Copy link
Member

enxebre commented Mar 31, 2025

As Chris mentioned above, the capi contractual assumption is that failures signaled through those fields would be terminal. So it is a provider implementer task to honour that assumption. In this case if the failure is indeed recoverable, I would expect aws implementation to not signal it via those fields.

@jwitko
Copy link
Author

jwitko commented Mar 31, 2025

Thanks @enxebre @chrischdi! Appreciate the context and that all sounds reasonable to me. I'll take it up over there and link that over to this issue.

@jwitko jwitko closed this as completed Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants