-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter and kube-scheduler Hangs for ~5m when Deleting a StatefulSet PVC #1029
Comments
Here's an example of the ➜ karpenter-provider-aws git:(main) k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default local-volume-node-cleanup-controller-78767bc5c4-bmvwd 1/1 Running 0 5h7m
default release-name-local-static-provisioner-7qxq2 1/1 Running 0 2m13s
default release-name-local-static-provisioner-m2h5r 1/1 Running 0 2m56s
default web-0 0/1 Pending 0 100s
default web-1 1/1 Running 0 2m50s ➜ karpenter-provider-aws git:(main) k describe pod -n default web-0
Name: web-0
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: app=nginx
apps.kubernetes.io/pod-index=0
controller-revision-hash=web-5f69958999
statefulset.kubernetes.io/pod-name=web-0
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/web
Containers:
nginx:
Image: registry.k8s.io/nginx-slim:0.8
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts:
/usr/share/nginx/html from www (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xt9rk (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
www:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: www-web-0
ReadOnly: false
kube-api-access-xt9rk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 107s default-scheduler 0/16 nodes are available: 1 node(s) had untolerated taint {karpenter.sh/disruption: disrupting}. preemption: 0/16 nodes are available: 1 Preemption is not helpful for scheduling, 15 No preemption victims found for incoming pod..
Normal Nominated 106s karpenter Pod should schedule on: nodeclaim/default-897jf, node/ip-192-168-103-34.us-west-2.compute.internal ➜ karpenter-provider-aws git:(main) k get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default www-web-0 Pending k8s-disks 117s
default www-web-1 Bound local-pv-ef11af16 279Gi RWO k8s-disks 3m9s ➜ karpenter-provider-aws git:(main) k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-6623f7d1 279Gi RWO Delete Available k8s-disks 3m57s
local-pv-71cc7caa 279Gi RWO Delete Available k8s-disks 3m57s
local-pv-97fa2c11 279Gi RWO Delete Available k8s-disks 4m40s
local-pv-ef11af16 279Gi RWO Delete Bound default/www-web-1 k8s-disks 4m40s ➜ karpenter-provider-aws git:(main) k get nodes -l karpenter.sh/nodepool=default
NAME STATUS ROLES AGE VERSION
ip-192-168-103-34.us-west-2.compute.internal Ready <none> 4m45s v1.28.5-eks-5e0fdde
ip-192-168-118-53.us-west-2.compute.internal Ready <none> 5m27s v1.28.5-eks-5e0fdde |
Looking deeper at the |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Why the f it is closed??
It's a full stop issue! |
/reopen |
@jonathan-innis: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@pkit Can you describe the specific case that you are running into? Like the step-by-step of what is happening? Karpenter is trying to roll the nodes with stateful sets and it's not able to reschedule the pods? |
@jonathan-innis yes. It tries to roll out a new SS (for example image upgrade) It's kinda ok if nodes in your SS are "replicated", but if the nodes are shards it leads to essentially 5 min of unavailability. |
/triage needs-information |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Noticing this issue, when we already have few uneven nodes bootstrapped in three zones, apply the below stateful set to replicate the issue. I understand that all the 10 stateful sets share the same matching label for topologySpreadConstraints, but that should not affect the behavior. Nodepool for your Nodeclass
Workload
Result
Also note the stateful set spec, with the policy to retain the PVs , Karpenter has to take this into consideration when there is a pod rollout restart to decide what zones it bootstraps the new nodes, meaning the nodes should be in the same zone where the Pvs reside, else. The pod remain in pending due to error -
|
Description
Observed Behavior:
I'm testing the behavior of #1018 and validating how Karpenter handles node rolling when the Node Cleanup Controller is responsible for deleting PVCs and PVs from a node that is using NVME storage. The behavior that I'm seeing is that Karpenter is rolling the pods onto the new nodes, but when the new nodes come up, it takes the node cleanup controller a little bit of time to fully delete the PVCs/PVs from the old node (since it's doing the cleanup on a poll).
When it finally does delete the PVCs, Karpenter begins reporting the following error:
"ignoring pod, getting persistent volume claim \"www-web-1\", PersistentVolumeClaim \"www-web-1\" not found"
. It does so correctly and should do so until the pod gets a replacement PVC from the stateful set controller. However, once the PVC is re-created by the statefulset controller, Karpenter still doesn't see that this pod is schedulable and waits around for ~5m before it begins launching a pod for the node.There is no log on Karpenter's side that indicates why it is ignoring the pod or why it is not launching a node.
Expected Behavior:
Karpenter should react to schedule the pod as soon as the pod shows that it can be schedulable.
Reproduction Steps (Please include YAML):
nvme-testing.tar.gz
Versions:
kubectl version
): 1.29The text was updated successfully, but these errors were encountered: