Skip to content

Dependency between eks-pod-identity-agent and CNI #3260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vietwow opened this issue Dec 28, 2024 · 18 comments
Open

Dependency between eks-pod-identity-agent and CNI #3260

vietwow opened this issue Dec 28, 2024 · 18 comments
Labels

Comments

@vietwow
Copy link

vietwow commented Dec 28, 2024

Hi,

I have just found the situation that we want to create EKS cluster with managed add-on Pod Identity and VPC-CNI using Pod Identity. My code looks like that :

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.31.6"

  ...
  cluster_addons = {
    eks-pod-identity-agent = {
      most_recent = true
    }

    vpc-cni = {
      before_compute = true
      most_recent    = true
      pod_identity_role_arn = {
        role_arn        = local.aws_vpc_cni_pod_identity_role_arn
        service_account = local.aws_vpc_cni_serviceaccount_name
      }
    }

    ...[other add-ons]...
  }
  ...
}

So the problem is terraform (randomly) create vpc-cni before eks-pod-identity-agent is created, this leads to VPC-CNI add-on crash.

So I'm not sure how we can solve this dependency problem. Hope to get help on this. Thank you so much.

Best Regards,
VietNC

@bryantbiggs
Copy link
Member

this is not something that we will be able to explicitly control within the module - we cannot control the specific ordering of addons across an unknown variable of addons

@vietwow
Copy link
Author

vietwow commented Jan 6, 2025

@bryantbiggs. However, as of now, it feels like a circular dependency when using pod identity with CNI in this module, because:

  • Terraform waits for the node groups to become healthy, which only happens when their status is "Ready." This means that the CNI must be set up properly first.
  • Node never be ready if it got this issue.

@bryantbiggs
Copy link
Member

Why are you waiting on compute if you aren't configuring anything on the CNI?

@vietwow
Copy link
Author

vietwow commented Jan 7, 2025

it's not me but terraform, let me describe again the issue, the situation is I want to deploy an EKS cluster with managed CNI using Pod Identity (like terraform above I posted). The problem is :

  • terraform CLI creates managed CNI add-on before eks-pod-identity-agent addon (Note: this is randomly) => so CNI crashed (because it requires eks-pod-identity-agent first)
  • terraform module "terraform-aws-eks" create nodegroups parallels with CNIs, and terraform CLI (not me) waits for the node groups to become healthy (which only happens when their status is "Ready"), and because CNI was crashed so nodegroups never become healththy

Not sure if there is something wrong from my understand. Thanks

Copy link

github-actions bot commented Feb 7, 2025

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label Feb 7, 2025
@vietwow
Copy link
Author

vietwow commented Feb 7, 2025

I have just destroyed cluster and re-reploy new cluster using newest module version and confirm that the issue is still existed

@github-actions github-actions bot removed the stale label Feb 8, 2025
@adamwshero
Copy link

adamwshero commented Feb 10, 2025

Also seeing this using release v20.33.1. The response from the API in this scenario is:

  │ Error: waiting for EKS Node Group (my-cluster-name:my-node-group-name-2025021022451790200000001c) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-0e1221be7b827f1f6: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster
  │ 
  │   with module.eks_managed_node_group["compute-heavy"].aws_eks_node_group.this[0],
  │   on modules/eks-managed-node-group/main.tf line 394, in resource "aws_eks_node_group" "this":
  │  394: resource "aws_eks_node_group" "this" {
  │ 

  
  exit status 1

After inspecting the nodes in the cluster you will see something like:

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

@vietwow
Copy link
Author

vietwow commented Feb 11, 2025

cc @bryantbiggs so I think this issue should be marked as "bug" or "need to improve" rather "question" ?

@bryantbiggs
Copy link
Member

No. It's not a supported use case

@vietwow
Copy link
Author

vietwow commented Feb 11, 2025

Oh, that's what I mean "need to improve" because you guys did offer mechanism "pod_identity_role_arn" but it's not working correctly with a (common - as many people are using AWS VPC CNI) usecase, so it should be "not yet" rather "not".

@bryantbiggs
Copy link
Member

Its not supported as in - the EKS APIs work as intended, the module interacts with the EKS APIs as intended, but we don't have control over the ordering of addon deployments. This is an issue in Kubernetes in general and why projects like Argo created concepts like sync waves

we match the EKS API by supporting the EKS addon resource and users can specify the addons they wish to deploy (the module accepts a map of addons that should be created) - but there is no dependency ordering between these addons because there isn't a point where that can be controlled.

So it is a "not" supported, there aren't any changes we can make to support this addon dependency ordering in a reasonable way. The alternative is to deploy the node IAM role with the permissions required by the VPC CNI and then remove those on a subsequent apply once the VPC CNI and EKS Pod Identity Agent are up and running on the cluster

@sawicki-maciej
Copy link

Maybe the alternative approach would be to use a separate module to manage addons (ex https://aws-ia.github.io/terraform-aws-eks-blueprints-addons/main/amazon-eks-addons/)?

@bryantbiggs
Copy link
Member

I manage that module as well and can tell you that it doesn't solve this problem either. Dependency ordering of applications within Kubernetes is a broad problem

@sawicki-maciej
Copy link

What will happen if we have a two instances amazon-eks-addons module (first with EKS Pod Identity Agent and 2nd with VPC CNI and the 2nd instance would depend on the first one)? If it works it maybe a cleaner solution (in my opinion) than suggested temporary IAM role approach.

Also I would suggest escalating this issue to AWS (I don't volunteer to do it myself, since I don't have the right point of contacts in AWS). I would consider this as an EKS bug, especially since the addon crashes with auth any reasonable explanation. Even if we try to instal it via AWS web console we have only a generic "vpc-cin could not be added to cluster X. Try again later." error that is not helpful and trying later doesn't solve the issue.

@Jasonrve
Copy link

I'm running into this issue as well..
Struggling to get around this..

@tropnikovvl
Copy link

Have you tried adding parameter before_compute = true to eks-pod-identity-agent and increasing wait timeouts for deploying addons?
In theory, this should help

@vietwow
Copy link
Author

vietwow commented Mar 18, 2025

For my case, I have enabled before_compute = true but still got the problem

@Jasonrve
Copy link

Thanks @tropnikovvl

Even with above approach I get the same issue.

Would love a work around to this that doesn't require me to run terraform multiple times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants