Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom_resources: Provider Lambda function is missing lambda:GetFunctionConfiguration #26838

Open
erwaxler opened this issue Aug 22, 2023 · 14 comments · May be fixed by #32904
Open

custom_resources: Provider Lambda function is missing lambda:GetFunctionConfiguration #26838

erwaxler opened this issue Aug 22, 2023 · 14 comments · May be fixed by #32904
Assignees
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. effort/medium Medium work item – several days of effort p1

Comments

@erwaxler
Copy link

Describe the bug

The Landing Zone Accelerator solution leverages the custom_resources module to create service-linked roles via CDK custom resources. When this custom resource Lambda function is invoked several times in succession, users intermittently receive the following error:

Received response status [FAILED] from custom resource. Message returned: AccessDeniedException: Resource is not in the state functionActive

We believe this is the result of queuing incoming requests and the role attached to the cdk.custom_resources.Provider function is missing the permission: lambda:GetFunctionConfiguration

Expected Behavior

Custom resource provider implements appropriate permissions and retries to execute successfully when invoked several times in succession.

Current Behavior

Transient failures:

Received response status [FAILED] from custom resource. Message returned: AccessDeniedException: Resource is not in the state functionActive

Reproduction Steps

Deploy v1.4.3 of the Landing Zone Accelerator on AWS.

For a smaller sample that can be extracted without deploying the entire LZA solution, you may use this custom resource construct that is used by LZA to create the service-linked roles:

https://github.com/awslabs/landing-zone-accelerator-on-aws/blob/1614a01824c5a43f97fadfb8ec0c3627a0f343dd/source/packages/%40aws-accelerator/constructs/lib/aws-iam/service-linked-role.ts#L87

Possible Solution

Add lambda:GetFunctionConfiguration permission to the provider Lambda function's IAM role.

Additional Information/Context

No response

CDK CLI Version

2.79

Framework Version

No response

Node.js Version

16.20.1

OS

Amazon Linux

Language

Typescript

Language Version

No response

Other information

No response

@erwaxler erwaxler added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 22, 2023
@github-actions github-actions bot added the @aws-cdk/custom-resources Related to AWS CDK Custom Resources label Aug 22, 2023
@pahud
Copy link
Contributor

pahud commented Aug 22, 2023

this is probably related to #24358

The custom resource essentially check the functionActive state before each invocation:

/**
* The status of the Lambda function is checked every second for up to 300 seconds.
* Exits the loop on 'Active' state and throws an error on 'Inactive' or 'Failed'.
*
* And now we wait.
*
* Use functionActive instead of functionActiveV2, since functionActiveV2 is only
* available on SDK 2.1080.0 and up, Lambda installs 2.1055.0 by default,
* and we use the SDK version that Lambda includes by default.
*/
await waitUntilFunctionActive({
client: lambda,
maxWaitTime: 60,
}, {
FunctionName: req.FunctionName,
});
return await lambda.invoke(req);
}
}

But it should work as expected.

Which region did you deploy?

Instead of running the LZA, are you able to provide a smallest code snippet that reproduces this issue?

@pahud pahud added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Aug 22, 2023
@erwaxler
Copy link
Author

@pahud Agreed on it likely being related to #24358, I'll work on a smaller snippet to reproduce the error. The error has been seen in at least us-east-1 and ap-southeast-2, but we've heard reports from 5+ customers so I believe the error to be region-agnostic. I'll work on a smaller snippet to reproduce the behavior more predictably.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 22, 2023
@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 25, 2023
@pahud
Copy link
Contributor

pahud commented Aug 25, 2023

@erwaxler
Copy link
Author

@pahud Still working on a smaller reproducible snippet. LZA creates 6 individual custom resources to create the service-linked roles.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 25, 2023
@aaronbrighton
Copy link

I've added some more details below, please advise whether it's still necessary to provide a reproducible code-snippet.

This issue is exacerbated by logic that may run these custom resources on every pipeline run, for instance: awslabs/landing-zone-accelerator-on-aws#237

The following line and the surrounding retry logic requires the as mentioned lambda:GetFunctionConfiguration permission to function: https://github.com/aws/aws-cdk/blob/c695b6004219426cf0e67cbb92d916a394ddd594/packages/aws-cdk-lib/custom-resources/lib/provider-framework/runtime/outbound.ts#L66C15-L66C15

However only invokeFunction is granted on the onEventHandler Lambda function:

I believe the above provider.ts likely needs a fn.addToRolePolicy added after the above line granting "lambda:GetFunctionConfiguration" on this.onEventHandler.functionArn.

@yubingjiaocn
Copy link
Contributor

Encountered same problem with EKS Blueprint, which is using kubectlProvider created from https://github.com/aws/aws-cdk/blob/v2.96.0/packages/aws-cdk-lib/aws-eks/lib/kubectl-provider.ts#L144.

The error message on Cloudformation said:

Custom::AWSCDK-EKS-KubernetesResource EKSStackAwsAuthmanifest75D20040 CREATE_FAILED Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions ...

Also found the following error message in CloudTrail:

User: arn:aws:sts::[redacted]:assumed-role/EKSStack-awscdka-ProviderframeworkonEventS-duAmMCwNZO6z/EKSStack-awscdka-ProviderframeworkonEvent-4NeK4zj7Z6ab is not authorized to perform: lambda:GetFunctionConfiguration on resource: arn:aws:lambda:[redacted]:[redacted]:function:EKSStack-awscdkawseksKube-Handler886CB40B-3fOpwrZnomNI because no identity-based policy allows the lambda:GetFunctionConfiguration action

After added lambda:GetFunctionConfiguration permision to IAM role of the function, the template can be deployed.

@ejt4x
Copy link

ejt4x commented Feb 1, 2024

The waiter call changed in c3a4b7b from waitUntilFunctionActive to waitUntilFunctionActiveV2. This changed the required IAM permission from lambda:GetFunctionConfiguration to lambda:GetFunction.

/**
* The status of the Lambda function is checked every second for up to 300 seconds.
* Exits the loop on 'Active' state and throws an error on 'Inactive' or 'Failed'.
*
* And now we wait.
*/
await waitUntilFunctionActiveV2({
client: lambda,
maxWaitTime: 300,
}, {
FunctionName: req.FunctionName,
});
return await lambda.invoke(req);

@markhankins
Copy link

@ejt4x It looks like both PR's failed to merge based on missing tests?

#27204
#27524

I couldn't see a passing PR for this?

@ejt4x
Copy link

ejt4x commented Feb 8, 2024

@markhankins Yes, it looks like the original PR was abandoned. I am not prepared to create one anytime soon.

I was just commenting to inform any would-be submitters or reviewers that the API call and therefore required fix have changed somewhat since the original title and description of this issue were written.

@blinkdaffer
Copy link

any solution for this

@pahud
Copy link
Contributor

pahud commented Apr 24, 2024

probably related to #24358

@blinkdaffer @ejt4x @markhankins

Can you tell me which region(s) are you seeing this error?

Are you able to provide a very tiny CDK app that we can deploy in that region and reproduce this error?

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Apr 24, 2024
@ejt4x
Copy link

ejt4x commented Apr 24, 2024

@pahud Here's my simplest way of producing this error.

const thisLambdaDoesNotExist = Function.fromFunctionName(this, 'NonExistentLambda', 'fakelambda');

const provider = new Provider(this, 'Provider', {
  onEventHandler: thisLambdaDoesNotExist,
});

new CustomResource(this, 'Resource1', { serviceToken: provider.serviceToken });

The actual exception (throttling, function does not exist, whatever) is swallowed by the try/except block, leaving the following error on the CFN event log:

Resource1	
CREATE_FAILED
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"} at checkExceptions (/var/runtime/node_modules/@aws-sdk/node_modules/@smithy/util-waiter/dist-cjs/index.js:59:26) at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/index.js:5933:49) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async defaultInvokeFunction (/var/task/outbound.js:1:875) at async invokeUserFunction (/var/task/framework.js:1:2192) at async onEvent (/var/task/framework.js:1:369) at async Runtime.handler (/var/task/cfn-response.js:1:1573) 

If we go digging in CloudTrail, we find this IAM error

   "errorMessage": "User: arn:aws:sts::[redacted]-ProviderframeworkonEvent-jKCdLDqBfAP0 is not authorized to perform: lambda:GetFunction on resource: arn:aws:lambda:us-west-2:redacted:function:fakelambda because no identity-based policy allows the lambda:GetFunction action",

All these errors mask the actual issue - the user lambda invocation failed due to throttling, non-existence, or some other reason. The missing IAM permission prevents this from being discovered by the user

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Apr 25, 2024
@pahud
Copy link
Contributor

pahud commented Jan 6, 2025

Yes when deploying

export class ProviderStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // Import a non-existent Lambda function
    const thisLambdaDoesNotExist = lambda.Function.fromFunctionName(this, 'NonExistentLambda', 'fakelambda');

    // Create a custom resource provider using the non-existent Lambda
    const provider = new cr.Provider(this, 'Provider', {
      onEventHandler: thisLambdaDoesNotExist,
    });

    // Create a custom resource using the provider
    new CustomResource(this, 'Resource1', { 
      serviceToken: provider.serviceToken 
    });
  }
}

In 2.173.0 we'll get Waiter has timed out error.

Failed resources:
DummyStack2 | 11:39:53 AM | CREATE_FAILED        | AWS::CloudFormation::CustomResource | Resource1/Default (Resource1) Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"}
    at checkExceptions (/var/runtime/node_modules/@aws-sdk/node_modules/@smithy/util-waiter/dist-cjs/index.js:59:26)
    at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/index.js:5820:49)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async defaultInvokeFunction (/var/task/outbound.js:1:1024)
    at async invokeUserFunction (/var/task/framework.js:1:2794)
    at async onEvent (/var/task/framework.js:1:369)
    at async Runtime.handler (/var/task/cfn-response.js:1:1837) (RequestId: 1645d85b-b74b-4638-9ea3-c0e68419e42a)
❌  DummyStack2 failed: Error: The stack named DummyStack2 failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"}
    at checkExceptions (/var/runtime/node_modules/@aws-sdk/node_modules/@smithy/util-waiter/dist-cjs/index.js:59:26)
    at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/index.js:5820:49)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async defaultInvokeFunction (/var/task/outbound.js:1:1024)
    at async invokeUserFunction (/var/task/framework.js:1:2794)
    at async onEvent (/var/task/framework.js:1:369)
    at async Runtime.handler (/var/task/cfn-response.js:1:1837) (RequestId: 1645d85b-b74b-4638-9ea3-c0e68419e42a)

Looking at the provider framework log:


2025-01-06T16:39:52.381Z	4cc87c28-3877-4423-a03f-a3ae6eef055e	INFO	[provider-framework] CREATE failed, responding with a marker physical resource id so that the subsequent DELETE will be ignored
--
2025-01-06T16:39:52.388Z	4cc87c28-3877-4423-a03f-a3ae6eef055e	INFO	[provider-framework] submit response to cloudformation https://cloudformation-custom-resource-response-useast1.s3.amazonaws.com//arn%3Aaws%3Acloudformation%3Aus-east-1%3A903779448426%3Astack/DummyStack2/1a3311b0-cc4c-11ef-8680-0e501d1dc595%7CResource1%7C1645d85b-b74b-4638-9ea3-c0e68419e42a?*** {     "Status": "FAILED",     "Reason": "TimeoutError: {\"state\":\"TIMEOUT\",\"reason\":\"Waiter has timed out\"}\n    at checkExceptions (/var/runtime/node_modules/@aws-sdk/node_modules/@smithy/util-waiter/dist-cjs/index.js:59:26)\n    at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/index.js:5820:49)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async defaultInvokeFunction (/var/task/outbound.js:1:1024)\n    at async invokeUserFunction (/var/task/framework.js:1:2794)\n    at async onEvent (/var/task/framework.js:1:369)\n    at async Runtime.handler (/var/task/cfn-response.js:1:1837)",     "StackId": "arn:aws:cloudformation:us-east-1:903779448426:stack/DummyStack2/1a3311b0-cc4c-11ef-8680-0e501d1dc595",     "RequestId": "1645d85b-b74b-4638-9ea3-c0e68419e42a",     "PhysicalResourceId": "AWSCDK::CustomResourceProviderFramework::CREATE_FAILED",     "LogicalResourceId": "Resource1" }

affected code:

My take:

  1. custom resource would not check if the provided lambda ARN exists and would simply invoke and wait the state, which seems would always timeout if the function actually does not exist.
  2. We probably should improve the way we invoke a function with provided ARN.

Making this a p1.

@pahud pahud added p1 and removed p2 labels Jan 6, 2025
@pahud
Copy link
Contributor

pahud commented Jan 6, 2025

internal: V1631223300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment