Bug: Parser envelopes incomplete payload pre-parsing #3269

dreamorosi · 2024-10-29T17:40:58Z

Expected Behavior

When working with built-in envelopes, I should be able to work with the inner body and either provide a schema to parse it, or get it back decoded or as-is.

A few categories:

Requests

This category includes:

API Gateway REST (v1) - ApiGatewayEnvelope
API Gateway HTTP (v2) - ApiGatewayV2Envelope
Lambda Function URL - LambdaFunctionUrlEnvelope
VPC Lattice (v1) - VpcLatticeEnvelope
VPC Lattice (v2) - VpcLatticeV2Envelope

All the events in this category have a payloads that boil down to this:

{
  "body": "some string",
  "isBase64Encoded": false,
  "headers": {
    "content-type": "text/plain"
  }
}

For all of them, the body is always a string type however the content could be a number of things like a plain text string (like in the example), a JSON encoded string, a base64 encoded string, a base64 encoded binary, or a base64 encoded something else.

When using the envelope, I should be able to either: 1/ just get the body as-is so that I can handle the transform in my code, or 2/ specify a Zodiac schema that will be used to parse the body. In the latter case, the Parser envelope should try to pre-transform the body based on heuristic logic before applying the parsing.

Specifically, the envelopes for these events should be able to handle the following cases:

JSON stringified body

{
  "body": "{\"foo\": 1}",
  "isBase64Encoded": false,
  "headers": {
    "content-type": "application/json"
  }
}

This should work whenever the isBase64Encoded field is set to false AND content type header set to "application/json" or omitted, and allow this:

const MySchema = z.object({ foo: z.string() });

const body: z.infer<typeof MySchema> = BuiltInRequestEnvelope.parse(event, MySchema);

base64-encoded body

{
  "body": "aGVsbG8gd29ybGQ=",
  "isBase64Encoded": true,
  "headers": {
    "content-type": "application/base64"
  }
}

This should work whenever the isBase64Encoded field is set to true AND content type header is set to "application/base64", and allow this:

const body: string = BuiltInRequestEnvelope.parse(event);

For all other cases we return the body as is, see diagrams in the sections below.

base64 encoded streams

Kinesis Data Stream - KinesisEnvelope
Kinesis Firehose - KinesisFirehoseEnvelope
Kafka (MSK + Self managed) - KafkaEnvelope

These types of envelopes deal with payloads that always have their body partially or fully base64 encoded. Similar to what happens for the previous category (requests), customers might want to either: 1/ just get the body within the records, or 2/ get the body and parse it according to a Zod schema they specify.

For the first case, we should simply base 64 decode the encoded fields and return the body. For the latter instead, we try a further JSON parsing if the customer passes a complex schema (i.e. z.object({ ... })), before actually piping the body records through it.

Queues and topics

SNS - SnsEnvelope
SQS - SqsEnvelope

These are relatively simple and follow a similar logic to the previous two categories, except we don't have any way of telling whether the body of the messages are encoded in any way. The heuristic proposed below should work though.

No pre-parsing needed

EventBridge - EventBridgeEnvelope

This is self explanatory, but essentially it's the only category of envelopes that kind of works today since we don't need to do any pre-parsing and the inner body is always already an object.

Others

DynamoDB Stream DynamoDBStreamEnvelope
CloudWatch Logs - CloudWatchEnvelope

These envelopes are the odd-ones-out of the bunch. They require special handling because one requires unmarshalling from DynamoDB Attribute types to native JS objects, and the second one includes a compression step before the parsing.

For the purpose of this issue, we could move the logic within their respective envelopes and tackle them separately. Specifically the DynamoDB one requires further attention and considerations due to both missing event types (#3193) and requiring an extra dependency (#3194).

Current Behavior

Currently the envelopes don't support base64 encoded body fields at all, and in all cases assume the body is a JSON string. This is why I am marking this issue as a bug and not a feature request / improvement.

There are also a few other issues that are related and that will likely be resolved by addressing this one:

For now I'll leave them open mainly to make sure we address each and every one.

Code snippet

N/A

Steps to Reproduce

N/A

Possible Solution

The proposed approach for the parsing could be three:

Option 1

flowchart TB
    A[Parse using outer schema - i.e. VpcLatticeV2Schema & extend schema to enfoce body: z.string] -->|Ensure body, headers, and isBase64Encoded fields are present| B{isBase64Encoded === true}
    B -->|YES| C{decodeBase64}
    C -->|successful| D
    C -->|fail| E
    B -->|NO| D{Did customer pass a Zod schema for the body/payload?}
    D -->|NO| E[Return paylaod as is, aka a string]
    D -->|YES| F[JSON.parse]
    F -->|successful| G[Apply customer Zod schema]
    F -->|fail| H[Throw ParseError]

In this case, we throw only when the customer explicitly passes a Zod schema for the inner body. This can happen either as a result of a failed JSON parse, or as a result of a failed Zod parse.

We don't throw if the customer doesn't pass any schema, but we still attempt to base64 decode the body if the rest of the payload indicates that it's base64 encoded. However we don't throw if this fails and just return the original body as a string.

This option would cover most of the use cases with no changes to the API, while still being lax enough to allow customers to apply their own parsing (escape hatch) if our heuristics fail.

Option 2

The alternative approach, would be to be slightly more strict and throw even when the base64 decoding operation fails:

flowchart TB
    A[Parse using outer schema - i.e. VpcLatticeV2Schema & extend schema to enfoce body: z.string] -->|Ensure body, headers, and isBase64Encoded fields are present| B{isBase64Encoded === true}
    B -->|YES| C{decodeBase64}
    C -->|successful| D
    C -->|fail| H
    B -->|NO| D{Did customer pass a Zod schema for the body/payload?}
    D -->|NO| E[Return paylaod as is, aka a string]
    D -->|YES| F[JSON.parse]
    F -->|successful| G[Apply customer Zod schema]
    F -->|fail| H[Throw ParseError]

This would admittedly be less magic but would make it impossible for customers to apply their own transform whenever ours fail.

Option 3

This approach is a further refined version of the first one, in which customers can specify what I'm gonna refer to as primitive schemas for the payload. These are things like z.string(), z.number(), z.boolean(), but importantly not z.object().

This makes the diagram slightly uglier, but what it really means is that we avoid running JSON.parse() whenever the customer tells us (via the inner schema) that the body is not supposed to be a JSON:

flowchart TB
    A[Parse using outer schema - i.e. VpcLatticeV2Schema & extend schema to enfoce body: z.string] -->|Ensure body, headers, and isBase64Encoded fields are present| B{isBase64Encoded === true}
    B -->|YES| C{decodeBase64}
    C -->|successful| D
    C -->|fail| H
    B -->|NO| D{Did customer pass a Zod schema for the body/payload?}
    D -->|NO| E[Return paylaod as is, aka a string]
    D -->|YES| F{Is a primitive schema like z.string, z.boolean, etc}
    F -->|YES| H
    F -->|NO| G[JSON.parse]
    G -->|successful| H[Apply customer Zod schema]
    G -->|fail| I[Throw ParseError]

Note that I am not sure this is even possible, I don't know Zod enough to know if it allows introspection of this kind, but imo this has all the benefits of option 1, with the added benefit of handling even more cases.

Personally I am inclined to go for the third option if it's feasible, otherwise fall back on the first option.

Provided we explain the logic properly in our documentation, I believe it provides the most versatility while still allowing for extensibility. By not providing a schema for the inner body, we still attempt to decode the body whenever the request tells us so, but customers are responsible for their own further decoding and parsing.

Finally, as discussed during our brief, I think we'd be better served with having some helper pure functions and have the orchestration of how they're called in each envelope type/category for maximum extensibility.

Powertools for AWS Lambda (TypeScript) version

latest

AWS Lambda function runtime

20.x

Packaging format used

npm

Execution logs

No response

The text was updated successfully, but these errors were encountered:

am29d · 2024-11-13T13:02:54Z

I am going through the implementation and I made few observations that contradict our assumptions.

Schema is not optional for envelopes

The concept of envelope was created with the idea to extract and parse inner payload in one step. There is no point in using envelope without the schema, the use case for that is to use APIGatewayProxyEventSchema and then transform your payload manually. Reimplementing this in envelopes introduces redundancy and makes it challenging to explain when use one over the other.

Fall back to string returns

One of the key points outlined is to fall back and return the original event if something goes wrong, base64 decoding, JSON parse, or anything else. Here, we also have a separate use case for that, safeParse, that returns error and the original event. Having the same functionality in parse would introduce redundancy, because the user would still need to guess "what did parse with envelope returned?" and probe the answer with custom logic. I think parse should be strict and straightforward, you give me the schema and envelope, and will return an object (any transformation included), but I will throw an error if something goes wrong. For anything else, use safeParse.

This results into a much lighter implementation, mostly adding base64 use case. Using envelopes and custom schema also means that the payload is a JSON object (except EventBridge), whether it was base64 encoded, compressed or stringified. We certainly need to highlight this point in our docs, so it is clear when to use envelopes and when fallback to the custom schema (i.e. extending built-in schema).

dreamorosi added bug Something isn't working confirmed The scope is clear, ready for implementation parser This item relates to the Parser Utility labels Oct 29, 2024

dreamorosi assigned am29d Oct 29, 2024

dreamorosi added this to Powertools for AWS Lambda (TypeScript) Oct 29, 2024

github-project-automation bot moved this to Triage in Powertools for AWS Lambda (TypeScript) Oct 29, 2024

dreamorosi moved this from Triage to Backlog in Powertools for AWS Lambda (TypeScript) Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Parser envelopes incomplete payload pre-parsing #3269

Bug: Parser envelopes incomplete payload pre-parsing #3269

dreamorosi commented Oct 29, 2024 •

edited

Loading

am29d commented Nov 13, 2024

Bug: Parser envelopes incomplete payload pre-parsing #3269

Bug: Parser envelopes incomplete payload pre-parsing #3269

Comments

dreamorosi commented Oct 29, 2024 • edited Loading

Expected Behavior

Requests

JSON stringified body

base64-encoded body

base64 encoded streams

Queues and topics

No pre-parsing needed

Others

Current Behavior

Code snippet

Steps to Reproduce

Possible Solution

Option 1

Option 2

Option 3

Powertools for AWS Lambda (TypeScript) version

AWS Lambda function runtime

Packaging format used

Execution logs

am29d commented Nov 13, 2024

Schema is not optional for envelopes

Fall back to string returns

dreamorosi commented Oct 29, 2024 •

edited

Loading