Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Parser envelopes incomplete payload pre-parsing #3269

Open
dreamorosi opened this issue Oct 29, 2024 · 1 comment
Open

Bug: Parser envelopes incomplete payload pre-parsing #3269

dreamorosi opened this issue Oct 29, 2024 · 1 comment
Assignees
Labels
bug Something isn't working confirmed The scope is clear, ready for implementation parser This item relates to the Parser Utility

Comments

@dreamorosi
Copy link
Contributor

dreamorosi commented Oct 29, 2024

Expected Behavior

When working with built-in envelopes, I should be able to work with the inner body and either provide a schema to parse it, or get it back decoded or as-is.

A few categories:

Requests

This category includes:

  • API Gateway REST (v1) - ApiGatewayEnvelope
  • API Gateway HTTP (v2) - ApiGatewayV2Envelope
  • Lambda Function URL - LambdaFunctionUrlEnvelope
  • VPC Lattice (v1) - VpcLatticeEnvelope
  • VPC Lattice (v2) - VpcLatticeV2Envelope

All the events in this category have a payloads that boil down to this:

{
  "body": "some string",
  "isBase64Encoded": false,
  "headers": {
    "content-type": "text/plain"
  }
}

For all of them, the body is always a string type however the content could be a number of things like a plain text string (like in the example), a JSON encoded string, a base64 encoded string, a base64 encoded binary, or a base64 encoded something else.

When using the envelope, I should be able to either: 1/ just get the body as-is so that I can handle the transform in my code, or 2/ specify a Zodiac schema that will be used to parse the body. In the latter case, the Parser envelope should try to pre-transform the body based on heuristic logic before applying the parsing.

Specifically, the envelopes for these events should be able to handle the following cases:

JSON stringified body

{
  "body": "{\"foo\": 1}",
  "isBase64Encoded": false,
  "headers": {
    "content-type": "application/json"
  }
}

This should work whenever the isBase64Encoded field is set to false AND content type header set to "application/json" or omitted, and allow this:

const MySchema = z.object({ foo: z.string() });

const body: z.infer<typeof MySchema> = BuiltInRequestEnvelope.parse(event, MySchema);

base64-encoded body

{
  "body": "aGVsbG8gd29ybGQ=",
  "isBase64Encoded": true,
  "headers": {
    "content-type": "application/base64"
  }
}

This should work whenever the isBase64Encoded field is set to true AND content type header is set to "application/base64", and allow this:

const body: string = BuiltInRequestEnvelope.parse(event);

For all other cases we return the body as is, see diagrams in the sections below.

base64 encoded streams

  • Kinesis Data Stream - KinesisEnvelope
  • Kinesis Firehose - KinesisFirehoseEnvelope
  • Kafka (MSK + Self managed) - KafkaEnvelope

These types of envelopes deal with payloads that always have their body partially or fully base64 encoded. Similar to what happens for the previous category (requests), customers might want to either: 1/ just get the body within the records, or 2/ get the body and parse it according to a Zod schema they specify.

For the first case, we should simply base 64 decode the encoded fields and return the body. For the latter instead, we try a further JSON parsing if the customer passes a complex schema (i.e. z.object({ ... })), before actually piping the body records through it.

Queues and topics

  • SNS - SnsEnvelope
  • SQS - SqsEnvelope

These are relatively simple and follow a similar logic to the previous two categories, except we don't have any way of telling whether the body of the messages are encoded in any way. The heuristic proposed below should work though.

No pre-parsing needed

  • EventBridge - EventBridgeEnvelope

This is self explanatory, but essentially it's the only category of envelopes that kind of works today since we don't need to do any pre-parsing and the inner body is always already an object.

Others

  • DynamoDB Stream DynamoDBStreamEnvelope
  • CloudWatch Logs - CloudWatchEnvelope

These envelopes are the odd-ones-out of the bunch. They require special handling because one requires unmarshalling from DynamoDB Attribute types to native JS objects, and the second one includes a compression step before the parsing.

For the purpose of this issue, we could move the logic within their respective envelopes and tackle them separately. Specifically the DynamoDB one requires further attention and considerations due to both missing event types (#3193) and requiring an extra dependency (#3194).

Current Behavior

Currently the envelopes don't support base64 encoded body fields at all, and in all cases assume the body is a JSON string. This is why I am marking this issue as a bug and not a feature request / improvement.

There are also a few other issues that are related and that will likely be resolved by addressing this one:

For now I'll leave them open mainly to make sure we address each and every one.

Code snippet

N/A

Steps to Reproduce

N/A

Possible Solution

The proposed approach for the parsing could be three:

Option 1

flowchart TB
    A[Parse using outer schema - i.e. VpcLatticeV2Schema & extend schema to enfoce body: z.string] -->|Ensure body, headers, and isBase64Encoded fields are present| B{isBase64Encoded === true}
    B -->|YES| C{decodeBase64}
    C -->|successful| D
    C -->|fail| E
    B -->|NO| D{Did customer pass a Zod schema for the body/payload?}
    D -->|NO| E[Return paylaod as is, aka a string]
    D -->|YES| F[JSON.parse]
    F -->|successful| G[Apply customer Zod schema]
    F -->|fail| H[Throw ParseError]
Loading

In this case, we throw only when the customer explicitly passes a Zod schema for the inner body. This can happen either as a result of a failed JSON parse, or as a result of a failed Zod parse.

We don't throw if the customer doesn't pass any schema, but we still attempt to base64 decode the body if the rest of the payload indicates that it's base64 encoded. However we don't throw if this fails and just return the original body as a string.

This option would cover most of the use cases with no changes to the API, while still being lax enough to allow customers to apply their own parsing (escape hatch) if our heuristics fail.

Option 2

The alternative approach, would be to be slightly more strict and throw even when the base64 decoding operation fails:

flowchart TB
    A[Parse using outer schema - i.e. VpcLatticeV2Schema & extend schema to enfoce body: z.string] -->|Ensure body, headers, and isBase64Encoded fields are present| B{isBase64Encoded === true}
    B -->|YES| C{decodeBase64}
    C -->|successful| D
    C -->|fail| H
    B -->|NO| D{Did customer pass a Zod schema for the body/payload?}
    D -->|NO| E[Return paylaod as is, aka a string]
    D -->|YES| F[JSON.parse]
    F -->|successful| G[Apply customer Zod schema]
    F -->|fail| H[Throw ParseError]
Loading

This would admittedly be less magic but would make it impossible for customers to apply their own transform whenever ours fail.

Option 3

This approach is a further refined version of the first one, in which customers can specify what I'm gonna refer to as primitive schemas for the payload. These are things like z.string(), z.number(), z.boolean(), but importantly not z.object().

This makes the diagram slightly uglier, but what it really means is that we avoid running JSON.parse() whenever the customer tells us (via the inner schema) that the body is not supposed to be a JSON:

flowchart TB
    A[Parse using outer schema - i.e. VpcLatticeV2Schema & extend schema to enfoce body: z.string] -->|Ensure body, headers, and isBase64Encoded fields are present| B{isBase64Encoded === true}
    B -->|YES| C{decodeBase64}
    C -->|successful| D
    C -->|fail| H
    B -->|NO| D{Did customer pass a Zod schema for the body/payload?}
    D -->|NO| E[Return paylaod as is, aka a string]
    D -->|YES| F{Is a primitive schema like z.string, z.boolean, etc}
    F -->|YES| H
    F -->|NO| G[JSON.parse]
    G -->|successful| H[Apply customer Zod schema]
    G -->|fail| I[Throw ParseError]
Loading

Note that I am not sure this is even possible, I don't know Zod enough to know if it allows introspection of this kind, but imo this has all the benefits of option 1, with the added benefit of handling even more cases.


Personally I am inclined to go for the third option if it's feasible, otherwise fall back on the first option.

Provided we explain the logic properly in our documentation, I believe it provides the most versatility while still allowing for extensibility. By not providing a schema for the inner body, we still attempt to decode the body whenever the request tells us so, but customers are responsible for their own further decoding and parsing.

Finally, as discussed during our brief, I think we'd be better served with having some helper pure functions and have the orchestration of how they're called in each envelope type/category for maximum extensibility.

Powertools for AWS Lambda (TypeScript) version

latest

AWS Lambda function runtime

20.x

Packaging format used

npm

Execution logs

No response

@dreamorosi dreamorosi added bug Something isn't working confirmed The scope is clear, ready for implementation parser This item relates to the Parser Utility labels Oct 29, 2024
@dreamorosi dreamorosi moved this from Triage to Backlog in Powertools for AWS Lambda (TypeScript) Oct 29, 2024
@am29d
Copy link
Contributor

am29d commented Nov 13, 2024

I am going through the implementation and I made few observations that contradict our assumptions.

Schema is not optional for envelopes

The concept of envelope was created with the idea to extract and parse inner payload in one step. There is no point in using envelope without the schema, the use case for that is to use APIGatewayProxyEventSchema and then transform your payload manually. Reimplementing this in envelopes introduces redundancy and makes it challenging to explain when use one over the other.

Fall back to string returns

One of the key points outlined is to fall back and return the original event if something goes wrong, base64 decoding, JSON parse, or anything else. Here, we also have a separate use case for that, safeParse, that returns error and the original event. Having the same functionality in parse would introduce redundancy, because the user would still need to guess "what did parse with envelope returned?" and probe the answer with custom logic. I think parse should be strict and straightforward, you give me the schema and envelope, and will return an object (any transformation included), but I will throw an error if something goes wrong. For anything else, use safeParse.

This results into a much lighter implementation, mostly adding base64 use case. Using envelopes and custom schema also means that the payload is a JSON object (except EventBridge), whether it was base64 encoded, compressed or stringified. We certainly need to highlight this point in our docs, so it is clear when to use envelopes and when fallback to the custom schema (i.e. extending built-in schema).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed The scope is clear, ready for implementation parser This item relates to the Parser Utility
Projects
Development

No branches or pull requests

2 participants