Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374

Open
ryanohnemus opened this issue Sep 10, 2024 · 2 comments

Comments

@ryanohnemus
Copy link
Contributor

ryanohnemus commented Sep 10, 2024

Bug Report

Describe the bug
If you set a tail input with a large Buffer_Chunk_Size and Buffer_Max_Size, the chunks that are created and passed to fluentbit are larger than a max 10485760 bytes they are rejected by cloud logging and dropped by the stackdriver output plugin with the following error:

 "error": {
    "code": 400,
    "message": "Request payload size exceeds the limit: 10485760 bytes.",
    "status": "INVALID_ARGUMENT"
    }

To Reproduce

  1. Use the following fluentbit input:
    [INPUT]
        name                     tail
        read_from_head    false
        skip_long_lines     on
        path                        /var/log/containers/*.log
        Tag                            kube.*
        buffer_chunk_size   5M
        buffer_max_size      10M
        Exclude_Path              /var/log/containers/*fluent*
        Refresh_Interval          1
        mem_buf_limit          50MB
        threaded                    on
        Skip_long_lines         on
  • Have a high volume logging container running on the same node as fluentbit.
  • Fluentbit tail input successfully reads all messages from the container (and can be verified by checking the prometheus metrics)
    • fluentbit_input_records_total{name="tail.0"} 125000002
  • out_stackdriver fails to create properly sized requests to cloud logging:

    fluentbit_stackdriver_proc_records_total{grpc_code="-1",status="400",name="stackdriver.0"} 12033778
    fluentbit_stackdriver_proc_records_total{grpc_code="0",status="200",name="stackdriver.0"} 466224
    

Most of the records here have been dropped by out_stackdriver plugin

  • you will also see the error messages above, in the log

(This can most likely happen in any situation where a fluentbit chunk is greater than 10485760, in fluentbit chunks can be up to 2MB

Expected behavior
out_stackdriver plugin should batch cloud logging payloads and not rely on the incoming chunk to be below the 10485760 bytes limit. I believe fluentbit chunks can be around 2MB based on https://docs.fluentbit.io/manual/v/1.8/administration/buffering-and-storage#chunks

Your Environment

  • Version used: fluentbit-3.1.5
  • Configuration:
  • Environment name and version (e.g. Kubernetes? What version?): GKE 1.29
  • Server type and version:
  • Operating System and version: ContainerOS
  • Filters and plugins:

Additional context

@ryanohnemus
Copy link
Contributor Author

FYI - #1938 mentions a potential solution but work would need to be done at a base output plugin level if we didn't want to batch in out_stackdriver directly. This looks like an involved change and that issue has been open for 4.5years

@braydonk
Copy link
Contributor

This is a problem we have tried but failed to fix in the past. I believe it affects numerous other output plugins.

The root of the problem is the size of a chunk doesn't equate to the size of the Cloud Logging payload, and we can't accurately predict it to allow for any intelligent batching. Getting a msgpack payload of some size doesn't mean the size of the payload once converted to JSON is going to match, since JSON is so much more expensive to represent the same thing.

The road I went down last time I tried to fix this was to try and come up with a rough heuristic for how big a chunk would be before it became too big for a Cloud Logging request payload. In that scenario, I would split the chunk in half, and recursively do this on each half of the payload until we end up with a list of Cloud Logging requests that would make it through. This change is non-trivial, in particular I remember trying to split the event chunks in half was a rat's nest. (Maybe it would be easier with the log_event_decoder API that exists now 🤔)

The idea in the issue mentioned would probably be better. I'll see if I can engage Fluent Bit maintainers if they have any ideas as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants