out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374

ryanohnemus · 2024-09-10T20:47:37Z

Bug Report

Describe the bug
If you set a tail input with a large Buffer_Chunk_Size and Buffer_Max_Size, the chunks that are created and passed to fluentbit are larger than a max 10485760 bytes they are rejected by cloud logging and dropped by the stackdriver output plugin with the following error:

 "error": {
    "code": 400,
    "message": "Request payload size exceeds the limit: 10485760 bytes.",
    "status": "INVALID_ARGUMENT"
    }

To Reproduce

Use the following fluentbit input:

    [INPUT]
        name                     tail
        read_from_head    false
        skip_long_lines     on
        path                        /var/log/containers/*.log
        Tag                            kube.*
        buffer_chunk_size   5M
        buffer_max_size      10M
        Exclude_Path              /var/log/containers/*fluent*
        Refresh_Interval          1
        mem_buf_limit          50MB
        threaded                    on
        Skip_long_lines         on

Have a high volume logging container running on the same node as fluentbit.
Fluentbit tail input successfully reads all messages from the container (and can be verified by checking the prometheus metrics)
- fluentbit_input_records_total{name="tail.0"} 125000002

out_stackdriver fails to create properly sized requests to cloud logging:

fluentbit_stackdriver_proc_records_total{grpc_code="-1",status="400",name="stackdriver.0"} 12033778
fluentbit_stackdriver_proc_records_total{grpc_code="0",status="200",name="stackdriver.0"} 466224

Most of the records here have been dropped by out_stackdriver plugin

you will also see the error messages above, in the log

(This can most likely happen in any situation where a fluentbit chunk is greater than 10485760, in fluentbit chunks can be up to 2MB

Expected behavior
out_stackdriver plugin should batch cloud logging payloads and not rely on the incoming chunk to be below the 10485760 bytes limit. I believe fluentbit chunks can be around 2MB based on https://docs.fluentbit.io/manual/v/1.8/administration/buffering-and-storage#chunks

Your Environment

Version used: fluentbit-3.1.5
Configuration:
Environment name and version (e.g. Kubernetes? What version?): GKE 1.29
Server type and version:
Operating System and version: ContainerOS
Filters and plugins:

Additional context

The text was updated successfully, but these errors were encountered:

ryanohnemus · 2024-09-11T18:16:33Z

FYI - #1938 mentions a potential solution but work would need to be done at a base output plugin level if we didn't want to batch in out_stackdriver directly. This looks like an involved change and that issue has been open for 4.5years

braydonk · 2024-09-11T18:23:05Z

This is a problem we have tried but failed to fix in the past. I believe it affects numerous other output plugins.

The root of the problem is the size of a chunk doesn't equate to the size of the Cloud Logging payload, and we can't accurately predict it to allow for any intelligent batching. Getting a msgpack payload of some size doesn't mean the size of the payload once converted to JSON is going to match, since JSON is so much more expensive to represent the same thing.

The road I went down last time I tried to fix this was to try and come up with a rough heuristic for how big a chunk would be before it became too big for a Cloud Logging request payload. In that scenario, I would split the chunk in half, and recursively do this on each half of the payload until we end up with a list of Cloud Logging requests that would make it through. This change is non-trivial, in particular I remember trying to split the event chunks in half was a rat's nest. (Maybe it would be easier with the log_event_decoder API that exists now 🤔)

The idea in the issue mentioned would probably be better. I'll see if I can engage Fluent Bit maintainers if they have any ideas as well.

ryanohnemus added the status: waiting-for-triage label Sep 10, 2024

braydonk mentioned this issue Sep 13, 2024

[POC, Do not merge] input_chunk: split incoming buffer when it's too big #9385

Open

6 tasks

ryanohnemus mentioned this issue Sep 18, 2024

Performance Testing of Fluent-bit with several filters shows log processing falling < 5mb/s #9399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374

out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374

ryanohnemus commented Sep 10, 2024 •

edited

Loading

out_stackdriver fails to create properly sized requests to cloud logging:

ryanohnemus commented Sep 11, 2024

braydonk commented Sep 11, 2024

out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374

out_stackdriver: does not batch output records properly if passed a large chunk of records and can drop a majority of records #9374

Comments

ryanohnemus commented Sep 10, 2024 • edited Loading

Bug Report

out_stackdriver fails to create properly sized requests to cloud logging:

ryanohnemus commented Sep 11, 2024

braydonk commented Sep 11, 2024

ryanohnemus commented Sep 10, 2024 •

edited

Loading