Fluent Bit cannot resume logs from file storage once the connection to the upstream is restored. #9339

MarkRushB · 2024-09-03T20:14:05Z

Bug Report

Describe the bug
Fluent Bit cannot resume logs from file storage once the connection to the upstream is restored.

To Reproduce

I used helm to deploy fluent-bit on my k8s clusters
I configured filestorage for my flluent-bit:

        [SERVICE]
            Daemon Off
            Flush {{ .Values.flush }}
            Log_Level {{ .Values.logLevel }}
            Parsers_File /fluent-bit/etc/parsers.conf
            Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
            HTTP_Server On
            HTTP_Listen 0.0.0.0
            HTTP_Port {{ .Values.metricsPort }}
            Health_Check On

            # Persistent storage path for buffering
            storage.path /var/log/flb-storage/
            storage.sync normal
            storage.checksum off
            storage.backlog.mem_limit 5M
            storage.max_chunks_up 300

        [INPUT]
            Name tail
            Tag app_container
            Path /var/log/test.log
            # Path /var/log/*.log, /var/log/*/*.log, /var/log/*/*/*.log, /var/log/*/*/*/*.log, /var/log/*/*/*/*/*.log,
            # Exclude_Path /var/log/containers/nginx-ingress-*.log, /var/log/containers/fluent-bit-*.log, /var/log/containers/fluentbit-*.log, /var/log/pods/fluent-bit_*/*/*.log, /var/log/containers/cloudguard-*.log, /var/log/pods/checkpoint_cloudguard-*/*/*.log, /var/log/flb-storage/*
            Path_Key filename
            Parser cri
            DB /var/log/flb_kube.db
            storage.type filesystem
            Mem_Buf_Limit     5MB
            Buffer_Max_Size   1MB
            Skip_Long_Lines   Off
            Refresh_Interval 30
            Alias  app_log_file
        [OUTPUT]
            Name tcp
            Match app_container
            Host 107.162.208.134
            Port 20540
            Format json_lines
            Json_date_key false
            tls On
            tls.verify Off
            Alias sse-ingest
            storage.total_limit_size  500M

I used a script to generate dummy logs:

#!/bin/bash

# Define the log file path
log_file="/var/log/test.log"

# Create the log file if it doesn't exist
touch $log_file

echo "Starting to generate logs to $log_file"

# Initialize the counter
log_count=1

# Loop to generate logs
while true; do
  echo "$(date +'%Y-%m-%d %H:%M:%S') - Log entry $log_count: This is log message number $log_count" >> $log_file
  ((log_count++))  # Increment the counter
  sleep 1  # Generate a log entry every second
done

After running the dummy log script for a while, I attempted to manually modify the configuration by changing the TCP host in the output to a non-functional one. This configuration change in Fluent Bit triggered a pod redeployment. Following this, I noticed that logs were not being pushed to the upstream due to the unavailable connection.
I observed there were some chunks under file we I specified

s.zhao@gke-dv1-gcp-csse1-us-n2s8-application-5a90eb1f-3slt /var/log/flb-storage/tail.1 $ ls
1-1725392991.728518370.flb  1-1725392993.737605434.flb  1-1725392995.746954773.flb  1-1725392997.756039616.flb  1-1725392999.765454780.flb  1-1725393001.775012872.flb
1-1725392992.732981341.flb  1-1725392994.742255505.flb  1-1725392996.751538449.flb  1-1725392998.760521464.flb  1-1725393000.770326169.flb

then I updated the host to correct one and pods got restarted. Then from my upstream (ELK), looks like I lost those logs when the connection was unavailable.

From Kibana, the log jumped from 215 to 363, we missed 216 - 362.

Your Environment

Version used: helm version: 0.38.0

Additional context

The text was updated successfully, but these errors were encountered:

MarkRushB added the status: waiting-for-triage label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluent Bit cannot resume logs from file storage once the connection to the upstream is restored. #9339

Fluent Bit cannot resume logs from file storage once the connection to the upstream is restored. #9339

MarkRushB commented Sep 3, 2024 •

edited

Loading

Fluent Bit cannot resume logs from file storage once the connection to the upstream is restored. #9339

Fluent Bit cannot resume logs from file storage once the connection to the upstream is restored. #9339

Comments

MarkRushB commented Sep 3, 2024 • edited Loading

Bug Report

MarkRushB commented Sep 3, 2024 •

edited

Loading