You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been debugging an issue with a python Cloud Run Job on GCP where the logs fail to appear in datadog even though I can see the logs appear in Cloud Logging.
I believe I've found a race in batch_strategy.go (more below).
Agent Environment
datadog/serverless-init:1.5.3 (not 100% what agent version this maps to)
Describe what happened:
The job runs every 5 minutes and is short-lived (several seconds runtime) if there's no work to do. It prints a couple of summary lines before it exits. 90%+ of the time the logs don't appear in datadog, even though I see them captured by Cloud Logging.
Describe what you expected:
100% of log lines to be captured.
Steps to reproduce the issue:
Run a short-lived python app as a Cloud Run job (see below for a trivial example)
INFO 2025-01-14T10:20:55.038304Z <my application logs of interest (2 lines)>
DEFAULT 2025-01-14T10:20:55.779608Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | INFO | Triggering a flush in the logs-agent
DEFAULT 2025-01-14T10:20:55.779635Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | DEBUG | Flush in the logs-agent done.
DEFAULT 2025-01-14T10:20:55.779643Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | DEBUG | Send messages for pipeline logs (msg_count:2, content_size=1101, avg_msg_size=550.50)
Here's what I think is happening:
My process logs and exits. serverless-init/main.go calls lastFlush which in turn calls logsAgent.Flush().
Triggering a flush in the logs-agent / Flush in the logs-agent done. - this appears to be the logging from logAgent.Flush().
Send messages for pipeline logs - my logs messages appear! I think this is triggered by writing to flushChan in Pipeline.Flush(). This causes the goroutine running in batchStrategy.Start() to call flushBuffer(), which logs "Send messages for pipeline logs".
// Flush flushes synchronously the processor and sender managed by this pipeline.
func (p *Pipeline) Flush(ctx context.Context) {
p.flushChan <- struct{}{}
p.processor.Flush(ctx) // flush messages in the processor into the sender
if p.serverless {
// Wait for the logs sender to finish sending payloads to all destinations before allowing the flush to finish
p.flushWg.Wait()
}
}
flushChan triggers a flush, but there's no guarantee the goroutine in batchStrategy.Start() will complete its flush before the call to p.processor.Flush() starts.
That means batchStrategy ends up calling sendMessages() and enqueuing the batched messages after the pipeline flushed for the final time.
The text was updated successfully, but these errors were encountered:
I've been debugging an issue with a python Cloud Run Job on GCP where the logs fail to appear in datadog even though I can see the logs appear in Cloud Logging.
I believe I've found a race in batch_strategy.go (more below).
Agent Environment
datadog/serverless-init:1.5.3 (not 100% what agent version this maps to)
Describe what happened:
The job runs every 5 minutes and is short-lived (several seconds runtime) if there's no work to do. It prints a couple of summary lines before it exits. 90%+ of the time the logs don't appear in datadog, even though I see them captured by Cloud Logging.
Describe what you expected:
100% of log lines to be captured.
Steps to reproduce the issue:
Run a short-lived python app as a Cloud Run job (see below for a trivial example)
Additional environment details (Operating System, Cloud provider, etc):
GCP Cloud Run, python:3.12-slim-bookworm.
Example app:
Dockerfile:
Additional env vars:
Debugging
Here are the logs of interest:
Here's what I think is happening:
Triggering a flush in the logs-agent
/Flush in the logs-agent done.
- this appears to be the logging from logAgent.Flush().Send messages for pipeline logs
- my logs messages appear! I think this is triggered by writing to flushChan in Pipeline.Flush(). This causes the goroutine running in batchStrategy.Start() to call flushBuffer(), which logs "Send messages for pipeline logs".I believe the bug is here:
flushChan
triggers a flush, but there's no guarantee the goroutine in batchStrategy.Start() will complete its flush before the call top.processor.Flush()
starts.That means batchStrategy ends up calling sendMessages() and enqueuing the batched messages after the pipeline flushed for the final time.
The text was updated successfully, but these errors were encountered: