Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/stefexporter] Add basic STEF exporter implementation #37564

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tigrannajaryan
Copy link
Member

Description

Added STEF exporter implementation for metrics, sending data over gRPC stream. For now only queuing and retry exporter helpers are used. We will need to decide later if other helpers are needed for this exporter.

Testing

Unit tests that verify connecting, reconnecting, sending, acking of data are included.

Documentation

Added to README.

Future Work

More extensive test coverage is desirable and will likely be added in the future.

We likely want to implement STEF receiver and add STEF as a tested protocol to our testbed.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has lots of concurrency issues, and I would suggest to simplify the design because otherwise this is impossible to get right.

exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
exporter/stefexporter/exporter.go Show resolved Hide resolved
exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
Comment on lines +260 to +192
// stefWriter is not safe for concurrent writing, protect it.
s.stefWriterMutex.Lock()
defer s.stefWriterMutex.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing synchronous cross cloud regions (since if I understand correctly STEF is for that) is a questionable design in my opinion. Should we at least have multiple "connection" that we use in the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, this is not a good implementation for a streaming protocol. I will rework once we discuss and agree on how we want streaming exporters like this to work.

exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
exporter/stefexporter/exporter.go Outdated Show resolved Hide resolved
@tigrannajaryan
Copy link
Member Author

I think this has lots of concurrency issues, and I would suggest to simplify the design because otherwise this is impossible to get right.

@bogdandrutu I agree, I don't like the design myself. If we can find a simpler way I will be happier.

Here are my constraints:

Prefer single gRPC stream

Why? Two reasons:

  1. STEF stream compression ratio increases as more data is sent through it. This is because of dictionary encoding at 2 different levels (STEF itself and ZStd), causing previously seen data to compress better. If we split the data that the exporter receives from the pipeline into multiple STEF streams we will reduce compression ratio.
  2. STEF encoder and decoder keep dictionaries in memory. If we split into multiple STEF streams all of these encoders and decoders will eventually contain virtually identical data, duplicated as many times as you have STEF streams, increasing memory usage by a factor of the number of streams. This is bad, especially for backends which expect millions of incoming streams.

Sync API of Exporter Helper

The current exporter helper design requires exportMetrics() call to synchronously block until sent data is confirmed to be delivered to the destination via ACK messages that destination sends back on the same gRPC stream. When exportMetrics() returns, the metric data is removed from the queue (and is garbage collected). If we change exportMetrics() to return before data is confirmed to be delivered to the destination, without waiting for ACK messages, then there is a chance that the data will be lost if the gRPC connection breaks before STEF data is actually delivered to the destination.

Furthermore if exportMetrics() were to return immediately after encoding STEF data, that data most likely will not be written to the gRPC stream at all, since STEF encoders buffer data into fairly large frames before being written to gRPC stream. To guarantee encoded data is sent over the gRPC stream exportMetrics() has to issue a Flush() call to STEF encoder. If this is done for every single exportMetrics() call this can result in very significant reduction in compression ratio since there is typically a fixed overhead per STEF frame (Flush() sends and creates a new frame). The difference in experiments is about 2x times worse compression if you do Flush() every time (on the datasets I have). This is unacceptable and defeats the purpose of STEF. Note, as described above, even issuing Flush() call every time does not guarantee delivery, so this is still not good enough for reliable delivery.

It would be ideal if exporter helper design decoupled the act of consuming from the queue from the act of deleting from the queue. This would be perfect for asynchronous protocols like STEF. For example a hypothetical async exporter design could look like this:

func exportMetrics(ctx context.Context, md pdata.Metrics, ack func(id SomeIDType)) (id SomeIDType, err error)

With this API we would then implement STEF exporter's exportMetrics() call to return immediately after encoding md into STEF stream and return the id of the written record. STEF exporter later would asynchronously call the ack() func when it receives delivery confirmation from the destination. This would also allow to have much, much simpler implementation of STEF exporter, I would delete 90% of the code that you (and I) don't like.

I have briefly discussed this topic with @dmitryax but I think this is a much bigger effort and for now we have to work with the exporter helper API we have.

If you have any thoughts on how to simplify the design within the current constraints or if you think there is a better way to handle asynchronous sending please tell.

@jmacd
Copy link
Contributor

jmacd commented Feb 3, 2025

@tigrannajaryan The OTel-Arrow exporter supports a limited number of streams and I would agree, the concurrency wasn't easy to get right. In the experiment (https://opentelemetry.io/blog/2024/otel-arrow-production/), we found that a single-stream would lead to problems associated with high latency, there's a point at which your batches are large enough to get the compression benefit you want, and at that point it's better to add a stream.

I've suggested it once, now twice: I think it will be worth the time and energy to generalize the otelarrow codebase, to let it support multiple codecs for large-frame compression protocols like STEF and OTAP so that most of the code between these two is shared. I suspect at least 90% of the exporter/receiver codebase are not directly concerned with the OTAP representation, because most of the challenge is handling gRPC streams and cancellation.

@tigrannajaryan
Copy link
Member Author

I've suggested it once, now twice: I think it will be worth the time and energy to generalize the otelarrow codebase, to let it support multiple codecs for large-frame compression protocols like STEF and OTAP so that most of the code between these two is shared. I suspect at least 90% of the exporter/receiver codebase are not directly concerned with the OTAP representation, because most of the challenge is handling gRPC streams and cancellation.

@jmacd I agree, it would be great to have a general solution for streaming exporters. I will see if I can find time to look at OTAP codebase.

@tigrannajaryan
Copy link
Member Author

@bogdandrutu I simplified the implementation to a completely basic one, eliminating significant portion of concurrency control, so it should be easier now to reason about. Please take another look.

This implementation is extremely basic and is not what I would like to see in the production version. A proper version would not block for acks. I think we should discuss how exactly we want exporters like this to be written and perhaps have generic helpers for these uses cases like @jmacd offered. Since that's a longer story I think it is worth having the basic implementation as a reference and improve it after we decide on the direction.

@tigrannajaryan tigrannajaryan reopened this Feb 6, 2025
@tigrannajaryan tigrannajaryan force-pushed the tigran/stefexporter branch 2 times, most recently from 637de4b to 62bcc5a Compare February 6, 2025 19:55
Added STEF exporter implementation for metrics, sending data over gRPC stream. For now only timeout, queuing and retry exporter helpers are used. We will need to decide later if other helpers are needed for this exporter.

Unit tests that verify connecting, reconnecting, sending, acking of data are included.

Added to README.

More extensive test coverage is desirable and will likely be added in the future.

We likely want to implement STEF receiver and add STEF as a tested protocol to our testbed.

A full-duplex implementation is desirable, which takes advantage of the streaming nature of STEF protocol and does not block waiting for acks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants