Skip to content
This repository was archived by the owner on Dec 6, 2024. It is now read-only.

Commit d79c75f

Browse files
committed
apply feedback; be specific
1 parent f23f8b0 commit d79c75f

File tree

1 file changed

+50
-10
lines changed

1 file changed

+50
-10
lines changed

text/0257-utf8-handling.md

+50-10
Original file line numberDiff line numberDiff line change
@@ -124,12 +124,46 @@ different protocol buffer implementation would likely be the first to
124124
observe the invalid UTF-8, and by that time, a large batch of
125125
telemetry could fail as a result.
126126

127-
### No byte-slice valued attribute API
127+
### Responsibility to the SDK user
128+
129+
The existing specification [dictates a safety mechanism for exporting
130+
invalid string-valued attributes safely][SAFETY], however it only
131+
applies to attribute strings:
132+
133+
[SAFETY]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values
134+
135+
```
136+
## String Values
137+
String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field.
138+
139+
String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string.
140+
```
141+
142+
To satisfy the existing error-handling requirements, the OpenTelemetry
143+
SDK specifications will be modified for all signals with an opt-out
144+
validation feature.
145+
146+
#### Proposed Collector behavior change
147+
148+
The SDK SHOULD in its default configuration validate all string-valued
149+
telemetry data fields. Each run of invalid UTF-8 (i.e., any invalid
150+
UTF-8 sequences) will be replaced by a single Unicode replacement
151+
character, `` (U+FFFD). The exact behavior of this correction is
152+
undefined. When possible, SDKs SHOULD use a built-in library for this
153+
repair (for example, [Golang's
154+
`strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or
155+
[Rust's
156+
`String::to_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy)
157+
satisfy this requirement).
158+
159+
#### No byte-slice valued attribute API
128160

129161
As a caveat, the OpenTelemetry project has previously debated and
130-
rejected the potential to support a byte-slice typed attribute. This
131-
potential feature was rejected because it allows API users a way to
132-
record a possibly uninterpretable sequence of bytes. Users with
162+
rejected the potential to support a byte-slice typed attribute in
163+
OpenTelemetry APIs.
164+
165+
This potential feature was rejected because it allows API users a way
166+
to record a possibly uninterpretable sequence of bytes. Users with
133167
invalid UTF-8 are left with a few options, for example:
134168

135169
- Base64-encode the invalid data wrapped in human-readable syntax
@@ -146,23 +180,29 @@ be lost, therefore it seems better for Collector pipelines to
146180
explicitly handle UTF-8 validation, rather than leave it to the
147181
protocol buffer library.
148182

149-
OpenTelemetry Collector SHOULD support automatic UTF-8 validation to
150-
protect users. There are several places this could be done:
183+
OpenTelemetry Collector should support automatic UTF-8 validation to
184+
protect users, however there are several places this could be done:
151185

152186
1. Following a receiver, with data coming from an external source,
153187
2. Following a mutating processor, with data modified by potentially
154188
untrustworthy code,
155189
3. Prior to an exporter, with data coming from either an external
156190
source and/or modified by potentially untrustworhty code.
157191

158-
Each of these approaches will take significant effort and cost the
159-
user at runtime, therefore:
192+
Each of these approaches will take significant effort and vary in cost
193+
at runtime.
194+
195+
#### Proposed Collector behavior change
196+
197+
To reduce the cost of UTF-8 validation to a minimum, we propose:
160198

161-
- UTF-8 validation SHOULD be enabled by default
199+
- UTF-8 validation SHOULD be enabled by default for all Receiver components
162200
- Users SHOULD be able to opt out of UTF-8 validation
163201
- A `receiverhelper` library SHOULD offer a function to correct
164202
invalid UTF-8 in-place, with two configurable outcomes (1) reject
165-
individual items containing invalid UTF-8, (2) repair invalid UTF-8.
203+
individual items containing invalid UTF-8, meaning to count them as
204+
rejected spans/points/logs, and (2) repair invalid UTF-8 as specified
205+
for SDKs above.
166206

167207
When an OpenTelemetry collector receives telemetry data in any
168208
protocol, in cases where the underlying RPC or protocol buffer

0 commit comments

Comments
 (0)