-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS v2.30 SDK InputStream behavior changes #5859
Comments
Taking a look |
Hi @rphilippen, thank you for your report and feedback! Let me first address the use of newStream() and how it should be closed. It is by design that newStream() may be invoked multiple times throughout the lifecycle of a request, for instance, first calculating content-length for S3 putObject if it's not already provided and then sending it over the wire, as you mentioned. Per the contract of ContentStreamProvider, the user is responsible for closing the streams unless it's the last attempt. This is because the user may intend to return the same stream that is resettable instead of a new stream every time newStream() is invoked. Copy-pasting the Javadoc for ContentStreamProvider here,
However, there are places where we can avoid calling Second, wrt buffering and memory usage, we understand buffering everything in memory is not a good practice, and in hindsight, we should not have supported streaming with unknown content length by buffering in the first place, and we can’t remove the support at this point unfortunately. That is why we recommend that users provide a resettable stream and use RequestBody#fromInputStream for streaming with known content length. For streaming with unknown content length, we recommend following this Dev Guide. We will improve our documentation to make it more discoverable. On a side note, we have a plan to support streaming with unknown content length w/o buffering. Finally, wrt your suggestions:
Unfortunately, we can't remove the existing APIs due to backwards compatibility reasons.
This is what we intend to do in the latest code, as you can see here. If an InputStream supports resetting, we don't buffer the content. #5850 is a bug and Please let us know if you have further questions or suggestions. |
Only supporting
This is an extremely weird ownership set-up. It requires users of the AWS SDK, that provide the streams, to know when they can close the stream. It means all user code handing out streams needs to track them, so they can close them when the AWS SDK asks for the next one. That seems error-prone to me, for something that I personally think the AWS SDK should take responsibility for: closing any streams it asks for. Also note that if we follow this JavaDoc as to how the AWS SDK should behave regarding streams, there is no need to copy data from streams as was done in #5837 and #5841. Following that documented behavior, the AWS SDK should just call I already alluded to this in my description, but we switched to a streaming set-up for file uploads, as our service producing the exports ran out of memory for large files. We've coded a byte buffered output stream that shares its
I'm sorry, but this statement ("not a side effect of recent change") is just false. The issue report literally starts with "Starting with v2.30.9, …". Besides that, this issue is clearly introduced by #5841, and your fix in #5855 partially "reverts" that change, by avoiding the copy behavior for an internal use-case. What worries me about the original change, combined with the subsequent fix you did, is that it still copies the stream for anything external to the AWS SDK. That means it practically breaks our export job, as we get back the risk of going Out-of-Memory, because the AWS SDK is copying our Currently, I know the last point is easily fixed, but then newer versions of the AWS SDK are still copying our 16Mb |
I hear your concerns and feedback. Unfortunately, there is not a lot we can do due to backwards compatibility reasons. If users return a new stream every time
Sorry, let me clarify, I meant to say it was not an intended side effect for SDK ContentStreamProvider implementations. For your use-case, are you able to use We will look into optimizing As a side note, #5866 is submitted to avoid |
Let me start by saying: I realize the title of this issue
Yes, we can, our stream is resettable. The reason I went for /**
* Creates a {@link RequestBody} from an input stream. {@value Header#CONTENT_LENGTH} must
* be provided so that the SDK does not have to make two passes of the data.
* <p>
* The stream will not be closed by the SDK. It is up to to caller of this method to close the stream. The stream
* should not be read outside of the SDK (by another thread) as it will change the state of the {@link InputStream} and
* could tamper with the sending of the request.
* <p>
* To support resetting via {@link ContentStreamProvider}, this uses {@link InputStream#reset()} and uses a read limit of
* 128 KiB. If you need more control, use {@link #fromContentProvider(ContentStreamProvider, long, String)} or
* {@link #fromContentProvider(ContentStreamProvider, String)}.
*
* @param inputStream Input stream to send to the service. The stream will not be closed by the SDK.
* @param contentLength Content length of data in input stream.
* @return RequestBody instance.
*/
public static RequestBody fromInputStream(InputStream inputStream, long contentLength) {
IoUtils.markStreamWithMaxReadLimit(inputStream);
InputStream nonCloseable = nonCloseableInputStream(inputStream);
return fromContentProvider(() -> {
if (nonCloseable.markSupported()) {
invokeSafely(nonCloseable::reset);
}
return nonCloseable;
}, contentLength, Mimetype.MIMETYPE_OCTET_STREAM);
}
/**
* Creates a {@link RequestBody} from the given {@link ContentStreamProvider}.
*
* @param provider The content provider.
* @param contentLength The content length.
* @param mimeType The MIME type of the content.
*
* @return The created {@code RequestBody}.
*/
public static RequestBody fromContentProvider(ContentStreamProvider provider, long contentLength, String mimeType) {
return new RequestBody(provider, contentLength, mimeType);
} (For reference, the links to RequestBody and ContentStreamProvider of version I overlooked the detail in the JavaDoc of So in practice the AWS SDK behaved exactly as desired, by using
To give some more background: our export code runs using two threads: a producer and a consumer. The producer thread reads the input streams (from S3), and writes to a zip stream that is wrapping a
Of course, the whole synchronization depends on knowing when the consumer can release the shared buffer so the producer thread can write to it again. Why am I sharing these details? To establish that closing the Version The problem the AWS SDK faces, is that it often (but not always) needs to iterate content twice. Optionally to determine the content length and/or a checksum of the data, and obviously it needs to actually stream the content into an http request. For any I understand the way the AWS SDK tried to solve it before v2.30, although I also think it did so (and does that) with flawed behavior:
So what are typical use-cases for end-users?
Now lets see what options the AWS SDK provides us to deal with these scenarios. In-memory content is easy, the stream is resettable, and creating a new input stream on the same in-memory content is also cheap. For external and generated content, we always have the option to turn it into in-memory content (by putting a
With the recent change to always use Of course it's easy to point out flaws, but how do you make the AWS SDK "well-behaved" in all use cases, without unnecessary copying of data? Here is my proposal for behavior that I think covers all use cases well, and makes the SDK predictable. Note that the SDK already has all the API definitions in place, it just needs to change its behavior.
This leaves end-users with guaranteed correct behavior, and the option to optimize memory usage in case necessary:
If the AWS SDK wants to be mindful of memory usage, and thus only copy an As a bonus, if the SDK always closes every stream it asks/receives, there are also numerous opportunities in the code base to use Java's try-with-resources feature, making it much easier to guarantee the AWS SDK doesn't leak resources, both on success and exception paths. And yes, I realize this is a behavior breaking change that may have consequences for code relying on the "the SDK won't close my stream" quirk. I'd like to point out that Do with this feedback what you will. I've said my piece, I've made my arguments (hopefully clear). We can switch to the |
Am I understanding from this bug report that the latest version of the SDK now reads in and buffers all data provided by a content provider just so it can calculate a checksum? That is going to destroy memory management where we upload data directly from files so we can have multiple threads uploading multiple 64 MB blocks at a time. If we turn off this new checksum feature of #5801 -will the buffering go away? |
To provide an update, the latest version The SDK only buffers data if you use RequestBody#fromContentProvider w/o providing content length. This behavior is not configurable. To upload objects with unknown content-length, we recommend using AWS CRT-based S3 client. https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html |
@zoewangg thanks. Reviewed all our uses of fromContentProvider() and length is passed in. |
Describe the bug
Starting from AWS SDK version 2.30.0, the multipart upload calls
ContentStreamProvider::newStream()
twice on theprovider
passed toRequestBody::fromContentProvider(ContentStreamProvider provider, long contentLength, String mimeType)
. This leads to twoInputStream
instances being created, but only the second one gets closed. If code using the SDK creates anInputSteam
that is tied to an OS handle (like reading from file), the OS handle leaks.For quite a few streams in Java, it doesn't really matter if you close them or not. Typically this goes for "in-memory" streams, such as
ByteArrayInputStream
; after all, there is no underlying handle to release. However, if you provide anInputStream
that has has a native OS handle attached to it (such as a file handle), the container may run out of file handles and crash.For our use case, this issue is blocking us from upgrading to version 2.30, simply because it breaks our exports feature. We unload data (chunks) onto S3. We then load these chunks (one-by-one, in order) from our service to run them through a zip stream, and upload the result as parts. This process has been running fine for years with the AWS sdk up to version 2.29.
Regression Issue
Expected Behavior
The AWS SDK should close any
InputStream's
it asks, preferably only once.Preferably, the SDK should also only ask for the InputStream once, as producing the stream may be an expensive operation.
Current Behavior
The SDK invokes
newStream()
on the providedContentStreamProvider
twice, but only closes the second stream. This has been observed by adding logging in our application, which produced this output (heavily simplified):I'm pretty confident this issue was first introduced in #4492. Specifically, the AwsChunkedV4PayloadSigner::beforeSigning method calls
newStream()
, but does not close it.It doesn't close it, because the SignerUtils::moveContentLength method takes an
InputStream
, but never closes it.Reproduction Steps
Writing out a realistic example for a multipart upload is quite complicated, especially with the requirement of any non-last parts to have a size > 5mb. In short, to trace the issue, start a multipart upload (we use the synchronous
S3Client
), and upload a part using aContentStreamProvider
that has been patched to track "open" (ContentStreamProvider::newStream()
) and close (InputStream::close()
) actions. We're providing the content-length (as we know it), as well as a mime-type.Expand for an extremely simplified example of the issue.
Possible Solution
Obviously: close all user-provided streams the SDK asks for. There is some nuance to this though: the SDK should also be mindful of memory usage when working with data to upload.
The use case we are using the multipart upload with is basically:
We used to have a naïve process that would keep all data in memory. It is probably no surprise that our service tended to Out-of-Memory on larger exports (think 1gb range). About two years ago, we re-wrote the implementation to be fully streaming. The data is streamed directly from S3 (the getObject response) into the zip output stream. The output stream is a custom byte buffer that follows an "exclusive write, multiple read" pattern; and it enforces this. To write to the stream, no outstanding readers may be active. It is allowed to read multiple times from the same buffer though. With this set-up, an export takes about 20Mb of ram to process; regardless of how big the actual export is. There are some read buffers, and we have a 16Mb chunk buffer that is written to, and read from, repeatedly.
It was the sanity check of this shared buffer class that alerted us to the issue in sdk version 2.30.0 and upwards. Once the first part was done uploading to S3, the streaming zip operation requested write access to the buffer, but that flagged (via an exception) that there was still a reader open.
I want to pinpoint a specific unfortunate abstraction. Look at StreamManagingStage::ClosingStreamProvider and ask yourself: what happens if
newStream()
is called multiple times, beforecloseCurrentStream()
is invoked? The answer is of course: any non-latestcurrentStream
value is overwritten, and those streams are leaked.Additional Information/Context
To add some unsolicited advice:
I think it's best to take a good look at the SDK's usage of streams, and revisit how input data is handled. This summary may be too simple (I may be missing something), but to my knowledge, the SDK needs to know:
I want to clarify that I grasp the complexity of all the possible use-cases the SDK has to support. If your input is just an
InputStream
, you'll have to consume the stream at least twice; once to determine the length and checksum, and once more to actually stream the content to the http client (during upload). I think the SDK could do a better job of abstracting the common issues away. But at its core, the problem is also fuzzy ownership (and Java doesn't help there). Once you take ownership of anInputStream
, you need to close it. Sharing anInputStream
between multiple readers is dangerous, and requires sanity check to ensure the stream is not read concurrently by multiple parties.What I think the SDK should do, to be both memory efficient and clear in its usage pattern, is either:
byte[]
inputs (and clarifying ownership of thebyte[]
is transferred to the SDK, e.g. no more writes should be performed).InputStream
: if it supports resetting, use resetting when needing to read multiple times. Only copy a stream's contents to an in-memory buffer if it doesn't support resetting.I notice attempts to avoid shared ownership issues were made in #5837 and #5841, by simply copying the data. Blatantly copying all data is an anti-pattern though. As I described in "Possible solution", we've worked hard to make our export process memory efficient. With the AWS SDK just copying data "to be safe", you've basically taken away any control from end-users to avoid going Out-of-Memory on large uploads. I can't resist pointing out it didn't take long for that to happen, as evidenced by issue #5850.
AWS Java SDK version used
2.30.0
JDK version used
openjdk 21.0.1 2023-10-17 LTS
Operating System and version
Windows 11 24H2, build 26100.2894
The text was updated successfully, but these errors were encountered: