Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need better docs and examples encouraging the use of socketTimeout in AWS clients for V3. SDK should warn or error when invalid V2 style params are used. #6763

Open
jeffbski-rga opened this issue Dec 30, 2024 · 7 comments
Assignees
Labels
documentation This is a problem with documentation. needs-triage This issue or PR still needs to be triaged.

Comments

@jeffbski-rga
Copy link

Describe the issue

Recently we spent several days trying to debug mysterious RequestTimeout errors that started showing up in one of our AWS environments. This was occurring with the S3 client but after understanding the issue it could easily occur for any SDK client.

We had converted from V2 to V3 months ago and did not notice any problems till recently.

After close inspection, we had missed the conversion of the V2 httpOptions.timeout to the new requestTimeout.socketTimeout in V3. Although it is not specified in the docs, the default socketTimeout must be shorter than 72s since we were hitting it often at that interval.

So it should be made much more obvious that this is an important value to increase since the default is no longer sufficient and can just start failing.

Another PR was just created with the same problem, so I will also link it here. #6762 (I commented on the solution to help the user). I guess this evidence that more docs are needed.

I would expect that the typscript docs would explain the importance of the socketTimeout and what the default value is when unset. It might also provide the type of error one might expect if this is too low (RequestTimeout). It should be made clear that this value should be set since the current default is not sufficient (at least for S3). Currently the description for this item is completely blank.

I see that you recently updated the migrating.md notable-changes section which helps if people were coming from V2 to V3. However I think it may still need some improvement to explain the importance of the socketTimeout value and what may happen if not set.

I believe that the examples provided and especially the best practices docs or examples should include this value by default to encourage its use since you really should not run with in production with the default.

It also would have been really helpful to have warnings or error messages when we create an AWS client using the old V2 parameters like httpOptions.timeout. This would have prevented us from introducing this regression bug with our code. Currently there does not appear to be any indication that you are using the old values and that they will no longer do anything. I know it's difficult to check all settings, but the ones that have clearly moved, should be errored or at least warned about so the code can be corrected.

I don't know if the other API's are as sensitive to timeouts as S3 is, but I would suspect it would be good practice to increase the socketTimeout for all of them if the default remains so low. The main one that we hit was when using lib-storage to do a multi-part upload.

The code examples for the various commands also do not show any best practices for setting the options. Many of the examples don't even show the options object at all. So while maybe it is overkill to have these details in every example, maybe you should have a comment by the options object linking to a nice doc that would spell out all the important client settings that one should typically use.

I think it would be smart for best practices to show all the details one needs to make your code production ready.

Another thing that would have been helpful for us is that when using @aws-sdk/lib-storage to do a multi-part upload, the error generated by the Upload object is just a basic AbortError with no indication of why it was aborted. So it wasn't until we turned on AWS SDK client logging that we could find the real error the RequestTimeout. So you might mention in the docs for Upload object that AbortError could be caused by a RequestTimeout and that enabling logging for errors is a smart thing. I think this tip alone deserves it's own issue so I will add one for SDK logging too.

Links

@kuhe
Copy link
Contributor

kuhe commented Jan 3, 2025

@jeffbski-rga
Copy link
Author

jeffbski-rga commented Jan 3, 2025

@kuhe Wow, thanks for mentioning this, we will update to requestTimeout. I was wondering why there were so many similar looking settings.

Regarding the issue we encountered, in the afternoon of 12/16/2024, our lambda code which peforms a streaming read from S3, transform records, and streaming write to S3 using the @aws-sdk/lib-storage Upload to do a multi-part upload, started encountering RequestTimeouts in 30-40% of our executions. Looking at the logs, the stream would often make progress for several minutes and then just timeout. The time between the last successful S3 call till the one that failed was about 72s in many cases. In the previous several months, logs only showed it happening maybe 3 times, but now it was happening regularly 1000s of times every hour. This code has been in place for years and worked well even after upgrading to AWS SDK V3 earlier in the year. The problem got persistently worse failing up to 50% of the time on 12/17/2024.

This problem persisted and we involved AWS support team (both lambda and S3 teams). After exhausting all ideas and suggestions, we enabled more logging and were able to zero in on the fact that it was a RequestTimeout that was the root cause.

After realizing the timeout property had changed to socketTimeout (from V2 to V3), we set this to 15m and everything started succeeding properly. So it appears that for whatever reason, there is a default s3 client socketTimeout/requestTimeout that is less than 72s. If this was due to something else like a Node.js timeout then I don't believe setting socketTimeout would have resolved it. I will be happy to discuss further.

@jeffbski-rga
Copy link
Author

jeffbski-rga commented Jan 3, 2025

Also note that another user also reported requestTimeout issue after they migrated to V3 (and did not have socketTimeout or requestTimeout set). So there appears to be some default timeout when it is not set. See #6762
Adding a value resolved their issue too.

@jeffbski-rga
Copy link
Author

We hit the RequestTimeout while using:
@aws-sdk/[email protected]
@aws-sdk/[email protected]
@smithy/[email protected]
in case the code has changed since then

@chbiel
Copy link

chbiel commented Jan 15, 2025

To add context from us:

We ran @aws-sdk/[email protected] => everything was good
After updating to @aws-sdk/[email protected] => the issue started to appear.
Downgrading solves it again.
My bet are the dependency updates in 3.709 but did not dive deeper yet.

It appears only on putObject calls. getObject stlll seem to work.
We don't use explicit Streaming but directly putObject strings.

Our application code (beside dependencies) did not change for over 6 months at that place (that was migrating to V3).

What we also see with that version update are following warnings:

@smithy/node-http-handler:WARN - socket usage at capacity=50 and 3085 additional requests are enqueued. See https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-configuring-maxsockets.html or increase socketAcquisitionWarningTimeout=(millis) in the NodeHttpHandler config.

Which rarely appeared before the update but now appear on most uploads every 20 seconds.
The notable thing here is, that the number of enqueued requests stays the same until the Lambda functions times out after 15 minutes.

We will also no try to update to requestTimeout and see if that solves the problem (we also still have httpOptions configured).

@jeffbski-rga
Copy link
Author

@chbiel Another thing I learned is that AWS SDK V3 sets a default maxSockets to 50 if you don't set it, so you can increase it so more requests can go in parallel.

const awsOptions = {
    ...
    requestHandler: {
        httpsAgent: { keepAlive: true, maxSockets: 5000 },
        requestTimeout: 300 * 1000 // 5m
      }
    }

I also flag keepAlive true just to ensure it was true, default might be fine.

Here's the article which mentions the default is 50 max sockets and how to increase it.
https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-configuring-maxsockets.html

@chbiel
Copy link

chbiel commented Jan 17, 2025

@jeffbski-rga thanks for the hint, added it :)

all in all using the requestHandler configuration solved our problems.

things I would like to see improved:

  1. the previously mentioned error message is very confusing in this context. There should be a more explicit error message instead of only showing those two kinds of error messages:
    Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
    and
@smithy/node-http-handler:WARN - socket usage at capacity=50 and 3085 additional requests are enqueued. See https://docs.aws.amazon.com/sdk-for-javascript/v3/developer-guide/node-configuring-maxsockets.html or increase socketAcquisitionWarningTimeout=(millis) in the NodeHttpHandler config.

But I also cannot tell what error message would be helpful here beside "You use a wrong configuration, please use requestHandler config"...

  1. The requestTimeout configuration in combination with keepAlive seems to behave differently then the previous httpOptions.timeout configuration. Setting it the the same value as before still leads to errors for bigger payload to put into S3.
    The only reason I can imagine right now is, that the new requestTimeout sets the timeout for the overall connection that is reused to send several payload. In comparison httpOptions.timeout seem to be used for every single request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation This is a problem with documentation. needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests

4 participants