AWS C++ SDK Stable Versions? S3 Performance Drop-Off? #3322

simoneves · 2025-03-01T00:55:41Z

simoneves
Mar 1, 2025

I feel like I already know the answer to this, and it's not one I want, but is there any such thing as a Stable Version in this project?

It just seems like it's a constant stream of new tags, every day or so, with no formal or consistent release notes or any real high-level indication of what's changed and whether a particular release is good or bad.

Our product has used this SDK for many years, 1.3.10, 1.4.76, 1.7.83, 1.7.123, 1.7.301, 1.9.335, and back to 1.7.301 for unclear reasons (developer since departed).

I inherited the long-overdue dependencies update task, and about six months ago I tried some newer versions, most recently 1.11.376, but it seems that there was an S3 access performance cliff somewhere before that point, so I literally did a brute-force bisection and found a subjective ~25% performance drop between 1.11.211 and 1.11.212, although looking at the (substantial) diff between those two versions gave no real insight as to what might have caused it.

Even 1.11.211 was noticeably slower than 1.7.301 in high-level profiling. Obviously I'm aware that any profiling of S3 access is subject to the whims of the Internet, but I tried multiple runs at different times of day and the mean difference was still clear.

There have been nearly 150 new versions since that point, now that I'm looking at it again. I'm about to build with the latest 1.11.517 and compare the performance again with my most recent fave, 1.11.211, and with our current 1.7.301.

It's also not clear what version of Curl we should be using underneath it, plus we have Apache Arrow in the mix too, wanting to update that from 9 to at least 16. Our unit tests that go through Arrow didn't seem as affected by the later versions, so I tried stealing the Endpoint Provider Cache implementation from Arrow for our own non-Arrow S3 access code paths, which caches S3Client objects, as they seemed strongly culpable in the slowdown when I profiled it, but it didn't make any difference.

Any and all suggestions warmly welcomed.

Answered by simoneves

Mar 5, 2025

@sbiscigl thank you for the above, but...

I added some more profiling, and it seems that even just constructing a ClientConfiguration (or these days, an S3ClientConfiguration) was what was taking the two seconds. Literally a low-level timer around the constructor line showed that.

After some wailing and gnashing of teeth with my remote colleagues, one of them then came across this...

#1410

...and the related section of this...

https://github.com/aws/aws-sdk-cpp/wiki/What%E2%80%99s-New-in-AWS-SDK-for-CPP-Version-1.8

To cut a long story short, if one does indeed set AWS_EC2_METADATA_DISABLED=true everything then runs blazing fast.

Even the build with 1.7.301 runs WAY faster, presumably due …

View full answer

sbiscigl · 2025-03-03T15:09:18Z

sbiscigl
Mar 3, 2025
Maintainer

hey @simoneves thanks for reaching out and sorry about any frustration you may have, lets see if we cant work through it.

I feel like I already know the answer to this, and it's not one I want, but is there any such thing as a Stable Version in this project?

the way versioning works on this project is each version number is architecture.major.minor. the architecture i.e the first number would change if we decided to completely re-write the SDK changing how interactions works with it at a fundamental level. between major versions some APIs may break and backwards incompatible changes may be made, but for the most part you should be able to just upgrade. A patch version is one of our daily releases (more on that in a bit) and only introduces new features. think of it like

1.11.x -> receives daily updates
1.10.x -> no longer updates frozen in time
1.9.x -> no longer updates frozen in time

so the latest version of 1.11.x will receive model updates, and bug fixes, the older versions will not. So if you want the freshest version of the SDK 1.11.x is the latest version and will have all AWS models and bug fixes.

So to answer your question, there is no "stable version" there is more so a "currently worked on version".

It just seems like it's a constant stream of new tags, every day or so, with no formal or consistent release notes or any real high-level indication of what's changed and whether a particular release is good or bad.

Each day we release new model for aws services. these models add new functionality for services. For instance last fridays added several features for different AWS services. for instance one of them is

This release lets Amazon Bedrock Flows support newer models by increasing the maximum length of output in a prompt configuration. This release also increases the maximum number of prompt variables to 20 and the maximum number of node inputs to 20.

which you can see reflected in that diff. This diff is not very helpful though and we plan to move to something like how go-v2 does it, but I cannot give a time frame on when we will get to that.

all releases should be "good" and go through a daily CI.

There have been nearly 150 new versions since that point, now that I'm looking at it again. I'm about to build with the latest 1.11.517 and compare the performance again with my most recent fave, 1.11.211, and with our current 1.7.301.

The version that you are speaking on specifically introduced support for s3 express which we had to do some work for to support. But we've also pushed bug fixes for it and pushed several s3 features for it since, so really the best baseline is against the current HEAD of the project.

Also you don't mention what the performance drop is in? what are you seeing a performance drop with? specifically with a reproducible example what are you doing where you see a performance drop

It's also not clear what version of Curl we should be using underneath it

We dont have have a recommended version, its more or less bring what version you want to or that your system supports. I would defer to the curl project to determine the best version. That said there are some known buggy versions of curl on mac specifically that we have called out in the past.

so I tried stealing the Endpoint Provider Cache implementation from Arrow for our own non-Arrow S3 access code paths, which caches S3Client objects, as they seemed strongly culpable in the slowdown when I profiled it, but it didn't make any difference.

Are you creating and destroying lots of S3Clients?

1 reply

simoneves Mar 3, 2025
Author

hey @simoneves thanks for reaching out and sorry about any frustration you may have, lets see if we cant work through it.

Hi Sam. Thank you for getting back to me. I apologize for my original message being somewhat combative.

the way versioning works on this project is each version number is architecture.major.minor.
...
So to answer your question, there is no "stable version" there is more so a "currently worked on version".

OK, understood. I figured that's what the middle number meant, at least.

All releases should be "good" and go through a daily CI.

Thanks.

The version that you are speaking on specifically introduced support for s3 express which we had to do some work for to support. But we've also pushed bug fixes for it and pushed several s3 features for it since, so really the best baseline is against the current HEAD of the project.

Also you don't mention what the performance drop is in? what are you seeing a performance drop with? specifically with a reproducible example what are you doing where you see a performance drop

To clarify, I first tried the latest-at-the-time 1.11.376, and found this to be much slower than our previous 1.7.301.

I bisected back, and found a small intermediate performance drop-off introduced at 1.11.212, but 1.11.211 is still a lot slower than 1.7.301, as (IIRC) are several other earlier 1.11.x and 1.10.x versions. I did not previously manage to bisect to that point, but I might try again, perhaps starting at the other end.

I have not yet been able to pinpoint the actual SDK usage that is "slower", which I accept is something I need to concentrate on. I am simply referring to the overall runtime of some of our GoogleTest-based unit test executables for various types of database import/export operations which take up to 5x longer than they used to (e.g. 32 minutes vs 6 minutes). Obviously these include MANY tests of various types, although I am filtering to just those tests which use S3 in this context.

Obviously there have also been code changes on our side, some by the previous (now departed) developer, some by be, to allow us to build against your newer versions at all. It's entirely possible that these changes introduced a bug or inefficiency, and the slow-down is not on your side at all.

We don't have have a recommended version [of Curl], its more or less bring what version you want to or that your system supports. I would defer to the curl project to determine the best version. That said there are some known buggy versions of curl on mac specifically that we have called out in the past.

Understood. This is all on Linux, Ubuntu 22 and RHEL 9, compiling with GCC 11 or 13, on x86 and ARM, although new S3 testing has only been on x86 so far.

Are you creating and destroying lots of S3Clients?

My initial testing and profiling six months ago seemed to indicate that S3Client creation was one area where the actual SDK usage appeared to be called out as slower in the profiler. I have no frame of reference as to whether we are creating "too many", as I was not involved in the original code design and have no previous experience with the SDK.

I will go away and do more experiments and analysis, and report back when I have something more constructive to say. Thank you again for your assistance! :)

Simon

sbiscigl · 2025-03-03T20:57:33Z

sbiscigl
Mar 3, 2025
Maintainer

Hi Sam. Thank you for getting back to me. I apologize for my original message being somewhat combative.

No you're fine! we love finding areas where we can help people, and if we're doing something bad we like hearing about it before we do more bad!

My initial testing and profiling six months ago seemed to indicate that S3Client creation was one area where the actual SDK usage appeared to be called out as slower in the profiler.

So this is something that is known and is actually something that pyarrow has actually called out #2880 to us before. tl;dr the logic of how endpoints are created each time is kind of expensive, so avoid creating lots of clients. If they share a endpoint provider like how pyarrow does it, this hit can be minimized to a single time.

One general piece of advice i can give is that if possible, avoid calling client creation too many times. The client is thread safe and one should prefer to call the async methods on the client, instead of trying to manage many clients.

I have not yet been able to pinpoint the actual SDK usage that is "slower"

Yeah this might be tricky to find. If you can narrow down the usage that is slower we would love to take a look at it and hopefully fix it. where we spend a lot of time monitoring performance is the hot paths like GetObject or PutObject, which we should have caught. but if we didnt, more than happy to sort it out.

I know this complicates a simple update task but if you want to try to use the s3-crt client instead of the normal "S3Client" you will almost certainly see a performance increase. It is essentially the same s3 client except GetObject, PutObject, and CopyObject are super charged through the CRT, which customers have usually found to be better.

FWIW the big change between 1.7/1.8 and 1.9 was the introduction of the CRT libraries into the SDK which is why the old engineer may have downgraded in the past, as they can be tricky to work with, but should be a drop in replacement..

4 replies

simoneves Mar 4, 2025
Author

So this [possibly slower S3Client creation] is something that is known and is actually something that pyarrow has actually called out #2880 to us before. tl;dr the logic of how endpoints are created each time is kind of expensive, so avoid creating lots of clients. If they share a endpoint provider like how pyarrow does it, this hit can be minimized to a single time.

One general piece of advice i can give is that if possible, avoid calling client creation too many times. The client is thread safe and one should prefer to call the async methods on the client, instead of trying to manage many clients.

So it seems we do create quite a few S3Clients, but the vast majority of the creations are well sub-millisecond. There are some which seem to take about 4 seconds, with both the old (1.7.301) and newer (1.11.211) SDKs. Not sure why that would be, but it's consistent.

After belatedly activating some debug timer functionality, I have found something very fundamentally curious, though.

At least in the context of the one specific test which for some reason causes the ~4 second create-client operations, it seems that all the major S3 operations (create client, list files, list objects, get object) ALL take almost exactly 2 seconds longer with the newer SDK than they do with the older one.

Attached are two log snippets which show the same set of operations with the old and new SDK, and you can see that the creates take ~4 seconds with the old, and ~6 seconds with the new, and all the other operations take 100-200ms with the old, but 2100-2200ms with the new. Note that the logs are hierarchical (each group of lines at indent N are totaled by the line above at indent N-1, if that's not obvious).

There are no client side code changes other than the minimal required for handling both SDK versions which is mostly to do with how we create S3Client objects, hence my concentration on that before. But the rest of the client-side code (that for all the other operations mentioned above) is identical.

I will keep digging and report what I find, but the above is so curious I thought I'd mention it separately. What could possible cause a ~2 second delay on all those operations?
old_minimal.log
new_minimal.log

simoneves Mar 4, 2025
Author

At least in the context of the one specific test which for some reason causes the ~4 second create-client operations, it seems that all the major S3 operations (create client, list files, list objects, get object) ALL take almost exactly 2 seconds longer with the newer SDK than they do with the older one.

The 2-seconds thing seems so consistent and so cumulative I am wondering if there is some kind of silent timeout/retry behavior going on, no matter how unintuitive that seems.

Google seems to think that the default timeout for at least the Java version of the SDK is 2 seconds, but the C++ docs and headers don't agree. Some seem to be 1000ms, others 3000ms. But I'm sure it's more complex/cleverer than that.

Anyway, I tried messing with the three timeout values in the ClientConfiguration but it made no difference. Having also belatedly found the Wiki page about porting to 1.9, I also tried explicitly setting enableEndpointProvider to true or false, but that also made no difference.

Turns out that the minimum version required by Arrow 16 (our desired version of that) is 1.10.55 so I tried that (no code changes needed), but that's even slower, subjectively the same as 1.11.212 and later! The log for that run shows that the create_s3_client calls now take a hair over 8 seconds, compared with 6 or 4, and the other operations each still take expected time plus 2 seconds. But again, they're all very round 2 seconds additional times. It's suspicious AF.

Again, the only code difference on our side (other than some logging changes to do with the field type changes in S3ClientConfiguration) is how we do the S3Client creation, as follows (no idea if either/both/neither are correct)...

New:

S3ClientConfiguration client_config;
// populate region, caPath, caFile, optional endpoint, useVirtualAddressing = true, remainder default
auto endpoint_provider = std::make_shared<Aws::S3::S3EndpointProvider>();
endpoint_provider->InitBuiltInParameters(client_config);
std::make_unique<S3Client>(credentials, std::move(endpoint_provider), client_config);

Old:

ClientConfiguration client_config;
// populate region, caPath, caFile, remainder default
std::make_unique<S3Client>(credentials, client_config, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, true);

...where (in my tests) caPath is blank and caFile is /etc/ssl/certs/ca-certificates.crt.

simoneves Mar 4, 2025
Author

After belatedly realizing that the old-style S3Client constructors are still in 1.11.x (although marked as deprecated) I disabled the version checking and hence the code path that uses the newer-style constructors, and it still compiles just fine. Then literally the ONLY change needed in our code was to comment-out the lines in our function that logs the contents of a ClientConfiguration which now need custom operator<<() implementations (enableEndpointDiscovery and followRedirects).

Still has the two-second thing.

I then tried 1.10.55, which is the version still recommended for Arrow 16.1 which is our chosen new version of that.

Also still has the two-second thing.

1.7.301 continues to work perfectly.

I guess I'll bisect back through 1.9 and 1.8 to try to find the break.

simoneves Mar 4, 2025
Author

Note also that S3 access through Arrow (e.g. reading Parquet files) works normally with all AWS SDK versions 1.7 thru 1.11.

sbiscigl · 2025-03-04T21:07:11Z

sbiscigl
Mar 4, 2025
Maintainer

with both the old (1.7.301) and newer (1.11.211) SDKs.

So before going further how are you actually building and distributing the old version, from our end it doesnt compile with warnings enabled on current platforms. So how are you actually building it/installing it/using it?

i.e.

FROM public.ecr.aws/amazonlinux/amazonlinux:2023
RUN yum groupinstall "Development Tools" -y
RUN yum install -y curl-devel openssl-devel ninja-build cmake3

# build and install SDK
RUN git clone -b 1.7.301 --recurse-submodules https://github.com/aws/aws-sdk-cpp && \
    cd aws-sdk-cpp && \
    mkdir build && \
    cd build && \
    cmake -DBUILD_ONLY="s3" -DCMAKE_INSTALL_PREFIX=/sdk-install -DAUTORUN_UNIT_TESTS=OFF .. && \
    cmake --build . && \
    cmake --install .

53.77 [ 48%] Building C object tests/CMakeFiles/aws-c-common-tests.dir/byte_buf_test.c.o
53.96 /aws-sdk-cpp/build/.deps/build/src/AwsCCommon/tests/byte_buf_test.c: In function 's_test_buffer_advance':
53.96 /aws-sdk-cpp/build/.deps/build/src/AwsCCommon/tests/byte_buf_test.c:236:35: error: 'arr' may be used uninitialized [-Werror=maybe-uninitialized]
53.96   236 |     struct aws_byte_buf src_buf = aws_byte_buf_from_empty_array(arr, sizeof(arr));
53.96       |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53.96 In file included from /aws-sdk-cpp/build/.deps/build/src/AwsCCommon/tests/byte_buf_test.c:16:
53.96 /aws-sdk-cpp/build/.deps/build/src/AwsCCommon/include/aws/common/byte_buf.h:480:36: note: by argument 1 of type 'const void *' to 'aws_byte_buf_from_empty_array' declared here
53.96   480 | AWS_COMMON_API struct aws_byte_buf aws_byte_buf_from_empty_array(const void *bytes, size_t capacity);
53.96       |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53.96 /aws-sdk-cpp/build/.deps/build/src/AwsCCommon/tests/byte_buf_test.c:235:13: note: 'arr' declared here
53.96   235 |     uint8_t arr[16];
53.96       |             ^~~
54.29 cc1: all warnings being treated as errors
54.29 gmake[5]: *** [tests/CMakeFiles/aws-c-common-tests.dir/build.make:132: tests/CMakeFiles/aws-c-common-tests.dir/byte_buf_test.c.o] Error 1
54.29 gmake[4]: *** [CMakeFiles/Makefile2:910: tests/CMakeFiles/aws-c-common-tests.dir/all] Error 2
54.29 gmake[3]: *** [Makefile:146: all] Error 2
54.29 gmake[2]: *** [CMakeFiles/AwsCCommon.dir/build.make:86: build/src/AwsCCommon-stamp/AwsCCommon-build] Error 2
54.29 gmake[1]: *** [CMakeFiles/Makefile2:87: CMakeFiles/AwsCCommon.dir/all] Error 2
54.29 gmake: *** [Makefile:91: all] Error 2
54.29 CMake Error at CMakeLists.txt:235 (message):
54.29   Failed to build third-party libraries.
54.29

So im even having a hard time trying to profile a diff between the two.

That aside when you talk about measuring different things, its really hard to know what you are measuring and how, so lets focus on one example and build from there. heres a googlebenchmark test that profiles calling get object 20 times on a 100MB file. What performance difference do you see while using that

~/s3-profile ❯❯❯ tree -L2
.
├── CMakeLists.txt
├── src
│   └── main.cpp
└── test-files
    └── 100mb.txt

CMakeLists.txt

cmake_minimum_required(VERSION 3.13)
project(sdk_benchmark)
set(CMAKE_CXX_STANDARD 20)

include(FetchContent)

FetchContent_Declare(gbench
    GIT_REPOSITORY https://github.com/google/benchmark
    GIT_TAG        v1.8.3
)
FetchContent_MakeAvailable(gbench)

find_package(AWSSDK REQUIRED COMPONENTS s3)

add_executable(${PROJECT_NAME} "src/main.cpp")
target_link_libraries(${PROJECT_NAME} benchmark::benchmark ${AWSSDK_LINK_LIBRARIES})

src/main.cpp

#include <aws/core/Aws.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/ListBucketsRequest.h>
#include <aws/s3/model/ListObjectsV2Request.h>
#include <aws/s3/model/CreateBucketRequest.h>
#include <aws/s3/model/PutObjectRequest.h>
#include <aws/s3/model/GetObjectRequest.h>
#include <benchmark/benchmark.h>

#include <fstream>

using namespace Aws;
using namespace Aws::S3;
using namespace Aws::S3::Model;


static const char *BUCKET_NAME = "your-bucket-name-unique";
static const char *KEY = "key";
static const char *LOG_TAG = "s3benchmark";
static const int ITERATIONS = 20;

static SDKOptions s_options;

struct TestFilePair {
  const char *file_name;
  StorageClass storage_class;
};

static void CreateBucketUploadTestObjects() {
  S3Client client{};

  // Create the bucket if it doesnt exist
  auto list_buckets_outcome = client.ListBuckets(ListBucketsRequest());
  assert(list_buckets_outcome.IsSuccess());
  Vector<Bucket> buckets{};
  auto listed_buckets = list_buckets_outcome.GetResult().GetBuckets();
  buckets.insert(buckets.end(), listed_buckets.begin(), listed_buckets.end());
  while (!list_buckets_outcome.GetResult().GetContinuationToken().empty()) {
    list_buckets_outcome = client.ListBuckets(ListBucketsRequest().WithContinuationToken(list_buckets_outcome.GetResult().GetContinuationToken()));
    listed_buckets = list_buckets_outcome.GetResult().GetBuckets();
    assert(list_buckets_outcome.IsSuccess());
    buckets.insert(buckets.end(), listed_buckets.begin(), listed_buckets.end());
  }
  if (std::ranges::find_if(buckets, [](const Bucket &bucket) { return bucket.GetName() == BUCKET_NAME; }) == buckets.end()) {
    auto create_bucket_output = client.CreateBucket(CreateBucketRequest().WithBucket(BUCKET_NAME));
    assert(create_bucket_output.IsSuccess());
    benchmark::DoNotOptimize(create_bucket_output);
  }

  // Create Objects if they dont exist
  auto list_objects_outcome = client.ListObjectsV2(ListObjectsV2Request().WithBucket(BUCKET_NAME));
  assert(list_objects_outcome.IsSuccess());
  Vector<Object> objects{};
  objects.insert(objects.end(), list_objects_outcome.GetResult().GetContents().begin(), list_objects_outcome.GetResult().GetContents().end());
  while (!list_objects_outcome.GetResult().GetContinuationToken().empty()) {
    list_objects_outcome = client.ListObjectsV2(ListObjectsV2Request().WithBucket(BUCKET_NAME).WithContinuationToken(list_objects_outcome.GetResult().GetContinuationToken()));
    assert(list_objects_outcome.IsSuccess());
    objects.insert(objects.end(), list_objects_outcome.GetResult().GetContents().begin(), list_objects_outcome.GetResult().GetContents().end());
  }
  if (std::ranges::find_if(objects, [](const Object& object) { return object.GetKey() == KEY; }) == objects.end()) {
    auto put_object_request = PutObjectRequest().WithBucket(BUCKET_NAME).WithKey(KEY);
    std::shared_ptr<IOStream> body = Aws::MakeShared<FStream>(LOG_TAG, "test-files/100mb.txt", std::ios::binary | std::ios::in);
    put_object_request.SetBody(body);
    auto put_object_response = client.PutObject(put_object_request);
    assert(put_object_response.IsSuccess());
    benchmark::DoNotOptimize(put_object_response);
  }
}

static void DoSetup(const benchmark::State &state) {
  InitAPI(s_options);
  {
    CreateBucketUploadTestObjects();
  }
}

static void DoTeardown(const benchmark::State &state) {
  ShutdownAPI(s_options);
}

static void BM_S3StandardTierFile(benchmark::State& state) {
  S3Client client{};
  for (auto _ : state) {
    auto get_object_output = client.GetObject(GetObjectRequest().WithBucket(BUCKET_NAME).WithKey(KEY));
    assert(get_object_output.IsSuccess());
    benchmark::DoNotOptimize(get_object_output);
  }
}

BENCHMARK(BM_S3StandardTierFile)
    ->Setup(DoSetup)
    ->Teardown(DoTeardown)
    ->MeasureProcessCPUTime()
    ->Iterations(ITERATIONS)
    ->Unit(benchmark::kMillisecond)
    ->UseRealTime();

BENCHMARK_MAIN();

Also goes without saying but appreciate you looking into it indepth

13 replies

simoneves Mar 5, 2025
Author

@sbiscigl I'm going to just rewrite this code to instead just read known files from our buckets.

Meanwhile, forgive me for asking again, but please, is there really nothing you can think of that might be being accidentally invoked that could cause such an obvious almost-exactly-2-second hitch in these operations? Some silent time-out/retry mechanism? Something with proxy config?

For example, here is our low-level function for doing a get-object, which was deliberately packaged like this some time ago in order to make use of our scope-based debug timer mechanism, for our own benchmarking purposes.

Aws::S3::Model::GetObjectOutcome s3_get_object(
    const std::unique_ptr<Aws::S3::S3Client>& s3_client,
    const Aws::S3::Model::GetObjectRequest& object_request) {
  auto timer = DEBUG_TIMER(__func__);
  return s3_client->GetObject(object_request);
}

This is what generates the lines of the form...

2091ms start(32684ms) s3_get_object FileReaderS3.cpp:93

...in the logs I posted before. That's the new-SDK log, c/w the equivalent line in the old-SDK log.

30ms start(16862ms) s3_get_object FileReaderS3.cpp:93

Barring some fundamental misbehavior in our debug timer mechanism (which, respectfully, I doubt) I don't see how this can indicate anything other than the GetObject() method call taking (in this case) ~2060ms longer than it should. As do all the other method calls.

...and yet it always succeeds and everything works, tests pass etc.

I have played with the timeout values in the ClientConfiguration, also the proxy stuff, all of which are normally left as default, but to no avail.

sbiscigl Mar 5, 2025
Maintainer

sorry let me get caught up on this briefly

sbiscigl Mar 5, 2025
Maintainer

Should we expect this code to run as-is? Does bucket your-bucket-name-unique exist?

nah gotta replace with a bucket you own, it will upload and then download from that bucket.

Must still be a permissions thing. Is there some other public bucket I can test this against?

No i would reccomend against using a public bucket, private bucket is fine, just choose your own name and the rest of the benchmark should just run.

Meanwhile, forgive me for asking again, but please, is there really nothing you can think of that might be being accidentally invoked that could cause such an obvious almost-exactly-2-second hitch in these operations? Some silent time-out/retry mechanism? Something with proxy config?

no nothing comes to mind, but almost exact timing each time does seem off, and sounds kind of weird

Barring some fundamental misbehavior in our debug timer mechanism (which, respectfully, I doubt) I don't see how this can indicate anything other than the GetObject() method call taking (in this case) ~2060ms longer than it should.

yeah i dont think so either, fwiw the next steps as im seeing it are:

we got a common test that shows a discrepency that we both can see, if that doesnt exist, we need to find a common place where we see one
hook in telemetry to see where time is being spent per request to see where the time is actually going, is it the network or is it something else

simoneves Mar 5, 2025
Author

Agreed. I have located an equivalent slow-down in our older simpler S3 code path, so I'm profiling that now.

More when I know it. Thank you again so much! :)

sbiscigl Mar 5, 2025
Maintainer

More when I know it. Thank you again so much! :)

No Problem!

alight after the conversation yesterday i created a short benchmark test that is more so batteries included that should help us get closer to at least "we get a common test that shows a discrepancy that we both can see". sorry about that before i should have given better instructions one what it did and what needed to change.

https://github.com/sbiscigl/aws-sdk-cpp-s3-benchmark/tree/main

so you should be able to run this with

git clone https://github.com/sbiscigl/aws-sdk-cpp-s3-benchmark.git benchmark-test && cd benchmark-test && ./replicate.sh 1.11.519 some_bucket_name

given curl/zlib/openssl are installed, and your current session has valid aws crednetials. where 1.11.519 is the version you are testing, and some_bucket_name is a bucket name that will be created in your account that a test object will live in.

after running you can see a result like

Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
This does not affect benchmark measurements, only the metadata output.
***WARNING*** Failed to set thread affinity. Estimated CPU frequency may be incorrect.
2025-03-05T17:29:49-05:00
Running ./build/sdk_benchmark
Run on (10 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 24.86, 19.64, 21.42
----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_S3GetObject/iterations:20/process_time/real_time       4837 ms         1817 ms           20

you should hopefully be able to move the version around with having too much trouble. If you could update that profile test to work/fail for a given tag, we can start taking a look from there. I dont think that 1.7.x will work out of the box with it, but let me know if it doesnt, and if we can find a way to update it to make it work.

simoneves · 2025-03-05T22:54:56Z

simoneves
Mar 5, 2025
Author

@sbiscigl thank you for the above, but...

I added some more profiling, and it seems that even just constructing a ClientConfiguration (or these days, an S3ClientConfiguration) was what was taking the two seconds. Literally a low-level timer around the constructor line showed that.

After some wailing and gnashing of teeth with my remote colleagues, one of them then came across this...

#1410

...and the related section of this...

https://github.com/aws/aws-sdk-cpp/wiki/What%E2%80%99s-New-in-AWS-SDK-for-CPP-Version-1.8

To cut a long story short, if one does indeed set AWS_EC2_METADATA_DISABLED=true everything then runs blazing fast.

Even the build with 1.7.301 runs WAY faster, presumably due to the expensive client inits I was seeing even with that now also being quick. The particular unit test which used to take 67s, and which increased to 122s with 1.11.211 and 215s with 1.11.212 and later, now runs in just under 6s.

This is both excellent AND very frustrating, but oh well.

I guess my question now is, what would be the OFFICIAL C++ API way of disabling the automatic EC2 region lookup, rather than having to do a putenv() (ugh) as I am definitely seeing the ClientConfiguration default constructor take 2 seconds even with no parameters, before the code has even had a chance to set .region on it!

It also begs the question of why s3_client->GetObject(...) was ALSO slow, given that the client should have been created with a config that HAD a region set earlier, even if there was a delay doing that. Was it still trying to do a expensive on-the-fly lazy lookup of something?

But frankly, I don't care, because even that GetObject() is now fast (a couple of tens of ms, as you'd expect).

I will report back again when I've done some more testing.

11 replies

sbiscigl Mar 6, 2025
Maintainer

Finally, one of the other devs is concerned that disabling IMDS will break automatic EC2 IAM Role authentication?

Yes that is what IMDS is doing, IMDS is fetching the EC2 IAM Role. if you shut off IMDS that will stop working. but it would seem that you werent using it before?

Most of the tests are still fast with the latter, but the tests that use S3 through Arrow still have the 2-second slowdown, so I guess Arrow (in this case 16.1, from May 2024) isn't using the disable-IMDS option either, and would only be fast with the env var which would disable it in the Arrow AWS context too. I'll pull down the source for Arrow 17 and 18 to see if they've changed this since.

no likely you will have to set the env var, as apache arrow doesnt know where it is running ahead of time so it needs to check all.

Could you please elaborate on what you meant by the last part of...

So IMDS is the instance metadata service for EC2 instances, it provides various data about a running instance. it can provide region, it can provide credentials. in client configuration it provides region. in the credentials provider chain we also call it looking for credentials. this is based on defaults for all SDKs.

so to override the chain, yes you have to override it in the constructor for instance

#include <aws/core/Aws.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/s3/S3Client.h>

using namespace Aws;
using namespace Aws::Auth;
using namespace Aws::Client;
using namespace Aws::Utils;
using namespace S3;

namespace {
const char * LOG_TAG = "test-app";
}

auto main() -> int {
  SDKOptions options{};
  options.loggingOptions.logLevel = Logging::LogLevel::Debug;
  InitAPI(options);
  {
    auto credProvider = Aws::MakeShared<ProfileConfigFileAWSCredentialsProvider>(LOG_TAG);
    S3Client client{std::move(credProvider)};
  }
  ShutdownAPI(options);
  return 0;
}

which would only use the profile credentials provider.

i think though that in the end this can really be summarized as "i dont want to use IMDS crednetials unless im on a EC2 instance". which isnt something that we can do automatically. that is something that will have to be configured when you run your applicaiton

simoneves Mar 6, 2025
Author

my guess is that IMDS call fails quicker for some depending on env, which would be an interesting thing to take a look at actually. the command it is running more or less is

curl -v http://169.254.169.254/latest/meta-data/placement/availability-zone

I assume that only works within an EC2 setup. Obviously it's not a valid global public IP, and doing that curl just hangs for me at home.

Digesting the rest. Thank you.

simoneves Mar 6, 2025
Author

To clarify, we don't actually know that it's slow within an EC2 setup. It may well be fine. It's just that all our dev and CI and testing is obviously NOT happening in an EC2 setup! :)

DmitriyMusatkin Mar 7, 2025
Maintainer

169.254.169.254 is an ip that very often used for communication with metadata services on local network and its not a valid public ip, so makes sense that its hanging.
Lots of firewalls also just block that ip when a local machine tries to route it to public internet, which might explain why you are seeing it fail faster in some environments.

@sbiscigl imds connect timeout should be 1s, so 2s seems a bit strange. Are we doing a couple rounds of timeouts?

In general you want to specify as much as possible to avoid having client to look it up from environment. in this case if you want to have no creds (i.e. anonymous calls to s3) its probably better to specify that upfront instead of going through cred chain

simoneves Mar 7, 2025
Author

We have decided to stick with the env var technique, thus also solving the issue for the Arrow S3 access without us having to patch the Arrow code. We already have config options for controlling authentication modes, so although we default to disabling IMDS, you can turn it back on again if you know you're running on an EC2 instance where the lookup should be quick. I believe this solves all our issues with using the latest version. Thank you both very much for all your help. Much appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS C++ SDK Stable Versions? S3 Performance Drop-Off? #3322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 29 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AWS C++ SDK Stable Versions? S3 Performance Drop-Off? #3322

simoneves Mar 1, 2025

Replies: 4 comments · 29 replies

sbiscigl Mar 3, 2025 Maintainer

simoneves Mar 3, 2025 Author

sbiscigl Mar 3, 2025 Maintainer

simoneves Mar 4, 2025 Author

simoneves Mar 4, 2025 Author

simoneves Mar 4, 2025 Author

simoneves Mar 4, 2025 Author

sbiscigl Mar 4, 2025 Maintainer

simoneves Mar 5, 2025 Author

sbiscigl Mar 5, 2025 Maintainer

sbiscigl Mar 5, 2025 Maintainer

simoneves Mar 5, 2025 Author

sbiscigl Mar 5, 2025 Maintainer

simoneves Mar 5, 2025 Author

sbiscigl Mar 6, 2025 Maintainer

simoneves Mar 6, 2025 Author

simoneves Mar 6, 2025 Author

DmitriyMusatkin Mar 7, 2025 Maintainer

simoneves Mar 7, 2025 Author

simoneves
Mar 1, 2025

Replies: 4 comments 29 replies

sbiscigl
Mar 3, 2025
Maintainer

simoneves Mar 3, 2025
Author

sbiscigl
Mar 3, 2025
Maintainer

simoneves Mar 4, 2025
Author

simoneves Mar 4, 2025
Author

simoneves Mar 4, 2025
Author

simoneves Mar 4, 2025
Author

sbiscigl
Mar 4, 2025
Maintainer

simoneves Mar 5, 2025
Author

sbiscigl Mar 5, 2025
Maintainer

sbiscigl Mar 5, 2025
Maintainer

simoneves Mar 5, 2025
Author

sbiscigl Mar 5, 2025
Maintainer

simoneves
Mar 5, 2025
Author

sbiscigl Mar 6, 2025
Maintainer

simoneves Mar 6, 2025
Author

simoneves Mar 6, 2025
Author

DmitriyMusatkin Mar 7, 2025
Maintainer

simoneves Mar 7, 2025
Author