Description
Describe the bug
Following simple code that starts a few async calls and then interrupts them segfaults (randomly):
#include <aws/core/Aws.h>
#include <aws/core/auth/AWSAuthSigner.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/core/utils/Outcome.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/GetObjectRequest.h>
#include <aws/s3/model/HeadObjectRequest.h>
#include <algorithm>
const auto bucket = "test-bucket";
const auto key = "bigfile";
int main()
{
Aws::SDKOptions options;
Aws::InitAPI(options);
auto prov = Aws::MakeShared<Aws::Auth::EnvironmentAWSCredentialsProvider>("EnvironmentAWSCredentialsProvider");
Aws::Auth::AWSCredentials creds = prov->GetAWSCredentials();
Aws::Client::ClientConfiguration config;
auto client = Aws::MakeShared<Aws::S3::S3Client>( "S3Client", prov, config, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, true );
Aws::S3::Model::HeadObjectRequest hor;
hor.WithBucket(bucket).WithKey(key);
auto outcome = client->HeadObject(hor);
if (!outcome.IsSuccess())
{
return -1;
}
uint64_t remainingFileSize = outcome.GetResult().GetContentLength();
uint64_t readPosition = 0;
const auto readAheadQueueSize = 15;
const auto readSectorSize = 8Ull * 1024Ull * 1024Ull;
std::deque<Aws::S3::Model::GetObjectOutcomeCallable> queuedRequests;
while (queuedRequests.size() < readAheadQueueSize)
{
uint64_t readSize = std::min(readSectorSize, remainingFileSize);
std::ostringstream readRange;
readRange << "bytes=" << readPosition << "-" << (readPosition + readSize - 1); // Inclusive range
Aws::S3::Model::GetObjectRequest objectRequest;
objectRequest.WithBucket(bucket).WithKey(key).WithRange(readRange.str());
queuedRequests.push_back(client->GetObjectCallable(objectRequest));
remainingFileSize -= readSize;
readPosition += readSize;
}
client.reset();
Aws::ShutdownAPI(options);
}
There are two reasons for the segfault. One is extended lifetime of executor - ClientConfiguration
creates a shared_ptr
to Executor
, passes it onto the client but because its lifetime is longer than a client, it makes the Executor
to live longer than the client.
However, there's an implicit assumption when destroying the client that the executor will run to its completion (at least for already started packaged_tasks) as part of its destructor. This obviously is not invoked as Executor
outlives the client. The most obvious fix is to add client.executor.reset()
after creating an instance of client so it is the only owner of Executor
.
The problem stems from using shared_ptr
and simultaneously assuming it is the only owner - which cannot be guaranteed.
Secondly, there's another data race (also leading to segfault pretty consistently) in ClientWithAsyncTemplateMethods::ShutdownSdkClient
which is called as part of S3Client::~S3Client()
. There, we first release the m_endpointProvider
and then m_executor
(but only if it's the only owner). This is wrong as m_endpointProvider
can still be used by threads started by the Executor
- such as:
The fix here is simply to move the m_endpointProvider.reset()
call after m_executor
is released:
pClient->m_executor.reset();
pClient->m_clientConfiguration.executor.reset();
pClient->m_clientConfiguration.retryStrategy.reset();
pClient->m_endpointProvider.reset();
Running original version with thread sanitizer yields more than 100 warnings while the above changes gives none for attached sample code
Expected Behavior
No segfault nor datarace
Current Behavior
Above code segfault in various places, most likely on dereferencing already released EndpointProvider but also deep inside the AttemptExhaustively function on dereferencing RetryStrategy
Reproduction Steps
Run above code few times and observe segfaults
Possible Solution
No response
Additional Information/Context
No response
AWS CPP SDK version used
AWS SDK CPP, version 1.11.352
Compiler and Version used
gcc 11.4.0, Visual Studio 17.11, clang 14.0
Operating System and version
Ubuntu 22.04 and Windows 11