AWSClient segfaults after a data race #3084

k0zmo · 2024-08-21T21:21:38Z

Describe the bug

Following simple code that starts a few async calls and then interrupts them segfaults (randomly):

#include <aws/core/Aws.h>
#include <aws/core/auth/AWSAuthSigner.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/core/utils/Outcome.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/GetObjectRequest.h>
#include <aws/s3/model/HeadObjectRequest.h>

#include <algorithm>

const auto bucket = "test-bucket";
const auto key = "bigfile";

int main()
{
    Aws::SDKOptions options;
    Aws::InitAPI(options);

    auto prov = Aws::MakeShared<Aws::Auth::EnvironmentAWSCredentialsProvider>("EnvironmentAWSCredentialsProvider");
    Aws::Auth::AWSCredentials creds = prov->GetAWSCredentials();
    Aws::Client::ClientConfiguration config;

    auto client = Aws::MakeShared<Aws::S3::S3Client>( "S3Client", prov, config, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, true );

    Aws::S3::Model::HeadObjectRequest hor;
    hor.WithBucket(bucket).WithKey(key);
    auto outcome = client->HeadObject(hor);
    if (!outcome.IsSuccess())
    {
        return -1;
    }

    uint64_t remainingFileSize = outcome.GetResult().GetContentLength();
    uint64_t readPosition = 0;
    const auto readAheadQueueSize = 15;
    const auto readSectorSize = 8Ull * 1024Ull * 1024Ull;

    std::deque<Aws::S3::Model::GetObjectOutcomeCallable> queuedRequests;

    while (queuedRequests.size() < readAheadQueueSize)
    {
        uint64_t readSize = std::min(readSectorSize, remainingFileSize);

        std::ostringstream readRange;
        readRange << "bytes=" << readPosition << "-" << (readPosition + readSize - 1); // Inclusive range

        Aws::S3::Model::GetObjectRequest objectRequest;
        objectRequest.WithBucket(bucket).WithKey(key).WithRange(readRange.str());

        queuedRequests.push_back(client->GetObjectCallable(objectRequest));

        remainingFileSize -= readSize;
        readPosition += readSize;
    }

    client.reset();

    Aws::ShutdownAPI(options);
}

There are two reasons for the segfault. One is extended lifetime of executor - ClientConfiguration creates a shared_ptr to Executor, passes it onto the client but because its lifetime is longer than a client, it makes the Executor to live longer than the client.

However, there's an implicit assumption when destroying the client that the executor will run to its completion (at least for already started packaged_tasks) as part of its destructor. This obviously is not invoked as Executor outlives the client. The most obvious fix is to add client.executor.reset() after creating an instance of client so it is the only owner of Executor.

The problem stems from using shared_ptr and simultaneously assuming it is the only owner - which cannot be guaranteed.

Secondly, there's another data race (also leading to segfault pretty consistently) in ClientWithAsyncTemplateMethods::ShutdownSdkClient which is called as part of S3Client::~S3Client(). There, we first release the m_endpointProvider and then m_executor (but only if it's the only owner). This is wrong as m_endpointProvider can still be used by threads started by the Executor - such as:

The fix here is simply to move the m_endpointProvider.reset() call after m_executor is released:

            pClient->m_executor.reset();
            pClient->m_clientConfiguration.executor.reset();
            pClient->m_clientConfiguration.retryStrategy.reset();
            pClient->m_endpointProvider.reset();

Running original version with thread sanitizer yields more than 100 warnings while the above changes gives none for attached sample code

Expected Behavior

No segfault nor datarace

Current Behavior

Above code segfault in various places, most likely on dereferencing already released EndpointProvider but also deep inside the AttemptExhaustively function on dereferencing RetryStrategy

Reproduction Steps

Run above code few times and observe segfaults

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

AWS SDK CPP, version 1.11.352

Compiler and Version used

gcc 11.4.0, Visual Studio 17.11, clang 14.0

Operating System and version

Ubuntu 22.04 and Windows 11

The text was updated successfully, but these errors were encountered:

jmklix · 2024-08-23T22:54:43Z

Can you try adding brackets around the sdk code so that it is all in the same scope and is cleaned up correctly before Aws::ShutdownAPI is called? This is mentioned in our basic use here:

#include <aws/core/Aws.h>
int main(int argc, char** argv)
{
   Aws::SDKOptions options;
   Aws::InitAPI(options);
   {
      // make your SDK calls here.
   }
   Aws::ShutdownAPI(options);
   return 0;
}

…e shutdown behavior

…down behavior #3084

SergeyRyabinin · 2024-08-26T17:45:28Z

Hi @k0zmo ,

thank you for bringing and describing in details this issue.
Additional thank you for pointing to the ClientConfiguration object storing extra reference to the Executor.

I agree with @jmklix in referring to our basic-use instruction: this CPP SDK is designed in a such a way that it requires extra scope, and extra caution in tracking async running tasks from the user.

However, this is a recurring topic, so I submitted a PR #3087 having your suggestion to re-arrange the destruction order, as well as, in addition, some refactoring to use factories by default within the ClientConfiguration, and some other minor changes to track client "is initialized" checks. I hope we are going to merge this PR soon, while it is not going to solve all "Init-Shutdown-memory-allocator-async-tasks-tracking" kind of issues, it is still an improvement.

Best regards,
Sergey

k0zmo · 2024-08-26T18:46:14Z

@jmklix Limiting a scope for AWSClient like you shown doesn't make difference. The problem I reported comes from an entangled lifetime between ClientConfiguration and AWSClient, not AWSClient and the global SDK state.

…down behavior #3084

github-actions · 2024-08-27T17:12:45Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

k0zmo added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 21, 2024

jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Aug 23, 2024

SergeyRyabinin added a commit that referenced this issue Aug 26, 2024

#3084 use factories for client utility classes initialization, improv…

11c124f

…e shutdown behavior

SergeyRyabinin added a commit that referenced this issue Aug 26, 2024

use factories for client utility classes initialization, improve shut…

c7a96b2

…down behavior #3084

jmklix added pending-release This issue will be fixed by an approved PR that hasn't been released yet. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Aug 26, 2024

jmklix linked a pull request Aug 26, 2024 that will close this issue

#3084: use factories for client utility classes initialization, better shutdown #3087

Merged

11 tasks

SergeyRyabinin added a commit that referenced this issue Aug 26, 2024

use factories for client utility classes initialization, improve shut…

5bc5f83

…down behavior #3084

SergeyRyabinin added a commit that referenced this issue Aug 26, 2024

use factories for client utility classes initialization, improve shut…

1efcfdd

…down behavior #3084

SergeyRyabinin added a commit that referenced this issue Aug 26, 2024

use factories for client utility classes initialization, improve shut…

9d4e4ee

…down behavior #3084

SergeyRyabinin mentioned this issue Aug 26, 2024

#3084: use factories for client utility classes initialization, better shutdown #3087

Merged

11 tasks

SergeyRyabinin added a commit that referenced this issue Aug 27, 2024

use factories for client utility classes initialization, improve shut…

9e8e71d

…down behavior #3084

SergeyRyabinin closed this as completed in #3087 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWSClient segfaults after a data race #3084

AWSClient segfaults after a data race #3084

k0zmo commented Aug 21, 2024

jmklix commented Aug 23, 2024

SergeyRyabinin commented Aug 26, 2024

k0zmo commented Aug 26, 2024

github-actions bot commented Aug 27, 2024