Skip to content

AWSClient segfaults after a data race #3084

Closed
@k0zmo

Description

@k0zmo

Describe the bug

Following simple code that starts a few async calls and then interrupts them segfaults (randomly):

#include <aws/core/Aws.h>
#include <aws/core/auth/AWSAuthSigner.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/core/utils/Outcome.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/GetObjectRequest.h>
#include <aws/s3/model/HeadObjectRequest.h>

#include <algorithm>

const auto bucket = "test-bucket";
const auto key = "bigfile";

int main()
{
    Aws::SDKOptions options;
    Aws::InitAPI(options);

    auto prov = Aws::MakeShared<Aws::Auth::EnvironmentAWSCredentialsProvider>("EnvironmentAWSCredentialsProvider");
    Aws::Auth::AWSCredentials creds = prov->GetAWSCredentials();
    Aws::Client::ClientConfiguration config;

    auto client = Aws::MakeShared<Aws::S3::S3Client>( "S3Client", prov, config, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, true );

    Aws::S3::Model::HeadObjectRequest hor;
    hor.WithBucket(bucket).WithKey(key);
    auto outcome = client->HeadObject(hor);
    if (!outcome.IsSuccess())
    {
        return -1;
    }

    uint64_t remainingFileSize = outcome.GetResult().GetContentLength();
    uint64_t readPosition = 0;
    const auto readAheadQueueSize = 15;
    const auto readSectorSize = 8Ull * 1024Ull * 1024Ull;

    std::deque<Aws::S3::Model::GetObjectOutcomeCallable> queuedRequests;

    while (queuedRequests.size() < readAheadQueueSize)
    {
        uint64_t readSize = std::min(readSectorSize, remainingFileSize);

        std::ostringstream readRange;
        readRange << "bytes=" << readPosition << "-" << (readPosition + readSize - 1); // Inclusive range

        Aws::S3::Model::GetObjectRequest objectRequest;
        objectRequest.WithBucket(bucket).WithKey(key).WithRange(readRange.str());

        queuedRequests.push_back(client->GetObjectCallable(objectRequest));

        remainingFileSize -= readSize;
        readPosition += readSize;
    }

    client.reset();

    Aws::ShutdownAPI(options);
}

There are two reasons for the segfault. One is extended lifetime of executor - ClientConfiguration creates a shared_ptr to Executor, passes it onto the client but because its lifetime is longer than a client, it makes the Executor to live longer than the client.

However, there's an implicit assumption when destroying the client that the executor will run to its completion (at least for already started packaged_tasks) as part of its destructor. This obviously is not invoked as Executor outlives the client. The most obvious fix is to add client.executor.reset() after creating an instance of client so it is the only owner of Executor.

The problem stems from using shared_ptr and simultaneously assuming it is the only owner - which cannot be guaranteed.

Secondly, there's another data race (also leading to segfault pretty consistently) in ClientWithAsyncTemplateMethods::ShutdownSdkClient which is called as part of S3Client::~S3Client(). There, we first release the m_endpointProvider and then m_executor (but only if it's the only owner). This is wrong as m_endpointProvider can still be used by threads started by the Executor - such as:
image
The fix here is simply to move the m_endpointProvider.reset() call after m_executor is released:

            pClient->m_executor.reset();
            pClient->m_clientConfiguration.executor.reset();
            pClient->m_clientConfiguration.retryStrategy.reset();
            pClient->m_endpointProvider.reset();

Running original version with thread sanitizer yields more than 100 warnings while the above changes gives none for attached sample code

Expected Behavior

No segfault nor datarace

Current Behavior

Above code segfault in various places, most likely on dereferencing already released EndpointProvider but also deep inside the AttemptExhaustively function on dereferencing RetryStrategy

Reproduction Steps

Run above code few times and observe segfaults

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

AWS SDK CPP, version 1.11.352

Compiler and Version used

gcc 11.4.0, Visual Studio 17.11, clang 14.0

Operating System and version

Ubuntu 22.04 and Windows 11

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.p3This is a minor priority issuepending-releaseThis issue will be fixed by an approved PR that hasn't been released yet.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions