Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWSClient segfaults after a data race #3084

Closed
k0zmo opened this issue Aug 21, 2024 · 4 comments · Fixed by #3087
Closed

AWSClient segfaults after a data race #3084

k0zmo opened this issue Aug 21, 2024 · 4 comments · Fixed by #3087
Labels
bug This issue is a bug. p3 This is a minor priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet.

Comments

@k0zmo
Copy link

k0zmo commented Aug 21, 2024

Describe the bug

Following simple code that starts a few async calls and then interrupts them segfaults (randomly):

#include <aws/core/Aws.h>
#include <aws/core/auth/AWSAuthSigner.h>
#include <aws/core/auth/AWSCredentialsProvider.h>
#include <aws/core/utils/Outcome.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/GetObjectRequest.h>
#include <aws/s3/model/HeadObjectRequest.h>

#include <algorithm>

const auto bucket = "test-bucket";
const auto key = "bigfile";

int main()
{
    Aws::SDKOptions options;
    Aws::InitAPI(options);

    auto prov = Aws::MakeShared<Aws::Auth::EnvironmentAWSCredentialsProvider>("EnvironmentAWSCredentialsProvider");
    Aws::Auth::AWSCredentials creds = prov->GetAWSCredentials();
    Aws::Client::ClientConfiguration config;

    auto client = Aws::MakeShared<Aws::S3::S3Client>( "S3Client", prov, config, Aws::Client::AWSAuthV4Signer::PayloadSigningPolicy::Never, true );

    Aws::S3::Model::HeadObjectRequest hor;
    hor.WithBucket(bucket).WithKey(key);
    auto outcome = client->HeadObject(hor);
    if (!outcome.IsSuccess())
    {
        return -1;
    }

    uint64_t remainingFileSize = outcome.GetResult().GetContentLength();
    uint64_t readPosition = 0;
    const auto readAheadQueueSize = 15;
    const auto readSectorSize = 8Ull * 1024Ull * 1024Ull;

    std::deque<Aws::S3::Model::GetObjectOutcomeCallable> queuedRequests;

    while (queuedRequests.size() < readAheadQueueSize)
    {
        uint64_t readSize = std::min(readSectorSize, remainingFileSize);

        std::ostringstream readRange;
        readRange << "bytes=" << readPosition << "-" << (readPosition + readSize - 1); // Inclusive range

        Aws::S3::Model::GetObjectRequest objectRequest;
        objectRequest.WithBucket(bucket).WithKey(key).WithRange(readRange.str());

        queuedRequests.push_back(client->GetObjectCallable(objectRequest));

        remainingFileSize -= readSize;
        readPosition += readSize;
    }

    client.reset();

    Aws::ShutdownAPI(options);
}

There are two reasons for the segfault. One is extended lifetime of executor - ClientConfiguration creates a shared_ptr to Executor, passes it onto the client but because its lifetime is longer than a client, it makes the Executor to live longer than the client.

However, there's an implicit assumption when destroying the client that the executor will run to its completion (at least for already started packaged_tasks) as part of its destructor. This obviously is not invoked as Executor outlives the client. The most obvious fix is to add client.executor.reset() after creating an instance of client so it is the only owner of Executor.

The problem stems from using shared_ptr and simultaneously assuming it is the only owner - which cannot be guaranteed.

Secondly, there's another data race (also leading to segfault pretty consistently) in ClientWithAsyncTemplateMethods::ShutdownSdkClient which is called as part of S3Client::~S3Client(). There, we first release the m_endpointProvider and then m_executor (but only if it's the only owner). This is wrong as m_endpointProvider can still be used by threads started by the Executor - such as:
image
The fix here is simply to move the m_endpointProvider.reset() call after m_executor is released:

            pClient->m_executor.reset();
            pClient->m_clientConfiguration.executor.reset();
            pClient->m_clientConfiguration.retryStrategy.reset();
            pClient->m_endpointProvider.reset();

Running original version with thread sanitizer yields more than 100 warnings while the above changes gives none for attached sample code

Expected Behavior

No segfault nor datarace

Current Behavior

Above code segfault in various places, most likely on dereferencing already released EndpointProvider but also deep inside the AttemptExhaustively function on dereferencing RetryStrategy

Reproduction Steps

Run above code few times and observe segfaults

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

AWS SDK CPP, version 1.11.352

Compiler and Version used

gcc 11.4.0, Visual Studio 17.11, clang 14.0

Operating System and version

Ubuntu 22.04 and Windows 11

@k0zmo k0zmo added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 21, 2024
@jmklix
Copy link
Member

jmklix commented Aug 23, 2024

Can you try adding brackets around the sdk code so that it is all in the same scope and is cleaned up correctly before Aws::ShutdownAPI is called? This is mentioned in our basic use here:

#include <aws/core/Aws.h>
int main(int argc, char** argv)
{
   Aws::SDKOptions options;
   Aws::InitAPI(options);
   {
      // make your SDK calls here.
   }
   Aws::ShutdownAPI(options);
   return 0;
}

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Aug 23, 2024
SergeyRyabinin added a commit that referenced this issue Aug 26, 2024
@SergeyRyabinin
Copy link
Contributor

Hi @k0zmo ,

thank you for bringing and describing in details this issue.
Additional thank you for pointing to the ClientConfiguration object storing extra reference to the Executor.

I agree with @jmklix in referring to our basic-use instruction: this CPP SDK is designed in a such a way that it requires extra scope, and extra caution in tracking async running tasks from the user.

However, this is a recurring topic, so I submitted a PR #3087 having your suggestion to re-arrange the destruction order, as well as, in addition, some refactoring to use factories by default within the ClientConfiguration, and some other minor changes to track client "is initialized" checks. I hope we are going to merge this PR soon, while it is not going to solve all "Init-Shutdown-memory-allocator-async-tasks-tracking" kind of issues, it is still an improvement.

Best regards,
Sergey

@k0zmo
Copy link
Author

k0zmo commented Aug 26, 2024

@jmklix Limiting a scope for AWSClient like you shown doesn't make difference. The problem I reported comes from an entangled lifetime between ClientConfiguration and AWSClient, not AWSClient and the global SDK state.

@jmklix jmklix added pending-release This issue will be fixed by an approved PR that hasn't been released yet. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Aug 26, 2024
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue pending-release This issue will be fixed by an approved PR that hasn't been released yet.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants