Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittently getting EndpointDiscoveryException's thrown when attempting to write records to timestream. #2940

Open
brysn opened this issue Jun 7, 2024 · 7 comments
Assignees
Labels
investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue

Comments

@brysn
Copy link

brysn commented Jun 7, 2024

Describe the bug

I have a webhook endpoint that consumes messages from a 3rd party and writes records into AWS Timestream. It does so using the TimestreamWrite/TimestreamWriteClient::writeRecords method. There are long periods of time in which it works, but intermittently it will start throwing exceptions. When this starts to happen, it seems that ~1/3 of requests fail with this exception, while 2/3 of the requests continue to work.

Here are the exception details:

Property Value
class Aws\Exception\AwsException
awsErrorCode EndpointDiscoveryException
awsErrorMessage The endpoint required for this service is currently unable to be retrieved, and your request can not be fulfilled unless you manually specify an endpoint.
previous exception Error executing "DescribeEndpoints" on "https://ingest.timestream.us-west-2.amazonaws.com"; AWS HTTP error: cURL error 6: getaddrinfo() thread failed to start (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://ingest.timestream.us-west-2.amazonaws.com

Expected Behavior

According to https://curl.se/libcurl/c/libcurl-errors.html the curl error means: Could not resolve host. The given remote host was not resolved.

I expected the records to be written to timestream.

Current Behavior

It appears that in some intermittent cases that the endpoint discovery call doesn't work and the hostname https://ingest.timestream.us-west-2.amazonaws.com cannot be resolved.

Reproduction Steps

<?php

use Aws\Credentials\Credentials;
use Aws\Exception\AwsException;
use Aws\Result;
use Aws\TimestreamWrite\Exception\TimestreamWriteException;
use Aws\TimestreamWrite\TimestreamWriteClient;

$awsKey = '{{ INSERT AWS KEY }}';
$awsSecret = '{{ INSERT AWS SECRET }}';
$awsTimestreamDatabase = '{{ INSERT TIMESTREAM DB }}';
$awsTimestreamTable = '{{ INSERT TIMESTREAM TABLE }}';
$awsTimestreamRecords = []; //INSERT RECORDS TO BE INSERTED

$client = new TimestreamWriteClient([
    'version' => 'latest',
    'region' => 'us-west-2',
    'credentials' => new Credentials($awsKey, $awsSecret),
]);

// This is what throws the exception intermittently
$response = $client->writeRecords([
    'DatabaseName' => $awsTimestreamDatabase,
    'Records' => $awsTimestreamRecords,
    'TableName' => $awsTimestreamTable,
]);

Possible Solution

I'm guessing there must be some issue with the curl setup here as I doubt that endpoint on aws is having that many issues.

Additional Information/Context

This happens to hundreds, if not thousands of times per day. It does seem to occur when there are more requests being sent at the same time. For example, overnight there are much fewer requests and everything seems to work fine. But during the day when it is getting flooded with requests then about 1/3 of them fail with this exception.

SDK version used

3.311.0

Environment details (Version of PHP (php -v)? OS name and version, etc.)

PHP 8.3.7, Ubuntu 22.04

@brysn brysn added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 7, 2024
@yenfryherrerafeliz yenfryherrerafeliz self-assigned this Jun 7, 2024
@yenfryherrerafeliz yenfryherrerafeliz added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jun 11, 2024
@yenfryherrerafeliz
Copy link
Contributor

Hi @brysn, sorry to hear about your issues. This issue is hard to reproduce seems it happens sporadically. I tried to do it locally but everything works fine at my end. Can you please confirm if is there any proxy in the middle?, or any other network limitations done by a firewall conf maybe?.

Thanks!

@yenfryherrerafeliz yenfryherrerafeliz added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 19, 2024
Copy link

This issue has not recieved a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Jun 30, 2024
@brysn
Copy link
Author

brysn commented Jul 1, 2024

Hi @yenfryherrerafeliz , thanks for looking into it. The application is on Heroku; there is no firewall, proxy in the middle, or any other network limitations.

It seems like it has to do with the endpoint discovery caching. Could the endpoint be changing on AWS before the cache expires for the endpoint?

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Jul 2, 2024
@RanVaknin RanVaknin added the p2 This is a standard priority issue label Aug 1, 2024
@RanVaknin
Copy link
Contributor

Hi @brysn ,

AFAIK the SDK does not cache any endpoints. The SDK uses Guzzle, and PHP's standard curl client to make network calls. DNS resolution happens at the OS level and is not something that the SDK has visibility / control over.

If you want to root cause this, you might want to use a network diagnostics tool like Wireshark to inspect what kind of networking events are happening your machine. As a temporary measure, you might want to lower / completely disable the TTL on your the DNS cache to see if it solves your issue.

This doesn't seem directly related to the SDK, but we will keep the issue open to see if this helps and maybe provide some other possible guidance.

Thanks,
Ran~

@RanVaknin RanVaknin added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Aug 7, 2024
@brysn
Copy link
Author

brysn commented Aug 12, 2024

@RanVaknin

Thanks for looking into this. The issue is with endpoint discovery and the SDK does indeed cache endpoints: Example

It's unlikely that AWS itself has an endpoint discovery bug. It's more likely that it's an issue with the SDK's endpoint discovery caching.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 13, 2024
@brysn
Copy link
Author

brysn commented Aug 31, 2024

@RanVaknin since this is a bug with the SDK, can this be looked into further?

@brysn
Copy link
Author

brysn commented Oct 22, 2024

@RanVaknin @yenfryherrerafeliz Was there more information I needed to provide or why was this marked as "guidance" when it's a bug with the sdk caching?

@RanVaknin RanVaknin self-assigned this Nov 4, 2024
@RanVaknin RanVaknin added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed guidance Question that needs advice or information. labels Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

3 participants