Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests failing for @aws-sdk/client-sqs >=3.316 #1477

Closed
pichlermarc opened this issue Apr 25, 2023 · 14 comments · Fixed by #1847
Closed

Tests failing for @aws-sdk/client-sqs >=3.316 #1477

pichlermarc opened this issue Apr 25, 2023 · 14 comments · Fixed by #1847
Assignees
Labels
bug Something isn't working pkg:instrumentation-aws-sdk priority:p4 Bugs and spec inconsistencies which do not fall into a higher prioritization

Comments

@pichlermarc
Copy link
Member

What version of OpenTelemetry are you using?

Current revision on main

What version of Node are you using?

Node 16, 18

What did you do?

npm run test-all-versions in @opentelemetry/instrumentation-aws-sdk

What did you expect to see?

Tests pass.

What did you see instead?

Tests fail for @aws-sdk/client-sqs >=3.316

Additional context

Run for Node 16: https://github.com/open-telemetry/opentelemetry-js-contrib/actions/runs/4787435108/jobs/8512667684
Run for Node 18: https://github.com/open-telemetry/opentelemetry-js-contrib/actions/runs/4787435108/jobs/8512667928

@pichlermarc pichlermarc added bug Something isn't working pkg:instrumentation-aws-sdk labels Apr 25, 2023
@pichlermarc
Copy link
Member Author

pichlermarc commented Apr 25, 2023

cc @carolabadeer would you mind looking into this as you're listed as the code owner? 🙂 This could mean that the instrumentation is not producing the expected telemetry for the most recent versions of @aws-sdk/client-sqs

I'm able to produce this locally by installing either 3.316.0 or 3.319.0.
3.315.0 on the other hand works fine 🤔

@carolabadeer
Copy link
Contributor

carolabadeer commented Apr 26, 2023

Thanks for raising this @pichlermarc! I will take a look into what could be causing this issue

@dyladan dyladan added the priority:p4 Bugs and spec inconsistencies which do not fall into a higher prioritization label May 3, 2023
@dyladan
Copy link
Member

dyladan commented May 3, 2023

labeling p4 as this does not affect users

@pichlermarc
Copy link
Member Author

@carolabadeer any update?

@carolabadeer
Copy link
Contributor

carolabadeer commented May 16, 2023

Currently working with the AWS SDK for JS team to identify possible issues in the changes between @aws-sdk/client-sqs v3.315 and v3.316 that could be causing this test failure. Will post updates here on findings and progress

@carolabadeer
Copy link
Contributor

carolabadeer commented May 17, 2023

While waiting I also looked into the behavior in the unit tests and found that there are two spans being used (not sure if they are meant to be the same) - a test span and a callback span.

The callback span is the one whose attributes are being checked in the unit test that is failing. In that span, the test is checking for a SemanticAttributes.MESSAGING_OPERATION attribute but cannot find one. This span seems to be wrapping an S3 CreateBucket call. These are it's attributes:

{
  'aws.operation': 'createBucket',
  'aws.signature.version': 's3',
  'aws.service.api': 'S3',
  'aws.service.identifier': 's3',
  'aws.service.name': 'Amazon S3',
  'rpc.system': 'aws-api',
  'rpc.method': 'CreateBucket',
  'rpc.service': 'S3',
  'aws.region': 'us-east-1',
  'aws.request.id': '0000000000000',
  'http.status_code': 200
}

Since this is an S3 call, there shouldn't be a messaging.operation atttribute, which is the correct behavior.

The test span, on the other hand, is the one wrapping the SQS ReceiveMessage call. That span contains the correct messaging.operation attribute:

{
  'rpc.system': 'aws-api',
  'rpc.method': 'ReceiveMessage',
  'rpc.service': 'SQS',
  'messaging.system': 'aws.sqs',
  'messaging.destination_kind': 'queue',
  'messaging.destination': 'otel-demo-aws-sdk',
  'messaging.url': 'https://sqs.us-east-1.amazonaws.com/731241200085/otel-demo-aws-sdk',
  'messaging.operation': 'receive',
  'aws.region': 'us-east-1',
  'http.status_code': 200
}

These are my findings so far, but I am still curious about 2 questions:

  1. Why is the callback span only being used in this messaging.operation attribute test and not in any other place? I was able to solve this issue and get the test to pass by using
          // where span.attributes represents the attributes of the test span shown above 
          expect(span.attributes[SemanticAttributes.MESSAGING_OPERATION]).toMatch(
            MessagingOperationValues.RECEIVE
          );

instead of

          expect(attributes[SemanticAttributes.MESSAGING_OPERATION]).toMatch(
            MessagingOperationValues.RECEIVE
          );

@blumamir since you added this unit test, maybe you can give us some insight?

  1. Still unsure about how/why this issue did not present itself earlier. The implementations for adding the MESSAGING_OPERATION attribute and the unit tests have not had any recent changes, and I am not yet sure about the correlation between the @aws-sdk/client-sqs version bump and this test failure.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@blumamir
Copy link
Member

blumamir commented Aug 9, 2023

This span seems to be wrapping an S3 CreateBucket call. These are it's attributes:

why are we seeing S3 CreateBucket span in this test?

  1. Why is the callback span only being used in this messaging.operation attribute test and not in any other place? I was able to solve this issue and get the test to pass by using

So what this part actually asserts is that the callback itself is being invoked with the receive span set as the active span for the callback invocation. The issue is that any operations triggered inside the callback, like calling deleteMessage or executing db queries - those will create new spans and the parent for them will be the span we get back from context.active() const receiveCallbackSpan = trace.getSpan(context.active());. If the context is ever not propagated to the callback correctly, the attributes for the active span would contain noop or some other span which is not the receive span, thus future child spans will be scoped improperly.

I hope it is clear, and feel free to refactor the test so it better communicate it's intent. We can maybe create a new span in the callback and assert child-parent relationships, or we can just compare the active span id in the callback with the test span? I hope I don't miss anything there.

2. Still unsure about how/why this issue did not present itself earlier. The implementations for adding the MESSAGING_OPERATION attribute and the unit tests have not had any recent changes, and I am not yet sure about the correlation between the @aws-sdk/client-sqs version bump and this test failure.

To me the immediate suspect is this S3 create bucket which seems unrelated. I would add a breakpoint and trace the stack to follow it's roots.

Feel free to ping me in slack to research together.

@pichlermarc
Copy link
Member Author

pichlermarc commented Oct 2, 2023

I opened #1711 to disable this test temporarily.
Otherwise, it's very hard to see if other test-all-versions scripts are failing.

If the PR merges, we'll need to remove the skip() once the problem is resovled.
Edit: actually, all three tests in custom service behavior are failing.

@trentm
Copy link
Contributor

trentm commented Nov 30, 2023

I have been looking at this a little bit. I don't have a fix yet, however.

Also, see #1838 (comment) about a different test failure with more recent versions of @aws-sdk/client-sqs.

-- required packages ["@aws-sdk/[email protected]"]
-- installing ["@aws-sdk/[email protected]"]
-- running test "npm run test" with @aws-sdk/client-sqs (env: {})
...
  3 failing

  1) instrumentation-aws-sdk-v3
       custom service behavior
         SQS
           sqs send add messaging attributes:
     SyntaxError: Unexpected token < in JSON at position 0
  Deserialization error: to see the raw response, inspect the hidden field {error}.$response on this object.
      at JSON.parse (<anonymous>)
      at /home/runner/work/opentelemetry-js-contrib/opentelemetry-js-contrib/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_json1_0.js:1980:21
      at processTicksAndRejections (internal/process/task_queues.js:95:5)

That failure, I believe, is because a recent-ish version of client-sqs changed the internal protocol used with the SQS service to one that expects JSON payload responses. A write-up of a similar failure against Elastic's APM agent tests here: elastic/apm-agent-nodejs#3324 (comment)

@trentm
Copy link
Contributor

trentm commented Dec 1, 2023

I believe I understand the issue with the v3.316.0 failure now, but I'm not sure how easily it can be fixed. I'll try to explain, and then I have a question about whether we even want to support this. (/cc @blumamir)

the feature that is broken

The aws-sdk instrumentation has special handling for SQS ReceiveMessage described here: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/plugins/node/opentelemetry-instrumentation-aws-sdk/doc/sqs.md#receivemessage
One of the items from that doc:

Sets the inter process context correctly, so that additional spans created through the process will be linked to parent spans correctly.
This behavior is partially implemented, see discussion below.

What this means is that the instrumentation attempts to make it so:

sqsClient.receiveMessage(params).then(res => {
  // ...the current span context in here is the ReceiveMessage span.
});

The way that is done currently is by shimming the .then on some of the promises in the code path:

  • using bindPromise:
    return requestMetadata.isIncoming
    ? bindPromise(handlerPromise, activeContextWithSpan, 2)
    : handlerPromise;
  • bindPromise implementation:
    export const bindPromise = <T = unknown>(
    target: Promise<T>,
    contextForCallbacks: Context,
    rebindCount = 1
    ): Promise<T> => {
    const origThen = target.then;
    type PromiseThenParameters = Parameters<Promise<T>['then']>;
    target.then = function <TResult1 = T, TResult2 = never>(
    onFulfilled: PromiseThenParameters[0],
    onRejected: PromiseThenParameters[1]
    ): Promise<TResult1 | TResult2> {
    const newOnFulfilled = context.bind(contextForCallbacks, onFulfilled);
    const newOnRejected = context.bind(contextForCallbacks, onRejected);
    const patchedPromise = origThen.call<
    Promise<T>,
    any[],
    Promise<TResult1 | TResult2>
    >(this, newOnFulfilled, newOnRejected);
    return rebindCount > 1
    ? bindPromise(patchedPromise, contextForCallbacks, rebindCount - 1)
    : patchedPromise;
    };
    return target;
    };

the reason for breakage

This broke in @aws-sdk/[email protected] when the SQS class methods became async functions. Effectively the SQS class changed from (heavily simplified from aws/aws-sdk-js-v3#4639):

class SQS extends SQSClient {
  receiveMessage(args, optionsOrCb, cb) {
    const command = new ReceiveMessageCommand_1.ReceiveMessageCommand(args);
    return this.send(command, optionsOrCb);
  }

to this:

class SQS extends SQSClient {
  async receiveMessage(args, optionsOrCb, cb) {
    const command = new ReceiveMessageCommand_1.ReceiveMessageCommand(args);
    return this.send(command, optionsOrCb);
  }

The current bindPromise usage is not wrapping the return value of SQS.receiveMessage, but instead some internal promises in aws-sdk@3 "middleware stack". The additional promise from this now-async receiveMessage breaks that context binding.

possible fix

A possible fix might be (I haven't tried this) to explicitly wrap that sqsClient.receiveMessage() method, then shim the .then on its returned promise. ReceiveMessage is the only method with this special handling, so there would not be a need to wrap every API method on the SQS class.

However, IIUC, there will always be possible context loss around some async/await usage patterns. This is mentioned in the "sqs.md" doc:

  1. If async/await is used, context can be lost when returning data from async functions

Also, tests for this for aws-sdk v2 are skipped (and have always been, I believe):

it.skip('should set parent context in sqs receive after await', async () => {
const sqs = new AWS.SQS();
await sqs
.receiveMessage({
QueueUrl: 'queue/url/for/unittests',
})
.promise();
createReceiveChildSpan();
expectReceiverWithChildSpan(getTestSpans());
});
it.skip('should set parent context in sqs receive from async function', async () => {
const asycnReceive = async () => {
const sqs = new AWS.SQS();
return await sqs
.receiveMessage({
QueueUrl: 'queue/url/for/unittests',
})
.promise();
};
await asycnReceive();
createReceiveChildSpan();
expectReceiverWithChildSpan(getTestSpans());
});
});

Discussion: do we want to support this?

There are two special handling things being done by SQS ReceiveMessage instrumentation:

  1. attempting to propagate the span context to the message processing code (what this issue is about); and
  2. attempting to create "processing" spans for the handling of each message in a batch.

The latter is attempted by monkey-patching the forEach, map, and filter methods on the returned response.Messages array. So this code produces a "processing" span for each message:

sqs.receiveMessage(...).then(res => {
  res.Messages.forEach(msg => {
    console.log('one message is', msg);
  });
});

I say "attempted" because the instrumentation is necessarily limited. This does not produce processing spans:

sqs.receiveMessage(...).then(res => {
  for (let msg of res.Messages) {
    console.log('one message is', msg);
  };
});

There is already a (very old) discussion issue for this: #707
The current last comment mentions a bad user outcome when (2) works, but (1) fails:

I am using forEach and inside it, using a setTimeout. Its causing the spans to be nested with one root span having 10000+ spans for receiving messages.


I understand the desire to try to implement the OTel message spec guidelines to create processing spans automatically. However, I think the limited ability to do so in JavaScript results in surprising behaviour such that we might want to consider not doing this limited special handling at all. This would be a breaking change.

Actually, just looking at the OTel Messaging spec (https://opentelemetry.io/docs/specs/semconv/messaging/messaging-spans/) now, has guidance for "processing" spans been dropped? Compare https://opentelemetry.io/docs/specs/semconv/messaging/messaging-spans/#batch-receiving to the older v1.20.0 equivalent: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.20.0/specification/trace/semantic_conventions/messaging.md#batch-processing
The current spec doesn't mention "processing" spans anymore.

@trentm
Copy link
Contributor

trentm commented Dec 1, 2023

I opened #1711 to disable this test temporarily.

The "fix" or removal of the feature (see previous comment) isn't going to be a quick thing. I think in the interim we should disable this particular test. I believe we can skip just this test block:

it('sqs receive add messaging attributes and context',

this would be slight change to #1711. @pichlermarc Would you object to a change pushed to your #1711 to disable just the smaller part of aws-sdk-v3.test.ts that is failing?

@trentm
Copy link
Contributor

trentm commented Dec 5, 2023

@pichlermarc Would you object to a change pushed to your #1711 to disable just the smaller part of aws-sdk-v3.test.ts that is failing?

I don't have permissions to push to your branch, so I created #1847 as an alternative.

@pichlermarc
Copy link
Member Author

@pichlermarc Would you object to a change pushed to your #1711 to disable just the smaller part of aws-sdk-v3.test.ts that is failing?

I don't have permissions to push to your branch, so I created #1847 as an alternative.

Ah, yes that's not possible with organization forks. :/
Looks like I was too late here - I'll close my old PR - splitting it and only skipping the part that's failing is better 🙂

trentm added a commit that referenced this issue Dec 11, 2023
…sqs ReceiveMessage context handling (#1847)

In `@aws-sdk/client-sqs` v3.316 the SQS client methods became async. This breaks
the `utils.bindPromise` usage that attempts to propagate the SQS ReceiveMessage
span context to the user's handler for the method's returned promise.
Fixing that propagation is for #707 or another issue.

This change is a workaround that skips the testing of that span context
propagation feature.

Fixes: #1477
Refs: #707
Co-authored-by: Marc Pichler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pkg:instrumentation-aws-sdk priority:p4 Bugs and spec inconsistencies which do not fall into a higher prioritization
Projects
None yet
5 participants