Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with net_http instrumentation on JRuby causing intermittent errors #3021

Open
NC-piercej opened this issue Jan 16, 2025 · 8 comments
Assignees
Labels
bug community To tag external issues and PRs submitted by the community

Comments

@NC-piercej
Copy link

Description

We are seeing occasional errors logged in NewRelic that appear to be originating in the NewRelic gem code. Here is a sample backtrace:

NoMethodError: undefined method `add_request_headers' for nil:NilClass
…ms/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/net_http/instrumentation.rb:24:in `request_with_tracing'
…undle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/net_http/prepend.rb:15:in `request'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/connection_pool.rb:348:in `request'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:80:in `block in transmit'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:134:in `block in session'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/connection_pool.rb:106:in `session_for'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:129:in `session'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:77:in `transmit'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:46:in `block in call'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:206:in `block in span_wrapper'
        /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/telemetry/no_op.rb:29:in `in_span'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:202:in `span_wrapper'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:45:in `call'
…r/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/content_length.rb:24:in `call'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/request_callback.rb:118:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/s3_signer.rb:78:in `call'
         /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/s3_host_id.rb:17:in `call'
    /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/http_200_errors.rb:17:in `call'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/xml/error_handler.rb:10:in `call'
           /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/sign.rb:53:in `call'
…r/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/transfer_encoding.rb:27:in `call'
…cal/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:12:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/s3_signer.rb:53:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/redirects.rb:20:in `call'
     /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/user_agent.rb:69:in `call'
<truncated 26 additional frames>
…/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/raise_response_errors.rb:16:in `call'
            /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/sse_cpk.rb:24:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/dualstack.rb:21:in `call'
         /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/accelerate.rb:43:in `call'
…/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111:in `call'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16:in `call'
  /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/invocation_id.rb:16:in `call'
…r/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19:in `call'
/usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/param_converter.rb:26:in `call'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/request_callback.rb:89:in `call'
/usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/response_paging.rb:12:in `call'
…/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/response_target.rb:24:in `call'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39:in `block in call'
        /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/telemetry/no_op.rb:29:in `in_span'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:53:in `span_wrapper'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39:in `call'
             /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/request.rb:72:in `send_request'
                     /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/client.rb:17185:in `put_object'
                             /usr/src/app/app/services/logging_caching/logger_helper.rb:142:in `save_to_s3'

Specifically, it appears to be blowing up here, meaning that NewRelic::Agent::Tracer.start_external_request_segment seems to be incorrectly returning nil.

Expected Behavior

NewRelic instrumentation reliably works correctly when multiple requests are in flight at once on different threads on JRuby.

Troubleshooting or NR Diag results

Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information

Steps to Reproduce

This appears to be a race condition. We see it quite frequently when multiple requests are running at once on different threads, but it is hard to specifically reproduce.

Your Environment

JRuby 9.4.9.0 (jruby:9.4.9.0-jdk21 docker image)
NewRelic 9.16.1

Additional context

Add any other context about the problem here. For example, relevant community posts or support tickets.

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

@NC-piercej NC-piercej added the bug label Jan 16, 2025
@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs submitted by the community label Jan 16, 2025
@tannalynn
Copy link
Contributor

Thank you for letting us know about this issue.
We'll need to take some time to work on reproducing this issue to help us investigate the root cause of it, but this information is helpful.
If you find any other information related to this or errors in the agent logs, please let us know.

@chynesNR chynesNR moved this to In Quarter in Ruby Engineering Board Jan 17, 2025
@chynesNR chynesNR moved this from In Quarter to In Sprint in Ruby Engineering Board Jan 21, 2025
@tannalynn tannalynn self-assigned this Jan 27, 2025
@tannalynn
Copy link
Contributor

Hello @NC-piercej

I'm going to make a change that will prevent this specific error from occurring, but I suspect this is a side effect of some other error happening, due to start_external_request_segment returning nil, since that generally should not be happening. I'd like to investigate the root cause of what's going on. But I haven't reproduced this error yet, so I'd like a little more information if that's possible.

Are you seeing any other errors logged by the ruby agent in the newrelic_agent.log file?
Specifically, I'm interested in an error that contains the text "Exception during Tracer.start_external_request_segment", so I could see what exception/stack trace is happening that causes start_external_request_segment to return nil. If there are other errors though, please share those as well just in case they are related. It would be a big help since I haven't gotten this happening locally yet.

@NC-piercej
Copy link
Author

I am going to look into seeing if I can find someone who has access to the those logs. In the meantime, I do have some more details.

This error is always triggering from the same place: logging the response async on a different thread after the server has already responded with the response. It is possible there may some weirdness around this. Basically, the sequence is:

  • come up with the response
  • start a thread to log it (but do not wait for it)
  • respond to the request with the response

Second, we did find another race-condition related error within the agent gem code that may be related:

RuntimeError: can't add a new key into hash during iteration
org/jruby/RubyHash.java:1191:in `[]='
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/attributes.rb:119:in `add'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/attributes.rb:33:in `add_agent_attribute'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/transaction/segment.rb:28:in `add_agent_attribute'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/dynamodb/instrumentation.rb:43:in `instrument_method_with_new_relic'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/dynamodb/prepend.rb:11:in `block in Prepend'
/usr/src/app/app/services/logging_caching/logger_helper.rb:178:in `push_log_to_dynamodb'
/usr/src/app/app/services/logging_caching/logger_helper.rb:92:in `block in push_log_to_remote'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/tracer.rb:434:in `block in thread_block_with_current_transaction'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/tracer.rb:357:in `capture_segment_error'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/tracer.rb:433:in `block in thread_block_with_current_transaction'

@NC-piercej
Copy link
Author

We're tweaking some logging settings and should hopefully have some data for you tomorrow.

@NC-piercej
Copy link
Author

We are seeing agent logs, but nothing unusual when the race condition error occurs. Should we try a log level other than the default?

@tannalynn
Copy link
Contributor

We are seeing agent logs, but nothing unusual when the race condition error occurs. Should we try a log level other than the default?

Hmm, that's interesting that you aren't seeing anything related to the errors I was expecting. The error I was specifically interested in is logged at an error level, so it should be there with default settings if an error is happening in that area. No other errors either?

I was looking at the newer error you provided, the hash/iteration one. Does the original error you reported usually occur in the same transaction? I noticed the original stacktrace also includes code in the aws sdk, but that one seems to be using the s3 sdk. Just wanted to check because we do have dynamodb specific instrumentation, but the s3 one would only be picked up by the net http instrumentation. Are these errors usually happening at the same time, or are these different things that both are having errors. Just want to get a better understanding.

Also, I did make a change that will prevent the original error from occurring when segment is nil #3046, this should be included in the next release that we do.

I'll start looking at that second error and see what's going on there.

@NC-piercej
Copy link
Author

NC-piercej commented Feb 3, 2025

@tannalynn We took another look at the logs and we think we found a few things:

  • ** [NewRelic][2025-01-31 11:23:37 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : Exception during Transaction#finish
  • ** [NewRelic][2025-01-31 11:23:37 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : NoMethodError: undefined method `finalize' for nil:NilClass
  • ** [NewRelic][2025-01-31 11:23:49 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : Exception during Tracer.start_external_request_segment
  • ** [NewRelic][2025-01-31 11:23:49 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : NoMethodError: undefined method `starting_segment_key' for nil:NilClass
  • ** [NewRelic][2025-01-31 11:24:23 -0500 ip-172-19-12-14.ec2.internal (1)] ERROR : Exception during Transaction#finish
  • ** [NewRelic][2025-01-31 11:24:23 -0500 ip-172-19-12-14.ec2.internal (1)] ERROR : NoMethodError: undefined method `first' for nil:NilClass
  • ** [NewRelic][2025-01-31 16:32:47 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : Exception finishing segment: External/.amazonaws.com/Net::HTTP/PUT
  • ** [NewRelic][2025-01-31 16:32:47 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : TypeError: nil can't be coerced into Float

Looking at the code, I think this array mutation needs to be guarded by a mutex:

https://github.com/newrelic/newrelic-ruby-agent/blob/dev/lib/new_relic/agent/transaction/tracing.rb#L27

We have multiple services (S3, DynamicDB, etc...) being written to in parallel during our logging code. These would all be triggered from the same overall request/transaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community To tag external issues and PRs submitted by the community
Projects
Status: In Sprint
Development

No branches or pull requests

3 participants