Race condition with net_http instrumentation on JRuby causing intermittent errors #3021

NC-piercej · 2025-01-16T22:40:39Z

Description

We are seeing occasional errors logged in NewRelic that appear to be originating in the NewRelic gem code. Here is a sample backtrace:

NoMethodError: undefined method `add_request_headers' for nil:NilClass
…ms/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/net_http/instrumentation.rb:24:in `request_with_tracing'
…undle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/net_http/prepend.rb:15:in `request'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/connection_pool.rb:348:in `request'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:80:in `block in transmit'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:134:in `block in session'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/connection_pool.rb:106:in `session_for'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:129:in `session'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:77:in `transmit'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:46:in `block in call'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:206:in `block in span_wrapper'
        /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/telemetry/no_op.rb:29:in `in_span'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:202:in `span_wrapper'
    /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/net_http/handler.rb:45:in `call'
…r/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/content_length.rb:24:in `call'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/request_callback.rb:118:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/s3_signer.rb:78:in `call'
         /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/s3_host_id.rb:17:in `call'
    /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/http_200_errors.rb:17:in `call'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/xml/error_handler.rb:10:in `call'
           /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/sign.rb:53:in `call'
…r/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/transfer_encoding.rb:27:in `call'
…cal/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:12:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/s3_signer.rb:53:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/redirects.rb:20:in `call'
     /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/user_agent.rb:69:in `call'
<truncated 26 additional frames>
…/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/raise_response_errors.rb:16:in `call'
            /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/sse_cpk.rb:24:in `call'
          /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/dualstack.rb:21:in `call'
         /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/plugins/accelerate.rb:43:in `call'
…/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111:in `call'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16:in `call'
  /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/invocation_id.rb:16:in `call'
…r/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19:in `call'
/usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/param_converter.rb:26:in `call'
…local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/request_callback.rb:89:in `call'
/usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/response_paging.rb:12:in `call'
…/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/response_target.rb:24:in `call'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39:in `block in call'
        /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/telemetry/no_op.rb:29:in `in_span'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:53:in `span_wrapper'
      /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39:in `call'
             /usr/local/bundle/gems/aws-sdk-core-3.214.0/lib/seahorse/client/request.rb:72:in `send_request'
                     /usr/local/bundle/gems/aws-sdk-s3-1.176.1/lib/aws-sdk-s3/client.rb:17185:in `put_object'
                             /usr/src/app/app/services/logging_caching/logger_helper.rb:142:in `save_to_s3'

Specifically, it appears to be blowing up here, meaning that NewRelic::Agent::Tracer.start_external_request_segment seems to be incorrectly returning nil.

Expected Behavior

NewRelic instrumentation reliably works correctly when multiple requests are in flight at once on different threads on JRuby.

Troubleshooting or NR Diag results

Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information

Steps to Reproduce

This appears to be a race condition. We see it quite frequently when multiple requests are running at once on different threads, but it is hard to specifically reproduce.

Your Environment

JRuby 9.4.9.0 (jruby:9.4.9.0-jdk21 docker image)
NewRelic 9.16.1

Additional context

Add any other context about the problem here. For example, relevant community posts or support tickets.

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

The text was updated successfully, but these errors were encountered:

workato-integration · 2025-01-16T22:40:43Z

https://new-relic.atlassian.net/browse/NR-359409

tannalynn · 2025-01-17T16:38:08Z

Thank you for letting us know about this issue.
We'll need to take some time to work on reproducing this issue to help us investigate the root cause of it, but this information is helpful.
If you find any other information related to this or errors in the agent logs, please let us know.

tannalynn · 2025-01-30T17:38:30Z

Hello @NC-piercej

I'm going to make a change that will prevent this specific error from occurring, but I suspect this is a side effect of some other error happening, due to start_external_request_segment returning nil, since that generally should not be happening. I'd like to investigate the root cause of what's going on. But I haven't reproduced this error yet, so I'd like a little more information if that's possible.

Are you seeing any other errors logged by the ruby agent in the newrelic_agent.log file?
Specifically, I'm interested in an error that contains the text "Exception during Tracer.start_external_request_segment", so I could see what exception/stack trace is happening that causes start_external_request_segment to return nil. If there are other errors though, please share those as well just in case they are related. It would be a big help since I haven't gotten this happening locally yet.

NC-piercej · 2025-01-30T18:03:49Z

I am going to look into seeing if I can find someone who has access to the those logs. In the meantime, I do have some more details.

This error is always triggering from the same place: logging the response async on a different thread after the server has already responded with the response. It is possible there may some weirdness around this. Basically, the sequence is:

come up with the response
start a thread to log it (but do not wait for it)
respond to the request with the response

Second, we did find another race-condition related error within the agent gem code that may be related:

RuntimeError: can't add a new key into hash during iteration
org/jruby/RubyHash.java:1191:in `[]='
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/attributes.rb:119:in `add'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/attributes.rb:33:in `add_agent_attribute'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/transaction/segment.rb:28:in `add_agent_attribute'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/dynamodb/instrumentation.rb:43:in `instrument_method_with_new_relic'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/instrumentation/dynamodb/prepend.rb:11:in `block in Prepend'
/usr/src/app/app/services/logging_caching/logger_helper.rb:178:in `push_log_to_dynamodb'
/usr/src/app/app/services/logging_caching/logger_helper.rb:92:in `block in push_log_to_remote'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/tracer.rb:434:in `block in thread_block_with_current_transaction'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/tracer.rb:357:in `capture_segment_error'
/usr/local/bundle/gems/newrelic_rpm-9.16.1/lib/new_relic/agent/tracer.rb:433:in `block in thread_block_with_current_transaction'

NC-piercej · 2025-01-30T18:30:38Z

We're tweaking some logging settings and should hopefully have some data for you tomorrow.

NC-piercej · 2025-01-31T18:07:40Z

We are seeing agent logs, but nothing unusual when the race condition error occurs. Should we try a log level other than the default?

tannalynn · 2025-02-03T17:23:02Z

We are seeing agent logs, but nothing unusual when the race condition error occurs. Should we try a log level other than the default?

Hmm, that's interesting that you aren't seeing anything related to the errors I was expecting. The error I was specifically interested in is logged at an error level, so it should be there with default settings if an error is happening in that area. No other errors either?

I was looking at the newer error you provided, the hash/iteration one. Does the original error you reported usually occur in the same transaction? I noticed the original stacktrace also includes code in the aws sdk, but that one seems to be using the s3 sdk. Just wanted to check because we do have dynamodb specific instrumentation, but the s3 one would only be picked up by the net http instrumentation. Are these errors usually happening at the same time, or are these different things that both are having errors. Just want to get a better understanding.

Also, I did make a change that will prevent the original error from occurring when segment is nil #3046, this should be included in the next release that we do.

I'll start looking at that second error and see what's going on there.

NC-piercej · 2025-02-03T20:04:22Z

@tannalynn We took another look at the logs and we think we found a few things:

** [NewRelic][2025-01-31 11:23:37 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : Exception during Transaction#finish
** [NewRelic][2025-01-31 11:23:37 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : NoMethodError: undefined method `finalize' for nil:NilClass
** [NewRelic][2025-01-31 11:23:49 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : Exception during Tracer.start_external_request_segment
** [NewRelic][2025-01-31 11:23:49 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : NoMethodError: undefined method `starting_segment_key' for nil:NilClass
** [NewRelic][2025-01-31 11:24:23 -0500 ip-172-19-12-14.ec2.internal (1)] ERROR : Exception during Transaction#finish
** [NewRelic][2025-01-31 11:24:23 -0500 ip-172-19-12-14.ec2.internal (1)] ERROR : NoMethodError: undefined method `first' for nil:NilClass
** [NewRelic][2025-01-31 16:32:47 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : Exception finishing segment: External/.amazonaws.com/Net::HTTP/PUT
** [NewRelic][2025-01-31 16:32:47 -0500 ip-172-19-14-100.ec2.internal (1)] ERROR : TypeError: nil can't be coerced into Float

Looking at the code, I think this array mutation needs to be guarded by a mutex:

https://github.com/newrelic/newrelic-ruby-agent/blob/dev/lib/new_relic/agent/transaction/tracing.rb#L27

We have multiple services (S3, DynamicDB, etc...) being written to in parallel during our logging code. These would all be triggered from the same overall request/transaction.

NC-piercej added the bug label Jan 16, 2025

github-actions bot added the community To tag external issues and PRs submitted by the community label Jan 16, 2025

chynesNR moved this to In Quarter in Ruby Engineering Board Jan 17, 2025

chynesNR added this to Ruby Engineering Board Jan 17, 2025

chynesNR moved this from In Quarter to In Sprint in Ruby Engineering Board Jan 21, 2025

tannalynn self-assigned this Jan 27, 2025

chynesNR added needs review and removed needs review labels Jan 27, 2025

tannalynn mentioned this issue Jan 30, 2025

Add safe navigation in case segment is nil in net http instrumentation #3046

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with net_http instrumentation on JRuby causing intermittent errors #3021

Race condition with net_http instrumentation on JRuby causing intermittent errors #3021

NC-piercej commented Jan 16, 2025

workato-integration bot commented Jan 16, 2025

tannalynn commented Jan 17, 2025

tannalynn commented Jan 30, 2025

NC-piercej commented Jan 30, 2025

NC-piercej commented Jan 30, 2025

NC-piercej commented Jan 31, 2025

tannalynn commented Feb 3, 2025

NC-piercej commented Feb 3, 2025 •

edited

Loading

Race condition with net_http instrumentation on JRuby causing intermittent errors #3021

Race condition with net_http instrumentation on JRuby causing intermittent errors #3021

Comments

NC-piercej commented Jan 16, 2025

Description

Expected Behavior

Troubleshooting or NR Diag results

Steps to Reproduce

Your Environment

Additional context

For Maintainers Only or Hero Triaging this bug

workato-integration bot commented Jan 16, 2025

tannalynn commented Jan 17, 2025

tannalynn commented Jan 30, 2025

NC-piercej commented Jan 30, 2025

NC-piercej commented Jan 30, 2025

NC-piercej commented Jan 31, 2025

tannalynn commented Feb 3, 2025

NC-piercej commented Feb 3, 2025 • edited Loading

NC-piercej commented Feb 3, 2025 •

edited

Loading