Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMS: terminate called after throwing an instance of '__gnu_cxx::recursive_init_error' #27727

Open
ball-hayden opened this issue Dec 5, 2024 · 7 comments
Assignees
Labels
api: cloudkms Issues related to the Cloud Key Management Service API.

Comments

@ball-hayden
Copy link

ball-hayden commented Dec 5, 2024

We're seeing an intermittent Ruby crash after calling the KMS service.

Please see example logs below:

insertId timestamp severity textPayload
pj9tfbnor8azuu2d 2024-12-04T09:47:53.799319508Z ERROR D, [2024-12-04T09:47:53.798983 #1] DEBUG -- : calling cloudkms.googleapis.com:/google.cloud.kms.v1.KeyManagementService/Decrypt
sulu03yt34keqh8x 2024-12-04T09:47:53.804834188Z ERROR D, [2024-12-04T09:47:53.804635 #1] DEBUG -- : calling cloudkms.googleapis.com:/google.cloud.kms.v1.KeyManagementService/Decrypt
jndib3wyowfxlhm7 2024-12-04T09:47:53.807942181Z ERROR D, [2024-12-04T09:47:53.807789 #1] DEBUG -- : calling cloudkms.googleapis.com:/google.cloud.kms.v1.KeyManagementService/Decrypt
8fvkl75ex7wsqcc6 2024-12-04T09:47:53.808815177Z ERROR D, [2024-12-04T09:47:53.806478 #1] DEBUG -- : calling cloudkms.googleapis.com:/google.cloud.kms.v1.KeyManagementService/Decrypt
6r4avcnryq03n560 2024-12-04T09:47:53.827669009Z ERROR terminate called after throwing an instance of '__gnu_cxx::recursive_init_error'
8ra2pqo4hjbxirgx 2024-12-04T09:47:53.827707624Z ERROR what():  std::exception

gRPC appears to be the only C++ native gem we use, which is used exclusively by Google Cloud KMS:

$ bundle why grpc
google-cloud-kms -> google-cloud-kms-v1 -> google-cloud-location -> gapic-common -> googleapis-common-protos -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> google-iam-v1 -> gapic-common -> googleapis-common-protos -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> gapic-common -> googleapis-common-protos -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> google-iam-v1 -> grpc-google-iam-v1 -> googleapis-common-protos -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> google-cloud-location -> gapic-common -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> google-iam-v1 -> gapic-common -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> gapic-common -> grpc
google-cloud-kms -> google-cloud-kms-v1 -> google-iam-v1 -> grpc-google-iam-v1 -> grpc

Environment details

Steps to reproduce

  1. Perform multiple decrypt operations
  2. Observe intermittent failure

Code example

require "logger"

module MyLogger
  LOGGER = Logger.new $stderr, level: Logger::DEBUG
  def logger
    LOGGER
  end
end

# Define a gRPC module-level logger method before grpc/logconfig.rb loads.
module GRPC
  extend MyLogger
end

require "google/cloud/kms"

client = Google::Cloud::Kms.key_management_service do |config|
  config.timeout = 2
end

client.decrypt(name: key_id, ciphertext: encrypted_data_key).plaintext

Full backtrace

terminate called after throwing an instance of   '__gnu_cxx::recursive_init_error'
  what():  std::exception

(Yes, that really is it. There's no Ruby backtrace. I've nothing else to go on.)

@product-auto-label product-auto-label bot added the api: cloudkms Issues related to the Cloud Key Management Service API. label Dec 5, 2024
@dazuma
Copy link
Member

dazuma commented Dec 5, 2024

Can you provide the following additional information:

  • Exact version and architecture (such as 1.68.1 x86_64-linux) of both the grpc and google-protobuf gems? You can find these in your Gemfile.lock, or, if you're not using bundler, by running gem list. In particular, if these two gems are not the latest, please try updating them (e.g. with bundle update) and see if the problem persists.
  • Does the problem persist if you remove the config.timeout = 2 code?
  • Roughly "how" intermittent is the issue as you have observed it? Does it happen most of the time? about 1 in 10? 1 in 1000? 1 in 1,000,000?
  • When the code does not crash, does the call itself succeed and return valid results?
  • Does your application ever fork? (For example, are you doing any preforking using a webserver like Puma?)
  • Is there anything we should know about regarding your network setup that might complicate reproducing this? (e.g. are you going through a network proxy?)

It's likely this will need to be reported upstream to the gRPC team (https://github.com/grpc/grpc) as that's where the assertion failure appears to be coming from, but it would be good to get some of this Ruby-side runtime context in a report. Thanks!

@ball-hayden
Copy link
Author

Thanks for your quick reply @dazuma.

Exact version and architecture...

grpc (1.67.0-x86_64-linux)
google-protobuf (4.29.0-x86_64-linux)

Also probably worth noting that the Alpine image we're running is MUSL (rather than glibc) based.

I've just set off a test run with the gems updated (apologies - they've been updated within the last week) to see if the issue is still present.

Does the problem persist if you remove the config.timeout = 2 code?

This is from within a Gem - I'll try monkeypatching it out and get back to you.

Roughly "how" intermittent is the issue as you have observed it?
When the code does not crash, does the call itself succeed and return valid results?

From my last test run, 1824 calls were successful and returned correct results. 1 failed.

Does your application ever fork?

We're running Puma, but with a multi-thread configuration rather than multi-process.
So no, we shouldn't be forking.

There could potentially be a threading issue?
I'll see if I can patch in a Mutex.

Is there anything we should know about regarding your network setup that might complicate reproducing this?

Possibly?
The application is hosted in GKE behind nginx-ingress. We're using Workload Identity for authentication, so there will be some talking to the metadata server going on (although I'd assume at a higher level?)

@ball-hayden
Copy link
Author

I've run some more tests.

Upgrading the gems to the latest (compatible) versions (noting [email protected] is released, but not accepted by google-protobuf) does not resolve the issue.

Removing the config.timeout line also didn't appear to make any difference.

I did, however, make the following change (appreciating that this is a toy example):

require "google/cloud/kms"

kms_mutex = Mutex.new

client = Google::Cloud::Kms.key_management_service do |config|
  config.timeout = 2
end

mutex.synchronize do
  client.decrypt(name: key_id, ciphertext: encrypted_data_key).plaintext
end

This does appear to have resolved the issue.

I'm proposing that there is some sort of multi-threading issue at play here.

@dazuma
Copy link
Member

dazuma commented Dec 10, 2024

Your threading issue theory does seem to be supported by the text of the exception coming out of grpc ("__gnu_cxx::recursive_init_error"). "recursive init" sounds to me like a reentrancy issue. And if it's coming from C++, we wouldn't have much visibility from the Ruby side.

One other thing I notice, though. My understanding is that the x86_64-linux gems are built against glibc and won't work when run with musl. (That said, I've never heard of that issue failing in this way.) So another thing I suggest trying is, install the "ruby" platform versions (rather than the x86_64-linux platform versions) of both the google-protobuf and grpc gems. That install will unfortunately take a while because it will have to compile grpc from C source. But it also might make a difference. Additionally, do you know if you have some kind of musl-glibc emulation/compatibility library installed? I wonder if it might be what is failing.

@ball-hayden
Copy link
Author

do you know if you have some kind of musl-glibc emulation/compatibility library installed?

Yeah - we've gcompat available in the environment, so I'd guess that's what's going on here.

I'll see if I can convince bundler to not use the pre-compiled version.

@ball-hayden
Copy link
Author

Okay. Building from source also seems to have made the problem go away.

Gemfile snippet for completeness:

gem "google-cloud-kms"

# For CloudKMS
gem "google-protobuf", force_ruby_platform: true
gem "grpc", force_ruby_platform: true

You weren't wrong, though - this added about 10 minutes to our build.

I guess we could look into pre-building gRPC / google-protobuf for the correct platform to mitigate this.

Really odd that this is the way it's failing, though, and that the mutex also appears to have made the problem go away.

@dazuma
Copy link
Member

dazuma commented Dec 12, 2024

Really odd that this is the way it's failing, though, and that the mutex also appears to have made the problem go away.

My guess from what we're seeing so far, is that there's a thread safety issue in the allocator in gcompat.

It would of course be better if we could get musl-based binary gem releases of google-protobuf and grpc. Not sure how feasible that will be, but we can look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: cloudkms Issues related to the Cloud Key Management Service API.
Projects
None yet
Development

No branches or pull requests

2 participants