Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on implementing async timeout #64

Open
tboyko opened this issue Apr 2, 2018 · 10 comments
Open

Guidance on implementing async timeout #64

tboyko opened this issue Apr 2, 2018 · 10 comments

Comments

@tboyko
Copy link

tboyko commented Apr 2, 2018

I would like to include handling for timeout situations when using .push_async and would appreciate some guidance on how best to accomplish this.

To provide some context, we first discovered timeout-related issues in our own project upon using Apnotic's connection.join. The join was never returning and there were also notifications that had been queued via .push_async that had not run their on(:response) yet. Our knee jerk solution is to wrap .join in a timeout block and run .close shortly after, whether a timeout occurs or not.

This got me to thinking... if two notifications are sent with the async method, are they always delivered in sequence? If there is some sort of delay with sending the first one, does the second one stay stuck? My guess is that it depends upon how the http2 streams multiplex the notifications. Is this accurate?

Ultimately, I'm hoping to find a robust way to time out these problems and retry but would greatly benefit from some insights on the best way to approach it with this library.

@maciejkowalski
Copy link

maciejkowalski commented Jul 17, 2018

I am also experiencing hangouts inside Sidekiq worker.

I was able to get stacktrace for current threads via

kill -TTIN [worker_pid]

src: https://github.com/mperham/sidekiq/wiki/Signals

Stacktrace:

2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: Thread TID-wd31v processor
2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: /home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `sleep'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `delayed_push_async'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:49:in `push_async'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:46:in `block in send_push_iphone'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:23:in `send_push_iphone'

It points to:
https://github.com/ostinelli/apnotic/blob/1.3.1/lib/apnotic/connection.rb#L83

@ostinelli
Copy link
Owner

Can you please check if this clarifies it for you?
#65

@maciejkowalski
Copy link

maciejkowalski commented Jul 18, 2018

Hey @ostinelli

TL;DR;
Unfortunately not.

LONG STORY:
We are sending a lot of notifications (~30000 iphone users * 10-30 notifications daily).

We have 1 worker for 1 "batch notification".

Code looks like:

# connection pool - declared as CONSTANT inside class (not method) in AbstractWorker
  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection.on(:error) do |exception|
      Sidekiq.logger.error "Exception has been raised on APNS socket: #{exception.inspect}"
    end
    connection
  end

# taken from SendMarketRateNotificationWorker#perform
      ActiveRecord::Base.uncached do
        user_scope(type, rate.currency_name).find_in_batches.with_index do |group, index|
          device_tokens = group.pluck(:device_token)

          device_tokens.each do |device_token|
            send_push_iphone(device_token, notification_message, type)
          end
        end
      end
# [...]

  def send_push_iphone(device_token, notification_message, type)
    APNOTIC_POOL.with do |connection|
      notification = Apnotic::Notification.new(device_token)

      notification.alert = notification_message
      notification.sound = 'default'
      notification.topic = # ...
      notification.custom_payload = {
        others: {
          device_token: device_token,
          push_type: type
        }
      }

      push = connection.prepare_push(notification)
      connection.push_async(push)
    end

{ We don't aim at 100% accuracy, I am happy if 99.9% notifications are sent, so no retry for rare failed notifications }

In my experience, if connection is broken e.g.

Socket was remotely closed
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:130:in `callback_or_raise'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:115:in `rescue in block (2 levels) in ensure_open'/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:109:in `block (2 levels) in ensure_open'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:51: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:52: warning: constant ::Bignum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/core_ext/numeric/conversions.rb:138: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Bignum is deprecated

This connection is still inside ConnectionPool, but it is corrupted and if worker pickes-up this connection, then it hangs-out
zrzut ekranu 2018-07-18 02 51 15

{ 6 godzin temu means == 6 hours ago -> this job is stuck, iphone worker takes 2-5mins }

This result in worker waiting forever (sleep)
{ I posted stacktrace before }

2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: Thread TID-wd31v processor
2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: /home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `sleep'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `delayed_push_async'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:49:in `push_async'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:46:in `block in send_push_iphone'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:23:in `send_push_iphone'

@ostinelli
Copy link
Owner

ostinelli commented Jul 18, 2018

I don't understand how you can get this error being raised if you define a connection callback:

Socket was remotely closed
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:130:in `callback_or_raise'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:115:in `rescue in block (2 levels) in ensure_open'/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:109:in `block (2 levels) in ensure_open'

This should take care of not raising an error. Any ideas why you are seeing this?

On a side note, instead of your current custom way of declaring a connection you can now use:

APNOTIC_POOL = Apnotic::ConnectionPool.new({
  cert_path: Rails.root.join("config", "certs", "apns_certificate.pem"),
  cert_pass: "mypass"
}, size: 5) do |connection|
  connection.on(:error) { |exception| puts "Exception has been raised: #{exception}" }
end

@maciejkowalski
Copy link

maciejkowalski commented Jul 18, 2018

Hey @ostinelli
I pasted old stacktrace, I already have connection callback before:

  # https://github.com/ostinelli/apnotic/issues/48#issuecomment-399776823
  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection.on(:error) do |exception|
      Sidekiq.logger.error "Exception has been raised on APNS socket: #{exception.inspect}"
    end
    connection
  end

Right now I switched from ConnectionPool to 1 connection per worker and it works.

I switched from ConnectionPool to 1 connection per worker (create new connection, use it to send notifiactions, close it).

It seems to me like after connection is broken
see #73
then it doesn't reinitialize/repair/heal the connection inside pool and when worker tries to use it, it hangs forever.

@ostinelli
Copy link
Owner

#73 has nothing to do with this. What do you mean "I pasted old stacktrace"?

@maciejkowalski
Copy link

What do you mean "I pasted old stacktrace"?

That means this stacktrace:

Socket was remotely closed
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:130:in `callback_or_raise'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:115:in `rescue in block (2 levels) in ensure_open'/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:109:in `block (2 levels) in ensure_open'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:51: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:52: warning: constant ::Bignum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/core_ext/numeric/conversions.rb:138: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Bignum is deprecated

was when, I had ConnectionPool without callback (and Sidekiq worker was crashing)

class AbstractSendMobileNotificationWorker
  include Sidekiq::Worker
  sidekiq_options retry: 0

  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection
  end
# [...]

After that (found info here) I added callback:

class AbstractSendMobileNotificationWorker
  include Sidekiq::Worker
  sidekiq_options retry: 0

  # https://github.com/ostinelli/apnotic/issues/48#issuecomment-399776823
  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection.on(:error) do |exception|
      Sidekiq.logger.error "Exception has been raised on APNS socket: #{exception.inspect}"
    end
    connection
  end
# [...]

Sidekiq stopped crashing when connection was closed, but I experienced issues with hanging sidekiq jobs.

#73 has nothing to do with this.

In my experience (YMMV) if you got 2 SocketErrors, 2 sidekiq workers were executing forever.
So I suspect it has to do with corrupted connection inside ConnectionPool

@ostinelli
Copy link
Owner

Please see if latest 1.6.0 solves your issue.

@wogg
Copy link

wogg commented Oct 14, 2019

I'm posting this here, because it seems the same as the original poster's issue, however I am in running using a single connection in a rake task. It happens inconsistently. This is the stack that is dumped, when I interrupt a stuck process using sig TERM:

versions: (ruby 2.3.8)

   apnotic (1.6.0)
      connection_pool (~> 2)
      net-http2 (>= 0.18, < 2)

trace reported when interrupted:

.../vendor/ruby/2.3.0/gems/net-http2-0.18.2/lib/net-http2/client.rb:59:in `sleep'
.../vendor/ruby/2.3.0/gems/net-http2-0.18.2/lib/net-http2/client.rb:59:in `join'
.../vendor/ruby/2.3.0/gems/apnotic-1.6.0/lib/apnotic/connection.rb:76:in `join'
.../lib/tasks/apple_notifications.rake:73:in `block (3 levels) in <top (required)>'
.../lib/tasks/apple_notifications.rake:89:in `block in device_selector'
.../vendor/ruby/2.3.0/gems/activerecord-4.2.5.1/lib/active_record/relation/batches.rb:124:in `find_in_batches'
.../lib/tasks/apple_notifications.rake:89:in `device_selector'
.../lib/tasks/apple_notifications.rake:43:in `block (2 levels) in <top (required)>'
.../vendor/ruby/2.3.0/gems/rake-12.3.1/exe/rake:27:in `<top (required)>'
.../vendor/ruby/2.3.0/bin/ruby_executable_hooks:24:in `eval'
.../vendor/ruby/2.3.0/bin/ruby_executable_hooks:24:in `<main>'

it's a basic async_push:

      pushme.on(:response) do |result|
          begin 
            handle_result_for_device(device.id, result, true)
          rescue Exception => e
            puts("unknown exception %s" % e)
            puts(e.backtrace)
          end
      end

# (...) and here is the async push ("selected_pusher" is an apnotic connection)

        begin
          selected_pusher.push_async(pushme)
        rescue Errno::ECONNRESET, HTTP2::Error, Errno::EPIPE => e
          puts("connection error device %s skipping and trying next: %s" % [
            device.id, e.inspect])
          next
        rescue Exception => e
          puts("unknown exception %s" % e)
          puts(e.backtrace)
        end

In the attempt to troubleshoot, I have reduced a batch of around 30k pushes on a connection, to doing batches of 200, and re-building the connection for each batch. However, I still run into this "sleep forever" behavior on .join in about 1/4 of the runs of the job.

@pbhogan
Copy link

pbhogan commented Oct 29, 2019

I'm having a similar hang on sends (not async, and using individual connections, not connection pool, although the problem happens with the pool also). It just seems to get stuck on the send forever.

So far, adding an explicit timeout to the send call seems to be working, but it remains to be seen if this is a long term fix.

notification = Apnotic::Notification.new token
# ...
response = connection.push notification, timeout: 5
connection.close

Does the internal HTTP2 call without a specified timeout just wait indefinitely and that's what's causing these hangs?

response = @client.call(:post, request.path,
body: request.body,
headers: request.headers,
timeout: options[:timeout]
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants