Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotshots client not writing to telegraf server in process.on('exit') blocks, so I can't send exit codes as metrics #239

Open
meyerkev opened this issue Dec 12, 2022 · 2 comments

Comments

@meyerkev
Copy link

meyerkev commented Dec 12, 2022

Update: This is because our socket calls aren't calling their callbacks.

I stick a console.log('Handling Callback') into

  const handleCallback = (err, bytes) => {
    console.log("Handling callback")

in lib/statsd.js

Before process.on('exit')

iccservice                  | Trace: socket send
iccservice                  |     at Object.send (/src/node_modules/hot-shots/lib/transport.js:81:17)
iccservice                  |     at Client.sendMessage (/src/node_modules/hot-shots/lib/statsd.js:399:17)
iccservice                  |     at Client._send (/src/node_modules/hot-shots/lib/statsd.js:298:12)
iccservice                  |     at Client.send (/src/node_modules/hot-shots/lib/statsd.js:263:8)
iccservice                  |     at Client.sendStat (/src/node_modules/hot-shots/lib/statsd.js:239:8)
iccservice                  |     at Client.sendAll (/src/node_modules/hot-shots/lib/statsd.js:215:10)
iccservice                  |     at Client.gauge (/src/node_modules/hot-shots/lib/statsFunctions.js:200:10)
iccservice                  |     at Timeout.statsdDevices [as _onTimeout] (/src/src/message/metrics.js:39:16)
iccservice                  |     at listOnTimeout (node:internal/timers:559:17)
iccservice                  |     at processTimers (node:internal/timers:502:7)
iccservice                  | Handling callback
iccservice                  | Trace: socket send
iccservice                  |     at Object.send (/src/node_modules/hot-shots/lib/transport.js:81:17)
iccservice                  |     at Client.sendMessage (/src/node_modules/hot-shots/lib/statsd.js:398:17)
iccservice                  |     at Client._send (/src/node_modules/hot-shots/lib/statsd.js:298:12)
iccservice                  |     at Client.send (/src/node_modules/hot-shots/lib/statsd.js:263:8)
iccservice                  |     at Client.sendStat (/src/node_modules/hot-shots/lib/statsd.js:239:8)
iccservice                  |     at Client.sendAll (/src/node_modules/hot-shots/lib/statsd.js:215:10)
iccservice                  |     at Client.increment (/src/node_modules/hot-shots/lib/statsFunctions.js:138:10)
iccservice                  |     at process.<anonymous> (file:///src/src/index.mjs:152:33)
iccservice                  |     at process.emit (node:events:513:28)
iccservice                  |     at process.emit (node:domain:489:12)
iccservice                  |     at process._fatalException (node:internal/process/execution:156:19)

(Ignore the line numbers, I'm doing printf debugging + breakpoints)

So what I think is happening is that the socket call doesn't work because you're never allowed to go async in process.on('exit')

I don't know for sure, but I do know that that socket call doesn't call its callback because I put a breakpoint on it and it fires outside of process.on('exit'), but not in it. So once you've caught an unhandled exception, there's no good way to send metrics from your application and we can't containerize/orchestrate this, so I guess we're going to run a 3rd-party application that does nothing but run a StatsD call, wait for the callbacks, then quit.

Original Post:

When I write metrics synchronously, I get metrics out

When I write metrics from process.on('uncaughtExceptionMonitor') or process.on('exit') blocks, the log messages all show up in my stderr, none of my metrics send. What am I doing wrong on hotshots 9.3.0?

//None of this works
process.on('unhandledRejection', (reason, p) => {
    Logger.error(`Unhandled Rejection at: Promise ${p} reason: ${reason.stack}`, reason)
    if (manualDI) {
        // One day, we will get a real monitoring system in place that comes with events and therefore dict tagging
        // Until then, we'll just use statsD
        Logger.debug('Incrementing unhandledRejection metric')
        manualDI.statsD.increment('iccservice.unhandledRejection', [`instance-os.hostname()`])
        Logger.debug('Incremented unhandledRejection metric')
    }
})

// TODO: Because we catch unhandedRejection, we could change this to uncaughtException
//       Though if we did that, we'd need to explicitly crash as per docs
process.on('uncaughtExceptionMonitor', (err, origin) => {
    Logger.error('Uncaught exception, process will terminate.', err)
    if (manualDI) {
        // One day, we will get a real monitoring system in place that comes with events and therefore dict tagging
        // Until then, we'll just use statsD
        Logger.debug('Incrementing uncaughtException metric')
        manualDI.statsD.increment('iccservice.uncaughtException', [`instance-${os.hostname()}`])
        Logger.debug('Incremented uncaughtException metric')
    }
})

process.on('exit', (code) => {
    Logger.debug(`Process is exiting with code: ${code}`)
    if (manualDI && manualDI.statsD) {
        if (code !== 0) {
            manualDI.statsD.increment('iccservice.exit_code.nonzero_exit', /*[`code-${code}`, `instance-${os.hostname()}`]*/)
            Logger.debug('Incremented nonzero_exit metric')
        } else {
            manualDI.statsD.increment('iccservice.exit_code.zero_exit' [`instance-${os.hostname()}`])
            Logger.debug('Incremented zero_exit metric')
        }
        manualDI.statsD.close((err) => {
            Logger.Error('The StatsD close did not work quite right: ', err);
        })
    }
    
})

//But this works
manualDI.statsD.increment('iccservice.startup', [`success`, `instance-${os.hostname()}`])

setTimeout(() => {  throw Error('test error'); }, 10000);
@meyerkev meyerkev changed the title Hotshots client not writing to telegraf server in process.on('exit') blocks Hotshots client not writing to telegraf server in process.on('exit') blocks, so I can't send exit codes as metrics Dec 13, 2022
@meyerkev
Copy link
Author

meyerkev commented Dec 13, 2022

So I did some code searching in a forked repo and I think what's happening is that on the on-close when we try to flush, we do a non-synchronous socket send and then there's no callback handler to let us cycle until we've finished flushing. So the on(exit) just kills the process before we can send our message.

And this is very very awkward.

iccservice                  | {"date":"2022-12-13T22:04:21.067Z","level":"DEBUG","message":"Checking for new ISOXML files since: Tue Dec 13 2022 21:04:11 GMT+0000"}
iccservice                  | Sending all stats iccservice.requestPerMs.max 0 g undefined undefined undefined
iccservice                  | Sending Message iccservice.requestPerMs.max:0|g of length 31
iccservice                  | 31 bytes sent
iccservice                  | Sending all stats iccservice.devices.total 0 g undefined undefined undefined
iccservice                  | Sending Message iccservice.devices.total:0|g of length 28
iccservice                  | 28 bytes sent
iccservice                  | Sending all stats iccservice.health.rss 115191808 g undefined undefined undefined
iccservice                  | Sending Message iccservice.health.rss:115191808|g of length 33
iccservice                  | Sending all stats iccservice.health.heapTotal 60923904 g undefined undefined undefined
iccservice                  | Sending Message iccservice.health.heapTotal:60923904|g of length 38
iccservice                  | Sending all stats iccservice.health.heapUsed 55299696 g undefined undefined undefined
iccservice                  | Sending Message iccservice.health.heapUsed:55299696|g of length 37
iccservice                  | Sending all stats iccservice.health.external 1466484 g undefined undefined undefined
iccservice                  | Sending Message iccservice.health.external:1466484|g of length 36
iccservice                  | Sending all stats iccservice.health.arrayBuffers 107684 g undefined undefined undefined
iccservice                  | Sending Message iccservice.health.arrayBuffers:107684|g of length 39
iccservice                  | Sending all stats iccservice.health.eventLoopLag 5.8031289875507355 g undefined undefined undefined
iccservice                  | Sending Message iccservice.health.eventLoopLag:5.8031289875507355|g of length 51
iccservice                  | Sending all stats iccservice.dbpool.connectionQueue 0 g { poolName: 'writer' } undefined undefined
iccservice                  | Sending Message iccservice.dbpool.connectionQueue,poolName=writer:0|g of length 53
iccservice                  | Sending all stats iccservice.dbpool.connectionsInUse 2 g { poolName: 'writer' } undefined undefined
iccservice                  | Sending Message iccservice.dbpool.connectionsInUse,poolName=writer:2|g of length 54
iccservice                  | Sending all stats iccservice.dbpool.connectionQueue 0 g { poolName: 'reader' } undefined undefined
iccservice                  | Sending Message iccservice.dbpool.connectionQueue,poolName=reader:0|g of length 53
iccservice                  | Sending all stats iccservice.dbpool.connectionsInUse 3 g { poolName: 'reader' } undefined undefined
iccservice                  | Sending Message iccservice.dbpool.connectionsInUse,poolName=reader:3|g of length 54
iccservice                  | {"date":"2022-12-13T22:04:21.142Z","level":"ERROR","message":"Uncaught exception, process will terminate.","stack":"Error: test error\n    at Timeout._onTimeout (file:///src/src/index.mjs:166:27)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)","error":"test error"}
iccservice                  | Sending all stats iccservice.uncaughtException 1 c undefined undefined undefined
iccservice                  | Sending Message iccservice.uncaughtException:1|c of length 32
iccservice                  | file:///src/src/index.mjs:166
iccservice                  | setTimeout(() => {  throw Error('test error'); }, 10000);
iccservice                  |                     ^
iccservice                  |
iccservice                  | Error: test error
iccservice                  |     at Timeout._onTimeout (file:///src/src/index.mjs:166:27)
iccservice                  |     at listOnTimeout (node:internal/timers:559:17)
iccservice                  |     at processTimers (node:internal/timers:502:7)
iccservice                  | [nodemon] app crashed - waiting for file changes before starting...
iccservice-serverlist-1     | {"date":"2022-12-13T22:04:21.188Z","level":"INFO","message":"192.168.16.3:34854/m2m-relay cancel"}
iccservice-serverlist-1     | {"date":"2022-12-13T22:04:21.190Z","level":"INFO","message":"192.168.16.3:34854/m2m-relay finish"}
iccservice-serverlist-1     | {"date":"2022-12-13T22:04:21.192Z","level":"INFO","message":"192.168.16.3:34854/m2m-relay close"}
iccservice-telegraf-1       | iccservice.devices.total,host=34413a96c884,metric_type=gauge value=0 1670969060000000000
iccservice-telegraf-1       | iccservice.requestPerMs.max,host=34413a96c884,metric_type=gauge value=0 1670969060000000000
iccservice-telegraf-1       | iccservice.startup,host=34413a96c884,metric_type=counter,status=success value=1i 1670969060000000000
iccservice-telegraf-1       | 2022-12-13T22:04:26Z D! [outputs.file] Wrote batch of 3 metrics in 144.5µs
iccservice-telegraf-1       | 2022-12-13T22:04:26Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
iccservice-telegraf-1       | 2022-12-13T22:04:36Z D! [outputs.file] Wrote batch of 12 metrics in 395.167µs
iccservice-telegraf-1       | 2022-12-13T22:04:36Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
iccservice-telegraf-1       | iccservice.kinesis.socketsInUse,host=34413a96c884,metric_type=gauge,streamName=proactive-service-telemetry value=0 1670969070000000000
iccservice-telegraf-1       | iccservice.requestPerMs.max,host=34413a96c884,metric_type=gauge value=0 1670969070000000000
iccservice-telegraf-1       | serverlist.health.heapTotal,host=34413a96c884,metric_type=gauge value=35401728 1670969070000000000
iccservice-telegraf-1       | serverlist.health.external,host=34413a96c884,metric_type=gauge value=1078178 1670969070000000000
iccservice-telegraf-1       | iccservice.kinesis.freeSockets,host=34413a96c884,metric_type=gauge,streamName=rdaf-raw value=0 1670969070000000000
iccservice-telegraf-1       | serverlist.health.eventLoopLag,host=34413a96c884,metric_type=gauge value=9.170170992612839 1670969070000000000
iccservice-telegraf-1       | iccservice.kinesis.socketsInUse,host=34413a96c884,metric_type=gauge,streamName=rdaf-raw value=0 1670969070000000000
iccservice-telegraf-1       | iccservice.kinesis.freeSockets,host=34413a96c884,metric_type=gauge,streamName=proactive-service-telemetry value=0 1670969070000000000
iccservice-telegraf-1       | iccservice.devices.total,host=34413a96c884,metric_type=gauge value=0 1670969070000000000
iccservice-telegraf-1       | serverlist.health.rss,host=34413a96c884,metric_type=gauge value=0 1670969070000000000
iccservice-telegraf-1       | serverlist.health.heapUsed,host=34413a96c884,metric_type=gauge value=18517456 1670969070000000000
iccservice-telegraf-1       | serverlist.health.arrayBuffers,host=34413a96c884,metric_type=gauge value=34056 1670969070000000000
iccservice-telegraf-1       | serverlist.health.rss,host=34413a96c884,metric_type=gauge value=0 1670969080000000000
iccservice-telegraf-1       | serverlist.health.heapTotal,host=34413a96c884,metric_type=gauge value=35401728 1670969080000000000
iccservice-telegraf-1       | serverlist.health.heapUsed,host=34413a96c884,metric_type=gauge value=18635752 1670969080000000000
iccservice-telegraf-1       | serverlist.health.external,host=34413a96c884,metric_type=gauge value=1078218 1670969080000000000
iccservice-telegraf-1       | serverlist.health.arrayBuffers,host=34413a96c884,metric_type=gauge value=34056 1670969080000000000
iccservice-telegraf-1       | serverlist.health.eventLoopLag,host=34413a96c884,metric_type=gauge value=4.767379015684128 1670969080000000000
iccservice-telegraf-1       | 2022-12-13T22:04:46Z D! [outputs.file] Wrote batch of 6 metrics in 258.958µs
iccservice-telegraf-1       | 2022-12-13T22:04:46Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics

So there's no on-send, and there's no receipt

@meyerkev
Copy link
Author

meyerkev commented Dec 14, 2022

OK, so what I think is happening even more is that we're not tracking socket sends so if we do a sudden shutdown, there's no finally block saying "Hey, wait up, you're in the middle of sending 3 metrics"

I have ripped libs/statsd.js to shreds adding callbacks and synchronous infinite while loops until callback done and since we're not tracking these metrics, we're just in callback hell.

Also, there's no way to get a callback on SUCCESS of a close. So there's no way in process.on('exit') to wait on our callbacks until the flush and close has completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant