Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scalyr agent self-telemetry #1016

Open
cskinfill opened this issue Oct 18, 2022 · 6 comments
Open

scalyr agent self-telemetry #1016

cskinfill opened this issue Oct 18, 2022 · 6 comments

Comments

@cskinfill
Copy link

Does the scalyr agent expose any internal metrics? I'd like to be able to setup some alerts via prometheus based on some internal metrics scraped from the scalyr agent pods. Possible?

@weilliu
Copy link

weilliu commented Dec 21, 2022

@cskinfill Sorry for the late response. Scalyr K8s agent does collect metrics by default. You can find the list of metrics here in our documentation.

We don't have a public integration with Prometheus at this time. That being said, I did find the K8s explorer which engineering has been working on that might work for you
https://app.scalyr.com/help/scalyr-agent-k8s-explore

The feature is currently in preview mode and I don't have any additional information on that besides what's available in the doc. If you are interested in it, I can check with engineering to get more information for you.

@tr3mor
Copy link

tr3mor commented Oct 26, 2023

Sorry for the late response. Scalyr K8s agent does collect metrics by default. You can find the list of metrics here in our documentation.

Hey @weilliu.
I am pretty sure it is not what was requested here.
We also would like to make some monitoring for scalyr agent performance (like the health of the agent, amount of logs it failed to send, current throughput, number of pods being processed by this agent, and more), but without agent exposing such information as metrics, it is really hard to observe and maintain the agent.
This is an example of such feature in fluentd https://github.com/fluent/fluent-plugin-prometheus

@weilliu
Copy link

weilliu commented Oct 26, 2023

Most of the metrics mentioned (agent health, number of pods processed, failed sent bytes) are available in the agent status output (kubectl exec {scalyr-agent-2-pod} -- /usr/sbin/scalyr-agent-2 status -v). One way to get those data into the account is to create a cronjob to run the command periodically and send the results back to DataSet.

@tr3mor
Copy link

tr3mor commented Oct 27, 2023

Let's say my pods can't communicate with the Dataset platform for some reason, so this job will not be able to ingest data and I will not have data to alert.
Also, I am unsure why I need to create a cronjob, permit it to discover agent pods/exec into pods, parse output, and ingest it, instead of the agent just exposing already existing metrics in Prometheus format. This will help integrate scalyr agent into quite a common observability stack and get better visibility of agent performance.

@tr3mor
Copy link

tr3mor commented Oct 31, 2023

Okay, here is another problem with /usr/sbin/scalyr-agent-2 status -v approach.
After calling it for some time, agent just get stuck and cant return status anymore.
Could be reproduced with something like

while true; /usr/sbin/scalyr-agent-2 status -v; sleep 60; done

For me it takes around 1 hour running it before it gets stuck

Failed to get status within 5 seconds.  Giving up.  The agent process is possibly stuck.  See /var/log/scalyr-agent-2/agent.log for more details

And nothing in logs

tail /var/log/scalyr-agent-2/agent.log
2023-10-31 19:55:25.039Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131404 requests_failed=0 bytes_sent=1319251600 compressed_bytes_sent=300587792 bytes_received=4861948 request_latency_secs=1743.936202 connections_created=6
2023-10-31 19:56:25.151Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131476 requests_failed=0 bytes_sent=1319976838 compressed_bytes_sent=300740446 bytes_received=4864612 request_latency_secs=1745.733488 connections_created=6
2023-10-31 19:57:25.323Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131548 requests_failed=0 bytes_sent=1320670194 compressed_bytes_sent=300887376 bytes_received=4867276 request_latency_secs=1747.247307 connections_created=6
2023-10-31 19:58:25.454Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131622 requests_failed=0 bytes_sent=1321374104 compressed_bytes_sent=301036938 bytes_received=4870014 request_latency_secs=1748.840341 connections_created=6
2023-10-31 19:59:25.619Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131694 requests_failed=0 bytes_sent=1322059558 compressed_bytes_sent=301180778 bytes_received=4872678 request_latency_secs=1750.451800 connections_created=6
2023-10-31 19:59:25.626Z INFO [core] [scalyr_agent.agent_main:2019] copy_manager_status num_worker_sessions=3 total_scan_iterations=21396 total_read_time=88.982987 total_compression_time=13.323100 total_waiting_time=328378.671414 total_blocking_response_time=754.491354 total_request_time=876.629165 total_pipelined_requests=65847 avg_bytes_produced_rate=2249.366667 avg_bytes_copied_rate=2249.366667
2023-10-31 20:00:25.756Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131766 requests_failed=0 bytes_sent=1322836992 compressed_bytes_sent=301335892 bytes_received=4875342 request_latency_secs=1752.343806 connections_created=6
2023-10-31 20:01:26.000Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131838 requests_failed=0 bytes_sent=1323658314 compressed_bytes_sent=301502124 bytes_received=4878006 request_latency_secs=1754.616942 connections_created=6
2023-10-31 20:02:26.300Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131910 requests_failed=0 bytes_sent=1324332140 compressed_bytes_sent=301645218 bytes_received=4880670 request_latency_secs=1757.999244 connections_created=6
2023-10-31 20:03:26.504Z INFO [core] [scalyr_agent.agent_main:1993] agent_requests requests_sent=131982 requests_failed=0 bytes_sent=1325011808 compressed_bytes_sent=301789054 bytes_received=4883334 request_latency_secs=1759.914332 connections_created=6

@weilliu
Copy link

weilliu commented Nov 3, 2023

@tr3mor I see your point. Let me create a feature request for the product to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants