Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于agent的存活检测的问题 #1

Closed
ZeaLoVe opened this issue Aug 31, 2015 · 2 comments
Closed

关于agent的存活检测的问题 #1

ZeaLoVe opened this issue Aug 31, 2015 · 2 comments

Comments

@ZeaLoVe
Copy link

ZeaLoVe commented Aug 31, 2015

自检测里只提到了组件的检测,想请问下。Agent的存活检测是怎么做的?(虽然有agent.alive的指标在,但没法做到一旦agent挂了会触发警报,配置了agent.alive !=1 也是没卵用,因为根本没数据过Judge)是不也通过另一个类似Anteye的外部组件?我有个想法,既然hbs是周期性获取心跳数据,周期同系统指标的上传周期,为什么不让hbs接手所有agent.alive指标的上报?在host列表内的endpoint,一旦hbs收到心跳,则将agent.alive设置为1,否则设置为0,push到本机的agent上。
这样就可以通过agent.alive这个指标明确了解实时的在线情况。而且可以设置报警模板。使得agent检测也纳入整个系统。

@ZeaLoVe
Copy link
Author

ZeaLoVe commented Aug 31, 2015

考虑到每个agent汇报心跳的时间是分散的,这个改动并不是很好。。想先听听你们的想法。

@itxx00
Copy link

itxx00 commented Sep 2, 2015

目前小米正在开发一个叫nodata的东东,其用法大致是配置agent.alive这类指标如果没有数据上报就触发报警,但是nodata还没发布,于是我采取了另外一个方法,写了一个脚本使用query的last数据接口获取所有agent.alive最后上报数据的时间,一旦某个agent时间超过阈值就将其agent.alive设置为0,在机器量不是特别大的时候这样是基本够用的。

@ZeaLoVe ZeaLoVe closed this as completed Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants