-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] carbon-cache always returns 0 data points for cached metric, but writes the metric to disk regularly #940
Comments
Looks like a bug indeed but not sure how it can be investigated externally. :( |
Thanks for your reply, I will provide both configurations. A representative carbon-cache
graphite-web config:
I understand this would be hard to diagnose remotely. I'm happy to run any tests that would help. Thank you. |
@percygrunwald : yes, config looks perfectly fine, indeed. Will think how better to debug that. Maybe we have some hashing issue and graphite-web checking different carbonlink instead for proper one for some metrics... |
@deniszh thank you for your feedback.
In my testing,
This is only the write side, but it's the correct port for
Unless the log is wrong, it appears as though the write and read are happening to the same carbon-cache instance (
Thank you for the suggestion. We make use of go-carbon in our new architecture and it has worked very well as a carbon replacement. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Looks like big really exists, but I still very confused with it. |
Hello Deniszh, I think this is the same issue I'm experiencing on my setup. Is there anyway to check the carbon daemons directly to see what data they're currently holding? |
Sorry, could you please elaborate what you mean?
ср, 19 июл. 2023 г., 18:35 itinneed2022 ***@***.***>:
… Hello Deniszh, I think this is the same issue I'm experiencing on my
setup. Is there anyway to check the carbon daemons directly to see what
data they're currently holding?
—
Reply to this email directly, view it on GitHub
<#940 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJLTVX77VQWELPWUTBGMRTXRAEETANCNFSM6AAAAAAR23KCUU>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
@deniszh I'm seeing the same error @percygrunwald is seeing.
I created a question on graphite-web last year. Is there anyway to query the carbon cache daemon's directly and see if the metrics do exist in the caches? |
I added some lines to my carbonlink.py file to try and narrow down the issue. Here's a snippet:
Something I'm consistently noticing in the logs is that any datapoint that returns empty has a byte string that ends in: Where are successfully retrieved bytes always look like: |
Describe the bug
We have a graphite cluster with approximately 130 million metrics spread across 3 storage nodes running graphite-web and 16 instances of carbon-cache. For a small proportion of our metrics (I'm unsure about the exact percentage, but from my experience it seems like <1%), carbon-cache always returns 0 data points. We know for a fact that carbon-cache is receiving the metrics in question though, because the whisper file is regularly updated. The way this problem manifests is that for certain metrics, the latest data point can be quite far away from "now", periodically "catching up" when carbon-cache flushes its data points to the whisper file.
I have verified with
tcpdump
that carbon-c-relay is sending the metrics to the same carbon-cache instance that graphite-web tries to query for the cached data points.I feel like this might have something to do with hashing, since the same metric name exhibits the same problem in both clusters that receive the metric independently, and that metrics one character off (e.g. same metric from
host124
instead ofhost123
) do not exhibit the same problem. We have at least 3 instances of the same metric on different host exhibiting the problem (out of around 500 hosts reporting the metric), which gives me further cause to believe the issue is not related to something other than carbon.The logs we're seeing:
Metric exhibiting problem:
Metric of same name from other host (path differing by one character):
Note that I have changed the exact metric name to not expose our metric publicly, but I'm happy to provide the metric name privately for debugging (assuming the issue may be related to the exact metric name).
To Reproduce
We are able to replicate the behavior by querying a metric of the same name for different hosts, e.g.
statsd.host123-primary.some_metric.count
. The metric returns data, but is lagging far behind. The graphite-web cache logs show that the metric always returns 0 data points. Querying for the same metric where the path differs by one character e.g.statsd.host124-primary.some_metric.count
,statsd.host125-primary.some_metric.count
, does not exhibit the problem.Expected behavior
carbon-cache should return data points for the metrics in question, given that it appears to have the data in memory (evidenced by the fact that it's writing data to disk at regular intervals).
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: