-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP PubSub Source: messages stop being picked up and do not resume until restart #19418
Comments
#12608 is a related issue but that one was ostensibly fixed. |
I don't think this is fixed, I'm still having the issue on my end. There's messages sitting in the topic but I'm getting:
|
Hi! I'm running vector 0.36.0 and still have the same issue for the pub_sub source. @jszwedko do you need more info to reproduce? |
@jszwedko I think the issue that this error comes from https://github.com/vectordotdev/vector/blob/master/src/sources/gcp_pubsub.rs#L719, which shouldn't raised as an error as done here https://github.com/vectordotdev/vector/blob/master/src/sources/gcp_pubsub.rs#L717. For me it should have some configurable backoff before raised as actual errror. |
I'm not sure I see what you are saying @alexandrst88 . The code you are pointing at will result in a retry in either case, but stream errors are retried immediately to reduce interruption. Unfortunately we haven't been able to dig into this one more yet. |
@jszwedko my point that those errors are flooding the Vector Logs. From my point of view, i'll implement logic: retry_errors_amount: 20, if after 20 retries there is still issue with gcp raise warning in case messages have been successfully fetched. |
Ah I see, so the issue is just the warning logs when retries happen? |
for me yes. |
Makes sense. We have had complaints about retries being logged at the |
I went down a huge rabbit hole here, but it turns out this error gets thrown if the subscription has no events left to pull. It's extremely confusing to see an error message say I don't know if updated pubsub libraries have addressed this. This seems like a relevant issue: googleapis/google-cloud-dotnet#1505 Happy to provide any logs that helps debug or troubleshoot this. At the very least, these should be moved to debug if for no other reason than they're incredibly misleading |
Aha, interesting. Nice find. Agreed then, these log messages could be moved to debug to avoid confusion. Happy to see a PR for that if anyone is so motivated 🙏 |
We are seeing something very similar in one of the pubsub source which pulls from a subscription which receives messages in bursts and no messages for long periods after that. The error is slightly different though :
As stated in the @clong-msec 's comment, things start to work when the service is restarted. ++ @bruceg |
Yeah this is super frustrating. We ended up modifying the Vector service file to just restart every hour as a workaround, but something is definitely wrong with the PubSub source when pulling from bursty topics |
Is there any progress on the investigation? Or does someone manage to reproduce this? |
A note for the community
Problem
We are using Vector to send events from our control plane, through a queue (SQS for AWS and Pub/Sub for GCP), where they go through a few transforms before going to a clickhouse sink. On startup, and for some time after, messages are picked up and sent as expected. However after some amount of time vector stops processing new messages. It stays in this state until it's restarted, where it goes through the whole cycle again.
Vector is running in Kubernetes and uses the helm chart to deploy it
vector tap --inputs-of "clickhouse" --outputs-of "metrics_events_queue" --interval 1 --limit 1500
to help debug, when it's working I can see events come through as expected (though due to how tap works I might miss one or two)internal_metrics
there is a single errorcomponent_errors: {error_code: failed_fetching_events, error_type: request_failed}
that shows up and from this point Vector does not seem to process anything from Pub/Sub until it's restarted. There is a corresponding error in the logs, which I've included below.Error:
While nothing is processed again after this, I've included the bit of the log after the error showing that it appears that it's started to pull again at the very end. Despite this there are no further messages read from Pub/Sub, but also no further errors. In fact there are debug log lines showing a token generation / stream pull restarting, but after the
The service was unable to fulfill your request. Please try again. [code=8a75]
error above there are no further occurrences of the token / restarting stream messages in the logs until vector is restarted (which was 8 hours in this particular case).Are there any additional ways to get more debug information out or some other metric that can help explain if this is an issue inside vector or something on our side?
Configuration
Version
0.34.1
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: