-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] event router stops responding after a couple of days #1054
Comments
Howdy 🖐 jm66 ! Thank you for your interest in this project. We value your feedback and will respond soon. |
Hi @jm66 - When you say the Also, if you're on VEBA Slack channel, we can also chat further to diagnose the issue |
Our env is around 3-4K VMs. The pod looks healthy but no events are shown in the pod logs nor the /events resource. Tried months ago joining but no luck. Would you mind sending an invite for the slack channel to Jm.Lopez (at) utoronto.ca? |
OK, lets take a look at the setup to see whats going on. I've added your email to invite, you should get an email to complete the signup and you can join https://vmwarecode.slack.com/archives/CQLT9B5AA |
Just joined. Thanks. Any particular channel? |
Yes, the one I linked above (that should take you to our VEBA channel) |
I wonder if it could be related to this? #809 |
Most likely 👍🏻 |
@embano1 That's a good point! I was actually thinking it could be that but forgot I had a blog post about it. @jm66 Could you check https://williamlam.com/2022/07/heads-up-potential-missing-vcenter-server-events-due-to-sequence-id-overflow.html and see if this is what you're observing? |
Hey @lamw I checked the post and the
Is it possible we have experienced the issue in the past and vCenter self heals? |
vCenter doesn't "heal", but this |
@embano1 yeah, actually to workaround this we cron'd this Got it. Enabled |
@embano1 the
Last event |
Is the from the DEBUG log? Anything else in there? |
Nothing relevant to my eyes, just the last event, but I could share the logs if you'd like. |
|
Hey @embano1 , got this after submitting
|
That error is fine because during a rollout the current instance is terminated and the code reacts gracefully to it ( Even before the shutdown it was processing an event so I was wondering whether the code/event stream works as expected then? |
Hey, On our test environment everything works for months already. If I activate info mode, it just stops logging anything:
Within the debug mode we can only see those entries anymore:
a short pod deletion fixes the problem and everything is back to normal for at least a few minutes, sometimes hours...
|
@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps? |
You mean regarding the VC DB overflow issue? Sources won't help here bc it's using the same |
No, I meant for @jm66 issue #1054 (comment) are we saying this is related to VC DB overflow? but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue? |
See my response above: #1054 (comment) The initial issue description though seems to be a deadlock in the
IIRC, this is also deadlock in |
@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ... |
Yup, we should definitely cross-check with Sources. There's lots of code both share, hoping it's related to event processing/invocation where code differs. |
@jm66 @laugrean To summarize the next steps, we would like to try un-deploying the event-router from within VEBA and deploy the Tanzu Sources for Knative which includes vSphere as a source (similar code which was ported from VEBA to Tanzu Sources) to see if this issue is resolved. The instructions below assumes VEBA appliance can go outbound to pull down some additional packages and a require change is needed for your Step 0 - SSH to VEBA appliance Step 1 - Undeploy Event Router
Step 2 - Install Tanzu Sources for Knative
Step 3 - Install Knative CLI & vSphere Sources
Step 4 - Export VC Creds
Step 5 - Create vSphere Secret
Step 6 - Create vSphere Source
If everything was setup successfully, you should see the following pods running:
Specifically, we want to make sure vSphere Sources Adapter is running as shown in example below:
We can check the logs of vSphere sources and ensure the very last last or so state it was able to login by retrieving the VC time:
At this point, you should also be seeing events flow into Sockeye by opening browser to Lastly, to deploy or re-deploy your functions, you need to edit the
Let me know if you have any questions and hopefully this yields better results ... JFYI - Tanzu Sources by default does not log all events with their default INFO logging, you'll have to enable DEBUG to do so if you wish to see vSphere events in Tanzu Sources logs but I think for now, lets see if this resolves the issue with the setup |
Hey, In my szenario veba is running as a helm chart deployment on RedHat Openshift. The installation comes directly from your github repo: Should I uninstall the helm chart in step 1? Is your release.yaml mentioned in step 2 based on tanzu specialities or is it working on any kubernetes? |
Without touching your prod environment, you could also deploy a separate VEBA instance/Kubernetes environment and just deploy the sources to a broker without additional triggers to see if the sources continue to run when you see the But you can also run sources in parallel to your existing setup by installing as described above w/out having to uninstall your router. Depending on your configured triggers, this can lead to duplicate events though. So just be careful. |
Did it on my test environment.
It's connected, but I cannot see any events within sockeye. |
To keep things simple for now, try enabling |
Finally I made it, it did not work with the sink-uri, but the sink-name:
Now the vsphere event source I'll let you know when I see the event router problem again, and if the vcsa will survive. |
Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events |
Wait, |
source is still working My setup is: I've setup two routers, to check if they got stucked at the same time -> Yes they do. |
What do these
This is good news! |
just missed to remove the bold code style at the end. Logs:
Last entry for the info node, after that nothing else
source logs
During a short investigation I could see this problem in AriaLogs, not sure if this is releated.
last time we could see:
|
Good news: we don't have a deadlock :) My suspicion here is that, given |
Unfortunately our kubernetes admins declined the above mentioned workaround for our production environments. Currently we have installed everything within one namespace, but there are several ClusterRoleBindings needed e.g.:
Problem are global permissions like:
Any idea if this can be changed to dedicated namespaces only? |
@gabo1208 See above ^ |
I've now again the same problem with the vcsa-source-adapter.
When I kill the vca-source-adapter-pod I see this message:
And everything is back to normal... for at least a few days |
IIUC, when you enable BTW: seems you're running |
Hey, Sorry for the late reply. I had now the following scenario:
it was running for a few seconds only... I've updated kn-vsphere to the latest release and give it another try |
@laugrean any updates on this? We had a user recently reported that after introducing the |
@rguske Finally got a chance to deploy the latest version after a few issues during the Something I noticed is that I cannot see the following events: Also the filtering |
Hi @jm66, thanks for your answer.
You can see all available events here: EVENTS
Reference: Docs apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: veba-ps-echo-trigger
labels:
app: veba-ui
spec:
broker: default
filter:
attributes:
type: com.vmware.vsphere.VmPoweredOffEvent.v0
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: kn-ps-echo Looking forward reading your updates. |
Thanks. Yes I see the event listed in the ref docs, but it is not showing in sockeye at all. In v0.7.5 type was either something like event or eventx. Subject was the actual event type, and now type contains v0.7.5 subject attribute value, with a .v0 suffix? |
Got it working. Key is that all events need to use the format
I stand corrected. I was meaning to refer to All my functions have been migrated to the new spec and apparently are working. |
Great news @jm66. Do you consider this issue as solved? If so, feel free to close it. |
Describe the bug
vmware-event-router
stops working after two days. No errors registered in logs.To Reproduce
N/A
Expected behavior
vmware-event-router
to continuously run without interruption.Screenshots
N/A
Version (please complete the following information):
Additional context
This behaviour is only shown in our prod instance which events volume is greater than our testing instance.
The text was updated successfully, but these errors were encountered: