Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] event router stops responding after a couple of days #1054

Closed
jm66 opened this issue May 15, 2023 · 51 comments
Closed

[BUG] event router stops responding after a couple of days #1054

jm66 opened this issue May 15, 2023 · 51 comments
Labels
bug Something isn't working

Comments

@jm66
Copy link

jm66 commented May 15, 2023

Describe the bug

vmware-event-router stops working after two days. No errors registered in logs.

To Reproduce
N/A

Expected behavior
vmware-event-router to continuously run without interruption.

Screenshots
N/A

Version (please complete the following information):

  • VEBA Form Factor: Appliance
  • VEBA Version: v0.7.5

Additional context

This behaviour is only shown in our prod instance which events volume is greater than our testing instance.

@jm66 jm66 added the bug Something isn't working label May 15, 2023
@github-actions
Copy link

Howdy 🖐   jm66 ! Thank you for your interest in this project. We value your feedback and will respond soon.

@lamw
Copy link
Contributor

lamw commented May 15, 2023

Hi @jm66 - When you say the vmware-event-router stops working, could you elaborate what you're seeing? Is the pod running or has it gone into a crash / restart state? Does this happen immediately after deployment or after a period of time? How large is your vCenter inventory, we'll need a bit more information to understand the issue.

Also, if you're on VEBA Slack channel, we can also chat further to diagnose the issue

@jm66
Copy link
Author

jm66 commented May 15, 2023

Our env is around 3-4K VMs. The pod looks healthy but no events are shown in the pod logs nor the /events resource. Tried months ago joining but no luck. Would you mind sending an invite for the slack channel to Jm.Lopez (at) utoronto.ca?

@lamw
Copy link
Contributor

lamw commented May 15, 2023

OK, lets take a look at the setup to see whats going on. I've added your email to invite, you should get an email to complete the signup and you can join https://vmwarecode.slack.com/archives/CQLT9B5AA

@jm66
Copy link
Author

jm66 commented May 15, 2023

Just joined. Thanks. Any particular channel?

@lamw
Copy link
Contributor

lamw commented May 16, 2023

Yes, the one I linked above (that should take you to our VEBA channel)

@embano1
Copy link
Collaborator

embano1 commented May 16, 2023

I wonder if it could be related to this? #809

@rguske
Copy link
Contributor

rguske commented May 16, 2023

I wonder if it could be related to this? #809

Most likely 👍🏻

@lamw
Copy link
Contributor

lamw commented May 16, 2023

@embano1 That's a good point! I was actually thinking it could be that but forgot I had a blog post about it.

@jm66 Could you check https://williamlam.com/2022/07/heads-up-potential-missing-vcenter-server-events-due-to-sequence-id-overflow.html and see if this is what you're observing?

@jm66
Copy link
Author

jm66 commented May 16, 2023

Hey @lamw I checked the post and the chainId is positive:

chainId int 310662297

Is it possible we have experienced the issue in the past and vCenter self heals?

@embano1
Copy link
Collaborator

embano1 commented May 16, 2023

vCenter doesn't "heal", but this int32 wraps over time so it could be that the underlying event collector used by the router might be stuck. Could you please try and restart the router/VEBA? Also, enabling DEBUG logging can also reveal some insights. All assuming the router is actually running correctly.

@jm66
Copy link
Author

jm66 commented May 16, 2023

@embano1 yeah, actually to workaround this we cron'd this /usr/bin/kubectl rollout restart deployment vmware-event-router-vcenter -n vmware-system to run daily. So far the router hasn't got stuck.

Got it. Enabled debug and disabled the cron job.

@jm66
Copy link
Author

jm66 commented May 18, 2023

@embano1 the event-router is stuck now:

kubectl get pods -n vmware-system
NAME                                          READY   STATUS    RESTARTS        AGE
cadvisor-rhdw8                                1/1     Running   9 (7d1h ago)    31d
fluent-bit-rzbd4                              1/1     Running   15 (7d1h ago)   31d
tinywww-669b487769-pr76s                      1/1     Running   6 (7d1h ago)    31d
veba-rabbit-server-0                          1/1     Running   9 (7d1h ago)    31d
veba-ui-846bb59f69-n7ln7                      1/1     Running   6 (7d1h ago)    31d
vmware-event-router-vcenter-ffb58667c-pwj9z   1/1     Running   0               2d
vmware-event-router-webhook-755654855-dw7mc   1/1     Running   0               2d

Last event UserLogoutSessionEvent with key 310792526 was processed at 2023-05-18T17:15:14.667Z.

@embano1
Copy link
Collaborator

embano1 commented May 18, 2023

Is the from the DEBUG log? Anything else in there?

@jm66
Copy link
Author

jm66 commented May 18, 2023

Nothing relevant to my eyes, just the last event, but I could share the logs if you'd like.

@jm66
Copy link
Author

jm66 commented May 19, 2023

2023-05-18T17:15:14.667Z   DEBUG [KNATIVE]   knative/knative.go:184  got response   {"eventID": "1f98ca23-2fe3-4376-8606-3c1c965b9a23", "response": "202: "}
2023-05-18T17:15:14.667Z   INFO  [KNATIVE]   knative/knative.go:194  successfully sent event {"eventID": "1f98ca23-2fe3-4376-8606-3c1c965b9a23"}
2023-05-18T17:15:14.667Z   INFO  [VCENTER]   vcenter/vcenter.go:343  invoking processor   {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9"}
2023-05-18T17:15:14.667Z   DEBUG [KNATIVE]   knative/knative.go:170  processing event  {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "event": "Context Attributes,\n  specversion: 1.0\n  type: com.vmware.event.router/event\n  source: https://vcdomain/sdk\n  subject: UserLogoutSessionEvent\n  id: 2971841d-73cd-48b0-a68e-bd7065b59dc9\n  time: 2023-05-18T17:15:13.103999Z\n  datacontenttype: application/json\nExtensions,\n  vsphereapiversion: 7.0.3.0\nData,\n  {\n    \"Key\": 310792526,\n    \"ChainId\": 310792526,\n    \"CreatedTime\": \"2023-05-18T17:15:13.103999Z\",\n    \"UserName\": \"VSKEY5\\\\vsswebservices\",\n    \"Datacenter\": null,\n    \"ComputeResource\": null,\n    \"Host\": null,\n    \"Vm\": null,\n    \"Ds\": null,\n    \"Net\": null,\n    \"Dvs\": null,\n    \"FullFormattedMessage\": \"User ****** logged out (login time: Thursday, 18 May, 2023 05:15:01 PM, number of API invocations: 4, user agent: Go-http-client/1.1)\",\n    \"ChangeTag\": \"\",\n    \"IpAddress\": \"*****\",\n    \"UserAgent\": \"Go-http-client/1.1\",\n    \"CallCount\": 4,\n    \"SessionId\": \"5219944b-3cd6-66e9-4ff8-4e662962c304\",\n    \"LoginTime\": \"2023-05-18T17:15:01.773485Z\"\n  }\n"}
2023-05-18T17:15:14.667Z   INFO  [KNATIVE]   knative/knative.go:182  sending event  {"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "subject": "UserLogoutSessionEvent"}```

@jm66
Copy link
Author

jm66 commented May 19, 2023

Hey @embano1 , got this after submitting kubectl rollout restart deployment vmware-event-router-vcenter -n vmware-system:

root@veba [ ~ ]# 2023-05-19T15:47:25.339Z	INFO	[MAIN]	router/main.go:212	initiating shutdown	{"commit": "73f72694", "version": "v0.7.5"}
2023-05-19T15:47:25.339Z	INFO	[VCENTER]	vcenter/vcenter.go:370	attempting graceful shutdown
2023-05-19T15:47:25.339Z	DEBUG	[KNATIVE]	knative/knative.go:184	got response	{"eventID": "2971841d-73cd-48b0-a68e-bd7065b59dc9", "response": "Post \"http://default-broker-ingress.vmware-functions.svc.cluster.local/\": context canceled"}
2023-05-19T15:47:25.339Z	ERROR	[VCENTER]	vcenter/vcenter.go:347	could not process event	{"event": "Context Attributes,\n  specversion: 1.0\n  type: com.vmware.event.router/event\n  source: https://****/sdk\n  subject: UserLogoutSessionEvent\n  id: 2971841d-73cd-48b0-a68e-bd7065b59dc9\n  time: 2023-05-18T17:15:13.103999Z\n  datacontenttype: application/json\nExtensions,\n  vsphereapiversion: 7.0.3.0\nData,\n  {\n    \"Key\": 310792526,\n    \"ChainId\": 310792526,\n    \"CreatedTime\": \"2023-05-18T17:15:13.103999Z\",\n    \"UserName\": \"********\",\n    \"Datacenter\": null,\n    \"ComputeResource\": null,\n    \"Host\": null,\n    \"Vm\": null,\n    \"Ds\": null,\n    \"Net\": null,\n    \"Dvs\": null,\n    \"FullFormattedMessage\": \"User ******* logged out (login time: Thursday, 18 May, 2023 05:15:01 PM, number of API invocations: 4, user agent: Go-http-client/1.1)\",\n    \"ChangeTag\": \"\",\n    \"IpAddress\": \"******\",\n    \"UserAgent\": \"Go-http-client/1.1\",\n    \"CallCount\": 4,\n    \"SessionId\": \"5219944b-3cd6-66e9-4ff8-4e662962c304\",\n    \"LoginTime\": \"2023-05-18T17:15:01.773485Z\"\n  }\n", "error": "knative: send event 2971841d-73cd-48b0-a68e-bd7065b59dc9: Post \"http://default-broker-ingress.vmware-functions.svc.cluster.local/\": context canceled"}
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).processEvents
	github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:347
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).stream
	github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:318
github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter.(*EventStream).Stream
	github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/internal/provider/vcenter/vcenter.go:231
main.main.func3
	github.com/vmware-samples/vcenter-event-broker-appliance/vmware-event-router/cmd/router/main.go:206
golang.org/x/sync/errgroup.(*Group).Go.func1
	golang.org/x/[email protected]/errgroup/errgroup.go:75

@embano1
Copy link
Collaborator

embano1 commented May 20, 2023

That error is fine because during a rollout the current instance is terminated and the code reacts gracefully to it (context cancelled). The current event in flight is then not correctly processed from the source (vcenter) and, depending on whether checkpointing is enabled, the code will resume from the last checkpoint or "now" when it starts.

Even before the shutdown it was processing an event so I was wondering whether the code/event stream works as expected then?

@laugrean
Copy link

laugrean commented Jun 13, 2023

Hey,
Unfortunately we encounter the same problem.
Helm deployment:
Chart: event-router-v0.7.6
App version: v0.7.5

On our test environment everything works for months already.

If I activate info mode, it just stops logging anything:

2023-06-13T06:13:05.216Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "4ab869ed-b2e0-4f90-9fad-00a5c1d83f82"}
2023-06-13T06:13:05.941Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10"}
2023-06-13T06:13:05.941Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10", "subject": "com.vmware.vc.HA.NotAllHostAddrsPingable"}
2023-06-13T06:13:05.942Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "06d60432-b9c9-4a3e-9f04-51bc8a875d10"}

Within the debug mode we can only see those entries anymore:

2023-06-13T06:35:50.917Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:35:55.973Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:36:01.026Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:36:06.072Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}

a short pod deletion fixes the problem and everything is back to normal for at least a few minutes, sometimes hours...

2023-06-13T06:41:13.196Z INFO [KNATIVE] knative/knative.go:182 sending event {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872", "subject": "com.vmware.vcIntegrity.ClusterMembershipChange"}
2023-06-13T06:41:13.197Z DEBUG [KNATIVE] knative/knative.go:184 got response {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872", "response": "202: "}
2023-06-13T06:41:13.197Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "80e3ab5d-9dd3-47ef-8ad2-efe54ec86872"}
2023-06-13T06:41:14.051Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 1}
2023-06-13T06:41:15.108Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 2}
2023-06-13T06:41:17.160Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 4}
2023-06-13T06:41:21.237Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:41:26.298Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-13T06:41:31.414Z INFO [VCENTER] vcenter/vcenter.go:343 invoking processor {"eventID": "24358daa-9903-4929-bceb-96d617560f1c"}
``´

@lamw
Copy link
Contributor

lamw commented Jun 16, 2023

@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps?

@embano1
Copy link
Collaborator

embano1 commented Jun 17, 2023

@embano1 would you also recommend folks here give Tanzu Sourecs for Knative a try to see if that helps?

You mean regarding the VC DB overflow issue? Sources won't help here bc it's using the same EventHistoryCollector mechanism as router - and this is a server-side issue.

@lamw
Copy link
Contributor

lamw commented Jun 17, 2023

You mean regarding the VC DB overflow issue?

No, I meant for @jm66 issue #1054 (comment) are we saying this is related to VC DB overflow?

but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue?

@embano1
Copy link
Collaborator

embano1 commented Jun 17, 2023

No, I meant for @jm66 issue #1054 (comment) are we saying this is related to VC DB overflow?

See my response above: #1054 (comment)

The initial issue description though seems to be a deadlock in the router.

but there's also @laugrean report which isn't clear to me if its VC DB overflow either unless you're referring to this issue?

IIRC, this is also deadlock in router.

@lamw
Copy link
Contributor

lamw commented Jun 17, 2023

@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ...

@embano1
Copy link
Collaborator

embano1 commented Jun 17, 2023

@embano1 OK, since both of the issues reported seems to be a deadlock in router ... then my initial comment about testing Tanzu Sources on VEBA would at least see if that may help with issue? If so, I'll put something quick/dirty together that'll go ahead and undeploy router and setup sources ...

Yup, we should definitely cross-check with Sources. There's lots of code both share, hoping it's related to event processing/invocation where code differs.

@lamw
Copy link
Contributor

lamw commented Jun 18, 2023

@jm66 @laugrean To summarize the next steps, we would like to try un-deploying the event-router from within VEBA and deploy the Tanzu Sources for Knative which includes vSphere as a source (similar code which was ported from VEBA to Tanzu Sources) to see if this issue is resolved.

The instructions below assumes VEBA appliance can go outbound to pull down some additional packages and a require change is needed for your function.yaml deployment as it slightly different schema in how you subscribe to an event, which I will detail further below after the setup.

Step 0 - SSH to VEBA appliance

Step 1 - Undeploy Event Router

kubectl -n vmware-system delete -f /root/config/event-router/vmware-event-router-k8s-vcenter.yaml

Step 2 - Install Tanzu Sources for Knative

kubectl apply -f https://github.com/vmware-tanzu/sources-for-knative/releases/latest/download/release.yaml

Step 3 - Install Knative CLI & vSphere Sources

curl -L https://github.com/knative/client/releases/download/knative-v1.10.0/kn-linux-amd64 -o /usr/local/bin/kn
chmod +x /usr/local/bin/kn
curl -L https://github.com/vmware-tanzu/sources-for-knative/releases/download/v0.37.0/kn-vsphere_0.37.0_Linux_x86_64.tar.gz -o /root/kn-vsphere_0.37.0_Linux_x86_64.tar.gz
tar -zxvf /root/kn-vsphere_0.37.0_Linux_x86_64.tar.gz
mv /root/kn-vsphere_0.37.0_Linux_x86_64/kn-vsphere /usr/local/bin/kn-vsphere
chmod +x /usr/local/bin/kn-vsphere
rm -rf /root/kn-vsphere_0.37.0_Linux_x86_64*

Step 4 - Export VC Creds

export VCENTER_USERNAME="FILL_ME"
export VCENTER_PASSWORD="FILL_ME"
export VCENTER_HOSTNAME="FILL_ME"

Step 5 - Create vSphere Secret

kn vsphere auth create \
    --namespace vmware-functions \
    --username ${VCENTER_USERNAME} \
    --password ${VCENTER_PASSWORD} \
    --name vcenter-creds \
    --verify-url https://${VCENTER_HOSTNAME} \
    --verify-insecure

Step 6 - Create vSphere Source

kn vsphere source create \
    --namespace vmware-functions \
    --name vcsa-source \
    --vc-address https://${VCENTER_HOSTNAME} \
    --skip-tls-verify \
    --secret-ref vcenter-creds \
    --sink-uri http://default-broker-ingress.vmware-functions.svc.cluster.local \
    --encoding json

If everything was setup successfully, you should see the following pods running:

# kubectl get pods -A
NAMESPACE            NAME                                                  READY   STATUS      RESTARTS      AGE
cert-manager         cert-manager-99bb69456-dhxcv                          1/1     Running     2 (23m ago)   89d
cert-manager         cert-manager-cainjector-ffb4747bb-8x7ck               1/1     Running     2 (23m ago)   89d
cert-manager         cert-manager-webhook-545bd5d7d8-5fxtr                 1/1     Running     2 (23m ago)   89d
contour-external     contour-685f87dc74-ccdcn                              1/1     Running     2 (23m ago)   89d
contour-external     contour-685f87dc74-q24fx                              1/1     Running     2 (23m ago)   89d
contour-external     contour-certgen-v1.22.0-42c87                         0/1     Completed   0             89d
contour-external     envoy-jqzft                                           2/2     Running     4 (23m ago)   89d
contour-internal     contour-c4478d89b-cvdk5                               1/1     Running     2 (23m ago)   89d
contour-internal     contour-c4478d89b-ws4tw                               1/1     Running     2 (23m ago)   89d
contour-internal     contour-certgen-v1.22.0-nn2r2                         0/1     Completed   0             89d
contour-internal     envoy-pkb7d                                           2/2     Running     4 (23m ago)   89d
knative-eventing     eventing-controller-fdc4dd6bb-6zttl                   1/1     Running     2 (23m ago)   89d
knative-eventing     eventing-webhook-676dfb6c4f-hmnwz                     1/1     Running     2 (23m ago)   89d
knative-eventing     rabbitmq-broker-controller-54c85d4f98-b5vrv           1/1     Running     2 (23m ago)   89d
knative-eventing     rabbitmq-broker-webhook-877b8d7df-bqd4t               1/1     Running     2 (23m ago)   89d
knative-serving      activator-7cbbfbc85-4zg95                             1/1     Running     2 (23m ago)   89d
knative-serving      autoscaler-8f986cff-jgngg                             1/1     Running     2 (23m ago)   89d
knative-serving      controller-58dfb45d74-bz88p                           1/1     Running     2 (23m ago)   89d
knative-serving      domain-mapping-5d8db49bf6-8z96x                       1/1     Running     2 (23m ago)   89d
knative-serving      domainmapping-webhook-584476fd67-cdm5s                1/1     Running     2 (23m ago)   89d
knative-serving      net-contour-controller-6768758c67-wb5jn               1/1     Running     2 (23m ago)   89d
knative-serving      webhook-6d5c55fd8c-zl5zg                              1/1     Running     2 (23m ago)   89d
kube-system          antrea-agent-9xthc                                    2/2     Running     7 (23m ago)   89d
kube-system          antrea-controller-6db8bb65cf-k8ltr                    1/1     Running     3 (22m ago)   89d
kube-system          coredns-565d847f94-dvrdf                              1/1     Running     2 (23m ago)   89d
kube-system          coredns-565d847f94-zmt46                              1/1     Running     2 (23m ago)   89d
kube-system          etcd-veba.primp-industries.local                      1/1     Running     2 (23m ago)   89d
kube-system          kube-apiserver-veba.primp-industries.local            1/1     Running     2 (23m ago)   89d
kube-system          kube-controller-manager-veba.primp-industries.local   1/1     Running     2 (23m ago)   89d
kube-system          kube-proxy-n22lb                                      1/1     Running     2 (23m ago)   89d
kube-system          kube-scheduler-veba.primp-industries.local            1/1     Running     2 (23m ago)   89d
local-path-storage   local-path-provisioner-5646477f4b-pql69               1/1     Running     2 (23m ago)   89d
rabbitmq-system      messaging-topology-operator-74c896bb55-5h66m          1/1     Running     2 (23m ago)   89d
rabbitmq-system      rabbitmq-cluster-operator-586b7547f8-7vr9x            1/1     Running     2 (23m ago)   89d
vmware-functions     default-broker-ingress-bbf774754-xjzpz                1/1     Running     2 (23m ago)   89d
vmware-functions     sockeye-79b7fc7c55-tmjqk                              1/1     Running     2 (23m ago)   89d
vmware-functions     sockeye-trigger-dispatcher-5b96bdd8d-mjth6            1/1     Running     2 (23m ago)   89d
vmware-functions     vcsa-source-adapter-7d474dbbfb-xd7ms                  1/1     Running     0             7m2s
vmware-sources       horizon-source-controller-754d465b5b-ptbmw            1/1     Running     0             17m
vmware-sources       horizon-source-webhook-6d7cbd8dd-mtz2t                1/1     Running     0             17m
vmware-sources       vsphere-source-webhook-55c56cf48-x7fpw                1/1     Running     0             17m
vmware-system        cadvisor-jt8rk                                        1/1     Running     2 (23m ago)   89d
vmware-system        tinywww-5b795ddd75-pjds5                              1/1     Running     2 (23m ago)   89d
vmware-system        veba-rabbit-server-0                                  1/1     Running     4 (21m ago)   89d
vmware-system        veba-ui-846bb59f69-rk994                              1/1     Running     2 (23m ago)   89d

Specifically, we want to make sure vSphere Sources Adapter is running as shown in example below:

kubectl -n vmware-functions get deploy/vcsa-source-adapter
NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
vcsa-source-adapter   1/1     1            1           11m

We can check the logs of vSphere sources and ensure the very last last or so state it was able to login by retrieving the VC time:

kubectl -n vmware-functions logs deploy/vcsa-source-adapter
{"level":"warn","ts":"2023-06-18T14:09:14.176Z","logger":"vsphere-source-adapter","caller":"v2/config.go:197","msg":"Tracing configuration is invalid, using the no-op default{error 26 0  empty json tracing config}","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-06-18T14:09:14.177Z","logger":"vsphere-source-adapter","caller":"v2/config.go:190","msg":"Sink timeout configuration is invalid, default to -1 (no timeout)","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.261Z","logger":"vsphere-source-adapter","caller":"kvstore/kvstore_cm.go:54","msg":"Initializing configMapKVStore...","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.271Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:92","msg":"configuring checkpointing","commit":"8fda92a-dirty","ReplayWindow":"5m0s","Period":"10s"}
{"level":"warn","ts":"2023-06-18T14:09:14.271Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:131","msg":"could not retrieve checkpoint configuration","commit":"8fda92a-dirty","error":"key checkpoint does not exist"}
{"level":"info","ts":"2023-06-18T14:09:14.274Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:311","msg":"no valid checkpoint found","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-18T14:09:14.274Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:312","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-06-18 14:09:14.260972 +0000 UTC"}
{"level":"info","ts":"2023-06-18T14:14:14.261Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-06-18T14:14:14.265Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-06-18 14:14:14.31874 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}

At this point, you should also be seeing events flow into Sockeye by opening browser to https://VEBA_FQDN/events

Lastly, to deploy or re-deploy your functions, you need to edit the function.yaml manifest and update the type field to something like: com.vmware.vsphere.<EVENTID>.v0 and remove the subject field.

....snip ....
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: veba-ps-slack-trigger
  labels:
    app: veba-ui
spec:
  broker: default
  filter:
    attributes:
      type: com.vmware.vsphere.DrsVmPoweredOnEvent.v0
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: kn-ps-slack

Let me know if you have any questions and hopefully this yields better results ...

JFYI - Tanzu Sources by default does not log all events with their default INFO logging, you'll have to enable DEBUG to do so if you wish to see vSphere events in Tanzu Sources logs but I think for now, lets see if this resolves the issue with the setup

@laugrean
Copy link

Hey,

In my szenario veba is running as a helm chart deployment on RedHat Openshift.
I've multiple vcenter connected to the same broker within the same namespace.

The installation comes directly from your github repo:
https://github.com/vmware-samples/vcenter-event-broker-appliance.git

Should I uninstall the helm chart in step 1?

Is your release.yaml mentioned in step 2 based on tanzu specialities or is it working on any kubernetes?
Currently my helm deployment and all trigger + functions are installed in the same namespace.
If I'll use it, is it sufficient to delete the namespace creation part and replace all namespaces with the one I want it to be installed?

@embano1
Copy link
Collaborator

embano1 commented Jun 19, 2023

Should I uninstall the helm chart in step 1?

Without touching your prod environment, you could also deploy a separate VEBA instance/Kubernetes environment and just deploy the sources to a broker without additional triggers to see if the sources continue to run when you see the router stopping in the other appliance.

But you can also run sources in parallel to your existing setup by installing as described above w/out having to uninstall your router. Depending on your configured triggers, this can lead to duplicate events though. So just be careful.

@laugrean
Copy link

Did it on my test environment.
First problem: we always need Openshift Cluster Admin permissions.
But here we go:

{"level":"warn","ts":"2023-06-19T07:31:42.597Z","logger":"vsphere-source-adapter","caller":"v2/config.go:197","msg":"Tracing configuration is invalid, using the no-op default{error 26 0 empty json tracing config}","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-06-19T07:31:42.597Z","logger":"vsphere-source-adapter","caller":"v2/config.go:190","msg":"Sink timeout configuration is invalid, default to -1 (no timeout)","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.701Z","logger":"vsphere-source-adapter","caller":"kvstore/kvstore_cm.go:54","msg":"Initializing configMapKVStore...","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.715Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:92","msg":"configuring checkpointing","commit":"8fda92a-dirty","ReplayWindow":"5m0s","Period":"10s"}
{"level":"warn","ts":"2023-06-19T07:31:42.715Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:131","msg":"could not retrieve checkpoint configuration","commit":"8fda92a-dirty","error":"key checkpoint does not exist"}
{"level":"info","ts":"2023-06-19T07:31:42.725Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:311","msg":"no valid checkpoint found","commit":"8fda92a-dirty"}
{"level":"info","ts":"2023-06-19T07:31:42.725Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:312","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-06-19 07:31:42.72101 +0000 UTC"}

It's connected, but I cannot see any events within sockeye.

@embano1
Copy link
Collaborator

embano1 commented Jun 19, 2023

To keep things simple for now, try enabling DEBUG log level to see if it's receiving/doing something:
https://github.com/vmware-tanzu/sources-for-knative#source-adapter-log-level

@laugrean
Copy link

Finally I made it, it did not work with the sink-uri, but the sink-name:

kn vsphere source create \
    --namespace xxx\
    --name vcsa-source \
    --vc-address https://${VCENTER_HOSTNAME} \
    --skip-tls-verify \
    --secret-ref vcenter-creds \
    --sink-api-version eventing.knative.dev/v1 \
    --sink-kind Broker \
    --sink-name default \
    --encoding json

Now the vsphere event source
and two router are running against the same vcenter.

I'll let you know when I see the event router problem again, and if the vcsa will survive.

@laugrean
Copy link

Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events

@embano1
Copy link
Collaborator

embano1 commented Jun 21, 2023

Both router connected to my vcenter got stucked again at the same point in time, but the vsca still receives events

Wait, sources and router are not sending anything anymore? Can you please enable DEBUG logging in the sources and dump output here when this happens?

@laugrean
Copy link

source is still working
routers stopped working

My setup is:
router with info mode against vcenter -> stopped receiving events (vmware-router-74f7c779f4-cv82r)
router with debug mode against vcenter -> stopped receiving events** (vmware-router-debug-668b6dfd75-mdm7)
event with debug mode against vcenter -> still working (vcsa-source-adapter-84c75c57f6-kq4x2)

I've setup two routers, to check if they got stucked at the same time -> Yes they do.
Your provided solution to swith to Tanzu Sources is now running for more than one day without any problems.

@embano1
Copy link
Collaborator

embano1 commented Jun 21, 2023

stopped receiving events**

What do these ** mean? Do you have some logs here again please?

Your provided solution to swith to Tanzu Sources is now running for more than one day without any problems.

This is good news!

@laugrean
Copy link

What do these ** mean? Do you have some logs here again please?

just missed to remove the bold code style at the end.

Logs:
The debug node:

2023-06-21T06:03:21.132Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "724e457d-28f1-428a-a0c5-a419358362bd"}
2023-06-21T06:03:21.910Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 1}
2023-06-21T06:03:22.960Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 2}
2023-06-21T06:03:25.011Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 4}
2023-06-21T06:03:29.066Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
2023-06-21T06:03:34.120Z DEBUG [VCENTER] vcenter/vcenter.go:313 no new events, backing off {"delaySeconds": 5}
... and so on

Last entry for the info node, after that nothing else

2023-06-21T06:03:21.568Z INFO [KNATIVE] knative/knative.go:194 successfully sent event {"eventID": "f0ec150b-2e4e-45e4-9452-53a31284779b"}

source logs

{"level":"debug","ts":"2023-06-21T06:03:17.000Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-06-21T06:03:22.096Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:208","msg":"got 3 events","commit":"8fda92a-dirty"}
{"level":"debug","ts":"2023-06-21T06:03:22.096Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017269","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.100Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017270","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.101Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:268","msg":"sending event","commit":"8fda92a-dirty","ID":"395017271","type":"com.vmware.vsphere.DrsSoftRuleViolationEvent.v0","data":"xxx"}
{"level":"debug","ts":"2023-06-21T06:03:22.153Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":1}
{"level":"debug","ts":"2023-06-21T06:03:23.208Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":2}

During a short investigation I could see this problem in AriaLogs, not sure if this is releated.
Did not had time to further investigate.

vpxd-main - - - 2023-06-21T08:03:21.399+02:00 warning vpxd[15384] [Originator@6876 sub=vmomi.soapStub[826] opID=TaskLoop-host-8441071] SOAP request returned HTTP failure; <SSL

last time we could see:

info vpxd[13645] [Originator@6876 sub=HttpSvc.HTTPService.HttpConnection] HTTP Connection read failed while waiting for further requests; <io_obj p:0x00007f9594563e60, h:-1, <UNIX '/var/run/vmware/vpxd-webserver-pipe'>, <UNIX ''>>, N7Vmacore16TimeoutExceptionE(Operation timed out: Stream: <io_obj p:0x00007f9594563e60, h:-1, <UNIX '/var/run/vmware/vpxd-webserver-pipe'>, <UNIX ''>>, duration: 00:00:48.994720 (hh:mm:ss.us))\n--> [context]zKq7AVECAQAAAH1IQAEMdnB4ZAAA7tg3bGlidm1hY29yZS5zbwAAp3gsACVsLQAL6TIANyQtAFQEIwAjqCMAp0kjAE+fIwCsZDcBh38AbGlicHRocmVhZC5zby4wAAIvNg9saWJjLnNvLjYA[/context]

@embano1
Copy link
Collaborator

embano1 commented Jun 22, 2023

Good news: we don't have a deadlock :) My suspicion here is that, given Sources are working, the router runs into session timeout issues. Looks like this is not implemented in the code base, but I have done it for Sources. That would explain why there're no new events which is fixed after restart.

@lamw
Copy link
Contributor

lamw commented Jun 27, 2023

#1075

@laugrean
Copy link

Unfortunately our kubernetes admins declined the above mentioned workaround for our production environments.

Currently we have installed everything within one namespace, but there are several ClusterRoleBindings needed e.g.:

  • vmware-sources-controller-admin
  • vmware-sources-webhook-podspecable-binding
  • vmware-sources-webhook-addressable-resolver-binding
  • horizon-source-webhook-rolebinding

Problem are global permissions like:

 configmaps                                                    []                 []              [get list create update delete patch watch]
  events                                                        []                 []              [get list create update delete patch watch]
  secrets                                                       []                 []              [get list create update delete patch watch]
  serviceaccounts                                               []                 []              [get list create update delete patch watch]
  services                                                      []                 []              [get list create update delete patch watch]
  mutatingwebhookconfigurations.admissionregistration.k8s.io    []                 []              [get list create update delete patch watch]
  validatingwebhookconfigurations.admissionregistration.k8s.io  []                 []              [get list create update delete patch watch]
  customresourcedefinitions.apiextensions.k8s.io                []                 []              [get list create update delete patch watch]
  deployments.apps/finalizers                                   []                 []              [get list create update delete patch watch]
  deployments.apps                                              []                 []              [get list create update delete patch watch]
  leases.coordination.k8s.io                                    []                 []              [get list create update delete patch watch]
  rolebindings.rbac.authorization.k8s.io                        []                 []              [get list create update delete patch watch]

Any idea if this can be changed to dedicated namespaces only?

@embano1
Copy link
Collaborator

embano1 commented Jul 12, 2023

@laugrean that is understandable. I'll have @lamw and the VMware team jump in to comment on the topic of RBAC here.

@lamw
Copy link
Contributor

lamw commented Jul 12, 2023

@gabo1208 See above ^

@laugrean
Copy link

laugrean commented Sep 5, 2023

I've now again the same problem with the vcsa-source-adapter.
After a few days I do not receive any events anymore.
Sockeye is empty for affected source-adapter.
Those are the entries within the log:

{"level":"info","ts":"2023-09-05T05:26:19.279Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:26:19.289Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-09-05 05:26:19.285489 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:31:19.291Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:115","msg":"Executing SOAP keep-alive handler","commit":"8fda92a-dirty","rpc":"keepalive"}
{"level":"info","ts":"2023-09-05T05:31:19.302Z","logger":"vsphere-source-adapter","caller":"vsphere/client.go:121","msg":"vCenter current time: 2023-09-05 05:31:19.298007 +0000 UTC","commit":"8fda92a-dirty","rpc":"keepalive"}

When I kill the vca-source-adapter-pod I see this message:

{"level":"info","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:296","msg":"found existing checkpoint","commit":"8fda92a-dirty"}
{"level":"warn","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:299","msg":"potential data loss: last event timestamp in checkpoint is older than configured maximum","commit":"8fda92a-dirty","maxHistory":"5m0s","checkpointTimestamp":"2023-08-25 06:09:09.535 +0000 UTC"}
{"level":"warn","ts":"2023-09-05T05:32:49.857Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:303","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-09-05 05:27:49.853811 +0000 UTC"}

And everything is back to normal... for at least a few days

@embano1
Copy link
Collaborator

embano1 commented Sep 5, 2023

IIUC, when you enable debug logging in the adapter it will tell you that it's not receiving new events, correct? If that's the case, and seeing the above info logs about keep-alives which execute correctly, I'm afraid it's definitely a server-side issue as nothing is coming through the EventHistoryCollector stream anymore. The server might close/cancel that stream after some time, or there is a network issue and the session is not correctly reestablished. Can't tell from this side. Would definitely need someone from engineering to look into the server side behavior and also someone from govmomi (Go SDK) to look into what happens with the stream under the cover e.g., is it cleanly terminated by the server or there's a network interruption, leading to a dangling EventHistoryCollector session where the client believes it's still in tact but no data is received anymore. IIUC, there's also no heartbeat mechanism in the protocol where the client can detect that the stream is basically gone.

BTW: seems you're running 8fda92a-dirty which is before the URL fix. You might want to update the version of Sources.

@laugrean
Copy link

laugrean commented Sep 29, 2023

Hey,

Sorry for the late reply.

I had now the following scenario:

  1. kill the pod vsphere source pod (no events problem)
  2. it restarted
  3. received 100 events
  4. stops working
{"level":"warn","ts":"2023-09-29T06:24:07.287Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:303","msg":"setting begin of event stream","commit":"8fda92a-dirty","beginTimestamp":"2023-09-29 06:19:07.283263 +0000 UTC"}
{"level":"debug","ts":"2023-09-29T06:24:07.515Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:208","msg":"got 100 events","commit":"8fda92a-dirty"}
<all events listed>
{"level":"debug","ts":"2023-09-29T06:24:07.671Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":1}
{"level":"debug","ts":"2023-09-29T06:24:08.731Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":2}
{"level":"debug","ts":"2023-09-29T06:24:10.797Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":4}
{"level":"debug","ts":"2023-09-29T06:24:14.870Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-09-29T06:24:19.873Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:185","msg":"creating checkpoint","commit":"8fda92a-dirty","checkpoint":{"vCenter":"vcenter.xxx","lastEventKey":425952999,"lastEventType":"com.vmware.vc.HardwareSensorGroupStatus","lastEventKeyTimestamp":"2023-09-29T06:21:29.574999Z","createdTimestamp":"2023-09-29T06:24:07.60603884Z"}}
{"level":"debug","ts":"2023-09-29T06:24:19.952Z","logger":"vsphere-source-adapter","caller":"vsphere/adapter.go:203","msg":"backing off retrieving events: no new events received","commit":"8fda92a-dirty","backoffSeconds":5}
{"level":"debug","ts":"2023-0

it was running for a few seconds only...

I've updated kn-vsphere to the latest release and give it another try

@rguske
Copy link
Contributor

rguske commented Dec 8, 2023

@laugrean any updates on this? We had a user recently reported that after introducing the VSphereSource (replacing the event-router), the issue (stop working) was solved.
See: https://vmwarecode.slack.com/archives/CQLT9B5AA/p1701877325178139

@rguske
Copy link
Contributor

rguske commented Jan 24, 2024

@jm66 @laugrean Any updates on this? Is the issue solved by using the Tanzu Sources for Knative (VSphereSource)?

@jm66
Copy link
Author

jm66 commented Apr 23, 2024

@rguske Finally got a chance to deploy the latest version after a few issues during the setup.sh script. I will provide an update soon.

Something I noticed is that I cannot see the following events: BadUsernameSessionEvent and com.vmware.sso.LoginFailure.

Also the filtering type is not needed anymore?

@rguske
Copy link
Contributor

rguske commented Apr 26, 2024

Hi @jm66, thanks for your answer.

Something I noticed is that I cannot see the following events: BadUsernameSessionEvent and com.vmware.sso.LoginFailure.

You can see all available events here: EVENTS

Also the filtering type is not needed anymore?

type: is still vaild, see:

Reference: Docs

apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: veba-ps-echo-trigger
  labels:
    app: veba-ui
spec:
  broker: default
  filter:
    attributes:
      type: com.vmware.vsphere.VmPoweredOffEvent.v0
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: kn-ps-echo

Looking forward reading your updates.

@jm66
Copy link
Author

jm66 commented Apr 26, 2024

Thanks. Yes I see the event listed in the ref docs, but it is not showing in sockeye at all.

In v0.7.5 type was either something like event or eventx. Subject was the actual event type, and now type contains v0.7.5 subject attribute value, with a .v0 suffix?

@jm66
Copy link
Author

jm66 commented May 2, 2024

You can see all available events here: EVENTS

Got it working. Key is that all events need to use the format com.vmware.vsphere.{event}.v0.

Also the filtering type is not needed anymore?

I stand corrected. I was meaning to refer to subject. Looks like isn't required anymore.

All my functions have been migrated to the new spec and apparently are working.

@rguske
Copy link
Contributor

rguske commented May 16, 2024

Great news @jm66. Do you consider this issue as solved? If so, feel free to close it.

@jm66 jm66 closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants