Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale ebpf maps if agent stops abruptly #537

Open
anubhabMajumdar opened this issue Jul 10, 2024 · 2 comments
Open

Stale ebpf maps if agent stops abruptly #537

anubhabMajumdar opened this issue Jul 10, 2024 · 2 comments
Assignees
Labels
area/ebpf area/plugins help wanted Extra attention is needed lang/go The Go Programming Language priority/0 P0

Comments

@anubhabMajumdar
Copy link
Contributor

Describe the bug

If agent pod is OOMKIlled, Packetparser leaves behind stale maps and qdiscs. These are never cleaned up on restart.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy retina-advanced
  2. Exec into a node and kill the controller process repeatedly
  3. Check maps and qdiscs

Expected behavior
Only one instance of maps should exist for each plugin and one ingress/egress qdisc for each veth.

Platform (please complete the following information):

  • OS: Linux
  • Kubernetes Version: 1.29
  • Host: AKS
  • Retina Version: current

Additional context
Suggestion - Cleanup should happen in init container (probably we need privilege to clean up residual maps and qdiscs)

@anubhabMajumdar anubhabMajumdar added help wanted Extra attention is needed lang/go The Go Programming Language area/plugins area/ebpf priority/0 P0 labels Jul 10, 2024
@nddq nddq self-assigned this Jul 12, 2024
@nddq
Copy link
Contributor

nddq commented Jul 29, 2024

I've looked into the issue and found that eBPF maps are deleted when their Close() function is called, which occurs in the plugin's Stop() function during a graceful shutdown. However, those maps persist if the agent is forcibly terminated via signals like SIGTERM or SIGKILL, bypassing the plugins' Stop() function. To address this, we should implement a goroutine in the main thread to catch these signals and handle cleanup.

@timraymond
Copy link
Member

@nddq I think making it crash-only would be more robust (as much as I like defer). On-boot we should check if those maps erroneously exist, then delete and recreate them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ebpf area/plugins help wanted Extra attention is needed lang/go The Go Programming Language priority/0 P0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants