Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events disappear after a few hours #146

Open
alternativc opened this issue Jun 5, 2024 · 9 comments
Open

Events disappear after a few hours #146

alternativc opened this issue Jun 5, 2024 · 9 comments
Assignees
Labels
kind/bug Something isn't working lifecycle/rotten

Comments

@alternativc
Copy link

Describe the bug

This is a simple setup: falco(systemd) -> falcosidekick(docker) -> falcosidekick-ui(docker) + redis(docker). We run falco on all machines while the sidekick/ui/redis stack lives inside a docker swarm stack on the same host.

Now this setup works, we can see events in the UI, however after a certain time interval the events disappear.

These are the logs from facosidekick-ui container (debug level logs):

2024/06/04 12:28:51  NEW event 'event:b0ad942b-b7c1-4272-92ea-977419630c1d'
2024/06/04 12:28:51  NEW event 'event:f64960df-271c-4562-9c15-2e2ce3e78a07'
2024/06/04 12:28:51  NEW event 'event:464bd05d-24bd-4e62-84ce-aa68a7debdd3'
2024/06/04 12:30:25 [ERROR]: [0] Unknown index name

2024/06/04 12:30:25 [ERROR]: [0] Unknown index name

2024/06/04 12:30:25 [ERROR]: [0] Unknown index name
...
2024/06/05 07:57:01 [INFO] : user 'admin' authenticated
2024/06/05 07:57:01  GET count by priority (source='', priority='', rule='', since='', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by rule (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by priority (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by priority (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by rule (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by source (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by priority (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by source (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by tags (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by hostname (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET count by rule (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01  GET search (source='', priority='', rule='', since='24h', hostname='', filter='', tags='', page='0', limit='500')
2024/06/05 07:57:01  GET search (source='', priority='', rule='', since='24h', hostname='', filter='', tags='', page='0', limit='500')
2024/06/05 07:57:01  GET count by tags (source='', priority='', rule='', since='24h', hostname='', filter='', tags='')
2024/06/05 07:57:01 [ERROR]: eventIndex: no such index
2024/06/05 07:57:01 [ERROR]: eventIndex: no such index

These are the redis logs (from the time the errors started:

...
9:M 04 Jun 2024 12:24:10.771 * Background saving started by pid 48
48:C 04 Jun 2024 12:24:10.774 * DB saved on disk
48:C 04 Jun 2024 12:24:10.774 * Fork CoW for RDB: current 1 MB, peak 1 MB, average 1 MB
9:M 04 Jun 2024 12:24:10.871 * Background saving terminated with success
9:M 04 Jun 2024 12:30:01.034 * DB saved on disk
9:M 04 Jun 2024 12:30:01.485 * <redisgears_2> Got a flush started event
9:M 04 Jun 2024 12:30:01.486 * DB saved on disk
9:M 04 Jun 2024 12:30:03.020 * DB saved on disk
9:M 04 Jun 2024 12:30:03.459 * DB saved on disk
9:M 04 Jun 2024 12:30:03.671 * <redisgears_2> Got a flush started event
9:M 04 Jun 2024 12:30:03.672 * DB saved on disk
9:M 04 Jun 2024 12:30:05.209 * DB saved on disk
9:M 04 Jun 2024 12:30:05.849 * DB saved on disk
9:M 04 Jun 2024 14:49:55.750 * <redisgears_2> Got a flush started event
...

Now I think something happens in the redis container to invalidate the index. If I restart falcosidekick-ui container then the events appear again.

I have tried manipulating the since parameter, with the same result.

How to reproduce it

Run the following docker-compose stack, emit some test events and wait. Please note that this is not a production ready stack, deploy section omitted:

version: "3.8"
services:
  falco-sidekick:
    image: falcosecurity/falcosidekick:latest
    ports:
      - "2801:2801"
    networks:
      - falco
    environment:
      - WEBUI_URL=http://falco-sidekick-ui:2802
  redis:
    image: redis/redis-stack:latest
    ports:
      - "6379:6379"
    networks:
      - falco

  falco-sidekick-ui:
    image: falcosecurity/falcosidekick-ui:latest
    environment:
      - FALCOSIDEKICK_UI_REDIS_URL=redis:6379
      - FALCOSIDEKICK_UI_LOGLEVEL=debug
    ports:
      - "2802:2802"
    networks:
      - falco
      - caddy    

networks:
  caddy:
    external: true
  falco:

Expected behaviour

Falco events persist longer than X hours, or with TTL definition.

Screenshots

After X hours:
image

After UI container restart:
image

Environment

  • falcosidekick:
    /app $ ./falcosidekick --version
    GitVersion: bce6b79
    GitCommit: bce6b79ca5e0bc130649a4dae5d31ce7e33e6cae
    GitTreeState: clean
    BuildDate: '2024-06-04T08:44:13Z'
    GoVersion: go1.22.0
    Compiler: gc
    Platform: linux/amd64

  • falcosidekick-ui:
    /app $ ./falcosidekick-ui -v
    GitVersion: 01947af
    GitCommit: 01947af
    GitTreeState: clean
    BuildDate: '2024-04-30T14:11:51Z'
    GoVersion: go1.20.14
    Compiler: gc
    Platform: linux/amd64

  • Cloud provider or hardware configuration:
    Hetzner bare-metal

  • OS:
    PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
    NAME="Debian GNU/Linux"
    VERSION_ID="12"
    VERSION="12 (bookworm)"

  • Kernel:
    Linux worker-0 6.1.0-21-amd64 UI updates #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux

  • Installation method:
    Docker swarm

Additional context
n/a

@alternativc alternativc added the kind/bug Something isn't working label Jun 5, 2024
@Issif
Copy link
Member

Issif commented Jun 5, 2024

Hi,
Do you have any idea about the duration before the issue occurs?
I'm getting more and more issues with the redis backend, it's in my to-do to replace it with something else, but no ETA for now.

@alternativc
Copy link
Author

It's fairly non-deterministic but somewhere between 4<->12h. I was hoping that DEBUG level logs would give more info as to what is actually being searched for so I could inspect what is happening in both containers. Let me know if I can help in anyway

@Issif
Copy link
Member

Issif commented Jun 5, 2024

I'll do some tests on my side too, redis is the root cause for sure, just don't know how.

@alternativc
Copy link
Author

On my end: I've added a volume mount to the redis container, for persistance (if that was the cause?). I'll update the ticket with those findings if they will be relevant.

@judikag03
Copy link

you can write root cause this problem, i have same issue.

@Issif Issif self-assigned this Aug 17, 2024
@poiana
Copy link

poiana commented Nov 15, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@alternativc
Copy link
Author

Talked to sysdig on aws re:invent to get this on the radar as well. Hopefully it will move things along

@Issif
Copy link
Member

Issif commented Dec 4, 2024

I've a lot of issues with the redis, I'm thinking to rewrite totally the UI in 2025, with a different backend. I still don't know the root cause of this specific issue because it's hard to reproduce.

@poiana
Copy link

poiana commented Jan 3, 2025

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lifecycle/rotten
Projects
None yet
Development

No branches or pull requests

4 participants