Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable high frequency non-checkpoint barrier without barrier read #12393

Closed
1 of 3 tasks
hzxa21 opened this issue Sep 18, 2023 · 1 comment
Closed
1 of 3 tasks

Enable high frequency non-checkpoint barrier without barrier read #12393

hzxa21 opened this issue Sep 18, 2023 · 1 comment
Assignees
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Sep 18, 2023

In #4290, we have introduced two features: Barrier & Checkpoint decoupling and Read on non-checkpointed epoch. The main motivation back then was to improve the freshness for batch query without high-frequency checkpoint, which made these two feature correlated to each other. However, this also introduces some complexities:

  • Since non-checkpoint epoch is considerable readable, on seeing a non-checkpointed barrier, a memtable flush must be triggered, which can generate many immutable memtables. We later introduce perf(storage): Merge multiple imms in the staging version to a large one #7368 to merge IMMs, which made the codes more complex.
  • There are some discrepancies in batch query scheduling when reading non-checkpoint and checkpoint epoch since non-checkpointed states are only available in the writer CN. This also makes it hard to support non-checkpoint epoch read when we have dedicated serving cluster. Also, the cluster is unavailable to non-checkpoint epoch read during recovery.
  • There are some discrepancies in metadata management. We need to explicitly maintain the max "committed" non-checkpoint epoch and have special logic for epoch pin/unpin.

Due to the above reasons, we recently by default turn off barrier & checkpoint decoupling and barrier read by setting checkpoint_frequency=1, barrier_interval=1s and visibility_mode=checkpoint.

However, in recent discussions, we realize that Barrier & Checkpoint decoupling is actually independent to Read on non-checkpointed epoch and enabling the former without the later can bring us the benefits of high frequency barrier without the extra complexities

  • Higher frequency of barrier means more timely triggers for the following operations:
    • barrier alignment -> better backpressure (considering inner join on fast & slow stream)
    • operator cache eviction -> less chance for OOM
    • memtable spill -> less chance for OOM
    • agg result emisson -> less chance for OOM
  • No guaratees on barrier read means we can keep things simple
    • No need to worry about batch query scheduling
    • No need to force a memtable flush on non-checkpoint barrier. We can try flush only when needed.

Action items:

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Nov 8, 2023

We have conducted several rounds of test hoping OOM can be fixed by high-frequency barrier. However, OOM is still present. We have decided to proceed with implementing spilling within a single barrier, which is a more optimal solution than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant