Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for starting up with corrupted shards #225

Open
etiennedi opened this issue Jul 2, 2024 · 0 comments
Open

Add test for starting up with corrupted shards #225

etiennedi opened this issue Jul 2, 2024 · 0 comments
Assignees

Comments

@etiennedi
Copy link
Member

Background

This broke during a v1.24 -> v1.25 upgrade which highlights that we didn't have regression testing.

Related Core tickets:

Pipeline idea

My rough idea for the chaos pipeline was the following. That said, I’m completely open to other suggestions, just wanted to share what I already came up with:

  • 3-node cluster with 3x replication
  • ingest 10 tenants (arbitrary number)
  • shutdown whole cluster
  • strategically corrupt tenants in a way that every node has some corrupt tenants, yet for each tenant always a QUORUM of replicas is left uncorrupted. In other words, never corrupt the same tenant twice
  • start cluster
  • Cluster must start up
  • All 10 tenants must be usable with QUORUM operations

How do you corrupt a tenant?

From what I understand there are two ways to corrupt tenants:

  • Randomly override portions of (or truncate) a *.db file in the LSM store. Any file should do. By randomly picking one we increase the chances of finding new bugs
  • Same pattern, but with the Vector Index (HNSW commit log files)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants