Add test for starting up with corrupted shards #225

etiennedi · 2024-07-02T06:25:55Z

Background

This broke during a v1.24 -> v1.25 upgrade which highlights that we didn't have regression testing.

Related Core tickets:

v1.25 changes Lazy loading behavior (seems to force-load on startup) weaviate#5257
Single corrupt or otherwise broken tenant can prevent a startup of an entire node (with many healthy tenants) in v1.25.x weaviate#5258

Pipeline idea

My rough idea for the chaos pipeline was the following. That said, I’m completely open to other suggestions, just wanted to share what I already came up with:

3-node cluster with 3x replication
ingest 10 tenants (arbitrary number)
shutdown whole cluster
strategically corrupt tenants in a way that every node has some corrupt tenants, yet for each tenant always a QUORUM of replicas is left uncorrupted. In other words, never corrupt the same tenant twice
start cluster
Cluster must start up
All 10 tenants must be usable with QUORUM operations

How do you corrupt a tenant?

From what I understand there are two ways to corrupt tenants:

Randomly override portions of (or truncate) a *.db file in the LSM store. Any file should do. By randomly picking one we increase the chances of finding new bugs
Same pattern, but with the Vector Index (HNSW commit log files)

The text was updated successfully, but these errors were encountered:

nathanwilk7 self-assigned this Jul 2, 2024

nathanwilk7 mentioned this issue Jul 2, 2024

WIP Add Test for Corrupt Shards #226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test for starting up with corrupted shards #225

Add test for starting up with corrupted shards #225

etiennedi commented Jul 2, 2024

Add test for starting up with corrupted shards #225

Add test for starting up with corrupted shards #225

Comments

etiennedi commented Jul 2, 2024

Background

Pipeline idea

How do you corrupt a tenant?