Cherry-pick TiKV related changes to 8.10.fb #364

v01dstar · 2024-04-11T17:37:09Z

6.29 (last TiKV base) diff: facebook/rocksdb@6.29.fb...tikv:rocksdb:6.29.tikv

Apply write-amplification-based rate limiter fe76269
- 3dc1b55
- 458bbd7
- ccf5215
- f9aacb3
- 4751586
- 8d6414e
- db59a3f
- c0953b3
- 938c016
- Comments:
  - Env::IOPriority has more options now, LOW, MID, HIGH, USER instead of just LOW and HIGH, write amplification rate limiter's tuning logic may need to adjust to this change.
Apply PerfFlag patch b27b564
- 2f1efc5
- 7ee5329
- Comments:
  - We need to resolve compatibility issue every time when RocksDB adds new metrics. And all dependent projects, like Titan and rust-rocksdb have to change accordingly.
Compaction filter optimization 2a93687 <-re-evaluate unsafe filter v4
- 8e20349
- 9554ad2
- ~~23c8635~~ <- removed (this is a bug)
- 3a238bf <- no longer needed, see avoid deadlock for GC in compaction filter tikv#9694
- Manual apply related changes in bb515db
- Comments:
  - We also need to resolve compatibility issue every time for this when RocksDB changes filter API.
Doubly skiplist for reverse scan acc80ce
- 5d01038
Add WAL write duration metric de0fc30
- e3c8f48
Add Iterator and Append method for WriteBatch 6ffdf20
- 340b810
TiKV IO rate limiter a8d22fc
- 7749860
- 52b4a97 <- retiring because of Add subcompaction event API facebook/rocksdb#9311
- 4ec4a1f
- 8bd42ca <- removed base_background_compactions related code due to Remove unused API base_background_compactions facebook/rocksdb#9462
- 28f7636
- Remove ROCKSDB_LITE macro use
Manifest dump tool optimization
- e1797a0
- 615c1dc
Implement pipelined commit / multi-batch write
- 910417b
- 19db40b
- 2d03d53 <- removed, but we need to verify this does not cause upgrade issue again tikv restart failed（upgrade new image and CrashLoopBackOff） tikv#13007
- f4cba2f
- 9bb7147
- Manually applied some changes from 98a80e9
- Comments:
  - This is not easy to maintain, multi-batch write's implementation is based on rocksdb::BatchWrite() (with some custom code). Whenever rocksdb::BatchWrite changes, multi-batch write's implementation should change accordingly, this requires human intervention.
Per-file encryption key management
- b9c2064
- 6386992
- 63399df
- 63586f2
- 7ee32c0
- bbd27cf
- 1868d12
- 4cebfc1
- 9464766
- Remove ROCKSDB_LITE macro
Optimize SST partitioner to avoid huge compaction
- e2f6ec7
Titan
- Make statistics extensible 6d88b39
  - 7c6dcaa <- pitfall, do not try to move statistics impl to stats.cc. https://stackoverflow.com/questions/1111440/undefined-reference-error-for-template-method
  - Manual apply changes in bb515db
- Manual apply related changes (blob index) in bb515db
- dcf2f8d
Raftstore v2?
- 32f8f2b
- faad483
- 8899a36 <- remove wal
- 638c217
- Expose seqno / Add post write callback
  - 3cd757c <- over-written by 08aa503
  - 08aa503
  - fdcd14d
- 9ea79ab
- acc624f
- cd9aa99
- 8a9c10e
- 14f36f8
- 5b9cef9 <-re-evaluate
- de47e8e
- 6121b2d
- 0813e37
- fe76937 just the CheckInRange API

Already exist in upstream (fb):

No longer needed:

bc1f255
dc9353f <- fixed differently in upstream OnFlushCompleted is called before flush completed facebook/rocksdb#5892
545d0b2 <- fixed by Sort L0 files by newly introduced epoch_num facebook/rocksdb#10922
bb515db <- Separated. i.e. monitoring related changes are merged with "Make statistics extensible"
2cbb069 <- reverted by 53eae82
53eae82 <- reverting 2cbb069

Need triage:

0559eac <- rocksdb cloud

To be verified:

40551e2 <- we can evaluate RocksDB's new option for solving this problem introduced in Delay bottommost level single file compactions facebook/rocksdb#11701

Complications:

WriteBufferManager has changed a lot
SST file epoch number was introduced, instance merge needs to accommodate that change.
Write stall logic behavior change introduced in multi-instance support project made some tests to fail.
RocksDB now uses C++17 standard

Signed-off-by: v01dstar <[email protected]>

compaction_filter: add bottommost_level into context (tikv#160) Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> add range for compaction filter context (tikv#192) * add range for compaction filter context Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> allow no_io for VersionSet::GetTableProperties (tikv#211) * allow no_io for VersionSet::GetTableProperties Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> expose seqno from compaction filter and iterator (tikv#215) This PR supports to access `seqno` for every key/value pairs in compaction filter or iterator. It's helpful to enhance GC in compaction filter in TiKV. Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> allow to query DB stall status (tikv#226) This PR adds a new property is-write-stalled to query whether the column family is in write stall or not. In TiKV there is a compaction filter used for GC, in which DB::write is called. So if we can query whether the DB instance is stalled or not, we can skip to create more compaction filter instances to save some resources. Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> Fix compatibilty issue with Titan Signed-off-by: v01dstar <[email protected]> filter deletion in compaction filter (tikv#344) And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <[email protected]> Adjustments for compaptibilty with 8.10.facebook Signed-off-by: v01dstar <[email protected]> Adjust tikv related changes with upstream Signed-off-by: v01dstar <[email protected]>

Signed-off-by: v01dstar <[email protected]>

Ref tikv#277 When the iterator read keys in reverse order, each Prev() function cost O(log n) times. So I add prev pointer for every node in skiplist to improve the Prev() function. Signed-off-by: Little-Wallace [email protected] Implemented new virtual functions: - `InsertWithHintConcurrently` - `FindRandomEntry` Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>

Add WAL write duration metric UCP tikv/tikv#6541 Signed-off-by: Wangweizhen <[email protected]> Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>

I want to use format of rocksdb::WriteBatch to encode key-value pairs of TiKV, and I need an more effective method to copy data from Entry to WriteBatch directly so that I could avoid CPU cost of decode. Signed-off-by: Little-Wallace <[email protected]> Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>

Signed-off-by: v01dstar <[email protected]>

Implement multi batches write Signed-off-by: v01dstar <[email protected]> Fix SIGABRT caused by uninitialized mutex (tikv#296) (tikv#298) * Fix SIGABRT caused by uninitialized mutex Signed-off-by: Wenbo Zhang <[email protected]> * Use spinlock instead of mutex to reduce writer ctor cost Signed-off-by: Wenbo Zhang <[email protected]> * Update db/write_thread.h Co-authored-by: Xinye Tao <[email protected]> Signed-off-by: Wenbo Zhang <[email protected]> Co-authored-by: Xinye Tao <[email protected]> Signed-off-by: Wenbo Zhang <[email protected]> Co-authored-by: Xinye Tao <[email protected]>

Signed-off-by: v01dstar <[email protected]>

Signed-off-by: hillium <[email protected]> Signed-off-by: Yang Zhang <[email protected]>

Signed-off-by: Yang Zhang <[email protected]>

* Add copy constructor for ColumnFamilyHandleImpl Signed-off-by: Yang Zhang <[email protected]>

* return sequence number of writes Signed-off-by: 5kbpers <[email protected]> * fix compile error Signed-off-by: 5kbpers <[email protected]> Signed-off-by: tabokie <[email protected]>

…nFlushBegin event (tikv#300) * add largest seqno of memtable Signed-off-by: 5kbpers <[email protected]> * add test Signed-off-by: 5kbpers <[email protected]> * address comment Signed-off-by: 5kbpers <[email protected]> * address comment Signed-off-by: 5kbpers <[email protected]> * format Signed-off-by: 5kbpers <[email protected]> * memtable info Signed-off-by: 5kbpers <[email protected]> Signed-off-by: 5kbpers <[email protected]> Signed-off-by: Yang Zhang <[email protected]>

A callback that is called after write succeeds and changes have been applied to memtable. Titan change: tikv/titan#270 Signed-off-by: tabokie <[email protected]> Signed-off-by: Yang Zhang <[email protected]>

Summary: Modify existing write buffer manager to support multiple instances. Previously, a flush is triggered before user writes if `ShouldFlush()` returns true. But in the multiple-instance context, this will cause flushing for all DBs that are undergoing writes. In this patch, column families are registered to a shared linked list inside the write buffer manager. When flush condition is triggered, the column family with highest score from this list will be chosen and flushed. The score can be either size or age. The flush condition calculation is also changed to exclude immutable memtables. This is because RocksDB schedules flush every time an immutable memtable is generated. They will eventually be evicted from memory given the flush bandwidth doesn't bottleneck. Test plan: - Unit test cases - Trigger flush of largest/oldest memtable in another DB - Resolve flush condition by destroy CF/DB - Dynamically change flush threshold - Manual test insert, update, read-write workload, [script](https://gist.github.com/tabokie/d38d27dc3843946c7813ab7bafd0f753). Signed-off-by: tabokie <[email protected]> Signed-off-by: Yang Zhang <[email protected]>

* fix bug of using post write callback with empty batch Signed-off-by: tabokie <[email protected]> * fix nullptr Signed-off-by: tabokie <[email protected]> Signed-off-by: tabokie <[email protected]>

Add support to merge multiple DBs that have no overlapping data (tombstone included). Memtables are frozen and then referenced by the target DB. Table files are hard linked with new file numbers into the target DB. After merge, the sequence numbers of memtables and L0 files will appear out-of-order compared to a single DB. But for any given user key, the ordering still holds because there will only be one unique source DB that contains the key and the source DB's ordering is inherited by the target DB. If source and target instances share the same block cache, target instance will be able to reuse cache. This is done by cloning the table readers of source instances to the target instance. Because the cache key is stored in table reader, reads after the merge can still retrieve source instances' blocks via old cache key. Under release build, it takes 8ms to merge a 25GB DB (500 files) into another. Signed-off-by: tabokie <[email protected]>

* exclude uninitialized files when estimating compression ratio Signed-off-by: tabokie <[email protected]> * add comment Signed-off-by: tabokie <[email protected]> * fix flaky test Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>

* hook delete dir in encrypted env Signed-off-by: tabokie <[email protected]> * add a comment Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>

* add toggle Signed-off-by: tabokie <[email protected]> * protect underflow Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * remove deadline and add penalty for l0 files Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * consider compaction trigger Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>

Also added a new options to detect whether manual compaction is disabled. In practice we use this to avoid blocking on flushing a tablet that will be destroyed shortly after. --------- Signed-off-by: tabokie <[email protected]>

…heckpoint (tikv#338) * fix renaming encrypted directory Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * patch test manager Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * check compaction paused during checkpoint Signed-off-by: tabokie <[email protected]> * add comment Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>

And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <[email protected]>

Signed-off-by: Spade A <[email protected]>

Signed-off-by: SpadeA-Tang <[email protected]>

Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: v01dstar <[email protected]>

Signed-off-by: Yang Zhang <[email protected]>

hbisheng · 2024-06-24T05:52:27Z

utilities/rate_limiters/write_amp_based_rate_limiter.cc

+    return;
+  }
+
+  ++total_requests_[pri];


It looks like tikv ran into a segfault on this line. I'm still trying to understand why but I guess it's related to the addition of new IO priorities.

Here's the evidence:

segfault information:

[Mon Jun 24 10:59:14 2024] apply-1[2562783]: segfault at 7f68952f40a0 ip 000055bd6eacc09a sp 00007f6c1d621f90 error 6 in tikv-server[55bd69e00000+6052000]

The static address of the segfault code can be calculated as ip (000055bd6eacc09a) - base_addr_of_tikv (55bd69e00000) = 0x4ccc09a.

With the help of gdb, we can locate the code of that address.

$ gdb tikv-server (gdb) info line *0x4ccc09a Line 172 of "/workspace/.cargo/git/checkouts/rust-rocksdb-9e01d192e8b6561d/af14652/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc" starts at address 0x4ccc093 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+147> and ends at 0x4ccc0a2 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+162>.

Rocksdb introduced more types of priorities (User, Mid etc) since 8.x, while WriteAmpRateLimiter only considered 3 of them. I thought it could work, but maybe this can cause some problems.

One thing that seems to be interesting: the segfault issue seemed to only happen on Linux machines; I wasn't able to reproduce it on Mac. So it could be arch/compiler related.

Also, I managed to get a coredump of the segfault on Linux.

Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00005595b5961efa in rocksdb::WriteAmpBasedRateLimiter::Request (this=0x7ffa1e214000, bytes=46541, pri=<optimized out>, stats=0x0) at /root/tabokie/packages/cargo/.cargo/git/checkouts/rust-rocksdb-da3ff04b5d606849/ca1f1dd/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc:172 172 ++total_requests_[pri];

It shows that total_requests_ was initialized with a length of 4 as expected. But the pri was optimized out.

(gdb) print total_requests_ $1 = {1407, 16, 274, 0} (gdb) print pri $2 = <optimized out>

Given the definition of IOPriority, the only way for it to cause a segfault is when pri equals IO_TOTAL, but I don't think that's how we expect pri to be used...

enum IOPriority { IO_LOW = 0, IO_MID = 1, IO_HIGH = 2, IO_USER = 3, IO_TOTAL = 4 };

Still investigating...

Found the problem! Turns out that one constructor for Writer did not initialize rate_limiter_priority. With this one-line fix, the segfault problem went away.

We might want to check how the bug was introduced and whether there could be other similar problems.
Update: The upstream 8.10.fb branch does not have this problem (db/write_thread.h), so it's likely an oversight when we cherry-picked the commits.

Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: v01dstar <[email protected]>

v01dstar and others added 8 commits February 6, 2024 23:21

Sanitize tikv-rocksdb

74d783a

Signed-off-by: v01dstar <[email protected]>

Apply write-amplification based rate limiter patch

fe76269

Signed-off-by: v01dstar <[email protected]>

Apply PerfFlag patch

b27b564

Signed-off-by: v01dstar <[email protected]>

Make statistics extensible

6d88b39

Signed-off-by: v01dstar <[email protected]>

Add WAL write duration metric

de0fc30

Add WAL write duration metric UCP tikv/tikv#6541 Signed-off-by: Wangweizhen <[email protected]> Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>

zhangjinpeng87 requested a review from Connor1996 May 10, 2024 02:13

v01dstar and others added 6 commits May 19, 2024 00:12

Add support for TiKV IO rate limiter

e13234e

Signed-off-by: v01dstar <[email protected]>

Allow tool dump single SST meta in MANIFEST

325afa4

Signed-off-by: v01dstar <[email protected]>

Add patches

6b53a73

Signed-off-by: v01dstar <[email protected]>

Add KeyManagedEncryptedEnv for per file key management

54d3462

Signed-off-by: v01dstar <[email protected]>

Add patch files

16bf38e

Signed-off-by: v01dstar <[email protected]>

v01dstar force-pushed the 8.10-tikv branch from 0acb2b9 to 16bf38e Compare May 19, 2024 08:12

YuJuncen and others added 4 commits May 21, 2024 01:18

Optimize SST partitioner to avoid huge compaction

135f0fa

Signed-off-by: hillium <[email protected]> Signed-off-by: Yang Zhang <[email protected]>

Add patch file

1240012

Signed-off-by: Yang Zhang <[email protected]>

Fix statistics

a53f73c

Signed-off-by: Yang Zhang <[email protected]>

Fix compaction filter test

7de0d08

Signed-off-by: Yang Zhang <[email protected]>

v01dstar force-pushed the 8.10-tikv branch from ee9612a to 7de0d08 Compare May 22, 2024 02:13

v01dstar and others added 9 commits May 21, 2024 19:25

Add support for TitanColumnFamilyHandle

60e5b2f

* Add copy constructor for ColumnFamilyHandleImpl Signed-off-by: Yang Zhang <[email protected]>

Return sequence number of writes (tikv#292)

b0c27b8

* return sequence number of writes Signed-off-by: 5kbpers <[email protected]> * fix compile error Signed-off-by: 5kbpers <[email protected]> Signed-off-by: tabokie <[email protected]>

support post write callback (tikv#326)

b3121eb

A callback that is called after write succeeds and changes have been applied to memtable. Titan change: tikv/titan#270 Signed-off-by: tabokie <[email protected]> Signed-off-by: Yang Zhang <[email protected]>

fix bug of using post write callback with empty batch (tikv#327)

496ed89

* fix bug of using post write callback with empty batch Signed-off-by: tabokie <[email protected]> * fix nullptr Signed-off-by: tabokie <[email protected]> Signed-off-by: tabokie <[email protected]>

hook delete dir in encrypted env (tikv#334)

7dbc017

* hook delete dir in encrypted env Signed-off-by: tabokie <[email protected]> * add a comment Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>

tabokie and others added 7 commits May 21, 2024 23:49

FlushForGetLiveFiles does not wait for write stall (tikv#336)

b0bfbb1

filter deletion in compaction filter (tikv#344)

bb9cb78

And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <[email protected]>

enable cf uses separete write buffer manager (tikv#343)

0af4d9d

Signed-off-by: Spade A <[email protected]>

fix deadlock between Flush and UnregisterDB (tikv#349)

5a9e751

Signed-off-by: SpadeA-Tang <[email protected]>

v01dstar force-pushed the 8.10-tikv branch 5 times, most recently from 2e011bf to 3ea895b Compare May 31, 2024 01:37

Fix compatibility

ce0449c

Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: v01dstar <[email protected]>

v01dstar force-pushed the 8.10-tikv branch from 3ea895b to ce0449c Compare May 31, 2024 01:47

v01dstar added 2 commits May 31, 2024 23:33

Add CheckInRange API

c6143dc

Signed-off-by: Yang Zhang <[email protected]>

Apply new IO priorities in rate limiter

00010c1

Signed-off-by: Yang Zhang <[email protected]>

hbisheng reviewed Jun 24, 2024

View reviewed changes

v01dstar force-pushed the 8.10-tikv branch from 620dcb3 to ec8629a Compare August 15, 2024 22:52

Fix Titan compatibility

578e0db

Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: v01dstar <[email protected]>

v01dstar force-pushed the 8.10-tikv branch from ec8629a to 578e0db Compare September 5, 2024 18:29

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick TiKV related changes to 8.10.fb #364

Cherry-pick TiKV related changes to 8.10.fb #364

v01dstar commented Apr 11, 2024 •

edited

Loading

hbisheng Jun 24, 2024

v01dstar Jun 24, 2024 •

edited

Loading

hbisheng Jun 25, 2024 •

edited

Loading

hbisheng Jun 26, 2024 •

edited

Loading

Connor1996 Jun 26, 2024

Cherry-pick TiKV related changes to 8.10.fb #364

Are you sure you want to change the base?

Cherry-pick TiKV related changes to 8.10.fb #364

Conversation

v01dstar commented Apr 11, 2024 • edited Loading

hbisheng Jun 24, 2024

Choose a reason for hiding this comment

v01dstar Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

hbisheng Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

hbisheng Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Connor1996 Jun 26, 2024

Choose a reason for hiding this comment

v01dstar commented Apr 11, 2024 •

edited

Loading

v01dstar Jun 24, 2024 •

edited

Loading

hbisheng Jun 25, 2024 •

edited

Loading

hbisheng Jun 26, 2024 •

edited

Loading