This repo is based on MIT Spring 2024 6.5840 Course and Lab
- Lab1: Map Reduce (SS, 100% pass)
- pass all tests with
-race
- use
commit
call to promise server/client exit moment right - workers scalable
- pass all tests with
- Lab2: Key/Value Server (M, 100% pass)
- tests are extremely strict(for codes and machine), check this article, it may help a lot if you are stuck in memory and time overhead
- Lab3: Raft (H, 100% pass)
- Lab3A: Leader Election (M)
- Locks are disturbing, but remember that main process does not wait for go routine, so it's free to use lock inside threads, while the outside function is within lock
- If you do respect the frameworks, some lab requirements are different from years before, like avoiding the use of
time.Timer/time.Ticker
, and all loop process should be inticker
only
- Lab3B: log (HH)
- Not so hard, but very complicated, you'd better understand the algorithm through running test, rather than refer to different versions of others' codes, they can be confusing
- It cost me a lot of time to debug and understand the algorithm, corner cases
- Lab3C: persistence (SS)
-
TestFigure83C
/TestFigure8Unreliable3C
cannot pass within 30s timeout
-
- Lab3D: log compaction (H)
- Don't be confused by conceptions, Snapshot is just Log Compaction not persistence, we cannot recover states from Snapshots
- But debug is hard, you'd better not modify other files' implementation like
config.go
, here is some hints:- If you encounter
one([cmd]) failed to reach agreement
error, there are two possible reasons:- Your 3A implementation has bugs, no leader was successfully elected
- it means that
applyCh
is not the oneconfig.go
set up, note thatapplierSnap
check therafts[i]
before checkapplyCh
outputs
-
readPersist
must reloadlastApplied
/commitIndex
fromlastIncludedIndexRd
or new added server would get error to find thelastApplied + 1
index - We cannot compact the log actively by running a go routine, older tests not support such Snapshot mechanism, see the conception above
- If you encounter
- Lab3A: Leader Election (M)
- Lab4: Fault tolerance Key/Value Service (S/M, 100% pass)
- Lab4A: Key/value service without snapshots(M)
- Client Linear RPC Specification, Page 67
- How to pass TestSpeed4A
- We accelarate the leader finding process by add a
GetState
RPC call, I do not know why the framework code must try every server each time, cost too much extra time -
TestSpeed4A
cannot pass, the requests are theoretically not satisfied with lab3 timing requirements - Do NOT modify your implementation of Raft in lab3 easily, it's likely that in multi-process environment, any small design changes can cause critical bugs which is hardly understandable. So if you have confidence of your lab3 tests, do not modify your implementation. If you really detect some lab3 bugs by retesting Raft directly, you can fix them, and make sure your Raft works perfectly after each modification
- Use another RPC call
GetState
to recognize leader faster, no need for extra communication costs
- Lab4B: Key/value service with snapshots (S)
- We should save our kv storage as well as client sequences into snapshots, and save raft states
- We tested a corner case handled roughly when in Lab3: in
AppendEntries
, if the leader try to send append entries which has already been send to snapshots by clients (But server'snextIndex
array do not receive that news), this is gonna happen:args.PrevLogIndex < rf.getFirstLogIndex()
, we should returnfalse
withFirstIndex
set to-1
, let server retry later. This modification can pass all tests, but may not be as clear as other theories -
go test
can not handle well with goroutines, Iterations can be easily stuck - golang has a poor
unsafe.Sizeof
- Lab4A: Key/value service without snapshots(M)
- Lab5: Sharded Key/Value Service (Coding: M/H, Debug: HH, 100% pass)
- Recommend this blog
- Lab5A: The Controller and Static Sharding (S/M, 100% pass)
- TC:
$O(2n)$ SC:$O(1)$ Shard rebalance algorithm (Not like others' costly $O(n^2)$): Calculateavg
andremains
number of shards in gids first, Use 0 gid as a tmp storage for extra shards (larger thanavg + 1
if still gotremains
, larger than avg ifremains
is 0), then move all these shards to gid shards less thanavg
- TC:
- Lab5B (Coding: M/H, Debug: HHH, 100% pass)
- Idea Refernce Blog: https://www.inlighting.org/archives/mit-6.824-notes
- Bug log:
- Snapshot Test (3 Days):
-
time.Millisecond
typo:time.Microsecond
- Bug: stuck on
rf.log
encoding process- Why: newly created server cannot sync log and do correct snapshots, so
rf.log
grow bigger - Solition:
- Seperate snapshot with main loop engine, do not let channel stuck the snapshot process
- Judge the snapshot index, if is 0, do not do snapshot
- Why: newly created server cannot sync log and do correct snapshots, so
-
- UnReliable Test (0.5 Days)
- repeated ctrler requests return
ErrOK
typoErrWrongLeader
(Most Serious)
- repeated ctrler requests return
- All (1 Day):
- livelock: help new leader apply old logs
-
handleOps
return directly when newSequenceNum
$\leq$ old
- Other optimizations
- Snapshot Test (3 Days):
- Performance: We use conservative time intervals, so our performance is much lower than else. We may improve this later.
- Challenges
-
TestChallenge1Delete
: easy one, send delete shard requests after install shards -
TestChallenge2Unaffected
: check shards before common operations, if their states areServing
, they can be used directly -
TestChallenge2Partial
-
Evaluation Level (due to my own experience, regardless of official assessment):
- SS: Too Simple
- S: Simple
- M: Medium
- H: Hard
- HH: Too Hard
- HHH: Extrememly Hard
This repo clearifies some FAQ, but some answers need to be reconsider: https://github.com/niebayes/MIT-6.5840, it uses modularized development, codes are not so referencable.
Good reference: https://github.com/lantin16/MIT6.824-2020-Lab/tree/main
Go: go version go1.21.3 linux/amd64
Project based on git://g.csail.mit.edu/6.5840-golabs-2024
Commit: e7aea3e613fdf9814580c97fd56c22c86a798c0e