Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

Use IsBlacklistedOrBackingOff to determine if we should try to fetch devices #3254

Merged
merged 11 commits into from
Nov 9, 2023

Conversation

S7evinK
Copy link
Contributor

@S7evinK S7evinK commented Nov 1, 2023

Use IsBlacklistedOrBackingOff from the federation API to check if we should fetch devices.

To reduce back pressure, we now only queue retrying servers if there's space in the channel.

@S7evinK S7evinK requested a review from a team as a code owner November 1, 2023 09:39
Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (ee73a90) 65.51% compared to head (fa3e0b5) 65.51%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3254      +/-   ##
==========================================
- Coverage   65.51%   65.51%   -0.01%     
==========================================
  Files         507      507              
  Lines       57217    57245      +28     
==========================================
+ Hits        37484    37502      +18     
- Misses      15892    15899       +7     
- Partials     3841     3844       +3     
Flag Coverage Δ
unittests 49.50% <88.67%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
cmd/dendrite-demo-pinecone/monolith/monolith.go 77.55% <100.00%> (ø)
cmd/dendrite/main.go 62.23% <ø> (-0.27%) ⬇️
federationapi/federationapi.go 77.14% <100.00%> (ø)
federationapi/internal/api.go 80.50% <100.00%> (ø)
userapi/userapi.go 72.44% <100.00%> (ø)
userapi/internal/device_list_update.go 74.52% <87.50%> (+1.11%) ⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@S7evinK S7evinK changed the title Reduce DeviceListUpdater timeout to 1s Use IsBlacklistedOrBackingOff to determine if we should try to fetch devices Nov 1, 2023
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good, few things.

federationapi/routing/profile_test.go Show resolved Hide resolved
var federationClientError *fedsenderapi.FederationClientError
if errors.As(err, &federationClientError) {
if federationClientError.Blacklisted {
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs unit tests.

hash := fnv.New32a()
_, _ = hash.Write([]byte(remoteServer))
index := int(int64(hash.Sum32()) % int64(len(u.workerChans)))

ch := u.assignChannel(userID)
deviceListUpdaterBackpressure.With(prometheus.Labels{"worker_id": strconv.Itoa(index)}).Inc()
defer deviceListUpdaterBackpressure.With(prometheus.Labels{"worker_id": strconv.Itoa(index)}).Dec()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove? Using a defer is much cleaner than what we have now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because we sent the work to the worker doesn't mean the back pressure is gone, the worker might still be processing the server. So decrementing doesn't make sense here.
We're incrementing right before trying to send the work and the worker decrements it once it is done processing the server. (either when skipping or actually done)

userapi/internal/device_list_update.go Show resolved Hide resolved
// The channel is at capacity, don't try to send more work
if len(ch) == cap(ch) {
continue
}
serversToRetry = serversToRetry[:0] // reuse memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super unclear that it's nuking the serversToRetry slice, and why. Can we add more comments here please?

@@ -431,6 +460,7 @@ func (u *DeviceListUpdater) worker(ch chan spec.ServerName, workerID int) {
_, exists := retries[serverName]
retriesMu.Unlock()
if exists {
deviceListUpdaterBackpressure.With(prometheus.Labels{"worker_id": strconv.Itoa(workerID)}).Dec()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unconvinced that this is going to track counts correctly. It is incremented for every call to notifyWorkers(userID) but only decremented under very certain scenarios (when not full, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

case "localhost":
delete(expectedServers, serverName)
aliceCh <- true // unblock notifyWorkers
case "notlocalhost": // this should not happen as it is "filtered" away by the blacklist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case "notlocalhost": // this should not happen as it is "filtered" away by the blacklist
case unreachableServer: // this should not happen as it is "filtered" away by the blacklist

userIDToChan: make(map[string]chan bool),
userIDToMutex: make(map[string]*sync.Mutex),
}
workerCh := make(chan spec.ServerName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer close(workerCh)?

@S7evinK S7evinK merged commit 7863a40 into main Nov 9, 2023
15 of 18 checks passed
@S7evinK S7evinK deleted the s7evink/device-list-updater-again branch November 9, 2023 07:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants