-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discovery+graph: track job set dependencies in vb #9241
base: master
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
7acf321
to
fc00572
Compare
ValidationBarrier
ValidationBarrier
fc72083
to
7d95cd2
Compare
This omits calls to InitJobDependencies, SignalDependants, and WaitForDependants. A new method FetchJobSlot has been added to the ValidationBarrier which only reserves a job slot and does not set up any dependency mappings. These changes have been made here because the router / builder code does not actually need job dependency management. Calls to the builder code (i.e. AddNode, AddEdge, UpdateEdge) are all blocking in the gossiper. This, combined with the fact that child jobs are run after parent jobs in the gossiper, means that the calls to the router will happen in the proper dependency order. This means that usage of the ValidationBarrier is therefore useless except for the job slot reservation to prevent DoS.
This commit does two things: - removes the concept of allow / deny. Having this in place was a minor optimization and removing it makes the solution simpler. - changes the job dependency tracking to track sets of abstact parent jobs rather than individual parent jobs. As a note, the purpose of the ValidationBarrier is that it allows us to launch gossip validation jobs in goroutines while still ensuring that the validation order of these goroutines is adhered to when it comes to validating ChannelAnnouncement _before_ ChannelUpdate and _before_ NodeAnnouncement.
cc: @gijswijs for review |
@@ -675,45 +675,20 @@ func (b *Builder) handleNetworkUpdate(vb *ValidationBarrier, | |||
defer b.wg.Done() | |||
defer vb.CompleteJob() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit title nit: this is not the router
, it's the graph.Builder
func (v *ValidationBarrier) FetchJobSlot() { | ||
// We'll wait for either a new slot to become open, or for the quit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the builder code is called by the gossiper code which itself also uses a semaphore - why isnt that inheritance enough?
// Empty returns true if s is empty. | ||
func (s Set[T]) Empty() bool { | ||
return len(s) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think we need a separate PR for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also already have IsEmpty
func (v *ValidationBarrier) SignalDependants(job interface{}, allow bool) { | ||
// SignalDependents signals to any child jobs that this parent job has | ||
// finished. | ||
func (v *ValidationBarrier) SignalDependents(job interface{}, id JobID) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha sneaky change from British spelling to American 😝
err = v.removeParentJob(route.Vertex(msg.NodeID2), id) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
delete(v.chanEdgeDependencies, msg.ShortChannelID) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just return v.removeParentJob(route.Vertex(msg.NodeID2), id)
"JobID=%v", spew.Sdump(nMsg.msg), jobID) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not handle the error similarly to how it is handled for WaitForParents
above? (including returning after handling?)
// If there is no entry in the jobInfoMap, we don't have to wait on any | ||
// parent jobs to finish. | ||
info, ok := v.jobInfoMap[annID] | ||
if ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit:
if !ok {
return
}
// the rest here
info.activeParentJobIDs.Add(annJobID) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: return after info.activeParentJobIDs.Add(annJobID)
and remove & unindent the else
block
// should complete after another) for the (childJobID, annID) tuple. This must | ||
// only be called from InitJobDependencies. | ||
// NOTE: MUST be called with the mutex held. | ||
func (v *ValidationBarrier) populateDependencies(childJobID JobID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: can we keep the argument order the same for updateOrCreateJobInfo
and populateDependencies
🙏
signals, ok = v.chanEdgeDependencies[msg.ShortChannelID] | ||
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw the error? (if it is really impossible then panic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the many legitimate uses of panics, they have been rejected every time I have tried to use them (even for the provably impossible scenario). I believe that a critical log is the next best thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding the distinction between parent and child jobs here both confusing and unnecessary. What we have here is a dependency graph of undifferentiated jobIDs. once all of the dependencies have run we can run. once we run we want to signal all of our dependents. We should be able to accomplish this with a single removeJob
that does this index cleanup and dependent signaling.
The main difficulty I'm noticing in this PR is that we have multiple IDs that we want to be able to map to JobID
s from disjoint domains. My recommendation here is to make the core algebra of this component undifferentiated and then have auxilliary mappings that help recover the relevant JobID
from the other unique protocol identifiers.
// Empty returns true if s is empty. | ||
func (s Set[T]) Empty() bool { | ||
return len(s) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also already have IsEmpty
// length and entries therefore cannot hash to the same keys. | ||
// NOTE: IF OTHER TYPES OF KEYS ARE STORED, CHECK THAT COLLISION WON'T | ||
// OCCUR. | ||
jobInfoMap map[any]*jobInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we ought to use an explicit closed union (via an interface) in the key here. any
is a disaster waiting to happen.
// should complete after another) for the (childJobID, annID) tuple. This must | ||
// only be called from InitJobDependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just define this as a local function inside that scope to enforce that it is only referenceable there?
// Copy over the parent job IDs at this moment for this annID. | ||
// This job must be processed AFTER these parent IDs. | ||
parentJobs := info.activeParentJobIDs.Union(fn.NewSet[JobID]()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this reveals a need for a set copying method.
signals, ok = v.chanEdgeDependencies[msg.ShortChannelID] | ||
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the many legitimate uses of panics, they have been rejected every time I have tried to use them (even for the provably impossible scenario). I believe that a critical log is the next best thing.
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. | ||
parentJobIDs, ok = v.jobDependencies[childJobID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this read not need a mutex lock?
// notifies to annID's child jobs that it has finished validating. This must be | ||
// called from SignalDependents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These "must be called from X" notes suggest lack of proper decomposition. It inverts the dependency. Callers should depend on the function's promises and integrate the results according to the API contract.
// and cleans up its job dependency mappings. This MUST be called from | ||
// SignalDependents. | ||
// NOTE: MUST be called with the mutex held. | ||
func (v *ValidationBarrier) removeChildJob(annID any, childJobID JobID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
// We don't want to block when sending out the signal. | ||
select { | ||
case notifyChan <- struct{}{}: | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with swallowing the signal instead? Seems like this could case jobs to never be run, particularly if lastJob
is true
case *lnwire.NodeAnnouncement: | ||
delete(v.nodeAnnDependencies, route.Vertex(msg.NodeID)) | ||
// Remove child job info. | ||
v.removeChildJob(route.Vertex(msg.NodeID), id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need two distinct removal functions for parent and child.
This PR changes the
ValidationBarrier
to track abstract job dependencies. This just means that every time a child job comes in (i.e. channel update or node announcement), we track the set of possible parent jobs that are related to it (channel announcement(s)) that have registered viaInitJobDependencies
. The goroutines containing the child jobs will then wait to be notified every time one of their parent jobs completes. From the child job's POV, this just works as ref-counting except that you're only counting the parent jobs you're interested in.With this, we can now extend the
ValidationBarrier
to track any sort of abstract job that requires both concurrency and waiting for another job to finish. It also makes it possible in a future PR to very easily make node announcements depend on channel announcements. See the commit messages for more details.TODO:
ValidationBarrier
and ensure that all child jobs finish after their related parent jobs.