Skip to content

Faster Memory Backend Proposal

Chris Monson edited this page Mar 29, 2021 · 8 revisions

Faster In-Memory Proposal

UPDATE: This is implemented in pull request https://github.com/shiblon/entroq/pull/27

The EntroQ in-memory implementation was originally written (as of early 2021) as a quick-and-dirty approach to allow for testing and development. This has turned out very well thus far, but with the proposed introduction of a write-ahead-log (WAL Proposal), it may end up being simpler to deploy than any database-backed version, and thus is expected to become, over time, the default production implementation.

Challenges

The current implementation uses a single mutex to lock the entire set of queues simultaneously before allowing any work (or reads!) to occur. This means that potentially computationally intense operations like QueueStats can cause all dependent systems to pause, like a bad GC implementation would.

This proposal outlines a better locking plan for the in-memory implementation that can allow for long-running task listing and queue stats calculations without disrupting normal operation (or other long-running processes).

Current Implementation

Currently, the EntroQ in-memory implementation (March 2021) keeps track of two maps, treated (and locked) as a single data structure:

  • a map of queue name -> task heap, and
  • a map of task ID -> task.

These facilitate the major operations that require fast access, and act as indices.

  • Task Queue Index: The queue-keyed task heaps are partially ordered on AT (arrival time) so that tasks that are ready to be Claimed can be found and randomly (with bias toward older arrival times) returned.
  • Task ID Index: The id-keyed task map allows tasks to be found by ID, facilitating all modification and dump operations for a given task by ID. Modifications operate on IDs, and are more common even than claims, which operate on queues. Modification does not need to quickly find tasks of a certain arrival time, only tasks by ID.

When doing any operation on any queue or task, both structures are locked under the same mutex for the duration of the operation, avoiding race conditions entirely.

Computing statistics over any queue, however, can stop all work in its tracks: statistic-gathering also locks the entire structure.

Proposed Fine-Grained Locking Implementation

The real issue with the current implementation is that touching anything requires everything to be locked. We can do better than this with a more intelligent data structure.

Using a sync.Map, for example, allows one goroutine to iterate over the entire map without taking a global lock while another writes to and otherwise manipulates that map. This is perfect for situations where readers do not need immediate consistency, only eventual consistency. That is the case for the Tasks and QueueStats operations: these are advisory operations only, and do not have any impact on the correctness of the system itself. They should be eventually consistent, and certainly within a given sphere of interest (e.g., a task is inserted into a queue, and that task should be noticed in the task listing immediately, if nothing else is around to delete it).

The strategy is basically two-fold:

  • For read-only advisory operations like Tasks and QueueStats, simply iterate over the appropriate sync.Map for the queues in question, do not take a lock except to very briefly find out whether those queues exist and to get access to the appropriate map.
  • For critical mutating operations like Claim and Modify, briefly take a global lock to enumerate all queue-related structures for the operation, release it, then take all queue-specific locks as needed and rely on those to block others from mucking around with your relevant tasks.

Thus, the granularity goes from "system-level" to "queue-level" for almost all operations in this regime, and even a held queue lock will not stop a potentially long-running advisory process from proceeding without affecting other parts of the system (except briefly when identical tasks are simultaneously touched). In other words, not only are we reasoning about the system locks at the queue level, we are getting a bonus task-level lock because of the use of sync.Map to hold actual task data.

The new data structure looks like this:

  • Queue Tasks: map[string]*sync.Map <- maps queue names onto fine-grained lock structure for each task.
  • ID->Queue: map[uuid.UUID]string <- maps all system task IDs into the queues in which they reside. Usually not needed, used for search.
  • Queue Heaps: map[string]heap <- maps queue names onto a structure suitable for quickly accessing claimable task information.
  • Queue Locks: map[string]lock <- maps queue names onto locks that can be used for each queue.

With these in place, the system can be built in such a way as to ensure that any global locks are held only for the briefest of moments, and most of the time only queue-level locks are held. Only mutating operations will block at the queue level (important because they may affect multiple tasks across multiple queues), and reading operations will block at the task level when mutating operations are busy manipulating a single task.

The task data, the thing we think of as the "real task", is located in the sync.Map under each queue name. Everything else is an index of some kind: a mapping from ID to queue name, a heap of IDs and At times, and a set of locks that know the name of their own queue.

Tasks and QueueStats

The strategy for non-critical advisory read operations is as follows:

  • Brief system lock, while you
  • Find out what sync.Maps you need to get at relevant tasks.
  • Then range over those maps without holding any explicit locks at all.

Claim

The strategy for Claim is to do the following:

  • Briefly hold the system lock
  • Obtain all relevant (matching) queue locks for the query.
  • Release system lock
  • Shuffle queue locks
  • In turn, lock each queue and look for claimable tasks until one is found.

Only one queue lock is held at a time, and there is a moment between finding those locks and using those locks where no locks are held at all. This is safe because all mutating operations on tasks are done with the lock held that belongs to that task's queue. Other mutating operations must also hold the relevant queue lock to touch or even read one of these tasks.

Furthermore, there are indices at the system level that might need to be adjusted while queue locks are held. To avoid the Dining Philosophers problem, we release the global lock, then take a queue lock, and can then take the global lock again, overlapping with the queue lock. If all modifications do things in this order, then the problem is avoided: the global lock can only be briefly held if relevant queue locks are also already held.

Modify

The strategy for the much more complex Modify operation is as follows, and mirrors parts of the Claim operation above:

  • Briefly hold system lock
  • Obtain all relevant queue locks
  • Also obtain all queue-level structures, where appropriate, especially that containing all complete task data (the sync.Map).
  • Release the system lock
  • Obtain all queue locks in name-sorted order (avoid dining philosophers). All locks are now held simultaneously.
  • Determine whether the modification can proceed in the usual way (get tasks, do computations on them)
  • Perform the modification if no dependency error is triggered, briefly holding the system lock to update indices as needed. See above for why this is safe to do once queue locks are held.
  • Release all queue locks.

You may notice that this allows claims and modifications to proceed completely in parallel with minimal contention, provided that there is no queue overlap between the processes. Furthermore, this process can be occurring while a parallel read (QueueStats or Tasks) is ongoing, without disrupting those and without generating a critical data race: read operations hold no locks at all, only mutating operations hold locks for the duration of their work.

This approach is somewhat complex, but it does work, and it provides a layer of protection against DOSing the system with long-running read requests for full queues. Since queue statistics are important for proper system monitoring, it is critical that these not be a bottleneck for system functionality, and this regime allows them to be computed even if they must touch every task in the system for each monitoring output.

(Deprecated, Rejected) Proposed 2-Layer Implementation

The proposed implementation has two such systems:

  • Read-Only: a read-only Task Structure, and
  • Scratch: a read-write Task Structure that also explicitly tracks task deletion.

The proposed implementation also uses two RWLock instances, one for each of Read-Only and Scratch. The basic operations and how they work are listed below, but first we outline why this double-structure is considered in the first place.

Scratch

The Scratch system functions identically to the current EntroQ implementation, with the addition of a "deleted tasks" set: every time a task ID is deleted, that ID is added to the deletion set as a tombstone. We will see why below.

Read-Only

The Read-Only system is completely static and initially empty. It is never modified by the standard Claim and Modify methods. It is simply there.

There is one exception to this rule, however, and that is when the system merges Scratch into Read-Only. In that case, Read-Only is modified, and Scratch becomes briefly empty. This process is referred to in this document as "Scratch Depletion" and it can happen periodically or as a result of a certain number of user actions, etc. How and when it is triggered is less important than how it works, which we go into next.

Depletion

Every so often, ideally with relatively high frequency to keep Scratch small, the Scratch space needs to be merged with the Read-Only space and then reset. This process, called "Depletion", looks at any tasks in the normal scratch structures and overwrites tasks with the same ID in Read-Only with those values, unless those IDs are also in the Deletion set.

It then ensures that any IDs in the Deletion set are deleted from Read-Only, and resets Scratch to be completely empty. Read-Only now contains the current state of the EntroQ system in its entirety.

Basic Operations

Here we outline how each basic operation functions, with locking strategies and relative benefits. Of particular import are Claim, Modify, Tasks, and QueueStats. Additional emphasis is given to Deplete, which is not a core user-facing operation, but which must be considered as lock strategies are laid out. In all cases, Go's sync.RWLock terminology is used, where RLock means "read lock, allowing multiple readers" and Lock means "write lock, there can be only one."

We also expect and adhere to Go's sync.RLock idiosyncrasies, in particular that if one process calls RLock, and another calls Lock (which will block), every subsequent call to RLock will also block, even though strictly speaking it should be okay to have multiple RLock calls coexist: the presence of a blocked Lock causes every other subsequent attempt of any kind to block.

While this may seem a bit strict, it works fine in our situation, and makes sense from an anti-starvation point of view.

Deplete

  • Lock Read-Only, Lock Scratch
  • Do depletion as described above
  • Unlock Scratch, Unlock Read-Only

Read-Only is locked first to avoid blocking operations that always need to operate on scratch, which are user-critical operations such as Claim and Modify. Since nothing will be done to Read-Only unless both locks are acquired, any operation that requires access to either Read-Only or Scratch is safe to operate so long as it first acquires a lock on Scratch. If we were to block first on Read-Only, then long-running read operations that need to block depletion would also block the rest of the system from functioning.

Claim

  • Lock Scratch
  • Randomly choose an available task from Scratch and Read-Only, do standard modifications.
  • Unlock Scratch

Note that, because the depletion algorithm must lock both Read-Only and Scratch to do its work, accessing Read-Only without a lock is safe, here. If we were to add an RLock call, it could end up blocking on a long-running stats gathering exercise. See below for more information.

Modify

  • Lock Scratch
  • Determine dependency presence by looking at both Read-Only and Scratch.
  • Modify in Scratch only
  • Unlock Scratch

Note that, because the depletion algorithm must lock both Read-Only and Scratch to do its work, accessing Read-Only without a lock is safe, here. If we were to add an RLock call, it could end up blocking on a long-running stats gathering exercise. See below for more information.

Tasks and QueueStats

  • RLock Read-Only, RLock Scratch
  • Get snapshot of scratch information (tasks available, stats, deletions)
  • Unlock Scratch
  • Merge Scratch information with information computed from Read-Only
  • Unlock Read-Only

Of note regarding the locking strategy here: critical operations like Claim and Modify, which occupy the core logic of any worker, are blocked for as little time as possible. These potentially long-running read-only operations only lock Scratch for a short time, then unlock it so that the system may continue progressing. After that, the Read-Only space is RLocked, leaving the way open for other readers to do the same in parallel.

Since depletion must also lock both spaces, this effectively blocks depletion from doing its work. Here we rely on the special non-recursive non-starvation behavior of Go's RWLock implementation: if a long-running task listing operation holds a read lock on Read-Only, and depletion attempts to run, depletion will block on Lock. This means that any other long-running listing operation will also block, and depletion will run as soon as this one is complete.

Read-only operations are of less importance in the system than critical mutating operations, so occasional slow task listings or queue stats should be a reasonable thing to expect. This idea may need to be revisited, however, if users find that task listings are very commonly needed. It is unclear at this time whether an elegant solution exists that is both correct and maintainable. Some waiting may simply be required (as it is, the current proposal is vastly superior to the existing system for in-memory use).

Snapshot

While WAL snapshots are planned to happen in a completely separate, read-only process, this implementation provides some interesting possibilities for cooperative snapshots. It is trivial, for example, to provide a Snapshot operation that walks through every task in the system in much the same way that QueueStats does (which has to compute information based on the current time, and thus must touch all tasks, at least for now). Each task walked can be output to a file, and this can be done safely, without long disturbance of system functionality: it only needs to hold the Scratch lock long enough to get the smaller space metadata committed to memory, and then can unlock it and proceed with the Read-Only walk.

Note on Strict Ordering vs. Partial Ordering

Note that, with a true strict ordering of tasks by time, we can also improve claim notifications. The current system only notifies would-be claimants when a task is inserted, not when a task becomes available due to the passage of time. Having a strict ordering on time would allow for efficient discovery of the proper "now location" in a list of tasks for a queue, and that could be updated on a relatively short interval, like every second, without significant risk to underlying processes or significant CPU burden. With such a marker, time-based statistics become a simple calculation involving marker location and queue length. Also, as the marker moves forward past tasks, those tasks are now available and can be notified on immediately.

Facilitating a strict ordering is the subject of another proposal, and will likely involve an optimized B-Tree implementation in memory.

Conclusions

This proposal is likely not worth the time. The two-layer approach is complex and doesn't seem to buy much of importance. In the process of writing the proposal, it became evident that the primary weakness of the in-memory implementation, a weakness not shared by the Postgres implementation, is the lack of a total ordering over arrival times. With a total ordering, pagination can be used for long task listings, and statistics on time can be computed with a couple of additions.

Creating a totally ordered in-memory structure that is amenable to constant movement is not trivial, but it is well understood. A B-Tree implementation with a relatively large branching factor can be made to work quite nicely for this case, provided that it allows neighbors to have equal timestamps.

The BTree Package from Google's github looks like it may be perfectly adequate, provided that Item.Less is implemented to always return a total ordering. This can be accomplished by a secondary test on ID strings if timestamps are equivalent between tasks, allowing more than one task with the same timestamp to be stored.