Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

Caching design #1102

Open
neilalexander opened this issue Jun 5, 2020 · 1 comment
Open

Caching design #1102

neilalexander opened this issue Jun 5, 2020 · 1 comment
Labels
design:scaling T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. X-Fix-With-Monolith This issue can be resolved with a monolith-only arch X-Needs-Discussion We aren't really sure about this yet so let's talk about it some more X-Performance Issue/PR around something that is slow or taking lots of memory

Comments

@neilalexander
Copy link
Contributor

We need to think about caching properly, off the back of the following things:

  • RoomVersionCache
  • ServerKeyCache
  • EDUCache
  • transactions.Cache

And probably many others.

@neilalexander
Copy link
Contributor Author

neilalexander commented Jun 6, 2020

Writing this all down while it's still somewhat fresh in my mind.

Problem

At the moment, we have two caches in a first-class implementation—RoomVersionCache and ServerKeyCache. They are currently implemented using InMemoryLRUCache which has the following properties:

  • Caches are registered and reported in Prometheus
  • Immutable caches are really immutable, we'll panic() if we attempt to write to them again with a different value
  • Mutable caches allow changes, but presently there is no downstream invalidation

If we are going to embrace caching as a real architectural design decision, rather than something that we shoe-horn in ad-hoc, then we need to give some thought as to what this will actually look like and the rules we should set for ourselves.

Mutability

To start with, both RoomVersionCache and ServerKeyCache were immutable. Immutable caches are definitely much easier to reason about, since we don't have to worry about invalidating them in other places. Their values will never change. There are all sorts of things that never change—room versions, event type keys, event state keys, so on.

Following #1094, where we now properly implement key-fetching once the validity of a key passes, we now have had to change ServerKeyCache to be mutable, as otherwise we will panic() when we try to update the cache with the new validity. It's ultimately this that has opened up this discussion further.

Discoverability

In addition to this, we have other caches dotted around which aren't implemented in a first-class caching structure, like transactions.Cache (which is actually used for idempotency/spec compliance rather than optimisation!), EDUCache and possibly others. They aren't reported in Prometheus, they don't guarantee immutability where it is necessary and it's not obvious what the lifetime of these might be to a new developer.

Generally, I believe in first-class structures like the one in #1101. They're easy to understand and to reason about, the logic and tuning for them lives in one place and it's really easy to see where they are used (both as readers and as writers).

Go gives us good machinery as well in the form of interface{}s so that we can expose sensible functions that satisfy the types that we're interested in (e.g. gomatrixserverlib.RoomVersion).

Monolith

In monolith mode, a lot of this is quite simple—for each type of cache, we can just use something like InMemoryLRUCache, have one instance of each type and then give a reference to each component. That way, each component is using the same set of caches, there's no duplication, no downstream invalidation, therefore it's all very simple.

There is an additional benefit here, in that one component that caches something is also benefiting all other components, as they will be able to hit the cache and get a value rather than have to make an API or a database call themselves.

Polylith

In polylith mode, things get a lot more complicated. We currently build new InMemoryLRUCaches for each component. The caches will be populated only as and when that specific component writes to the cache. There's no sharing of caches across components in this model as they are in-process, and a polylith deployment runs different components in different processes.

For anything immutable, this is fine. We don't need to worry about invalidating anything as the values will never change. If we hit upper-bounds on, e.g. item count or memory, we can just evict the cache entries.

For anything mutable, this presents a problem. If the federation sender goes and gets an updated server key and updates its own cache, then there is no signalling or invalidation to other components to also update their caches, therefore the other components will operate on out-of-date information until the cache entries are evicted through other means.

Possibilities

The following are possible things that we could do to try as long goals—read: not necessarily right now!

Tiered caching

The idea here is that we would implement long caches and short caches as a two-tiered system:

  • Short caches are always in-process, existing just long enough to assist a task (like handling room state from a send_join where we hit the same server keys a lot) and then deliberately invalidated afterwards (letting the GC clean up)
  • Long caches can be in or out of process, populated with information that has no defined longevity and will just stay there until either it expires, we hit upper bounds and evict (LRU) or we run out of memory

Questions:

  • Should short caches also be wrapped somehow?
  • Will we really be able to enforce Prometheus metrics if we don't?
  • Still need to reason about mutability vs immutability

Single perspective (Redis?)

This doesn't matter in monolith mode, because we already have single caches in-process. We can continue to do exactly that.

In polylith mode, if we don't want to deal with invalidation notifications/streams (like Synapse does presently), then we ideally need to maintain the illusion of having only one set of central caches.

We could do this in a polylith deployment by offloading to Redis, which could either run on the same machine or on a different machine, and will still ultimately be many times faster than pulling something from Postgres.

This probably avoids the invalidation problem somewhat in that all components are all pulling from the same place - we don't need to stream invalidations between components.

Questions:

  • Is an out-of-process cache as an external dependency an issue? (We already depend on Kafka externally)
  • Would we need to Prometheus-monitor Redis from within Dendrite? (Possibly not?)
  • We'll probably have to serialise/deserialise things in and out of their native types to use something like Redis - is it worth measuring the performance hit on doing that? (It doesn't make sense to do if these operations will end up being overly costly.)

Justification

None of this should be taken as my wanting to eagerly cache everything. So far, room version and server key caching exist because we have met significant performance gains in doing so (particularly where joining federated rooms and retrieving history are concerned). I would like to ensure that we are only building in caches where we know it is justifiable to do so, since they can also create other problems.

Whatever model we come up with, I think that we ideally need to update our contributors doc with information such as:

  • When you should and shouldn't consider caching (e.g. steering contributors away from mutable caches as far as possible)
  • High-level guidance on using tools like pprof to understand where caches might make sense
  • A pattern to follow to use/implement
  • How to avoid overlapping caches, since that probably creates a new class of problems in itself

Other issues

Some other questions that we might want to answer:

  • How should they be tuned? (Manually vs automatically?)
  • Should they be entirely optional?
  • How can we avoid overlapping caches?
  • If we do end up with polylith components having to maintain their own caches, how can we reason about invalidation?

@kegsay kegsay added X-Needs-Discussion We aren't really sure about this yet so let's talk about it some more X-Performance Issue/PR around something that is slow or taking lots of memory T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. and removed needs discussion labels Dec 5, 2022
@kegsay kegsay added the X-Fix-With-Monolith This issue can be resolved with a monolith-only arch label Feb 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
design:scaling T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. X-Fix-With-Monolith This issue can be resolved with a monolith-only arch X-Needs-Discussion We aren't really sure about this yet so let's talk about it some more X-Performance Issue/PR around something that is slow or taking lots of memory
Projects
None yet
Development

No branches or pull requests

2 participants