From 3e018838d15403f51419d4632fdb7d0e5d2b5eed Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Thu, 18 Jan 2024 16:24:44 +0000
Subject: [PATCH 01/56] Add adr/transaction-log-state-store.md

---
 docs/adr/transaction-log-state-store.md | 114 ++++++++++++++++++++++++
 1 file changed, 114 insertions(+)
 create mode 100644 docs/adr/transaction-log-state-store.md

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
new file mode 100644
index 0000000000..81e24ae251
--- /dev/null
+++ b/docs/adr/transaction-log-state-store.md
@@ -0,0 +1,114 @@
+# Store a transaction log for the state store
+
+## Status
+
+Proposed
+
+## Context
+
+We have two implementations of the state store that tracks partitions and files in a Sleeper table. This takes some
+effort to keep both working as the system changes, and both have problems.
+
+The DynamoDB implementation stores partitions and files as individual items in DynamoDB tables. This means that updates
+which affect many items at once require splitting into separate transactions, and we can't always apply changes as
+atomically or quickly as we would like. When working with many items at once, there's also a consistency issue where as
+we page through these items to load them into memory, the data may change in DynamoDB in between pages.
+
+The S3 implementation stores one file for partitions and one for files, both in an S3 bucket. A DynamoDB table is used
+to track the current revision of each file, and each change means writing a whole new file. This means that each change
+takes some time to process, and if two changes happen to the same file at once, it backs out and has to retry. Under
+contention, many retries may happen. It's common for updates to fail entirely due to too many retries, or to take a long
+time.
+
+## Decision
+
+Use an [event sourced](https://martinfowler.com/eaaDev/EventSourcing.html) model to store a transaction log, as well as
+snapshots of the state.
+
+Store the transactions as items in DynamoDB. Store snapshots as S3 files.
+
+Store each transaction with a link to the next transaction to achieve total ordering and durability.
+
+## Consequences
+
+### Modelling state
+
+The simplest way to do this involves holding a model in memory consisting of the whole state of a Sleeper table. This
+model needs to support applying any update as an individual change in memory. We then add a way to store that model as a
+snapshot of the state at a given point in time.
+
+This means whenever a change occurs, we can apply that to the model in memory. If we store all the changes in DynamoDB
+as an ordered transaction log, anywhere else that holds the model can bring itself up to date by reading just the latest
+transactions from the log. With DynamoDB, consistent reads can enforce that you're really up-to-date.
+
+There's no need to read or write the whole state at once as with the S3 state store, because the model is derived from
+the transaction log. However, after some delay a separate process can write a snapshot of the whole state to S3. This
+allows any process to quickly load the whole model without needing to read the whole transaction log. Only transactions
+that happened after the snapshot need to be read.
+
+### Distributed updates and ordering
+
+A potential format for the primary key would be to take a local timestamp at the writer, and append some random data to
+the end. This would provide resolution between transactions that happen at the same time, and a reader after the fact
+would see a consistent view of which one happened first.
+
+This produces a problem where if two writers' clocks are out of sync, one of them can insert a transaction into the log
+in the past, according to the other writer. Ideally we would like to only ever append at the end of the log, so we know
+no transaction will be inserted in between ones we have already seen.
+
+One approach would be to allow some slack, so that every time we want to know the current state we have to start at a
+previous point in the log and bring ourselves up to date. This causes problems for durability. If two updates are
+mutually exclusive, one of them may insert itself before the other one and cause the original update to be lost. The
+first writer may believe its update successful because there was a period of time before the second writer added a
+transaction before it.
+
+We could design the system to allow for this slack and recover from transactions being undone over a short time period.
+This would be complicated to achieve, although it may allow for the highest performance.
+
+Another approach is to store a reference to the next log item in each transaction. When we add a transaction, we update
+the last transaction to set a link to our new transaction. DynamoDB can refuse this update if there's already a link to
+a different transaction. We then need to retry if we're out of date, but each transaction can be durably stored.
+
+This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
+need to store the whole state. You also don't need to reload the whole state each time, because you haven't applied your
+transaction in your local copy of the model yet. You need to do the conditional check, but you don't need to update
+your local model until after the transaction is saved.
+
+### Parallel models
+
+So far we've assumed that we'll always the entire state of a Sleeper table at once, with one model. With a transaction
+log, it can be more practical to expand on that by adding read-only views.
+
+#### DynamoDB queries
+
+With the DynamoDB state store, there are some advantages for queries, in that we only need to read the relevant parts
+of the state. If we wish to retain this benefit, we can store the same DynamoDB items we use now.
+
+Similarly to the S3 snapshot, we could regularly store a snapshot of the table state as items in DynamoDB tables, in
+whatever format is convenient for queries.
+
+If we want this view to be 100% up to date, we could still read the latest transactions that have happened since the
+snapshot, and include that data in the result of any query.
+
+#### Status stores for reporting
+
+If we capture events related to jobs as transactions in the log, that would allow us to produce a separate model from
+the same transactions that can show what jobs have occurred, and every detail we track about them in the state store.
+
+This could unify any updates to jobs that we would ideally like to happen simultaneously with some change in the state
+store, eg. a compaction job finishing.
+
+### Comparison to a relational database
+
+With a relational database, large queries can be made to present a consistent view of the data. This would avoid the
+consistency issue we have with DynamoDB, but would come with some costs:
+
+- Transaction management and locking
+- Server-based deployment model
+
+With a relational database, larger transactions involve locking many records. This would pose similar problems to
+DynamoDB in terms of limiting atomicity of updates, as we would be heavily incentivised to keep transactions small. We
+would also need to work within the database's transactional tradeoffs, which may cause unforeseen problems.
+
+The database would need to be deployed as a persistent instance, although we could use a managed service. This loses
+some of the promise of Sleeper in terms of serverless deployment, and only running when something needs to happen.

From ff3ff260f975bc4665f80ef56f33ed6d51a3d404 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 09:01:32 +0000
Subject: [PATCH 02/56] Use transaction number in log

---
 docs/adr/transaction-log-state-store.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 81e24ae251..012df3d090 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -27,7 +27,8 @@ snapshots of the state.
 
 Store the transactions as items in DynamoDB. Store snapshots as S3 files.
 
-Store each transaction with a link to the next transaction to achieve total ordering and durability.
+Store each transaction with hash key of the table ID and range key of the transaction number in order. Use a conditional
+check to ensure the transaction number set has not been used.
 
 ## Consequences
 
@@ -65,9 +66,9 @@ transaction before it.
 We could design the system to allow for this slack and recover from transactions being undone over a short time period.
 This would be complicated to achieve, although it may allow for the highest performance.
 
-Another approach is to store a reference to the next log item in each transaction. When we add a transaction, we update
-the last transaction to set a link to our new transaction. DynamoDB can refuse this update if there's already a link to
-a different transaction. We then need to retry if we're out of date, but each transaction can be durably stored.
+Another approach is to give each transaction a number. When we add a transaction, we use the next number in sequence
+after the current latest transaction. We use a conditional check to refuse the update if there's already a transaction
+with that number. We then need to retry if we're out of date. This way each transaction can be durably stored.
 
 This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
 need to store the whole state. You also don't need to reload the whole state each time, because you haven't applied your

From cfdab188b0c60c496df78aa53eb1c2d13d581381 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 09:07:28 +0000
Subject: [PATCH 03/56] Mention parallel update models

---
 docs/adr/transaction-log-state-store.md | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 012df3d090..e467b63c69 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -72,13 +72,13 @@ with that number. We then need to retry if we're out of date. This way each tran
 
 This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
 need to store the whole state. You also don't need to reload the whole state each time, because you haven't applied your
-transaction in your local copy of the model yet. You need to do the conditional check, but you don't need to update
-your local model until after the transaction is saved.
+transaction in your local copy of the model yet. You need to do the conditional check as in the S3 implementation, but
+you don't need to update your local model until after the transaction is saved.
 
 ### Parallel models
 
 So far we've assumed that we'll always the entire state of a Sleeper table at once, with one model. With a transaction
-log, it can be more practical to expand on that by adding read-only views.
+log, it can be more practical to expand on that by adding alternative models for read or update.
 
 #### DynamoDB queries
 
@@ -99,6 +99,16 @@ the same transactions that can show what jobs have occurred, and every detail we
 This could unify any updates to jobs that we would ideally like to happen simultaneously with some change in the state
 store, eg. a compaction job finishing.
 
+#### Update models
+
+If we ever decide it's worth avoiding holding the whole Sleeper table state in memory, we could create an alternative
+model to apply a single update. Rather than hold the entire state in memory, we could load just the relevant state to
+perform the conditional check. When we bring this up to date from the transaction log, this model can ignore
+transactions that are not relevant to the update.
+
+This would add complexity to the way we model the table state, so we may prefer to avoid this. It is an option we could
+consider.
+
 ### Comparison to a relational database
 
 With a relational database, large queries can be made to present a consistent view of the data. This would avoid the

From 9e07a089ed3890f86ca8aa47108b1fa1f555a861 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 09:57:27 +0000
Subject: [PATCH 04/56] Summarise consequences of transaction log

---
 docs/adr/transaction-log-state-store.md | 42 ++++++++++++++++---------
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index e467b63c69..3440fecf50 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -22,25 +22,37 @@ time.
 
 ## Decision
 
-Use an [event sourced](https://martinfowler.com/eaaDev/EventSourcing.html) model to store a transaction log, as well as
-snapshots of the state.
+Use an event sourced model to store a transaction log, as well as snapshots of the state.
 
 Store the transactions as items in DynamoDB. Store snapshots as S3 files.
 
-Store each transaction with hash key of the table ID and range key of the transaction number in order. Use a conditional
-check to ensure the transaction number set has not been used.
+The transaction log DynamoDB table has a hash key of the table ID, and range key of the transaction number in order. Use
+a conditional check to ensure the transaction number set has not been used.
 
 ## Consequences
 
+This would involve a very different set of patterns from those where the source of truth is a store of the current
+state. The model for the current state is derived from the transaction log, and independent of the underlying store.
+To avoid reading the whole transaction log every time, we regularly store a snapshot of the state so we can start from a
+certain point in the log.
+
+We'll need to design the transaction log storage to solve some problems with distributed updates. We'll look at how to
+achieve ordering and durability of the log. This should result in a similar update process to the S3 state store, but
+without the need to save or load the whole state at once. This should mean quicker updates even compared to the DynamoDB
+state store, since we only need to save one item.
+
+This approach also makes it much easier to use parallel models or storage formats. This can allow for queries instead of
+loading the whole state at once, or we can model the data in some alternative way for various purposes.
+
 ### Modelling state
 
-The simplest way to do this involves holding a model in memory consisting of the whole state of a Sleeper table. This
-model needs to support applying any update as an individual change in memory. We then add a way to store that model as a
-snapshot of the state at a given point in time.
+The simplest approach is to hold a model in memory for the whole state of a Sleeper table. This needs to support
+applying any update as an individual change in memory. We then store that model as a snapshot of the state at a given
+point in time.
 
-This means whenever a change occurs, we can apply that to the model in memory. If we store all the changes in DynamoDB
-as an ordered transaction log, anywhere else that holds the model can bring itself up to date by reading just the latest
-transactions from the log. With DynamoDB, consistent reads can enforce that you're really up-to-date.
+Whenever a change occurs, we apply that to the model in memory. If we store all the changes in DynamoDB as an ordered
+transaction log, anywhere else that holds the model can bring itself up to date by reading just the latest transactions
+from the log. With DynamoDB, consistent reads can enforce that you're really up-to-date.
 
 There's no need to read or write the whole state at once as with the S3 state store, because the model is derived from
 the transaction log. However, after some delay a separate process can write a snapshot of the whole state to S3. This
@@ -49,9 +61,9 @@ that happened after the snapshot need to be read.
 
 ### Distributed updates and ordering
 
-A potential format for the primary key would be to take a local timestamp at the writer, and append some random data to
-the end. This would provide resolution between transactions that happen at the same time, and a reader after the fact
-would see a consistent view of which one happened first.
+A naive format for the primary key would be to take a local timestamp at the writer, and append some random data to the
+end. This would provide resolution between transactions that happen at the same time, and a reader after the fact would
+see a consistent view of which one happened first.
 
 This produces a problem where if two writers' clocks are out of sync, one of them can insert a transaction into the log
 in the past, according to the other writer. Ideally we would like to only ever append at the end of the log, so we know
@@ -103,8 +115,8 @@ store, eg. a compaction job finishing.
 
 If we ever decide it's worth avoiding holding the whole Sleeper table state in memory, we could create an alternative
 model to apply a single update. Rather than hold the entire state in memory, we could load just the relevant state to
-perform the conditional check. When we bring this up to date from the transaction log, this model can ignore
-transactions that are not relevant to the update.
+perform the conditional check, eg. from a DynamoDB queryable snapshot. When we bring this model up to date from the
+transaction log, we can ignore transactions that are not relevant to the update.
 
 This would add complexity to the way we model the table state, so we may prefer to avoid this. It is an option we could
 consider.

From 77920a359ca17d3820d441ba2e7a9ca47c72800d Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:03:32 +0000
Subject: [PATCH 05/56] Mention snapshots DynamoDB table

---
 docs/adr/transaction-log-state-store.md | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 3440fecf50..0d3aed84a5 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -9,13 +9,13 @@ Proposed
 We have two implementations of the state store that tracks partitions and files in a Sleeper table. This takes some
 effort to keep both working as the system changes, and both have problems.
 
-The DynamoDB implementation stores partitions and files as individual items in DynamoDB tables. This means that updates
+The DynamoDB state store holds partitions and files as individual items in DynamoDB tables. This means that updates
 which affect many items at once require splitting into separate transactions, and we can't always apply changes as
-atomically or quickly as we would like. When working with many items at once, there's also a consistency issue where as
-we page through these items to load them into memory, the data may change in DynamoDB in between pages.
+atomically or quickly as we would like. When working with many items at once, there's a consistency issue. As we page
+through these items to load them into memory, the data may change in DynamoDB in between pages.
 
-The S3 implementation stores one file for partitions and one for files, both in an S3 bucket. A DynamoDB table is used
-to track the current revision of each file, and each change means writing a whole new file. This means that each change
+The S3 state store keeps one file for partitions and one for files, both in an S3 bucket. A DynamoDB table is used to
+track the current revision of each file, and each change means writing a whole new file. This means that each change
 takes some time to process, and if two changes happen to the same file at once, it backs out and has to retry. Under
 contention, many retries may happen. It's common for updates to fail entirely due to too many retries, or to take a long
 time.
@@ -29,6 +29,9 @@ Store the transactions as items in DynamoDB. Store snapshots as S3 files.
 The transaction log DynamoDB table has a hash key of the table ID, and range key of the transaction number in order. Use
 a conditional check to ensure the transaction number set has not been used.
 
+The snapshots DynamoDB table holds a reference to the latest snapshot held in S3, similar to the S3 state store
+revisions table. This also holds the transaction number that snapshot was derived from.
+
 ## Consequences
 
 This would involve a very different set of patterns from those where the source of truth is a store of the current

From 7854f954a30a50fc534119b30dadb77c0268a2fa Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:15:02 +0000
Subject: [PATCH 06/56] Add resources

---
 docs/adr/transaction-log-state-store.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 0d3aed84a5..336cb2c76c 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -92,8 +92,8 @@ you don't need to update your local model until after the transaction is saved.
 
 ### Parallel models
 
-So far we've assumed that we'll always the entire state of a Sleeper table at once, with one model. With a transaction
-log, it can be more practical to expand on that by adding alternative models for read or update.
+So far we've assumed that we'll always work with the entire state of a Sleeper table at once, with one model. With a
+transaction log, it can be more practical to add alternative models for read or update.
 
 #### DynamoDB queries
 
@@ -138,3 +138,8 @@ would also need to work within the database's transactional tradeoffs, which may
 
 The database would need to be deployed as a persistent instance, although we could use a managed service. This loses
 some of the promise of Sleeper in terms of serverless deployment, and only running when something needs to happen.
+
+### Resources
+
+- [Martin Fowler's article on event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html)
+- [Greg Young's talk on event sourcing](https://www.youtube.com/watch?v=8JKjvY4etTY)

From 05ba873d57f694cd1073481c4aabb380d958963b Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:27:16 +0000
Subject: [PATCH 07/56] Use a later talk in resources

---
 docs/adr/transaction-log-state-store.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 336cb2c76c..24b4d1eca8 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -142,4 +142,4 @@ some of the promise of Sleeper in terms of serverless deployment, and only runni
 ### Resources
 
 - [Martin Fowler's article on event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html)
-- [Greg Young's talk on event sourcing](https://www.youtube.com/watch?v=8JKjvY4etTY)
+- [Greg Young's talk on event sourcing](https://www.youtube.com/watch?v=LDW0QWie21s)

From e03d3ad6d50c234bbbd029f623592a97585aa36c Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:38:03 +0000
Subject: [PATCH 08/56] Restructure section on distributed updates

---
 docs/adr/transaction-log-state-store.md | 30 ++++++++++++++-----------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 24b4d1eca8..3bc14dcfbc 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -64,9 +64,20 @@ that happened after the snapshot need to be read.
 
 ### Distributed updates and ordering
 
-A naive format for the primary key would be to take a local timestamp at the writer, and append some random data to the
-end. This would provide resolution between transactions that happen at the same time, and a reader after the fact would
-see a consistent view of which one happened first.
+To achieve ordered, durable updates, we can give each transaction a number. When we add a transaction, we use the next
+number in sequence after the current latest transaction. We use a conditional check to refuse the update if there's
+already a transaction with that number. We then need to retry if we're out of date.
+
+This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
+need to store the whole state. You also don't need to reload the whole state each time, because you haven't applied your
+transaction in your local copy of the model yet. You need to do the conditional check as in the S3 implementation, but
+you don't need to update your local model until after the transaction is saved.
+
+An alternative approach would be to store each transaction immediately, and perform some resolution after the fact. Say
+we take a local timestamp at the writer, and append some random data to the end, and use that to order the transactions.
+This would provide resolution between transactions that happen at the same time, and a reader after the fact would
+see a consistent view of which one happened first. We could then store this without checking for any other
+transactions being written at the same time.
 
 This produces a problem where if two writers' clocks are out of sync, one of them can insert a transaction into the log
 in the past, according to the other writer. Ideally we would like to only ever append at the end of the log, so we know
@@ -79,16 +90,9 @@ first writer may believe its update successful because there was a period of tim
 transaction before it.
 
 We could design the system to allow for this slack and recover from transactions being undone over a short time period.
-This would be complicated to achieve, although it may allow for the highest performance.
-
-Another approach is to give each transaction a number. When we add a transaction, we use the next number in sequence
-after the current latest transaction. We use a conditional check to refuse the update if there's already a transaction
-with that number. We then need to retry if we're out of date. This way each transaction can be durably stored.
-
-This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
-need to store the whole state. You also don't need to reload the whole state each time, because you haven't applied your
-transaction in your local copy of the model yet. You need to do the conditional check as in the S3 implementation, but
-you don't need to update your local model until after the transaction is saved.
+This would be complicated to achieve, although it may allow for improved performance as updates don't need to wait. The
+increase in complexity means this is unlikely to be as practical as an approach where a full ordering is established
+immediately.
 
 ### Parallel models
 

From 52897fb0e5b7e7ef749bf4e99127382c996ed3cd Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:41:23 +0000
Subject: [PATCH 09/56] Restructure section on distributed updates

---
 docs/adr/transaction-log-state-store.md | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 3bc14dcfbc..173eb4f173 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -79,15 +79,10 @@ This would provide resolution between transactions that happen at the same time,
 see a consistent view of which one happened first. We could then store this without checking for any other
 transactions being written at the same time.
 
-This produces a problem where if two writers' clocks are out of sync, one of them can insert a transaction into the log
-in the past, according to the other writer. Ideally we would like to only ever append at the end of the log, so we know
-no transaction will be inserted in between ones we have already seen.
-
-One approach would be to allow some slack, so that every time we want to know the current state we have to start at a
-previous point in the log and bring ourselves up to date. This causes problems for durability. If two updates are
-mutually exclusive, one of them may insert itself before the other one and cause the original update to be lost. The
-first writer may believe its update successful because there was a period of time before the second writer added a
-transaction before it.
+This produces a durability problem where if two writers' clocks are out of sync, one of them can insert a transaction
+into the log in the past, according to the other writer. If two updates are mutually exclusive, one of them may be
+inserted before the previous update, and cause the original update to be lost. The first writer may believe its update
+was successful because there was a period of time before the second writer added a transaction before it.
 
 We could design the system to allow for this slack and recover from transactions being undone over a short time period.
 This would be complicated to achieve, although it may allow for improved performance as updates don't need to wait. The

From d447c2a38e8b372262a957130aeb903024ac54d6 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:50:49 +0000
Subject: [PATCH 10/56] Restructure section on distributed updates

---
 docs/adr/transaction-log-state-store.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 173eb4f173..3fe072ff72 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -69,22 +69,23 @@ number in sequence after the current latest transaction. We use a conditional ch
 already a transaction with that number. We then need to retry if we're out of date.
 
 This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
-need to store the whole state. You also don't need to reload the whole state each time, because you haven't applied your
-transaction in your local copy of the model yet. You need to do the conditional check as in the S3 implementation, but
-you don't need to update your local model until after the transaction is saved.
+need to store the whole state. You also don't need to reload the whole state each time. Instead you read the
+transactions you haven't seen yet and apply them to your local model. As in the S3 implementation, you perform a
+conditional check on your local model before saving the update. After your new transaction is saved, you could apply
+that to your local model as well, and keep it in memory to reuse for other updates or queries.
 
 An alternative approach would be to store each transaction immediately, and perform some resolution after the fact. Say
-we take a local timestamp at the writer, and append some random data to the end, and use that to order the transactions.
+we take a local timestamp at the writer, append some random data to the end, and use that to order the transactions.
 This would provide resolution between transactions that happen at the same time, and a reader after the fact would
-see a consistent view of which one happened first. We could then store this without checking for any other
-transactions being written at the same time.
+see a consistent view of which one happened first. We could then store this without checking for any other transactions
+being written at the same time.
 
 This produces a durability problem where if two writers' clocks are out of sync, one of them can insert a transaction
 into the log in the past, according to the other writer. If two updates are mutually exclusive, one of them may be
 inserted before the previous update, and cause the original update to be lost. The first writer may believe its update
 was successful because there was a period of time before the second writer added a transaction before it.
 
-We could design the system to allow for this slack and recover from transactions being undone over a short time period.
+We could design the system to allow for some slack and recover from transactions being undone over a short time period.
 This would be complicated to achieve, although it may allow for improved performance as updates don't need to wait. The
 increase in complexity means this is unlikely to be as practical as an approach where a full ordering is established
 immediately.

From b44f43d9b980b73f98f22cbec100e825b8689d35 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 10:54:29 +0000
Subject: [PATCH 11/56] Restructure section on distributed updates

---
 docs/adr/transaction-log-state-store.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 3fe072ff72..b4c307a17d 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -74,11 +74,11 @@ transactions you haven't seen yet and apply them to your local model. As in the
 conditional check on your local model before saving the update. After your new transaction is saved, you could apply
 that to your local model as well, and keep it in memory to reuse for other updates or queries.
 
-An alternative approach would be to store each transaction immediately, and perform some resolution after the fact. Say
-we take a local timestamp at the writer, append some random data to the end, and use that to order the transactions.
-This would provide resolution between transactions that happen at the same time, and a reader after the fact would
-see a consistent view of which one happened first. We could then store this without checking for any other transactions
-being written at the same time.
+If we wanted to avoid this retry, there is an alternative approach to store the transaction immediately. To build the
+primary key, you could take a local timestamp at the writer, append some random data to the end, and use that to order
+the transactions. This would provide resolution between transactions that happen at the same time, and a reader after
+the fact would see a consistent view of which one happened first. We could then store this without checking for any
+other transactions being written at the same time.
 
 This produces a durability problem where if two writers' clocks are out of sync, one of them can insert a transaction
 into the log in the past, according to the other writer. If two updates are mutually exclusive, one of them may be

From d066feaf8f5e48bfee80c3450623f226b404f382 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:03:28 +0000
Subject: [PATCH 12/56] Rework consequences introduction

---
 docs/adr/transaction-log-state-store.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index b4c307a17d..0f52a842fa 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -34,15 +34,15 @@ revisions table. This also holds the transaction number that snapshot was derive
 
 ## Consequences
 
-This would involve a very different set of patterns from those where the source of truth is a store of the current
-state. The model for the current state is derived from the transaction log, and independent of the underlying store.
-To avoid reading the whole transaction log every time, we regularly store a snapshot of the state so we can start from a
-certain point in the log.
-
-We'll need to design the transaction log storage to solve some problems with distributed updates. We'll look at how to
-achieve ordering and durability of the log. This should result in a similar update process to the S3 state store, but
-without the need to save or load the whole state at once. This should mean quicker updates even compared to the DynamoDB
-state store, since we only need to save one item.
+This would involve a different set of patterns from those where the source of truth is a store of the current state.
+We'll look at how we model the state by deriving it from the transaction log, independent of the underlying store.
+To avoid reading the whole transaction log every time, can store a snapshot of the state, so we can start from a certain
+point in the log.
+
+This means a different approach to distributed updates, and there are potential problems in how we store the transaction
+log. We'll look at how to achieve ordering and durability of the log. This should result in a similar update process to
+the S3 state store, but without the need to save or load the whole state at once. This should mean quicker updates even
+compared to the DynamoDB state store, since we only need to save one item per transaction.
 
 This approach also makes it much easier to use parallel models or storage formats. This can allow for queries instead of
 loading the whole state at once, or we can model the data in some alternative way for various purposes.

From f7ee7ae83b3d639cb7370307784881dd569d8ec2 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:04:34 +0000
Subject: [PATCH 13/56] Rework consequences introduction

---
 docs/adr/transaction-log-state-store.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 0f52a842fa..119762386f 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -39,10 +39,10 @@ We'll look at how we model the state by deriving it from the transaction log, in
 To avoid reading the whole transaction log every time, can store a snapshot of the state, so we can start from a certain
 point in the log.
 
-This means a different approach to distributed updates, and there are potential problems in how we store the transaction
-log. We'll look at how to achieve ordering and durability of the log. This should result in a similar update process to
-the S3 state store, but without the need to save or load the whole state at once. This should mean quicker updates even
-compared to the DynamoDB state store, since we only need to save one item per transaction.
+This is a slightly different approach to distributed updates, and there are potential problems in how we store the
+transaction log. We'll look at how to achieve ordering and durability of the log. This should result in a similar update
+process to the S3 state store, but without the need to save or load the whole state at once. This should mean quicker
+updates even compared to the DynamoDB state store, since we only need to save one item per transaction.
 
 This approach also makes it much easier to use parallel models or storage formats. This can allow for queries instead of
 loading the whole state at once, or we can model the data in some alternative way for various purposes.

From 389f683ba3e743a337e3b4323d54b7e6d11dac56 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:15:04 +0000
Subject: [PATCH 14/56] Rework section on modelling state

---
 docs/adr/transaction-log-state-store.md | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 119762386f..8257a70512 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -49,18 +49,17 @@ loading the whole state at once, or we can model the data in some alternative wa
 
 ### Modelling state
 
-The simplest approach is to hold a model in memory for the whole state of a Sleeper table. This needs to support
-applying any update as an individual change in memory. We then store that model as a snapshot of the state at a given
-point in time.
-
-Whenever a change occurs, we apply that to the model in memory. If we store all the changes in DynamoDB as an ordered
-transaction log, anywhere else that holds the model can bring itself up to date by reading just the latest transactions
-from the log. With DynamoDB, consistent reads can enforce that you're really up-to-date.
-
-There's no need to read or write the whole state at once as with the S3 state store, because the model is derived from
-the transaction log. However, after some delay a separate process can write a snapshot of the whole state to S3. This
-allows any process to quickly load the whole model without needing to read the whole transaction log. Only transactions
-that happened after the snapshot need to be read.
+The simplest approach is to hold a model in memory for the whole state of a Sleeper table. We can use this one, local
+model for any updates or queries, and bring it up to date based on a sequence of transactions.
+
+Whenever a change occurs, we apply that to the model in memory. Anywhere that holds the model can bring itself up to
+date by reading only the transactions it hasn't seen yet, starting from the latest transaction that's already been
+applied locally. With DynamoDB, consistent reads can enforce that you're really up-to-date.
+
+We can also skip to a certain point in the transaction log. We have a separate process whose job is to write regular
+snapshots of the model. Every few minutes, we can tag a snapshot against the latest transaction, and write a copy of the
+whole model to S3. This allows any process to quickly load the whole model without needing to read the whole transaction
+log. Only transactions that happened after the snapshot need to be read.
 
 ### Distributed updates and ordering
 

From f5e5668c49155a133bc7ccf05c99dbbc3b9e8c2a Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:16:06 +0000
Subject: [PATCH 15/56] Reformat section on modelling state

---
 docs/adr/transaction-log-state-store.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 8257a70512..95dc9b5caf 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -58,8 +58,8 @@ applied locally. With DynamoDB, consistent reads can enforce that you're really
 
 We can also skip to a certain point in the transaction log. We have a separate process whose job is to write regular
 snapshots of the model. Every few minutes, we can tag a snapshot against the latest transaction, and write a copy of the
-whole model to S3. This allows any process to quickly load the whole model without needing to read the whole transaction
-log. Only transactions that happened after the snapshot need to be read.
+whole model to S3. This allows any process to quickly load the model without needing to read the whole transaction log.
+Only transactions that happened after the snapshot need to be read.
 
 ### Distributed updates and ordering
 

From dc04c3dbb7234f513ed78f8ec799e23b5014f642 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:18:04 +0000
Subject: [PATCH 16/56] Tweak section on modelling state

---
 docs/adr/transaction-log-state-store.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 95dc9b5caf..a65dc258b3 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -50,7 +50,8 @@ loading the whole state at once, or we can model the data in some alternative wa
 ### Modelling state
 
 The simplest approach is to hold a model in memory for the whole state of a Sleeper table. We can use this one, local
-model for any updates or queries, and bring it up to date based on a sequence of transactions.
+model for any updates or queries, and bring it up to date based on the ordered sequence of transactions. We just need to
+be able to apply any given transaction to the model.
 
 Whenever a change occurs, we apply that to the model in memory. Anywhere that holds the model can bring itself up to
 date by reading only the transactions it hasn't seen yet, starting from the latest transaction that's already been

From 2882d54a9f9973e1c0c6f17028dd5a8c919eba34 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:23:34 +0000
Subject: [PATCH 17/56] Tweak section on DynamoDB queries

---
 docs/adr/transaction-log-state-store.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index a65dc258b3..f974350e30 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -97,11 +97,11 @@ transaction log, it can be more practical to add alternative models for read or
 
 #### DynamoDB queries
 
-With the DynamoDB state store, there are some advantages for queries, in that we only need to read the relevant parts
-of the state. If we wish to retain this benefit, we can store the same DynamoDB items we use now.
+The DynamoDB state store, has advantages for queries, as we only need to read the relevant parts of the state. If we
+want to retain this benefit, we can store the same DynamoDB structure we use now.
 
-Similarly to the S3 snapshot, we could regularly store a snapshot of the table state as items in DynamoDB tables, in
-whatever format is convenient for queries.
+Similar to the process for S3 snapshots, we could regularly store a snapshot of the table state as items in DynamoDB
+tables, in whatever format is convenient for queries.
 
 If we want this view to be 100% up to date, we could still read the latest transactions that have happened since the
 snapshot, and include that data in the result of any query.

From 785e19e419e5d33b4ba47ff5c04d9c391fe7cef5 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:23:45 +0000
Subject: [PATCH 18/56] Tweak section on modelling state

---
 docs/adr/transaction-log-state-store.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index f974350e30..acfd873561 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -58,9 +58,10 @@ date by reading only the transactions it hasn't seen yet, starting from the late
 applied locally. With DynamoDB, consistent reads can enforce that you're really up-to-date.
 
 We can also skip to a certain point in the transaction log. We have a separate process whose job is to write regular
-snapshots of the model. Every few minutes, we can tag a snapshot against the latest transaction, and write a copy of the
-whole model to S3. This allows any process to quickly load the model without needing to read the whole transaction log.
-Only transactions that happened after the snapshot need to be read.
+snapshots of the model. Every few minutes, we can take a snapshot against the latest transaction and write a copy of the
+whole model to S3. We can point to it in DynamoDB as in the S3 state store's revision table. This allows any process to
+quickly load the model without needing to read the whole transaction log. Only transactions that happened after the
+snapshot need to be read.
 
 ### Distributed updates and ordering
 

From 194153eceb07e7ad7872a5881f0f03fc8f77d279 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:25:25 +0000
Subject: [PATCH 19/56] Tweak section on update models

---
 docs/adr/transaction-log-state-store.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index acfd873561..0913d3d589 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -117,10 +117,10 @@ store, eg. a compaction job finishing.
 
 #### Update models
 
-If we ever decide it's worth avoiding holding the whole Sleeper table state in memory, we could create an alternative
-model to apply a single update. Rather than hold the entire state in memory, we could load just the relevant state to
-perform the conditional check, eg. from a DynamoDB queryable snapshot. When we bring this model up to date from the
-transaction log, we can ignore transactions that are not relevant to the update.
+If we ever decide to avoid holding the whole Sleeper table state in memory, we could create an alternative model to
+apply a single update. Rather than hold the entire state in memory, we could load just the relevant state to perform the
+conditional check, eg. from a DynamoDB queryable snapshot. When we bring this model up to date from the transaction log,
+we can ignore transactions that are not relevant to the update.
 
 This would add complexity to the way we model the table state, so we may prefer to avoid this. It is an option we could
 consider.

From 7fb75dab42a411320eabe1fb64cfd1765e7a68a7 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:28:19 +0000
Subject: [PATCH 20/56] Remove extra commas

---
 docs/adr/transaction-log-state-store.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 0913d3d589..2f55e1a13e 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -94,11 +94,11 @@ immediately.
 ### Parallel models
 
 So far we've assumed that we'll always work with the entire state of a Sleeper table at once, with one model. With a
-transaction log, it can be more practical to add alternative models for read or update.
+transaction log it can be more practical to add alternative models for read or update.
 
 #### DynamoDB queries
 
-The DynamoDB state store, has advantages for queries, as we only need to read the relevant parts of the state. If we
+The DynamoDB state store has advantages for queries, as we only need to read the relevant parts of the state. If we
 want to retain this benefit, we can store the same DynamoDB structure we use now.
 
 Similar to the process for S3 snapshots, we could regularly store a snapshot of the table state as items in DynamoDB

From 478fcad823447c5a1f4aecf2e203d588cca6428d Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:34:28 +0000
Subject: [PATCH 21/56] Restructure consequences introduction

---
 docs/adr/transaction-log-state-store.md | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 2f55e1a13e..7763ee460b 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -34,15 +34,16 @@ revisions table. This also holds the transaction number that snapshot was derive
 
 ## Consequences
 
-This would involve a different set of patterns from those where the source of truth is a store of the current state.
-We'll look at how we model the state by deriving it from the transaction log, independent of the underlying store.
-To avoid reading the whole transaction log every time, can store a snapshot of the state, so we can start from a certain
-point in the log.
-
-This is a slightly different approach to distributed updates, and there are potential problems in how we store the
-transaction log. We'll look at how to achieve ordering and durability of the log. This should result in a similar update
-process to the S3 state store, but without the need to save or load the whole state at once. This should mean quicker
-updates even compared to the DynamoDB state store, since we only need to save one item per transaction.
+This should result in a similar update process to the S3 state store, but without the need to save or load the whole
+state at once. This should mean quicker updates even compared to the DynamoDB state store, since we only need to save
+one item per transaction. This would use a different set of patterns from those where the source of truth is a store of
+the current state, and we'll look at some of the implications.
+
+We'll look at how to model the state as derived from the transaction log, independent of the underlying store. To avoid
+reading every transaction every time, we can store a snapshot of the state, and start from a certain point in the log.
+
+We'll look at how to achieve ordering and durability of transactions. This is a slightly different approach for
+distributed updates, and there are potential problems in how we store the transaction log.
 
 This approach also makes it much easier to use parallel models or storage formats. This can allow for queries instead of
 loading the whole state at once, or we can model the data in some alternative way for various purposes.

From d1aa59ca422b4dcf2fdd0b7a19cf4d4c771b0bf6 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:37:02 +0000
Subject: [PATCH 22/56] Tweak consequences introduction

---
 docs/adr/transaction-log-state-store.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 7763ee460b..bef3adb9c9 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -45,8 +45,9 @@ reading every transaction every time, we can store a snapshot of the state, and
 We'll look at how to achieve ordering and durability of transactions. This is a slightly different approach for
 distributed updates, and there are potential problems in how we store the transaction log.
 
-This approach also makes it much easier to use parallel models or storage formats. This can allow for queries instead of
-loading the whole state at once, or we can model the data in some alternative way for various purposes.
+We'll look at some applications for parallel models or storage formats, as this approach makes it easier to derive
+different formats for the state. This can allow for queries instead of loading the whole state at once, or we can model
+the data in some alternative ways for various purposes.
 
 ### Modelling state
 

From 8ac612753e2c64750f6587a8c75e759be6761b64 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:38:51 +0000
Subject: [PATCH 23/56] Tweak consequences introduction

---
 docs/adr/transaction-log-state-store.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index bef3adb9c9..eb79fc2d8b 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -49,6 +49,8 @@ We'll look at some applications for parallel models or storage formats, as this
 different formats for the state. This can allow for queries instead of loading the whole state at once, or we can model
 the data in some alternative ways for various purposes.
 
+We'll also look at how this compares to an approach based on a relational database.
+
 ### Modelling state
 
 The simplest approach is to hold a model in memory for the whole state of a Sleeper table. We can use this one, local

From 6ebb4cb2e581b1478a738124364f05493895839d Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:42:25 +0000
Subject: [PATCH 24/56] Tweak section on modelling state

---
 docs/adr/transaction-log-state-store.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index eb79fc2d8b..cd7416995a 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -57,9 +57,10 @@ The simplest approach is to hold a model in memory for the whole state of a Slee
 model for any updates or queries, and bring it up to date based on the ordered sequence of transactions. We just need to
 be able to apply any given transaction to the model.
 
-Whenever a change occurs, we apply that to the model in memory. Anywhere that holds the model can bring itself up to
-date by reading only the transactions it hasn't seen yet, starting from the latest transaction that's already been
-applied locally. With DynamoDB, consistent reads can enforce that you're really up-to-date.
+Whenever a change occurs, we create a transaction that we can apply to the model in memory. Anywhere that holds the
+model can bring itself up to date by reading only the transactions it hasn't seen yet, starting from the latest
+transaction that's already been applied locally. With DynamoDB, consistent reads can enforce that you're really
+up-to-date.
 
 We can also skip to a certain point in the transaction log. We have a separate process whose job is to write regular
 snapshots of the model. Every few minutes, we can take a snapshot against the latest transaction and write a copy of the

From e10f464e331b852b49b149aac5059e5b4d83f5eb Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:46:00 +0000
Subject: [PATCH 25/56] Add subheadings to distributed updates section

---
 docs/adr/transaction-log-state-store.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index cd7416995a..0e95cb7cc8 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -70,6 +70,8 @@ snapshot need to be read.
 
 ### Distributed updates and ordering
 
+#### Immediate ordering approach
+
 To achieve ordered, durable updates, we can give each transaction a number. When we add a transaction, we use the next
 number in sequence after the current latest transaction. We use a conditional check to refuse the update if there's
 already a transaction with that number. We then need to retry if we're out of date.
@@ -80,6 +82,8 @@ transactions you haven't seen yet and apply them to your local model. As in the
 conditional check on your local model before saving the update. After your new transaction is saved, you could apply
 that to your local model as well, and keep it in memory to reuse for other updates or queries.
 
+#### Eventual consistency approach
+
 If we wanted to avoid this retry, there is an alternative approach to store the transaction immediately. To build the
 primary key, you could take a local timestamp at the writer, append some random data to the end, and use that to order
 the transactions. This would provide resolution between transactions that happen at the same time, and a reader after

From b23e5f6ec81e5af25f9b0c40d59eff5d5d4d6555 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:49:50 +0000
Subject: [PATCH 26/56] Tweak section on DynamoDB state store context

---
 docs/adr/transaction-log-state-store.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index 0e95cb7cc8..d692818822 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -11,8 +11,8 @@ effort to keep both working as the system changes, and both have problems.
 
 The DynamoDB state store holds partitions and files as individual items in DynamoDB tables. This means that updates
 which affect many items at once require splitting into separate transactions, and we can't always apply changes as
-atomically or quickly as we would like. When working with many items at once, there's a consistency issue. As we page
-through these items to load them into memory, the data may change in DynamoDB in between pages.
+atomically or quickly as we would like. There's also a consistency issue when working with many items at once. As we
+page through items to load them into memory, the data may change in DynamoDB in between pages.
 
 The S3 state store keeps one file for partitions and one for files, both in an S3 bucket. A DynamoDB table is used to
 track the current revision of each file, and each change means writing a whole new file. This means that each change

From 8300d7d057b2827822cc15254ef32826f3d79aa4 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Fri, 19 Jan 2024 11:59:51 +0000
Subject: [PATCH 27/56] Tweak section on status stores for reporting

---
 docs/adr/transaction-log-state-store.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/adr/transaction-log-state-store.md
index d692818822..1665f6fed6 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/adr/transaction-log-state-store.md
@@ -121,8 +121,8 @@ snapshot, and include that data in the result of any query.
 If we capture events related to jobs as transactions in the log, that would allow us to produce a separate model from
 the same transactions that can show what jobs have occurred, and every detail we track about them in the state store.
 
-This could unify any updates to jobs that we would ideally like to happen simultaneously with some change in the state
-store, eg. a compaction job finishing.
+This could unify some updates to jobs that are currently done in a separate reporting status store, which we would
+ideally like to happen simultaneously with some change in the state store, eg. a compaction job finishing.
 
 #### Update models
 

From 5bf28efd7442f4d85f985432809b7a85eb476770 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 14:14:42 +0000
Subject: [PATCH 28/56] Move from ADR to design

---
 docs/{adr => designs}/transaction-log-state-store.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)
 rename docs/{adr => designs}/transaction-log-state-store.md (97%)

diff --git a/docs/adr/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
similarity index 97%
rename from docs/adr/transaction-log-state-store.md
rename to docs/designs/transaction-log-state-store.md
index 1665f6fed6..1574a651ee 100644
--- a/docs/adr/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -2,7 +2,7 @@
 
 ## Status
 
-Proposed
+Proposed design
 
 ## Context
 
@@ -20,9 +20,9 @@ takes some time to process, and if two changes happen to the same file at once,
 contention, many retries may happen. It's common for updates to fail entirely due to too many retries, or to take a long
 time.
 
-## Decision
+## Design Summary
 
-Use an event sourced model to store a transaction log, as well as snapshots of the state.
+Implement the state store using an event sourced model, storing a transaction log as well as snapshots of the state.
 
 Store the transactions as items in DynamoDB. Store snapshots as S3 files.
 
@@ -134,7 +134,7 @@ we can ignore transactions that are not relevant to the update.
 This would add complexity to the way we model the table state, so we may prefer to avoid this. It is an option we could
 consider.
 
-### Comparison to a relational database
+## Comparison to a relational database
 
 With a relational database, large queries can be made to present a consistent view of the data. This would avoid the
 consistency issue we have with DynamoDB, but would come with some costs:
@@ -149,7 +149,7 @@ would also need to work within the database's transactional tradeoffs, which may
 The database would need to be deployed as a persistent instance, although we could use a managed service. This loses
 some of the promise of Sleeper in terms of serverless deployment, and only running when something needs to happen.
 
-### Resources
+## Resources
 
 - [Martin Fowler's article on event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html)
 - [Greg Young's talk on event sourcing](https://www.youtube.com/watch?v=LDW0QWie21s)

From 9c357e41afec690292403fa39503a8b048971408 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 14:54:00 +0000
Subject: [PATCH 29/56] Update transaction log consequences section

---
 docs/designs/transaction-log-state-store.md | 32 ++++++++++++++++-----
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 1574a651ee..801fd0eb07 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -35,8 +35,8 @@ revisions table. This also holds the transaction number that snapshot was derive
 ## Consequences
 
 This should result in a similar update process to the S3 state store, but without the need to save or load the whole
-state at once. This should mean quicker updates even compared to the DynamoDB state store, since we only need to save
-one item per transaction. This would use a different set of patterns from those where the source of truth is a store of
+state at once. Since we only need to save one item per transaction, this may also result in quicker updates compared to
+the DynamoDB state store. This would use a different set of patterns from those where the source of truth is a store of
 the current state, and we'll look at some of the implications.
 
 We'll look at how to model the state as derived from the transaction log, independent of the underlying store. To avoid
@@ -68,6 +68,19 @@ whole model to S3. We can point to it in DynamoDB as in the S3 state store's rev
 quickly load the model without needing to read the whole transaction log. Only transactions that happened after the
 snapshot need to be read.
 
+### Transaction size
+
+A DynamoDB item can have a maximum size of 400KB. It's unlikely a single transaction will exceed that, but we'd have to
+guard against it. We can either pre-emptively split large transactions into smaller ones that we know will fit in a
+DynamoDB item, or we can handle an exception from DynamoDB when an item is too large, and handle it some other way.
+
+To split a transaction into smaller ones that will fit, we would need to handle this in our model, to split a
+transaction without affecting the aspects of atomicity which matter to the system.
+
+An alternative would be to detect that a transaction is too big, and write it to a file in S3 with just a pointer to
+that file in DynamoDB. This could be significantly slower than a standard DynamoDB update, and may slow down reading
+the transaction log.
+
 ### Distributed updates and ordering
 
 #### Immediate ordering approach
@@ -76,11 +89,16 @@ To achieve ordered, durable updates, we can give each transaction a number. When
 number in sequence after the current latest transaction. We use a conditional check to refuse the update if there's
 already a transaction with that number. We then need to retry if we're out of date.
 
-This retry is comparable to an update in the S3 state store, but each change is much quicker to apply because you don't
-need to store the whole state. You also don't need to reload the whole state each time. Instead you read the
-transactions you haven't seen yet and apply them to your local model. As in the S3 implementation, you perform a
-conditional check on your local model before saving the update. After your new transaction is saved, you could apply
-that to your local model as well, and keep it in memory to reuse for other updates or queries.
+This retry is comparable to an update in the S3 state store, but you don't need to store the whole state. You also don't
+need to reload the whole state each time. Instead, you read the transactions you haven't seen yet and apply them to your
+local model. As in the S3 implementation, you perform a conditional check on your local model before saving the update.
+After your new transaction is saved, you could apply that to your local model as well, and keep it in memory to reuse
+for other updates or queries.
+
+There are still potential concurrency issues with this approach, since retries are still required under contention. We
+don't know for sure whether this will reduce contention issues by a few percent relative to the S3 state store (in which
+case the transaction log approach doesn't solve the problem), or eliminate them completely. Since each update is
+smaller, it should be quicker. We could prototype this to gauge whether it will be eg. 5% quicker or 5x quicker.
 
 #### Eventual consistency approach
 

From 50117deceed00bc9f17b459510f0e744de13c580 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:18:35 +0000
Subject: [PATCH 30/56] Update relational database comparison section

---
 docs/designs/transaction-log-state-store.md | 34 ++++++++++++++++++---
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 801fd0eb07..22ffb42e9a 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -160,12 +160,36 @@ consistency issue we have with DynamoDB, but would come with some costs:
 - Transaction management and locking
 - Server-based deployment model
 
-With a relational database, larger transactions involve locking many records. This would pose similar problems to
-DynamoDB in terms of limiting atomicity of updates, as we would be heavily incentivised to keep transactions small. We
-would also need to work within the database's transactional tradeoffs, which may cause unforeseen problems.
+### Transaction management and locking
 
-The database would need to be deployed as a persistent instance, although we could use a managed service. This loses
-some of the promise of Sleeper in terms of serverless deployment, and only running when something needs to happen.
+With a relational database, larger transactions involve locking many records. If a larger transaction takes a
+significant amount of time, there is potential for waiting based on these locks. A relational database is similar to
+DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
+performance than we would like with large updates, or potential deadlock issues.
+
+We may also be affected by transaction isolation levels. As an example, PostgreSQL defaults to a read committed
+isolation level. This means that during one transaction, if you make multiple queries, the database may change in
+between those queries, and you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also
+supports higher levels of transaction isolation, and larger queries across tables. With higher levels of transaction
+isolation that produce a consistent view of the state, there is potential for serialization failure. For example, it may
+not be possible for PostgreSQL to reconstruct a consistent view of the state at the start of the transaction if the
+transaction is very large or a query is very large. In this case it's necessary to retry a transaction. See the
+PostgreSQL manual on transaction isolation levels:
+
+https://www.postgresql.org/docs/current/transaction-iso.html
+
+### Deployment model
+
+PostgreSQL operates as a cluster of individual server nodes. We can mitigate this by using a service with automatic
+scaling.
+
+Aurora Serverless v2 supports automatic scaling up and down between minimum and maximum limits. If you know Sleeper will
+be idle for a while, we could stop the database and then only be charged for the storage. We already have a concept of
+pausing Sleeper so that the periodic lambdas don't run. With Aurora Serverless this wouldn't be too much different.
+
+This has some differences to the rest of Sleeper, which is designed to scale to zero by default. Aurora Serverless v2
+does not support scaling to zero. This means there would be some persistent costs unless we explicitly pause the Sleeper
+instance and stop the database entirely.
 
 ## Resources
 

From 44d8e814ecc309ccdc3827c0ed1be58528cd7693 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:29:35 +0000
Subject: [PATCH 31/56] Update relational database comparison section

---
 docs/designs/transaction-log-state-store.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 22ffb42e9a..375704371c 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -165,16 +165,16 @@ consistency issue we have with DynamoDB, but would come with some costs:
 With a relational database, larger transactions involve locking many records. If a larger transaction takes a
 significant amount of time, there is potential for waiting based on these locks. A relational database is similar to
 DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
-performance than we would like with large updates, or potential deadlock issues.
-
-We may also be affected by transaction isolation levels. As an example, PostgreSQL defaults to a read committed
-isolation level. This means that during one transaction, if you make multiple queries, the database may change in
-between those queries, and you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also
-supports higher levels of transaction isolation, and larger queries across tables. With higher levels of transaction
-isolation that produce a consistent view of the state, there is potential for serialization failure. For example, it may
-not be possible for PostgreSQL to reconstruct a consistent view of the state at the start of the transaction if the
-transaction is very large or a query is very large. In this case it's necessary to retry a transaction. See the
-PostgreSQL manual on transaction isolation levels:
+performance than we would like with large updates, deadlocks, or other contention issues.
+
+We may also be affected by transaction isolation levels. PostgreSQL defaults to a read committed isolation level. This
+means that during one transaction, if you make multiple queries, the database may change in between those queries, and
+you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also supports higher levels of
+transaction isolation, and larger queries across tables. With higher levels of transaction isolation that produce a
+consistent view of the state, there is potential for serialization failure. For example, it may not be possible for
+PostgreSQL to reconstruct a consistent view of the state at the start of the transaction if the transaction is very
+large or a query is very large. In this case it's necessary to retry a transaction. See the PostgreSQL manual on
+transaction isolation levels:
 
 https://www.postgresql.org/docs/current/transaction-iso.html
 

From fe9712637d5f9e8669c106c0f0a6689a3f6f8b4b Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:33:02 +0000
Subject: [PATCH 32/56] Split out design document for PostgreSQL state store

---
 docs/designs/postgresql-state-store.md      | 64 +++++++++++++++++++++
 docs/designs/transaction-log-state-store.md | 39 -------------
 2 files changed, 64 insertions(+), 39 deletions(-)
 create mode 100644 docs/designs/postgresql-state-store.md

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
new file mode 100644
index 0000000000..f21ff92c56
--- /dev/null
+++ b/docs/designs/postgresql-state-store.md
@@ -0,0 +1,64 @@
+# Use PostgreSQL for the state store
+
+## Status
+
+Proposed design
+
+## Context
+
+We have two implementations of the state store that tracks partitions and files in a Sleeper table. This takes some
+effort to keep both working as the system changes, and both have problems.
+
+The DynamoDB state store holds partitions and files as individual items in DynamoDB tables. This means that updates
+which affect many items at once require splitting into separate transactions, and we can't always apply changes as
+atomically or quickly as we would like. There's also a consistency issue when working with many items at once. As we
+page through items to load them into memory, the data may change in DynamoDB in between pages.
+
+The S3 state store keeps one file for partitions and one for files, both in an S3 bucket. A DynamoDB table is used to
+track the current revision of each file, and each change means writing a whole new file. This means that each change
+takes some time to process, and if two changes happen to the same file at once, it backs out and has to retry. Under
+contention, many retries may happen. It's common for updates to fail entirely due to too many retries, or to take a long
+time.
+
+## Design Summary
+
+Store the file and partitions state in a PostgreSQL database, with a similar structure to the DynamoDB state store.
+
+## Consequences
+
+With a relational database, large queries can be made to present a consistent view of the data. This would avoid the
+consistency issue we have with DynamoDB, but would come with some costs:
+
+- Transaction management and locking
+- Server-based deployment model
+
+### Transaction management and locking
+
+With a relational database, larger transactions involve locking many records. If a larger transaction takes a
+significant amount of time, there is potential for waiting based on these locks. A relational database is similar to
+DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
+performance than we would like with large updates, deadlocks, or other contention issues.
+
+We may also be affected by transaction isolation levels. PostgreSQL defaults to a read committed isolation level. This
+means that during one transaction, if you make multiple queries, the database may change in between those queries, and
+you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also supports higher levels of
+transaction isolation, and larger queries across tables. With higher levels of transaction isolation that produce a
+consistent view of the state, there is potential for serialization failure. For example, it may not be possible for
+PostgreSQL to reconstruct a consistent view of the state at the start of the transaction if the transaction is very
+large or a query is very large. In this case it's necessary to retry a transaction. See the PostgreSQL manual on
+transaction isolation levels:
+
+https://www.postgresql.org/docs/current/transaction-iso.html
+
+### Deployment model
+
+PostgreSQL operates as a cluster of individual server nodes. We can mitigate this by using a service with automatic
+scaling.
+
+Aurora Serverless v2 supports automatic scaling up and down between minimum and maximum limits. If you know Sleeper will
+be idle for a while, we could stop the database and then only be charged for the storage. We already have a concept of
+pausing Sleeper so that the periodic lambdas don't run. With Aurora Serverless this wouldn't be too much different.
+
+This has some differences to the rest of Sleeper, which is designed to scale to zero by default. Aurora Serverless v2
+does not support scaling to zero. This means there would be some persistent costs unless we explicitly pause the Sleeper
+instance and stop the database entirely.
diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 375704371c..1344856d3b 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -152,45 +152,6 @@ we can ignore transactions that are not relevant to the update.
 This would add complexity to the way we model the table state, so we may prefer to avoid this. It is an option we could
 consider.
 
-## Comparison to a relational database
-
-With a relational database, large queries can be made to present a consistent view of the data. This would avoid the
-consistency issue we have with DynamoDB, but would come with some costs:
-
-- Transaction management and locking
-- Server-based deployment model
-
-### Transaction management and locking
-
-With a relational database, larger transactions involve locking many records. If a larger transaction takes a
-significant amount of time, there is potential for waiting based on these locks. A relational database is similar to
-DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
-performance than we would like with large updates, deadlocks, or other contention issues.
-
-We may also be affected by transaction isolation levels. PostgreSQL defaults to a read committed isolation level. This
-means that during one transaction, if you make multiple queries, the database may change in between those queries, and
-you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also supports higher levels of
-transaction isolation, and larger queries across tables. With higher levels of transaction isolation that produce a
-consistent view of the state, there is potential for serialization failure. For example, it may not be possible for
-PostgreSQL to reconstruct a consistent view of the state at the start of the transaction if the transaction is very
-large or a query is very large. In this case it's necessary to retry a transaction. See the PostgreSQL manual on
-transaction isolation levels:
-
-https://www.postgresql.org/docs/current/transaction-iso.html
-
-### Deployment model
-
-PostgreSQL operates as a cluster of individual server nodes. We can mitigate this by using a service with automatic
-scaling.
-
-Aurora Serverless v2 supports automatic scaling up and down between minimum and maximum limits. If you know Sleeper will
-be idle for a while, we could stop the database and then only be charged for the storage. We already have a concept of
-pausing Sleeper so that the periodic lambdas don't run. With Aurora Serverless this wouldn't be too much different.
-
-This has some differences to the rest of Sleeper, which is designed to scale to zero by default. Aurora Serverless v2
-does not support scaling to zero. This means there would be some persistent costs unless we explicitly pause the Sleeper
-instance and stop the database entirely.
-
 ## Resources
 
 - [Martin Fowler's article on event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html)

From d9a1af50ccd549e03dea051cbaff458cee9af3d9 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:37:13 +0000
Subject: [PATCH 33/56] Reformat docs/12-design.md

---
 docs/12-design.md | 71 ++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 35 deletions(-)

diff --git a/docs/12-design.md b/docs/12-design.md
index 73b82e765e..e6c9ab3834 100644
--- a/docs/12-design.md
+++ b/docs/12-design.md
@@ -57,7 +57,7 @@ sometimes be used to avoid reading a lot of the data).
 All records in a table conform to a schema. The records in a table are stored in multiple files, with each file
 belonging to a partition. These files are all in an S3 bucket that is exclusively used by this table.
 
-Each table has a state store associated to it. This stores metadata about the table, namely the files that are in 
+Each table has a state store associated to it. This stores metadata about the table, namely the files that are in
 the table and how the records in the table are partitioned.
 
 Tables are deployed by the CDK table stack. This stack creates the infrastructure for each table. Each table
@@ -139,36 +139,37 @@ make any difference to the results.
 
 The state store for a table holds information about the files that are currently in the table,
 and how those files are partitioned. Information about files is stored by creating file references, and by
-keeping track of the number of references to a file. A file reference represents a subset of the data in a file that 
-exists entirely within a partition. This means you can have multiple references to the same file, 
+keeping track of the number of references to a file. A file reference represents a subset of the data in a file that
+exists entirely within a partition. This means you can have multiple references to the same file,
 spread across multiple partitions.
 
 The state store allows for information about the file references in a partition to be retrieved,
 new file references to be added, a list of all the partitions to be retrieved, etc. It also allows the results
 of a compaction job to be atomically committed in which the references to the input files are removed,
-and a new reference is created for the output file. Note that a file with no references is still tracked in 
+and a new reference is created for the output file. Note that a file with no references is still tracked in
 the state store.
 
 There are currently two state store implementations, one that stores the data in DynamoDB and one that stores it
 in Parquet files in S3 with a lightweight consistency layer in DynamoDB.
 
-## DynamoDB state store
+### DynamoDB state store
 
-The DynamoDB state store uses three DynamoDB tables to store the state of a table. There is one table for file references, 
-one for the number of references to a file (or the file reference count), and one for information about the partitions 
-in the system. For the file reference and file reference count tables, the primary key is a concatenation of the 
-filename and the partition id. For the partition table, the primary key is simply the id of the partition. Updates to the state 
-that need to be executed atomically are wrapped in DynamoDB transactions. The number of items in a DynamoDB transaction
-is limited to 100. This has implications for the number of files that can be read in a compaction job. When the job finishes,
-the relevant references to the input files that the compaction job has read need to be removed, the output file need to 
-be written, a new file reference to the output file needs to be added to the state store, and the file reference count 
-needs to be updated. This means that at most 49 files can be read by a compaction job if the DynamoDB state store is used.
+The DynamoDB state store uses three DynamoDB tables to store the state of a table. There is one table for file
+references, one for the number of references to a file (or the file reference count), and one for information about the
+partitions in the system. For the file reference and file reference count tables, the primary key is a concatenation of
+the filename and the partition id. For the partition table, the primary key is simply the id of the partition. Updates
+to the state that need to be executed atomically are wrapped in DynamoDB transactions. The number of items in a DynamoDB
+transaction is limited to 100. This has implications for the number of files that can be read in a compaction job. When
+the job finishes, the relevant references to the input files that the compaction job has read need to be removed, the
+output file need to be written, a new file reference to the output file needs to be added to the state store, and the
+file reference count needs to be updated. This means that at most 49 files can be read by a compaction job if the
+DynamoDB state store is used.
 
-## S3 state store
+### S3 state store
 
 This state store stores the state of a table in Parquet files in S3, within the same bucket used to store the data
-for the table. There is one file for information about file references, and one for the partitions. When an update happens
-a new file is written. This new file contains the complete information about the state, i.e., it does not just
+for the table. There is one file for information about file references, and one for the partitions. When an update
+happens a new file is written. This new file contains the complete information about the state, i.e., it does not just
 contain the updated information. As two processes may attempt to update the information simultaneously, there needs
 to be a consistency mechanism to ensure that only one update can succeed. A table in DynamoDB is used as this
 consistency layer.
@@ -233,7 +234,7 @@ The ingest batcher groups ingest requests for individual files into ingest or bu
 submitted to an SQS queue. The batcher is then triggered periodically to group files into jobs and send them to the
 ingest queue configured for the table. The number of jobs created is determined by the configuration of the batcher.
 
-The files need to be accessible to the relevant ingest system, but are not read directly by the batcher. 
+The files need to be accessible to the relevant ingest system, but are not read directly by the batcher.
 
 An outline of the design of this system is shown below:
 
@@ -243,37 +244,37 @@ An outline of the design of this system is shown below:
 
 The purpose of a compaction job is to read N files and replace them with one file. This process keeps the number
 of files for a partition small, which means the number of files that need to be read in a query is small. The input
-files contain records sorted by key and sort fields, and are filtered so that only data for the current partition is read.
-The data for an input file that exists within a specific partition can be represented by a file reference.
+files contain records sorted by key and sort fields, and are filtered so that only data for the current partition is
+read. The data for an input file that exists within a specific partition can be represented by a file reference.
 The output from the job is a sorted file. As the filtered input files are sorted, it is simple to write out a sorted
 file containing their data. The output file will be written to the same partition that the input files were in.
 Note that the input files for a compaction job must be in the same leaf partition.
 
 When a compaction job finishes, it needs to update the state store to remove the references representing the input
 files in that partition, create a new reference to the output file, and update the relevant file reference counts.
-This update must be done atomically, to avoid clients that are requesting the state of the state store from seeing an 
+This update must be done atomically, to avoid clients that are requesting the state of the state store from seeing an
 inconsistent view.
 
 The CDK compaction stack deploys the infrastructure that is used to create and execute compaction jobs. A compaction
-job reads in N input files and merges them into 1 file. As the input files are all sorted by key, this job is a 
+job reads in N input files and merges them into 1 file. As the input files are all sorted by key, this job is a
 simple streaming merge that requires negligible amounts of memory. The input files are all from a single partition.
 
 There are two separate stages: the creation of compaction jobs, and the execution of those jobs. Compaction jobs
 are created by a lambda that runs the class `sleeper.compaction.job.creation.CreateJobsLambda`. This lambda is
-triggered periodically by a Cloudwatch rule. The lambda iterates through each table. For each table, it performs a 
-pre-splitting operation on file references in the state store. This involves looking for file references that exist 
-within non-leaf partitions, and atomically removing the original reference and creating 2 new references in the 
-child partitions. This only moves file references down one "level" on each execution of the lambda, so the 
+triggered periodically by a Cloudwatch rule. The lambda iterates through each table. For each table, it performs a
+pre-splitting operation on file references in the state store. This involves looking for file references that exist
+within non-leaf partitions, and atomically removing the original reference and creating 2 new references in the
+child partitions. This only moves file references down one "level" on each execution of the lambda, so the
 lambda would need to be invoked multiple times for the file references in the root partition to be moved down to the
 bottom of the tree.
 
-The lambda then queries the state store for information about the partitions and the file 
+The lambda then queries the state store for information about the partitions and the file
 references that do not have a job id (if a file reference has a job id it means that a compaction job has already been
-created for that file). It then uses a compaction strategy to decide what compaction jobs should be created. 
-The compaction strategy can be configured independently for each table. The current compaction strategies will only 
-create compaction jobs for files that are in leaf partitions at the time of creation (meaning a partition split could 
+created for that file). It then uses a compaction strategy to decide what compaction jobs should be created.
+The compaction strategy can be configured independently for each table. The current compaction strategies will only
+create compaction jobs for files that are in leaf partitions at the time of creation (meaning a partition split could
 happen after a job has been created, but before the job has run). Jobs that are created by the strategy are sent
-to an SQS queue. 
+to an SQS queue.
 
 Compaction jobs are executed in containers. Currently, these containers are executed in Fargate tasks, but they could
 be executed on ECS running on EC2 instances, or anywhere that supports running Docker containers. These containers
@@ -286,14 +287,14 @@ number of concurrent compaction tasks is configurable.
 
 ## Garbage collection
 
-A file is ready for garbage collection if there are no longer any references to the file in the state store, 
-and the last update was than N minutes ago, where N is a parameter that can be configured separately for each table. 
-The default value of N is 10 minutes. The reason for not deleting the file immediately as soon as the file no longer 
+A file is ready for garbage collection if there are no longer any references to the file in the state store,
+and the last update was than N minutes ago, where N is a parameter that can be configured separately for each table.
+The default value of N is 10 minutes. The reason for not deleting the file immediately as soon as the file no longer
 has any references is that it may be being used by queries.
 
 The garbage collector stack is responsible for deleting files that no longer have any references. It consists of
 a Cloudwatch rule that periodically triggers a lambda. This lambda iterates through all the tables. For each table
-it queries the state store to retrieve all the files that do not have any references and have been waiting for 
+it queries the state store to retrieve all the files that do not have any references and have been waiting for
 more than N minutes. These files are then deleted in batches.
 
 ## Queries

From 4c85c04e553c897a4770a553f3dde13977449829 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:37:38 +0000
Subject: [PATCH 34/56] Add potential alternative state stores section in
 docs/12-design.md

---
 docs/12-design.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/12-design.md b/docs/12-design.md
index e6c9ab3834..dd784398aa 100644
--- a/docs/12-design.md
+++ b/docs/12-design.md
@@ -174,6 +174,13 @@ contain the updated information. As two processes may attempt to update the info
 to be a consistency mechanism to ensure that only one update can succeed. A table in DynamoDB is used as this
 consistency layer.
 
+### Potential alternatives
+
+We are considering alternative designs for the state store:
+
+- [A transaction log stored in DynamoDB, with snapshots in S3](designs/transaction-log-state-store.md)
+- [A PostgreSQL database](designs/postgresql-state-store.md)
+
 ## Ingest of data
 
 To ingest data to a table, it is necessary to write files of sorted records. Each file should contain data for one

From 82198064926e9a12b227cb040541b3222aaf231b Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:46:37 +0000
Subject: [PATCH 35/56] Tweak status sections in state store designs

---
 docs/designs/postgresql-state-store.md      | 2 +-
 docs/designs/transaction-log-state-store.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index f21ff92c56..ff4720212c 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -2,7 +2,7 @@
 
 ## Status
 
-Proposed design
+Proposed
 
 ## Context
 
diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 1344856d3b..0448dc8a73 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -2,7 +2,7 @@
 
 ## Status
 
-Proposed design
+Proposed
 
 ## Context
 

From bb5655a96f81ae9bd5d5309ada1a185b7c700819 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:49:18 +0000
Subject: [PATCH 36/56] Mention database schema in PostgreSQL summary

---
 docs/designs/postgresql-state-store.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index ff4720212c..ed22390d96 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -24,6 +24,8 @@ time.
 
 Store the file and partitions state in a PostgreSQL database, with a similar structure to the DynamoDB state store.
 
+The database schema may be more normalised than the DynamoDB equivalent. We can consider this during prototyping.
+
 ## Consequences
 
 With a relational database, large queries can be made to present a consistent view of the data. This would avoid the

From eb0994366c4ad43d037c8c4afb563f33dafa3626 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:51:29 +0000
Subject: [PATCH 37/56] Update PostgreSQL consequences section

---
 docs/designs/postgresql-state-store.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index ed22390d96..08f6486164 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -28,7 +28,7 @@ The database schema may be more normalised than the DynamoDB equivalent. We can
 
 ## Consequences
 
-With a relational database, large queries can be made to present a consistent view of the data. This would avoid the
+With a relational database, large queries can be made to present a consistent view of the data. This could avoid the
 consistency issue we have with DynamoDB, but would come with some costs:
 
 - Transaction management and locking

From 47e5f9f978df8e83654ed49c03823a141c3b858f Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:53:16 +0000
Subject: [PATCH 38/56] Update PostgreSQL consequences section for transaction
 management

---
 docs/designs/postgresql-state-store.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index 08f6486164..22263410b1 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -45,10 +45,9 @@ We may also be affected by transaction isolation levels. PostgreSQL defaults to
 means that during one transaction, if you make multiple queries, the database may change in between those queries, and
 you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also supports higher levels of
 transaction isolation, and larger queries across tables. With higher levels of transaction isolation that produce a
-consistent view of the state, there is potential for serialization failure. For example, it may not be possible for
-PostgreSQL to reconstruct a consistent view of the state at the start of the transaction if the transaction is very
-large or a query is very large. In this case it's necessary to retry a transaction. See the PostgreSQL manual on
-transaction isolation levels:
+consistent view of the state, there is potential for serialization failure. For example, if a transaction is very large,
+it may not be possible for PostgreSQL to reconstruct a consistent view of the state as it was at the start of the
+transaction. See the PostgreSQL manual on transaction isolation levels:
 
 https://www.postgresql.org/docs/current/transaction-iso.html
 

From 9a1c3c1ee96fb1e1f6a9e8cb7c3d4eb1003a908e Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:55:36 +0000
Subject: [PATCH 39/56] Add a link to Aurora Serverless v2 documentation

---
 docs/designs/postgresql-state-store.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index 22263410b1..7fb69a00e1 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -58,7 +58,10 @@ scaling.
 
 Aurora Serverless v2 supports automatic scaling up and down between minimum and maximum limits. If you know Sleeper will
 be idle for a while, we could stop the database and then only be charged for the storage. We already have a concept of
-pausing Sleeper so that the periodic lambdas don't run. With Aurora Serverless this wouldn't be too much different.
+pausing Sleeper so that the periodic lambdas don't run. With Aurora Serverless this wouldn't be too much different. See
+the AWS documentation:
+
+https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.how-it-works.html
 
 This has some differences to the rest of Sleeper, which is designed to scale to zero by default. Aurora Serverless v2
 does not support scaling to zero. This means there would be some persistent costs unless we explicitly pause the Sleeper

From e3d413ae8e9f911592d40a5f9f02de52d058cae5 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 30 Jan 2024 15:56:57 +0000
Subject: [PATCH 40/56] Update PostgreSQL consequences section for transaction
 management

---
 docs/designs/postgresql-state-store.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index 7fb69a00e1..7080a82843 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -39,7 +39,7 @@ consistency issue we have with DynamoDB, but would come with some costs:
 With a relational database, larger transactions involve locking many records. If a larger transaction takes a
 significant amount of time, there is potential for waiting based on these locks. A relational database is similar to
 DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
-performance than we would like with large updates, deadlocks, or other contention issues.
+performance than we would like, deadlocks, or other contention issues.
 
 We may also be affected by transaction isolation levels. PostgreSQL defaults to a read committed isolation level. This
 means that during one transaction, if you make multiple queries, the database may change in between those queries, and

From 58df1c3ac6e965c0ed130ce5e1139ea56455262e Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 09:33:42 +0000
Subject: [PATCH 41/56] Update section on PostgreSQL transaction management

---
 docs/designs/postgresql-state-store.md | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index 7080a82843..b146320fb1 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -41,13 +41,21 @@ significant amount of time, there is potential for waiting based on these locks.
 DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
 performance than we would like, deadlocks, or other contention issues.
 
-We may also be affected by transaction isolation levels. PostgreSQL defaults to a read committed isolation level. This
-means that during one transaction, if you make multiple queries, the database may change in between those queries, and
-you may see an inconsistent state. This is similar to DynamoDB, except that PostgreSQL also supports higher levels of
-transaction isolation, and larger queries across tables. With higher levels of transaction isolation that produce a
-consistent view of the state, there is potential for serialization failure. For example, if a transaction is very large,
-it may not be possible for PostgreSQL to reconstruct a consistent view of the state as it was at the start of the
-transaction. See the PostgreSQL manual on transaction isolation levels:
+Since PostgreSQL supports larger queries with joins across tables, this should make it possible to produce a consistent
+view of large amounts of data, in contrast to DynamoDB. On the other hand, if we wanted to replicate DynamoDB's
+conditional updates, one way would be to make a query to check the condition, and perform an update within the same
+transaction. This may result in problems with transaction isolation.
+
+PostgreSQL defaults to a read committed isolation level. This means that during one transaction, if you make multiple
+queries, the database may change in between those queries. By default, checking state before an update does not produce
+a conditional update as in DynamoDB.
+
+With higher levels of transaction isolation, you can produce the same behaviour as a conditional update in DynamoDB.
+If a conflicting update occurs at the same time, this will produce a serialization failure. This would require you to
+retry the update as in S3. There may be other solutions to this problem, but this may push us towards keeping
+transactions as small as possible.
+
+See the PostgreSQL manual on transaction isolation levels:
 
 https://www.postgresql.org/docs/current/transaction-iso.html
 

From 335d90cdd48cb56e9eab407b4915c15ea3d76f2d Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 09:35:03 +0000
Subject: [PATCH 42/56] Update section on PostgreSQL transaction management

---
 docs/designs/postgresql-state-store.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index b146320fb1..f6ecf09e1e 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -52,8 +52,8 @@ a conditional update as in DynamoDB.
 
 With higher levels of transaction isolation, you can produce the same behaviour as a conditional update in DynamoDB.
 If a conflicting update occurs at the same time, this will produce a serialization failure. This would require you to
-retry the update as in S3. There may be other solutions to this problem, but this may push us towards keeping
-transactions as small as possible.
+retry the update. There may be other solutions to this problem, but this may push us towards keeping transactions as
+small as possible.
 
 See the PostgreSQL manual on transaction isolation levels:
 

From ddd5e3f1f3c1f896b57e48daa039ee2cd09be086 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 09:36:43 +0000
Subject: [PATCH 43/56] Update section on PostgreSQL transaction management

---
 docs/designs/postgresql-state-store.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/docs/designs/postgresql-state-store.md b/docs/designs/postgresql-state-store.md
index f6ecf09e1e..4d7088c285 100644
--- a/docs/designs/postgresql-state-store.md
+++ b/docs/designs/postgresql-state-store.md
@@ -37,14 +37,15 @@ consistency issue we have with DynamoDB, but would come with some costs:
 ### Transaction management and locking
 
 With a relational database, larger transactions involve locking many records. If a larger transaction takes a
-significant amount of time, there is potential for waiting based on these locks. A relational database is similar to
-DynamoDB in that each record needs to be updated individually. It's not clear whether this may result in slower
-performance than we would like, deadlocks, or other contention issues.
+significant amount of time, these locks may produce waiting or conflicts. A relational database is similar to DynamoDB
+in that each record needs to be updated individually. It's not clear whether this may result in slower performance than
+we would like, deadlocks, or other contention issues.
 
 Since PostgreSQL supports larger queries with joins across tables, this should make it possible to produce a consistent
-view of large amounts of data, in contrast to DynamoDB. On the other hand, if we wanted to replicate DynamoDB's
-conditional updates, one way would be to make a query to check the condition, and perform an update within the same
-transaction. This may result in problems with transaction isolation.
+view of large amounts of data, in contrast to DynamoDB.
+
+If we wanted to replicate DynamoDB's conditional updates, one way would be to make a query to check the condition, and
+perform an update within the same transaction. This may result in problems with transaction isolation.
 
 PostgreSQL defaults to a read committed isolation level. This means that during one transaction, if you make multiple
 queries, the database may change in between those queries. By default, checking state before an update does not produce

From cee0af730fd6e717a851f8c060647c6e096a1a96 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 09:49:42 +0000
Subject: [PATCH 44/56] Update section on eventual consistency

---
 docs/designs/transaction-log-state-store.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 0448dc8a73..b190cc7b78 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -115,7 +115,7 @@ was successful because there was a period of time before the second writer added
 
 We could design the system to allow for some slack and recover from transactions being undone over a short time period.
 This would be complicated to achieve, although it may allow for improved performance as updates don't need to wait. The
-increase in complexity means this is unlikely to be as practical as an approach where a full ordering is established
+increase in complexity means this may not be as practical as an approach where a full ordering is established
 immediately.
 
 ### Parallel models

From 8a71d8d8841b3c5303f8a17cda48401e6843c584 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 09:58:22 +0000
Subject: [PATCH 45/56] Update section on DynamoDB queries based on transaction
 log

---
 docs/designs/transaction-log-state-store.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index b190cc7b78..4eb5f10401 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -126,13 +126,14 @@ transaction log it can be more practical to add alternative models for read or u
 #### DynamoDB queries
 
 The DynamoDB state store has advantages for queries, as we only need to read the relevant parts of the state. If we
-want to retain this benefit, we can store the same DynamoDB structure we use now.
+want to retain this benefit, we could store the same DynamoDB structure we use now.
 
-Similar to the process for S3 snapshots, we could regularly store a snapshot of the table state as items in DynamoDB
-tables, in whatever format is convenient for queries.
+Similar to the process for S3 snapshots, we could regularly store a snapshot of the Sleeper table state as items in
+DynamoDB tables, in whatever format is convenient for queries. One option would be to use the same tables as for the
+DynamoDB state store, but use a snapshot ID instead of the table ID.
 
-If we want this view to be 100% up to date, we could still read the latest transactions that have happened since the
-snapshot, and include that data in the result of any query.
+If we want this view to be 100% up to date, then when we perform a query we could still read the latest transactions
+that have happened since the snapshot, and include that data in the result.
 
 #### Status stores for reporting
 

From 2d2587a092d7fbc95fd05390d777a1c623df37e1 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 10:05:15 +0000
Subject: [PATCH 46/56] Update section on modelling state based on a
 transaction log

---
 docs/designs/transaction-log-state-store.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 4eb5f10401..6e711038fe 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -62,11 +62,11 @@ model can bring itself up to date by reading only the transactions it hasn't see
 transaction that's already been applied locally. With DynamoDB, consistent reads can enforce that you're really
 up-to-date.
 
-We can also skip to a certain point in the transaction log. We have a separate process whose job is to write regular
-snapshots of the model. Every few minutes, we can take a snapshot against the latest transaction and write a copy of the
-whole model to S3. We can point to it in DynamoDB as in the S3 state store's revision table. This allows any process to
-quickly load the model without needing to read the whole transaction log. Only transactions that happened after the
-snapshot need to be read.
+We can also skip to a certain point in the transaction log. We can have a separate process whose job is to write regular
+snapshots of the model. This can run every few minutes, and write a copy of the whole model to S3. We can point to it in
+DynamoDB, similar to the S3 state store's revision table. This lets us get up to date without reading the whole
+transaction log. We can load the snapshot of the model, then load the transactions that have happened since the
+snapshot, and apply them in memory.
 
 ### Transaction size
 

From 7304f75d7860db9eb9c2505b3db0e6c57cec0b11 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Mon, 5 Feb 2024 10:18:35 +0000
Subject: [PATCH 47/56] Update section on modelling state based on a
 transaction log

---
 docs/designs/transaction-log-state-store.md | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/docs/designs/transaction-log-state-store.md b/docs/designs/transaction-log-state-store.md
index 6e711038fe..85dd7c2521 100644
--- a/docs/designs/transaction-log-state-store.md
+++ b/docs/designs/transaction-log-state-store.md
@@ -54,13 +54,12 @@ We'll also look at how this compares to an approach based on a relational databa
 ### Modelling state
 
 The simplest approach is to hold a model in memory for the whole state of a Sleeper table. We can use this one, local
-model for any updates or queries, and bring it up to date based on the ordered sequence of transactions. We just need to
-be able to apply any given transaction to the model.
+model for any updates or queries, and bring it up to date based on the ordered sequence of transactions. We can support
+any transaction that we can apply to the model in memory.
 
-Whenever a change occurs, we create a transaction that we can apply to the model in memory. Anywhere that holds the
-model can bring itself up to date by reading only the transactions it hasn't seen yet, starting from the latest
-transaction that's already been applied locally. With DynamoDB, consistent reads can enforce that you're really
-up-to-date.
+Whenever a change occurs, we create a transaction. Anywhere that holds the model can bring itself up to date by reading
+only the transactions it hasn't seen yet, starting after the latest transaction that's already been applied locally.
+With DynamoDB, consistent reads can enforce that you're really up-to-date.
 
 We can also skip to a certain point in the transaction log. We can have a separate process whose job is to write regular
 snapshots of the model. This can run every few minutes, and write a copy of the whole model to S3. We can point to it in

From b5c17c89eb9b135cde92a7da577ed3d58d5f60a0 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 12 Mar 2024 13:25:12 +0000
Subject: [PATCH 48/56] Test GC in MultipleTablesTest

---
 .../dsl/gc/SystemTestGarbageCollection.java   |  2 +-
 .../sleeper/systemtest/dsl/gc/WaitForGC.java  | 33 +++++++++++----
 .../dsl/instance/MultipleTablesTest.java      | 40 +++++++++++++++++++
 .../testutil/drivers/InMemoryCompaction.java  | 10 +----
 4 files changed, 69 insertions(+), 16 deletions(-)

diff --git a/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/SystemTestGarbageCollection.java b/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/SystemTestGarbageCollection.java
index e66dcc2923..23a1ac0261 100644
--- a/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/SystemTestGarbageCollection.java
+++ b/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/SystemTestGarbageCollection.java
@@ -39,7 +39,7 @@ public SystemTestGarbageCollection invoke() {
     }
 
     public void waitFor() {
-        WaitForGC.waitUntilNoUnreferencedFiles(instance.getStateStore(),
+        WaitForGC.waitUntilNoUnreferencedFiles(instance,
                 PollWithRetries.intervalAndPollingTimeout(
                         Duration.ofSeconds(5), Duration.ofSeconds(30)));
     }
diff --git a/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/WaitForGC.java b/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/WaitForGC.java
index d9cf9e2632..107f367791 100644
--- a/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/WaitForGC.java
+++ b/java/system-test/system-test-dsl/src/main/java/sleeper/systemtest/dsl/gc/WaitForGC.java
@@ -15,27 +15,37 @@
  */
 package sleeper.systemtest.dsl.gc;
 
+import sleeper.configuration.properties.table.TableProperties;
 import sleeper.core.statestore.StateStore;
 import sleeper.core.statestore.StateStoreException;
 import sleeper.core.util.PollWithRetries;
+import sleeper.systemtest.dsl.instance.SystemTestInstanceContext;
 
 import java.time.Duration;
 import java.time.Instant;
+import java.util.List;
+import java.util.Map;
+
+import static java.util.stream.Collectors.toMap;
+import static java.util.stream.Collectors.toUnmodifiableList;
+import static sleeper.configuration.properties.table.TableProperty.TABLE_ID;
 
 public class WaitForGC {
 
     private WaitForGC() {
     }
 
-    public static void waitUntilNoUnreferencedFiles(StateStore stateStore, PollWithRetries poll) {
+    public static void waitUntilNoUnreferencedFiles(SystemTestInstanceContext instance, PollWithRetries poll) {
+        Map<String, TableProperties> tablesById = instance.streamTableProperties()
+                .collect(toMap(table -> table.get(TABLE_ID), table -> table));
         try {
             poll.pollUntil("no unreferenced files are present", () -> {
-                try {
-                    return stateStore.getReadyForGCFilenamesBefore(Instant.now().plus(Duration.ofDays(1)))
-                            .findAny().isEmpty();
-                } catch (StateStoreException e) {
-                    throw new RuntimeException(e);
-                }
+                List<String> emptyTableIds = tablesById.values().stream()
+                        .filter(table -> hasNoUnreferencedFiles(instance.getStateStore(table)))
+                        .map(table -> table.get(TABLE_ID))
+                        .collect(toUnmodifiableList());
+                emptyTableIds.forEach(tablesById::remove);
+                return tablesById.isEmpty();
             });
         } catch (InterruptedException e) {
             Thread.currentThread().interrupt();
@@ -43,4 +53,13 @@ public static void waitUntilNoUnreferencedFiles(StateStore stateStore, PollWithR
         }
     }
 
+    private static boolean hasNoUnreferencedFiles(StateStore stateStore) {
+        try {
+            return stateStore.getReadyForGCFilenamesBefore(Instant.now().plus(Duration.ofDays(1)))
+                    .findAny().isEmpty();
+        } catch (StateStoreException e) {
+            throw new RuntimeException(e);
+        }
+    }
+
 }
diff --git a/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java b/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java
index f45f7a13bb..9b9120d1bd 100644
--- a/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java
+++ b/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java
@@ -19,6 +19,7 @@
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 
+import sleeper.compaction.strategy.impl.BasicCompactionStrategy;
 import sleeper.core.partition.PartitionTree;
 import sleeper.core.partition.PartitionsBuilder;
 import sleeper.core.schema.Schema;
@@ -31,8 +32,12 @@
 import java.util.stream.LongStream;
 
 import static org.assertj.core.api.Assertions.assertThat;
+import static sleeper.configuration.properties.table.TableProperty.COMPACTION_FILES_BATCH_SIZE;
+import static sleeper.configuration.properties.table.TableProperty.COMPACTION_STRATEGY_CLASS;
+import static sleeper.configuration.properties.table.TableProperty.GARBAGE_COLLECTOR_DELAY_BEFORE_DELETION;
 import static sleeper.configuration.properties.table.TableProperty.PARTITION_SPLIT_THRESHOLD;
 import static sleeper.core.statestore.FilesReportTestHelper.activeAndReadyForGCFiles;
+import static sleeper.core.statestore.FilesReportTestHelper.activeFiles;
 import static sleeper.core.testutils.printers.FileReferencePrinter.printExpectedFilesForAllTables;
 import static sleeper.core.testutils.printers.FileReferencePrinter.printTableFilesExpectingIdentical;
 import static sleeper.core.testutils.printers.PartitionsPrinter.printExpectedPartitionsForAllTables;
@@ -76,6 +81,41 @@ void shouldIngestOneFileToMultipleTables(SleeperSystemTest sleeper) {
                 .allSatisfy((table, files) -> assertThat(files).hasSize(1));
     }
 
+    @Test
+    void shouldCompactAndGCMultipleTables(SleeperSystemTest sleeper) {
+        // Given we have several tables
+        // And we ingest two source files as separate jobs
+        sleeper.tables().createManyWithProperties(NUMBER_OF_TABLES, schema, Map.of(
+                COMPACTION_STRATEGY_CLASS, BasicCompactionStrategy.class.getName(),
+                COMPACTION_FILES_BATCH_SIZE, "2",
+                GARBAGE_COLLECTOR_DELAY_BEFORE_DELETION, "0"));
+        sleeper.sourceFiles()
+                .createWithNumberedRecords(schema, "file1.parquet", LongStream.range(0, 50))
+                .createWithNumberedRecords(schema, "file2.parquet", LongStream.range(50, 100));
+        sleeper.ingest().byQueue()
+                .sendSourceFilesToAllTables("file1.parquet")
+                .sendSourceFilesToAllTables("file2.parquet")
+                .invokeTask().waitForJobs();
+
+        // When we run compaction and GC
+        sleeper.compaction().createJobs(NUMBER_OF_TABLES).invokeTasks(1).waitForJobs();
+        sleeper.garbageCollection().invoke().waitFor();
+
+        // Then all tables should have one active file with the expected records, and none ready for GC
+        assertThat(sleeper.query().byQueue().allRecordsByTable())
+                .hasSize(NUMBER_OF_TABLES)
+                .allSatisfy(((table, records) -> assertThat(records).containsExactlyElementsOf(
+                        sleeper.generateNumberedRecords(schema, LongStream.range(0, 100)))));
+        var tables = sleeper.tables().list();
+        var partitionsByTable = sleeper.partitioning().treeByTable();
+        var filesByTable = sleeper.tableFiles().filesByTable();
+        PartitionTree expectedPartitions = new PartitionsBuilder(schema).singlePartition("root").buildTree();
+        FileReferenceFactory fileReferenceFactory = FileReferenceFactory.from(expectedPartitions);
+        assertThat(printTableFilesExpectingIdentical(partitionsByTable, filesByTable))
+                .isEqualTo(printExpectedFilesForAllTables(tables, expectedPartitions,
+                        activeFiles(fileReferenceFactory.rootFile("merged.parquet", 100))));
+    }
+
     @Test
     void shouldSplitPartitionsOfMultipleTables(SleeperSystemTest sleeper) {
         // Given we have several tables with a split threshold of 20
diff --git a/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/testutil/drivers/InMemoryCompaction.java b/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/testutil/drivers/InMemoryCompaction.java
index b8b80302c4..70dbe1f92b 100644
--- a/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/testutil/drivers/InMemoryCompaction.java
+++ b/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/testutil/drivers/InMemoryCompaction.java
@@ -29,7 +29,6 @@
 import sleeper.compaction.testutils.InMemoryCompactionJobStatusStore;
 import sleeper.compaction.testutils.InMemoryCompactionTaskStatusStore;
 import sleeper.configuration.jars.ObjectFactory;
-import sleeper.configuration.properties.table.FixedTablePropertiesProvider;
 import sleeper.configuration.properties.table.TableProperties;
 import sleeper.configuration.properties.table.TablePropertiesProvider;
 import sleeper.core.iterator.CloseableIterator;
@@ -137,12 +136,12 @@ private void createJobs(Mode mode) {
 
         private CreateCompactionJobs jobCreator(Mode mode) {
             return new CreateCompactionJobs(ObjectFactory.noUserJars(), instance.getInstanceProperties(),
-                    tablePropertiesProvider(instance), instance.getStateStoreProvider(), jobSender(), jobStore, mode);
+                    instance.getTablePropertiesProvider(), instance.getStateStoreProvider(), jobSender(), jobStore, mode);
         }
     }
 
     private void finishJobs(SystemTestInstanceContext instance, String taskId) {
-        TablePropertiesProvider tablesProvider = tablePropertiesProvider(instance);
+        TablePropertiesProvider tablesProvider = instance.getTablePropertiesProvider();
         for (CompactionJob job : queuedJobsById.values()) {
             TableProperties tableProperties = tablesProvider.getById(job.getTableId());
             RecordsProcessedSummary summary = compact(job, tableProperties, instance.getStateStore(tableProperties), taskId);
@@ -212,11 +211,6 @@ private RecordsProcessed mergeInputFiles(CompactionJob job, Partition partition,
                 .sum());
     }
 
-    private static TablePropertiesProvider tablePropertiesProvider(SystemTestInstanceContext instance) {
-        return new FixedTablePropertiesProvider(
-                instance.streamTableProperties().collect(toUnmodifiableList()));
-    }
-
     private CreateCompactionJobs.JobSender jobSender() {
         return job -> queuedJobsById.put(job.getId(), job);
     }

From 1cfffbaad714d50859655f65ec2ac43e7bfea19f Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 12 Mar 2024 13:27:04 +0000
Subject: [PATCH 49/56] Test GC in MultipleTablesIT

---
 .../systemtest/suite/MultipleTablesIT.java    | 44 ++++++++++++++++++-
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
index 4e791e8893..efa20eaddd 100644
--- a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
+++ b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
@@ -19,7 +19,9 @@
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 
+import sleeper.compaction.strategy.impl.BasicCompactionStrategy;
 import sleeper.core.partition.PartitionTree;
+import sleeper.core.partition.PartitionsBuilder;
 import sleeper.core.schema.Schema;
 import sleeper.core.statestore.FileReferenceFactory;
 import sleeper.systemtest.dsl.SleeperSystemTest;
@@ -35,8 +37,12 @@
 import static sleeper.configuration.properties.instance.CdkDefinedInstanceProperty.COMPACTION_JOB_QUEUE_URL;
 import static sleeper.configuration.properties.instance.CdkDefinedInstanceProperty.INGEST_JOB_QUEUE_URL;
 import static sleeper.configuration.properties.instance.CdkDefinedInstanceProperty.PARTITION_SPLITTING_JOB_QUEUE_URL;
+import static sleeper.configuration.properties.table.TableProperty.COMPACTION_FILES_BATCH_SIZE;
+import static sleeper.configuration.properties.table.TableProperty.COMPACTION_STRATEGY_CLASS;
+import static sleeper.configuration.properties.table.TableProperty.GARBAGE_COLLECTOR_DELAY_BEFORE_DELETION;
 import static sleeper.configuration.properties.table.TableProperty.PARTITION_SPLIT_THRESHOLD;
 import static sleeper.core.statestore.FilesReportTestHelper.activeAndReadyForGCFiles;
+import static sleeper.core.statestore.FilesReportTestHelper.activeFiles;
 import static sleeper.core.testutils.printers.FileReferencePrinter.printExpectedFilesForAllTables;
 import static sleeper.core.testutils.printers.FileReferencePrinter.printTableFilesExpectingIdentical;
 import static sleeper.core.testutils.printers.PartitionsPrinter.printExpectedPartitionsForAllTables;
@@ -45,7 +51,6 @@
 import static sleeper.systemtest.dsl.sourcedata.GenerateNumberedValue.numberStringAndZeroPadTo;
 import static sleeper.systemtest.dsl.sourcedata.GenerateNumberedValueOverrides.overrideField;
 import static sleeper.systemtest.suite.fixtures.SystemTestInstance.MAIN;
-import static sleeper.systemtest.suite.testutil.PartitionsTestHelper.partitionsBuilder;
 
 @SystemTest
 public class MultipleTablesIT {
@@ -88,6 +93,41 @@ void shouldIngestOneFileToMultipleTables(SleeperSystemTest sleeper) {
                 .allSatisfy((table, files) -> assertThat(files).hasSize(1));
     }
 
+    @Test
+    void shouldCompactAndGCMultipleTables(SleeperSystemTest sleeper) {
+        // Given we have several tables
+        // And we ingest two source files as separate jobs
+        sleeper.tables().createManyWithProperties(NUMBER_OF_TABLES, schema, Map.of(
+                COMPACTION_STRATEGY_CLASS, BasicCompactionStrategy.class.getName(),
+                COMPACTION_FILES_BATCH_SIZE, "2",
+                GARBAGE_COLLECTOR_DELAY_BEFORE_DELETION, "0"));
+        sleeper.sourceFiles()
+                .createWithNumberedRecords(schema, "file1.parquet", LongStream.range(0, 50))
+                .createWithNumberedRecords(schema, "file2.parquet", LongStream.range(50, 100));
+        sleeper.ingest().byQueue()
+                .sendSourceFilesToAllTables("file1.parquet")
+                .sendSourceFilesToAllTables("file2.parquet")
+                .invokeTask().waitForJobs();
+
+        // When we run compaction and GC
+        sleeper.compaction().createJobs(NUMBER_OF_TABLES).invokeTasks(1).waitForJobs();
+        sleeper.garbageCollection().invoke().waitFor();
+
+        // Then all tables should have one active file with the expected records, and none ready for GC
+        assertThat(sleeper.query().byQueue().allRecordsByTable())
+                .hasSize(NUMBER_OF_TABLES)
+                .allSatisfy(((table, records) -> assertThat(records).containsExactlyElementsOf(
+                        sleeper.generateNumberedRecords(schema, LongStream.range(0, 100)))));
+        var tables = sleeper.tables().list();
+        var partitionsByTable = sleeper.partitioning().treeByTable();
+        var filesByTable = sleeper.tableFiles().filesByTable();
+        PartitionTree expectedPartitions = new PartitionsBuilder(schema).singlePartition("root").buildTree();
+        FileReferenceFactory fileReferenceFactory = FileReferenceFactory.from(expectedPartitions);
+        assertThat(printTableFilesExpectingIdentical(partitionsByTable, filesByTable))
+                .isEqualTo(printExpectedFilesForAllTables(tables, expectedPartitions,
+                        activeFiles(fileReferenceFactory.rootFile("merged.parquet", 100))));
+    }
+
     @Test
     void shouldSplitPartitionsOfMultipleTables(SleeperSystemTest sleeper) {
         // Given we have several tables with a split threshold of 20
@@ -118,7 +158,7 @@ void shouldSplitPartitionsOfMultipleTables(SleeperSystemTest sleeper) {
         var tables = sleeper.tables().list();
         var partitionsByTable = sleeper.partitioning().treeByTable();
         var filesByTable = sleeper.tableFiles().filesByTable();
-        PartitionTree expectedPartitions = partitionsBuilder(schema)
+        PartitionTree expectedPartitions = new PartitionsBuilder(schema)
                 .rootFirst("root")
                 .splitToNewChildren("root", "L", "R", "row-50")
                 .splitToNewChildren("L", "LL", "LR", "row-25")

From 93723f1c39ef89ec36d1d717774c85a437824579 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 12 Mar 2024 14:48:28 +0000
Subject: [PATCH 50/56] Test table metrics for multiple tables

---
 .../dsl/instance/MultipleTablesTest.java      | 29 ++++++++++++++++++
 .../systemtest/suite/MultipleTablesIT.java    | 30 +++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java b/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java
index 9b9120d1bd..7128a3635f 100644
--- a/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java
+++ b/java/system-test/system-test-dsl/src/test/java/sleeper/systemtest/dsl/instance/MultipleTablesTest.java
@@ -48,6 +48,7 @@
 import static sleeper.systemtest.dsl.testutil.InMemoryTestInstance.DEFAULT_SCHEMA;
 import static sleeper.systemtest.dsl.testutil.InMemoryTestInstance.ROW_KEY_FIELD_NAME;
 import static sleeper.systemtest.dsl.testutil.InMemoryTestInstance.withDefaultProperties;
+import static sleeper.systemtest.dsl.testutil.SystemTestTableMetricsHelper.tableMetrics;
 
 @InMemoryDslTest
 public class MultipleTablesTest {
@@ -173,4 +174,32 @@ void shouldSplitPartitionsOfMultipleTables(SleeperSystemTest sleeper) {
                         List.of("root", "L", "R", "LL", "LR", "RL", "RR"))));
     }
 
+    @Test
+    void shouldGenerateMetricsForMultipleTables(SleeperSystemTest sleeper) {
+        // Given we have several tables
+        // And we ingest two source files as separate jobs
+        sleeper.tables().createManyWithProperties(NUMBER_OF_TABLES, schema, Map.of(
+                COMPACTION_STRATEGY_CLASS, BasicCompactionStrategy.class.getName(),
+                COMPACTION_FILES_BATCH_SIZE, "2",
+                GARBAGE_COLLECTOR_DELAY_BEFORE_DELETION, "0"));
+        sleeper.sourceFiles()
+                .createWithNumberedRecords(schema, "file1.parquet", LongStream.range(0, 50))
+                .createWithNumberedRecords(schema, "file2.parquet", LongStream.range(50, 100));
+        sleeper.ingest().byQueue()
+                .sendSourceFilesToAllTables("file1.parquet")
+                .sendSourceFilesToAllTables("file2.parquet")
+                .invokeTask().waitForJobs();
+
+        // When we compute table metrics
+        sleeper.tableMetrics().generate();
+
+        // Then each table has the expected metrics
+        sleeper.tables().forEach(() -> {
+            assertThat(sleeper.tableMetrics().get()).isEqualTo(tableMetrics(sleeper)
+                    .partitionCount(1).leafPartitionCount(1)
+                    .fileCount(2).recordCount(100)
+                    .averageActiveFilesPerPartition(2)
+                    .build());
+        });
+    }
 }
diff --git a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
index efa20eaddd..26da8ee273 100644
--- a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
+++ b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
@@ -50,6 +50,7 @@
 import static sleeper.systemtest.dsl.sourcedata.GenerateNumberedValue.addPrefix;
 import static sleeper.systemtest.dsl.sourcedata.GenerateNumberedValue.numberStringAndZeroPadTo;
 import static sleeper.systemtest.dsl.sourcedata.GenerateNumberedValueOverrides.overrideField;
+import static sleeper.systemtest.dsl.testutil.SystemTestTableMetricsHelper.tableMetrics;
 import static sleeper.systemtest.suite.fixtures.SystemTestInstance.MAIN;
 
 @SystemTest
@@ -184,4 +185,33 @@ void shouldSplitPartitionsOfMultipleTables(SleeperSystemTest sleeper) {
                                 fileReferenceFactory.partitionFile("RRR", 13)),
                         List.of("root", "L", "R", "LL", "LR", "RL", "RR"))));
     }
+
+    @Test
+    void shouldGenerateMetricsForMultipleTables(SleeperSystemTest sleeper) {
+        // Given we have several tables
+        // And we ingest two source files as separate jobs
+        sleeper.tables().createManyWithProperties(NUMBER_OF_TABLES, schema, Map.of(
+                COMPACTION_STRATEGY_CLASS, BasicCompactionStrategy.class.getName(),
+                COMPACTION_FILES_BATCH_SIZE, "2",
+                GARBAGE_COLLECTOR_DELAY_BEFORE_DELETION, "0"));
+        sleeper.sourceFiles()
+                .createWithNumberedRecords(schema, "file1.parquet", LongStream.range(0, 50))
+                .createWithNumberedRecords(schema, "file2.parquet", LongStream.range(50, 100));
+        sleeper.ingest().byQueue()
+                .sendSourceFilesToAllTables("file1.parquet")
+                .sendSourceFilesToAllTables("file2.parquet")
+                .invokeTask().waitForJobs();
+
+        // When we compute table metrics
+        sleeper.tableMetrics().generate();
+
+        // Then each table has the expected metrics
+        sleeper.tables().forEach(() -> {
+            assertThat(sleeper.tableMetrics().get()).isEqualTo(tableMetrics(sleeper)
+                    .partitionCount(1).leafPartitionCount(1)
+                    .fileCount(2).recordCount(100)
+                    .averageActiveFilesPerPartition(2)
+                    .build());
+        });
+    }
 }

From fb44ed9110bcf3e067535eafd3a079905fdd90aa Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Tue, 12 Mar 2024 14:49:13 +0000
Subject: [PATCH 51/56] Increase number of tables in MultipleTablesIT

---
 .../test/java/sleeper/systemtest/suite/MultipleTablesIT.java    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
index 26da8ee273..7002d815d6 100644
--- a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
+++ b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
@@ -56,7 +56,7 @@
 @SystemTest
 public class MultipleTablesIT {
     private final Schema schema = SystemTestSchema.DEFAULT_SCHEMA;
-    private static final int NUMBER_OF_TABLES = 5;
+    private static final int NUMBER_OF_TABLES = 200;
 
     @BeforeEach
     void setUp(SleeperSystemTest sleeper, AfterTestPurgeQueues purgeQueues) {

From 106a8dcf3bfe3e4a070c41d40abcbd9e425d93c1 Mon Sep 17 00:00:00 2001
From: patchwork01 <110390516+patchwork01@users.noreply.github.com>
Date: Wed, 13 Mar 2024 15:34:56 +0000
Subject: [PATCH 52/56] Tag MultipleTablesIT as slow

---
 .../test/java/sleeper/systemtest/suite/MultipleTablesIT.java    | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
index 7002d815d6..fa72d3b44a 100644
--- a/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
+++ b/java/system-test/system-test-suite/src/test/java/sleeper/systemtest/suite/MultipleTablesIT.java
@@ -27,6 +27,7 @@
 import sleeper.systemtest.dsl.SleeperSystemTest;
 import sleeper.systemtest.dsl.extension.AfterTestPurgeQueues;
 import sleeper.systemtest.suite.fixtures.SystemTestSchema;
+import sleeper.systemtest.suite.testutil.Slow;
 import sleeper.systemtest.suite.testutil.SystemTest;
 
 import java.util.List;
@@ -54,6 +55,7 @@
 import static sleeper.systemtest.suite.fixtures.SystemTestInstance.MAIN;
 
 @SystemTest
+@Slow // Slow because compactions run for 200 tables in one task
 public class MultipleTablesIT {
     private final Schema schema = SystemTestSchema.DEFAULT_SCHEMA;
     private static final int NUMBER_OF_TABLES = 200;

From 05c5b64c2c01489dbc3b4c2370f372f0b0ad8ca9 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Thu, 14 Mar 2024 03:37:37 +0000
Subject: [PATCH 53/56] Bump org.java-websocket:Java-WebSocket from 1.5.3 to
 1.5.6 in /java

Bumps [org.java-websocket:Java-WebSocket](https://github.com/TooTallNate/Java-WebSocket) from 1.5.3 to 1.5.6.
- [Release notes](https://github.com/TooTallNate/Java-WebSocket/releases)
- [Changelog](https://github.com/TooTallNate/Java-WebSocket/blob/master/CHANGELOG.md)
- [Commits](https://github.com/TooTallNate/Java-WebSocket/compare/v1.5.3...v1.5.6)

---
updated-dependencies:
- dependency-name: org.java-websocket:Java-WebSocket
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
---
 java/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/java/pom.xml b/java/pom.xml
index 956194337c..ce23452cd5 100644
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -136,7 +136,7 @@
         <datasketches.version>3.3.0</datasketches.version>
         <slf4j.version>2.0.12</slf4j.version>
         <reload4j.version>1.2.24</reload4j.version>
-        <java-websocket.version>1.5.3</java-websocket.version>
+        <java-websocket.version>1.5.6</java-websocket.version>
         <arrow.version>11.0.0</arrow.version>
         <bouncycastle.version>1.75</bouncycastle.version>
         <athena.version>2023.3.1</athena.version>

From d286697b8df92c3f6cb020262904463b4e3aa7f8 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Thu, 14 Mar 2024 03:38:12 +0000
Subject: [PATCH 54/56] Bump aws-java-sdk-v2.version from 2.20.95 to 2.25.9 in
 /java

Bumps `aws-java-sdk-v2.version` from 2.20.95 to 2.25.9.

Updates `software.amazon.awssdk:s3` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:s3-transfer-manager` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:cloudwatch` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:lambda` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:cloudwatchlogs` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:cloudformation` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:emrserverless` from 2.20.95 to 2.25.9

Updates `software.amazon.awssdk:apache-client` from 2.20.95 to 2.25.9

---
updated-dependencies:
- dependency-name: software.amazon.awssdk:s3
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:s3-transfer-manager
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:cloudwatch
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:lambda
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:cloudwatchlogs
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:cloudformation
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:emrserverless
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: software.amazon.awssdk:apache-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
---
 java/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/java/pom.xml b/java/pom.xml
index 956194337c..f909ff8385 100644
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -110,7 +110,7 @@
         <!-- jungrapht-visualization-samples also declares an old version, which is a dependency of the build module. -->
         <logback.version>1.4.14</logback.version>
         <aws-java-sdk.version>1.12.498</aws-java-sdk.version>
-        <aws-java-sdk-v2.version>2.20.95</aws-java-sdk-v2.version>
+        <aws-java-sdk-v2.version>2.25.9</aws-java-sdk-v2.version>
         <aws-crt.version>0.22.2</aws-crt.version>
         <aws-lambda-java-events.version>3.11.2</aws-lambda-java-events.version>
         <aws-lambda-java-core.version>1.2.2</aws-lambda-java-core.version>

From dae5a95480eb98d3a24f815e222f912020e27c2d Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Thu, 14 Mar 2024 03:39:18 +0000
Subject: [PATCH 55/56] Bump jackson.version from 2.16.2 to 2.17.0 in /java

Bumps `jackson.version` from 2.16.2 to 2.17.0.

Updates `com.fasterxml.jackson.core:jackson-core` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-core/compare/jackson-core-2.16.2...jackson-core-2.17.0)

Updates `com.fasterxml.jackson.core:jackson-databind` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson/commits)

Updates `com.fasterxml.jackson.core:jackson-annotations` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson/commits)

Updates `com.fasterxml.jackson.core:jackson-datatype-guava` from 2.16.2 to 2.17.0

Updates `com.fasterxml.jackson.core:jackson-datatype-joda` from 2.16.2 to 2.17.0

Updates `com.fasterxml.jackson.module:jackson-module-scala_2.12` from 2.16.2 to 2.17.0
- [Changelog](https://github.com/FasterXML/jackson-module-scala/blob/2.17/release.sbt)
- [Commits](https://github.com/FasterXML/jackson-module-scala/compare/v2.16.2...v2.17.0)

Updates `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-modules-base/compare/jackson-modules-base-2.16.2...jackson-modules-base-2.17.0)

Updates `com.fasterxml.jackson.dataformat:jackson-dataformat-cbor` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-dataformats-binary/compare/jackson-dataformats-binary-2.16.2...jackson-dataformats-binary-2.17.0)

Updates `com.fasterxml.jackson.datatype:jackson-datatype-guava` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-datatypes-collections/compare/jackson-datatypes-collections-2.16.2...jackson-datatypes-collections-2.17.0)

Updates `com.fasterxml.jackson.datatype:jackson-datatype-joda` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-datatype-joda/compare/jackson-datatype-joda-2.16.2...jackson-datatype-joda-2.17.0)

Updates `com.fasterxml.jackson.datatype:jackson-datatype-jsr310` from 2.16.2 to 2.17.0

Updates `com.fasterxml.jackson.jaxrs:jackson-jaxrs-base` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-jaxrs-providers/compare/jackson-jaxrs-providers-2.16.2...jackson-jaxrs-providers-2.17.0)

Updates `com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider` from 2.16.2 to 2.17.0

Updates `com.fasterxml.jackson.dataformat:jackson-dataformat-yaml` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-dataformats-text/compare/jackson-dataformats-text-2.16.2...jackson-dataformats-text-2.17.0)

Updates `com.fasterxml.jackson.dataformat:jackson-dataformat-xml` from 2.16.2 to 2.17.0
- [Commits](https://github.com/FasterXML/jackson-dataformat-xml/compare/jackson-dataformat-xml-2.16.2...jackson-dataformat-xml-2.17.0)

---
updated-dependencies:
- dependency-name: com.fasterxml.jackson.core:jackson-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.core:jackson-databind
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.core:jackson-annotations
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.core:jackson-datatype-guava
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.core:jackson-datatype-joda
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.module:jackson-module-scala_2.12
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.module:jackson-module-jaxb-annotations
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.dataformat:jackson-dataformat-cbor
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.datatype:jackson-datatype-guava
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.datatype:jackson-datatype-joda
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.datatype:jackson-datatype-jsr310
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.jaxrs:jackson-jaxrs-base
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.dataformat:jackson-dataformat-yaml
  dependency-type: direct:production
  update-type: version-update:semver-minor
- dependency-name: com.fasterxml.jackson.dataformat:jackson-dataformat-xml
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
---
 java/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/java/pom.xml b/java/pom.xml
index 956194337c..d790448c75 100644
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -123,7 +123,7 @@
         <commons-text.version>1.10.0</commons-text.version>
         <janino.version>3.1.12</janino.version>
         <commons-net.version>3.9.0</commons-net.version>
-        <jackson.version>2.16.2</jackson.version>
+        <jackson.version>2.17.0</jackson.version>
         <!-- Trino integration uses a different version of JJWT, this is the version used in the build module -->
         <jjwt.build.version>0.12.5</jjwt.build.version>
         <facebook.collections.version>0.1.32</facebook.collections.version>

From 584a5dffd86cbda481a52c87f6e19effff319f61 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Thu, 14 Mar 2024 03:39:44 +0000
Subject: [PATCH 56/56] Bump testcontainers.version from 1.19.0 to 1.19.7 in
 /java

Bumps `testcontainers.version` from 1.19.0 to 1.19.7.

Updates `org.testcontainers:localstack` from 1.19.0 to 1.19.7
- [Release notes](https://github.com/testcontainers/testcontainers-java/releases)
- [Changelog](https://github.com/testcontainers/testcontainers-java/blob/main/CHANGELOG.md)
- [Commits](https://github.com/testcontainers/testcontainers-java/compare/1.19.0...1.19.7)

Updates `org.testcontainers:testcontainers` from 1.19.0 to 1.19.7
- [Release notes](https://github.com/testcontainers/testcontainers-java/releases)
- [Changelog](https://github.com/testcontainers/testcontainers-java/blob/main/CHANGELOG.md)
- [Commits](https://github.com/testcontainers/testcontainers-java/compare/1.19.0...1.19.7)

Updates `org.testcontainers:junit-jupiter` from 1.19.0 to 1.19.7
- [Release notes](https://github.com/testcontainers/testcontainers-java/releases)
- [Changelog](https://github.com/testcontainers/testcontainers-java/blob/main/CHANGELOG.md)
- [Commits](https://github.com/testcontainers/testcontainers-java/compare/1.19.0...1.19.7)

---
updated-dependencies:
- dependency-name: org.testcontainers:localstack
  dependency-type: direct:production
  update-type: version-update:semver-patch
- dependency-name: org.testcontainers:testcontainers
  dependency-type: direct:production
  update-type: version-update:semver-patch
- dependency-name: org.testcontainers:junit-jupiter
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
---
 java/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/java/pom.xml b/java/pom.xml
index 956194337c..d98cf1256b 100644
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -157,7 +157,7 @@
         <junit.version>5.10.2</junit.version>
         <junit.platform.version>1.10.1</junit.platform.version>
         <mockito.version>4.11.0</mockito.version>
-        <testcontainers.version>1.19.0</testcontainers.version>
+        <testcontainers.version>1.19.7</testcontainers.version>
         <wiremock.version>2.35.0</wiremock.version>
         <assertj.version>3.24.1</assertj.version>
         <approvaltests.version>22.3.3</approvaltests.version>