H3 Multi-Cloud Storage Architecture #207

trieloff · 2021-07-07T10:03:31Z

trieloff
Jul 7, 2021
Maintainer

Having the automated DNS-based multi-CDN load-balancing/fallback somewhat nailed (see show & tell), I'd like to get some feedback on the storage architecture.

The Goal

content can be served from S3 and Google Cloud Storage (GCS)
AWS Lambda-based delivery only needs to access S3, Google Cloud Functions (GCF)-based delivery only needs to access GCS
Both storage locations are eventually consistent

Nice to have

publishing can continue even when one cloud is unavailable

Approaches

Primary/Replica Architecture

(see here for terminology.

S3 would be the primary storage, GCS, the replica.

helix-admin updates S3, as before
each S3 bucket gets a lambda function triggered on object change that copies the object the the corresponding GCS bucket
a regular (daily) Google Cloud Storage Transfer Service job is scheduled for each bucket that will allow GCS to catch up after a GCP outage

Discussion

+ simple
- in case of an AWS outage, publishing is blocked

Dual-Write Architecture

helix-admin updates S3 and GCS
- if both update fail, the publish operation failed
- if GCS update fails, a job will be added to a catch-up queue in AWS
- if S3 update fails, a job will be added to a catch-up queue in GCP
both GCP and AWS have scheduled jobs that handle the catch-up queue:
- get the first entry, fetch it from your own storage and send it to the foreign storage
- if that fails, add the job back to the end of the queue

Discussion

+ high availability
- helix-admin needs to talk to two clouds
- increased complexity (I'd need @dominique-pfister's queueing expertise)

Delayed Replication

helix-admin updates whatever cloud it runs in (GCF->GCS, Lambda->S3)
a sync function is triggered on each write in each bucket
- if the foreign object is different and older than the own object or missing, update it the foreign storage
- if the operation fails, add the object to the catch-up queue
same catch-up queue handling as in the write Architecture

Discussion

+ high availability
+ helix-admin can stay single-cloud
+ can evolve from each of the previous two approaches
- combines the complexity of the previous two approaches

Dual-Write with Self-Invocation

(suggested by @tripod here: #207 (comment))

helix-admin updates the own cloud storage
helix-admin then self-invokes on the other cloud
with some additional parameter to stop recursion
after an outage, we run a manual re-sync from the live cloud to the dead cloud

Discussion

+ simple, no queues
+ no cross-cloud contamination of secrets/access keys
+ high availability
- clouds will go out of sync over time: H3 Multi-Cloud Storage Architecture #207 (reply in thread)

Summary

I'd start with the simplest approach (Primary/Replica) and evolve it into the Delayed Replication when it becomes expedient, e.g. after a prolonged AWS downtime during which one of our customers still had a pressing need to publish.

trieloff · 2021-07-07T10:08:45Z

trieloff
Jul 7, 2021
Maintainer Author

Bonus: we should use that opportunity to first mirror the media bus from Azure to S3 and GCS and then migrate it entirely.

There's a good chance that we can just rsync as a first shot: https://cloud.google.com/storage/docs/gsutil/commands/rsync

0 replies

tripodsan · 2021-07-07T10:24:37Z

tripodsan
Jul 7, 2021
Maintainer

I'm for the Dual-Write approach, but with the helix-admin to be executed twice, once on AWS and one on GCloud. i.e. we don't need to store the credentials for 'the other' storage as secrets. otherwise, we don't have a proper separated stack
(admin (-> content-bus) -> word2md -> storage)

if one cloud is down during an outage, we run afterwards a sync job to catch up what was missed.

In order to execute the admin twice, the admin will invoke itself on the other cloud with the same payload.

FWIW, I think we eventually change the architecture to be task based, i.e. any invocation to admin just creates the tasks in some queues that get processed.

for example a preview action could be:

create preview task
preview task gets executed, fetches the content from word2md, stores it in s3, creates purge task
purge task gets executed

with the dual-write systen, the admin could also write the task to the queue in the other cloud.

5 replies

rofe Jul 7, 2021
Maintainer

change the architecture to be task based

Would that still be synchronous from the client perspective? Otherwise it would increase the complexity to let the sidekick know when it is ok to proceed (redirect, etc) after an action.

trieloff Jul 7, 2021
Maintainer Author

Agree that the task-based architecture would be neat. Sidekick would POST a task, then poll the URL for the status.

trieloff Jul 7, 2021
Maintainer Author

One challenge that I see with Dual Write with Self-Invocation is that the lack of a catch-up queue means that there will be inconsistencies that are hard to detect over time. This is a much more likely failure scenario than the full cloud outage:

Lambda invokes the Google function, but because of reasons (network|overload|lack of interest), Google won't execute before the Lambda deadline approaches.
The object now exists in S3, but not GCS
Customers experience inconsistent sites when the CDN changes
We don't know what's going on because there wasn't a Google outage
The only way to find out is to run a complete scan/re-sync

My original proposal for dual-write is a bit more resilient against this failure mode, thanks to the catch-up queues.

tripodsan Jul 8, 2021
Maintainer

Lambda invokes the Google function, but because of reasons (network|overload|lack of interest), Google won't execute before the Lambda deadline approaches.

well, you'd have the same problem when you write to both storages. without a proper task management you are always prone to loose events. eg:

fetch content (takes 20s)
write to s3 - all good
write to google - hangs for > 40s. action is killed, no chances to write to catch-up queue

we should combine the double-write with the lazy sync, that ensures that the 2 storages are consistent

davidnuescheler Jul 8, 2021
Maintainer

I think that we should at have a way to compare the two storages and find ways to sync manually in exceptional cases, before we build complexity as we don't have any idea yet how often we get into a situation where this actually surfaced as a problem..

rofe · 2021-07-07T12:57:10Z

rofe
Jul 7, 2021
Maintainer

Fwiw I'd prefer the Dual Write or Delayed Replication over Primary/Replica. Have you already looked into the cost (COGS) of these options (compared to the single-cloud storage architecture we have today)?

5 replies

kptdobe Jul 7, 2021
Maintainer

This looks sooo over-engineered for something that will happen very rarely.
I would stay with the primary architecture and show a nice user-friendly message to authors to tell them they cannot publish because AWS is down.
If this happens too "frequently" or a customer is requesting something more solid (i.e. a customer having the absolute need to publish anytime), I would implement a more complex version.

rofe Jul 7, 2021
Maintainer

I agree. This may even be an advanced feature we could charge customers extra for.

trieloff Jul 7, 2021
Maintainer Author

This looks sooo over-engineered for something that will happen very rarely.

Hence my original suggestion to do the bare minimum for now:

I'd start with the simplest approach (Primary/Replica) and evolve it into the Delayed Replication when it becomes expedient, e.g. after a prolonged AWS downtime during which one of our customers still had a pressing need to publish.

Whatever we evolve to is subject to discussion when we know more about the customer need and the failure patterns.

tripodsan Jul 8, 2021
Maintainer

The delayed replication might indeed be the best and least intrusive approach

kptdobe Jul 8, 2021
Maintainer

Sorry, I initially did not read the summary part :) I was immediately absorbed by the fact everyone was commenting on the Dual-Write approach.

trieloff · 2021-07-08T08:18:45Z

trieloff
Jul 8, 2021
Maintainer Author

I'm treating the overwhelming show of thumbs here #207 (reply in thread) as agreement to proceed with the simple plan and complexify later.

3 replies

kptdobe Jul 8, 2021
Maintainer

Behind the simple plan, there is still an aspect that we slowly move us to a more complicated solution: make sure the 2 storages always stay in sync. This will certainly be the first part to improve even as part of this simple plan.

tripodsan Jul 8, 2021
Maintainer

... proceed with the simple plan and complexity later.

https://www.imdb.com/title/tt0120324/mediaviewer/rm2360546304/

stefan-guggisberg Jul 8, 2021
Maintainer

I'm treating the overwhelming show of thumbs here #207 (reply in thread) as agreement to proceed with the simple plan and complexify later.

Sorry for being late. Yes, I absolutely agree. Start simple (Primary/Replica) and keep it simple as long as possible.

davidnuescheler · 2021-07-08T17:21:58Z

davidnuescheler
Jul 8, 2021
Maintainer

I intuitively also gravitate towards the dual write approach

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H3 Multi-Cloud Storage Architecture #207

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

H3 Multi-Cloud Storage Architecture #207

trieloff Jul 7, 2021 Maintainer

The Goal

Nice to have

Approaches

Primary/Replica Architecture

Discussion

Dual-Write Architecture

Discussion

Delayed Replication

Discussion

Dual-Write with Self-Invocation

Discussion

Summary

Replies: 5 comments · 13 replies

trieloff Jul 7, 2021 Maintainer Author

tripodsan Jul 7, 2021 Maintainer

rofe Jul 7, 2021 Maintainer

trieloff Jul 7, 2021 Maintainer Author

trieloff Jul 7, 2021 Maintainer Author

tripodsan Jul 8, 2021 Maintainer

davidnuescheler Jul 8, 2021 Maintainer

rofe Jul 7, 2021 Maintainer

kptdobe Jul 7, 2021 Maintainer

rofe Jul 7, 2021 Maintainer

trieloff Jul 7, 2021 Maintainer Author

tripodsan Jul 8, 2021 Maintainer

kptdobe Jul 8, 2021 Maintainer

trieloff Jul 8, 2021 Maintainer Author

kptdobe Jul 8, 2021 Maintainer

tripodsan Jul 8, 2021 Maintainer

stefan-guggisberg Jul 8, 2021 Maintainer

davidnuescheler Jul 8, 2021 Maintainer

trieloff
Jul 7, 2021
Maintainer

Replies: 5 comments 13 replies

trieloff
Jul 7, 2021
Maintainer Author

tripodsan
Jul 7, 2021
Maintainer

rofe Jul 7, 2021
Maintainer

trieloff Jul 7, 2021
Maintainer Author

trieloff Jul 7, 2021
Maintainer Author

tripodsan Jul 8, 2021
Maintainer

davidnuescheler Jul 8, 2021
Maintainer

rofe
Jul 7, 2021
Maintainer

kptdobe Jul 7, 2021
Maintainer

rofe Jul 7, 2021
Maintainer

trieloff Jul 7, 2021
Maintainer Author

tripodsan Jul 8, 2021
Maintainer

kptdobe Jul 8, 2021
Maintainer

trieloff
Jul 8, 2021
Maintainer Author

kptdobe Jul 8, 2021
Maintainer

tripodsan Jul 8, 2021
Maintainer

stefan-guggisberg Jul 8, 2021
Maintainer

davidnuescheler
Jul 8, 2021
Maintainer