Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Best practice to setup Redis for GCS FT #2582

Open
1 of 2 tasks
kevin85421 opened this issue Nov 27, 2024 · 12 comments · May be fixed by #2684 or ray-project/ray#49887
Open
1 of 2 tasks

[Doc] Best practice to setup Redis for GCS FT #2582

kevin85421 opened this issue Nov 27, 2024 · 12 comments · May be fixed by #2684 or ray-project/ray#49887
Assignees
Labels
enhancement New feature or request gcs ft

Comments

@kevin85421
Copy link
Member

kevin85421 commented Nov 27, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Create a user guide for configuring a RayCluster with GCS Fault Tolerance using Redis on AWS or GCP. The guide should include persistence option for Redis to ensure Redis state can be recovered after a restart.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added enhancement New feature or request triage gcs ft and removed triage labels Nov 27, 2024
@andrewsykim
Copy link
Collaborator

@andrewsykim
Copy link
Collaborator

I put together this gist to persist Redis using GCSFuse https://gist.github.com/andrewsykim/55088178684b5a692854f932c8120914

@andrewsykim
Copy link
Collaborator

@rueian Kai-Hsun mentioned you're a Redis expert, do you have opinions on whether we should use RDB or AOF for persistence?

@rueian
Copy link
Contributor

rueian commented Nov 27, 2024

Hi @andrewsykim,

Generally, AOF is better since it persists changes more frequently in the format of append logs while RDB is the periodic snapshot of the whole redis memory.

However, when it comes to the integration of GCSFuse, I believe RDB is better because GCS doesn't support append operations. If we use AOF with GCSFuse, it will need to re-upload the whole aof file again and again whenever there is a new entry appended and the aof file will get bigger and bigger and then slows down redis in this case.

@andrewsykim
Copy link
Collaborator

If we use AOF with GCSFuse, it will need to re-upload the whole aof file again and again whenever there is a new entry appended and the aof file will get bigger and bigger and then slows down redis in this case.

Is this a problem specific to GCSFuse? Wouldn't this be a problem for any file-system based approach?

@rueian
Copy link
Contributor

rueian commented Nov 27, 2024

It is not specific to GCSFuse. Most cloud storages doesn’t support append operations, except for Azure blob storage and AWS S3 Express One Zone.

AOF getting bigger will not be a problem for other file systems which support append operations because they don’t need to rewrite the whole file when appending a new entry.

@andrewsykim
Copy link
Collaborator

okay makes sense, so we should either use block storage for AOF or RBD persistence if using GCSFuse

@andrewsykim andrewsykim changed the title [Feature] Best practice to setup Redis for GCS FT [Doc] Best practice to setup Redis for GCS FT Nov 27, 2024
@spencer-p
Copy link
Contributor

I'm interested in getting this doc together

@spencer-p
Copy link
Contributor

spencer-p commented Nov 28, 2024

Without clustering redis (all of the current guides have 1 replica), we're talking about using Redis as a save-to-disk engine. I'd like to include some clustering and failover to our best practices recommendations to get the most out of redis.

For reentrant workloads that can handle rolling back a few minutes, I wonder if it would be simpler/cheaper to write GCS to disk occasionally.

@rueian
Copy link
Contributor

rueian commented Nov 28, 2024

we're talking about using Redis as a save-to-disk engine.

Exactly.

I'd like to include some clustering and failover to our best practices recommendations to get the most out of redis.

Ray only supports standalone Redis (single master). It doesn’t support Redis Cluster (multiple sharded masters) and Redis Sentinel, the two types of clustering have failover built-in. So, now users needs to implement Redis HA by themselves.

As far as I know, https://github.com/dragonflydb/dragonfly-operator is the only open-source solution that has automatic failover.

For reentrant workloads that can handle rolling back a few minutes, I wonder if it would be simpler/cheaper to write GCS to disk occasionally.

Do you mean you want to skip Redis?

@spencer-p
Copy link
Contributor

Do you mean you want to skip Redis?

Well, I was curious if it's the best design choice if we're not clustering. But I understand that's not the main point, thanks.

@spencer-p spencer-p linked a pull request Dec 23, 2024 that will close this issue
1 task
@spencer-p
Copy link
Contributor

Apologies for the delay, just put up what I've got to iterate on before the holidays. I'll be back in January.

After we get a sample config and best practice here, I figure we'll want something in ray/cluster/kubernetes/user-guides as well, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gcs ft
Projects
None yet
4 participants