Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need faster deploys #160

Open
skalee opened this issue Feb 25, 2021 · 14 comments
Open

Need faster deploys #160

skalee opened this issue Feb 25, 2021 · 14 comments
Assignees

Comments

@skalee
Copy link
Contributor

skalee commented Feb 25, 2021

Deploying IEV site took over an hour, most of which (50 minutes) was spent on sending produced files to S3. We need to speed it up.

Currently we deploy with our custom Rake task defined here: https://github.com/geolexica/geolexica-server/blob/master/lib/tasks/deploy.rake. Under the hood it uses s3 sync, an official AWS tool.

Some ideas how to deal with that can be found in glossarist/iev-demo-site#66.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

@ronaldtse I got two questions:

  • If I end up with creating a brand new tool (which is possible, because these slow uploads are likely caused by poor parallelism), does it matter if it's Node or Ruby tool?
  • During upload site may be inconsistent (some pages old, some new). Is it a problem? If yes, then there are two options:
    1. We may upload site to a temporary bucket and then copy it to a proper one. Copying files over buckets in the same region should be much faster than uploads, especially with S3P tool you found.
    2. Alternatively, we can display some maintenance page.

@ronaldtse
Copy link
Member

If I end up with creating a brand new tool (which is possible, because these slow uploads are likely caused by poor parallelism), does it matter if it's Node or Ruby tool?

No, as long as you can maintain it.

During upload site may be inconsistent (some pages old, some new). Is it a problem? If yes, then there are two options:

  1. We may upload site to a temporary bucket and then copy it to a proper one. Copying files over buckets in the same region should be much faster than uploads, especially with S3P tool you found.

Great idea! GitHub now also supports environments - so that you can queue deploys that if one job is running, the other jobs are queued. In this case, we can use S3 Transfer Acceleration for the temporary bucket (as long as it does not contain '.' dots).

  1. Alternatively, we can display some maintenance page.

This is probably necessary in either case.

The third option is to use AWS DynamoDB or MongoDB Cloud Atlas, which will be necessary for high frequency update workloads.

@ronaldtse
Copy link
Member

https://github.com/cobbzilla/s3s3mirror seems to work for mirroring.

@ronaldtse
Copy link
Member

I just found out that we could enable Transfer Acceleration if we rename the buckets to remove the dots. It's now possible to use an arbitrarily named S3 bucket as an origin for CloudFront, so we can use "example-com" instead of "example.com" as the bucket name. Let me see what we can do.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

The third option is to use AWS DynamoDB or MongoDB Cloud Atlas, which will be necessary for high frequency update workloads.

Is this any expected? I though glossaries will not be updated very frequently.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

I just found out that we could enable Transfer Acceleration if we rename the buckets to remove the dots. It's now possible to use an arbitrarily named S3 bucket as an origin for CloudFront, so we can use "example-com" instead of "example.com" as the bucket name. Let me see what we can do.

AWS docs say:

You might want to use Transfer Acceleration on a bucket for various reasons:

  • Your customers upload to a centralized bucket from all over the world.
  • You transfer gigabytes to terabytes of data on a regular basis across continents.
  • You can't use all of your available bandwidth over the internet when uploading to Amazon S3.

Doesn't sound like our case.

@ronaldtse
Copy link
Member

Frequency: it’s also the burst frequencies, eg if people make subsequent changes quickly.

I found a way to make transfer acceleration work with cloud front, but it requires a separate lambda@edge to return index.html in order to mimic S3 website functionality.

In this case we may not need two buckets but let’s see.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

Frequency: it’s also the burst frequencies, eg if people make subsequent changes quickly.

Wow, sounds like very different thing than deploys we have now. If burst updates can happen, then slow uploads aren't our only problem. Building the full site from scratch will be too slow too. Note that IEV has 20k concepts or so. We need some kind of incremental site builds in GHA to handle burst updates. Or throttling, or debouncing.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

Also, we need to prevent race conditions between deploys.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

I'm not sure what exactly Paneron will be responsible for when it comes to site generation, so this may be a silly idea: We can use Paneron to generate concept pages, and then use Jekyll to bind them into a site. Jekyll supports incremental site generation, so if we modify a few files only, then it should finish quite fast. Then we need to upload these modified files without touching the others — maybe s3 sync will do much better in such case.

Obviously that won't speed up full site rebuilds which we need too.

@skalee
Copy link
Contributor Author

skalee commented Feb 27, 2021

My new idea involves persisting generated site across builds. This is going to be a separate Git repo (maybe hosted on GitHub, maybe existing just in GHA cache, it doesn't really matter) because I don't trust file timestamps as much as commit dates. File modification timestamp can be updated for any reason whereas git commit date means actual change to file contents.

In steps (all done in GHA):

  1. Obtain generated site (Git repo) from previous builds.
  2. Rebuild site (incrementally or not).
  3. Commit all the differences.
  4. List all files in generated site along with their last commit timestamp.
  5. List all files in S3 bucket along with their last modification timestamp.
  6. Send only these files which have changed since the last deploy.

This approach should greatly reduce deploy time as compared to s3 sync. The latter compares MD5 hashes in order to tell which files have changed. Whilst this is a great idea in general case, it surely takes some time, even though files stored in S3 have these hashes already computed (unless given bucket is encrypted). Alternatively, s3 sync can look at file sizes which is much faster, but not that reliable.

@ronaldtse
Copy link
Member

@skalee I think a more comprehensive approach is needed for S3 bucket sync; synching unchanged items is clearly not desired. A possible mechanism is to maintain a hash index at the root (with hash keys of all files), which is updated by some cron/lambda function, so that when we upload something we can match up which files need (or not) updating.

@skalee
Copy link
Contributor Author

skalee commented Mar 1, 2021

FYI I've just triggered re-deploy on iev-demo-site and it's slow again, despite the facts that nothing was changed and that most files are identical.

@ronaldtse ronaldtse moved this to 🆕 New in Geolexica Jul 24, 2022
@ronaldtse ronaldtse moved this from 🆕 New to 📋 Backlog in Geolexica Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants