Skip to content

Commit

Permalink
[no ticket] Update OPERATIONS.md for terraform state lock issues (#3285)
Browse files Browse the repository at this point in the history
## Context

written live during a deploy

---------

Co-authored-by: Matt Dragon <[email protected]>
  • Loading branch information
coilysiren and mdragon authored Dec 18, 2024
1 parent 27a7a65 commit f9ec087
Showing 1 changed file with 41 additions and 3 deletions.
44 changes: 41 additions & 3 deletions OPERATIONS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,43 @@
# Maintenances and Operation of Runtime System

## Deployment

### Updating to our Terraform Version

1. Install `tfenv`
2. Get the terraform version to install from `terraform_version` this file: https://github.com/HHS/simpler-grants-gov/blob/main/.github/workflows/deploy.yml
3. Follow `tfenv` instructions to instsall and utilize the given terraform version

### Terraform State Locks

Terraform state locks happen when multiple terraform deployments try to roll out simultaneously.

You can fix them on CLI by:

1. Finding the job (via Github Action or otherwise) where the deployment failed. If you aren't sure, then it was probably in a Github Action. You can find a list of failing actions here: https://github.com/HHS/simpler-grants-gov/actions
2. Wait for the deployment that caused the state lock to finish. If you can't find it, just wait 30 minutes.
3. Identify the folder in which the state lock is happening. The `Path` attribute on the `Lock Info` block will identify this.
4. Open up your terminal, setup AWS (eg. `export AWS_PROFILE=grants-bla-bla-bla` && `aws sso login`), and cd into the folder identified above
5. Run `terraform init -backend-config=<ENVIRONMENT>.s3.tfbackend`, where `<ENVIRONMENT>` can be identified by the `Path` above.
6. Run `terraform force-unlock -force <LOCK_ID>` where `<LOCK_ID>` is the value of `ID` in your state lock message.
7. Re-run your deploy job

Sometimes CLI unlock won't work, that will look like (for example) the following error message:

> terraform force-unlock -force <LOCK_ID>
> Failed to unlock state: failed to retrieve lock info for lock ID <LOCK_ID>: unexpected end of JSON input
When that happens, you need to unlock it via DynamoDB in the AWS console.

1. Login to AWS
2. [Open the DynamoDB console](https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1)
3. [Open the tables tab](https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1#tables)
4. Click on the state locks table. There should only be one.
5. Click the `Explore Table Items` button
6. Find the item that corresponds to the currently locked state, you can get that by again looking at the `Path` attribute in your locked job.
7. Remove the `Digest` key, `Save and close`
8. Re-run your deploy job

## Scaling

All scaling options can be found in the following files:
Expand Down Expand Up @@ -52,11 +90,11 @@ When scaling openSearch, consider which attribute changes will trigger blue/gree
can be edited in place. [You can find that information here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-configuration-changes.html). Requiring blue/green changes for the average configuration change is a
notable constraint of OpenSearch, relative to ECS and the Database.

# Yearly Rotations
## Yearly Rotations

We manage several secret values that need to be rotated yearly.

## Login.gov Certificates
### Login.gov Certificates

*These certificates were last updated in December 2024*

Expand All @@ -76,7 +114,7 @@ for the given environment to be the value from the `private.pem` key you generat

After the next deployment in an environment, we should be using the new keys, and can cleanup the old certificate.

### Prod Login.gov
#### Prod Login.gov

Prod login.gov does not update immediately, and you must [request a deployment](https://developers.login.gov/production/#changes-to-production-applications) to get a certificate rotated.

Expand Down

0 comments on commit f9ec087

Please sign in to comment.