Below is a codified playbook used to respond to MME Sev 1 Outages.
For the latest version, refer to the playbook in our Mattermost community instance: https://community.mattermost.com/playbooks/playbooks/9agdqr7jdtda7p4g8dxbppcibw
[ ] Create Incident Channel, run MME Sev1 Playbook
- Once MME Sev1 issue is escalated by CSM, TAM or CRE, create incident channel, and run MME Sev 1 Playbook
[ ] Add CSM, TAM & DE to Incident Channel
- Add CSM, TAM & DE leaders (@Brent Fox @Stu Doherty @Jason Blais) to the channel to add the appropriate staff member. Also add @Ian Tien to view Playbooks in motion for L2 and L1 incidents.
[ ] Start audio & screen share with customer
- Include a Mattermost engineer & customer DBA on the call who can run queries to support troubleshooting
[ ] Reply to customer (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer
[ ] Share system information
- Includes relevant system configuration setting, database specs (with CPU, RAM) & application specs (with CPU, RAM)
[ ] Share Grafana screenshots
- Include DB calls, API latency, Store latency, Top HTTP requests, Top API requests, CPU utilization, memory utilization
[ ] Share output from support bundle
- Link to relevant docs
[ ] Share output from slow query logs
- Link to relevant docs
[ ] Pin data to channel
- Link to relevant docs
[ ] Review system configuration settings that may impact performance
- Includes user typing timeout, user typing message, max notifications per channel & db replica lag settings
[ ] Review Grafana screenshots to identify potential issues
- Includes XXX
[ ] Review support bundle output to identify potential issues
- Includes XXX
[ ] Review slow query log output to identify potential issues
- Includes XXX
[ ] Summary findings from data review
- Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
[ ] Based on findings from data review, identify areas of codebase with potential root cause
- Includes XXX
[ ] Identify potential root cause based on the code
- Includes XXX
[ ] Identify solution for root cause
- Includes XXX
[ ] Submit PR for solution
- Includes XXX
[ ] Deem whether verification of a fix is required for release candidate
- If yes, provide clear step-by-step instructions for QA to verify the fix, including specifications for test server such as database type (MySQL vs Postgres)
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
[ ] Merge PR to master branch
- Includes XXX
[ ] Cherry pick PR to dot release branch
- Includes XXX
[ ] Cut dot release candidate
- Includes XXX
[ ] Verify fix in dot release candidate
- Includes XXX
[ ] Cut dot release
- Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
[ ] Send dot release binary to customer
- Includes XXX
[ ] Upgrade customer’s dev/staging environment with dot release
- Includes XXX
[ ] Verify fix in customer’s dev/staging environment
- Includes XXX
[ ] Upgrade customer’s production environment with dot release
- Includes XXX
[ ] Verify fix in customer’s production environment
- Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
[ ] Monitor fix in customer environment for 24 hours
- Includes XXX
[ ] Receive confirmation from customer about issue resolution
- Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer
[ ] Complete incident retrospective within 1 business day from resolution
- Includes XXX
[ ] Draft incident summary analysis within 2 business days from resolution
- Includes XXX
[ ] Send completed incident summary analysis with customer within 3 business days
- Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
- MME Sev 1 outage >1 hour requires CEO looped into customer