Kubernetes Information Security Review Checklist

This document is focusing on Information Security in Kubernetes clusters.

In this repository you can also find a general, technology-neutral Cloud Information Security Review Checklist as well as other technology specific checklists.

Governance, risk management, and compliance

What regulations / information security standards do you need to comply with?
1. ☐ FFFS 2014:5
2. ☐ SOC-2
3. ☐ PCI-DSS
4. ☐ ISO 27001
5. ☐ HIPAA
6. ☐ Swedish HealthCare (Swedish: "Patientdatalagen")
7. ☐ GDPR
8. ☐ BSI IT baseline protection (German: "IT-Grundschutz")
9. Other
Who/what role in your organisation is responsible for compliance?
Does your organization have general information security policies outside of your compliance requirements in place? (e.g., “We need to store data in Germany to appeal to national sentiments.”)

Mapping of Standards to Policies

What policy statements do you have in place to show your compliance with the above standard(s)?

Mapping of Policies to Implementation

How do you track implementation?
What is your development/implementation methodology?
1. How do you take high level policy requirements and create implementable units of work?
How do you track and plan implementation/development?

Evidence of Implementation fulfilling Policies

When a policy is implemented how is this validated?
Is this validation recorded?
Can an auditor access this record?

High-Level General Practices

Supply Management

Who supplies your Kubernetes cluster(s) infrastructure today?
Is the underlying cloud providers infrastructure in line with your compliance requirements?
Are the underlying VMs / load-balancers / storage sufficiently protected? (e.g., via firewalls)
Is your Kubernetes cluster connected to any managed services? (e.g., database-as-a-service, logging-as-a-service, incident-management-as-a-service)

Separation of testing and production

Do you separate testing from production clusters?
Do you take production data into testing?
What is your source of test data?
Is your testing infrastructure sufficiently protected?

Access Control

Describe your access control system for Kubernetes?
Is multi-factor authentication enabled?
How many people have access to the production Kubernetes cluster?
1. Is role based access implemented? (Read-only, Read-Write)
Do you have one user account per person?
Do you keep an audit trail on Kubernetes API access?
Is the audit trail stored in a tamper-proof logging¹ environment?
Do you regularly review access, e.g., revoking access to people leaving your organization?

Logging

Do you forward nodes (journald), Kubernetes (API server) and application logs to a tamper-proof logging environment?
What is your retention policy? What is the minimum amount of time you keep logging entries? What is the maximum amount of time you keep logging entries?
1. Source of retention policy? Specified in legislation or company policy? If company policy, why that figure?

Backups

Do you perform regular backups of your production Kubernetes clusters?
How do you protect backups from loss or corruption?
1. Do you replicate backups to an off-site location?
2. Do you store backups in an immutable storage?
Do you regularly test restoring from backups?

Change Management

What is the journey of code changes? (e.g., How does a new feature make it into production?)
What is the journey of infrastructure changes? (e.g., How do you create a new Kubernetes cluster? How do you update a SecurityGroup?)
What is the journey of data changes? (e.g., How do you add a new column to your database?)
Do you enforce a 2-person policy for performing system changes?
Do you use semantic versioning for container image tags?
How do you perform data migration?
In the event of a ‘bad’ deployment, can you rollback to a previous good state?
Is this rollback ability tested regularly?
Do you do Canary deployments?
Do you have a maintenance window for preventive maintenance?
1. Do you communicate the maintenance window to your users?

Technical Vulnerability Management

Do you scan containers for vulnerabilities before entering production?
Do you have a process in place to get alerted when a container becomes vulnerable in production?
Do you enforce deployment only from known-good container image registries?
After how much time do you fix known container image vulnerabilities?
Is the production Kubernetes cluster updated, as required to avoid vulnerabilities?
Is the underlying OS image updated, as required to avoid vulnerabilities?
Are other adjacent services (e.g., container registry, logging environment, identity provider) updated, as required to avoid vulnerabilities?

Use of Cryptography

Do you encrypt traffic over open networks? Do you use HTTPS over the Internet?
Do you regularly rotate cryptographic keys?
Do you use HSTS²?

Network Segregation

Do you have NetworkPolicies in place?

Intrusion Detection / Prevention

Do you have an intrusion detection system in the Kubernetes cluster?
Do you have a web application firewall in front or within the Kubernetes cluster?
Do you alert on blocked outbound traffic?

Business Continuity

Are your systems designed with sufficient redundancy? (e.g., multiple availability zones, multiple Kubernetes nodes, multiple Pod replicas)
Do you regularly test your redundancy? (e.g., by failing a Kubernetes node and killing a Pod replica)
Do at least two team members have access to each system?

Incident Management

Are your systems sufficiently documented for incident management?
Do you have a clear definition of what is an incident?
Does your team have clear procedures for who handles incidents and how to handle incidents?
Are there situations where you need to fix data in a production environment?
1. If yes, what is your process?

Incident reporting: Internal

How do internal users report incidents?
Is there an escalation process in place if an incident is not responded to in time?

Incident reporting: External

How do external users report incidents?
Is there an escalation process in place if an incident is not responded to in time?
Do you have a process in place to report incidents -- e.g., downtime, data breaches -- to your regulators or users, as required?

Capacity Management

Do you have processes in place to predict running out-of-capacity, specifically with respect to CPUs, memory and storage?
Do you have processes in place to add capacity to the cluster when needed?

Kubernetes Cluster Security

Does your cluster pass the CIS Kubernetes security benchmark?
Do you have RBAC enabled?
Do you use PodSecurityPolicy (to be deprecated) or Pod Security Standards?
Do you use OpenPolicyAgent?
Do your Kubernetes Nodes have appropriate topology labels and taints?

Kubernetes Workload Security

Do you enforce appropriate Pod SecurityContext?
1. Do you enforce no privileged Pods?
2. Do you enforce minimum Pod capabilities?
3. Do you enforce no Pods running as root?
4. Do you enforce no Pod sysctls?
5. Do you enforce no hostPath-s?
6. Do you enforce no nodePort / host network?
Do you enforce no automountServiceAccountToken?
Do you enforce Pod resource limits?
Do you enforce minimum Pod replication?
Do you enforce a read-only Pod file system?
Are all Pods properly labeled?
Are Pod container images pinned to a specific version (i.e., no “latest”)?
Do you use Secrets appropriately?
Do you restrict access to Secrets?
Do you enforce Ingress TLS?
Do you have sufficient and tested NetworkPolicies in place?
Do you use Pod Topology Spread Constraints?
Did you run through other Kubernetes Best Practices?

Kubernetes Development Practices

How is a container image built?
How is a container image deployed into the Kubernetes cluster?
How do you manage Kubernetes resources? (e.g., Helm)
Do you practice GitOps³?
Do you fully separate development from production environments?
Is there a fully separate environment for testing/QA?

Development Environment

Access control

How is access managed to this environment? (Role based, per user etc)
Describe your source control systems and process

Testing/QA Environment

Access control

How is access managed to this environment? (Role based, per user etc)

Test data

How is test data created/sourced?
Does data come from production data?
1. If yes, is it anonymised?

Production Environment

Access control

How is access managed to this environment? (Role based, per user etc)

Code/build Deployment

Kubernetes

Describe your pod deployment process
What checks do you carry out on pods at deployment time?

Volume mounts/Data sources

Describe your deployment/update process for data sources.
Is your data persistence layer built using the Infrastructure as code paradigm?
Is your data persistence layer version controlled?

Permissions

Automated/manual system

Kubernetes Operational Practices

Health Checks

Do you maintain KPIs for determining if your Kubernetes cluster is working well? (e.g., USE: utilization, saturation, errors)
Do you have relevant alerts in place?
Do you have alerts for “slowly filling and getting full”? (e.g., disk space)
Do you require Liveness, Readiness and Startup Probes?

Operational Logging and Alerts

Do you have relevant alerts in place?
Do you have a defined process for handling alerts?
What is the retention period for these logs?

SLA’s/SLO’s management

Do you maintain KPIs for determining if your users are served well? (e.g., daily active users, log-ins per day, registrations per day)
Do you maintain KPIs for determining if your application is working well? (e.g., RED: request rate, error rate, duration)

Path to resolution

Describe the chain of events from an alert/issue raised to resolution

Security Logging and Alerts

Do you have relevant alerts in place?
Do you have a defined process for handling alerts?
What is the retention period for these logs?

Triage

Do you have processes and systems in place for evaluating security incidents? (Info, Low, Medium, High, Critical)
Do you have processes and systems in place for informing/alerting external shareholders of relevant incidents? (Data leaks, System availability etc)

Path to resolution

Describe the chain of events from an alert/issue raised to resolution

Notes

Footnotes

By “tamper-proof” it is understood that a person gaining (authorized or unauthorized access) to the production Kubernetes cluster cannot modify or delete existing log entries. ↩
https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security ↩
By “GitOps” we mean that all system changes (except in “break glass” scenarios) need to be performed via git commits. ↩

Files

kubernetes-information-security-review-checklist.md

Latest commit

History

kubernetes-information-security-review-checklist.md

File metadata and controls

Kubernetes Information Security Review Checklist

Governance, risk management, and compliance

Mapping of Standards to Policies

Mapping of Policies to Implementation

Evidence of Implementation fulfilling Policies

High-Level General Practices

Supply Management

Separation of testing and production

Access Control

Logging

Backups

Change Management

Technical Vulnerability Management

Use of Cryptography

Network Segregation

Intrusion Detection / Prevention

Business Continuity

Incident Management

Incident reporting: Internal

Incident reporting: External

Capacity Management

Kubernetes Cluster Security

Kubernetes Workload Security

Kubernetes Development Practices

Development Environment

Access control

Testing/QA Environment

Access control

Test data

Production Environment

Access control

Code/build Deployment

Kubernetes

Volume mounts/Data sources

Permissions

Automated/manual system

Kubernetes Operational Practices

Health Checks

Operational Logging and Alerts

SLA’s/SLO’s management

Path to resolution

Security Logging and Alerts

Triage

Path to resolution

Notes

Footnotes