CSM Troubleshooting Information

This document provides links to troubleshooting information for services and functionality provided by CSM.

Known issues
Kubernetes
Grafana dashboards
UAS
Booting
- UAN
- Compute node
Compute rolling upgrades
Configuration management
Security and authentication
ConMan
Utility storage
Node management
Customer Management Network (CMN)
Domain Name Service (DNS)
MetalLB
Spire

Known issues

SAT/HSM/CAPMC Component Power State Mismatch
HMS Discovery job not creating RedfishEndpoints in Hardware State Manager
initrd.img.xz not found
SSL Certificate Validation Issues
SLS Not Working During Node Rebuild

Kubernetes

General Kubernetes Commands for Troubleshooting
Kubernetes Log File Locations
Liveliness or Readiness Probe Failures
Unresponsive kubectl Commands
Kubernetes Node NotReady
Kubernetes Pods not Starting
Postgres Database
Recover from Postgres WAL Event
Restore Postgres
Disaster Recovery for Postgres

Grafana dashboards

Grafana Dashboards

UAS

Viewing UAI Log Output
Stale Brokered UAIs
UAI Stuck in ContainerCreating
Duplicate Mount Paths in a UAI
Missing or Incorrect UAI Images
Common Mistakes When Creating a Custom End-User UAI Image

Booting

UAN boot issues

UAN Boot Issues

Compute node boot issues

Issues Related to Unified Extensible Firmware Interface (UEFI)
Issues Related to Dynamic Host Configuration Protocol (DHCP)
Issues Related to the Boot Script Service
Issues Related to Trivial File Transfer Protocol (TFTP)
Troubleshooting Using Kubernetes
Log File Locations and Ports Used
Issues Related to Slow Boot Times

Compute rolling upgrades

CRUS was deprecated in CSM 1.2.0. It will be removed in a future CSM release and replaced with BOS V2, which will provide similar functionality. See Deprecated features.

Nodes Failing to Upgrade in a CRUS Session
Failed CRUS Session Because of Unmet Conditions
Failed CRUS Session Because of Bad Parameters

Configuration management

Ansible Play Failures in CFS Sessions
CFS Session Failing to Complete

Security and authentication

Common Vault Cluster Issues
Keycloak User Localization

ConMan

ConMan Blocking Access to a Node BMC
ConMan Failing to Connect to a Console
ConMan Asking for Password on SSH Connection

Utility storage

Failure to Get Ceph Health
Down OSDs
Ceph OSDs Reporting Full
System Clock Skew
Unresponsive S3 Endpoint
Ceph-Mon Processes Stopping and Exceeding Max Restarts
Large Object Map Objects in Ceph Health
Failure of RGW Health Check
Troubleshoot S3FS Mounts

Node management

Issues with Redfish Endpoint DiscoveryCheck for Redfish Events from Nodes
Interfaces with IP Address Issues
Loss of Console Connections and Logs on Gigabyte Nodes

Customer Management Network (CMN)

CMN Issues

Domain Name Service (DNS)

Connectivity to Services with External IP addresses
DNS Configuration Issues

MetalLB

Services Without an Allocated IP Address
BGP not Accepting Routes from MetalLB

Spire

Restore Spire Postgres without a Backup
Spire Database Cluster DNS Lookup Failure
Spire Failing to Start on NCNs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CSM Troubleshooting Information

Known issues

Kubernetes

Grafana dashboards

UAS

Booting

UAN boot issues

Compute node boot issues

Compute rolling upgrades

Configuration management

Security and authentication

ConMan

Utility storage

Node management

Customer Management Network (CMN)

Domain Name Service (DNS)

MetalLB

Spire

Files

README.md

Latest commit

History

README.md

File metadata and controls

CSM Troubleshooting Information

Known issues

Kubernetes

Grafana dashboards

UAS

Booting

UAN boot issues

Compute node boot issues

Compute rolling upgrades

Configuration management

Security and authentication

ConMan

Utility storage

Node management

Customer Management Network (CMN)

Domain Name Service (DNS)

MetalLB

Spire