The Cray System Management (CSM) operational activities are administrative procedures required to operate an HPE Cray EX system with CSM software installed.
The following administrative topics can be found in this guide:
- CSM product management
- Bare-metal
- Image management
- Boot orchestration
- System power off procedures
- System power on procedures
- Power management
- Artifact management
- Compute rolling upgrades
- Configuration management
- Kubernetes
- Package repository management
- Security and authentication
- Resiliency
- ConMan
- Utility storage
- System management health
- System Layout Service (SLS)
- System configuration service
- Hardware State Manager (HSM)
- Hardware Management (HM) collector
- HPE Power Distribution Unit (PDU)
- Node management
- Network
- Spire
- Update firmware with FAS
- User Access Service (UAS)
Important procedures for configuring, managing, and validating the CSM environment.
- Validate CSM Health
- Configure Keycloak Account
- Configure the Cray Command Line Interface (Cray CLI)
- Change Passwords and Credentials
- Configure Non-Compute Nodes with CFS
- Configure CSM Packages with CFS
- Perform NCN Personalization
- Access the LiveCD USB Device After Reboot
- Post-Install Customizations
- Validate Signed RPMs
General information on what needs to be done before the initial install of CSM.
Build and customize image recipes with the Image Management Service (IMS).
- Image Management
- Image Management Workflows
- Upload and Register an Image Recipe
- Build a New UAN Image Using the Default Recipe
- Build an Image Using IMS REST Service
- Import External Image to IMS
- Customize an Image Root Using IMS
- Delete or Recover Deleted IMS Content
- Configure IMS to Validate RPMs
Use the Boot Orchestration Service (BOS) to boot, configure, and shut down collections of nodes.
- Boot Orchestration Service (BOS)
- BOS Workflows
- BOS Components
- BOS Session Templates
- BOS Sessions
- Manage a BOS Session
- View the Status of a BOS Session
- Limit the Scope of a BOS Session
- Stage Changes with BOS
- Configure the BOS Timeout When Booting Compute Nodes
- Kernel Boot Parameters
- Check the Progress of BOS Session Operations
- Clean Up Logs After a BOA Kubernetes Job
- Clean Up After a BOS/BOA Job is Completed or Cancelled
- Troubleshoot UAN Boot Issues
- Troubleshoot Booting Nodes with Hardware Issues
- BOS Options
- Rolling Upgrades using BOS
- BOS Limitations for Gigabyte BMC Hardware
- Compute Node Boot Sequence
- Healthy Compute Node Boot Process
- Node Boot Root Cause Analysis
- Compute Node Boot Issue Symptom: Duplicate Address Warnings and Declined DHCP Offers in Logs
- Compute Node Boot Issue Symptom: Node is Not Able to Download the Required Artifacts
- Compute Node Boot Issue Symptom: Message About Invalid EEPROM Checksum in Node Console or Log
- Boot Issue Symptom: Node HSN Interface Does Not Appear or Show Detected Links Detected
- Compute Node Boot Issue Symptom: Node Console or Logs Indicate that the Server Response has Timed Out
- Tools for Resolving Compute Node Boot Issues
- Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
- Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
- Troubleshoot Compute Node Boot Issues Related to the Boot Script Service
- Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
- Troubleshoot Compute Node Boot Issues Using Kubernetes
- Log File Locations and Ports Used in Compute Node Boot Troubleshooting
- Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
- Customize iPXE Binary Names
- Edit the iPXE Embedded Boot Script
- Redeploy the iPXE and TFTP Services
- Upload Node Boot Information to Boot Script Service (BSS)
Procedures required for a full power off of an HPE Cray EX system.
Additional links to power off sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:
- Prepare the System for Power Off
- Shut Down and Power Off Compute and User Access Nodes
- Save Management Network Switch Configuration Settings
- Power Off Compute Cabinets
- Shut Down and Power Off the Management Kubernetes Cluster
- Power Off the External Lustre File System
Procedures required for a full power on of an HPE Cray EX system.
Additional links to power on sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:
- Power On and Start the Management Kubernetes Cluster
- Power On Compute Cabinets
- Power On the External Lustre File System
- Power On and Boot Compute and User Access Nodes
- Recover from a Liquid Cooled Cabinet EPO Event
HPE Cray System Management (CSM) software manages and controls power out-of-band through Redfish APIs.
- Power Management
- Cray Advanced Platform Monitoring and Control (CAPMC)
- Liquid Cooled Node Power Management
- Standard Rack Node Power Management
- Ignore Nodes with CAPMC
- Set the Turbo Boost Limit
Use the Ceph Object Gateway Simple Storage Service (S3) API to manage artifacts on the system.
- Artifact Management
- Manage Artifacts with the Cray CLI
- Use S3 Libraries and Clients
- Generate Temporary S3 Credentials
Upgrade sets of compute nodes with the Compute Rolling Upgrade Service (CRUS) without requiring an entire set of nodes to be out of service at once. CRUS enables administrators to limit the impact on production caused from upgrading compute nodes by working through one step of the upgrade process at a time.
NOTE CRUS was deprecated in CSM 1.2.0. It will be removed in a future CSM release and replaced with BOS V2, which will provide similar functionality. See Deprecated features.
- Compute Rolling Upgrade Service (CRUS)
- CRUS Workflow
- Upgrade Compute Nodes with CRUS
- Troubleshoot Nodes Failing to Upgrade in a CRUS Session
- Troubleshoot a Failed CRUS Session Because of Unmet Conditions
- Troubleshoot a Failed CRUS Session Because of Bad Parameters
The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images.
- Configuration Management
- Configuration Layers
- Ansible Inventory
- Configuration Sessions
- Create a CFS Session with Dynamic Inventory
- Create an Image Customization CFS Session
- Set Limits for a Configuration Session
- Use a Specific Inventory for a Configuration Session
- Change the Ansible Verbosity Logs
- Set the
ansible.cfg
for a Session - Delete CFS Sessions
- Automatic Session Deletion with
sessionTTL
- Track the Status of a Session
- View Configuration Session Logs
- Troubleshoot Ansible Play Failures in CFS Sessions
- Troubleshoot CFS Session Failing to Complete
- Troubleshoot CFS Sessions Failing to Start
- Configuration Management with the CFS Batcher
- CFS Flow Diagrams
- Configuration Management of System Components
- Ansible Execution Environments
- CFS Global Options
- Version Control Service (VCS)
- Write Ansible Code for CFS
- CFS Key Management
- Management NCN personalization and image customization
The system management components are broken down into a series of micro-services. Each service is independently deployable, fine-grained, and uses lightweight protocols. As a result, the system's micro-services are modular, resilient, and can be updated independently. Services within the Kubernetes architecture communicate using REST APIs.
- Kubernetes Architecture
- About
kubectl
- About Kubernetes Taints and Labels
- Kubernetes Storage
- Kubernetes Networking
- Retrieve Cluster Health Information Using Kubernetes
- Pod Resource Limits
- About etcd
- Check the Health and Balance of etcd Clusters
- Rebuild Unhealthy etcd Clusters
- Backups for etcd-operator Clusters
- Create a Manual Backup of a Healthy Bare-Metal etcd Cluster
- Create a Manual Backup of a Healthy etcd Cluster
- Restore an etcd Cluster from a Backup
- Repopulate Data in etcd Clusters When Rebuilding Them
- Restore Bare-Metal etcd Clusters from an S3 Snapshot
- Rebalance Healthy etcd Clusters
- Check for and Clear etcd Cluster Alarms
- Report the Endpoint Status for etcd Clusters
- Clear Space in an etcd Cluster Database
- About Postgres
- Kyverno policy management
- Troubleshoot Intermittent HTTP 503 Code Failures
Repositories are added to systems to extend the system functionality beyond what is initially delivered. The Sonatype Nexus Repository Manager is the primary method for repository management. Nexus hosts the Yum, Docker, raw, and Helm repositories for software and firmware content.
- Package Repository Management
- Package Repository Management with Nexus
- Manage Repositories with Nexus
- Nexus Configuration
- Nexus Deployment
- Nexus Export and Restore
- Restrict Admin Privileges in Nexus
- Repair Yum Repository Metadata
- Nexus Space Cleanup
Mechanisms used by the system to ensure the security and authentication of internal and external requests.
- System Security and Authentication
- Manage System Passwords
- Update NCN Passwords
- Change Root Passwords for Compute Nodes
- Set NCN Image Root Password, SSH Keys, and Timezone
- Change EX Liquid-Cooled Cabinet Global Default Password
- Provisioning a Liquid-Cooled EX Cabinet CEC with Default Credentials
- Updating the Liquid-Cooled EX Cabinet Default Credentials after a CEC Password Change
- Update Default Air-Cooled BMC and Leaf-BMC Switch SNMP Credentials
- Change Air-Cooled Node BMC Credentials
- Change SNMP Credentials on Leaf-BMC Switches
- Update Default ServerTech PDU Credentials used by the Redfish Translation Service
- Change Credentials on ServerTech PDUs
- Add Root Service Account for Gigabyte Controllers
- Recovering from Mismatched BMC Credentials
- SSH Keys
- Authenticate an Account with the Command Line
- Default Keycloak Realms, Accounts, and Clients
- Certificate Types
- Change Keycloak Token Lifetime
- Change the Keycloak Admin Password
- Create a Service Account in Keycloak
- Retrieve the Client Secret for Service Accounts
- Get a Long-Lived Token for a Service Account
- Access the Keycloak User Management UI
- Create Internal User Accounts in the Keycloak Shasta Realm
- Delete Internal User Accounts in the Keycloak Shasta Realm
- Create Internal User Groups in the Keycloak Shasta Realm
- Remove Internal Groups from the Keycloak Shasta Realm
- Remove the Email Mapper from the LDAP User Federation
- Re-Sync Keycloak Users to Compute Nodes
- Keycloak Operations
- Configure Keycloak for LDAP/AD authentication
- Configure the RSA Plugin in Keycloak
- Preserve Username Capitalization for Users Exported from Keycloak
- Change the LDAP Server IP Address for Existing LDAP Server Content
- Change the LDAP Server IP Address for New LDAP Server Content
- Remove the LDAP User Federation from Keycloak
- Add LDAP User Federation
- Keycloak User Management with
kcadm.sh
- Keycloak User Localization
- Public Key Infrastructure (PKI)
- API Authorization
- Manage Sealed Secrets
- Audit Logs
HPE Cray EX systems are designed so that system management services (SMS) are fully resilient and that there is no single point of failure.
- Resiliency
- Resilience of System Management Services
- Restore System Functionality if a Kubernetes Worker Node is Down
- Recreate
StatefulSet
Pods on Another Node - NTP Resiliency
ConMan is a tool used for connecting to remote consoles and collecting console logs. These node logs can then be used for various administrative purposes, such as troubleshooting node boot issues.
- Access Compute Node Logs
- Access Console Log Data Via the System Monitoring Framework (SMF)
- Manage Node Consoles
- Log in to a Node Using ConMan
- Establish a Serial Connection to NCNs
- Disable ConMan After System Software Installation
- Troubleshoot ConMan Blocking Access to a Node BMC
- Troubleshoot ConMan Failing to Connect to a Console
- Troubleshoot ConMan Asking for Password on SSH Connection
Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.
- Utility Storage
- Collect Information about the Ceph Cluster
- Manage Ceph Services
- Adjust Ceph Pool Quotas
- Add Ceph OSDs
- Ceph Health States
- Ceph Deep Scrubs
- Ceph Daemon Memory Profiling
- Ceph Service Check Script Usage
- Ceph Orchestrator Usage
- Ceph Storage Types
- CSM RBD Tool Usage
cubs_tool
Usage- Dump Ceph Crash Data
- Identify Ceph Latency Issues
- Cephadm Reference Material
- Restore Nexus Data After Data Corruption
- Troubleshoot Failure to Get Ceph Health
- Troubleshoot a Down OSD
- Troubleshoot Ceph OSDs Reporting Full
- Troubleshoot System Clock Skew
- Troubleshoot an Unresponsive S3 Endpoint
- Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
- Troubleshoot Pods Multi-Attach Error
- Troubleshoot Large Object Map Objects in Ceph Health
- Troubleshoot Failure of RGW Health Check
- Troubleshooting Ceph MDS Reporting Slow Requests and Failure on Client
Enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally.
- System Management Health
- System Management Health Checks and Alerts
- Access System Management Health Services
- Configure Prometheus Email Alert Notifications
- Grafana Dashboards by Component
- Remove Kiali
- Prometheus-Kafka-Adapter error during install
The System Layout Service (SLS) holds information about the system design, such as the physical locations of network hardware, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each compute node.
- System Layout Service (SLS)
- Dump SLS Information
- Load SLS Database with Dump File
- Add Liquid-Cooled Cabinets to SLS
- Add UAN CAN IP Addresses to SLS
- Update SLS with UAN Aliases
- Add an alias to a service
- Create a Backup of the SLS Postgres Database
- Restore SLS Postgres Database from Backup
- Restore SLS Postgres without an Existing Backup
The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but
this tool enables parameters to be set before or after discovery. The operations to change these parameters are available in the Cray CLI under the scsd
command.
- System Configuration Service
- Configure BMC and Controller Parameters with SCSD
- Manage Parameters with the SCSD Service
- Set BMC Credentials
Use the Hardware State Manager (HSM) to monitor and interrogate hardware components in the HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.
- Hardware State Manager (HSM)
- Hardware Management Services (HMS) Locking API
- Component Groups and Partitions
- Hardware State Manager (HSM) State and Flag Fields
- HSM Roles and Subroles
- Add an NCN to the HSM Database
- Add a Switch to the HSM Database
- Create a Backup of the HSM Postgres Database
- Restore HSM Postgres from a Backup
- Restore HSM Postgres without a Backup
- Set BMC Management Role
The Hardware Management (HM) Collector is used to collect telemetry and Redfish events from hardware in the system.
Procedures for managing and setting up HPE PDUs.
Monitor and manage compute nodes (CNs) and non-compute nodes (NCNs) used in the HPE Cray EX system.
- Node Management
- Node Management Workflows
- Rebuild NCNs
- Reboot NCNs
- Enable Nodes
- Disable Nodes
- Find Node Type and Manufacturer
- Add additional Liquid-Cooled Cabinets to a System
- Updating Cabinet Routes on Management NCNs
- Move a liquid-cooled blade within a System
- Add a Standard Rack Node
- Clear Space in Root File System on Worker Nodes
- Troubleshoot Issues with Redfish Endpoint
DiscoveryCheck
for Redfish Events from Nodes - Reset Credentials on Redfish Devices
- Access and Update Settings for Replacement NCNs
- Change Settings for HMS Collector Polling of Air Cooled Nodes
- Use the Physical KVM
- Launch a Virtual KVM on Gigabyte Nodes
- Launch a Virtual KVM on Intel Nodes
- Change Java Security Settings
- Configuration of NCN Bonding
- Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
- Check the BMC Failover Mode
- Update Compute Node Mellanox HSN NIC Firmware
- TLS Certificates for Redfish BMCs
- Dump a Non-Compute Node
- Enable Passwordless Connections to Liquid Cooled Node BMCs
- Configure NTP on NCNs
- Swap a Compute Blade with a Different System
- Swap a Compute Blade with a Different System Using SAT
- Replace a Compute Blade
- Replace a Compute Blade Using SAT
- Update the Gigabyte Node BIOS Time
- S3FS Usage Guidelines
Overview of the several different networks supported by the HPE Cray EX system.
- Network
- Access to System Management Services
- Default IP Address Ranges
- Connect to the HPE Cray EX Environment
- Connect to Switch over USB-Serial Cable
- Create a CSM Configuration Upgrade Plan
- Gateway Testing
HPE Cray EX systems can have network switches in many roles: spine switches, leaf switches, LeafBMC
switches, and CDU switches. Newer systems have HPE Aruba switches,
while older systems have Dell and Mellanox switches. Switch IP addresses are generated by Cray Site Init
(CSI).
- HPE Cray EX Management Network Installation and Configuration Guide
- Update Management Network Firmware
The customer accessible networks (CMN/CAN/CHN) provide access from outside the customer network to services, NCNs, and User Access Nodes (UANs) in the system.
- Customer Accessible Networks
- Externally Exposed Services
- Connect to the CMN and CAN
- BI-CAN Aruba/Arista Configuration
- MetalLB Peering with Arista Edge Router
- CAN/CMN with Dual-Spine Configuration
- Troubleshoot CMN Issues
The DHCP service on the HPE Cray EX system uses the Internet Systems Consortium (ISC) Kea tool. Kea provides more robust management capabilities for DHCP servers.
The central DNS infrastructure provides the structural networking hierarchy and datastore for the system.
- DNS
- Manage the DNS Unbound Resolver
- Enable
ncsd
on UANs - PowerDNS Configuration
- PowerDNS Migration Guide
- Troubleshoot Common DNS Issues
- Troubleshoot PowerDNS
External DNS, along with the Customer Management Network (CMN), Border Gateway Protocol (BGP), and MetalLB, makes it simpler to access the HPE Cray EX API and system management services. Services are accessible directly from a laptop without needing to tunnel into a non-compute node (NCN) or override /etc/hosts settings.
- External DNS
- External DNS
csi config init
Input Values - Update the
cmn-external-dns
Value Post-Installation - Ingress Routing
- External DNS Failing to Discover Services Workaround
- Troubleshoot Connectivity to Services with External IP addresses
- Troubleshoot DNS Configuration Issues
MetalLB is a component in Kubernetes that manages access to LoadBalancer
services from outside the Kubernetes cluster. There are LoadBalancer
services on the Node
Management Network (NMN), Hardware Management Network (HMN), and Customer Access Network (CAN).
MetalLB can run in either Layer2-mode
or BGP-mode
for each address pool it manages. BGP-mode
is used for the NMN, HMN, and CAN. This enables true load balancing
(Layer2-mode
does failover, not load balancing) and allows for a more robust layer 3 configuration for these networks.
- MetalLB in BGP-Mode
- MetalLB Configuration
- Check BGP Status and Reset Sessions
- Troubleshoot Services without an Allocated IP Address
- Troubleshoot BGP not Accepting Routes from MetalLB
Spire provides the ability to authenticate nodes and workloads, and to securely distribute and manage their identities along with the credentials associated with them.
- Restore Spire Postgres without a Backup
- Troubleshoot Spire Failing to Start on NCNs
- Update Spire Intermediate CA Certificate
- Xname Validation
- Restore Missing Spire Meta-Data
The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Managers (HSM), device data, and image data in order to update firmware.
See Update Firmware with FAS for a list components that are upgradable with FAS. Refer to the HPC Firmware Pack (HFP) product stream to update firmware on other components.
- Update Firmware with FAS
- FAS CLI
- FAS Filters
- FAS Recipes
- FAS Admin Procedures
- FAS Use Cases
- Upload Olympus BMC Recovery Firmware into TFTP Server
The User Access Service (UAS) is a containerized service managed by Kubernetes that enables application developers to create and run user applications. Users launch
a User Access Instance (UAI) using the cray
command. Users can also transfer data between the Cray system and external systems using the UAI.
- User Access Service (UAS)
- End-User UAIs
- Special Purpose UAIs
- Elements of a UAI
- UAI Host Nodes
- UAI
macvlans
Network Attachments - UAI Host Node Selection
- UAI Network Attachments
- Configure UAIs in UAS
- UAI Management
- Legacy Mode User-Driven UAI Management
- Broker Mode UAI Management
- UAI Images
- Troubleshoot UAS Issues
- Troubleshoot UAS by Viewing Log Output
- Troubleshoot UAIs by Viewing Log Output
- Troubleshoot Stale Brokered UAIs
- Troubleshoot UAI Stuck in
ContainerCreating
- Troubleshoot Duplicate Mount Paths in a UAI
- Troubleshoot Missing or Incorrect UAI Images
- Troubleshoot UAIs with Administrative Access
- Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image
The System Admin Toolkit (SAT) is a command-line interface that can assist administrators with common tasks, such as
troubleshooting and querying information about the HPE Cray EX System, system boot and shutdown, and replacing hardware
components. In CSM 1.3 and newer, the sat
command is available on the Kubernetes NCNs without installing the SAT
product stream.