Skip to content

Server Troubleshooting and Resolution

Emmanuel Nwanochie edited this page Jul 30, 2024 · 2 revisions

Server Troubleshooting And Resolution Guide

Troubleshooting and Resolving High CPU Usage in Linux

Alert Rule

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))

This alert triggers when the average CPU usage over 5 minutes exceeds a certain threshold.

Investigation Steps

  1. Verify Alert

    • Check Prometheus/Grafana to confirm high CPU usage
    • Ensure alert is not a false positive
  2. Identify CPU-intensive processes Use top or htop

  3. Analyze specific processes

ps aux | grep <process_name_or_PID>
  1. Check system load average
uptime
  1. Monitor CPU usage over time
sudo sar -u 1 10
  1. Examine CPU core usage
mpstat -P ALL 1 5
  1. Investigate high I/O wait times
iostat -xz 1 10

Resolution Steps 8. Terminate unnecessary processes

kill <PID>

or force kill: `kill -9

  1. Adjust process priority renice +10 <PID>
  2. Limit CPU usage for a process
sudo cpulimit -p <PID> -l 50
  1. Update or optimize software
sudo apt update && sudo apt upgrade
  1. Check for malware
sudo rkhunter --check
  1. Optimize system services
sudo systemctl disable <service_name>

Post-Resolution Actions

  1. Document and Report
  • Record actions taken and their effects
  • Update alert status
  • Notify relevant team members (devops team)
  1. Preventive Measures
  • Implement regular system maintenance
  • Set up resource usage monitoring
  • Optimize application code if applicable
  1. Follow-up
  • Conduct root cause analysis
  • Implement long-term solutions
  • Update runbook if necessary

Note: Always backup your system before making significant changes, and test in a non-production environment first.

Troubleshooting and Resolving Low Memory Space in Linux

Alert rule

(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100

Troubleshooting tips

  1. Check Current Memory Usage

Use the free command to view memory statistics:

free -h

or a more detailed view, use:

cat /proc/meminfo
  1. Identify Memory-Intensive Processes: Use top or htop to see which processes are consuming the most memory
# Use top
top

# Use htop
htop

Sort processes by memory usage in top by pressing Shift+M.

  1. Analyze Specific Processes For detailed information about a process's memory usage:
ps aux | grep <process_name_or_PID>

To see the memory map of a process:

pmap -x <PID>
  1. Check for Memory Leaks Use Valgrind to check for memory leaks in a specific application:
valgrind --leak-check=full /path/to/your/program
  1. Monitor Swap Usage. Check swap space usage:
swapon --show
  1. Examine System Logs. Look for any memory-related errors in system logs:
sudo journalctl -p err..emerg

Resolution steps

  1. Terminate unnecessary processes:
kill <PID>

or force kill:

kill -9 <PID>
  1. Clear Page Cache: To free up cached memory
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
  1. Increase Swap Space: Create a new swap file:
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Add to /etc/fstab for persistence:

/swapfile none swap sw 0 0
  1. Optimize Applications:
  • Update software to latest versions
  • Configure applications to use less memory
  • Use lightweight alternatives for resource-heavy applications
  1. Implement Memory Limits:Use cgroups to set memory limits for services:
sudo systemctl set-property <service_name> MemoryLimit=1G
  1. Clean Up Disk Space:Remove unnecessary files and uninstall unused applications:
sudo apt autoremove
sudo apt clean
  1. Consider Hardware Upgrades: If issues persist, consider adding more RAM to your system.

Troubleshooting and Resolving Low Disk Space

Alert rule

100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})

Low disk space on a Linux server can cause various issues, including application crashes and system instability. This guide provides steps and commands to troubleshoot and resolve low disk space issues.

  1. Check Disk Usage

Use the df command to check disk usage of all mounted filesystems.

df -h
  1. Identify Large Files and Directories: Use the du command to identify large files and directories
du -sh /path/to/directory/*

Find Top 10 Largest Directories in Root

du -ahx / | sort -rh | head -10
  1. Clean Up Unnecessary Files
  • Remove Unnecessary Packages
sudo apt-get autoremove
sudo apt-get clean
  • Clear Systemd Journal Logs
sudo journalctl --vacuum-size=100M
  • Clear APT Cache (Debian/Ubuntu)
sudo apt-get clean
  • Delete Old Logs
sudo find /var/log -type f -name "*.log" -exec rm -f {} \;
  1. Investigate and Clear Docker Disk Usage (if docker is being used) If you are using Docker, it can consume a significant amount of disk space.
  • Check Docker Disk Usage
sudo docker system df
  • Remove unused Docker data
sudo docker system prune -a

# or force Remove
sudo docker system prune -af
  1. Implement log rotation using tools like logrotate to prevent log files from consuming too much disk space.

  2. Consider adding more disk space or storage to the server if disk space issues persist.

Troubleshooting and resolving Network Traffic Issues

Alert rule

irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8

Troubleshooting Steps

  1. Check network utilization: iftop -i <interface>
  2. Analyze network connections: netstat -tuln
  3. Monitor incoming/outgoing traffic: tcpdump -i <interface> -n

Resolution

  • Optimize application code for network efficiency
  • Implement caching mechanisms
  • Consider load balancing or CDN solutions

Troubleshooting and Resolving Network Errors

Alert rule

increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])

Troubleshooting Steps

  1. Check DNS resolution: nslookup <domain>
  2. Test network connectivity: ping <host> traceroute <host>
  3. Verify SSL/TLS configuration: openssl s_client -connect <host>:<port>

Resolution

  • Update DNS settings
  • Check firewall rules
  • Renew or reconfigure SSL/TLS certificates

Troubleshooting and Resolving Disk I/O Issues

Symptoms

  • High disk usage
  • Slow read/write operations
  • I/O wait time spikes

Troubleshooting Steps

  1. Monitor disk I/O: iostat -x 1
  2. Check disk usage: df -h du -sh /*
  3. Identify processes causing high I/O: iotop

Resolution

  • Optimize database queries
  • Implement proper indexing
  • Consider upgrading to SSDs or faster storage
  • Adjust file system parameters (e.g., noatime mount option)

Troubleshooting and Resolving System Reboot Alert Resolution

Alert Rule:

node_time_seconds{instance="localhost:9100",job="node_exporter"} - node_boot_time_seconds{instance="localhost:9100",job="node_exporter"}

This alert triggers when the system has recently rebooted. It calculates the difference between current time and boot time.

Initial Assessment:

  1. Verify alert legitimacy
  2. Check if reboot was planned maintenance

Troubleshooting Steps: a. Access the affected system b. Review system logs:

sudo journalctl -b -1 -n

c. Check last reboot time: who -b d. Examine uptime: uptime

Common Causes and Solutions: a. Power failure

  • Check UPS status
  • Verify power supply integrity b. Kernel panic
  • Review kernel logs:
sudo dmesg | grep -i panic
  • Update kernel if necessary c. Hardware failure
  • Run hardware diagnostics
  • Check for overheating d. Software update
  • Review package manager logs
  • Rollback recent updates if problematic

Prevention Measures:

  • Implement regular maintenance schedule
  • Set up automatic security updates
  • Monitor system resources

Alert Resolution:

  • Document findings and actions taken
  • Update alert status in monitoring system
  • Notify relevant team members

Follow-up:

  • Conduct root cause analysis
  • Implement preventive measures
  • Update runbook if necessary

Troubleshooting and Resolving High System Load Alert

Alert Description: This alert triggers when the 1-minute load average on a system exceeds a certain percentage of available CPU cores.

Alert Rule:

scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 / count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))

Step 1: Verify the Alert

  1. Log into the monitoring system and confirm the alert details.
  2. Check if the alert is still active or if it was a temporary spike.

Step 2: Assess the Situation

  1. SSH into the affected system
  2. Run uptime to view the current load averages.
  3. Use top or htop to get an overview of system resource usage.

Step 3: Identify High Resource Consumers

  1. In top/htop, sort processes by CPU usage ('%CPU' column).
  2. Identify any processes consuming an unusually high amount of CPU.
  3. Note the process IDs (PIDs) of high consumers.

Step 4: Investigate Problematic Processes

For each high-consuming process: a. Run ps aux | grep <PID> to get more details. b. Check if the process is expected to be running and consuming high resources. c. Investigate logs related to the process (usually in /var/log/ or application-specific locations).

Step 5: Address Issues

If a process is misbehaving: a. Try restarting the process: sudo systemctl restart <service-name> or kill -15 <PID> b. If restart doesn't help, consider stopping the process temporarily: sudo systemctl stop <service-name> or kill -9 <PID> c. If the high load is due to expected behavior (e.g., batch job), consider rescheduling or optimizing the task.

Step 6: Check System Resources

  1. Run free -h to check memory usage. If memory is low, it might cause high CPU usage due to swapping.
  2. Use df -h to check disk usage. Full disks can cause various issues.
  3. Check I/O wait using iostat -x 1. High wait times might indicate disk issues.

Step 7: Review Recent Changes

  1. Check recent system or application updates that might have caused the issue.
  2. Review any recent configuration changes.

Step 8: Implement Short-term Fix

Based on findings, implement a short-term fix to reduce system load. This might include stopping non-critical services, killing runaway processes, or adding resources.

Step 9: Monitor the Situation

  1. Continue monitoring the system load using top or htop.
  2. Verify that the alert resolves in the monitoring system.

Step 10: Plan Long-term Solution

If the issue is recurring, plan for a long-term solution. This might include:

  • Upgrading hardware resources
  • Optimizing application code
  • Load balancing or scaling out the service

General Tips

  • Always backup data before making significant changes
  • Keep system and application logs for reference
  • Regularly update and patch your systems
  • Monitor server performance consistently to catch issues early
Clone this wiki locally