-
Notifications
You must be signed in to change notification settings - Fork 215
Server Troubleshooting and Resolution
Alert Rule
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))
This alert triggers when the average CPU usage over 5 minutes exceeds a certain threshold.
Investigation Steps
-
Verify Alert
- Check Prometheus/Grafana to confirm high CPU usage
- Ensure alert is not a false positive
-
Identify CPU-intensive processes Use
top
orhtop
-
Analyze specific processes
ps aux | grep <process_name_or_PID>
- Check system load average
uptime
- Monitor CPU usage over time
sudo sar -u 1 10
- Examine CPU core usage
mpstat -P ALL 1 5
- Investigate high I/O wait times
iostat -xz 1 10
Resolution Steps 8. Terminate unnecessary processes
kill <PID>
or force kill: `kill -9
-
Adjust process priority
renice +10 <PID>
- Limit CPU usage for a process
sudo cpulimit -p <PID> -l 50
- Update or optimize software
sudo apt update && sudo apt upgrade
- Check for malware
sudo rkhunter --check
- Optimize system services
sudo systemctl disable <service_name>
Post-Resolution Actions
- Document and Report
- Record actions taken and their effects
- Update alert status
- Notify relevant team members (devops team)
- Preventive Measures
- Implement regular system maintenance
- Set up resource usage monitoring
- Optimize application code if applicable
- Follow-up
- Conduct root cause analysis
- Implement long-term solutions
- Update runbook if necessary
Note: Always backup your system before making significant changes, and test in a non-production environment first.
Alert rule
(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100
Troubleshooting tips
- Check Current Memory Usage
Use the free
command to view memory statistics:
free -h
or a more detailed view, use:
cat /proc/meminfo
- Identify Memory-Intensive Processes: Use
top
orhtop
to see which processes are consuming the most memory
# Use top
top
# Use htop
htop
Sort processes by memory usage in top
by pressing Shift+M.
- Analyze Specific Processes For detailed information about a process's memory usage:
ps aux | grep <process_name_or_PID>
To see the memory map of a process:
pmap -x <PID>
- Check for Memory Leaks Use Valgrind to check for memory leaks in a specific application:
valgrind --leak-check=full /path/to/your/program
- Monitor Swap Usage. Check swap space usage:
swapon --show
- Examine System Logs. Look for any memory-related errors in system logs:
sudo journalctl -p err..emerg
Resolution steps
- Terminate unnecessary processes:
kill <PID>
or force kill:
kill -9 <PID>
- Clear Page Cache: To free up cached memory
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
- Increase Swap Space: Create a new swap file:
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Add to /etc/fstab for persistence:
/swapfile none swap sw 0 0
- Optimize Applications:
- Update software to latest versions
- Configure applications to use less memory
- Use lightweight alternatives for resource-heavy applications
- Implement Memory Limits:Use
cgroups
to set memory limits for services:
sudo systemctl set-property <service_name> MemoryLimit=1G
- Clean Up Disk Space:Remove unnecessary files and uninstall unused applications:
sudo apt autoremove
sudo apt clean
- Consider Hardware Upgrades: If issues persist, consider adding more RAM to your system.
Alert rule
100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})
Low disk space on a Linux server can cause various issues, including application crashes and system instability. This guide provides steps and commands to troubleshoot and resolve low disk space issues.
- Check Disk Usage
Use the df
command to check disk usage of all mounted filesystems.
df -h
- Identify Large Files and Directories: Use the
du
command to identify large files and directories
du -sh /path/to/directory/*
Find Top 10 Largest Directories in Root
du -ahx / | sort -rh | head -10
- Clean Up Unnecessary Files
- Remove Unnecessary Packages
sudo apt-get autoremove
sudo apt-get clean
- Clear Systemd Journal Logs
sudo journalctl --vacuum-size=100M
- Clear APT Cache (Debian/Ubuntu)
sudo apt-get clean
- Delete Old Logs
sudo find /var/log -type f -name "*.log" -exec rm -f {} \;
- Investigate and Clear Docker Disk Usage (if docker is being used) If you are using Docker, it can consume a significant amount of disk space.
- Check Docker Disk Usage
sudo docker system df
- Remove unused Docker data
sudo docker system prune -a
# or force Remove
sudo docker system prune -af
-
Implement log rotation using tools like
logrotate
to prevent log files from consuming too much disk space. -
Consider adding more disk space or storage to the server if disk space issues persist.
Alert rule
irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8
Troubleshooting Steps
- Check network utilization:
iftop -i <interface>
- Analyze network connections:
netstat -tuln
- Monitor incoming/outgoing traffic:
tcpdump -i <interface> -n
Resolution
- Optimize application code for network efficiency
- Implement caching mechanisms
- Consider load balancing or CDN solutions
Alert rule
increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])
Troubleshooting Steps
- Check DNS resolution:
nslookup <domain>
- Test network connectivity:
ping <host> traceroute <host>
- Verify SSL/TLS configuration:
openssl s_client -connect <host>:<port>
Resolution
- Update DNS settings
- Check firewall rules
- Renew or reconfigure SSL/TLS certificates
Symptoms
- High disk usage
- Slow read/write operations
- I/O wait time spikes
Troubleshooting Steps
- Monitor disk I/O:
iostat -x 1
- Check disk usage:
df -h du -sh /*
- Identify processes causing high I/O:
iotop
Resolution
- Optimize database queries
- Implement proper indexing
- Consider upgrading to SSDs or faster storage
- Adjust file system parameters (e.g., noatime mount option)
Alert Rule:
node_time_seconds{instance="localhost:9100",job="node_exporter"} - node_boot_time_seconds{instance="localhost:9100",job="node_exporter"}
This alert triggers when the system has recently rebooted. It calculates the difference between current time and boot time.
Initial Assessment:
- Verify alert legitimacy
- Check if reboot was planned maintenance
Troubleshooting Steps: a. Access the affected system b. Review system logs:
sudo journalctl -b -1 -n
c. Check last reboot time: who -b
d. Examine uptime: uptime
Common Causes and Solutions: a. Power failure
- Check UPS status
- Verify power supply integrity b. Kernel panic
- Review kernel logs:
sudo dmesg | grep -i panic
- Update kernel if necessary c. Hardware failure
- Run hardware diagnostics
- Check for overheating d. Software update
- Review package manager logs
- Rollback recent updates if problematic
Prevention Measures:
- Implement regular maintenance schedule
- Set up automatic security updates
- Monitor system resources
Alert Resolution:
- Document findings and actions taken
- Update alert status in monitoring system
- Notify relevant team members
Follow-up:
- Conduct root cause analysis
- Implement preventive measures
- Update runbook if necessary
Alert Description: This alert triggers when the 1-minute load average on a system exceeds a certain percentage of available CPU cores.
Alert Rule:
scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 / count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))
Step 1: Verify the Alert
- Log into the monitoring system and confirm the alert details.
- Check if the alert is still active or if it was a temporary spike.
Step 2: Assess the Situation
- SSH into the affected system
- Run
uptime
to view the current load averages. - Use
top
orhtop
to get an overview of system resource usage.
Step 3: Identify High Resource Consumers
- In top/htop, sort processes by CPU usage ('%CPU' column).
- Identify any processes consuming an unusually high amount of CPU.
- Note the process IDs (PIDs) of high consumers.
Step 4: Investigate Problematic Processes
For each high-consuming process:
a. Run ps aux | grep <PID>
to get more details.
b. Check if the process is expected to be running and consuming high resources.
c. Investigate logs related to the process (usually in /var/log/ or application-specific locations).
Step 5: Address Issues
If a process is misbehaving:
a. Try restarting the process: sudo systemctl restart <service-name>
or kill -15 <PID>
b. If restart doesn't help, consider stopping the process temporarily: sudo systemctl stop <service-name>
or kill -9 <PID>
c. If the high load is due to expected behavior (e.g., batch job), consider rescheduling or optimizing the task.
Step 6: Check System Resources
- Run
free -h
to check memory usage. If memory is low, it might cause high CPU usage due to swapping. - Use
df -h
to check disk usage. Full disks can cause various issues. - Check I/O wait using
iostat -x 1
. High wait times might indicate disk issues.
Step 7: Review Recent Changes
- Check recent system or application updates that might have caused the issue.
- Review any recent configuration changes.
Step 8: Implement Short-term Fix
Based on findings, implement a short-term fix to reduce system load. This might include stopping non-critical services, killing runaway processes, or adding resources.
Step 9: Monitor the Situation
- Continue monitoring the system load using top or htop.
- Verify that the alert resolves in the monitoring system.
Step 10: Plan Long-term Solution
If the issue is recurring, plan for a long-term solution. This might include:
- Upgrading hardware resources
- Optimizing application code
- Load balancing or scaling out the service
- Always backup data before making significant changes
- Keep system and application logs for reference
- Regularly update and patch your systems
- Monitor server performance consistently to catch issues early
- Introduction
- Server Setup
- PostgreSQL Setup
- NGINX installation
- RabbitMQ
- Cloning of repo and creating of app directories