Skip to content

Commit 73a0214

Browse files
committed
troubleshoot: start adding content
Signed-off-by: Pau Capdevila <[email protected]>
1 parent af09bcf commit 73a0214

File tree

3 files changed

+220
-0
lines changed

3 files changed

+220
-0
lines changed

docs/troubleshooting/.pages

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
title: Troubleshooting
2+
nav:
3+
- Overview: overview.md
4+
- Boot and Installation Issues:
5+
- GRUB Rescue Mode: troubleshooting_grub_rescue.md

docs/troubleshooting/overview.md

+68
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,71 @@
22

33
!!! warning ""
44
Under construction.
5+
6+
This section provides solutions for common issues encountered when deploying and managing Hedgehog Fabric. The goal is to help you quickly diagnose and resolve problems to minimize downtime and maintain system stability.
7+
8+
---
9+
10+
## **1. Boot and Installation Issues**
11+
- [GRUB rescue mode and missing `normal.mod`](./troubleshooting_grub_rescue.md)
12+
- ONIE installation failures and network discovery issues
13+
14+
---
15+
16+
## **2. Kubernetes and Control Plane Issues**
17+
- Control plane communication failures (Fabric Controller ↔️ Fabric Agent)
18+
- CRD synchronization issues
19+
- Kubernetes pod failures or crash loops
20+
21+
---
22+
23+
## **3. EVPN and BGP Issues**
24+
- BGP session establishment failures
25+
- Incorrect BGP advertisements or route distribution issues
26+
- EVPN VXLAN tunnel misconfigurations
27+
28+
---
29+
30+
## **4. VPC and Overlay Network Issues**
31+
- VPC creation or attachment failures
32+
- Subnet conflicts and overlapping IP addresses
33+
- DHCP relay and addressing issues within VPCs
34+
35+
---
36+
37+
## **5. Redundancy and High Availability**
38+
- MCLAG and ESLAG configuration mismatches
39+
- Peer link failures and redundancy inconsistencies
40+
- Traffic blackholing due to loopback or peer link misconfigurations
41+
42+
---
43+
44+
## **6. Switch and Interface Issues**
45+
- Interface state mismatch (admin vs oper)
46+
- Port breakout mode misconfiguration
47+
- ASIC-related packet drops and performance bottlenecks
48+
49+
---
50+
51+
## **7. External Peering and Border Leaf Issues**
52+
- BGP peer session failures with edge routers
53+
- Incorrect route advertisements or filtering
54+
- VLAN tagging issues for external connections
55+
56+
---
57+
58+
## **8. Performance and Resource Issues**
59+
- High CPU or memory usage on switches
60+
- Slow convergence times for BGP/EVPN
61+
- ASIC resource exhaustion (route limits, FDB, etc.)
62+
63+
---
64+
65+
## **9. Monitoring and Logging**
66+
- Grafana dashboards showing missing or inconsistent data
67+
- Alloy logs missing or not updating
68+
- Incorrect log forwarding to external systems
69+
70+
---
71+
72+
Use the navigation panel to explore each troubleshooting area in detail.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
2+
# Troubleshooting: Recovering SONiC Installation on a Switch
3+
4+
This document outlines the steps taken to successfully boot into NOS (SONiC) install mode on a switch after a boot failure.
5+
6+
---
7+
8+
## **Issue**
9+
The switch failed to boot into SONiC due to a missing or corrupted bootloader. Symptoms included:
10+
- Stuck in **GRUB rescue mode** with `error: file /boot/grub/normal.mod not found`.
11+
- Unable to load the kernel directly from the GRUB shell.
12+
- EFI bootloader files (`bootx64.efi`, `grubx64.efi`) were present but not loading properly.
13+
14+
---
15+
16+
## **Root Cause**
17+
- The GRUB bootloader was either misconfigured or corrupted.
18+
- The switch's boot order and EFI configuration were not correctly aligned with the installed SONiC image.
19+
- The ONIE boot partition was not properly mounted, preventing direct boot into SONiC.
20+
21+
---
22+
23+
## **Resolution**
24+
25+
### **Step 1: Enter BIOS and Adjust Boot Order**
26+
1. Power cycle the switch.
27+
2. Enter the BIOS setup by pressing **DEL** or **ESC** during boot.
28+
3. Navigate to the **Boot Override** section.
29+
4. Select `UEFI: Built-in EFI Shell` to enter the EFI shell.
30+
31+
---
32+
33+
### **Step 2: Load the Bootloader Manually from EFI Shell**
34+
1. From the EFI shell, list available file systems:
35+
```bash
36+
map -r
37+
```
38+
Example output:
39+
```
40+
fs0 :HardDisk - Alias hd17a65535a1 blk0
41+
fs1 :HardDisk - Alias hd17a65535a2 blk1
42+
```
43+
44+
2. Mount the EFI partition:
45+
```bash
46+
fs0:
47+
```
48+
49+
3. List available bootloader files:
50+
```bash
51+
ls EFI/BOOT
52+
```
53+
Example output:
54+
```
55+
bootx64.efi
56+
grubx64.efi
57+
fbx64.efi
58+
grub.cfg
59+
```
60+
61+
4. Attempt to load the default bootloader:
62+
```bash
63+
EFI/BOOT/BOOTX64.EFI
64+
```
65+
66+
---
67+
68+
### **Step 3: Select ONIE Install Option in GRUB**
69+
1. After loading `bootx64.efi`, the GRUB menu appeared with ONIE options:
70+
- `ONIE: Install OS`
71+
- `ONIE: Rescue`
72+
- `ONIE: Uninstall OS`
73+
74+
2. Selected **"ONIE: Install OS"** to trigger the installer.
75+
76+
---
77+
78+
### **Step 4: ONIE Network-Based Recovery**
79+
1. ONIE attempts to discover the network-based installer from the control node:
80+
```text
81+
ONIE: Discovering installer...
82+
```
83+
84+
2. The installer is detected and executed:
85+
```text
86+
ONIE: Executing installer: http://172.30.0.1:32000/onie
87+
```
88+
89+
3. ONIE downloads and prepares the SONiC image:
90+
```text
91+
NOS: Verifying image checksum ... OK.
92+
NOS: Preparing image archive ... OK.
93+
NOS: Installing SONiC in ONIE
94+
```
95+
96+
---
97+
98+
### **Step 5: Partition Repair and Installation**
99+
1. ONIE attempts to repair and recreate partition tables:
100+
```text
101+
NOS: Partition #3 is available
102+
NOS: Creating new SONiC-OS partition /dev/sda3 ...
103+
```
104+
2. ONIE successfully creates the partition:
105+
```text
106+
NOS: The operation has completed successfully.
107+
```
108+
109+
3. ONIE formats and mounts the partition:
110+
```text
111+
NOS: Creating filesystem with 7588935 4k blocks and 1900544 inodes
112+
```
113+
114+
4. SONiC files are extracted and installed:
115+
```text
116+
NOS: inflating: boot/vmlinuz-5.10.0-21-amd64
117+
NOS: inflating: boot/initrd.img-5.10.0-21-amd64
118+
NOS: inflating: platform-modules-ag9032v2a_1.1_amd64.deb
119+
```
120+
121+
---
122+
123+
### **Step 6: Finalize Installation**
124+
1. ONIE reported successful installation:
125+
```text
126+
NOS: Installed SONiC base image SONiC-OS successfully
127+
```
128+
2. ONIE rebooted the switch:
129+
```text
130+
ONIE: NOS install successful: http://172.30.0.1:32000/onie
131+
ONIE: Rebooting...
132+
```
133+
134+
---
135+
136+
### **Step 7: Verify Successful Boot**
137+
1. After reboot, GRUB listed the installed SONiC image:
138+
```
139+
* SONiC-OS-4.4.1-Enterprise_Base
140+
ONIE
141+
```
142+
143+
2. Booted into SONiC:
144+
```text
145+
Debian GNU/Linux 11 sonic ttyS1
146+
System is ready
147+
```

0 commit comments

Comments
 (0)