Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold reboot loses SSD info[BUG] #4

Open
sjyi opened this issue Oct 30, 2020 · 2 comments
Open

Cold reboot loses SSD info[BUG] #4

sjyi opened this issue Oct 30, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@sjyi
Copy link

sjyi commented Oct 30, 2020

Describe the issue
Please describe the issue
After copying over rootfs to SSD and then enabling booting from SSD, I reboot.
It works fine as long as I'm simply warm booting. That is I don't unplug the power.
When I unplug the power and then do a cold boot, the system doesn't see the SSD directories.
I have to do another warm boot (reboot) in order to see SSD directories.

What version of L4T/JetPack
L4T/JetPack version: 4.4
Which Jetson
Jetson: NXXavier
To Reproduce
Steps to reproduce the behavior:
For example, what command line did you run?

Setup the SSD.
Clone the rootOnNVMe
then copy rootfs
./copy-rootfs-ssd.sh
then enable booting from SSD
./setup-service.sh

Reboot to enable SSD.

Now you have access to SSD.

At this point, you can shutdown and unplug the power or simply just unplug the power.

When the power is reconnected, SSD directories are not seen.
You have to reboot again to see the SSD directories.
Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@sjyi sjyi added the bug Something isn't working label Oct 30, 2020
@smyeungx
Copy link

smyeungx commented Jan 20, 2021

Dear Contributors,

Thanks for creating such a wonderful package. We also encounter this cold reboot on both NX and AGX
Modify the setssdroot.sh a little bit so show where systemd start the setssdroot.sh script.

NORMAL STARTUP or REBOOT
So During normal startup or reboot, the device /dev/nvme0n1p1 appears very soon after nvme enabling device command is issued:
[ 2.491446] pcie_pme 0005:00:00.0:pcie001: service driver pcie_pme loaded
[ 2.491510] aer 0005:00:00.0:pcie002: service driver aer loaded
[ 2.491950] nvme nvme0: pci function 0005:01:00.0
[ 2.492023] nvme 0005:01:00.0: enabling device (0000 -> 0002)

[ 2.501732] tegra-cbb 14040000.cv-noc: noc_secure_irq = 89, noc_nonsecure_irq = 88>
[ 2.506497] tegra194-isp5 14800000.isp: initialized
[ 2.514111] tegra194-vi5 15c10000.vi: using default number of vi channels, 36
[ 2.518419] tegra194-vi5 15c10000.vi: initialized
[ 2.522866] tegra194-vi5 15c10000.vi: subdev 15a00000.nvcsi--2 bound
[ 2.522944] tegra194-vi5 15c10000.vi: subdev 15a00000.nvcsi--1 bound
[ 2.523609] tegra186-cam-rtcpu bc00000.rtcpu: Trace buffer configured at IOVA=0xbff00000
[ 2.601813] nvme0n1: p1 p2
[ 2.606426] tegra-ivc ivc-bc00000.rtcpu: region 0: iova=0xbfee0000-0xbfefffff size=131072
[ 2.607071] tegra-ivc ivc-bc00000.rtcpu:echo@0: echo: ver=0 grp=1 RX[16x64]=0x1000-0x1480 TX[16x64]=0x1480-0x1900

After the device is detected, systemd launched setssdroot.service which invoke setssdroot.sh when the requirement ConditionPathExists=/dev/nvme0n1p1 in the service file is fullfilled.
Seems that systemd start to run the service pretty early as soon as the EXT4-fs is remounted as expected (as indicate in the service file):
[ 3.464040] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)
....
[ 3.895975] setssdroot: remount rootfs to nvme0n1p1 <-- added logging code to dmesg
[ 3.980035] [EXT4 FS bs=4096, gc=3249, bpg=32768, ipg=8192, mo=e882c818, mo2=0002]
[ 3.992243] EXT4-fs (nvme0n1p1): recovery complete
[ 4.019774] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: debug,errors=continue,discard
[ 4.060660] setssdroot: exit remount rootfs <-- added logging code to dmesg

COLD REBOOT
But during a cold boot in the L4T in Jetson AGX, for some reason like file system recovery on an improperly unmounted partition upon failure or accidentally power off, the nvme0n1p1 device partition usually only detected at a relatively late stage after nvme enabling command is issued:
[ 2.511762] nvme nvme0: pci function 0005:01:00.0
[ 2.512342] nvme 0005:01:00.0: enabling device (0000 -> 0002)
...
[ 4.008445] hid-generic 0003:17EF:60EE.0005: hidraw4: USB HID v1.11 Device [Lenovo TrackPoint Keyboard II] on usb-3610000.xhci-2.4.4.2/input2
[ 4.450262] nvme0n1: p1 p2
[ 4.685762] random: crng init done

Therefore the device /dev/nvme0n1p1 appears after systemd executes the setssdroot.service and cannot fulfill the requirement:
ConditionPathExists=/dev/nvme0n1p1
and thus the service is never executed in this case.

PROPOSED SOLUTION
We have tried the following modification to setssdroot.service and indicate the service only started after the device /dev/nvme0n1p1 appears:
[Unit]
Description=Change rootfs to SSD in M.2 key M slot (nvme0n1p1)
DefaultDependencies=no
Conflicts=shutdown.target
#systemctl list-units --type=mount
After=systemd-remount-fs.service dev-nvme0n1p1.device
Before=local-fs-pre.target local-fs.target shutdown.target
Wants=local-fs-pre.target dev-nvme0n1p1.device
ConditionPathExists=/dev/nvme0n1p1
ConditionPathExists=/etc/setssdroot.conf
ConditionVirtualization=!container
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/setssdroot.sh
[Install]
WantedBy=default.target

And modified the EXT4_OPT with errors=continue in setssdroot.sh:
#!/bin/sh
#Runs at startup, switches rootfs to the SSD on nvme0 (M.2 Key M slot)
NVME_DRIVE="/dev/nvme0n1p1"
CHROOT_PATH="/nvmeroot"

INITBIN=/lib/systemd/systemd
EXT4_OPT="-o defaults -o debug -o errors=continue -o discard"

echo "setssdroot: mount and switch rootfs to nvme0n1p1" | tee /dev/kmsg

modprobe ext4

mkdir -p ${CHROOT_PATH}
mount -t ext4 ${EXT4_OPT} ${NVME_DRIVE} ${CHROOT_PATH}

cd ${CHROOT_PATH}
/bin/systemctl --no-block switch-root ${CHROOT_PATH}

echo "setssdroot: exit mount and switch rootfs" | tee /dev/kmsg

Seems the above approach may delay the boot process for 1-2s during file system recovery, but we try cold boot it over 20 times and seems it's working nicely on both NX/AGX. Please kindly check if this approach help resolves the issue.
$ dmesg | grep -E 'setssd|EXT4-fs|rootfs|nvme'
[ 0.973739] Trying to unpack rootfs image as initramfs...
[ 2.513857] nvme nvme0: pci function 0005:01:00.0
[ 2.513983] nvme 0005:01:00.0: enabling device (0000 -> 0002)
[ 2.786245] EXT4-fs (mmcblk0p1): recovery complete
[ 2.786966] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[ 2.810175] Switching from initrd to actual rootfs
[ 3.475874] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)
[ 4.012793] nvme0n1: p1 p2
[ 4.841735] setssdroot: remount rootfs to nvme0n1p1
[ 5.064657] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039474
[ 5.064881] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040079
[ 5.064943] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040131
[ 5.065021] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040106
[ 5.065143] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039479
[ 5.065241] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040092
[ 5.065398] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039565
[ 5.065444] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039530
[ 5.065491] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040073
[ 5.065540] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039621
[ 5.065584] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039772
[ 5.065652] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039488
[ 5.065715] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039492
[ 5.065776] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039527
[ 5.065821] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 18087972
[ 5.065886] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039505
[ 5.065933] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039503
[ 5.065982] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039504
[ 5.066024] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039480
[ 5.066065] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17039433
[ 5.066107] EXT4-fs (nvme0n1p1): ext4_orphan_cleanup: deleting unreferenced inode 17040067
[ 5.066142] EXT4-fs (nvme0n1p1): 21 orphan inodes deleted
[ 5.066144] EXT4-fs (nvme0n1p1): recovery complete
[ 5.108166] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: debug,errors=continue,discard
[ 5.127778] setssdroot: exit remount rootfs
[ 15.280425] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)

Best,
Simon

@Redox15
Copy link

Redox15 commented Jun 15, 2023

I have to make this change to script in order to boot from SSD. But, in my case, it never boots from SSD without the change.
However, I faced a new issue (#28). Has your system the CUDA installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants