Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zephyr - new FPU context switch not working in VexRiscV-smp CPU #297

Open
pottendo opened this issue Feb 3, 2023 · 24 comments
Open

Zephyr - new FPU context switch not working in VexRiscV-smp CPU #297

pottendo opened this issue Feb 3, 2023 · 24 comments

Comments

@pottendo
Copy link

pottendo commented Feb 3, 2023

hi,
upfront, I hope this isn't too off-topic for this project. Let me know and in case direct me to the correct place, e.g. the LiteX repo... thanx.

Summary: Zephyr 3.3-RC1 introduced a new (optimized) handling of FPU register saving in mutli-threaded applications (context switch).
My Litex-project uses a VexRiscV-smp SoftCPU core.
I got basic Zephyr running on this project by just providing a matching devicetree - find the links in the issue report.

Find all details to this issue in the Zephyr repository.
The Zephyr author @npitre suggests:

Alright. Since you do have access to the CPU implementation source code,
I think it would be far better for you to figure out how to support the
MSTATUS_FS bits in your project.

So as I haven't managed to get it going on my own, I ask you experts here!
As a fallback I could revert to 3.2.99 version of Zephyr's FPU code; but I'd like to understand if the problem is either in my Zephyr port, my RiscV Litex project or even something which would be required to be fixed/added in the VexRiscV-smp CPU itself.
If it's the very last reason, other's may encounter this problem as well.

Thanks for any feedback, pottendo

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 3, 2023

Hi,

I wasn't aware of that issue.

So, there is a few things we can do to have a better diagnostic :

  1. Run the software while the dirty flag is always forced to see if the bug is still there

https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/FpuPlugin.scala#L216

 val fs = Reg(Bits(2 bits)) init(1)
 Would become 
 val fs = U"11"
  1. Do you have a simulation which could be used to reproduce the issue ? Inclusing software source code / binary in elf format or .asm

One possibility is that it come from a "recently" added privileged feature missing in VexRiscv, or just something buggy in the core.

Thanks ^^
Charles

@npitre
Copy link

npitre commented Feb 3, 2023 via email

@pottendo
Copy link
Author

pottendo commented Feb 3, 2023

Hi Charles,
thanks for the quick feedback.
ad 2) No, I only have the HW (ECP5 FPGA) and a working LiteX build-env as suggested by the original devloper. Sorry.

ad 1) I've changed the FpuPlugin as instructed, but I think the scala compiler didn't execute (note that the project is a LiteX project) - or I didn't notice it. How can I double-check if the change really arrived? How can I force to rebuild from the scala code in LiteX?

The change in Zephyr is recent - with the previous version it has been working. I don't think this has something to do with privileged execution; See here

I'm really a rookie on this level of programming - so no surprise that I have to bother with that basics!
bye, pottendo

@pottendo
Copy link
Author

pottendo commented Feb 4, 2023

Hi Charles,
found out how to recompile and tried your proposed change.

So, there is a few things we can do to have a better diagnostic :

1. Run the software while the dirty flag is always forced to see if the bug is still there

https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/FpuPlugin.scala#L216

 val fs = Reg(Bits(2 bits)) init(1)
 Would become 
 val fs = U"11"

This won't compile unfortunately:

[info] running (fork) vexriscv.demo.smp.VexRiscvLitexSmpClusterCmdGen --cpu-count=1 --ibus-width=64 --dbus-width=64 --dcache-size=4096 --icache-size=4096 --dcache-ways=1 --icache-ways=1 --litedram-width=32 --aes-instruction=False --out-of-order-decoder=True --wishbone-memory=True --fpu=True --cpu-per-fpu=4 --rvc=True --netlist-name=VexRiscvLitexSmpCluster_Cc1_Iw64Is4096Iy1_Dw64Ds4096Dy1_ITs4DTs4_Ood_Wm_Fpu4_Rvc --netlist-directory=/opt/work/src/orangeCart/RVCop64-pottendo/hw/deps/pythondata-cpu-vexriscv_smp/pythondata_cpu_vexriscv_smp/verilog --dtlb-size=4 --itlb-size=4
[info] [Runtime] SpinalHDL dev    git head : 020da4e21c6d2114872a623780763f818f346e6e
[info] [Runtime] JVM max memory : 3988.0MiB
[info] [Runtime] Current date : 2023.02.04 16:11:21
[info] [Progress] at 0.000 : Elaborate components
[info] [Progress] at 1.376 : Checks and transforms
[info] **********************************************************************************************
[info] [Warning] Elaboration failed (2 errors).
[info]           Spinal will restart with scala trace to help you to find the problem.
[info] **********************************************************************************************
[info] [Progress] at 1.856 : Elaborate components
[info] [Progress] at 2.314 : Checks and transforms
[error] Exception in thread "main" spinal.core.SpinalExit: 
[error]  Error detected in phase PhaseCheckCombinationalLoops
[error] ********************************************************************************
[error] ********************************************************************************
[error] COMBINATORIAL LOOP :
[error]   Partial chain :
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/CsrPlugin_csrMapping_readDataSignal :  Bits[32 bits]) at vexriscv.plugin.CsrMapping.<init>(CsrPlugin.scala:352) >>>
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/execute_CsrPlugin_readToWriteData :  Bits[32 bits]) at jdk.internal.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) >>>
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/??? :  Bits[32 bits]) at vexriscv.plugin.CsrPlugin$$anon$16$$anon$23.<init>(CsrPlugin.scala:1184) >>>
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/CsrPlugin_csrMapping_writeDataSignal :  Bits[32 bits]) at vexriscv.plugin.CsrMapping.<init>(CsrPlugin.scala:352) >>>
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/FpuPlugin_fs :  UInt[2 bits]) at vexriscv.plugin.FpuPlugin$$anon$2.<init>(FpuPlugin.scala:216) >>>
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/??? :  Bits[32 bits]) at vexriscv.plugin.CsrPlugin$$anon$16$$anon$23$$anonfun$27$$anonfun$apply$mcV$sp$42.apply(CsrPlugin.scala:1284) >>>
[error]     >>> (toplevel/cores_0_cpu_logic_cpu/CsrPlugin_csrMapping_readDataInit :  Bits[32 bits]) at vexriscv.plugin.CsrMapping.<init>(CsrPlugin.scala:352) >>>
[error]   Full chain :
[error]     (toplevel/cores_0_cpu_logic_cpu/CsrPlugin_csrMapping_readDataSignal :  Bits[32 bits])
[error]     (toplevel/cores_0_cpu_logic_cpu/execute_CsrPlugin_readToWriteData :  Bits[32 bits])
[error]     (Bits & Bits)[32 bits]
[error]     (Bool ? Bits | Bits)[32 bits]
[error]     (toplevel/cores_0_cpu_logic_cpu/??? :  Bits[32 bits])
[error]     (toplevel/cores_0_cpu_logic_cpu/CsrPlugin_csrMapping_writeDataSignal :  Bits[32 bits])
[error]     Bits(Int downto Int)
[error]     (Bits -> UInt of 2 bits)
[error]     (toplevel/cores_0_cpu_logic_cpu/FpuPlugin_fs :  UInt[2 bits])
[error]     (UInt -> Bits of 2 bits)
[error]     (toplevel/cores_0_cpu_logic_cpu/??? :  Bits[32 bits])
[error]     (Bits | Bits)[32 bits]
[error]     (Bits | Bits)[32 bits]
[error]     (Bits | Bits)[32 bits]
[error]     (Bits | Bits)[32 bits]
[error]     (Bits | Bits)[32 bits]
[error]     (toplevel/cores_0_cpu_logic_cpu/CsrPlugin_csrMapping_readDataInit :  Bits[32 bits])
[error] ********************************************************************************
[error] ********************************************************************************
[error] Design's errors are listed above.
[error] SpinalHDL compiler exit stack : 
[error] 	at spinal.core.SpinalExit$.apply(Misc.scala:424)
[error] 	at spinal.core.SpinalError$.apply(Misc.scala:479)
[error] 	at spinal.core.internals.PhaseContext.checkPendingErrors(Phase.scala:175)
[error] 	at spinal.core.internals.PhaseContext.doPhase(Phase.scala:191)
[error] 	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$2$$anonfun$apply$137.apply(Phase.scala:2714)
[error] 	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$2$$anonfun$apply$137.apply(Phase.scala:2712)
[error] 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[error] 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[error] 	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$2.apply(Phase.scala:2712)
[error] 	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$2.apply(Phase.scala:2648)
[error] 	at spinal.core.ScopeProperty$.sandbox(ScopeProperty.scala:69)
[error] 	at spinal.core.internals.SpinalVerilogBoot$.singleShot(Phase.scala:2648)
[error] 	at spinal.core.internals.SpinalVerilogBoot$.apply(Phase.scala:2643)
[error] 	at spinal.core.Spinal$.apply(Spinal.scala:392)
[error] 	at spinal.core.SpinalConfig.generateVerilog(Spinal.scala:170)
[error] 	at vexriscv.demo.smp.VexRiscvLitexSmpClusterCmdGen$.delayedEndpoint$vexriscv$demo$smp$VexRiscvLitexSmpClusterCmdGen$1(VexRiscvSmpLitexCluster.scala:201)
[error] 	at vexriscv.demo.smp.VexRiscvLitexSmpClusterCmdGen$delayedInit$body.apply(VexRiscvSmpLitexCluster.scala:103)
[error] 	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error] 	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error] 	at scala.App$$anonfun$main$1.apply(App.scala:76)
[error] 	at scala.App$$anonfun$main$1.apply(App.scala:76)
[error] 	at scala.collection.immutable.List.foreach(List.scala:392)
[error] 	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error] 	at scala.App$class.main(App.scala:76)
[error] 	at vexriscv.demo.smp.VexRiscvLitexSmpClusterCmdGen$.main(VexRiscvSmpLitexCluster.scala:103)
[error] 	at vexriscv.demo.smp.VexRiscvLitexSmpClusterCmdGen.main(VexRiscvSmpLitexCluster.scala)
[error] Nonzero exit code returned from runner: 1
[error] (Compile / runMain) Nonzero exit code returned from runner: 1
[error] Total time: 98 s (01:38), completed Feb 4, 2023, 4:11:24 PM

Reverting the change and building results in a working bitstream. So I assume, my build-system to be OK.

bye, pottendo

@pottendo
Copy link
Author

pottendo commented Feb 4, 2023

...one more observation: I ported my test-program (mandelbrot set) to Linux, booted on the same machine. *)
Building with FPU instructions seems to have a similar problem (fast but wrong result)
Double checking without FPU instructions, the program works as expected (much slower of course).

Maybe this helps to reproduce on other VexRiscV-smp systems?

bye, pottendo

*) managed to get this project running on my FPGA board.

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 6, 2023

@npitre

any access to the FPU (including fcsr, frm and fflags) when mstatus_fs is set to "off" (0b00) must raise an illegal instruction fault instead

Ahhh that's the missing feature !
Thanks :D
I will fix it.

Dolu1990 added a commit that referenced this issue Feb 6, 2023
@Dolu1990
Copy link
Member

Dolu1990 commented Feb 6, 2023

@pottendo With 9acc5dd it should be good i think.
Let's me know how things goes :)
I'm currentl letting the regression run to see if that broke nothing.

@pottendo
Copy link
Author

pottendo commented Feb 6, 2023

@pottendo With 9acc5dd it should be good i think.
Let's me know how things goes :)

hi,
will do tonight. Thanks for taking care!
pottendo

@pottendo
Copy link
Author

pottendo commented Feb 6, 2023

@Dolu1990, I see your dev branch is incorporating a few more changes - shall I apply the full change or just the Fpu related one?

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 6, 2023

Just that specific change should be good.

@pottendo
Copy link
Author

pottendo commented Feb 6, 2023

Hi Charles,
some good news: your patch on FpuPlugin.scala did the job for my test-case (mandelbrot set) on Zephyr.

However, not yet perfect though as the zephyr test-suite still results in some fails:

*** Booting Zephyr OS build v3.3.0-rc2-21-g56d628c083a8 ***
Running TESTSUITE riscv_fpu_sharing
===================================================================
START - test_basics

    Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:58: riscv_fpu_sharing_test_basics: fpu_is_clean() is false

 FAIL - test_basics in 0.006 seconds
===================================================================
START - test_fp_insn_trap

    Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:383: riscv_fpu_sharing_test_fp_insn_trap: !fpu_is_off() is false

 FAIL - test_fp_insn_trap in 0.007 seconds
===================================================================
START - test_multi_thread_interaction

    Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:90: new_thread_check: (fpu_is_clean() is false)
FPU not clean after read
 FAIL - test_multi_thread_interaction in 0.007 seconds
===================================================================
START - test_thread_vs_exc_interaction

    Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:326: riscv_fpu_sharing_test_thread_vs_exc_interaction: fpu_is_off() is false

 FAIL - test_thread_vs_exc_interaction in 0.006 seconds
===================================================================
TESTSUITE riscv_fpu_sharing failed.

------ TESTSUITE SUMMARY START ------

SUITE FAIL -   0.00% [riscv_fpu_sharing]: pass = 0, fail = 4, skip = 0, total = 4 duration = 0.026 seconds
 - FAIL - [riscv_fpu_sharing.test_basics] duration = 0.006 seconds
 - FAIL - [riscv_fpu_sharing.test_fp_insn_trap] duration = 0.007 seconds
 - FAIL - [riscv_fpu_sharing.test_multi_thread_interaction] duration = 0.007 seconds
 - FAIL - [riscv_fpu_sharing.test_thread_vs_exc_interaction] duration = 0.006 seconds

------ TESTSUITE SUMMARY END ------

===================================================================
PROJECT EXECUTION FAILED

You find the tests here.

More serious is that Linux is not booting anymore - the patch may break other Linux-systems as well!:

Platform Name       : LiteX / VexRiscv-SMP
Platform Features   : timer,mfdeleg
Platform HART Count : 8
Boot HART ID        : 0
Boot HART ISA       : rv32imafdcs
BOOT HART Features  : time
BOOT HART PMP Count : 0
Firmware Base       : 0x40f00000
Firmware Size       : 124 KB
Runtime SBI Version : 0.2

MIDELEG : 0x00000222
MEDELEG : 0x0000b101
[    0.000000] Linux version 6.1.0-rc2 (pottendo@hansi) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.11-rc3-30-ge87d929666) 11.3.0, GNU ld (GNU Binutils) 2.38) #2 SMP Sat Jan 28 12:33:39 CET 2023
[    0.000000] earlycon: liteuart0 at I/O port 0x0 (options '')
[    0.000000] Malformed early option 'console'
[    0.000000] earlycon: liteuart0 at MMIO 0xf0001000 (options '')
[    0.000000] printk: bootconsole [liteuart0] enabled
[    0.000000] INITRD: 0x41000000+0x00800000 is not a memory region - disabling initrd
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000040000000-0x0000000040ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000040000000-0x0000000040ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000040000000-0x0000000040ffffff]
[    0.000000] SBI specification v0.2 detected
[    0.000000] SBI implementation ID=0x1 Version=0x8
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions acdfim
[    0.000000] riscv: ELF capabilities acdfim
[    0.000000] percpu: Embedded 8 pages/cpu s11264 r0 d21504 u32768
[    0.000000] Built 1 zonelists, mobility grouping off.  Total pages: 4064
[    0.000000] Kernel command line: console=liteuart earlycon=liteuart,0xf0001000 rootwait root=/dev/mmcblk0p2
[    0.000000] Dentry cache hash table entries: 2048 (order: 1, 8192 bytes, linear)
[    0.000000] Inode-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 9044K/16384K available (5052K kernel code, 355K rwdata, 840K rodata, 202K init, 180K bss, 7340K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: 	RCU restricting CPUs from NR_CPUS=32 to nr_cpu_ids=1.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@f0c00000: mapped 32 interrupts with 1 handlers for 2 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x127350b881, max_idle_ns: 440795202125 ns
[    0.000156] sched_clock: 64 bits at 80MHz, resolution 12ns, wraps every 4398046511100ns
[    0.019043] Console: colour dummy device 80x25
[    0.025047] Calibrating delay loop (skipped), value calculated using timer frequency.. 160.00 BogoMIPS (lpj=800000)
[    0.032636] pid_max: default: 32768 minimum: 301
[    0.072882] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.080142] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.366047] ASID allocator using 9 bits (512 entries)
[    0.397480] rcu: Hierarchical SRCU implementation.
[    0.400590] rcu: 	Max phase no-delay instances is 1000.
[    0.474295] smp: Bringing up secondary CPUs ...
[    0.477015] smp: Brought up 1 node, 1 CPU
[    0.548890] devtmpfs: initialized
[    0.799629] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.805298] futex hash table entries: 256 (order: 2, 16384 bytes, linear)
[    1.064023] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    2.633630] usbcore: registered new interface driver usbfs
[    2.646244] usbcore: registered new interface driver hub
[    2.657912] usbcore: registered new device driver usb
[    2.690877] pps_core: LinuxPPS API ver. 1 registered
[    2.693867] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <[email protected]>
[    2.703112] PTP clock support registered
[    2.739357] FPGA manager framework
[    2.854676] clocksource: Switched to clocksource riscv_clocksource
[    5.087978] workingset: timestamp_bits=30 max_order=12 bucket_order=0
[    7.235663] io scheduler mq-deadline registered
[    7.241396] io scheduler kyber registered
[    7.395700] LiteX SoC Controller driver initialized
[   19.825031] f0001000.serial: ttyLXU0 at MMIO 0x0 (irq = 0, base_baud = 0) is a liteuart
[   19.835563] printk: console [liteuart0] enabled
[   19.835563] printk: console [liteuart0] enabled
[   19.842906] printk: bootconsole [liteuart0] disabled
[   19.842906] printk: bootconsole [liteuart0] disabled
[   19.906576] f0004800.serial: ttyLXU1 at MMIO 0x0 (irq = 0, base_baud = 0) is a liteuart
[   20.214475] usbcore: registered new interface driver ftdi_sio
[   20.224921] usbserial: USB Serial support registered for FTDI USB Serial Device
[   20.231241] i2c_dev: i2c /dev entries driver
[   20.363112] mmc_spi spi0.0: SD/MMC host mmc0, no WP, no poweroff, cd polling
[   20.498345] usbcore: registered new interface driver usbhid
[   20.504198] usbhid: USB HID core driver
[   20.706008] Waiting for root device /dev/mmcblk0p2...
[   20.737859] mmc0: host does not support reading read-only switch, assuming write-enable
[   20.746230] mmc0: new SDXC card on SPI
[   20.883125] mmcblk0: mmc0:0000 SC64G 59.5 GiB 
[   21.146296]  mmcblk0: p1 p2
[   23.303332] EXT4-fs (mmcblk0p2): mounted filesystem without journal. Quota mode: disabled.
[   23.314020] VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
[   24.273907] devtmpfs: mounted
[   24.297809] Freeing unused kernel image (initmem) memory: 196K
[   24.304704] Kernel memory protection not selected by kernel config.
[   24.310697] Run /sbin/init as init process
[   30.115385] init[1]: unhandled signal 4 code 0x1 at 0x95b2f658 in ld-linux-riscv32-ilp32d.so.1[95b1e000+18000]
[   30.126632] CPU: 0 PID: 1 Comm: init Not tainted 6.1.0-rc2 #2
[   30.133444] epc : 95b2f658 ra : 95b2a9ec sp : 9d9d1530
[   30.137246]  gp : c064d870 tp : 00000000 t0 : 6ffffe38
[   30.143664]  t1 : 95b1eaec t2 : 00000035 s0 : 95b38a84
[   30.147266]  s1 : 95b38ae0 a0 : 9d9d1550 a1 : 00000000
[   30.153255]  a2 : 9d9d16e8 a3 : 00000024 a4 : 693a9f4c
[   30.157240]  a5 : 9d9d1548 a6 : 8f070303 a7 : 4824160a
[   30.163671]  s2 : 95b38ae0 s3 : 9d9d1680 s4 : 693a9f3c
[   30.167245]  s5 : 00000000 s6 : 00000000 s7 : 9d9d1680
[   30.173270]  s8 : 00000001 s9 : 9d9d1680 s10: 69450ef4
[   30.177272]  s11: 95b37e80 t3 : 95b2a99e t4 : 95b1d118
[   30.183687]  t5 : 95b1d12c t6 : 6fffff44
[   30.186992] status: 00000020 badaddr: 02853c27 cause: 00000002
[   30.196133] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
[   30.199836] CPU: 0 PID: 1 Comm: init Not tainted 6.1.0-rc2 #2
[   30.204402] Call Trace:
[   30.207183] [<c0003d10>] dump_backtrace+0x2c/0x3c
[   30.211623] [<c04dcb24>] show_stack+0x44/0x5c
[   30.216092] [<c04e6bc4>] dump_stack_lvl+0x58/0x7c
[   30.220620] [<c04e6c04>] dump_stack+0x1c/0x2c
[   30.225097] [<c04dcd40>] panic+0x130/0x2f8
[   30.229503] [<c000ff50>] do_exit+0x82c/0x834
[   30.234054] [<c00100c8>] do_group_exit+0x38/0x9c
[   30.238569] [<c001dab4>] get_signal+0x8b8/0x8e8
[   30.243101] [<c0002e7c>] do_notify_resume+0x90/0x368
[   30.247631] [<c00023a0>] ret_from_exception+0x0/0x10
[   30.252783] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004 ]---

I double checked booting the previous version (without the FPU plugin patch) succefully booting the same Linux Image.
The exception is raised by the shared library ld-linux-riscv32-ilp32d.so.1 with code 4 - this is an illegal instruction (IIRC).
I don't see a way to tell buildroot anything differently to generate a different library or use different instruction sets.

All the best and thanks for this great project - all efforts here are much appreciated!
br, pottendo

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 7, 2023

Hi @pottendo

Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:58: riscv_fpu_sharing_test_basics: fpu_is_clean() is false

This one is fine, basicaly, VexRiscv dirty tracking is very pessimitic and assume that any interraction with the FPU or CSR write to the FPU will make it dirty.
It is allowed by the spec :
"Implementations may choose to track the dirtiness of the floating-point register state imprecisely by reporting the state to be dirty even when it has not been modified."

Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:90: new_thread_check: (fpu_is_clean() is false)

Same

Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:326: riscv_fpu_sharing_test_thread_vs_exc_interaction: fpu_is_off() is false

Same, the test assume that will not make the FPU dirty, but in VexRiscv it will.

Assertion failed at WEST_TOPDIR-pottendo/tests/arch/riscv/fpu_sharing/src/main.c:383: riscv_fpu_sharing_test_fp_insn_trap: !fpu_is_off() is false

That one is actualy VexRiscv missspec, CSR read write should now also trap when FS==0 with that additional patch :
cbc8909

linux

That's more worrying
The linux one trap on a 0x02853c27 which is a "fsd fs0,56(a0)", while fs=0 (status: 00000020)

I have to check how linux handle the MMU :)

Let's me know how it goes with the cbc8909 patch :D

Best regards
Charles

@npitre
Copy link

npitre commented Feb 7, 2023 via email

@npitre
Copy link

npitre commented Feb 8, 2023 via email

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 8, 2023

How hard would it be to set the dirty state on writes only as opposed to set it on any access?

@npitre Shouldn't be that hard, i will give a try as soon as the other issues in Vex are fixed (to avoid mixing fixes) ^^

@pottendo
Copy link
Author

pottendo commented Feb 8, 2023

@pottendo: Could you give it a try?

hi,
will try today; however, I'll be off until the rest of the week, not able to access my setup. So it may last until the weekend for further tests from my side.
br, pottendo

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 8, 2023

Got the VexRiscv implementation for more a much more accurate dirty bit :
33e820b

@pottendo
Copy link
Author

pottendo commented Feb 8, 2023

Hi all,
things improved drastically! ;-)

--============= Liftoff! ===============--
*** Booting Zephyr OS build v3.3.0-rc2-21-g56d628c083a8 ***
Running TESTSUITE riscv_fpu_sharing
===================================================================
START - test_basics
 PASS - test_basics in 0.002 seconds
===================================================================
START - test_fp_insn_trap
 PASS - test_fp_insn_trap in 0.004 seconds
===================================================================
START - test_multi_thread_interaction
 PASS - test_multi_thread_interaction in 0.014 seconds
===================================================================
START - test_thread_vs_exc_interaction
 PASS - test_thread_vs_exc_interaction in 0.002 seconds
===================================================================
TESTSUITE riscv_fpu_sharing succeeded

------ TESTSUITE SUMMARY START ------

SUITE PASS - 100.00% [riscv_fpu_sharing]: pass = 4, fail = 0, skip = 0, total = 4 duration = 0.022 seconds
 - PASS - [riscv_fpu_sharing.test_basics] duration = 0.002 seconds
 - PASS - [riscv_fpu_sharing.test_fp_insn_trap] duration = 0.004 seconds
 - PASS - [riscv_fpu_sharing.test_multi_thread_interaction] duration = 0.014 seconds
 - PASS - [riscv_fpu_sharing.test_thread_vs_exc_interaction] duration = 0.002 seconds

------ TESTSUITE SUMMARY END ------

===================================================================
PROJECT EXECUTION SUCCESSFUL

This is the result after applying all proposed patches - so, Charles, one of the last two patches obviously made the system booting again - also Zephyr.
Also @npitre's patch on the test-program is applied here.
My test-program also runs fine.
Linux still to be done... stay tuned!

Great job guys!
pottendo

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 8, 2023

@pottendo

Charles, one of the last two patches obviously made the system booting again

Nice ^^

Linux still to be done... stay tuned!

So, do you mean it is still broken for linux ? or that you need test ?

Thanks :)
Charles

@npitre
Copy link

npitre commented Feb 8, 2023 via email

@pottendo
Copy link
Author

pottendo commented Feb 8, 2023

@npitre, I've reverted your change and can confirm that all tests still pass.
@Dolu1990, in the meantime I've run also Linux - here it still fails:

[   24.311086] Run /sbin/init as init process
[   30.114352] init[1]: unhandled signal 4 code 0x1 at 0x95b12658 in ld-linux-riscv32-ilp32d.so.1[95b01000+18000]
[   30.125637] CPU: 0 PID: 1 Comm: init Not tainted 6.1.0-rc2 #2
[   30.132595] epc : 95b12658 ra : 95b0d9ec sp : 9d859530
[   30.136251]  gp : c064d870 tp : 00000000 t0 : 6ffffe38
[   30.142703]  t1 : 95b01aec t2 : 00000035 s0 : 95b1ba84
[   30.146659]  s1 : 95b1bae0 a0 : 9d859550 a1 : 00000000
[   30.152684]  a2 : 9d8596e8 a3 : 00000024 a4 : 69339f4c
[   30.156651]  a5 : 9d859548 a6 : 8f070303 a7 : 4824160a
[   30.163283]  s2 : 95b1bae0 s3 : 9d859680 s4 : 69339f3c
[   30.167251]  s5 : 00000000 s6 : 00000000 s7 : 9d859680
[   30.173275]  s8 : 00000001 s9 : 9d859680 s10: 693e0ef4
[   30.177278]  s11: 95b1ae80 t3 : 95b0d99e t4 : 95b00118
[   30.183694]  t5 : 95b0012c t6 : 6fffff44
[   30.186997] status: 00000020 badaddr: 02853c27 cause: 00000002
[   30.196137] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
[   30.199841] CPU: 0 PID: 1 Comm: init Not tainted 6.1.0-rc2 #2
[   30.204407] Call Trace:
[   30.207188] [<c0003d10>] dump_backtrace+0x2c/0x3c
[   30.211628] [<c04dcb24>] show_stack+0x44/0x5c
[   30.216097] [<c04e6bc4>] dump_stack_lvl+0x58/0x7c
[   30.220625] [<c04e6c04>] dump_stack+0x1c/0x2c
[   30.225102] [<c04dcd40>] panic+0x130/0x2f8
[   30.229507] [<c000ff50>] do_exit+0x82c/0x834
[   30.234059] [<c00100c8>] do_group_exit+0x38/0x9c
[   30.238575] [<c001dab4>] get_signal+0x8b8/0x8e8
[   30.243106] [<c0002e7c>] do_notify_resume+0x90/0x368
[   30.247635] [<c00023a0>] ret_from_exception+0x0/0x10
[   30.252788] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004 ]---

I didn't change my buildroot and Linux kernel (yet) - I don't know if and how this would be needed.
But maybe it's something to be deeper investigated in the Linux kernel, how FPU management is done there.
cheers, pottendo

@npitre
Copy link

npitre commented Feb 8, 2023 via email

@pottendo
Copy link
Author

pottendo commented Feb 8, 2023

@npitre,
obviously I've mixed up the different configs. Now I have double checked and indeed missed the 'CONFIG_FPU=y' option.
Fixing that makes Linux also happy!

image

For me this looks promising - good to commit to the respective main branches!
bye, pottendo

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 9, 2023

Do you have CONFIG_FPU=y in your kernel config?

@npitre omg nice catch XDXD

Thanks :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants