Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia GH200 / ARM64: SIGSEGV in XPU and Vector, but not Scalar modes #180

Open
cwpenhale opened this issue Jan 11, 2025 · 2 comments
Open

Comments

@cwpenhale
Copy link

Hi Team,

I'm successfully using OpenMoonRay 1.7 in Gentoo on an AMD EPYC 9654 workstation (Ebuild here, patches to OMR here). I'm working on building out a render farm, and I hope to use the well-priced NVIDIA GH200 platform on VULTR as an on-demand Arras compute node.

I've built an OMR docker image for ARM64 Neocortex-V2 with Optix and CUDA (-march=armv9-a -mcpu=neoverse-v2 -mtune=neoverse-v2), the chips used in the NVIDIA GH200. The patches I've made against OMR's source are here and the ebuild, slightly modified from the previous example, is here.

I'm launching my docker container like so:
docker run -it -v /root:/root --runtime=nvidia --gpus=all -e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility openmoonray-arm64

My bash environment looks like this:

NVIDIA_VISIBLE_DEVICES=all
REZ_MOONRAY_ROOT=/opt/openmoonray
PWD=/root/example_scenes/pbrt_scenes/country_kitchen
NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility
HOME=/root
LS_COLORS=<trimmed>
RDL2_DSO_PATH=/opt/openmoonray/rdl2dso
MOONRAY_ROOT=/opt/openmoonray
MOONRAY_CLASS_PATH=/opt/openmoonray/shader_json
TERM=xterm
SHLVL=1
ARRAS_SESSION_PATH=/opt/openmoonray/sessions
PATH=/opt/openmoonray/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
OLDPWD=/root/example_scenes
_=/usr/sbin/env

The processor looks like this in /proc/cpuinfo

processor       : 71
BogoMIPS        : 2000.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd4f
CPU revision    : 0

I'm running the test render on the country kitchen scene with
moonray -debug -exec_mode xpu -in scene.rdla -in scene.rdlb -out arm64.exr

And the output is in the attached file
kitchen.log

I'd love to contribue a coherent patch once I get this working. The majority of the changes I've made to try and get this working are all about changing __APPLE__ to __ARM_NEON__ in the appropriate places, and separating out the concerns between ARM on Darwin and ARM on Linux. It's been a whirlwind trying to get this far, and compiling on qemu had made this process slower than usual :)

Where is a good place to start with debugging this? Since scalar works, I imagine I made some mistakes in my patching as it relates to vector and XPU. I also assume after reading the code that Apple hasn't been tested with Optix at all and we're in uncharted waters.

Looking forward to working with everyone! Thanks!

@cwpenhale
Copy link
Author

cwpenhale commented Jan 11, 2025

Quick update; attached lldb in the container.

(lldb) platform status
  Platform: host
    Triple: aarch64-*-linux-gnu
OS Version: 6.8.0 (6.8.0-1019-nvidia-64k)
  Hostname: 127.0.0.1
WorkingDir: /opt/cuda
    Kernel: #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 21 21:12:06 UTC 2
    Kernel: Linux
   Release: 6.8.0-1019-nvidia-64k
   Version: #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 21 21:12:06 UTC 2
(lldb) file /opt/openmoonray/bin/moonray
Current executable set to '/opt/openmoonray/bin/moonray' (aarch64).
(lldb) platform settings -w /root/example_scenes/pbrt_scenes/country_kitchen/
(lldb) platform shell pwd
/root/example_scenes/pbrt_scenes/country_kitchen
(lldb)  platform shell ls
LICENSE.txt
scene.exr
scene.rdla
scene.rdlb
textures
(lldb) settings set -- target.env-vars MOONRAY_ROOT=/opt/openmoonray  PATH=$MOONRAY_ROOT/bin:$PATH RDL2_DSO_PATH=$MOONRAY_ROOT/rdl2dso REZ_MOONRAY_ROOT=$MOONRAY_ROOT ARRAS_SESSION_PATH=$MOONRAY_ROOT/sessions MOONRAY_CLASS_PATH=$MOONRAY_ROOT/shader_json
(lldb) env
target.env-vars (dictionary of strings) =
  ARRAS_SESSION_PATH=$MOONRAY_ROOT/sessions
  MOONRAY_CLASS_PATH=$MOONRAY_ROOT/shader_json
  MOONRAY_ROOT=/opt/openmoonray
  PATH=$MOONRAY_ROOT/bin:$PATH
  RDL2_DSO_PATH=$MOONRAY_ROOT/rdl2dso
  REZ_MOONRAY_ROOT=$MOONRAY_ROOT
(lldb) settings set -- target.run-args "-debug" "-exec_mode" "xpu" "-in" "scene.rdla" "-in" "scene.rdlb" "-out" "arm64.exr"
(lldb) settings set target.disable-aslr false
(lldb) run
Process 145 launched: '/opt/openmoonray/bin/moonray' (aarch64)
Setting mPerThreadRayStatePoolSize to 65536
Setting mRayQueueSize to 1024
Setting mOcclusionQueueSize to 1024
Setting mShadeQueueSize to 128
Setting mRadianceQueueSize to 512
Setting mShadingWorkloadChunkSize to 32
Setting mPresenceShadowsQueueSize to 1024
RenderPrep CPU-affinity control disabled
RenderPrep CPU-affinity control disabled
free(): invalid size
Process 145 stopped
* thread #1, name = 'moonray', stop reason = signal SIGABRT
    frame #0: 0x0000f39217fe5db8 libc.so.6`___lldb_unnamed_symbol3478 + 312
libc.so.6`___lldb_unnamed_symbol3478:
->  0xf39217fe5db8 <+312>: cmn    w0, #0x1, lsl #12 ; =0x1000 
    0xf39217fe5dbc <+316>: csneg  w0, wzr, w0, ls
    0xf39217fe5dc0 <+320>: b      0xf39217fe5d3c ; <+188>
    0xf39217fe5dc4 <+324>: mov    x0, x22
(lldb)  settings set -- target.run-args "-debug" "-exec_mode" "vector" "-in" "scene.rdla" "-in" "scene.rdlb" "-out
" "arm64.exr"
(lldb) run
There is a running process, kill it and restart?: [Y/n] y
Process 145 exited with status = 9 (0x00000009) killed
Process 232 launched: '/opt/openmoonray/bin/moonray' (aarch64)
Setting mPerThreadRayStatePoolSize to 65536
Setting mRayQueueSize to 1024
Setting mOcclusionQueueSize to 1024
Setting mShadeQueueSize to 128
Setting mRadianceQueueSize to 512
Setting mShadingWorkloadChunkSize to 32
Setting mPresenceShadowsQueueSize to 1024
RenderPrep CPU-affinity control disabled
RenderPrep CPU-affinity control disabled
free(): invalid size
Process 232 stopped
* thread #1, name = 'moonray', stop reason = signal SIGABRT
    frame #0: 0x0000f9c7cf4f5db8 libc.so.6`___lldb_unnamed_symbol3478 + 312
libc.so.6`___lldb_unnamed_symbol3478:
->  0xf9c7cf4f5db8 <+312>: cmn    w0, #0x1, lsl #12 ; =0x1000 
    0xf9c7cf4f5dbc <+316>: csneg  w0, wzr, w0, ls
    0xf9c7cf4f5dc0 <+320>: b      0xf9c7cf4f5d3c ; <+188>
    0xf9c7cf4f5dc4 <+324>: mov    x0, x22
(lldb) thread backtrace
* thread #1, name = 'moonray', stop reason = signal SIGABRT
  * frame #0: 0x0000ede92dba5db8 libc.so.6`___lldb_unnamed_symbol3478 + 312
    frame #1: 0x0000ede92db5c2fc libc.so.6`raise + 28
    frame #2: 0x0000ede92db47b80 libc.so.6`abort + 244
    frame #3: 0x0000ede92db998d0 libc.so.6`___lldb_unnamed_symbol3386 + 512
    frame #4: 0x0000ede92dbb04c8 libc.so.6`___lldb_unnamed_symbol3549 + 24
    frame #5: 0x0000ede92dbb26f0 libc.so.6`___lldb_unnamed_symbol3569 + 656
    frame #6: 0x0000ede92dbb4cc0 libc.so.6`__libc_free + 176
    frame #7: 0x0000ede933136e84 libscene_rdl2.so`scene_rdl2::rdl2::DsoFinder::guessDsoPath() + 1044
    frame #8: 0x0000ede93313716c libscene_rdl2.so`scene_rdl2::rdl2::DsoFinder::find() + 316
    frame #9: 0x0000ede93317fa04 libscene_rdl2.so`scene_rdl2::rdl2::SceneContext::SceneContext() + 644
    frame #10: 0x0000ede941062304 librendering_rndr.so`moonray::rndr::RenderContext::RenderContext(moonray::rndr::RenderOptions&, std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char>>*) + 564
    frame #11: 0x0000bf058131585c moonray`___lldb_unnamed_symbol208 + 76
    frame #12: 0x0000ede941236c1c libapplication.so`moonray::RaasApplication::main(int, char**) + 124
    frame #13: 0x0000bf0581315ad8 moonray`___lldb_unnamed_symbol209 + 88
    frame #14: 0x0000ede92db48244 libc.so.6`___lldb_unnamed_symbol3095 + 116
    frame #15: 0x0000ede92db48318 libc.so.6`__libc_start_main + 152
    frame #16: 0x0000bf0581315030 moonray`___lldb_unnamed_symbol202 + 48

I'm not much of a C++ guy, so let me know if this is helpful :)

@cwpenhale
Copy link
Author

I see what I did wrong in collecting this with lldb. I've updated my environment to not use $BASH_VARIABLES and I'm getting output that is more useful.

(lldb) run                                                                                                        
There is a running process, kill it and restart?: [Y/n] y                                                         
Process 36 exited with status = 9 (0x00000009) killed                                                             
Process 268 launched: '/opt/openmoonray/bin/moonray' (aarch64)                                                    
RenderPrep CPU-affinity control disabled                                                                          
Loading Scene File(s): /root/example_scenes/pbrt_scenes/country_kitchen/scene.rdla                                
Loading Scene File(s): /root/example_scenes/pbrt_scenes/country_kitchen/scene.rdlb                                
Render prep time = 00:00:01.531                                                                                   
MOONRAY MCRT thread pool : MCRT-CPU-affinity disabled                                                             
Process 268 stopped  
... snip ... 
    frame #0: 0x0000fb7da8553b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xfb7da8553b48 <+360>: ldr    x8, [x8, #0x148]
    0xfb7da8553b4c <+364>: strb   wzr, [x8, #0x14]
    0xfb7da8553b50 <+368>: mrs    x9, CNTVCT_EL0
    0xfb7da8553b54 <+372>: ldp    x10, x11, [x8]
  thread #146, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xfb858c79cec0)
    frame #0: 0x0000fb7da8553b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xfb7da8553b48 <+360>: ldr    x8, [x8, #0x148]
    0xfb7da8553b4c <+364>: strb   wzr, [x8, #0x14]
    0xfb7da8553b50 <+368>: mrs    x9, CNTVCT_EL0
    0xfb7da8553b54 <+372>: ldp    x10, x11, [x8]
  thread #147, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xfb858c781040)
    frame #0: 0x0000fb7da8553b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xfb7da8553b48 <+360>: ldr    x8, [x8, #0x148]
    0xfb7da8553b4c <+364>: strb   wzr, [x8, #0x14]
    0xfb7da8553b50 <+368>: mrs    x9, CNTVCT_EL0
    0xfb7da8553b54 <+372>: ldp    x10, x11, [x8]
  thread #148, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xfb858c7b8d40)
    frame #0: 0x0000fb7da8553b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xfb7da8553b48 <+360>: ldr    x8, [x8, #0x148]
    0xfb7da8553b4c <+364>: strb   wzr, [x8, #0x14]
    0xfb7da8553b50 <+368>: mrs    x9, CNTVCT_EL0
    0xfb7da8553b54 <+372>: ldp    x10, x11, [x8]
  thread #149, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xfb858c7d4bc0)
    frame #0: 0x0000fb7da8553b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xfb7da8553b48 <+360>: ldr    x8, [x8, #0x148]
    0xfb7da8553b4c <+364>: strb   wzr, [x8, #0x14]
    0xfb7da8553b50 <+368>: mrs    x9, CNTVCT_EL0
    0xfb7da8553b54 <+372>: ldp    x10, x11, [x8]
(lldb) thread backtrace
* thread #81, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xfb858c06b2c0)
  * frame #0: 0x0000fb7da8553b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
    frame #1: 0x0000fb7db432ed20 librendering_rndr.so`moonray::pbr::XPURayQueue::flushInternal(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, scene_rdl2::alloc::Arena*) + 288
    frame #2: 0x0000fb7db42d784c librendering_rndr.so`moonray::rndr::RenderDriver::flushXPUQueues(moonray::mcrt_common::ThreadLocalState*, scene_rdl2::alloc::Arena*) + 92
    frame #3: 0x0000fb7db434aa4c librendering_rndr.so`___lldb_unnamed_symbol6174 + 1276
    frame #4: 0x0000fb7da6090cf8 librender_util.so`scene_rdl2::ThreadExecutor::threadMain() + 184
    frame #5: 0x0000fb7da6091b14 librender_util.so`___lldb_unnamed_symbol756 + 68
    frame #6: 0x0000fb7da0dd418c libc.so.6`___lldb_unnamed_symbol3475 + 892
    frame #7: 0x0000fb7da0e3641c libc.so.6`___lldb_unnamed_symbol3820 + 12

and with vector


00:00:01  702.5 MB | ---------- MCRT Rendering --------------------------------
no-extra-snapshot
MOONRAY MCRT thread pool : MCRT-CPU-affinity disabled
MOONRAY MCRT thread pool : MCRT-CPU-affinity disabled
Process 500 stopped
* thread #89, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xff23ac1822c0)
    frame #0: 0x0000ff1bc70a3b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xff1bc70a3b48 <+360>: ldr    x8, [x8, #0x148]
    0xff1bc70a3b4c <+364>: strb   wzr, [x8, #0x14]
    0xff1bc70a3b50 <+368>: mrs    x9, CNTVCT_EL0
    0xff1bc70a3b54 <+372>: ldp    x10, x11, [x8]
  thread #147, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xff23ac7d4bc0)
    frame #0: 0x0000ff1bc70a3b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
librendering_pbr.so`moonray::pbr::rayBundleHandler:
->  0xff1bc70a3b48 <+360>: ldr    x8, [x8, #0x148]
    0xff1bc70a3b4c <+364>: strb   wzr, [x8, #0x14]
    0xff1bc70a3b50 <+368>: mrs    x9, CNTVCT_EL0
    0xff1bc70a3b54 <+372>: ldp    x10, x11, [x8]
(lldb) thread backtrace
* thread #89, name = 'moonray', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0xff23ac1822c0)
  * frame #0: 0x0000ff1bc70a3b48 librendering_pbr.so`moonray::pbr::rayBundleHandler(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, void*) + 360
    frame #1: 0x0000ff1bc7091a94 librendering_pbr.so`moonray::mcrt_common::LocalQueue<moonray::pbr::RayState*>::flushInternal(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, scene_rdl2::alloc::Arena*) + 228
    frame #2: 0x0000ff1bc708f33c librendering_pbr.so`moonray::mcrt_common::LocalQueue<moonray::pbr::RayState*>::addEntries(moonray::mcrt_common::ThreadLocalState*, unsigned int, moonray::pbr::RayState**, scene_rdl2::alloc::Arena*) + 172
    frame #3: 0x0000ff1bc70b7b00 librendering_pbr.so`moonray::pbr::PathIntegrator::queuePrimaryRay(moonray::pbr::TLState*, int, int, int, int, moonray::pbr::Sample const&, moonray::pbr::RayState*) const + 176
    frame #4: 0x0000ff1bd2e9a128 librendering_rndr.so`moonray::rndr::RenderDriver::renderPixelVectorSamples(moonray::pbr::TLState*, unsigned int, unsigned int, moonray::rndr::RenderSamplesParams*, moonray::rndr::TileGroup const&, moonray::pbr::DeepBuffer const*, moonray::pbr::CryptomatteBuffer*) + 1096
    frame #5: 0x0000ff1bd2e9b248 librendering_rndr.so`bool moonray::rndr::RenderDriver::renderTileUniformSamples<false>(moonray::rndr::RenderDriver*, moonray::mcrt_common::ThreadLocalState*, moonray::rndr::TileGroup const&, moonray::rndr::RenderSamplesParams&, moonray::pbr::DeepBuffer*, moonray::pbr::CryptomatteBuffer*, unsigned int, unsigned int, unsigned int&, moonray::rndr::ActivePixelMask const&) + 1064
    frame #6: 0x0000ff1bd2e982dc librendering_rndr.so`moonray::rndr::RenderDriver::renderTile(moonray::rndr::RenderDriver*, moonray::mcrt_common::ThreadLocalState*, moonray::rndr::TileGroup const&, moonray::rndr::RenderSamplesParams&, moonray::pbr::DeepBuffer*, moonray::pbr::CryptomatteBuffer*, unsigned int&) + 300
    frame #7: 0x0000ff1bd2e980cc librendering_rndr.so`moonray::rndr::RenderDriver::renderTiles(moonray::rndr::RenderDriver*, moonray::mcrt_common::ThreadLocalState*, moonray::rndr::TileGroup const&) + 492
    frame #8: 0x0000ff1bd2e9a7d4 librendering_rndr.so`___lldb_unnamed_symbol6174 + 644
    frame #9: 0x0000ff1bc4be0cf8 librender_util.so`scene_rdl2::ThreadExecutor::threadMain() + 184
    frame #10: 0x0000ff1bc4be1b14 librender_util.so`___lldb_unnamed_symbol756 + 68
    frame #11: 0x0000ff1bbf92418c libc.so.6`___lldb_unnamed_symbol3475 + 892
    frame #12: 0x0000ff1bbf98641c libc.so.6`___lldb_unnamed_symbol3820 + 12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant