Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer xclbin download err: -11 #633

Open
ioeddk opened this issue Jan 5, 2024 · 2 comments
Open

peer xclbin download err: -11 #633

ioeddk opened this issue Jan 5, 2024 · 2 comments

Comments

@ioeddk
Copy link

ioeddk commented Jan 5, 2024

On an F1 instance, when I try to run my kernel, it indicates the peer xclbin download err: -11 and crashes the instance (the instance become unreachable even after several reboot). The problem can be reproduced with the same kernel (every time the instance would become unreachable). The error occurs at the cl::Program function in the OpenCL host. I'm using an identical host program as in the OpenCL Hello World example. All of the logs of such problem has this download err: -11 line. An example log is:

CentOS Linux 7 (Core)
Kernel 3.10.0-1160.105.1.el7.x86_64 on an x86_64

ip-172-31-82-35 login: [  649.890603] FS-Cache: Loaded
[  649.927360] FS-Cache: Netfs 'nfs' registered for caching
[  649.937201] Key type dns_resolver registered
[  649.970587] NFS: Registering the id_resolver key type
[  649.975090] Key type id_resolver registered
[  649.979006] Key type id_legacy registered
[  764.800748] [drm] Finding MEM_TOPOLOGY section header
[  764.806939] [drm] Section MEM_TOPOLOGY details:[drm]   offset = 0x2f8
[  764.814490] [drm]   size = 0x120[  768.848634] xocl 0000:00:1d.0: icap.u.23068672 ffff949cf9d79c10 __icap_peer_xclbin_download: peer xclbin download err: -11
[  768.860315] xocl 0000:00:1d.0:  ffff947fb61f7098 xocl_read_axlf_helper: Failed to download xclbin, err: -11
[  772.861691] cirrus 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[  772.868331] pci 0000:00:1d.0: BAR 4: assigned [mem 0x2000000000-0x3fffffffff 64bit pref]
[  772.877849] pci 0000:00:1d.0: BAR 0: assigned [mem 0x82000000-0x83ffffff]
[  772.885670] pci 0000:00:1d.0: BAR 1: assigned [mem 0x85400000-0x855fffff]
[  772.894049] pci 0000:00:1d.0: BAR 2: assigned [mem 0x85600000-0x8560ffff 64bit pref]
[  772.922505] [drm] Initialized xocl 2.12.0 20220111 for 0000:00:1d.0 on minor 1
[  774.409295] [drm] Finding MEM_TOPOLOGY section header
[  774.413989] [drm] Section MEM_TOPOLOGY details:[drm]   offset = 0x2f8
[  774.419835] [drm]   size = 0x120[  775.563291] xocl 0000:00:1d.0: icap.u.23068672 ffff949cfe24a810 icap_cache_bitstream_axlf_section: get section err: -22
[  775.572864] xocl 0000:00:1d.0: icap.u.23068672 ffff949cfe24a810 icap_cache_bitstream_axlf_section: get section err: -22
[  775.582886] xocl 0000:00:1d.0: icap.u.23068672 ffff949cfe24a810 icap_cache_bitstream_axlf_section: get section err: -22
[  775.592281] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  775.601575] IP: [<ffffffffc05caaab>] icap_create_subdev_cu+0x5b/0x550 [xocl]
[  775.609760] PGD 8000001e8363b067 PUD 1e7f4c9067 PMD 0 
[  775.616022] Oops: 0000 [#1] SMP 
[  775.620217] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ext4 sb_edac mbcache jbd2 ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr xocl(OE) cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm parport_pc parport i2c_piix4 drm_panel_orientation_quirks ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix nvme crct10dif_pclmul crct10dif_common libata xen_blkfront nvme_core ena crc32c_intel serio_raw floppy sunrpc
[  775.683337] CPU: 2 PID: 3651 Comm: host Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.105.1.el7.x86_64 #1
[  775.695980] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
[  775.703213] task: ffff949d03d04200 ti: ffff949c9ca48000 task.ti: ffff949c9ca48000
[  775.711691] RIP: 0010:[<ffffffffc05caaab>]  [<ffffffffc05caaab>] icap_create_subdev_cu+0x5b/0x550 [xocl]
[  775.721532] RSP: 0018:ffff949c9ca4b738  EFLAGS: 00010286
[  775.726518] RAX: ffff949d03f98050 RBX: ffff949cfe24a800 RCX: 0000000000000000
[  775.733616] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff947eb3d10098
[  775.740403] RBP: ffff949c9ca4b918 R08: ffffb078cd62a112 R09: 00000000ffffffed
[  775.747516] R10: 0000000000005147 R11: ffffffff93412d88 R12: ffff949d03f98050
[  775.754273] R13: 0000000000000000 R14: 0000000000000000 R15: ffff949cfe24b450
[  775.761428] FS:  00007f8c7794e780(0000) GS:ffff949d05280000(0000) knlGS:0000000000000000
[  775.769414] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  775.775480] CR2: 0000000000000000 CR3: 0000001e837c4000 CR4: 00000000001606e0
[  775.783032] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  775.790194] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  775.798160] Call Trace:
[  775.801134]  [<ffffffff939b7920>] ? __schedule+0x320/0x680
[  775.807475]  [<ffffffffc0594f17>] ? xocl_get_xdev+0x17/0x30 [xocl]
[  775.813853]  [<ffffffffc05ccdc7>] ? icap_create_subdev_ip_layout+0x47/0x870 [xocl]
[  775.821653]  [<ffffffff934107ae>] ? map_vm_area+0x2e/0x50
[  775.827807]  [<ffffffff93412e30>] ? __vmalloc_node_range+0x170/0x280
[  775.837485]  [<ffffffffc05c9790>] ? icap_cache_clock_freq_topology+0x50/0x1b0 [xocl]
[  775.846643]  [<ffffffff93413262>] ? vzalloc+0x52/0x60
[  775.852757]  [<ffffffffc05d1700>] __icap_download_bitstream_axlf+0xbd0/0x1310 [xocl]
[  775.861693]  [<ffffffffc05d274f>] icap_download_bitstream_axlf+0x90f/0x970 [xocl]
[  775.870470]  [<ffffffffc066dd19>] xocl_read_axlf_helper+0x12c9/0x1a00 [xocl]
[  775.878706]  [<ffffffff933d0e7a>] ? __rmqueue+0x8a/0x460
[  775.885242]  [<ffffffffc066ea83>] xocl_read_axlf_ioctl+0x33/0x50 [xocl]
[  775.893106]  [<ffffffffc066ea50>] ? get_live_clients+0x80/0x80 [xocl]
[  775.900703]  [<ffffffffc04a3bcc>] drm_ioctl_kernel+0xbc/0x100 [drm]
[  775.908665]  [<ffffffffc04a3e5c>] drm_ioctl+0x24c/0x450 [drm]
[  775.915728]  [<ffffffffc066ea50>] ? get_live_clients+0x80/0x80 [xocl]
[  775.923596]  [<ffffffff9340926c>] ? do_mmap+0x39c/0x590
[  775.930135]  [<ffffffffc06698ae>] xocl_drm_ioctl+0xe/0x20 [xocl]
[  775.937320]  [<ffffffff93471988>] do_vfs_ioctl+0x3a8/0x5c0
[  775.943828]  [<ffffffff93471c21>] SyS_ioctl+0x81/0xa0
[  775.949885]  [<ffffffff93346966>] ? __audit_syscall_exit+0x1f6/0x2b0
[  775.957369]  [<ffffffff939c539a>] system_call_fastpath+0x25/0x2a
[  775.964512] Code: 48 89 45 d0 31 c0 e8 75 e2 10 d3 48 89 df 49 89 c7 48 89 85 28 fe ff ff e8 63 a4 fc ff 4d 8b b7 88 00 00 00 48 89 85 60 fe ff ff <41> 8b 0e 85 c9 0f 8e bc 04 00 00 48 8d 85 78 fe ff ff 45 31 d2 
[  775.997004] RIP  [<ffffffffc05caaab>] icap_create_subdev_cu+0x5b/0x550 [xocl]
[  776.005498]  RSP <ffff949c9ca4b738>
[  776.009934] CR2: 0000000000000000

The AFI creation was successful and I'm sure running the kernel after seen "code: available" when querying my AFI with the AFI ID. I can't find much information about this error code.

@AWSjoeluc
Copy link

Hello!

Thank you for reaching out with this issue. Can you provide any more information on how you reproduce this issue?

  • It sounds like your running the F1 instance with your own private kernel?
  • Does this error occur on kernel startup or upon running some workflow?

@AWSjoeluc
Copy link

Hi,

Is there anything that AWS can help to resolve this issue? If the issue is resolved, we're curious to know the resolution?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants