256 x AMD EPYC 7742 64-Core Processor (2 Sockets)
Linux 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200)
pve-manager/7.2-11/b76d3178
I am running some evaluation of Proxmox VE on this system.
Installation was uneventful.
I created two storage directories on one Raid1 and one Raid10.
My first test was the creation of a VM with 240 Cores (numa=1) with two virtual hdds. one for base system 100GB and one for swap-drive 64GB.
The system has 800GB of Ram + Q35 + UEFI + Virtio Scsi single + SSD-Emulation + IO thread + Discard + Virtio RNG + virtio networking (Bridged)
When I booted Debian 10 netinstall the system hang on drive detection. When I removed the second drive, I could install debian without any issues.
After the installation I added back the Swap drive but then the system would not boot up anymore with the same error.
After I lowered the Cores to around 192 it started to work again.
Today I added a third drive (7TB) via hotplugging. then I experienced a KernelPanic in the guest:
It took ages to shut the system down, afterwards the system hung at drive detection.
So I lowered the Core Count again to 136 (I did not check if that was the maximum possible to work correctly), then the system was able to boot again.
I found this kernel.org bug report: https://bugzilla.kernel.org/show_bug.cgi?id=199727
But switching to "threads" did not fix my issue.
Maybe someone has an idea how to fix this strange issue ...
Linux 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200)
pve-manager/7.2-11/b76d3178
I am running some evaluation of Proxmox VE on this system.
Installation was uneventful.
I created two storage directories on one Raid1 and one Raid10.
My first test was the creation of a VM with 240 Cores (numa=1) with two virtual hdds. one for base system 100GB and one for swap-drive 64GB.
The system has 800GB of Ram + Q35 + UEFI + Virtio Scsi single + SSD-Emulation + IO thread + Discard + Virtio RNG + virtio networking (Bridged)
When I booted Debian 10 netinstall the system hang on drive detection. When I removed the second drive, I could install debian without any issues.
After the installation I added back the Swap drive but then the system would not boot up anymore with the same error.
After I lowered the Cores to around 192 it started to work again.
Today I added a third drive (7TB) via hotplugging. then I experienced a KernelPanic in the guest:
Code:
[26554.137516] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[26554.139530] rcu: 161-...0: (1 GPs behind) idle=e5e/1/0x4000000000000000 softirq=422/423 fqs=2626
[26554.142360] rcu: (detected by 95, t=5253 jiffies, g=487989, q=3442)
[26554.144400] Sending NMI from CPU 95 to CPUs 161:
[26557.625399] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [sshd:22223]
[26557.627616] Modules linked in: fuse btrfs zstd_compress zstd_decompress xxhash ufs qnx4 hfsplus hfs minix msdos jfs xfs dm_mod rfkill snd_hda_intel snd_hda_codec nls_ascii nls_cp437 snd_hda_core kvm_amd vfat snd_hwdep ccp bochs_drm snd_pcm fat kvm ttm snd_timer irqbypass crct10dif_pclmul drm_kms_helper crc32_pclmul snd efi_pstore joydev virtio_rng iTCO_wdt sg ghash_clmulni_intel rng_core virtio_console serio_raw virtio_balloon pcspkr drm efivars iTCO_vendor_support evdev soundcore qemu_fw_cfg button efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod usbhid hid sr_mod virtio_net cdrom sd_mod net_failover crc32c_intel virtio_scsi
[26557.627664] failover ahci aesni_intel ehci_pci libahci uhci_hcd aes_x86_64 ehci_hcd crypto_simd virtio_pci libata cryptd virtio_ring usbcore lpc_ich glue_helper scsi_mod psmouse i2c_i801 virtio mfd_core usb_common
[26557.627678] CPU: 57 PID: 22223 Comm: sshd Not tainted 4.19.0-22-amd64 #1 Debian 4.19.260-1
[26557.627678] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[26557.627687] RIP: 0010:smp_call_function_many+0x1f8/0x250
[26557.627689] Code: c7 e8 fc 1e 5e 00 3b 05 ba 07 02 01 0f 83 8c fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 20 07 4f 9d 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c8 48 c7 c2 20 c4 72 9d 4c 89 fe 89 df
[26557.627690] RSP: 0018:ffffbfca1956bc18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[26557.627691] RAX: 000000000000003e RBX: ffff9c296f4681c0 RCX: ffff9c296f5acb60
[26557.627692] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c296f4681c8
[26557.627692] RBP: ffff9c296f4681c8 R08: 0000000000000200 R09: ffffffffffffffff
[26557.627693] R10: 00000000007fffff R11: 0000000000ffffff R12: ffffffff9c667ff0
[26557.627693] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000200
[26557.627697] FS: 00007fa545957e40(0000) GS:ffff9c296f440000(0000) knlGS:0000000000000000
[26557.627697] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26557.627698] CR2: 00007ffd979aeef2 CR3: 0000006287a02000 CR4: 0000000000340ee0
[26557.627700] Call Trace:
[26557.628216] ? load_new_mm_cr3+0xc0/0xc0
[26557.628218] on_each_cpu+0x28/0x60
[26557.628219] flush_tlb_kernel_range+0x48/0x90
[26557.628222] __purge_vmap_area_lazy+0x4d/0xc0
[26557.628223] vm_unmap_aliases+0xe9/0x120
[26557.628225] change_page_attr_set_clr+0xc7/0x420
[26557.628227] set_memory_ro+0x26/0x30
[26557.628229] bpf_prog_select_runtime+0x28/0x110
[26557.628232] bpf_prepare_filter+0x523/0x590
[26557.628233] bpf_prog_create_from_user+0xbb/0x110
[26557.628235] ? hardlockup_detector_perf_cleanup+0x80/0x80
[26557.628236] do_seccomp+0x25d/0x6c0
[26557.628238] __x64_sys_prctl+0x4e6/0x590
[26557.628241] do_syscall_64+0x53/0x110
[26557.628244] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[26557.628245] RIP: 0033:0x7fa545d09c4a
[26557.628247] Code: 48 8b 0d 49 02 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 9d 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 16 02 0c 00 f7 d8 64 89 01 48
[26557.628247] RSP: 002b:00007ffd979ade08 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
[26557.628248] RAX: ffffffffffffffda RBX: 00007ffd979ade10 RCX: 00007fa545d09c4a
[26557.628248] RDX: 000056055a65d040 RSI: 0000000000000002 RDI: 0000000000000016
[26557.628249] RBP: 000056055b1347b0 R08: 0000000000000000 R09: 00007fa545d89e80
[26557.628249] R10: 00007fa545d09c4a R11: 0000000000000246 R12: 00007ffd979adeb0
[26557.628249] R13: 000056055b133b30 R14: 0000000000000000 R15: 0000000000000013
[26564.065302] rcu: rcu_sched kthread starved for 2480 jiffies! g487989 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=17
[26564.068578] rcu: RCU grace-period kthread stack dump:
[26564.070188] rcu_sched I 0 12 2 0x80000000
[26564.070190] Call Trace:
[26564.070197] __schedule+0x29f/0x840
[26564.070200] ? __switch_to_asm+0x35/0x70
[26564.070202] schedule+0x28/0x80
[26564.070203] schedule_timeout+0x16b/0x3b0
[26564.070206] ? __next_timer_interrupt+0xc0/0xc0
[26564.070208] rcu_gp_kthread+0x40d/0x850
[26564.070210] ? call_rcu_sched+0x20/0x20
[26564.070212] kthread+0x112/0x130
[26564.070214] ? kthread_bind+0x30/0x30
[26564.070215] ret_from_fork+0x35/0x40
[26585.624632] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [sshd:22223]
[26585.626860] Modules linked in: fuse btrfs zstd_compress zstd_decompress xxhash ufs qnx4 hfsplus hfs minix msdos jfs xfs dm_mod rfkill snd_hda_intel snd_hda_codec nls_ascii nls_cp437 snd_hda_core kvm_amd vfat snd_hwdep ccp bochs_drm snd_pcm fat kvm ttm snd_timer irqbypass crct10dif_pclmul drm_kms_helper crc32_pclmul snd efi_pstore joydev virtio_rng iTCO_wdt sg ghash_clmulni_intel rng_core virtio_console serio_raw virtio_balloon pcspkr drm efivars iTCO_vendor_support evdev soundcore qemu_fw_cfg button efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod usbhid hid sr_mod virtio_net cdrom sd_mod net_failover crc32c_intel virtio_scsi
[26585.626890] failover ahci aesni_intel ehci_pci libahci uhci_hcd aes_x86_64 ehci_hcd crypto_simd virtio_pci libata cryptd virtio_ring usbcore lpc_ich glue_helper scsi_mod psmouse i2c_i801 virtio mfd_core usb_common
[26585.626898] CPU: 57 PID: 22223 Comm: sshd Tainted: G L 4.19.0-22-amd64 #1 Debian 4.19.260-1
[26585.626898] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[26585.626906] RIP: 0010:smp_call_function_many+0x1f8/0x250
[26585.626908] Code: c7 e8 fc 1e 5e 00 3b 05 ba 07 02 01 0f 83 8c fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 20 07 4f 9d 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c8 48 c7 c2 20 c4 72 9d 4c 89 fe 89 df
[26585.626909] RSP: 0018:ffffbfca1956bc18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[26585.626910] RAX: 000000000000003e RBX: ffff9c296f4681c0 RCX: ffff9c296f5acb60
[26585.626910] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c296f4681c8
[26585.626911] RBP: ffff9c296f4681c8 R08: 0000000000000200 R09: ffffffffffffffff
[26585.626911] R10: 00000000007fffff R11: 0000000000ffffff R12: ffffffff9c667ff0
[26585.626912] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000200
[26585.626914] FS: 00007fa545957e40(0000) GS:ffff9c296f440000(0000) knlGS:0000000000000000
[26585.626915] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[26585.626915] CR2: 00007ffd979aeef2 CR3: 0000006287a02000 CR4: 0000000000340ee0
[26585.626917] Call Trace:
[26585.626923] ? load_new_mm_cr3+0xc0/0xc0
[26585.626924] on_each_cpu+0x28/0x60
[26585.626926] flush_tlb_kernel_range+0x48/0x90
[26585.626928] __purge_vmap_area_lazy+0x4d/0xc0
[26585.626930] vm_unmap_aliases+0xe9/0x120
[26585.626931] change_page_attr_set_clr+0xc7/0x420
[26585.626933] set_memory_ro+0x26/0x30
[26585.626937] bpf_prog_select_runtime+0x28/0x110
[26585.626939] bpf_prepare_filter+0x523/0x590
[26585.626940] bpf_prog_create_from_user+0xbb/0x110
[26585.626943] ? hardlockup_detector_perf_cleanup+0x80/0x80
[26585.626944] do_seccomp+0x25d/0x6c0
[26585.626946] __x64_sys_prctl+0x4e6/0x590
[26585.626949] do_syscall_64+0x53/0x110
[26585.626952] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[26585.626953] RIP: 0033:0x7fa545d09c4a
[26585.626954] Code: 48 8b 0d 49 02 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 9d 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 16 02 0c 00 f7 d8 64 89 01 48
[26585.626954] RSP: 002b:00007ffd979ade08 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
[26585.626955] RAX: ffffffffffffffda RBX: 00007ffd979ade10 RCX: 00007fa545d09c4a
[26585.626956] RDX: 000056055a65d040 RSI: 0000000000000002 RDI: 0000000000000016
[26585.626956] RBP: 000056055b1347b0 R08: 0000000000000000 R09: 00007fa545d89e80
[26585.626957] R10: 00007fa545d09c4a R11: 0000000000000246 R12: 00007ffd979adeb0
[26585.626957] R13: 000056055b133b30 R14: 0000000000000000 R15: 0000000000000013
root@www:~#
Message from syslogd@www at Oct 30 22:02:37 ...
kernel:[26613.623866] watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [sshd:22223]
It took ages to shut the system down, afterwards the system hung at drive detection.
So I lowered the Core Count again to 136 (I did not check if that was the maximum possible to work correctly), then the system was able to boot again.
I found this kernel.org bug report: https://bugzilla.kernel.org/show_bug.cgi?id=199727
But switching to "threads" did not fix my issue.
Maybe someone has an idea how to fix this strange issue ...