Proxmox node hung and required forced reboot to recover

May 28, 2020
34
10
13
43
South Africa
Afternoon Proxmoxers,

I had a production issue on one of the nodes in my 14 node cluster.

The node completely crashed and was totally unresponsive requiring a hard reboot of the system from iDrac

The host in question is a 15G Dell PowerEdge R7425 specs are:

1645782457590.png


I suspect one of the workloads on the host is responsible for the issue as the VM in question has all 96 cores allocated to the VM and we are experiencing some CPU related issues with this VM as a result of a software issue.


When i look in the syslog on the node i see the below error occurring recursively:

Feb 25 11:03:59 proxmox-host-12 kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 75s! [migration/29:195]
Feb 25 11:03:59 proxmox-host-12 kernel: Modules linked in: dm_service_time md4 cmac nls_utf8 cifs libarc4 fscache netfs libdes vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_tcpudp ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd nouveau mxm_wmi kvm_amd wmi snd_hda_codec_hdmi video drm_ttm_helper kvm ttm mgag200 snd_hda_intel snd_intel_dspcfg drm_kms_helper snd_intel_sdw_acpi irqbypass snd_hda_codec crct10dif_pclmul ghash_clmulni_intel snd_hda_core aesni_intel snd_hwdep acpi_ipmi cec crypto_simd snd_pcm ucsi_ccg cdc_ether cryptd rc_core dcdbas fb_sys_fops typec_ucsi usbnet syscopyarea snd_timer rapl efi_pstore pcspkr ipmi_si mii sysfillrect snd typec input_leds soundcore joydev sysimgblt k10temp ipmi_devintf ccp ipmi_msghandler acpi_power_meter mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm
Feb 25 11:03:59 proxmox-host-12 kernel: ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua hid_generic usbmouse usbkbd usbhid hid xhci_pci igb xhci_pci_renesas crc32_pclmul ahci megaraid_sas i2c_nvidia_gpu i2c_algo_bit i2c_piix4 libahci xhci_hcd dca i40e
Feb 25 11:03:59 proxmox-host-12 kernel: CPU: 29 PID: 195 Comm: migration/29 Tainted: P O L 5.13.19-3-pve #1
Feb 25 11:03:59 proxmox-host-12 kernel: Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.17.0 07/30/2021
Feb 25 11:03:59 proxmox-host-12 kernel: Stopper: multi_cpu_stop+0x0/0x120 <- migrate_swap+0xab/0x100
Feb 25 11:03:59 proxmox-host-12 kernel: RIP: 0010:rcu_momentary_dyntick_idle+0x24/0x30
Feb 25 11:03:59 proxmox-host-12 kernel: Code: c3 0f 1f 44 00 00 48 c7 c0 80 5c 03 00 65 c6 05 b5 82 10 74 00 65 48 03 05 79 ba 0e 74 ba 04 00 00 00 f0 0f c1 90 20 01 00 00 <83> e2 02 74 01 c3 0f 0b c3 0f 1f 00 0f 1f 44 00 00 55 31 c0 65 48
Feb 25 11:03:59 proxmox-host-12 kernel: RSP: 0018:ffff9cc44d347e58 EFLAGS: 00000292
Feb 25 11:03:59 proxmox-host-12 kernel: RAX: ffff8f8a1f7b5c80 RBX: ffff9cc463507b38 RCX: ffff8f8a1f7a7790
Feb 25 11:03:59 proxmox-host-12 kernel: RDX: 000000009a863b4e RSI: 0000000000000286 RDI: ffffffff8d0230a0
Feb 25 11:03:59 proxmox-host-12 kernel: RBP: ffff9cc44d347e98 R08: 0000000000000000 R09: 0000000000000000
Feb 25 11:03:59 proxmox-host-12 kernel: R10: ffffffff8da746c0 R11: 0000000000000001 R12: ffff9cc463507b5c
Feb 25 11:03:59 proxmox-host-12 kernel: R13: 0000000000000001 R14: ffffffff8d0230a0 R15: 0000000000000001
Feb 25 11:03:59 proxmox-host-12 kernel: FS: 0000000000000000(0000) GS:ffff8f8a1f780000(0000) knlGS:0000000000000000
Feb 25 11:03:59 proxmox-host-12 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 11:03:59 proxmox-host-12 kernel: CR2: 000055c98497f198 CR3: 000000089f94e000 CR4: 00000000003506e0
Feb 25 11:03:59 proxmox-host-12 kernel: Call Trace:
Feb 25 11:03:59 proxmox-host-12 kernel: <TASK>
Feb 25 11:03:59 proxmox-host-12 kernel: ? multi_cpu_stop+0xbf/0x120
Feb 25 11:03:59 proxmox-host-12 kernel: ? stop_machine_yield+0x10/0x10
Feb 25 11:03:59 proxmox-host-12 kernel: cpu_stopper_thread+0xc8/0x130
Feb 25 11:03:59 proxmox-host-12 kernel: smpboot_thread_fn+0xd0/0x170
Feb 25 11:03:59 proxmox-host-12 kernel: ? smpboot_register_percpu_thread+0xe0/0xe0
Feb 25 11:03:59 proxmox-host-12 kernel: kthread+0x12b/0x150
Feb 25 11:03:59 proxmox-host-12 kernel: ? set_kthread_struct+0x50/0x50
Feb 25 11:03:59 proxmox-host-12 kernel: ret_from_fork+0x22/0x30
Feb 25 11:03:59 proxmox-host-12 kernel: </TASK>
Feb 25 11:03:59 proxmox-host-12 kernel: watchdog: BUG: soft lockup - CPU#52 stuck for 78s! [migration/52:333]
Feb 25 11:03:59 proxmox-host-12 kernel: Modules linked in: dm_service_time md4 cmac nls_utf8 cifs libarc4 fscache netfs libdes vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_tcpudp ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd nouveau mxm_wmi kvm_amd wmi snd_hda_codec_hdmi video drm_ttm_helper kvm ttm mgag200 snd_hda_intel snd_intel_dspcfg drm_kms_helper snd_intel_sdw_acpi irqbypass snd_hda_codec crct10dif_pclmul ghash_clmulni_intel snd_hda_core aesni_intel snd_hwdep acpi_ipmi cec crypto_simd snd_pcm ucsi_ccg cdc_ether cryptd rc_core dcdbas fb_sys_fops typec_ucsi usbnet syscopyarea snd_timer rapl efi_pstore pcspkr ipmi_si mii sysfillrect snd typec input_leds soundcore joydev sysimgblt k10temp ipmi_devintf ccp ipmi_msghandler acpi_power_meter mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm
Feb 25 11:03:59 proxmox-host-12 kernel: ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua hid_generic usbmouse usbkbd usbhid hid xhci_pci igb xhci_pci_renesas crc32_pclmul ahci megaraid_sas i2c_nvidia_gpu i2c_algo_bit i2c_piix4 libahci xhci_hcd dca i40e
Feb 25 11:03:59 proxmox-host-12 kernel: CPU: 52 PID: 333 Comm: migration/52 Tainted: P O L 5.13.19-3-pve #1
Feb 25 11:03:59 proxmox-host-12 kernel: Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.17.0 07/30/2021
Feb 25 11:03:59 proxmox-host-12 kernel: Stopper: multi_cpu_stop+0x0/0x120 <- migrate_swap+0xab/0x100
Feb 25 11:03:59 proxmox-host-12 kernel: RIP: 0010:stop_machine_yield+0x2/0x10
Feb 25 11:03:59 proxmox-host-12 kernel: Code: 00 8b 45 94 48 8b 4d f0 65 48 2b 0c 25 28 00 00 00 75 0d 4c 8b 65 f8 c9 c3 b8 fe ff ff ff eb e4 e8 b3 f2 a8 00 0f 1f 00 f3 90 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 89
Feb 25 11:03:59 proxmox-host-12 kernel: RSP: 0018:ffff9cc44d7f3e58 EFLAGS: 00000202
Feb 25 11:03:59 proxmox-host-12 kernel: RAX: ffff8f6a1f935c80 RBX: ffff9cc488efb6c0 RCX: ffff8f6a1f927790
Feb 25 11:03:59 proxmox-host-12 kernel: RDX: 0000000000000002 RSI: 0000000000000286 RDI: ffffffff8d0184a0
Feb 25 11:03:59 proxmox-host-12 kernel: RBP: ffff9cc44d7f3e98 R08: 000000000000000c R09: 0000000000000004
Feb 25 11:03:59 proxmox-host-12 kernel: R10: ffffffff8da746c0 R11: 00000000000001bd R12: ffff9cc488efb6e4
Feb 25 11:03:59 proxmox-host-12 kernel: R13: 0000000000000001 R14: ffffffff8d0184a0 R15: 0000000000000001
Feb 25 11:03:59 proxmox-host-12 kernel: FS: 0000000000000000(0000) GS:ffff8f6a1f900000(0000) knlGS:0000000000000000
Feb 25 11:03:59 proxmox-host-12 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 25 11:03:59 proxmox-host-12 kernel: CR2: 00007f0828d9c010 CR3: 000000237c4e8000 CR4: 00000000003506e0
Feb 25 11:03:59 proxmox-host-12 kernel: Call Trace:
Feb 25 11:03:59 proxmox-host-12 kernel: <TASK>
Feb 25 11:03:59 proxmox-host-12 kernel: ? multi_cpu_stop+0xa1/0x120
Feb 25 11:03:59 proxmox-host-12 kernel: ? stop_machine_yield+0x10/0x10
Feb 25 11:03:59 proxmox-host-12 kernel: cpu_stopper_thread+0xc8/0x130
Feb 25 11:03:59 proxmox-host-12 kernel: smpboot_thread_fn+0xd0/0x170
Feb 25 11:03:59 proxmox-host-12 kernel: ? smpboot_register_percpu_thread+0xe0/0xe0
Feb 25 11:03:59 proxmox-host-12 kernel: kthread+0x12b/0x150
Feb 25 11:03:59 proxmox-host-12 kernel: ? set_kthread_struct+0x50/0x50
Feb 25 11:03:59 proxmox-host-12 kernel: ret_from_fork+0x22/0x30
Feb 25 11:03:59 proxmox-host-12 kernel: </TASK>

it then dies completely until I force a reboot:

1645783029442.jpeg

Node details:

root@proxmox-host-12:/var/log/pve/tasks# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1


Did we just over provision the VM or is there something else amiss here?

Thanks in advance
 

Attachments

  • 1645782943845.png
    1645782943845.png
    280.4 KB · Views: 21
I suspect one of the workloads on the host is responsible for the issue as the VM in question has all 96 cores allocated to the VM and we are experiencing some CPU related issues with this VM as a result of a software issue.
Never ever do that!
Always keep something dedicated for the host. Depending on the load 2 Cores could be sufficient. But if you do a lot of networking or storage io it also could be 8 or more.
You are creating a contention currently which can provoke all sorts of side effects!

Did we just over provision the VM
IMHO: yes

/edit:
The 7451 only has 24 "real" cores.
https://www.amd.com/de/products/cpu/amd-epyc-7451

Means your physical cores are only 48 anyways. So adding 64 to the VM is questionable. I'd not go beyond 44 to ensure all vCPUs really can be Scheduler. Then try again.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!