Proxmox Hang & Needed to reboot after waiting for 1 hour of it not coming back

Apr 12, 2022
4
0
1
Hi,

I am running proxmox with 2 1080 GPU and storage via NFS for vm disk and os via zfs

My details of setup


pve-manager/7.4-13/46c37d9c (running kernel: 5.15.74-1-pve)
24 x AMD Ryzen 9 5900X 12-Core Processor (1 Socket)
48 GB DDR4 (NON-ECC) + 1TB NVME SAMSUNG SSD
2 x 1080 GPU
ZFS for VM OS & Sub Disk from NFS

Randomly i got this issue and cpu got stuck and i was not able to ssh/gui access and had to reboot the host after waiting for some time.



Jun 24 18:43:20 dev-proxmox pvestatd[2115]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Jun 24 18:43:20 dev-proxmox pvestatd[2115]: status update time (8.029 seconds)
Jun 24 18:43:30 dev-proxmox kernel: watchdog: BUG: soft lockup - CPU#9 stuck for 4481s! [z_wr_int_1:589]
Jun 24 18:43:30 dev-proxmox kernel: Modules linked in: 8021q garp mrp tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm crct10dif_pclmul snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg snd_intel_sdw_acpi aesni_intel snd_hda_codec crypto_simd snd_hda_core cryptd snd_hwdep rapl snd_pcm gigabyte_wmi snd_timer wmi_bmof pcspkr ccp efi_pstore k10temp mxm_wmi snd soundcore vhost_net mac_hid vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic
Jun 24 18:43:30 dev-proxmox kernel: xor zstd_compress raid6_pq libcrc32c simplefb hid_generic usbkbd usbhid hid ahci xhci_pci crc32_pclmul i2c_piix4 xhci_pci_renesas libahci ixgbe xhci_hcd igb nvme i2c_algo_bit xfrm_algo dca mdio nvme_core wmi
Jun 24 18:43:30 dev-proxmox kernel: CPU: 9 PID: 589 Comm: z_wr_int_1 Tainted: P D W O L 5.15.74-1-pve #1
Jun 24 18:43:30 dev-proxmox kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F36d 07/20/2022
Jun 24 18:43:30 dev-proxmox kernel: RIP: 0010:smp_call_function_many_cond+0x13f/0x360
Jun 24 18:43:30 dev-proxmox kernel: Code: c4 73 2d 4d 63 ec 48 8b 13 49 81 fd ff 1f 00 00 0f 87 e3 01 00 00 4a 03 14 ed e0 fa cb a2 8b 42 08 a8 01 74 09 f3 90 8b 42 08 <a8> 01 75 f7 eb bc 48 83 c4 40 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc
Jun 24 18:43:30 dev-proxmox kernel: RSP: 0018:ffffbcc2027efa48 EFLAGS: 00000202
Jun 24 18:43:30 dev-proxmox kernel: RAX: 0000000000000011 RBX: ffff92c65ec71c00 RCX: 0000000000000011
Jun 24 18:43:30 dev-proxmox kernel: RDX: ffff92c65ee77bc0 RSI: 0000000000000000 RDI: ffff92bb40067668
Jun 24 18:43:30 dev-proxmox kernel: RBP: ffffbcc2027efab0 R08: 0000000000000000 R09: 0000000000000000
Jun 24 18:43:30 dev-proxmox kernel: R10: 0000000000000011 R11: fffffffffffe0000 R12: 0000000000000011
Jun 24 18:43:30 dev-proxmox kernel: R13: 0000000000000011 R14: 0000000000000001 R15: 0000000000000020
Jun 24 18:43:30 dev-proxmox kernel: FS: 0000000000000000(0000) GS:ffff92c65ec40000(0000) knlGS:0000000000000000
Jun 24 18:43:30 dev-proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 24 18:43:30 dev-proxmox kernel: CR2: 0000027047826000 CR3: 00000005ca6e2000 CR4: 0000000000750ee0
Jun 24 18:43:30 dev-proxmox kernel: PKRU: 55555554
Jun 24 18:43:30 dev-proxmox kernel: Call Trace:
Jun 24 18:43:30 dev-proxmox kernel: <TASK>
Jun 24 18:43:30 dev-proxmox kernel: ? __flush_tlb_all+0x30/0x30
Jun 24 18:43:30 dev-proxmox kernel: on_each_cpu_cond_mask+0x22/0x30
Jun 24 18:43:30 dev-proxmox kernel: flush_tlb_kernel_range+0x41/0xa0
Jun 24 18:43:30 dev-proxmox kernel: __purge_vmap_area_lazy+0xb9/0x700
Jun 24 18:43:30 dev-proxmox kernel: ? __cond_resched+0x1a/0x50
Jun 24 18:43:30 dev-proxmox kernel: free_vmap_area_noflush+0x2ef/0x330
Jun 24 18:43:30 dev-proxmox kernel: remove_vm_area+0x9e/0xb0
Jun 24 18:43:30 dev-proxmox kernel: __vunmap+0x93/0x2a0
Jun 24 18:43:30 dev-proxmox kernel: __vfree+0x22/0x70
Jun 24 18:43:30 dev-proxmox kernel: vfree+0x2c/0x50
Jun 24 18:43:30 dev-proxmox kernel: spl_slab_reclaim+0x172/0x1b0 [spl]
Jun 24 18:43:30 dev-proxmox kernel: spl_kmem_cache_free+0x187/0x200 [spl]
Jun 24 18:43:30 dev-proxmox kernel: zio_buf_free+0x33/0x80 [zfs]
Jun 24 18:43:30 dev-proxmox kernel: abd_free+0x1cd/0x1e0 [zfs]
Jun 24 18:43:30 dev-proxmox kernel: zio_pop_transforms+0x88/0xa0 [zfs]
Jun 24 18:43:30 dev-proxmox kernel: zio_done+0x17f/0x1290 [zfs]
Jun 24 18:43:30 dev-proxmox kernel: zio_execute+0x95/0x160 [zfs]
Jun 24 18:43:30 dev-proxmox kernel: taskq_thread+0x29f/0x4d0 [spl]
Jun 24 18:43:30 dev-proxmox kernel: ? wake_up_q+0x90/0x90
Jun 24 18:43:30 dev-proxmox kernel: ? zio_gang_tree_free+0x70/0x70 [zfs]
Jun 24 18:43:30 dev-proxmox kernel: ? taskq_thread_spawn+0x60/0x60 [spl]
Jun 24 18:43:30 dev-proxmox kernel: kthread+0x12a/0x150
Jun 24 18:43:30 dev-proxmox kernel: ? set_kthread_struct+0x50/0x50
Jun 24 18:43:30 dev-proxmox kernel: ret_from_fork+0x22/0x30
Jun 24 18:43:30 dev-proxmox kernel: </TASK>
Jun 24 18:43:30 dev-proxmox pvestatd[2115]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Jun 24 18:43:30 dev-proxmox pvestatd[2115]: status update time (8.029 seconds)
Jun 24 18:43:34 dev-proxmox kernel: watchdog: BUG: soft lockup - CPU#11 stuck for 4511s! [kworker/11:0:56662]
Jun 24 18:43:34 dev-proxmox kernel: Modules linked in: 8021q garp mrp tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm crct10dif_pclmul snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg snd_intel_sdw_acpi aesni_intel snd_hda_codec crypto_simd snd_hda_core cryptd snd_hwdep rapl snd_pcm gigabyte_wmi snd_timer wmi_bmof pcspkr ccp efi_pstore k10temp mxm_wmi snd soundcore vhost_net mac_hid vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic
Jun 24 18:43:34 dev-proxmox kernel: xor zstd_compress raid6_pq libcrc32c simplefb hid_generic usbkbd usbhid hid ahci xhci_pci crc32_pclmul i2c_piix4 xhci_pci_renesas libahci ixgbe xhci_hcd igb nvme i2c_algo_bit xfrm_algo dca mdio nvme_core wmi
Jun 24 18:43:34 dev-proxmox kernel: CPU: 11 PID: 56662 Comm: kworker/11:0 Tainted: P D W O L 5.15.74-1-pve #1
Jun 24 18:43:34 dev-proxmox kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F36d 07/20/2022
Jun 24 18:43:34 dev-proxmox kernel: Workqueue: events netstamp_clear
Jun 24 18:43:34 dev-proxmox kernel: RIP: 0010:smp_call_function_many_cond+0x13c/0x360
Jun 24 18:43:34 dev-proxmox kernel: Code: 01 41 89 c4 73 2d 4d 63 ec 48 8b 13 49 81 fd ff 1f 00 00 0f 87 e3 01 00 00 4a 03 14 ed e0 fa cb a2 8b 42 08 a8 01 74 09 f3 90 <8b> 42 08 a8 01 75 f7 eb bc 48 83 c4 40 5b 41 5c 41 5d 41 5e 41 5f
Jun 24 18:43:34 dev-proxmox kernel: RSP: 0018:ffffbcc21fa3fcf0 EFLAGS: 00000202
Jun 24 18:43:34 dev-proxmox kernel: RAX: 0000000000000011 RBX: ffff92c65ecf1c00 RCX: 0000000000000011
Jun 24 18:43:34 dev-proxmox kernel: RDX: ffff92c65ee77c00 RSI: 0000000000000000 RDI: ffff92bb40e5cd00
Jun 24 18:43:34 dev-proxmox kernel: RBP: ffffbcc21fa3fd58 R08: 0000000000000000 R09: 0000000000000000
Jun 24 18:43:34 dev-proxmox kernel: R10: 0000000000000011 R11: fffffffffffe0000 R12: 0000000000000011
Jun 24 18:43:34 dev-proxmox kernel: R13: 0000000000000011 R14: 0000000000000001 R15: 0000000000000020
Jun 24 18:43:34 dev-proxmox kernel: FS: 0000000000000000(0000) GS:ffff92c65ecc0000(0000) knlGS:0000000000000000
Jun 24 18:43:34 dev-proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 24 18:43:34 dev-proxmox kernel: CR2: 0000000000e203c0 CR3: 0000000946810000 CR4: 0000000000750ee0
Jun 24 18:43:34 dev-proxmox kernel: PKRU: 55555554
Jun 24 18:43:34 dev-proxmox kernel: Call Trace:
Jun 24 18:43:34 dev-proxmox kernel: <TASK>
Jun 24 18:43:34 dev-proxmox kernel: ? text_poke_loc_init+0x190/0x190
Jun 24 18:43:34 dev-proxmox kernel: on_each_cpu_cond_mask+0x22/0x30
Jun 24 18:43:34 dev-proxmox kernel: text_poke_bp_batch+0xb2/0x270
Jun 24 18:43:34 dev-proxmox kernel: text_poke_finish+0x1f/0x40
Jun 24 18:43:34 dev-proxmox kernel: arch_jump_label_transform_apply+0x1a/0x30
Jun 24 18:43:34 dev-proxmox kernel: __jump_label_update+0xf3/0x140
Jun 24 18:43:34 dev-proxmox kernel: jump_label_update+0xba/0xe0
Jun 24 18:43:34 dev-proxmox kernel: static_key_enable_cpuslocked+0x77/0xa0
Jun 24 18:43:34 dev-proxmox kernel: static_key_enable+0x1b/0x30
Jun 24 18:43:34 dev-proxmox kernel: netstamp_clear+0x2d/0x40
Jun 24 18:43:34 dev-proxmox kernel: process_one_work+0x22b/0x3d0
Jun 24 18:43:34 dev-proxmox kernel: worker_thread+0x53/0x420
Jun 24 18:43:34 dev-proxmox kernel: ? process_one_work+0x3d0/0x3d0
Jun 24 18:43:34 dev-proxmox kernel: kthread+0x12a/0x150
Jun 24 18:43:34 dev-proxmox kernel: ? set_kthread_struct+0x50/0x50
Jun 24 18:43:34 dev-proxmox kernel: ret_from_fork+0x22/0x30
Jun 24 18:43:34 dev-proxmox kernel: </TASK>
Jun 24 18:43:40 dev-proxmox pvestatd[2115]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - unable to connect to VM 100 qmp socket - timeout after 51 retries
Jun 24 18:43:40 dev-proxmox pvestatd[2115]: status update time (8.029 seconds)
Jun 24 18:43:42 dev-proxmox kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 4519s! [kcompactd0:164]
Jun 24 18:43:42 dev-proxmox kernel: Modules linked in: 8021q garp mrp tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm crct10dif_pclmul snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg snd_intel_sdw_acpi aesni_intel snd_hda_codec crypto_simd snd_hda_core cryptd snd_hwdep rapl snd_pcm gigabyte_wmi snd_timer wmi_bmof pcspkr ccp efi_pstore k10temp mxm_wmi snd soundcore vhost_net mac_hid vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic
Jun 24 18:43:42 dev-proxmox kernel: xor zstd_compress raid6_pq libcrc32c simplefb hid_generic usbkbd usbhid hid ahci xhci_pci crc32_pclmul i2c_piix4 xhci_pci_renesas libahci ixgbe xhci_hcd igb nvme i2c_algo_bit xfrm_algo dca mdio nvme_core wmi



I now changed (new i am yet to test) -- i will update if this issue happens again

sata0: local-zfs:vm-101-disk-1,aio=threads,cache=writeback,discard=on,size=100G,snapshot=1,ssd=1
sata1: local:100/vm100disk.qcow2,aio=threads,backup=0,cache=writeback,discard=on,snapshot=1
scsihw: virtio-scsi-pci
args: -smp '8,cores=4,threads=2,sockets=1,maxcpus=8' -cpu 'host,-hypervisor,topoext=on,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt'
agent: 1
balloon: 0
bios: ovmf
boot: order=sata0
cores: 8
cpu: host


Please help, what is the issue i am not getting
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!