Hi,
I have a pve node on an MS-01 machine, and sometimes it gets stuck:
It only responds to some commands, and the VMs but one seem to run (cannot stop/restart the stuck VM).
The stuck VM uses a passed through Nvidia GPU.
syslog:
Is there a way to properly debug and fix this?
Thank you.
I have a pve node on an MS-01 machine, and sometimes it gets stuck:
It only responds to some commands, and the VMs but one seem to run (cannot stop/restart the stuck VM).
The stuck VM uses a passed through Nvidia GPU.
syslog:
Code:
Jan 30 10:33:09 pve1 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 16636s! [pvestatd:1561]
Jan 30 10:33:09 pve1 kernel: Modules linked in: tcp_diag inet_diag vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd veth rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr soundwire_intel intel_rapl_common intel_uncore_frequency snd_sof_intel_hda_mlink soundwire_cadence intel_uncore_frequency_common snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core x86_pkg_temp_thermal intel_powerclamp mt7921e snd_compress mt7921_common snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine mt792x_lib kvm_intel mt76_connac_lib snd_hda_intel mt76 xe kvm snd_intel_dspcfg snd_intel_sdw_acpi mac80211 irqbypass snd_hda_codec
Jan 30 10:33:09 pve1 kernel: crct10dif_pclmul polyval_clmulni drm_gpuvm polyval_generic ghash_clmulni_intel drm_exec snd_hda_core sha256_ssse3 gpu_sched sha1_ssse3 btusb drm_buddy aesni_intel snd_hwdep drm_suballoc_helper drm_ttm_helper btrtl snd_pcm ttm btintel crypto_simd btbcm btmtk cryptd mei_hdcp mei_pxp drm_display_helper snd_timer cfg80211 bluetooth snd cec cmdlinepart rapl ecdh_generic mei_me spi_nor ecc intel_pmc_core rc_core soundcore intel_cstate libarc4 intel_vsec i2c_algo_bit mei pcspkr wmi_bmof mtd pmt_telemetry igen6_edac pmt_class acpi_pad acpi_tad mac_hid vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c xhci_pci nvme xhci_pci_renesas spi_intel_pci crc32_pclmul thunderbolt i40e video spi_intel i2c_i801 xhci_hcd nvme_core igc i2c_smbus nvme_auth wmi pinctrl_tigerlake
Jan 30 10:33:09 pve1 kernel: CPU: 0 PID: 1561 Comm: pvestatd Tainted: P D W O L 6.8.12-7-pve #1
Jan 30 10:33:09 pve1 kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
Jan 30 10:33:09 pve1 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x7f/0x2d0
Jan 30 10:33:09 pve1 kernel: Code: 00 00 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 5f 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e
Jan 30 10:33:09 pve1 kernel: RSP: 0000:ffffb56ca4abbb28 EFLAGS: 00000202
Jan 30 10:33:09 pve1 kernel: RAX: 0000000000000001 RBX: ffffe8c4293e0328 RCX: 000fffffffe00000
Jan 30 10:33:09 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe8c4293e0328
Jan 30 10:33:09 pve1 kernel: RBP: ffffb56ca4abbb48 R08: 0000000000000000 R09: 0000000000000000
Jan 30 10:33:09 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8eb2570dd890
Jan 30 10:33:09 pve1 kernel: R13: 000059af62400000 R14: ffffb56ca4abbc58 R15: ffff8ebb4f80c000
Jan 30 10:33:09 pve1 kernel: FS: 0000000000000000(0000) GS:ffff8ec94f200000(0000) knlGS:0000000000000000
Jan 30 10:33:09 pve1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 30 10:33:09 pve1 kernel: CR2: 000059af6258f000 CR3: 0000000143036000 CR4: 0000000000f52ef0
Jan 30 10:33:09 pve1 kernel: PKRU: 55555554
Jan 30 10:33:09 pve1 kernel: Call Trace:
Jan 30 10:33:09 pve1 kernel: <IRQ>
Jan 30 10:33:09 pve1 kernel: ? show_regs+0x6d/0x80
Jan 30 10:33:09 pve1 kernel: ? watchdog_timer_fn+0x206/0x290
Jan 30 10:33:09 pve1 kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Jan 30 10:33:09 pve1 kernel: ? __hrtimer_run_queues+0x105/0x280
Jan 30 10:33:09 pve1 kernel: ? clockevents_program_event+0xb3/0x140
Jan 30 10:33:09 pve1 kernel: ? hrtimer_interrupt+0xf6/0x250
Jan 30 10:33:09 pve1 kernel: ? __sysvec_apic_timer_interrupt+0x4e/0x150
Jan 30 10:33:09 pve1 kernel: ? sysvec_apic_timer_interrupt+0x8d/0xd0
Jan 30 10:33:09 pve1 kernel: </IRQ>
Jan 30 10:33:09 pve1 kernel: <TASK>
Jan 30 10:33:09 pve1 kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jan 30 10:33:09 pve1 kernel: ? native_queued_spin_lock_slowpath+0x7f/0x2d0
Jan 30 10:33:09 pve1 kernel: _raw_spin_lock+0x3f/0x60
Jan 30 10:33:09 pve1 kernel: __pte_offset_map_lock+0xa3/0x130
Jan 30 10:33:09 pve1 kernel: unmap_page_range+0x4b0/0x12e0
Jan 30 10:33:09 pve1 kernel: unmap_single_vma+0x89/0xf0
Jan 30 10:33:09 pve1 kernel: unmap_vmas+0xb5/0x190
Jan 30 10:33:09 pve1 kernel: exit_mmap+0x10a/0x3f0
Jan 30 10:33:09 pve1 kernel: __mmput+0x41/0x140
Jan 30 10:33:09 pve1 kernel: mmput+0x31/0x40
Jan 30 10:33:09 pve1 kernel: do_exit+0x32c/0xaf0
Jan 30 10:33:09 pve1 kernel: ? _printk+0x60/0x90
Jan 30 10:33:09 pve1 kernel: make_task_dead+0x83/0x170
Jan 30 10:33:09 pve1 kernel: rewind_stack_and_make_dead+0x17/0x20
Jan 30 10:33:09 pve1 kernel: RIP: 0033:0x736764927c4a
Jan 30 10:33:09 pve1 kernel: Code: Unable to access opcode bytes at 0x736764927c20.
Jan 30 10:33:09 pve1 kernel: RSP: 002b:00007ffc1786c3f8 EFLAGS: 00010202
Jan 30 10:33:09 pve1 kernel: RAX: 000059af6258be40 RBX: 0000000000004854 RCX: 0000000000001694
Jan 30 10:33:09 pve1 kernel: RDX: 0000000000004854 RSI: 000059af62343c70 RDI: 000059af6258f000
Jan 30 10:33:09 pve1 kernel: RBP: 000059af62340ab0 R08: 000059af6258be40 R09: 3e32717f79777775
Jan 30 10:33:09 pve1 kernel: R10: 0000736763763c4a R11: 8080808080808080 R12: 0000000000000046
Jan 30 10:33:09 pve1 kernel: R13: 00000000000074e8 R14: 8000000000000002 R15: 00007ffc1786c448
Jan 30 10:33:09 pve1 kernel: </TASK>
Is there a way to properly debug and fix this?
Thank you.