PVE node stuck

Hi,

I have a pve node on an MS-01 machine, and sometimes it gets stuck:

1738229744950.png

It only responds to some commands, and the VMs but one seem to run (cannot stop/restart the stuck VM).
The stuck VM uses a passed through Nvidia GPU.

syslog:

Code:
Jan 30 10:33:09 pve1 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 16636s! [pvestatd:1561]
Jan 30 10:33:09 pve1 kernel: Modules linked in: tcp_diag inet_diag vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd veth rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr soundwire_intel intel_rapl_common intel_uncore_frequency snd_sof_intel_hda_mlink soundwire_cadence intel_uncore_frequency_common snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core x86_pkg_temp_thermal intel_powerclamp mt7921e snd_compress mt7921_common snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine mt792x_lib kvm_intel mt76_connac_lib snd_hda_intel mt76 xe kvm snd_intel_dspcfg snd_intel_sdw_acpi mac80211 irqbypass snd_hda_codec
Jan 30 10:33:09 pve1 kernel:  crct10dif_pclmul polyval_clmulni drm_gpuvm polyval_generic ghash_clmulni_intel drm_exec snd_hda_core sha256_ssse3 gpu_sched sha1_ssse3 btusb drm_buddy aesni_intel snd_hwdep drm_suballoc_helper drm_ttm_helper btrtl snd_pcm ttm btintel crypto_simd btbcm btmtk cryptd mei_hdcp mei_pxp drm_display_helper snd_timer cfg80211 bluetooth snd cec cmdlinepart rapl ecdh_generic mei_me spi_nor ecc intel_pmc_core rc_core soundcore intel_cstate libarc4 intel_vsec i2c_algo_bit mei pcspkr wmi_bmof mtd pmt_telemetry igen6_edac pmt_class acpi_pad acpi_tad mac_hid vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c xhci_pci nvme xhci_pci_renesas spi_intel_pci crc32_pclmul thunderbolt i40e video spi_intel i2c_i801 xhci_hcd nvme_core igc i2c_smbus nvme_auth wmi pinctrl_tigerlake
Jan 30 10:33:09 pve1 kernel: CPU: 0 PID: 1561 Comm: pvestatd Tainted: P      D W  O L     6.8.12-7-pve #1
Jan 30 10:33:09 pve1 kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
Jan 30 10:33:09 pve1 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x7f/0x2d0
Jan 30 10:33:09 pve1 kernel: Code: 00 00 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 5f 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e
Jan 30 10:33:09 pve1 kernel: RSP: 0000:ffffb56ca4abbb28 EFLAGS: 00000202
Jan 30 10:33:09 pve1 kernel: RAX: 0000000000000001 RBX: ffffe8c4293e0328 RCX: 000fffffffe00000
Jan 30 10:33:09 pve1 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe8c4293e0328
Jan 30 10:33:09 pve1 kernel: RBP: ffffb56ca4abbb48 R08: 0000000000000000 R09: 0000000000000000
Jan 30 10:33:09 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8eb2570dd890
Jan 30 10:33:09 pve1 kernel: R13: 000059af62400000 R14: ffffb56ca4abbc58 R15: ffff8ebb4f80c000
Jan 30 10:33:09 pve1 kernel: FS:  0000000000000000(0000) GS:ffff8ec94f200000(0000) knlGS:0000000000000000
Jan 30 10:33:09 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 30 10:33:09 pve1 kernel: CR2: 000059af6258f000 CR3: 0000000143036000 CR4: 0000000000f52ef0
Jan 30 10:33:09 pve1 kernel: PKRU: 55555554
Jan 30 10:33:09 pve1 kernel: Call Trace:
Jan 30 10:33:09 pve1 kernel:  <IRQ>
Jan 30 10:33:09 pve1 kernel:  ? show_regs+0x6d/0x80
Jan 30 10:33:09 pve1 kernel:  ? watchdog_timer_fn+0x206/0x290
Jan 30 10:33:09 pve1 kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Jan 30 10:33:09 pve1 kernel:  ? __hrtimer_run_queues+0x105/0x280
Jan 30 10:33:09 pve1 kernel:  ? clockevents_program_event+0xb3/0x140
Jan 30 10:33:09 pve1 kernel:  ? hrtimer_interrupt+0xf6/0x250
Jan 30 10:33:09 pve1 kernel:  ? __sysvec_apic_timer_interrupt+0x4e/0x150
Jan 30 10:33:09 pve1 kernel:  ? sysvec_apic_timer_interrupt+0x8d/0xd0
Jan 30 10:33:09 pve1 kernel:  </IRQ>
Jan 30 10:33:09 pve1 kernel:  <TASK>
Jan 30 10:33:09 pve1 kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Jan 30 10:33:09 pve1 kernel:  ? native_queued_spin_lock_slowpath+0x7f/0x2d0
Jan 30 10:33:09 pve1 kernel:  _raw_spin_lock+0x3f/0x60
Jan 30 10:33:09 pve1 kernel:  __pte_offset_map_lock+0xa3/0x130
Jan 30 10:33:09 pve1 kernel:  unmap_page_range+0x4b0/0x12e0
Jan 30 10:33:09 pve1 kernel:  unmap_single_vma+0x89/0xf0
Jan 30 10:33:09 pve1 kernel:  unmap_vmas+0xb5/0x190
Jan 30 10:33:09 pve1 kernel:  exit_mmap+0x10a/0x3f0
Jan 30 10:33:09 pve1 kernel:  __mmput+0x41/0x140
Jan 30 10:33:09 pve1 kernel:  mmput+0x31/0x40
Jan 30 10:33:09 pve1 kernel:  do_exit+0x32c/0xaf0
Jan 30 10:33:09 pve1 kernel:  ? _printk+0x60/0x90
Jan 30 10:33:09 pve1 kernel:  make_task_dead+0x83/0x170
Jan 30 10:33:09 pve1 kernel:  rewind_stack_and_make_dead+0x17/0x20
Jan 30 10:33:09 pve1 kernel: RIP: 0033:0x736764927c4a
Jan 30 10:33:09 pve1 kernel: Code: Unable to access opcode bytes at 0x736764927c20.
Jan 30 10:33:09 pve1 kernel: RSP: 002b:00007ffc1786c3f8 EFLAGS: 00010202
Jan 30 10:33:09 pve1 kernel: RAX: 000059af6258be40 RBX: 0000000000004854 RCX: 0000000000001694
Jan 30 10:33:09 pve1 kernel: RDX: 0000000000004854 RSI: 000059af62343c70 RDI: 000059af6258f000
Jan 30 10:33:09 pve1 kernel: RBP: 000059af62340ab0 R08: 000059af6258be40 R09: 3e32717f79777775
Jan 30 10:33:09 pve1 kernel: R10: 0000736763763c4a R11: 8080808080808080 R12: 0000000000000046
Jan 30 10:33:09 pve1 kernel: R13: 00000000000074e8 R14: 8000000000000002 R15: 00007ffc1786c448
Jan 30 10:33:09 pve1 kernel:  </TASK>

Is there a way to properly debug and fix this?

Thank you.
 
Hello Urbaman! First of all, I hope you followed the guide on PCI(e) Passthrough. Make sure that the graphics driver is blacklisted correctly on the host, as described in the guide.

Also SSH seems to be stuck: I can connect, but freezes after I put the password
Are you trying to connect to the host, or is it the VM with the passed through GPU?

Also, does everything work correctly when the GPU is not passed through?
 
Hi @l.leahu-vladucu

I'm trying to manage the problem from remote, so I cannot do everything.

Tha said:

- All of the drivers are blacklisted (I can only show them on the other nodes atm, pve1 is really stuck)

Code:
root@pve2:~# cat /etc/modprobe.d/pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
root@pve2:~# cat /etc/modprobe.d/blacklist.conf
blacklist amdgpu
blacklist radeon
blacklist nouveau
blacklist nvidia*
blacklist i915

The SSH connection is to the faulty node. It gets in after a long time, but seems to be or way too slow or not fully responding.

1738232912072.png

1738232939482.png

I think I finally got to stop the VM (mk8s1), let's see if PVE gets back to working properly.

As an aside, could it be a faulty GPU driver on the VM causing the stuck on the host?
Also: I have the very same VM (with the very same GPU) on a different node, on different hardware (older Lenovo M90q), and it doesn't have these problems, it seems.
So I would need to know how to properly debug this node.
 
Some general things you can check to find more information:
  1. Check the journal with journalctl --since <TIME> to show the logs since a certain time, or journalctl --boot for the logs since the last boot. You should try that on both the host and the VM in question.
  2. Check dmesg to see if it shows any errors. You should try that on both the host and the VM in question.
  3. Try to disable the GPU passthrough and see if that helps. Then at least you know whether you're debugging in the right direction.
 
Hi, here I am with journalctl and the first faulty situation (there's a lot going on, can't get anything fruitful from here): https://pastebin.com/LYawfWmz

Also, couldn't reboot, had to force reboot, waiting for it to reboot.

Bash:
root@pve1:~# reboot
Failed to set wall message, ignoring: Connection timed out
Call to Reboot failed: Connection timed out
root@pve1:~#
root@pve1:~#
root@pve1:~#
root@pve1:~#
root@pve1:~#
root@pve1:~# systemctl reboot -ff
Rebooting.