The hypervisor freezes

positive

New Member
May 26, 2024
4
0
1
I have the following problem: my Proxmox installation completely freezes along with all guest machines. I have to reboot the system.
Specific details: the installation of NVIDIA GRID drivers and the passthrough of virtual video cards to guest VMs.

Could you please advise what I can check to diagnose the issue?

THX

Here are the versions
Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

Here is the dmesg
Code:
May 26 14:13:11 pve-01 kernel: NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
May 26 14:13:11 pve-01 kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
May 26 14:13:11 pve-01 kernel: softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0
May 26 14:13:11 pve-01 kernel: RPC: Registered named UNIX socket transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered udp transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered tcp transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered tcp-with-tls transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered blocking state
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered disabled state
May 26 14:13:11 pve-01 kernel: r8169 0000:03:00.0 enp3s0: entered allmulticast mode
May 26 14:13:11 pve-01 kernel: r8169 0000:03:00.0 enp3s0: entered promiscuous mode
May 26 14:13:11 pve-01 kernel: RTL8226B_RTL8221B 2.5Gbps PHY r8169-0-300:00: attached PHY driver (mii_bus:phy_addr=r8169-0-300:00, irq=MAC)
May 26 14:13:11 pve-01 kernel: NVRM: GPU at 0000:04:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
May 26 14:13:11 pve-01 kernel: r8169 0000:03:00.0 enp3s0: Link is Down
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered blocking state
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered forwarding state
May 26 14:13:12 pve-01 kernel: nvidia 0000:01:00.0: MDEV: Registered
May 26 14:13:12 pve-01 kernel: nvidia 0000:04:00.0: MDEV: Registered
May 26 14:13:12 pve-01 kernel: vmbr0: port 1(enp3s0) entered disabled state
May 26 14:13:14 pve-01 kernel: r8169 0000:03:00.0 enp3s0: Link is Up - 1Gbps/Full - flow control off
May 26 14:13:14 pve-01 kernel: vmbr0: port 1(enp3s0) entered blocking state
May 26 14:13:14 pve-01 kernel: vmbr0: port 1(enp3s0) entered forwarding state
May 26 14:13:16 pve-01 kernel: evm: overlay not supported
May 26 14:13:26 pve-01 kernel: Initializing XFRM netlink socket
May 26 14:13:26 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered blocking state
May 26 14:13:26 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:13:26 pve-01 kernel: veth1fa1175: entered allmulticast mode
May 26 14:13:26 pve-01 kernel: veth1fa1175: entered promiscuous mode
May 26 14:13:26 pve-01 kernel: eth0: renamed from veth2841ed4
May 26 14:13:27 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered blocking state
May 26 14:13:27 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered forwarding state
May 26 14:18:06 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:18:06 pve-01 kernel: veth2841ed4: renamed from eth0
May 26 14:18:06 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:18:06 pve-01 kernel: veth1fa1175 (unregistering): left allmulticast mode
May 26 14:18:06 pve-01 kernel: veth1fa1175 (unregistering): left promiscuous mode
May 26 14:18:06 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:18:13 pve-01 kernel: watchdog: watchdog0: watchdog did not stop!
May 26 14:18:13 pve-01 systemd-shutdown[1]: Using hardware watchdog 'Software Watchdog', version 0, device /dev/watchdog0
May 26 14:18:13 pve-01 systemd-shutdown[1]: Watchdog running with a timeout of 10min.
May 26 14:18:13 pve-01 systemd-shutdown[1]: Syncing filesystems and block devices.
May 26 14:18:13 pve-01 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
May 26 14:18:13 pve-01 systemd-journald[968]: Received SIGTERM from PID 1 (systemd-shutdow).


Here is the syslog
Code:
May 26 13:53:03 pve-01 qm[6573]: <root@pam> end task UPID:pve-01:000019AE:0000BEB7:6653148B:qmstart:101:root@pam: OK
May 26 13:53:03 pve-01 kernel: split_lock_warn: 1 callbacks suppressed
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 3/KVM/6679 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 11/KVM/6687 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 4/KVM/6680 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 1/KVM/6677 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 8/KVM/6684 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 5/KVM/6681 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 12/KVM/6688 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 9/KVM/6685 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 7/KVM/6683 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 13/KVM/6689 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:09 pve-01 nvidia-vgpu-mgr[6520]: notice: vmiop_log: (0x0): vGPU license state: Licensed
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: Driver Version: 551.78
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: vGPU version: 0x140001
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
May 26 13:53:23 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: (0x0): vGPU license state: Licensed
May 26 13:54:41 pve-01 kernel: split_lock_warn: 4 callbacks suppressed
May 26 13:54:41 pve-01 kernel: x86/split lock detection: #AC: CPU 0/KVM/6184 took a split_lock trap at address: 0xfffff80268e498c5
May 26 13:54:50 pve-01 kernel: x86/split lock detection: #AC: CPU 0/KVM/6366 took a split_lock trap at address: 0xfffff80233049aea
May 26 13:55:34 pve-01 kernel: x86/split lock detection: #AC: CPU 0/KVM/6676 took a split_lock trap at address: 0x597a21af
May 26 13:59:56 pve-01 systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
░░ Subject: A start job for unit systemd-tmpfiles-clean.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit systemd-tmpfiles-clean.service has begun execution.
░░
░░ The job identifier is 466.
May 26 13:59:56 pve-01 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit systemd-tmpfiles-clean.service has successfully entered the 'dead' state.
May 26 13:59:56 pve-01 systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
░░ Subject: A start job for unit systemd-tmpfiles-clean.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit systemd-tmpfiles-clean.service has finished successfully.
░░
░░ The job identifier is 466.
May 26 13:59:56 pve-01 systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount has successfully entered the 'dead' state.
May 26 14:01:47 pve-01 pvedaemon[2435]: <root@pam> successful auth for user 'root@pam'
 
Code:
May 26 16:41:12 pve-01 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000004
May 26 16:41:12 pve-01 kernel: #PF: supervisor instruction fetch in kernel mode
May 26 16:41:12 pve-01 kernel: #PF: error_code(0x0010) - not-present page
May 26 16:41:12 pve-01 kernel: PGD 0 P4D 0
May 26 16:41:12 pve-01 kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
May 26 16:41:12 pve-01 kernel: CPU: 8 PID: 3715 Comm: CPU 0/KVM Tainted: P           OE      6.8.4-3-pve #1
May 26 16:41:12 pve-01 kernel: Hardware name: Gigabyte Technology Co., Ltd. B760M AORUS ELITE AX/B760M AORUS ELITE AX, BIOS F18a 05/15/2024
May 26 16:41:12 pve-01 kernel: RIP: 0010:0x4
May 26 16:41:12 pve-01 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffda.
May 26 16:41:12 pve-01 kernel: RSP: 0018:ffffae8b4a78b7d8 EFLAGS: 00010046
May 26 16:41:12 pve-01 kernel: RAX: 00000041a292b35c RBX: ffffae8b4a78b7d8 RCX: 0000000000000000
May 26 16:41:12 pve-01 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 26 16:41:12 pve-01 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
May 26 16:41:12 pve-01 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffb8b6c910
May 26 16:41:12 pve-01 kernel: R13: ffff892d4cf48000 R14: 000000000000000a R15: ffff893fff334bc0
May 26 16:41:12 pve-01 kernel: FS:  000077b8dbe006c0(0000) GS:ffff893fff200000(0000) knlGS:000000841a58c000
May 26 16:41:12 pve-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 16:41:12 pve-01 kernel: CR2: ffffffffffffffda CR3: 0000000d88248000 CR4: 0000000000f52ef0
May 26 16:41:12 pve-01 kernel: PKRU: 55555554
May 26 16:41:12 pve-01 kernel: Call Trace:
May 26 16:41:12 pve-01 kernel:  <TASK>
May 26 16:41:12 pve-01 kernel:  ? show_regs+0x6d/0x80
May 26 16:41:12 pve-01 kernel:  ? __die+0x24/0x80
May 26 16:41:12 pve-01 kernel:  ? page_fault_oops+0x176/0x500
May 26 16:41:12 pve-01 kernel:  ? sched_clock_noinstr+0x9/0x10
May 26 16:41:12 pve-01 kernel:  ? do_user_addr_fault+0x2f9/0x6b0
May 26 16:41:12 pve-01 kernel:  ? select_task_rq_fair+0x180/0x1b80
May 26 16:41:12 pve-01 kernel:  ? exc_page_fault+0x83/0x1b0
May 26 16:41:12 pve-01 kernel:  ? asm_exc_page_fault+0x27/0x30
May 26 16:41:12 pve-01 kernel:  ? sched_clock_cpu+0x10/0x1b0
May 26 16:41:12 pve-01 kernel:  ? psi_task_change+0x55/0xd0
May 26 16:41:12 pve-01 kernel:  ? enqueue_task+0xd6/0x1a0
May 26 16:41:12 pve-01 kernel:  ? ttwu_do_activate+0x5f/0x250
May 26 16:41:12 pve-01 kernel:  ? try_to_wake_up+0x234/0x5f0
May 26 16:41:12 pve-01 kernel:  ? wake_up_process+0x15/0x30
May 26 16:41:12 pve-01 kernel:  ? rcuwait_wake_up+0x27/0x40
May 26 16:41:12 pve-01 kernel:  ? kvm_vcpu_wake_up+0x16/0x40 [kvm]
May 26 16:41:12 pve-01 kernel:  ? vmx_deliver_interrupt+0x5b/0x1e0 [kvm_intel]
May 26 16:41:12 pve-01 kernel:  ? __apic_accept_irq+0x140/0x2c0 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_apic_set_irq+0x40/0x60 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_irq_delivery_to_apic+0x159/0x340 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_apic_send_ipi+0xa6/0x120 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_lapic_reg_write+0x69d/0x820 [kvm]
May 26 16:41:12 pve-01 kernel:  ? vmx_vmexit+0x9a/0xe0 [kvm_intel]
May 26 16:41:12 pve-01 kernel:  ? kvm_apic_write_nodecode+0x3a/0x70 [kvm]
May 26 16:41:12 pve-01 kernel:  ? handle_apic_write+0x2e/0xe0 [kvm_intel]
May 26 16:41:12 pve-01 kernel:  ? vmx_handle_exit+0x1f5/0x920 [kvm_intel]
May 26 16:41:12 pve-01 kernel:  ? kvm_arch_vcpu_ioctl_run+0xd5b/0x1760 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_vcpu_ioctl+0x30e/0x800 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_vcpu_ioctl+0x30e/0x800 [kvm]
May 26 16:41:12 pve-01 kernel:  ? kvm_vcpu_ioctl+0x297/0x800 [kvm]
May 26 16:41:12 pve-01 kernel:  ? fire_user_return_notifiers+0x37/0x80
May 26 16:41:12 pve-01 kernel:  ? syscall_exit_to_user_mode+0x86/0x260
May 26 16:41:12 pve-01 kernel:  ? do_syscall_64+0x8d/0x170
May 26 16:41:12 pve-01 kernel:  ? __x64_sys_ioctl+0xa0/0xf0
May 26 16:41:12 pve-01 kernel:  ? x64_sys_call+0xa68/0x24b0
May 26 16:41:12 pve-01 kernel:  ? do_syscall_64+0x81/0x170
May 26 16:41:12 pve-01 kernel:  ? do_syscall_64+0x8d/0x170
May 26 16:41:12 pve-01 kernel:  ? do_syscall_64+0x8d/0x170
May 26 16:41:12 pve-01 kernel:  ? irqentry_exit+0x43/0x50
May 26 16:41:12 pve-01 kernel:  ? entry_SYSCALL_64_after_hwframe+0x78/0x80
May 26 16:41:12 pve-01 kernel:  </TASK>
May 26 16:41:12 pve-01 kernel: Modules linked in: veth xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype act_police cls_basic sch_ingress sch_htb ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables overlay softdog sunrpc binfmt_misc xt_nat xt_tcpudp iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls nfnetlink_log nfnetlink btusb btrtl btintel btbcm btmtk bluetooth ecdh_generic ecc nvidia_vgpu_vfio(OE) nvidia(OE) snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda intel_rapl_msr intel_rapl_common snd_hda_ext_core snd_soc_acpi_intel_match intel_uncore_frequency snd_soc_acpi intel_uncore_frequency_common soundwire_generic_allocation soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi x86_pkg_temp_thermal
May 26 16:41:12 pve-01 kernel:  intel_powerclamp iwlmvm coretemp snd_hda_intel kvm_intel snd_intel_dspcfg mac80211 snd_intel_sdw_acpi snd_hda_codec crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel snd_hda_core sha256_ssse3 sha1_ssse3 mdev snd_hwdep libarc4 aesni_intel snd_pcm kvm crypto_simd iwlwifi cmdlinepart ucsi_ccg snd_timer cryptd mei_me spi_nor snd typec_ucsi rapl cfg80211 mei typec soundcore pcspkr wmi_bmof mtd intel_pmc_core intel_cstate gigabyte_wmi intel_vsec pmt_telemetry pmt_class intel_hid sparse_keymap acpi_pad acpi_tad mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c nvme nvme_core ahci libahci nvme_auth xhci_pci crc32_pclmul r8169 spi_intel_pci xhci_pci_renesas i2c_i801 intel_lpss_pci spi_intel i2c_smbus realtek intel_lpss i2c_nvidia_gpu xhci_hcd idma64 i2c_ccgx_ucsi vmd video wmi pinctrl_alderlake
May 26 16:41:12 pve-01 kernel: CR2: 0000000000000004
May 26 16:41:12 pve-01 kernel: ---[ end trace 0000000000000000 ]---
May 26 16:41:12 pve-01 kernel: RIP: 0010:0x4
May 26 16:41:12 pve-01 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffda.
May 26 16:41:12 pve-01 kernel: RSP: 0018:ffffae8b4a78b7d8 EFLAGS: 00010046
May 26 16:41:12 pve-01 kernel: RAX: 00000041a292b35c RBX: ffffae8b4a78b7d8 RCX: 0000000000000000
May 26 16:41:12 pve-01 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 26 16:41:12 pve-01 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
May 26 16:41:12 pve-01 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffb8b6c910
May 26 16:41:12 pve-01 kernel: R13: ffff892d4cf48000 R14: 000000000000000a R15: ffff893fff334bc0
May 26 16:41:12 pve-01 kernel: FS:  000077b8dbe006c0(0000) GS:ffff893fff200000(0000) knlGS:000000841a58c000
May 26 16:41:12 pve-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 16:41:12 pve-01 kernel: CR2: ffffffffffffffda CR3: 0000000d88248000 CR4: 0000000000f52ef0
May 26 16:41:12 pve-01 kernel: PKRU: 55555554
May 26 16:41:12 pve-01 kernel: note: CPU 0/KVM[3715] exited with irqs disabled
May 26 16:41:12 pve-01 kernel: note: CPU 0/KVM[3715] exited with preempt_count 4
May 26 16:41:38 pve-01 kernel: watchdog: Watchdog detected hard LOCKUP on cpu 22