I have the following problem: my Proxmox installation completely freezes along with all guest machines. I have to reboot the system.
Specific details: the installation of NVIDIA GRID drivers and the passthrough of virtual video cards to guest VMs.
Could you please advise what I can check to diagnose the issue?
THX
Here are the versions
Here is the dmesg
Here is the syslog
Specific details: the installation of NVIDIA GRID drivers and the passthrough of virtual video cards to guest VMs.
Could you please advise what I can check to diagnose the issue?
THX
Here are the versions
Code:
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 17.2.7-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
Here is the dmesg
Code:
May 26 14:13:11 pve-01 kernel: NVRM: GPU at 0000:01:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
May 26 14:13:11 pve-01 kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
May 26 14:13:11 pve-01 kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0
May 26 14:13:11 pve-01 kernel: RPC: Registered named UNIX socket transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered udp transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered tcp transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered tcp-with-tls transport module.
May 26 14:13:11 pve-01 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered blocking state
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered disabled state
May 26 14:13:11 pve-01 kernel: r8169 0000:03:00.0 enp3s0: entered allmulticast mode
May 26 14:13:11 pve-01 kernel: r8169 0000:03:00.0 enp3s0: entered promiscuous mode
May 26 14:13:11 pve-01 kernel: RTL8226B_RTL8221B 2.5Gbps PHY r8169-0-300:00: attached PHY driver (mii_bus:phy_addr=r8169-0-300:00, irq=MAC)
May 26 14:13:11 pve-01 kernel: NVRM: GPU at 0000:04:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
May 26 14:13:11 pve-01 kernel: r8169 0000:03:00.0 enp3s0: Link is Down
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered blocking state
May 26 14:13:11 pve-01 kernel: vmbr0: port 1(enp3s0) entered forwarding state
May 26 14:13:12 pve-01 kernel: nvidia 0000:01:00.0: MDEV: Registered
May 26 14:13:12 pve-01 kernel: nvidia 0000:04:00.0: MDEV: Registered
May 26 14:13:12 pve-01 kernel: vmbr0: port 1(enp3s0) entered disabled state
May 26 14:13:14 pve-01 kernel: r8169 0000:03:00.0 enp3s0: Link is Up - 1Gbps/Full - flow control off
May 26 14:13:14 pve-01 kernel: vmbr0: port 1(enp3s0) entered blocking state
May 26 14:13:14 pve-01 kernel: vmbr0: port 1(enp3s0) entered forwarding state
May 26 14:13:16 pve-01 kernel: evm: overlay not supported
May 26 14:13:26 pve-01 kernel: Initializing XFRM netlink socket
May 26 14:13:26 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered blocking state
May 26 14:13:26 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:13:26 pve-01 kernel: veth1fa1175: entered allmulticast mode
May 26 14:13:26 pve-01 kernel: veth1fa1175: entered promiscuous mode
May 26 14:13:26 pve-01 kernel: eth0: renamed from veth2841ed4
May 26 14:13:27 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered blocking state
May 26 14:13:27 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered forwarding state
May 26 14:18:06 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:18:06 pve-01 kernel: veth2841ed4: renamed from eth0
May 26 14:18:06 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:18:06 pve-01 kernel: veth1fa1175 (unregistering): left allmulticast mode
May 26 14:18:06 pve-01 kernel: veth1fa1175 (unregistering): left promiscuous mode
May 26 14:18:06 pve-01 kernel: br-d27867211049: port 1(veth1fa1175) entered disabled state
May 26 14:18:13 pve-01 kernel: watchdog: watchdog0: watchdog did not stop!
May 26 14:18:13 pve-01 systemd-shutdown[1]: Using hardware watchdog 'Software Watchdog', version 0, device /dev/watchdog0
May 26 14:18:13 pve-01 systemd-shutdown[1]: Watchdog running with a timeout of 10min.
May 26 14:18:13 pve-01 systemd-shutdown[1]: Syncing filesystems and block devices.
May 26 14:18:13 pve-01 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
May 26 14:18:13 pve-01 systemd-journald[968]: Received SIGTERM from PID 1 (systemd-shutdow).
Here is the syslog
Code:
May 26 13:53:03 pve-01 qm[6573]: <root@pam> end task UPID:pve-01:000019AE:0000BEB7:6653148B:qmstart:101:root@pam: OK
May 26 13:53:03 pve-01 kernel: split_lock_warn: 1 callbacks suppressed
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 3/KVM/6679 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 11/KVM/6687 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 4/KVM/6680 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 1/KVM/6677 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 8/KVM/6684 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 5/KVM/6681 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 12/KVM/6688 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 9/KVM/6685 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 7/KVM/6683 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:03 pve-01 kernel: x86/split lock detection: #AC: CPU 13/KVM/6689 took a split_lock trap at address: 0x7ee5d050
May 26 13:53:09 pve-01 nvidia-vgpu-mgr[6520]: notice: vmiop_log: (0x0): vGPU license state: Licensed
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: Driver Version: 551.78
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: vGPU version: 0x140001
May 26 13:53:13 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
May 26 13:53:23 pve-01 nvidia-vgpu-mgr[6700]: notice: vmiop_log: (0x0): vGPU license state: Licensed
May 26 13:54:41 pve-01 kernel: split_lock_warn: 4 callbacks suppressed
May 26 13:54:41 pve-01 kernel: x86/split lock detection: #AC: CPU 0/KVM/6184 took a split_lock trap at address: 0xfffff80268e498c5
May 26 13:54:50 pve-01 kernel: x86/split lock detection: #AC: CPU 0/KVM/6366 took a split_lock trap at address: 0xfffff80233049aea
May 26 13:55:34 pve-01 kernel: x86/split lock detection: #AC: CPU 0/KVM/6676 took a split_lock trap at address: 0x597a21af
May 26 13:59:56 pve-01 systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
░░ Subject: A start job for unit systemd-tmpfiles-clean.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit systemd-tmpfiles-clean.service has begun execution.
░░
░░ The job identifier is 466.
May 26 13:59:56 pve-01 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit systemd-tmpfiles-clean.service has successfully entered the 'dead' state.
May 26 13:59:56 pve-01 systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
░░ Subject: A start job for unit systemd-tmpfiles-clean.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit systemd-tmpfiles-clean.service has finished successfully.
░░
░░ The job identifier is 466.
May 26 13:59:56 pve-01 systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount has successfully entered the 'dead' state.
May 26 14:01:47 pve-01 pvedaemon[2435]: <root@pam> successful auth for user 'root@pam'