Need Help, Proxmox 8.1.3 suddenly becomes unresponsive and blank screen on display, until hard reboot

Aug 7, 2023
26
2
3
Hi,

I have a Node with NVME + 64GB DDR5 + 7900X with X 2 4060 GPU

pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)

Bios : C-STATE off, UEFI

every 2-3 weeks randomly this node goes offline completely and no ping from even network and when i attach the display. i see blank screen,
I need to hard - reboot the entire node to get it back to work

Note: I have script that reboots (soft reboot) the node every day.

2 VMs with each 1 GPU passthough runs daily at nearly 80-90% max capacity

I have 0 clue why this node is going down like this

No such error on RAM and smartctl is observed, neither the power issue as this node is in hung state and absolutely unpredictable, sometimes it runs for days at max load


This are the logs


Apr 26 11:12:49 prod-node-1 systemd[1]: pve-daily-update.service: Consumed 1.656s CPU time.

Apr 26 11:17:01 prod-node-1 CRON[35572]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

Apr 26 11:17:01 prod-node-1 CRON[35573]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)

Apr 26 11:17:01 prod-node-1 CRON[35572]: pam_unix(cron:session): session closed for user root

Apr 26 11:22:10 prod-node-1 pvedaemon[1243]: <root@pam!api_token> update VM 100: -virtio1 dnd1:100/vm.qcow2,backup=0,cache=writeback,discard=on,snapshot=1

Apr 26 11:22:11 prod-node-1 pvedaemon[36351]: start VM 100: UPID:prod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token:

Apr 26 11:22:11 prod-node-1 pvedaemon[1242]: <root@pam!api_token> starting task UPID:prod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token:

Apr 26 11:22:11 prod-node-1 systemd[1]: Started 100.scope.

Apr 26 11:22:11 prod-node-1 kernel: tap100i0: entered promiscuous mode

Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered blocking state

Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered disabled state

Apr 26 11:22:11 prod-node-1 kernel: fwpr100p0: entered allmulticast mode

Apr 26 11:22:11 prod-node-1 kernel: fwpr100p0: entered promiscuous mode

Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered blocking state

Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered forwarding state

Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered blocking state

Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered disabled state

Apr 26 11:22:11 prod-node-1 kernel: fwln100i0: entered allmulticast mode

Apr 26 11:22:11 prod-node-1 kernel: fwln100i0: entered promiscuous mode

Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered blocking state

Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state

Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered blocking state

Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered disabled state

Apr 26 11:22:12 prod-node-1 kernel: tap100i0: entered allmulticast mode

Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered blocking state

Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered forwarding state

Apr 26 11:22:14 prod-node-1 pvedaemon[1242]: <root@pam!api_token> end task UPID:prod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token: OK

Apr 26 11:26:13 prod-node-1 pvedaemon[1242]: <root@pam!api_token> update VM 101: -virtio1 dnd1:100/disk2.qcow2,backup=0,cache=writeback,discard=on,snapshot=1

Apr 26 11:26:17 prod-node-1 pvedaemon[1243]: <root@pam!api_token> starting task UPID:prod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token:

Apr 26 11:26:17 prod-node-1 pvedaemon[37393]: start VM 101: UPID:prod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token:

Apr 26 11:26:18 prod-node-1 systemd[1]: Started 101.scope.

Apr 26 11:26:18 prod-node-1 kernel: tap101i0: entered promiscuous mode

Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered blocking state

Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state

Apr 26 11:26:18 prod-node-1 kernel: fwpr101p0: entered allmulticast mode

Apr 26 11:26:18 prod-node-1 kernel: fwpr101p0: entered promiscuous mode

Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered blocking state

Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered forwarding state

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state

Apr 26 11:26:18 prod-node-1 kernel: fwln101i0: entered allmulticast mode

Apr 26 11:26:18 prod-node-1 kernel: fwln101i0: entered promiscuous mode

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered forwarding state

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state

Apr 26 11:26:18 prod-node-1 kernel: tap101i0: entered allmulticast mode

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state

Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered forwarding state

Apr 26 11:26:26 prod-node-1 pvedaemon[1242]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries

Apr 26 11:26:27 prod-node-1 pvestatd[1211]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries

Apr 26 11:26:27 prod-node-1 pvedaemon[1243]: <root@pam!api_token> end task UPID:prod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token: OK

Apr 26 11:26:27 prod-node-1 pvestatd[1211]: status update time (8.527 seconds)

Apr 26 11:31:43 prod-node-1 pvedaemon[38542]: stop VM 101: UPID:prod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token:

Apr 26 11:31:43 prod-node-1 pvedaemon[1243]: <root@pam!api_token> starting task UPID:prod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token:

Apr 26 11:31:44 prod-node-1 kernel: tap101i0: left allmulticast mode

Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state

Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state

Apr 26 11:31:44 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state

Apr 26 11:31:44 prod-node-1 kernel: fwln101i0 (unregistering): left allmulticast mode

Apr 26 11:31:44 prod-node-1 kernel: fwln101i0 (unregistering): left promiscuous mode

Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state

Apr 26 11:31:44 prod-node-1 kernel: fwpr101p0 (unregistering): left allmulticast mode

Apr 26 11:31:44 prod-node-1 kernel: fwpr101p0 (unregistering): left promiscuous mode

Apr 26 11:31:44 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state

Apr 26 11:31:44 prod-node-1 qmeventd[850]: read: Connection reset by peer

Apr 26 11:31:44 prod-node-1 pvedaemon[1243]: <root@pam!api_token> end task UPID:prod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token: OK

Apr 26 11:31:44 prod-node-1 qmeventd[38563]: Starting cleanup for 101

Apr 26 11:31:44 prod-node-1 qmeventd[38563]: Finished cleanup for 101

Apr 26 11:31:45 prod-node-1 systemd[1]: 101.scope: Deactivated successfully.

Apr 26 11:31:45 prod-node-1 systemd[1]: 101.scope: Consumed 4min 42.105s CPU time.

Apr 26 11:38:10 prod-node-1 pvestatd[1211]: metrics send error 'InfluxDB': 500 Can't connect to 192.168.1.12:2222 (Connection timed out)

-- Reboot --

Apr 26 15:46:26 prod-node-1 kernel: Linux version 6.5.11-4-pve (fgruenbichler@yuna) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-4 (2023-11-20T10:19Z) ()

Apr 26 15:46:26 prod-node-1 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-4-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt textonly initcall_blacklist=sysfb_init pcie_aspm=off pcie_port_pm=off

Apr 26 15:46:26 prod-node-1 kernel: KERNEL supported cpus:
 
Looks like something went wrong within VM 101
Does that VM has pcie passthrough devices?

Pcie passthrough enables the VM to take down the whole node if the VM really wants it (or some broken process accidentally made this happen). So use with caution.
 
Yea VM has PCI gpu passthough @zzz09700

how to defend against such possibilities ? what are the strategies to stop vm taking down node with GPU Passthrough, its a common use case anyway
Usually the easiest way it to capture what is happening inside the VM and then try to fix what's wrong with the VM.
The other route is to fix broken drivers/BIOS/GPU BIOS

Or there could be no viable solution if the IOMMU of that MB is broken. This happens a lot in consumer boards, sometimes even with server boards from specific vendors, cough... ASRock, Gigabyte... cough.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!