Hi,
I have a Node with NVME + 64GB DDR5 + 7900X with X 2 4060 GPU
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
Bios : C-STATE off, UEFI
every 2-3 weeks randomly this node goes offline completely and no ping from even network and when i attach the display. i see blank screen,
I need to hard - reboot the entire node to get it back to work
Note: I have script that reboots (soft reboot) the node every day.
2 VMs with each 1 GPU passthough runs daily at nearly 80-90% max capacity
I have 0 clue why this node is going down like this
No such error on RAM and smartctl is observed, neither the power issue as this node is in hung state and absolutely unpredictable, sometimes it runs for days at max load
This are the logs
Apr 26 11:12:49 prod-node-1 systemd[1]: pve-daily-update.service: Consumed 1.656s CPU time.
Apr 26 11:17:01 prod-node-1 CRON[35572]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 26 11:17:01 prod-node-1 CRON[35573]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 26 11:17:01 prod-node-1 CRON[35572]: pam_unix(cron:session): session closed for user root
Apr 26 11:22:10 prod-node-1 pvedaemon[1243]: <root@pam!api_token> update VM 100: -virtio1 dnd1:100/vm.qcow2,backup=0,cache=writeback,discard=on,snapshot=1
Apr 26 11:22:11 prod-node-1 pvedaemon[36351]: start VM 100: UPIDrod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token:
Apr 26 11:22:11 prod-node-1 pvedaemon[1242]: <root@pam!api_token> starting task UPIDrod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token:
Apr 26 11:22:11 prod-node-1 systemd[1]: Started 100.scope.
Apr 26 11:22:11 prod-node-1 kernel: tap100i0: entered promiscuous mode
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Apr 26 11:22:11 prod-node-1 kernel: fwpr100p0: entered allmulticast mode
Apr 26 11:22:11 prod-node-1 kernel: fwpr100p0: entered promiscuous mode
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered forwarding state
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Apr 26 11:22:11 prod-node-1 kernel: fwln100i0: entered allmulticast mode
Apr 26 11:22:11 prod-node-1 kernel: fwln100i0: entered promiscuous mode
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Apr 26 11:22:12 prod-node-1 kernel: tap100i0: entered allmulticast mode
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered forwarding state
Apr 26 11:22:14 prod-node-1 pvedaemon[1242]: <root@pam!api_token> end task UPIDrod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token: OK
Apr 26 11:26:13 prod-node-1 pvedaemon[1242]: <root@pam!api_token> update VM 101: -virtio1 dnd1:100/disk2.qcow2,backup=0,cache=writeback,discard=on,snapshot=1
Apr 26 11:26:17 prod-node-1 pvedaemon[1243]: <root@pam!api_token> starting task UPIDrod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token:
Apr 26 11:26:17 prod-node-1 pvedaemon[37393]: start VM 101: UPIDrod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token:
Apr 26 11:26:18 prod-node-1 systemd[1]: Started 101.scope.
Apr 26 11:26:18 prod-node-1 kernel: tap101i0: entered promiscuous mode
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state
Apr 26 11:26:18 prod-node-1 kernel: fwpr101p0: entered allmulticast mode
Apr 26 11:26:18 prod-node-1 kernel: fwpr101p0: entered promiscuous mode
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered forwarding state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Apr 26 11:26:18 prod-node-1 kernel: fwln101i0: entered allmulticast mode
Apr 26 11:26:18 prod-node-1 kernel: fwln101i0: entered promiscuous mode
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered forwarding state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Apr 26 11:26:18 prod-node-1 kernel: tap101i0: entered allmulticast mode
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered forwarding state
Apr 26 11:26:26 prod-node-1 pvedaemon[1242]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Apr 26 11:26:27 prod-node-1 pvestatd[1211]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Apr 26 11:26:27 prod-node-1 pvedaemon[1243]: <root@pam!api_token> end task UPIDrod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token: OK
Apr 26 11:26:27 prod-node-1 pvestatd[1211]: status update time (8.527 seconds)
Apr 26 11:31:43 prod-node-1 pvedaemon[38542]: stop VM 101: UPIDrod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token:
Apr 26 11:31:43 prod-node-1 pvedaemon[1243]: <root@pam!api_token> starting task UPIDrod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token:
Apr 26 11:31:44 prod-node-1 kernel: tap101i0: left allmulticast mode
Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: fwln101i0 (unregistering): left allmulticast mode
Apr 26 11:31:44 prod-node-1 kernel: fwln101i0 (unregistering): left promiscuous mode
Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: fwpr101p0 (unregistering): left allmulticast mode
Apr 26 11:31:44 prod-node-1 kernel: fwpr101p0 (unregistering): left promiscuous mode
Apr 26 11:31:44 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state
Apr 26 11:31:44 prod-node-1 qmeventd[850]: read: Connection reset by peer
Apr 26 11:31:44 prod-node-1 pvedaemon[1243]: <root@pam!api_token> end task UPIDrod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token: OK
Apr 26 11:31:44 prod-node-1 qmeventd[38563]: Starting cleanup for 101
Apr 26 11:31:44 prod-node-1 qmeventd[38563]: Finished cleanup for 101
Apr 26 11:31:45 prod-node-1 systemd[1]: 101.scope: Deactivated successfully.
Apr 26 11:31:45 prod-node-1 systemd[1]: 101.scope: Consumed 4min 42.105s CPU time.
Apr 26 11:38:10 prod-node-1 pvestatd[1211]: metrics send error 'InfluxDB': 500 Can't connect to 192.168.1.12:2222 (Connection timed out)
-- Reboot --
Apr 26 15:46:26 prod-node-1 kernel: Linux version 6.5.11-4-pve (fgruenbichler@yuna) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-4 (2023-11-20T10:19Z) ()
Apr 26 15:46:26 prod-node-1 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-4-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt textonly initcall_blacklist=sysfb_init pcie_aspm=off pcie_port_pm=off
Apr 26 15:46:26 prod-node-1 kernel: KERNEL supported cpus:
I have a Node with NVME + 64GB DDR5 + 7900X with X 2 4060 GPU
pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
Bios : C-STATE off, UEFI
every 2-3 weeks randomly this node goes offline completely and no ping from even network and when i attach the display. i see blank screen,
I need to hard - reboot the entire node to get it back to work
Note: I have script that reboots (soft reboot) the node every day.
2 VMs with each 1 GPU passthough runs daily at nearly 80-90% max capacity
I have 0 clue why this node is going down like this
No such error on RAM and smartctl is observed, neither the power issue as this node is in hung state and absolutely unpredictable, sometimes it runs for days at max load
This are the logs
Apr 26 11:12:49 prod-node-1 systemd[1]: pve-daily-update.service: Consumed 1.656s CPU time.
Apr 26 11:17:01 prod-node-1 CRON[35572]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 26 11:17:01 prod-node-1 CRON[35573]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 26 11:17:01 prod-node-1 CRON[35572]: pam_unix(cron:session): session closed for user root
Apr 26 11:22:10 prod-node-1 pvedaemon[1243]: <root@pam!api_token> update VM 100: -virtio1 dnd1:100/vm.qcow2,backup=0,cache=writeback,discard=on,snapshot=1
Apr 26 11:22:11 prod-node-1 pvedaemon[36351]: start VM 100: UPIDrod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token:
Apr 26 11:22:11 prod-node-1 pvedaemon[1242]: <root@pam!api_token> starting task UPIDrod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token:
Apr 26 11:22:11 prod-node-1 systemd[1]: Started 100.scope.
Apr 26 11:22:11 prod-node-1 kernel: tap100i0: entered promiscuous mode
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Apr 26 11:22:11 prod-node-1 kernel: fwpr100p0: entered allmulticast mode
Apr 26 11:22:11 prod-node-1 kernel: fwpr100p0: entered promiscuous mode
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: vmbr0: port 2(fwpr100p0) entered forwarding state
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Apr 26 11:22:11 prod-node-1 kernel: fwln100i0: entered allmulticast mode
Apr 26 11:22:11 prod-node-1 kernel: fwln100i0: entered promiscuous mode
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Apr 26 11:22:11 prod-node-1 kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Apr 26 11:22:12 prod-node-1 kernel: tap100i0: entered allmulticast mode
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Apr 26 11:22:12 prod-node-1 kernel: fwbr100i0: port 2(tap100i0) entered forwarding state
Apr 26 11:22:14 prod-node-1 pvedaemon[1242]: <root@pam!api_token> end task UPIDrod-node-1:00008DFF:0014E3DE:662B410B:qmstart:100:root@pam!api_token: OK
Apr 26 11:26:13 prod-node-1 pvedaemon[1242]: <root@pam!api_token> update VM 101: -virtio1 dnd1:100/disk2.qcow2,backup=0,cache=writeback,discard=on,snapshot=1
Apr 26 11:26:17 prod-node-1 pvedaemon[1243]: <root@pam!api_token> starting task UPIDrod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token:
Apr 26 11:26:17 prod-node-1 pvedaemon[37393]: start VM 101: UPIDrod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token:
Apr 26 11:26:18 prod-node-1 systemd[1]: Started 101.scope.
Apr 26 11:26:18 prod-node-1 kernel: tap101i0: entered promiscuous mode
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state
Apr 26 11:26:18 prod-node-1 kernel: fwpr101p0: entered allmulticast mode
Apr 26 11:26:18 prod-node-1 kernel: fwpr101p0: entered promiscuous mode
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered forwarding state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Apr 26 11:26:18 prod-node-1 kernel: fwln101i0: entered allmulticast mode
Apr 26 11:26:18 prod-node-1 kernel: fwln101i0: entered promiscuous mode
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered forwarding state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Apr 26 11:26:18 prod-node-1 kernel: tap101i0: entered allmulticast mode
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Apr 26 11:26:18 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered forwarding state
Apr 26 11:26:26 prod-node-1 pvedaemon[1242]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Apr 26 11:26:27 prod-node-1 pvestatd[1211]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Apr 26 11:26:27 prod-node-1 pvedaemon[1243]: <root@pam!api_token> end task UPIDrod-node-1:00009211:00154418:662B4201:qmstart:101:root@pam!api_token: OK
Apr 26 11:26:27 prod-node-1 pvestatd[1211]: status update time (8.527 seconds)
Apr 26 11:31:43 prod-node-1 pvedaemon[38542]: stop VM 101: UPIDrod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token:
Apr 26 11:31:43 prod-node-1 pvedaemon[1243]: <root@pam!api_token> starting task UPIDrod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token:
Apr 26 11:31:44 prod-node-1 kernel: tap101i0: left allmulticast mode
Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: fwln101i0 (unregistering): left allmulticast mode
Apr 26 11:31:44 prod-node-1 kernel: fwln101i0 (unregistering): left promiscuous mode
Apr 26 11:31:44 prod-node-1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Apr 26 11:31:44 prod-node-1 kernel: fwpr101p0 (unregistering): left allmulticast mode
Apr 26 11:31:44 prod-node-1 kernel: fwpr101p0 (unregistering): left promiscuous mode
Apr 26 11:31:44 prod-node-1 kernel: vmbr0: port 3(fwpr101p0) entered disabled state
Apr 26 11:31:44 prod-node-1 qmeventd[850]: read: Connection reset by peer
Apr 26 11:31:44 prod-node-1 pvedaemon[1243]: <root@pam!api_token> end task UPIDrod-node-1:0000968E:0015C35A:662B4347:qmstop:101:root@pam!api_token: OK
Apr 26 11:31:44 prod-node-1 qmeventd[38563]: Starting cleanup for 101
Apr 26 11:31:44 prod-node-1 qmeventd[38563]: Finished cleanup for 101
Apr 26 11:31:45 prod-node-1 systemd[1]: 101.scope: Deactivated successfully.
Apr 26 11:31:45 prod-node-1 systemd[1]: 101.scope: Consumed 4min 42.105s CPU time.
Apr 26 11:38:10 prod-node-1 pvestatd[1211]: metrics send error 'InfluxDB': 500 Can't connect to 192.168.1.12:2222 (Connection timed out)
-- Reboot --
Apr 26 15:46:26 prod-node-1 kernel: Linux version 6.5.11-4-pve (fgruenbichler@yuna) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-4 (2023-11-20T10:19Z) ()
Apr 26 15:46:26 prod-node-1 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-4-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt textonly initcall_blacklist=sysfb_init pcie_aspm=off pcie_port_pm=off
Apr 26 15:46:26 prod-node-1 kernel: KERNEL supported cpus: