Host reboots during backups

resimarc · Sep 12, 2024

Hi!

We have a problem with one of our Proxmox hosts. This is a cluster with 3 hosts.

One of the hosts has 7 VMs running CentOS 7. It is capable of doing any backups with no errors, except one where the machine has lots of storage. It starts doing the backup, and it will suddenly reboot. On the logs we can see it apparently disconnected from the Backups storage, but that's not true, when it says it disconnected, i can still access the backups storage from proxmox from any other host.

We tried doing this backup locally, but it fails the same way. We have HA setup so we always find all services running on the other server on mondays after getting the "backup failed" email because of this error.

We seem to have issues with ZFS RAM usage too (this host has 188GB of RAM, ZFS sometimes will use up to 100GB, i assume its because it has 12TB storage and 10TB are assigned to one VM). We solved it setting up a cron job that releases 2GB of ram from the cache everytime, we looked into limiting it but we're doing something wrong and its just not doing it. We were told ZFS shouldn't use more than 50% of memory but in our case it uses way more (70% or more).

We can't see any error on the logs other than the storage issue, but it doesn't make sense that the server suddenly reboots (it doesn't even shut off VMs, it just does a hard reset!)

Sep 07 14:00:09 LabHost qm[3615100]: <root@pam> starting task UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam:
Sep 07 14:00:09 LabHost qm[3615159]: suspend VM 402: UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam:
Sep 07 14:00:09 LabHost qm[3615100]: <root@pam> end task UPID:LabHost:003729B7:03B4B051:66DC4049:qmpause:402:root@pam: OK
Sep 07 14:00:12 LabHost pvescheduler[3615092]: VM 402 qmp command failed - VM 402 qmp command 'guest-ping' failed - got timeout
Sep 07 14:10:01 LabHost CRON[3619506]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:10:01 LabHost CRON[3619507]: (root) CMD (sync; echo 3 > /proc/sys/vm/drop_caches)
Sep 07 14:10:26 LabHost kernel: sh (3619507): drop_caches: 3
Sep 07 14:10:26 LabHost CRON[3619506]: pam_unix(cron:session): session closed for user root
Sep 07 14:17:01 LabHost CRON[3622238]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:17:01 LabHost CRON[3622241]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 07 14:17:01 LabHost CRON[3622238]: pam_unix(cron:session): session closed for user root
Sep 07 14:19:08 LabHost pve-firewall[2019]: firewall update time (9.590 seconds)
Sep 07 14:19:12 LabHost pvestatd[2023]: status update time (33.853 seconds)
Sep 07 14:19:57 LabHost pve-ha-lrm[2064]: loop take too long (55 seconds)
Sep 07 14:20:01 LabHost CRON[3623200]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 07 14:20:01 LabHost CRON[3623201]: (root) CMD (sync; echo 3 > /proc/sys/vm/drop_caches)
Sep 07 14:20:04 LabHost pvestatd[2023]: status update time (52.078 seconds)
Sep 07 14:20:23 LabHost pve-ha-lrm[3623259]: VM 402 qmp command failed - VM 402 qmp command 'query-status' failed - got timeout
Sep 07 14:20:23 LabHost pve-ha-lrm[3623259]: VM 402 qmp command 'query-status' failed - got timeout
Sep 07 14:20:24 LabHost kernel: sh (3623201): drop_caches: 3
Sep 07 14:20:25 LabHost CRON[3623200]: pam_unix(cron:session): session closed for user root
Sep 07 14:20:36 LabHost pvestatd[2023]: storage 'ProxmoxBackups' is not online
Sep 07 14:20:45 LabHost pvestatd[2023]: status update time (40.971 seconds)
Sep 07 14:21:34 LabHost pve-firewall[2019]: firewall update time (15.472 seconds)
Sep 07 14:21:40 LabHost pvestatd[2023]: status update time (24.866 seconds)
Sep 07 14:21:56 LabHost pmxcfs[1924]: [dcdb] notice: data verification successful
Sep 07 14:21:58 LabHost pvestatd[2023]: VM 401 qmp command failed - VM 401 qmp command 'query-proxmox-support' failed - got timeout
Sep 07 14:22:21 LabHost pvestatd[2023]: VM 405 qmp command failed - VM 405 qmp command 'query-proxmox-support' failed - got timeout
Sep 07 14:22:21 LabHost pve-firewall[2019]: firewall update time (23.302 seconds)
Sep 07 14:22:21 LabHost pvestatd[2023]: status update time (30.892 seconds)
Sep 07 14:22:40 LabHost pvestatd[2023]: VM 402 qmp command failed - VM 402 qmp command 'query-proxmox-support' failed - unable to connect to VM 402 qmp socket - timeout after 51 retries
Sep 07 14:23:26 LabHost pvestatd[2023]: storage 'ProxmoxBackups' is not online
Sep 07 14:23:31 LabHost pve-ha-crm[2054]: loop take too long (57 seconds)
Sep 07 14:23:36 LabHost snmpd[1770]: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224
Sep 07 14:23:44 LabHost pvestatd[2023]: status update time (73.041 seconds)
-- Reboot --

Christoph Lechleitner · Sep 17, 2024

We have (or had) similar problems.

One PVE node sometimes rebooted when we synced Backups from VMs to an USB hard disc attached to that VM.

Today we played with Minikube in another Debian 12 VM, and the "minicube start --force" reliably triggered the hard reboot.
We then switched the VM's network card virtualization mode from virtio to e1000, which seems to have solved the problem.
We also switched it back to virtio and the problem was back, and back to e1000 and problem was gone again.

We're not sure yet if the Backup servers will survive that heavy sync-to-USB now, only time will tell.

Remark: While there are reports of bad network performance with e1000 emulation, we can not confirm that. We did transfer tests with iperf against my desktop machine, and managed 915 to 935 MBit repeatedly no matter if the guest VM was using virtio or e1000 networking.

Search

Search

Host reboots during backups

resimarc

New Member

Christoph Lechleitner

Active Member