Hypervisor node freezes completely during backup

1monkey

New Member
Jun 6, 2026
2
0
1
Our team has a 3 nodes Proxmox cluster. Every night we backup all of the VMs. Every 2-3 days the nightly backup completely freezes one of the nodes.

Sometimes the freezing can be stopped by stopping the backup job from another node, but complete freezing is the usual, in that case the Backup job's Stop button is disabled on other nodes. There is no specific VM which causes the freeze to happen, it varies.

Making ssh connection to the machine times out. Connecting physically to the server gives a completely frozen login prompt. Only cold booting the machine physically or via iLO gives back access to the machine.

In the beginning, our group thought it was a memory issue. But after running memtest for more than 300 hours, we concluded that that is not the case. Making a node have no VMs at all on it at the time of backup of course aids it from running into this state.

The backup job is in Snapshot mode, and Fleecing to local-lvm. The VMs mostly use an external Ceph cluster as storage.

The currently used versions on the cluster itself:

Bash:
$ pveversion -v
proxmox-ve: 9.2.0 (running kernel: 7.0.6-2-pve)
pve-manager: 9.2.3 (running version: 9.2.3/d0fde103346cf89a)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.6-2
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-7.0.2-3-pve-signed: 7.0.2-3
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.14: 6.14.11-9
proxmox-kernel-6.14.11-9-pve-signed: 6.14.11-9
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8: 6.8.12-13
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9


The PBS is Backup Server 4.2.0(running kernel: 7.0.0-3-pve). Ran inside a one node Proxmox cluster. Some backup tasks are simply missing from the PBS, some are failed with similar logs:

Code:
2026-06-10T21:08:15+00:00: starting new backup on datastore 'aaa' from bbb: "vm/10061/2026-06-10T21:08:10Z"
2026-06-10T21:08:15+00:00: download 'index.json.blob' from previous backup 'vm/10061/2026-06-09T21:07:58Z'.
2026-06-10T21:08:15+00:00: register chunks in 'drive-scsi0.img.fidx' from previous backup 'vm/10061/2026-06-09T21:07:58Z'.
2026-06-10T21:08:15+00:00: download 'drive-scsi0.img.fidx' from previous backup 'vm/10061/2026-06-09T21:07:58Z'.
2026-06-10T21:08:15+00:00: created new fixed index 1 ("vm/10061/2026-06-10T21:08:10Z/drive-scsi0.img.fidx")
2026-06-10T21:08:15+00:00: add blob "/mnt/datastore/aaa/vm/10061/2026-06-10T21:08:10Z/qemu-server.conf.blob" (1770 bytes, comp: 1770)
2026-06-10T21:23:22+00:00: backup failed: connection error
2026-06-10T21:23:22+00:00: removing failed backup
2026-06-10T21:23:22+00:00: removing backup snapshot "/mnt/datastore/aaa/vm/10061/2026-06-10T21:08:10Z"
2026-06-10T21:23:22+00:00: TASK ERROR: connection error: timed out

Where should we search for the exact root of the problem? Has anyone experienced similar problems and maybe come up with some solutions?