We have a 9 Node ProxMox Cluster. PVE 8.0.4 (last updated and restarted yesterday)
Based on Ceph Quincy. 17.2.6 ( 33 OSDs - max 60% filled - 40G Network )
200 Running VMs. (virtio-win-0.1.215.iso)
Proxmox Backup Server 3.0-2
2 x 16TB HDDs as ZFS Backup drive
When we run our nightly backups. 10% - 20% of all backups fail.
It's always different VMs that fail. But it's allways the same log output.
The VMs are running without any problms after that.
So the VMs did not hang/freez like it is discribed in other threads.
Also 4 hours earlier we allways make snapshoots of all VMs. - That task never fails. ( But maybe the snapshoot command does not fs-freez and fs-thaw the VMs?)
The log looks like this
I don't know what is running in a timeout .
Also i see that the 'fs-freeze' and the 'fs-thaw' command is a guest-agent command.
So i have to say all the VMs are running with the virtio-win-0.1.215.iso Version of the guest agent.
Log on Backup Server
Here i found another log with more detailed error
Any ideas ?
(btw.: we have a second way smaler setup with the same backup server configuration. but only 5 Nodes and 30 VMs. And there we have no problems )
Based on Ceph Quincy. 17.2.6 ( 33 OSDs - max 60% filled - 40G Network )
200 Running VMs. (virtio-win-0.1.215.iso)
Proxmox Backup Server 3.0-2
2 x 16TB HDDs as ZFS Backup drive
When we run our nightly backups. 10% - 20% of all backups fail.
It's always different VMs that fail. But it's allways the same log output.
The VMs are running without any problms after that.
So the VMs did not hang/freez like it is discribed in other threads.
Also 4 hours earlier we allways make snapshoots of all VMs. - That task never fails. ( But maybe the snapshoot command does not fs-freez and fs-thaw the VMs?)
The log looks like this
Code:
INFO: Backup finished at 2023-10-19 02:19:03
INFO: Starting Backup of VM 151 (qemu)
INFO: Backup started at 2023-10-19 02:19:03
INFO: status = running
INFO: VM Name: esx-Tina
INFO: include disk 'scsi0' 'VM_Festplatten_NVME:base-133-disk-1/vm-151-disk-1' 156G
INFO: include disk 'efidisk0' 'VM_Festplatten_NVME:base-133-disk-0/vm-151-disk-0' 528K
INFO: include disk 'tpmstate0' 'VM_Festplatten_NVME:base-133-disk-2/vm-151-disk-2' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/151/2023-10-19T00:19:03Z'
INFO: attaching TPM drive to QEMU for backup
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 151 qmp command 'backup' failed - got timeout
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 151 failed - VM 151 qmp command 'backup' failed - got timeout
INFO: Failed at 2023-10-19 02:22:55
I don't know what is running in a timeout .
Also i see that the 'fs-freeze' and the 'fs-thaw' command is a guest-agent command.
So i have to say all the VMs are running with the virtio-win-0.1.215.iso Version of the guest agent.
Log on Backup Server
Code:
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: starting new backup on datastore 'Backup': "ns/Buero/vm/151/2023-10-19T00:19:03Z"
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: download 'index.json.blob' from previous backup.
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: register chunks in 'drive-efidisk0.img.fidx' from previous backup.
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: download 'drive-efidisk0.img.fidx' from previous backup.
Oct 19 02:19:06 proxb1 proxmox-backup-proxy[1358]: created new fixed index 1 ("ns/Buero/vm/151/2023-10-19T00:19:03Z/drive-efidisk0.img.fidx")
Oct 19 02:19:11 proxb1 proxmox-backup-proxy[1358]: add blob "/mnt/datastore/Backup/ns/Buero/vm/140/2023-10-19T00:09:41Z/qemu-server.conf.blob" (523 bytes, comp: 523)
Oct 19 02:19:11 proxb1 proxmox-backup-proxy[1358]: backup ended and finish failed: backup ended but finished flag is not set.
Oct 19 02:19:11 proxb1 proxmox-backup-proxy[1358]: removing unfinished backup
Here i found another log with more detailed error
Code:
Oct 19 02:19:03 prox4 pvescheduler[73755]: INFO: Starting Backup of VM 151 (qemu)
Oct 19 02:19:17 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - got timeout
Oct 19 02:19:17 prox4 pvestatd[1714]: status update time (8.227 seconds)
Oct 19 02:19:27 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:21:11 prox4 pvescheduler[73755]: VM 151 qmp command failed - VM 151 qmp command 'backup' failed - got timeout
Oct 19 02:21:32 prox4 pvestatd[1714]: proxmox-backup-client failed: Error: http request timed out
Oct 19 02:21:32 prox4 pvestatd[1714]: status update time (133.078 seconds)
Oct 19 02:21:40 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - got timeout
Oct 19 02:22:08 prox4 pvestatd[1714]: status update time (35.738 seconds)
Oct 19 02:22:09 prox4 pmxcfs[1428]: [status] notice: received log
Oct 19 02:22:16 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:16 prox4 pvestatd[1714]: status update time (8.236 seconds)
Oct 19 02:22:26 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:27 prox4 pvestatd[1714]: status update time (8.248 seconds)
Oct 19 02:22:36 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:36 prox4 pvestatd[1714]: status update time (8.245 seconds)
Oct 19 02:22:46 prox4 pvestatd[1714]: VM 151 qmp command failed - VM 151 qmp command 'query-proxmox-support' failed - unable to connect to VM 151 qmp s>
Oct 19 02:22:46 prox4 pvestatd[1714]: status update time (8.239 seconds)
Oct 19 02:22:55 prox4 QEMU[5715]: BackupTask send abort failed.
Oct 19 02:22:55 prox4 pvescheduler[73755]: ERROR: Backup of VM 151 failed - VM 151 qmp command 'backup' failed - got timeout
Any ideas ?
(btw.: we have a second way smaler setup with the same backup server configuration. but only 5 Nodes and 30 VMs. And there we have no problems )