[SOLVED] Backup Problem (command timeout)

Deleted member 200185 · Sep 7, 2023

Hi,

I've got a small, but annoying problem during backups (PVE backup function, not PBS).

Setup:

Proxmox: 8.0.4
VMs: 10 (between 10-30 GB disk size)
Storage VMs: NVMe RAID 1 (Software) + LVM Pool
Storage Backups: HDD RAID 1 (Software) + LVM LV

Problem:
In about 10-20% of all backup runs 1 or 2 VMs fail to backup. Backup log:

Code:

XXX: 2023-09-07 10:30:53 INFO: Starting Backup of VM XXX (qemu)
XXX: 2023-09-07 10:30:53 INFO: status = running
XXX: 2023-09-07 10:30:53 INFO: VM Name: something
XXX: 2023-09-07 10:30:53 INFO: include disk 'scsi0' 'nvme-thin:vm-XXX-disk-1' 40G
XXX: 2023-09-07 10:30:53 INFO: include disk 'efidisk0' 'nvme-thin:vm-XXX-disk-0' 4M
XXX: 2023-09-07 10:31:10 ERROR: Backup of VM XXX failed - cannot determine size of volume 'nvme-thin:vm-XXX-disk-0' - command '/sbin/lvs --separator : --noheadings --units b --unbuffered --nosuffix --options lv_size /dev/vg0/vm-XXX-disk-0' failed: got timeout

The problem isn't connected to a specific VM or disk, except that it never happens to the first VM that gets backuped. It just seems "random".

Debugging:

Enabled LVM debug log: Timeout is occuring during "Processing data from device /dev/md127" (md127 is the HDD Storage Backup RAID 1)
Checked both disks for possible problems (SMART, write performance, read performance, unusual peaks in latency/disk queues/..., ...) -> nothing, the disks are fine
Verified the RAID 1 -> nothing RAID is fine

The only thing I noticed is that when the error occurs, the kernel tries to flush it's memory content to disk at the same time (peak in iostats for md127 flushs). As the server has some free memory (usually around 30GB) this flush could take some seconds.

I assume that sometimes it happens that the kernel decides to flush data to disk and at the same time the "lvs" command is issued. Of course the "lvs" command has to wait until flush is completed and because of "slow" HDDs (~200MB/s) and a possible 10-30GB to be flushed the timeout will occur.

The timeout of 5 seconds is hardcoded in PVE code (https://github.com/proxmox/qemu-ser...cc6781953d9c3a7/PVE/VZDump/QemuServer.pm#L125). So I think there is no way for me to change that (without patching the code).

Is there a possibility to manually issue a "flush" between the backup of two VMs? I think that would solve my problem.

Any other ideas to solve this?

fiona · Sep 8, 2023

Hi,
you could try using a hook script for the backup: https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_hook_scripts
See /usr/share/doc/pve-manager/examples/vzdump-hook-script.pl for an example script.
You can set your script for a backup job (check the ID with cat /etc/pve/jobs.cfg) with pvesh set /cluster/backup/backup-<ID> --script /path/to/your/script. Make sure the script has the executable flag set on the file system. If you want to remove the script again, you can use pvesh set /cluster/backup/backup-<ID> --delete script.

[SOLVED] Backup Problem (command timeout)

Deleted member 200185

Guest

fiona

Proxmox Staff Member

We value your privacy