[SOLVED] Backup Problem (command timeout)

  • Thread starter Deleted member 200185
  • Start date
D

Deleted member 200185

Guest
Hi,

I've got a small, but annoying problem during backups (PVE backup function, not PBS).

Setup:
  • Proxmox: 8.0.4
  • VMs: 10 (between 10-30 GB disk size)
  • Storage VMs: NVMe RAID 1 (Software) + LVM Pool
  • Storage Backups: HDD RAID 1 (Software) + LVM LV

Problem:
In about 10-20% of all backup runs 1 or 2 VMs fail to backup. Backup log:
Code:
XXX: 2023-09-07 10:30:53 INFO: Starting Backup of VM XXX (qemu)
XXX: 2023-09-07 10:30:53 INFO: status = running
XXX: 2023-09-07 10:30:53 INFO: VM Name: something
XXX: 2023-09-07 10:30:53 INFO: include disk 'scsi0' 'nvme-thin:vm-XXX-disk-1' 40G
XXX: 2023-09-07 10:30:53 INFO: include disk 'efidisk0' 'nvme-thin:vm-XXX-disk-0' 4M
XXX: 2023-09-07 10:31:10 ERROR: Backup of VM XXX failed - cannot determine size of volume 'nvme-thin:vm-XXX-disk-0' - command '/sbin/lvs --separator : --noheadings --units b --unbuffered --nosuffix --options lv_size /dev/vg0/vm-XXX-disk-0' failed: got timeout
The problem isn't connected to a specific VM or disk, except that it never happens to the first VM that gets backuped. It just seems "random".

Debugging:
  • Enabled LVM debug log: Timeout is occuring during "Processing data from device /dev/md127" (md127 is the HDD Storage Backup RAID 1)
  • Checked both disks for possible problems (SMART, write performance, read performance, unusual peaks in latency/disk queues/..., ...) -> nothing, the disks are fine
  • Verified the RAID 1 -> nothing RAID is fine
The only thing I noticed is that when the error occurs, the kernel tries to flush it's memory content to disk at the same time (peak in iostats for md127 flushs). As the server has some free memory (usually around 30GB) this flush could take some seconds.

I assume that sometimes it happens that the kernel decides to flush data to disk and at the same time the "lvs" command is issued. Of course the "lvs" command has to wait until flush is completed and because of "slow" HDDs (~200MB/s) and a possible 10-30GB to be flushed the timeout will occur.

The timeout of 5 seconds is hardcoded in PVE code (https://github.com/proxmox/qemu-ser...cc6781953d9c3a7/PVE/VZDump/QemuServer.pm#L125). So I think there is no way for me to change that (without patching the code).

Is there a possibility to manually issue a "flush" between the backup of two VMs? I think that would solve my problem.

Any other ideas to solve this?
 
Last edited by a moderator:
Hi,
you could try using a hook script for the backup: https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_hook_scripts
See /usr/share/doc/pve-manager/examples/vzdump-hook-script.pl for an example script.
You can set your script for a backup job (check the ID with cat /etc/pve/jobs.cfg) with pvesh set /cluster/backup/backup-<ID> --script /path/to/your/script. Make sure the script has the executable flag set on the file system. If you want to remove the script again, you can use pvesh set /cluster/backup/backup-<ID> --delete script.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!