D
Deleted member 200185
Guest
Hi,
I've got a small, but annoying problem during backups (PVE backup function, not PBS).
Setup:
Problem:
In about 10-20% of all backup runs 1 or 2 VMs fail to backup. Backup log:
The problem isn't connected to a specific VM or disk, except that it never happens to the first VM that gets backuped. It just seems "random".
Debugging:
I assume that sometimes it happens that the kernel decides to flush data to disk and at the same time the "lvs" command is issued. Of course the "lvs" command has to wait until flush is completed and because of "slow" HDDs (~200MB/s) and a possible 10-30GB to be flushed the timeout will occur.
The timeout of 5 seconds is hardcoded in PVE code (https://github.com/proxmox/qemu-ser...cc6781953d9c3a7/PVE/VZDump/QemuServer.pm#L125). So I think there is no way for me to change that (without patching the code).
Is there a possibility to manually issue a "flush" between the backup of two VMs? I think that would solve my problem.
Any other ideas to solve this?
I've got a small, but annoying problem during backups (PVE backup function, not PBS).
Setup:
- Proxmox: 8.0.4
- VMs: 10 (between 10-30 GB disk size)
- Storage VMs: NVMe RAID 1 (Software) + LVM Pool
- Storage Backups: HDD RAID 1 (Software) + LVM LV
Problem:
In about 10-20% of all backup runs 1 or 2 VMs fail to backup. Backup log:
Code:
XXX: 2023-09-07 10:30:53 INFO: Starting Backup of VM XXX (qemu)
XXX: 2023-09-07 10:30:53 INFO: status = running
XXX: 2023-09-07 10:30:53 INFO: VM Name: something
XXX: 2023-09-07 10:30:53 INFO: include disk 'scsi0' 'nvme-thin:vm-XXX-disk-1' 40G
XXX: 2023-09-07 10:30:53 INFO: include disk 'efidisk0' 'nvme-thin:vm-XXX-disk-0' 4M
XXX: 2023-09-07 10:31:10 ERROR: Backup of VM XXX failed - cannot determine size of volume 'nvme-thin:vm-XXX-disk-0' - command '/sbin/lvs --separator : --noheadings --units b --unbuffered --nosuffix --options lv_size /dev/vg0/vm-XXX-disk-0' failed: got timeout
Debugging:
- Enabled LVM debug log: Timeout is occuring during "Processing data from device /dev/md127" (md127 is the HDD Storage Backup RAID 1)
- Checked both disks for possible problems (SMART, write performance, read performance, unusual peaks in latency/disk queues/..., ...) -> nothing, the disks are fine
- Verified the RAID 1 -> nothing RAID is fine
I assume that sometimes it happens that the kernel decides to flush data to disk and at the same time the "lvs" command is issued. Of course the "lvs" command has to wait until flush is completed and because of "slow" HDDs (~200MB/s) and a possible 10-30GB to be flushed the timeout will occur.
The timeout of 5 seconds is hardcoded in PVE code (https://github.com/proxmox/qemu-ser...cc6781953d9c3a7/PVE/VZDump/QemuServer.pm#L125). So I think there is no way for me to change that (without patching the code).
Is there a possibility to manually issue a "flush" between the backup of two VMs? I think that would solve my problem.
Any other ideas to solve this?
Last edited by a moderator: