Backup failed due to busy device

oldgoodname

New Member
Jan 24, 2024
15
0
1
Hey Guys,

I have opened a thread about the same problem some months ago: https://forum.proxmox.com/threads/backup-error-over-smb-cifs-from-pve.140354/#post-633459
I thought it's better to open a new one, hope thats OK.

Here is a short overview of my network:

  • 2x PVE host in a cluster
  • 1x Quorum virtual machine running debian and hosted on the Qnap NAS
  • 1x Qnap NAS with 2 storage pools
    • 1 HDD storage pool saving backup-data
    • 1 SSD storage pool running all virtual machines (quorum and pve virtual machines)
  • 1x PBS host virtualized on one of the PVE hosts backing up data to a NFS share on the HDD storage of the Qnap NAs

The PVE hosts, quorum vm and the Qnap NAS are on the same vlan, only the PBS is on a different one routed over a firewall.


Since the beginning I have problems with backing up my virtual machines. For some months I haven't done any backups and was thinking about using a third party product or waiting for veeam for proxmox. About a month ago, I started a new try with the PBS to backup my vms as I upgraded from PVE version 7 to 8. It worked for about a month backuping up one cluster node a 1am and the other cluster node at 4am.

Yesterday I wanted to create some manual backups, but they failed several times. While the primary node created a successfull backup of the vms, the second node wasn't able to do so for several times, even after restarting node 1 and node 2. It always failed after the first or the second vm. So I forgot about the backup an took the risk. Today I hoped that it would create a successful backup again at 4am, but it failed. That have worked for about a month.

As the backup is really essential now, I need to get that fixed. Hopefully you can help me with that. One of the VMs always remain in a locked state (103 in the logfile), which I have to manually unlock on the terminal of the host. I attached the log file of today. The only thing I deleted from the logfile are about 20 - 30 entries of a service user that logs in. This is form the monitoring system and I think not relevant for the case.

For me it looks like it loses quorum status from time to time, but how can I fix that? Is there an option to prioritize this kind of traffic?

I really appreciate your help.
Thanks and best regards.
 

Attachments

  • error_cleaned.log
    27.8 KB · Views: 4
Why didn't you setup pbs together with the qdevice vm on the qnap? I would expect that you will get at least better Performance than with your current setup since local disks are the only supported storage for PBS.
 
Last edited:
I wanted to get rid of all VMs on the QNap as I want to exchange it sometimes with truenas or similar. So I installed everything on the PVE even I know a standalone PBS would make more sense. Later I thought the only good way to implement a quorum is to install a VM on the Qnap as installing it on the PVE would make no sense. So I used the Qnap again as a hypervisor.

I can install a new PBS on the Qnap, but this would be on the SSD storage pool, while the data would be backed up onto the HDD storage pool. I don't know if Qnap supports that, to directly mount another storage pool or if I have to use the fallback of NFS again. I first have to check that. But then I still don't know if the backup would be directly transfered from the SSD pool to the HDD pool. I think it wil always communicate and copy the data through the network. So the performance may be the same.

But the performance itself isn't the problem. The backup of one PVE node just needs 15 minutes, but I am wondering, why the specific error occurs.
 
that looks like your backup traffic/load either overloads the network, or the qnap box entirely.. I don't think there is a good way out of this other than spec-ing your hardware accordingly. not sure whether your "quorum VM" is a full blown PVE, or just a qdevice, if it's the former, switching to a qdevice might make quorum more stable.
 
  • Like
Reactions: Johannes S
I think the only hardware resources that are limited ist the network throughput. Looking at the cpu and memory load, neighter the PVEs nor the qnap is fully loaded. The PBS have plenty of CPU and memory and is also not overloaded. The quorum Vm is indeed only a qdevice and has enough resources as well.

The PVEs have only a 1Gb/s network connection while the qnap has 2.5 Gb/s. I can reduce the allowed network throughput of the PBS backups by limiting it to 50 MB/s, which should half the possible throughput, but I think I have done this earlier with not that much difference in reliability, but I will try.

Since Thursday, the backup worked every night without a problem. Interestingly, as you can see in the error-log, the problem always occures after a successfuly backup, so not during a backup. So it always fails when trying to start a new backup. So I don't know if it is really a throughput problem or more a I/O problem (latency?).

Could it also be a problem with files not beeing unlocked automatically or just after some time?
 
it's probably flushing at the end of the backup that causes corosync traffic to not meet its latency requirements..
 
Is there anything I can change to prevent that problem, like giving the corosync traffic a higher priority? Will the reduction of the backup throughput help with this problem?
 
I don't think a bwlimit on the backup job will help here, as there is likely a traffic spike at the end of the backup task that is decoupled from the backup I/O issued by the VM. the only proper solution is not sharing storage or guest traffic with corosync traffic by separating the links.
 
Ok thanks. As my servers only have two network ports, currently this seems to be no option, as I need the second port as fallback, if the first fails (or the switch fails, where the first port is connected). So with using the second port for only corosync, the backup problem may be gone, but other problems can occure.

During the night, the backup seems to be stable at the moment. So I think I will have a look at it, how it continues.

Thank you and best regards
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!