vzdump Backup of VM failed - got timeout (NFS)

timo-fc

Active Member
Nov 11, 2019
6
0
41
Hi,
we have the problem that every couple of days the backup of one VM fails. Most of the time it fails for different VMs.
The backup storage is a NFS share on a backup server.

The reason for that is probably that the NFS server is not responding fast enough to a check of Proxmox, beaus of high load.

Is it possible to increase the timeout or add something to force an retry if a backup attempt field?

The backup of the next VM a couple of seconds later is working without any problems again.
104: 2019-11-08 23:22:06 INFO: status: 99% (34179383296/34359738368), sparse 10% (3526156288), duration 648, read/write 53/38 MB/s
104: 2019-11-08 23:22:07 INFO: status: 100% (34359738368/34359738368), sparse 10% (3619962880), duration 649, read/write 180/86 MB/s
104: 2019-11-08 23:22:07 INFO: transferred 34359 MB in 649 seconds (52 MB/s)
104: 2019-11-08 23:22:18 INFO: archive file size: 12.77GB
104: 2019-11-08 23:22:27 INFO: delete old backup '/mnt/pve/offsite-backups/dump/vzdump-qemu-104-2019_11_06-23_11_17.vma.lzo'
104: 2019-11-08 23:22:31 INFO: Finished Backup of VM 104 (00:11:14)

114: 2019-11-08 23:22:31 INFO: Starting Backup of VM 114 (qemu)
114: 2019-11-08 23:22:31 INFO: status = running
114: 2019-11-08 23:22:32 INFO: update VM 114: -lock backup
114: 2019-11-08 23:22:32 INFO: VM Name: test-one
114: 2019-11-08 23:22:32 INFO: include disk 'scsi0' 'rbd:vm-114-disk-0' 32G
114: 2019-11-08 23:22:32 INFO: backup mode: snapshot
114: 2019-11-08 23:22:32 INFO: ionice priority: 7
114: 2019-11-08 23:22:35 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-114-2019_11_08-23_22_31.vma.lzo'
114: 2019-11-08 23:22:42 ERROR: got timeout
114: 2019-11-08 23:22:42 INFO: aborting backup job
114: 2019-11-08 23:22:48 ERROR: Backup of VM 114 failed - got timeout

117: 2019-11-08 23:22:48 INFO: Starting Backup of VM 117 (qemu)
117: 2019-11-08 23:22:48 INFO: status = running
117: 2019-11-08 23:22:48 INFO: update VM 117: -lock backup
117: 2019-11-08 23:22:48 INFO: VM Name: one
117: 2019-11-08 23:22:48 INFO: include disk 'scsi0' 'rbd:vm-117-disk-0' 32G
117: 2019-11-08 23:22:48 INFO: backup mode: snapshot
117: 2019-11-08 23:22:48 INFO: ionice priority: 7
117: 2019-11-08 23:22:48 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-117-2019_11_08-23_22_48.vma.lzo'
117: 2019-11-08 23:22:48 INFO: started backup task 'f2200e11-d850-4479-8110-7be4400c8c92'
117: 2019-11-08 23:22:51 INFO: status: 0% (254672896/34359738368), sparse 0% (16146432), duration 3, read/write 84/79 MB/s
117: 2019-11-08 23:22:54 INFO: status: 1% (373293056/34359738368), sparse 0% (36487168), duration 6, read/write 39/32 MB/s

Nov 08 23:22:20 proxmox1 pvestatd[xxx]: storage 'offsite-backups' is not online
Nov 08 23:22:31 proxmox1 vzdump[xxx]: INFO: Finished Backup of VM 104 (00:11:14)
Nov 08 23:22:31 proxmox1 vzdump[xxx]: INFO: Starting Backup of VM 114 (qemu)
Nov 08 23:22:32 proxmox1 qm[xxx]: <root@pam> update VM 114: -lock backup
Nov 08 23:22:48 proxmox1 vzdump[xxx]: ERROR: Backup of VM 114 failed - got timeout
Nov 08 23:22:48 proxmox1 vzdump[xxx]: INFO: Starting Backup of VM 117 (qemu)

I assume the test resulting in "storage 'offsite-backups' is not online" is /sbin/showmount --no-headers --exports xxx.xxx.xxx.xxx with a 2 Sec. timout in /usr/share/perl5/PVE/Storage/NFSPlugin.pm.

If that is correct is there a way to increase the timeout without changing the file which might be replaced by updates?



Proxmox 5.4-13
/etc/pve/storage.cfg
nfs: offsite-backups
export /1/backup
path /mnt/pve/offsite-backups
server xxx.xxx.xxx.xxx
content backup
maxfiles 1
options vers=4

/etc/vzdump.conf
no changes

/etc/cron.d/vzdump
vzdump --mode snapshot --all 1 --compress lzo --mailto xxx@xxx --mailnotification failure --quiet 1 --storage offsite-backups
 
Last edited:
Hi,
I assume the test resulting in "storage 'offsite-backups' is not online" is /sbin/showmount --no-headers --exports xxx.xxx.xxx.xxx with a 2 Sec. timout in /usr/share/perl5/PVE/Storage/NFSPlugin.pm.

If that is correct is there a way to increase the timeout without changing the file which might be replaced by updates?
Since the timeout is hard coded, there is unfortunately no alternative to changing the file.

Is it possible to increase the timeout or add something to force an retry if a backup attempt field?
Since our current backup system uses cron to schedule the tasks, there is no straightforward way to do a retry.
As a workaround you could use another job scheduler. That won't be managed by PVE then though.
 
Thanks for the information.

We have changed the timeout in /NFSPlugin.pm

And after restarting pvestatd it seems to have fixed the backup problem with the timeout.
 
That was to soon, the problem still exists.

I have changed the timeout value from 2 to 20 sec. in NFSPlugin.pm and have restarted the pvestatd service and umount and remounted the backup NFS share.

But I still got a "failed - got timeout" between two working backups which are less the 10 sec. apart from each other.

Is there another timeout which may cause this or are there additional steps necessary for the change in the NFSPlugin.pm to work?

104: 2019-12-13 00:05:25 INFO: archive file size: 2.82GB
104: 2019-12-13 00:05:25 INFO: delete old backup '/mnt/pve/offsite-backups/dump/vzdump-qemu-104-2019_12_11-23_58_26.vma.lzo'
104: 2019-12-13 00:05:27 INFO: Finished Backup of VM 104 (00:02:28)

105: 2019-12-13 00:05:27 INFO: Starting Backup of VM 105 (qemu)
105: 2019-12-13 00:05:27 INFO: status = running
105: 2019-12-13 00:05:27 INFO: update VM 105: -lock backup
105: 2019-12-13 00:05:27 INFO: VM Name: test
105: 2019-12-13 00:05:27 INFO: include disk 'scsi0' 'rbd:vm-105-disk-0' 32G
105: 2019-12-13 00:05:27 INFO: backup mode: snapshot
105: 2019-12-13 00:05:27 INFO: ionice priority: 7
105: 2019-12-13 00:05:29 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-105-2019_12_13-00_05_27.vma.lzo'
105: 2019-12-13 00:05:33 ERROR: got timeout
105: 2019-12-13 00:05:33 INFO: aborting backup job
105: 2019-12-13 00:05:33 ERROR: Backup of VM 105 failed - got timeout

108: 2019-12-13 00:05:34 INFO: Starting Backup of VM 108 (qemu)
108: 2019-12-13 00:05:34 INFO: status = running
 
You might need to do systemctl restart pvedaemon.service pveproxy.service as well. Then the new code should be loaded in all relevant places.
 
Sadly restarting pvedaemon.service and pveproxy.service didn't change the situation.

There are still backups failing with "ERROR: got timeout", with around 11 sec. between successful backups before and after the failed backup.
Although the timout in NFSPlugin.pm is set to 20 sec.

The nfs mount timout is 60 sec.
192.168.xx.xx:/storage/backups/hrtbt on /mnt/pve/offsite-backups type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.xx.xx,local_lock=none,addr=192.168.xx.xx)

Are there any other timeouts or is there something else I have possibly overlooked which might have an effect on this problem?

147: 2020-01-13 00:58:42 INFO: transferred 34359 MB in 313 seconds (109 MB/s)
147: 2020-01-13 00:58:44 INFO: archive file size: 7.90GB
147: 2020-01-13 00:58:44 INFO: delete old backup '/mnt/pve/offsite-backups/dump/vzdump-qemu-107-2020_01_12-00_53_37.vma.lzo'
147: 2020-01-13 00:58:46 INFO: Finished Backup of VM 107 (00:05:18)

149: 2020-01-13 00:58:46 INFO: Starting Backup of VM 109 (qemu)
149: 2020-01-13 00:58:46 INFO: status = running
149: 2020-01-13 00:58:47 INFO: update VM 109: -lock backup
149: 2020-01-13 00:58:47 INFO: VM Name: xxxx
149: 2020-01-13 00:58:47 INFO: include disk 'scsi0' 'rbd:vm-109-disk-0' 32G
149: 2020-01-13 00:58:47 INFO: backup mode: snapshot
149: 2020-01-13 00:58:47 INFO: ionice priority: 7
149: 2020-01-13 00:58:48 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-109-2020_01_13-00_58_46.vma.lzo'
149: 2020-01-13 00:58:52 ERROR: got timeout
149: 2020-01-13 00:58:52 INFO: aborting backup job
149: 2020-01-13 00:58:57 ERROR: Backup of VM 109 failed - got timeout

151: 2020-01-13 00:58:57 INFO: Starting Backup of VM 111 (qemu)
151: 2020-01-13 00:58:57 INFO: status = running
151: 2020-01-13 00:58:57 INFO: update VM 111: -lock backup
 
I did a bit of testing and it seems that check_connection in NFSPlugin.pm is only called once at the beginning of the backup and not in between backups. The good news is that you don't need to keep your local version of that module. The bad news is that it's not clear what's causing the timeout.

Are there any hookscripts for the VMs where the job fails? From the log messages the timeout most probably happens in the archive function of PVE/VZDump/QemuServer.pm (in the qemu-server repository). Before calling the function the creating archive message is printed and inside that function started backup task is printed (which is missing when you hit the timeout).
 
I have checked the last 4 VMs where the backup had failed, there are no hookscripts.

It effected 4 different VMs on 3 different Proxmox hosts in one cluster, which send their backups to one off-site backup server.

All VMs have only one disk image the disk image is stored on the Ceph which is run by the Proxmox nodes.

But I have noticed in journalctl that pvestatd reported a timout before the backup of VM 109 was started:

Jan 13 00:58:00 proxmox4 systemd[1]: Starting Proxmox VE replication runner...
Jan 13 00:58:00 proxmox4 systemd[1]: Started Proxmox VE replication runner.
Jan 13 00:58:13 proxmox4 pvestatd[xxx]: got timeout
Jan 13 00:58:46 proxmox4 vzdump[xxx]: INFO: Finished Backup of VM 107 (00:05:18)
Jan 13 00:58:46 proxmox4 vzdump[xxx]: INFO: Starting Backup of VM 109 (qemu)
Jan 13 00:58:47 proxmox4 qm[xxx]: <root@pam> update VM 109: -lock backup
Jan 13 00:58:53 proxmox4 pve-ha-lrm[xxx]: VM 109 qmp command failed - VM 109 qmp command 'query-status' failed - got timeout
Jan 13 00:58:53 proxmox4 pve-ha-lrm[xxx]: VM 109 qmp command 'query-status' failed - got timeout
Jan 13 00:58:57 proxmox4 vzdump[xxx]: ERROR: Backup of VM 109 failed - got timeout
Jan 13 00:58:57 proxmox4 vzdump[xxx]: INFO: Starting Backup of VM 111 (qemu)
Jan 13 00:58:57 proxmox4 qm[xxx]: <root@pam> update VM 111: -lock backup
Jan 13 00:59:00 proxmox4 systemd[1]: Starting Proxmox VE replication runner...
Jan 13 00:59:00 proxmox4 systemd[1]: Started Proxmox VE replication runner.

This timeout messages appear quite often, but backup errors only every few days.
Jan 13 01:32:44 proxmox4 pvestatd[xxx]: got timeout
Jan 13 01:32:44 proxmox4 pvestatd[xxx]: unable to activate storage 'offsite-backups' - directory '/mnt/pve/offsite-backups' does not exist or is unreachable
Jan 13 01:43:53 proxmox4 pvestatd[xxx]: got timeout
Jan 13 01:43:53 proxmox4 pvestatd[xxx]: unable to activate storage 'offsite-backups' - directory '/mnt/pve/offsite-backups' does not exist or is unreachable
Jan 13 01:47:06 proxmox4 pvestatd[xxx]: got timeout
Jan 13 02:24:13 proxmox4 pvestatd[xxx]: got timeout
Jan 13 03:06:03 proxmox4 pvestatd[xxx]: got timeout
 
Maybe the following code snippet in sub activate_storage in PVE/Storage/Plugin.pm is the culprit? NFSPlugin.pm's activate_storage calls this at the end.
Code:
    # this path test may hang indefinitely on unresponsive mounts
    my $timeout = 2;
    if (! PVE::Tools::run_fork_with_timeout($timeout, sub {-d $path})) {
       die "unable to activate storage '$storeid' - " .
       "directory '$path' does not exist or is unreachable\n";
   }

Please try adjusting the timeout here and run systemctl restart pvedaemon.service pveproxy.service afterwards.
 
Unfortunately that was not the cause of the problem.
I have changed the timeout in Storage/Plugin.pm to 15 sec. and restarted the services, but I still got a failed backup in less then 10 sec. between two successful backups.
 
What about the status function in PVE/Storage/Plugin.pm. This should get called by pvestatd periodically. But AFAICT it shouldn't affect the code used for the backup.
Code:
sub status {
    my ($class, $storeid, $scfg, $cache) = @_;

    my $path = $scfg->{path};

    die "storage definintion has no path\n" if !$path;

    my $timeout = 2;
    my $res = PVE::Tools::df($path, $timeout);

    return undef if !$res || !$res->{total};

    return ($res->{total}, $res->{avail}, $res->{used}, 1);
}

If this is also not the reason for the timeout, as a workaround you could split up the backup job into multiple ones and schedule them a few minutes (enough time so that each previous one can finish) apart from each other.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!