vzdump Backup of VM failed - got timeout (NFS)

timo-fc · Nov 11, 2019

Hi,
we have the problem that every couple of days the backup of one VM fails. Most of the time it fails for different VMs.
The backup storage is a NFS share on a backup server.

The reason for that is probably that the NFS server is not responding fast enough to a check of Proxmox, beaus of high load.

Is it possible to increase the timeout or add something to force an retry if a backup attempt field?

The backup of the next VM a couple of seconds later is working without any problems again.

104: 2019-11-08 23:22:06 INFO: status: 99% (34179383296/34359738368), sparse 10% (3526156288), duration 648, read/write 53/38 MB/s
104: 2019-11-08 23:22:07 INFO: status: 100% (34359738368/34359738368), sparse 10% (3619962880), duration 649, read/write 180/86 MB/s
104: 2019-11-08 23:22:07 INFO: transferred 34359 MB in 649 seconds (52 MB/s)
104: 2019-11-08 23:22:18 INFO: archive file size: 12.77GB
104: 2019-11-08 23:22:27 INFO: delete old backup '/mnt/pve/offsite-backups/dump/vzdump-qemu-104-2019_11_06-23_11_17.vma.lzo'
104: 2019-11-08 23:22:31 INFO: Finished Backup of VM 104 (00:11:14)

114: 2019-11-08 23:22:31 INFO: Starting Backup of VM 114 (qemu)
114: 2019-11-08 23:22:31 INFO: status = running
114: 2019-11-08 23:22:32 INFO: update VM 114: -lock backup
114: 2019-11-08 23:22:32 INFO: VM Name: test-one
114: 2019-11-08 23:22:32 INFO: include disk 'scsi0' 'rbd:vm-114-disk-0' 32G
114: 2019-11-08 23:22:32 INFO: backup mode: snapshot
114: 2019-11-08 23:22:32 INFO: ionice priority: 7
114: 2019-11-08 23:22:35 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-114-2019_11_08-23_22_31.vma.lzo'
114: 2019-11-08 23:22:42 ERROR: got timeout
114: 2019-11-08 23:22:42 INFO: aborting backup job
114: 2019-11-08 23:22:48 ERROR: Backup of VM 114 failed - got timeout

117: 2019-11-08 23:22:48 INFO: Starting Backup of VM 117 (qemu)
117: 2019-11-08 23:22:48 INFO: status = running
117: 2019-11-08 23:22:48 INFO: update VM 117: -lock backup
117: 2019-11-08 23:22:48 INFO: VM Name: one
117: 2019-11-08 23:22:48 INFO: include disk 'scsi0' 'rbd:vm-117-disk-0' 32G
117: 2019-11-08 23:22:48 INFO: backup mode: snapshot
117: 2019-11-08 23:22:48 INFO: ionice priority: 7
117: 2019-11-08 23:22:48 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-117-2019_11_08-23_22_48.vma.lzo'
117: 2019-11-08 23:22:48 INFO: started backup task 'f2200e11-d850-4479-8110-7be4400c8c92'
117: 2019-11-08 23:22:51 INFO: status: 0% (254672896/34359738368), sparse 0% (16146432), duration 3, read/write 84/79 MB/s
117: 2019-11-08 23:22:54 INFO: status: 1% (373293056/34359738368), sparse 0% (36487168), duration 6, read/write 39/32 MB/s

Nov 08 23:22:20 proxmox1 pvestatd[xxx]: storage 'offsite-backups' is not online
Nov 08 23:22:31 proxmox1 vzdump[xxx]: INFO: Finished Backup of VM 104 (00:11:14)
Nov 08 23:22:31 proxmox1 vzdump[xxx]: INFO: Starting Backup of VM 114 (qemu)
Nov 08 23:22:32 proxmox1 qm[xxx]: <root@pam> update VM 114: -lock backup
Nov 08 23:22:48 proxmox1 vzdump[xxx]: ERROR: Backup of VM 114 failed - got timeout
Nov 08 23:22:48 proxmox1 vzdump[xxx]: INFO: Starting Backup of VM 117 (qemu)

I assume the test resulting in "storage 'offsite-backups' is not online" is /sbin/showmount --no-headers --exports xxx.xxx.xxx.xxx with a 2 Sec. timout in /usr/share/perl5/PVE/Storage/NFSPlugin.pm.

If that is correct is there a way to increase the timeout without changing the file which might be replaced by updates?

Proxmox 5.4-13
/etc/pve/storage.cfg

nfs: offsite-backups
export /1/backup
path /mnt/pve/offsite-backups
server xxx.xxx.xxx.xxx
content backup
maxfiles 1
options vers=4

/etc/vzdump.conf
no changes

/etc/cron.d/vzdump

vzdump --mode snapshot --all 1 --compress lzo --mailto xxx@xxx --mailnotification failure --quiet 1 --storage offsite-backups

fiona · Nov 18, 2019

Hi,

timo-fc said:
I assume the test resulting in "storage 'offsite-backups' is not online" is /sbin/showmount --no-headers --exports xxx.xxx.xxx.xxx with a 2 Sec. timout in /usr/share/perl5/PVE/Storage/NFSPlugin.pm.

If that is correct is there a way to increase the timeout without changing the file which might be replaced by updates?

Since the timeout is hard coded, there is unfortunately no alternative to changing the file.

timo-fc said:
Is it possible to increase the timeout or add something to force an retry if a backup attempt field?

Since our current backup system uses cron to schedule the tasks, there is no straightforward way to do a retry.
As a workaround you could use another job scheduler. That won't be managed by PVE then though.

timo-fc · Dec 9, 2019

Thanks for the information.

We have changed the timeout in /NFSPlugin.pm

And after restarting pvestatd it seems to have fixed the backup problem with the timeout.

timo-fc · Dec 13, 2019

That was to soon, the problem still exists.

I have changed the timeout value from 2 to 20 sec. in NFSPlugin.pm and have restarted the pvestatd service and umount and remounted the backup NFS share.

But I still got a "failed - got timeout" between two working backups which are less the 10 sec. apart from each other.

Is there another timeout which may cause this or are there additional steps necessary for the change in the NFSPlugin.pm to work?

104: 2019-12-13 00:05:25 INFO: archive file size: 2.82GB
104: 2019-12-13 00:05:25 INFO: delete old backup '/mnt/pve/offsite-backups/dump/vzdump-qemu-104-2019_12_11-23_58_26.vma.lzo'
104: 2019-12-13 00:05:27 INFO: Finished Backup of VM 104 (00:02:28)

105: 2019-12-13 00:05:27 INFO: Starting Backup of VM 105 (qemu)
105: 2019-12-13 00:05:27 INFO: status = running
105: 2019-12-13 00:05:27 INFO: update VM 105: -lock backup
105: 2019-12-13 00:05:27 INFO: VM Name: test
105: 2019-12-13 00:05:27 INFO: include disk 'scsi0' 'rbd:vm-105-disk-0' 32G
105: 2019-12-13 00:05:27 INFO: backup mode: snapshot
105: 2019-12-13 00:05:27 INFO: ionice priority: 7
105: 2019-12-13 00:05:29 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-105-2019_12_13-00_05_27.vma.lzo'
105: 2019-12-13 00:05:33 ERROR: got timeout
105: 2019-12-13 00:05:33 INFO: aborting backup job
105: 2019-12-13 00:05:33 ERROR: Backup of VM 105 failed - got timeout

108: 2019-12-13 00:05:34 INFO: Starting Backup of VM 108 (qemu)
108: 2019-12-13 00:05:34 INFO: status = running

fiona · Dec 16, 2019

You might need to do systemctl restart pvedaemon.service pveproxy.service as well. Then the new code should be loaded in all relevant places.

timo-fc · Jan 13, 2020

Sadly restarting pvedaemon.service and pveproxy.service didn't change the situation.

There are still backups failing with "ERROR: got timeout", with around 11 sec. between successful backups before and after the failed backup.
Although the timout in NFSPlugin.pm is set to 20 sec.

The nfs mount timout is 60 sec.

192.168.xx.xx:/storage/backups/hrtbt on /mnt/pve/offsite-backups type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.xx.xx,local_lock=none,addr=192.168.xx.xx)

Are there any other timeouts or is there something else I have possibly overlooked which might have an effect on this problem?

147: 2020-01-13 00:58:42 INFO: transferred 34359 MB in 313 seconds (109 MB/s)
147: 2020-01-13 00:58:44 INFO: archive file size: 7.90GB
147: 2020-01-13 00:58:44 INFO: delete old backup '/mnt/pve/offsite-backups/dump/vzdump-qemu-107-2020_01_12-00_53_37.vma.lzo'
147: 2020-01-13 00:58:46 INFO: Finished Backup of VM 107 (00:05:18)

149: 2020-01-13 00:58:46 INFO: Starting Backup of VM 109 (qemu)
149: 2020-01-13 00:58:46 INFO: status = running
149: 2020-01-13 00:58:47 INFO: update VM 109: -lock backup
149: 2020-01-13 00:58:47 INFO: VM Name: xxxx
149: 2020-01-13 00:58:47 INFO: include disk 'scsi0' 'rbd:vm-109-disk-0' 32G
149: 2020-01-13 00:58:47 INFO: backup mode: snapshot
149: 2020-01-13 00:58:47 INFO: ionice priority: 7
149: 2020-01-13 00:58:48 INFO: creating archive '/mnt/pve/offsite-backups/dump/vzdump-qemu-109-2020_01_13-00_58_46.vma.lzo'
149: 2020-01-13 00:58:52 ERROR: got timeout
149: 2020-01-13 00:58:52 INFO: aborting backup job
149: 2020-01-13 00:58:57 ERROR: Backup of VM 109 failed - got timeout

151: 2020-01-13 00:58:57 INFO: Starting Backup of VM 111 (qemu)
151: 2020-01-13 00:58:57 INFO: status = running
151: 2020-01-13 00:58:57 INFO: update VM 111: -lock backup

fiona · Jan 13, 2020

I did a bit of testing and it seems that check_connection in NFSPlugin.pm is only called once at the beginning of the backup and not in between backups. The good news is that you don't need to keep your local version of that module. The bad news is that it's not clear what's causing the timeout.

Are there any hookscripts for the VMs where the job fails? From the log messages the timeout most probably happens in the archive function of PVE/VZDump/QemuServer.pm (in the qemu-server repository). Before calling the function the creating archive message is printed and inside that function started backup task is printed (which is missing when you hit the timeout).

timo-fc · Jan 13, 2020

I have checked the last 4 VMs where the backup had failed, there are no hookscripts.

It effected 4 different VMs on 3 different Proxmox hosts in one cluster, which send their backups to one off-site backup server.

All VMs have only one disk image the disk image is stored on the Ceph which is run by the Proxmox nodes.

But I have noticed in journalctl that pvestatd reported a timout before the backup of VM 109 was started:

Jan 13 00:58:00 proxmox4 systemd[1]: Starting Proxmox VE replication runner...
Jan 13 00:58:00 proxmox4 systemd[1]: Started Proxmox VE replication runner.
Jan 13 00:58:13 proxmox4 pvestatd[xxx]: got timeout
Jan 13 00:58:46 proxmox4 vzdump[xxx]: INFO: Finished Backup of VM 107 (00:05:18)
Jan 13 00:58:46 proxmox4 vzdump[xxx]: INFO: Starting Backup of VM 109 (qemu)
Jan 13 00:58:47 proxmox4 qm[xxx]: <root@pam> update VM 109: -lock backup
Jan 13 00:58:53 proxmox4 pve-ha-lrm[xxx]: VM 109 qmp command failed - VM 109 qmp command 'query-status' failed - got timeout
Jan 13 00:58:53 proxmox4 pve-ha-lrm[xxx]: VM 109 qmp command 'query-status' failed - got timeout
Jan 13 00:58:57 proxmox4 vzdump[xxx]: ERROR: Backup of VM 109 failed - got timeout
Jan 13 00:58:57 proxmox4 vzdump[xxx]: INFO: Starting Backup of VM 111 (qemu)
Jan 13 00:58:57 proxmox4 qm[xxx]: <root@pam> update VM 111: -lock backup
Jan 13 00:59:00 proxmox4 systemd[1]: Starting Proxmox VE replication runner...
Jan 13 00:59:00 proxmox4 systemd[1]: Started Proxmox VE replication runner.

This timeout messages appear quite often, but backup errors only every few days.

Jan 13 01:32:44 proxmox4 pvestatd[xxx]: got timeout
Jan 13 01:32:44 proxmox4 pvestatd[xxx]: unable to activate storage 'offsite-backups' - directory '/mnt/pve/offsite-backups' does not exist or is unreachable
Jan 13 01:43:53 proxmox4 pvestatd[xxx]: got timeout
Jan 13 01:43:53 proxmox4 pvestatd[xxx]: unable to activate storage 'offsite-backups' - directory '/mnt/pve/offsite-backups' does not exist or is unreachable
Jan 13 01:47:06 proxmox4 pvestatd[xxx]: got timeout
Jan 13 02:24:13 proxmox4 pvestatd[xxx]: got timeout
Jan 13 03:06:03 proxmox4 pvestatd[xxx]: got timeout

fiona · Jan 14, 2020

Maybe the following code snippet in sub activate_storage in PVE/Storage/Plugin.pm is the culprit? NFSPlugin.pm's activate_storage calls this at the end.

Code:

    # this path test may hang indefinitely on unresponsive mounts
    my $timeout = 2;
    if (! PVE::Tools::run_fork_with_timeout($timeout, sub {-d $path})) {
       die "unable to activate storage '$storeid' - " .
       "directory '$path' does not exist or is unreachable\n";
   }

Please try adjusting the timeout here and run systemctl restart pvedaemon.service pveproxy.service afterwards.

timo-fc · Jan 15, 2020

Unfortunately that was not the cause of the problem.
I have changed the timeout in Storage/Plugin.pm to 15 sec. and restarted the services, but I still got a failed backup in less then 10 sec. between two successful backups.

fiona · Jan 16, 2020

What about the status function in PVE/Storage/Plugin.pm. This should get called by pvestatd periodically. But AFAICT it shouldn't affect the code used for the backup.

Code:

sub status {
    my ($class, $storeid, $scfg, $cache) = @_;

    my $path = $scfg->{path};

    die "storage definintion has no path\n" if !$path;

    my $timeout = 2;
    my $res = PVE::Tools::df($path, $timeout);

    return undef if !$res || !$res->{total};

    return ($res->{total}, $res->{avail}, $res->{used}, 1);
}

If this is also not the reason for the timeout, as a workaround you could split up the backup job into multiple ones and schedule them a few minutes (enough time so that each previous one can finish) apart from each other.

Search

Search

vzdump Backup of VM failed - got timeout (NFS)

timo-fc

Active Member

fiona

Proxmox Staff Member

timo-fc

Active Member

timo-fc

Active Member

fiona

Proxmox Staff Member

timo-fc

Active Member

fiona

Proxmox Staff Member

timo-fc

Active Member

fiona

Proxmox Staff Member

timo-fc

Active Member

fiona

Proxmox Staff Member

We value your privacy