VM unreachable due to failing backup

jazzl0ver · Jan 18, 2021

Hi,

I have a VM with two drives which is set to backup once a week by Proxmox internal backup tool. The backups are stored on an NFS share.
Last backup has failed due to space issues on the NFS server:

Code:

vzdump 122 127 --mailnotification failure --compress zstd --quiet 1 --storage trendy-images --mode snapshot --mailto root@domain
127: 2021-01-16 02:30:03 INFO: Starting Backup of VM 127 (qemu)
127: 2021-01-16 02:30:03 INFO: status = running
127: 2021-01-16 02:30:03 INFO: VM Name: vm-srv1
127: 2021-01-16 02:30:03 INFO: include disk 'scsi0' 'LVM-Storage:127/vm-127-disk-0.qcow2' 10G
127: 2021-01-16 02:30:03 INFO: include disk 'scsi1' 'LVM-Storage:127/vm-127-disk-2.qcow2' 178G
127: 2021-01-16 02:30:03 INFO: backup mode: snapshot
127: 2021-01-16 02:30:03 INFO: ionice priority: 7
127: 2021-01-16 02:30:03 INFO: creating vzdump archive '/mnt/pve/trendy-images/dump/vzdump-qemu-127-2021_01_16-02_30_03.vma.zst'
127: 2021-01-16 02:30:03 INFO: issuing guest-agent 'fs-freeze' command
127: 2021-01-16 02:30:04 INFO: issuing guest-agent 'fs-thaw' command
127: 2021-01-16 02:30:04 INFO: started backup task '3a81f20b-e387-40b9-8979-c85611a56dc7'
127: 2021-01-16 02:30:04 INFO: resuming VM again
127: 2021-01-16 02:30:07 INFO:   0% (224.2 MiB of 188.0 GiB) in  3s, read: 74.7 MiB/s, write: 62.7 MiB/s
...
127: 2021-01-16 04:45:32 ERROR: VM 127 qmp command 'query-backup' failed - got timeout
127: 2021-01-16 04:45:32 INFO: aborting backup job
127: 2021-01-16 04:55:32 ERROR: VM 127 qmp command 'backup-cancel' failed - unable to connect to VM 127 qmp socket - timeout after 5982 retries
127: 2021-01-16 04:57:42 ERROR: Backup of VM 127 failed - VM 127 qmp command 'query-backup' failed - got timeout

After that, the VM became unreachable. I tried to hard reboot it, but it stopped booting at mounting the 2nd virtual drive. Manual mounting the qemu image on the PM host thru guestmount didn't work out within a reasonable time as well.

Then I found the following lines in the host logs:

Code:

ext4_multi_mount_protect: MMP interval higher than expected

and was able to recover the virtual drive by cleaning the MMP bit within a single user boot mode:

Code:

tune2fs -f -E clear_mmp /dev/sdb

Anyway, the failing backup should not affect the VM itself.

1. Is that a bug or I'm just missing something?
2. It would be great to add an option to check the storage space before starting the backup and alert by email if the space is not enough

dcsapak · Jan 18, 2021

for an introduction about how the backup works, see here: https://git.proxmox.com/?p=pve-qemu...16aeb06259c2bcd196e4949c;hb=refs/heads/master

basically, when the vm wants to write a block, pve writes that block to the backup stream, and then lets the vm continue to write it to disk
if the first part hangs (the write to the backup), the second one cannot happen

the problem here is that nfs mounts can completely block inside the kernel, with no chance to recover besides rebooting the server, so if your vm backup hangs in the nfs part of the kernel, there is no way to recover from that

you *could* mount the nfs with the 'soft' option, allowing for a timeout, but this means that data potentially get corrupted under certain circumstances (see the manpage of nfs(5))
or make sure that the nfs is reachable and writable (e.g. by using monitoring on your nfs server to make sure to expand it)

jazzl0ver · Jan 18, 2021

Thank for your reply Dominik! What storage is it better to use instead of NFS to avoid the blocking issue you mentioned?

What about my suggestion on checking the free space before doing a backup? I understand it's not a Proxmox's job to work as a monitoring engine, but it might be hard to maintain the external checks, due to the fact that the target backup storage might be changed for certain VMs, so some storage would require smaller free space amount, while another one might need more free space after that change.

dcsapak · Jan 18, 2021

jazzl0ver said:
What about my suggestion on checking the free space before doing a backup?

this is basically impossible

1. we do not know beforehand how big the backup will be
2. assuming we would know that, there are still potentially other writers that write to the nfs (e.g. if vm images are also on that nfs) which would again trigger that issue

jazzl0ver · Jan 18, 2021

yeah, good point.

> What storage is it better to use instead of NFS to avoid the blocking issue you mentioned?
would you pls answer this question?

dcsapak · Jan 18, 2021

nfs is alright, but it has some downsides as i mentioned.
you can use any other network file system, e.g. cifs/smb

you could also try the proxmox backup server as an alternative to 'normal' vzdump backups, see https://pbs.proxmox.com/wiki/index.php/Main_Page

jazzl0ver · Jan 18, 2021

I was just trying to figure this out without success: does Proxmox Backup Server export its storage thru NFS or it uses some other protocol?

dcsapak · Jan 18, 2021

it uses a http(2) api over tls

jazzl0ver · Jan 18, 2021

Thank you, Dominik, for all your quick and helpful answers!

Search

Search

VM unreachable due to failing backup

jazzl0ver

Renowned Member

dcsapak

Proxmox Staff Member

jazzl0ver

Renowned Member

dcsapak

Proxmox Staff Member

jazzl0ver

Renowned Member

dcsapak

Proxmox Staff Member

jazzl0ver

Renowned Member

dcsapak

Proxmox Staff Member

jazzl0ver

Renowned Member