[SOLVED] Properly recover VM after NFS shared crash during backup

Inglebard

Well-Known Member
May 20, 2016
100
6
58
31
Hi,

Yesterday, our NFS storage crash during a backup of one of our VM.
We were not able to reboot this storage after the crash.

This causes the following situation :
- the backup process was still running (but seems frozen).
- the VM where the backup running was frozen
- the proxmox server was still running and the other Vms were fine

I was not able to :
- stop the backup of the VM
- shutdown the VM
- reset the VM

So this is what I've done :
- kill (-9) the vzdump process (no impact on the GUI)
- unlock the vm
- try again to shutdown/reset the VM without success
- migrate the working VMs on another server
- shutdown the proxmox without success
- force shutdown the proxmox server

After the reboot, I was able to launch the VM.

I don't know if I take the right decisions but I would like to know several things.
Was It possible to properly stop the backup process and recover the VM (without a hard reboot of the proxmox server) ?
Why the proxmox server was not able to reboot properly (I suppose it still try to connect to nfs share), so was it the solution to force the end of the nfs connection ?
 
Hi,
Was It possible to properly stop the backup process and recover the VM (without a hard reboot of the proxmox server) ?
It is not possible to interupt a process what is waiting for IO this is Kernel related.
The main problem is that NFS waits forever and will never quit so the process what has IO on this share will also wait forever.

Why the proxmox server was not able to reboot properly (I suppose it still try to connect to nfs share), so was it the solution to force the end of the nfs connection ?
It will boot but it takes longer. It tries to make all storages available so you can start your VM. But after a time it will stop trying it and come up.
 
Hi thanks for the answer.

It will boot but it takes longer. It tries to make all storages available so you can start your VM. But after a time it will stop trying it and come up.

It's not the time to boot which is an issue but the time to shutdown.

It is not possible to interupt a process what is waiting for IO this is Kernel related.
The main problem is that NFS waits forever and will never quit so the process what has IO on this share will also wait forever.
Ok, I understand. But if we kill the backup process. Why the VM stays frozen/unresponsive (the VM is not on this NFS share) ?
 
You can not kill processes what wait for IO. A kill is an interrupt.
This is also the reason why you can't shut down.
 
But why this affect the VM like this ?
Because I only use this NFS for backup, it is possible to use the "soft" option (this may fix that kind of issue right) ?
 
But why this affect the VM like this ?
Because the vzdump write all block what chang on the vdisk first to the backup.
So if the changed block can't be backed up the VM can't write.

Because I only use this NFS for backup, it is possible to use the "soft" option (this may fix that kind of issue right) ?
I'm not sure. I can't remember but you can try if you use this storage only for backups.
When this storage is also used for vdisk do not set it because this can end in data lost.
 
Ok, I see.
So I tried to use the soft option. The issue seems to persist.

I will change to CIFS because I tried and seems to do the job with similar performance.
 
Ok, I see.
So I tried to use the soft option. The issue seems to persist.
I use the soft option of NFS and it works.
content of /etc/pve/storage.cfg​
Code:
....
       nfs: NFS-24
        path /mnt/pve/NFS-24
        server 192.168.122.24
        export /export/backups
        options vers=3,soft
        content backup
        maxfiles 4

You must unmount / mout the NFS share for it to work....

umount /mnt/pve/NFS-24
then mount -a will mount again .​
Then with mount | grep nfs you can see that it is mounted soft (versus hard):
Code:
192.168.122.24:/export/backups/ on /mnt/pve/NFS-24 type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.122.24,mountvers=3,mountport=57903,mountproto=udp,local_lock=none,addr=192.168.122.24)
 
Hi system is not perfect. What I obtain with soft option is, after disconnecting ethernet cable, .
If machine is running,
it does not get frozen.
system does not get frozen.
backup job does not end automatically.
If you reconnect ethernet after short time, backup continues.
If you reconnect ethernet after long time, backup ends (after some minutes)with broken pipe error
eventually you can stop backup job from gui, and after some minutes backup job ends.
machine returns to unlock status​
If machine is not runing
you must kill (-9) the lzop process and then kill (-9) the vzdump process
After some minutes the backup job ends (no log)
you must stop it, and then unlock it.

 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!