Hanged NFS backup share froze guest disk

axe · Sep 2, 2013

Hi,

First post here and sadly with a problem.

I'm running a two node cluster with DRBD-storage + a third node for quorum and NFS storage for image backups. The backup is scheduled daily and have been working just fine for a some time now, until last night that is. In the middle of a backup the NFS server failed with a kernel panic due to hardware failure - system hard drive died.

Nothing strange here but what concerns me was the effect this had on the cluster node and the VM being backed up during the failure. The VM is running Windows Server 2008 and has got two disk images, one for OS, one for data. The data volume became unaccessible on the guest and the only way I could recover from this was to reboot the cluster node. The VM started just fine after that and all data was readable, no problem.

Is this expected behavior when the NFS share becomes inaccessible during a backup? It would have been nice if the backup would have timed out and just failed with an error but I also understand that NFS itself is tricky and sensitive when there is a network or host problem.
Would iSCSI be a better solution in this case? The chances of this happening again is probably slim but when possible I would like to avoid scenarios that forces a reboot of a physical node when the VM's can't be migrated.

Thanks.
Freddy

axe · Sep 2, 2013

I have been searching for hours and now just a few minutes after posting I found someone on here with the same problem, although with samba shares:
http://forum.proxmox.com/threads/15445-Samba-share-missing-bad-result-in-Proxmox

screenie · Sep 2, 2013

Hi,

I assume you mounted NFS via TCP - try mounting via UDP as you only can reboot the node to get rid of a hanging TCP NFS mount

Alex

axe · Sep 12, 2013

screenie said:
Hi,

I assume you mounted NFS via TCP - try mounting via UDP as you only can reboot the node to get rid of a hanging TCP NFS mount

Alex

Thank you Alex for your suggestion. I just tried mounting with NFS options 'soft,proto=udp' and the result seems to be the same as before. Since these servers are in production my testing is fairly limited but when I get more time I'm going to put together a test environment. There must be a way to get this working properly without running clustered NFS-servers.

Freddy

e100 · Sep 12, 2013

I think the issue here is that starting with 3.0 KVM snapshot backups were changed to use a new feature in KVM called LiveBackup.

The way it works is the KVM process sends the data to the backup process.
This is a much better way to perform the backup when you look at it from a disk IO viewpoint.
But I suspect it causes other problems by utilizing other resources like CPU and RAM.

In your case the backup process was unable to keep writing, KVM is trying to send it data but backup process cannot accept it, eventually the buffers are full and KVM stops doing disk IO because it is waiting for backup process to get the data.
This is just speculation, maybe I am way off on what is happening but that is my educated guess.

I think this new LiveBackup uses more CPU and potentially causes more data copies in RAM(blowing away CPU Cache) than the old LVM setup did.
I have asked if it is possible to use the old backup method using LVM snapshots in 3.1 in this thread:
http://forum.proxmox.com/threads/16036-Can-I-use-LVM-Snapshot-backup-for-KVM-in-3-1

Basically I think there are some cases where LiveBackup is worse than the LVM Snapshot.
To prove/disprove that theory I need to be able to use the old method in 3.1 so I can compare LiveBackup vs LVM snapshot backup.

I suspect the issues here are also related to LiveBackup:
http://forum.proxmox.com/threads/11093-Windows-BSOD-happening-during-backups

axe · Sep 12, 2013

Your observation really does make a lot of sense, thank you for pointing this out. This would mean it's probably something that would need to be fixed(if possible) upstream and not something the Proxmox team can do much about - except for offering the old snapshot method as an alternative like you suggested that is.

Again, thank you guys for your quick and helpful responses.

Freddy

dietmar · Sep 12, 2013

axe said:
This would mean it's probably something that would need to be fixed(if possible) upstream and not something the Proxmox team can do much about

No, the Proxmox team wrote the KVM LiveBackup feature.

Do you have a reproducible test case?

Search

Search

Hanged NFS backup share froze guest disk

axe

New Member

axe

New Member

screenie

Active Member

axe

New Member

e100

Renowned Member

axe

New Member

dietmar

Proxmox Staff Member

We value your privacy