Hanged NFS backup share froze guest disk

axe

New Member
Sep 2, 2013
5
0
1
Hi,

First post here and sadly with a problem.

I'm running a two node cluster with DRBD-storage + a third node for quorum and NFS storage for image backups. The backup is scheduled daily and have been working just fine for a some time now, until last night that is. In the middle of a backup the NFS server failed with a kernel panic due to hardware failure - system hard drive died.

Nothing strange here but what concerns me was the effect this had on the cluster node and the VM being backed up during the failure. The VM is running Windows Server 2008 and has got two disk images, one for OS, one for data. The data volume became unaccessible on the guest and the only way I could recover from this was to reboot the cluster node. The VM started just fine after that and all data was readable, no problem.

Is this expected behavior when the NFS share becomes inaccessible during a backup? It would have been nice if the backup would have timed out and just failed with an error but I also understand that NFS itself is tricky and sensitive when there is a network or host problem.
Would iSCSI be a better solution in this case? The chances of this happening again is probably slim but when possible I would like to avoid scenarios that forces a reboot of a physical node when the VM's can't be migrated.


Thanks.
Freddy
 
Hi,

I assume you mounted NFS via TCP - try mounting via UDP as you only can reboot the node to get rid of a hanging TCP NFS mount

Alex
 
Hi,

I assume you mounted NFS via TCP - try mounting via UDP as you only can reboot the node to get rid of a hanging TCP NFS mount

Alex

Thank you Alex for your suggestion. I just tried mounting with NFS options 'soft,proto=udp' and the result seems to be the same as before. Since these servers are in production my testing is fairly limited but when I get more time I'm going to put together a test environment. There must be a way to get this working properly without running clustered NFS-servers.


Freddy
 
I think the issue here is that starting with 3.0 KVM snapshot backups were changed to use a new feature in KVM called LiveBackup.

The way it works is the KVM process sends the data to the backup process.
This is a much better way to perform the backup when you look at it from a disk IO viewpoint.
But I suspect it causes other problems by utilizing other resources like CPU and RAM.

In your case the backup process was unable to keep writing, KVM is trying to send it data but backup process cannot accept it, eventually the buffers are full and KVM stops doing disk IO because it is waiting for backup process to get the data.
This is just speculation, maybe I am way off on what is happening but that is my educated guess.

I think this new LiveBackup uses more CPU and potentially causes more data copies in RAM(blowing away CPU Cache) than the old LVM setup did.
I have asked if it is possible to use the old backup method using LVM snapshots in 3.1 in this thread:
http://forum.proxmox.com/threads/16036-Can-I-use-LVM-Snapshot-backup-for-KVM-in-3-1

Basically I think there are some cases where LiveBackup is worse than the LVM Snapshot.
To prove/disprove that theory I need to be able to use the old method in 3.1 so I can compare LiveBackup vs LVM snapshot backup.

I suspect the issues here are also related to LiveBackup:
http://forum.proxmox.com/threads/11093-Windows-BSOD-happening-during-backups
 
Your observation really does make a lot of sense, thank you for pointing this out. This would mean it's probably something that would need to be fixed(if possible) upstream and not something the Proxmox team can do much about - except for offering the old snapshot method as an alternative like you suggested that is.

Again, thank you guys for your quick and helpful responses.

Freddy
 
This would mean it's probably something that would need to be fixed(if possible) upstream and not something the Proxmox team can do much about

No, the Proxmox team wrote the KVM LiveBackup feature.

Do you have a reproducible test case?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!