NFS Deadlock in Proxmox 2.1 Cluster

adoII

Renowned Member
Jan 28, 2010
174
17
83
Today we again had our "mysterious nfs dedlock Problem" on a Proxmox 2.1 Cluster

We have vm images running on an NFS Server (nexenta, nfs4)
We tried to live-migrate a machine from proxmox server a to proxmox server b vie the webinterface
Live Migration failed because of an old, wrong key for the machine name in the known_hosts file of server b
After failed migration we had the non working vm running on both proxmox hosts

Afterwards it was not possible to read or mount any of the vm images on the nexenta nfs server anymore, the nfs files were somehow deadlocked.
After rebooting Proxmox Server a) everything worked again.

The problem can be easily reproduced e.g. by putting an invalid known_hosts file on proxmox server b

BTW: We are mounting the NFS storage via fstab and use local-storage,shared vor the vms.

Any ideas what happens there and how we can avoid these deadlocks in the future ? Does the proxmox engine somehow lock the nfs storage during migration ?

We have now put the nfs option nolock in the fstab for mounting the vm-storage:
10.65.1.3:/pool1/images /nexenta02 nfs4 hard,intr,bg,timeo=15,nolock
Was this a good idea ?
 
Hi,

we still have this Problem and this is a very big Problem for us.
Whenever a proxmox operation like starting or migrating a vm fails (we had some failures because of invalid ssh keys) Proxmox leaves the whole Filesystem of our NFS Server locked forever.

So after the failure it is impossible to start or migrate any vm. Every Operation on the NFS4 Nexenta Fileserver like opening images hangs forever as strace shows and the lock is valid is on all machines in our proxmox cluster. When I reboot the Proxmox host that initiated the failed operation then all NFS Filesystem on all other Proxmox hosts are available again for opening, reding and writing.

The Bad thing is: Even if proxmox locks a path like /pool/images/142 all other files under /pool/images are locked on all Proxmox hosts which mount that filesystem and the whole Proxmox Cluster is unavailabe.

I now nfs-mounted all NFS 4 FIlesystems with the parameter nolock and so far the problem did not happen again.
Has anybody an idea what might happen and how i could work around that problem ? Maybe something like a global nfs-unlock script ?

Thanks
 
Problem solved.
Basically what happened: kernel 2.6.32-7-pve from proxmox 1.9 behaved differently as nfs client that kernel 2.6.32-12-pve from proxmox 2.1
This leaded to the Fact that the OpenOwners Table on our nexenta nfs storage overflowed with 1 Mio entries. Afterwards it was not possible to open files on the Nexenta NFS Storage and so vms could not be started.
We fixed the problem with an update of the nexenta NFS storage.
 
does your nexenta system use zfs ?

is so which version zfs was the old and current?

and do you think this is still needed: "NFS 4 FIlesystems with the parameter nolock" ?
 
Hi,
yes, the nexenta uses zfs. The zfs pool version is 28, the nexenta version is 3.1.3
No, I dont think the Parameter is needed. This turned out to be not a locking problem but a File-Open problem
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!