NFS for backups and PVE 3.2 in cluster PVE

cesarpk · Apr 16, 2014

Hi to all

If i have several PVE Nodes 3.2, 3.1 and 2.3 in a PVE cluster, and all this Nodes have configured as NFS backup a NFS server, and this NFS Server is turned off, will be lost the PVE cluster communication or quorum?.

Best regards
Cesar

dietmar · Apr 16, 2014

no, cluster communication is not related to backup storage.

cesarpk · Apr 16, 2014

dietmar said:
no, cluster communication is not related to backup storage.

Hi Dietmar and thanks for your prompt response, but if i don't remember wrong, in the GUI of PVE 2.3, I had the red light for this PVE Node.

Best regards
cesar

spirit · Apr 16, 2014

cesarpk said:
Hi Dietmar and thanks for your prompt response, but if i don't remember wrong, in the GUI of PVE 2.3, I had the red light for this PVE Node.

Best regards
cesar

it was a pvestatd bug, hanging on unavailable nfs storage check.
It's fixed now.

(note that the best way is to specify on which nodes is the storage available. (in gui, storage options-> nodes)

cesarpk · Apr 16, 2014

spirit said:
it was a pvestatd bug, hanging on unavailable nfs storage check.
It's fixed now.

(note that the best way is to specify on which nodes is the storage available. (in gui, storage options-> nodes)

Thanks spirit for your answer, but if I have only PVE 3.2 Nodes, and i need power off my NFS server for little time while i don't have a backup in progress, should I first umount my PVE Nodes of the NFS shared?

liska_ · Apr 16, 2014

I experience this bug with latest two pve 3.2 servers and external nfs server only for backup. I even created two virtual proxmox servers and after turning off nfs, cluster splits. There is the same behaviour when using gluster or ceph as well. When a storage is not available, than cluster splits and it is not possible to use gui, move machines and so on.
I described it here - http://forum.proxmox.com/threads/18275-problem-with-external-storage-for-backup-causes-cluster-split

liska_ · Apr 23, 2014

Please, can someone else confirm this bug?
I installed another two new proxmox servers, joined them to cluster, than connect nfs server and selected it for backup only. I tested it and everything was working. Afterward I stopped that nfs server and in /var/log/syslog I can see
Apr 23 12:08:41 cl2 pvestatd[16952]: WARNING: command 'df -P -B 1 /mnt/pve/cupid_data' failed: got timeout
Apr 23 12:09:15 cl2 pveproxy[23273]: WARNING: proxy detected vanished client connection
Apr 23 12:09:45 cl2 pveproxy[23337]: WARNING: proxy detected vanished client connection

Cluster splits, gui stops (all machines have just numbers and no name specified, status unknown so it is impossible to move them or anything else) working correctly and so on, until I start that server again.

Apr 23 12:19:59 cl2 kernel: ct0 nfs: server 192.168.80.101 OK
Apr 23 12:19:59 cl2 pvestatd[16952]: status update time (403.055 seconds)

Then I added another cluster node just to be sure and this behaviour is exactly the same.

I think that this is a BIG problem, if a failure of some external storage makes cluster unusable. The same is happening when there is a failure of a gluster or storage.

cesarpk · Apr 23, 2014

liska_ said:
Please, can someone else confirm this bug?
I installed another two new proxmox servers, joined them to cluster, than connect nfs server and selected it for backup only. I tested it and everything was working. Afterward I stopped that nfs server and in /var/log/syslog I can see
Apr 23 12:08:41 cl2 pvestatd[16952]: WARNING: command 'df -P -B 1 /mnt/pve/cupid_data' failed: got timeout
Apr 23 12:09:15 cl2 pveproxy[23273]: WARNING: proxy detected vanished client connection
Apr 23 12:09:45 cl2 pveproxy[23337]: WARNING: proxy detected vanished client connection

Cluster splits, gui stops (all machines have just numbers and no name specified, status unknown so it is impossible to move them or anything else) working correctly and so on, until I start that server again.

Apr 23 12:19:59 cl2 kernel: ct0 nfs: server 192.168.80.101 OK
Apr 23 12:19:59 cl2 pvestatd[16952]: status update time (403.055 seconds)

Then I added another cluster node just to be sure and this behaviour is exactly the same.

I think that this is a BIG problem, if a failure of some external storage makes cluster unusable. The same is happening when there is a failure of a gluster or storage.

I am in a proyect that will have several PVE Nodes with two servers NFS shared for use only as backup NFS storage, each backup NFS storage will be used for the 50% of the PVE Nodes, then, if a Server of backup NFS storage shared is stopped, the 50% of my PVEs will have problems of cluster communication, this situation is very alarming.

Can someone of the PVE team tell us anything about of this matter?, or better,
how fix it?

Best regards
Cesar

Re edited: I think that a shared storage NFS stopped, never must break the PVE cluster communication (ISOs, VM images, backups, anything).

mcmyst · Apr 23, 2014

Hi,

If you are only using the NFS server as a backup server (not for VM disks), you should try to mount the NFS server as soft (the default is hard) :

Code:

cat /etc/pve/storage.cfg
[FONT=Menlo]nfs: NFS_Proxmox[/FONT]
[FONT=Menlo]    path /mnt/pve/NFS_Proxmox[/FONT]
[FONT=Menlo]    server my-nfs.server.com[/FONT]
[FONT=Menlo]    export /nfs/proxmox[/FONT]
[FONT=Menlo]    options rw,sec=sys,noatime,nfsvers=3,[B]soft,timeo=10,retrans=5,actimeo=10,retry=5[/B][/FONT]
[FONT=Menlo]    content images,iso,vztmpl,rootdir,backup[/FONT]
[FONT=Menlo]    nodes pve1,pve2,pve3[/FONT]
[FONT=Menlo]    maxfiles 5[/FONT]

Regards

cesarpk · Apr 23, 2014

mcmyst said:

Hi,

If you are only using the NFS server as a backup server (not for VM disks), you should try to mount the NFS server as soft (the default is hard) :

Code:

cat /etc/pve/storage.cfg
[FONT=Menlo]nfs: NFS_Proxmox[/FONT]
[FONT=Menlo]    path /mnt/pve/NFS_Proxmox[/FONT]
[FONT=Menlo]    server my-nfs.server.com[/FONT]
[FONT=Menlo]    export /nfs/proxmox[/FONT]
[FONT=Menlo]    options rw,sec=sys,noatime,nfsvers=3,[B]soft,timeo=10,retrans=5,actimeo=10,retry=5[/B][/FONT]
[FONT=Menlo]    content images,iso,vztmpl,rootdir,backup[/FONT]
[FONT=Menlo]    nodes pve1,pve2,pve3[/FONT]
[FONT=Menlo]    maxfiles 5[/FONT]

Regards

Many thanks mcmyst for your answer, but if I apply these changes manually, how many time should i wait for that the changes have effect?, or can i run a command for get effect immediately? (if i can't restart the PVE Node)

Best regards
Cesar

Re edited: and, what i should to do in the PVE Node when the NFS server come back online?

mcmyst · Apr 23, 2014

Just unmount your NFS storage and PVE will remount it in a few seconds.

wahmed · Apr 24, 2014

liska_ said:
Please, can someone else confirm this bug?
I installed another two new proxmox servers, joined them to cluster, than connect nfs server and selected it for backup only. I tested it and everything was working. Afterward I stopped that nfs server and in /var/log/syslog I can see
Apr 23 12:08:41 cl2 pvestatd[16952]: WARNING: command 'df -P -B 1 /mnt/pve/cupid_data' failed: got timeout
Apr 23 12:09:15 cl2 pveproxy[23273]: WARNING: proxy detected vanished client connection
Apr 23 12:09:45 cl2 pveproxy[23337]: WARNING: proxy detected vanished client connection

Cluster splits, gui stops (all machines have just numbers and no name specified, status unknown so it is impossible to move them or anything else) working correctly and so on, until I start that server again.

Apr 23 12:19:59 cl2 kernel: ct0 nfs: server 192.168.80.101 OK
Apr 23 12:19:59 cl2 pvestatd[16952]: status update time (403.055 seconds)

Then I added another cluster node just to be sure and this behaviour is exactly the same.

I think that this is a BIG problem, if a failure of some external storage makes cluster unusable. The same is happening when there is a failure of a gluster or storage.

i can confirm that this also happens with Proxmox+CEPH setup. If for any reason CEPH storage goes sideways, it prevents all GUI access. As soon as CEPH storage is back, everything comes back to normal. Had this issue with NFS back in Proxmox v2, only when NFS was setup to hold VMs.

liska_ · Apr 24, 2014

Hi mcmyst,
thanks for your hints, but unfortunatelly it does not solve the problem. In fact I had to manually unmout this device with umount -l /mnt/pve/cupid_data.
Here are my logs after turning off my nfs server:
Apr 24 11:27:02 cl1 kernel: ct0 nfs: server 192.168.80.200 not responding, timed out
Apr 24 11:27:02 cl1 pvestatd[2620]: WARNING: unable to activate storage 'cupid_data' - directory '/mnt/pve/cupid_data' does not exist
Apr 24 11:27:02 cl1 pvestatd[2620]: status update time (6.024 seconds)
Apr 24 11:29:15 cl1 pveproxy[35824]: WARNING: proxy detected vanished client connection
Apr 24 11:30:15 cl1 pvedaemon[13297]: WARNING: mkdir /mnt/pve/cupid_data: File exists at /usr/share/perl5/PVE/Storage/Plugin.pm line 792

Cluster starts working again after unmounting manually on all servers after these changes. I tried to add just option soft in storage.cfg but it behaves in the same manner as before any change.

mcmyst said:

Hi,

If you are only using the NFS server as a backup server (not for VM disks), you should try to mount the NFS server as soft (the default is hard) :

Code:

cat /etc/pve/storage.cfg
[FONT=Menlo]nfs: NFS_Proxmox[/FONT]
[FONT=Menlo]    path /mnt/pve/NFS_Proxmox[/FONT]
[FONT=Menlo]    server my-nfs.server.com[/FONT]
[FONT=Menlo]    export /nfs/proxmox[/FONT]
[FONT=Menlo]    options rw,sec=sys,noatime,nfsvers=3,[B]soft,timeo=10,retrans=5,actimeo=10,retry=5[/B][/FONT]
[FONT=Menlo]    content images,iso,vztmpl,rootdir,backup[/FONT]
[FONT=Menlo]    nodes pve1,pve2,pve3[/FONT]
[FONT=Menlo]    maxfiles 5[/FONT]

Regards

liska_ · Apr 29, 2014

Can some of the devs acknowledge this bug please? I would like to help somehow or try any patches, unfortunately I am not a programmer. This error is really disturbing and keeps me from completely migrating from esxi.

tom · Apr 29, 2014

liska_ said:
Can some of the devs acknowledge this bug please? I would like to help somehow or try any patches, unfortunately I am not a programmer. This error is really disturbing and keeps me from completely migrating from esxi.

I am not aware of such a NFS bug. If you find a bug, please report on https://bugzilla.proxmox.com with detailed information howto trigger the issue.

liska_ · Apr 29, 2014

Thank you for your answer Tom, I filled a bug https://bugzilla.proxmox.com/show_bug.cgi?id=521

cesarpk · May 1, 2014

liska_ said:
Thank you for your answer Tom, I filled a bug https://bugzilla.proxmox.com/show_bug.cgi?id=521

@liska:
@tom:

I was thinking in these questions:

1- What happens if the NFS server is decomposes while a backup is in progress?
2- Is PVE prepared to deal with this situation? and that finally, the VM/CT is unlocked!!!

If not well, can you liska or tom add a bug to bugzilla?

(i don't have in my study the necessary equipments for do this test)

Best regards
Cesar

fragger · May 2, 2014

Hello,

two weeks ago the Switch in the Datacenter was broken, so no communication was possible. After the Switch are exchanged our NFS Storage was up and running but the Node was hanging, i cant access this Server through the Dell DRAC - after a technican at the Datacenter has pulled out the power cords and restart the node, everything was fine.

So its a very big problem, our Datacenter are ~130km away from us and at the Night theres nobody.

liska_ · May 5, 2014

cesarpk:
I just tried to start backup a container (suspend only as for snapshot I do not have enough space in this testing virtual cluster) and turn off my nfs server after few seconds. The process hang out until that nfs storage is available again.
When I stopped nfs server and then backup process, backup failed but the container was running just fine.

liska_ · May 14, 2014

I have just installed the latest updates. Unfortunately, this error is still not solved and there is no answer on the bugzilla as well.
Although I am very happy for all new features, solving old major bugs should be on the first place.
I use proxmox for few years but this seems to me like a REALLY huge problem distracting me from completely migrating from esxi.
Please can someone from the pve team give us some updates about this?

NFS for backups and PVE 3.2 in cluster PVE

Well-Known Member

Proxmox Staff Member

Well-Known Member

Distinguished Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Famous Member

Member

Member

Proxmox Staff Member

Member

Well-Known Member

New Member

Member

Member