NFS for backups and PVE 3.2 in cluster PVE

cesarpk

Well-Known Member
Mar 31, 2012
770
3
58
Hi to all

If i have several PVE Nodes 3.2, 3.1 and 2.3 in a PVE cluster, and all this Nodes have configured as NFS backup a NFS server, and this NFS Server is turned off, will be lost the PVE cluster communication or quorum?.

Best regards
Cesar
 
Last edited:
no, cluster communication is not related to backup storage.

Hi Dietmar and thanks for your prompt response, but if i don't remember wrong, in the GUI of PVE 2.3, I had the red light for this PVE Node.

Best regards
cesar
 
Hi Dietmar and thanks for your prompt response, but if i don't remember wrong, in the GUI of PVE 2.3, I had the red light for this PVE Node.

Best regards
cesar

it was a pvestatd bug, hanging on unavailable nfs storage check.
It's fixed now.

(note that the best way is to specify on which nodes is the storage available. (in gui, storage options-> nodes)
 
it was a pvestatd bug, hanging on unavailable nfs storage check.
It's fixed now.

(note that the best way is to specify on which nodes is the storage available. (in gui, storage options-> nodes)

Thanks spirit for your answer, but if I have only PVE 3.2 Nodes, and i need power off my NFS server for little time while i don't have a backup in progress, should I first umount my PVE Nodes of the NFS shared?
 
Please, can someone else confirm this bug?
I installed another two new proxmox servers, joined them to cluster, than connect nfs server and selected it for backup only. I tested it and everything was working. Afterward I stopped that nfs server and in /var/log/syslog I can see
Apr 23 12:08:41 cl2 pvestatd[16952]: WARNING: command 'df -P -B 1 /mnt/pve/cupid_data' failed: got timeout
Apr 23 12:09:15 cl2 pveproxy[23273]: WARNING: proxy detected vanished client connection
Apr 23 12:09:45 cl2 pveproxy[23337]: WARNING: proxy detected vanished client connection

Cluster splits, gui stops (all machines have just numbers and no name specified, status unknown so it is impossible to move them or anything else) working correctly and so on, until I start that server again.

Apr 23 12:19:59 cl2 kernel: ct0 nfs: server 192.168.80.101 OK
Apr 23 12:19:59 cl2 pvestatd[16952]: status update time (403.055 seconds)

Then I added another cluster node just to be sure and this behaviour is exactly the same.

I think that this is a BIG problem, if a failure of some external storage makes cluster unusable. The same is happening when there is a failure of a gluster or storage.
 
Please, can someone else confirm this bug?
I installed another two new proxmox servers, joined them to cluster, than connect nfs server and selected it for backup only. I tested it and everything was working. Afterward I stopped that nfs server and in /var/log/syslog I can see
Apr 23 12:08:41 cl2 pvestatd[16952]: WARNING: command 'df -P -B 1 /mnt/pve/cupid_data' failed: got timeout
Apr 23 12:09:15 cl2 pveproxy[23273]: WARNING: proxy detected vanished client connection
Apr 23 12:09:45 cl2 pveproxy[23337]: WARNING: proxy detected vanished client connection

Cluster splits, gui stops (all machines have just numbers and no name specified, status unknown so it is impossible to move them or anything else) working correctly and so on, until I start that server again.

Apr 23 12:19:59 cl2 kernel: ct0 nfs: server 192.168.80.101 OK
Apr 23 12:19:59 cl2 pvestatd[16952]: status update time (403.055 seconds)

Then I added another cluster node just to be sure and this behaviour is exactly the same.

I think that this is a BIG problem, if a failure of some external storage makes cluster unusable. The same is happening when there is a failure of a gluster or storage.

I am in a proyect that will have several PVE Nodes with two servers NFS shared for use only as backup NFS storage, each backup NFS storage will be used for the 50% of the PVE Nodes, then, if a Server of backup NFS storage shared is stopped, the 50% of my PVEs will have problems of cluster communication, this situation is very alarming.

Can someone of the PVE team tell us anything about of this matter?, or better,
how fix it?

Best regards
Cesar

Re edited: I think that a shared storage NFS stopped, never must break the PVE cluster communication (ISOs, VM images, backups, anything).
 
Last edited:
Hi,

If you are only using the NFS server as a backup server (not for VM disks), you should try to mount the NFS server as soft (the default is hard) :

Code:
cat /etc/pve/storage.cfg
[FONT=Menlo]nfs: NFS_Proxmox[/FONT]
[FONT=Menlo]    path /mnt/pve/NFS_Proxmox[/FONT]
[FONT=Menlo]    server my-nfs.server.com[/FONT]
[FONT=Menlo]    export /nfs/proxmox[/FONT]
[FONT=Menlo]    options rw,sec=sys,noatime,nfsvers=3,[B]soft,timeo=10,retrans=5,actimeo=10,retry=5[/B][/FONT]
[FONT=Menlo]    content images,iso,vztmpl,rootdir,backup[/FONT]
[FONT=Menlo]    nodes pve1,pve2,pve3[/FONT]
[FONT=Menlo]    maxfiles 5[/FONT]

Regards
 
Last edited:
Hi,

If you are only using the NFS server as a backup server (not for VM disks), you should try to mount the NFS server as soft (the default is hard) :

Code:
cat /etc/pve/storage.cfg
[FONT=Menlo]nfs: NFS_Proxmox[/FONT]
[FONT=Menlo]    path /mnt/pve/NFS_Proxmox[/FONT]
[FONT=Menlo]    server my-nfs.server.com[/FONT]
[FONT=Menlo]    export /nfs/proxmox[/FONT]
[FONT=Menlo]    options rw,sec=sys,noatime,nfsvers=3,[B]soft,timeo=10,retrans=5,actimeo=10,retry=5[/B][/FONT]
[FONT=Menlo]    content images,iso,vztmpl,rootdir,backup[/FONT]
[FONT=Menlo]    nodes pve1,pve2,pve3[/FONT]
[FONT=Menlo]    maxfiles 5[/FONT]

Regards

Many thanks mcmyst for your answer, but if I apply these changes manually, how many time should i wait for that the changes have effect?, or can i run a command for get effect immediately? (if i can't restart the PVE Node)

Best regards
Cesar

Re edited: and, what i should to do in the PVE Node when the NFS server come back online?
 
Last edited:
Please, can someone else confirm this bug?
I installed another two new proxmox servers, joined them to cluster, than connect nfs server and selected it for backup only. I tested it and everything was working. Afterward I stopped that nfs server and in /var/log/syslog I can see
Apr 23 12:08:41 cl2 pvestatd[16952]: WARNING: command 'df -P -B 1 /mnt/pve/cupid_data' failed: got timeout
Apr 23 12:09:15 cl2 pveproxy[23273]: WARNING: proxy detected vanished client connection
Apr 23 12:09:45 cl2 pveproxy[23337]: WARNING: proxy detected vanished client connection

Cluster splits, gui stops (all machines have just numbers and no name specified, status unknown so it is impossible to move them or anything else) working correctly and so on, until I start that server again.

Apr 23 12:19:59 cl2 kernel: ct0 nfs: server 192.168.80.101 OK
Apr 23 12:19:59 cl2 pvestatd[16952]: status update time (403.055 seconds)

Then I added another cluster node just to be sure and this behaviour is exactly the same.

I think that this is a BIG problem, if a failure of some external storage makes cluster unusable. The same is happening when there is a failure of a gluster or storage.
i can confirm that this also happens with Proxmox+CEPH setup. If for any reason CEPH storage goes sideways, it prevents all GUI access. As soon as CEPH storage is back, everything comes back to normal. Had this issue with NFS back in Proxmox v2, only when NFS was setup to hold VMs.
 
Hi mcmyst,
thanks for your hints, but unfortunatelly it does not solve the problem. In fact I had to manually unmout this device with umount -l /mnt/pve/cupid_data.
Here are my logs after turning off my nfs server:
Apr 24 11:27:02 cl1 kernel: ct0 nfs: server 192.168.80.200 not responding, timed out
Apr 24 11:27:02 cl1 pvestatd[2620]: WARNING: unable to activate storage 'cupid_data' - directory '/mnt/pve/cupid_data' does not exist
Apr 24 11:27:02 cl1 pvestatd[2620]: status update time (6.024 seconds)
Apr 24 11:29:15 cl1 pveproxy[35824]: WARNING: proxy detected vanished client connection
Apr 24 11:30:15 cl1 pvedaemon[13297]: WARNING: mkdir /mnt/pve/cupid_data: File exists at /usr/share/perl5/PVE/Storage/Plugin.pm line 792

Cluster starts working again after unmounting manually on all servers after these changes. I tried to add just option soft in storage.cfg but it behaves in the same manner as before any change.

Hi,

If you are only using the NFS server as a backup server (not for VM disks), you should try to mount the NFS server as soft (the default is hard) :

Code:
cat /etc/pve/storage.cfg
[FONT=Menlo]nfs: NFS_Proxmox[/FONT]
[FONT=Menlo]    path /mnt/pve/NFS_Proxmox[/FONT]
[FONT=Menlo]    server my-nfs.server.com[/FONT]
[FONT=Menlo]    export /nfs/proxmox[/FONT]
[FONT=Menlo]    options rw,sec=sys,noatime,nfsvers=3,[B]soft,timeo=10,retrans=5,actimeo=10,retry=5[/B][/FONT]
[FONT=Menlo]    content images,iso,vztmpl,rootdir,backup[/FONT]
[FONT=Menlo]    nodes pve1,pve2,pve3[/FONT]
[FONT=Menlo]    maxfiles 5[/FONT]

Regards
 
Can some of the devs acknowledge this bug please? I would like to help somehow or try any patches, unfortunately I am not a programmer. This error is really disturbing and keeps me from completely migrating from esxi.
 
Can some of the devs acknowledge this bug please? I would like to help somehow or try any patches, unfortunately I am not a programmer. This error is really disturbing and keeps me from completely migrating from esxi.

I am not aware of such a NFS bug. If you find a bug, please report on https://bugzilla.proxmox.com with detailed information howto trigger the issue.
 
Thank you for your answer Tom, I filled a bug https://bugzilla.proxmox.com/show_bug.cgi?id=521

@liska:
@tom:

I was thinking in these questions:

1- What happens if the NFS server is decomposes while a backup is in progress?
2- Is PVE prepared to deal with this situation? and that finally, the VM/CT is unlocked!!!

If not well, can you liska or tom add a bug to bugzilla?

(i don't have in my study the necessary equipments for do this test)

Best regards
Cesar
 
Last edited:
Hello,

two weeks ago the Switch in the Datacenter was broken, so no communication was possible. After the Switch are exchanged our NFS Storage was up and running but the Node was hanging, i cant access this Server through the Dell DRAC - after a technican at the Datacenter has pulled out the power cords and restart the node, everything was fine.

So its a very big problem, our Datacenter are ~130km away from us and at the Night theres nobody.
 
cesarpk:
I just tried to start backup a container (suspend only as for snapshot I do not have enough space in this testing virtual cluster) and turn off my nfs server after few seconds. The process hang out until that nfs storage is available again.
When I stopped nfs server and then backup process, backup failed but the container was running just fine.
 
I have just installed the latest updates. Unfortunately, this error is still not solved and there is no answer on the bugzilla as well.
Although I am very happy for all new features, solving old major bugs should be on the first place.
I use proxmox for few years but this seems to me like a REALLY huge problem distracting me from completely migrating from esxi.
Please can someone from the pve team give us some updates about this?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!