Some of cluster hosts are down since last vzdump

Dec 4, 2017
5
0
21
47
Hi to all,

Every weekend we trigger a full backup of all VMs located on our 5 hosts proxmox cluster.

Our proxmox version is PVE 5.0-30. All nodes run exactly the same version.

We don't really know the reason why, but since last backup (sun december the 3th), 2 of the 5 nodes are displayed as unavailable in the web console.

Corosync is running on all hosts, we've tried to stop services pve-cluster && corosync on all other and start them one by one with no success.

/etc/pve is shared and mounted on all hosts. Ping is ok, ssh between hosts is ok, all hosts are on the same IPV4 subnet.

"pvecm nodes" returns :

Code:
Membership information
----------------------
    Nodeid      Votes Name
         5          1 srvvirt01
         3          1 srvvirt02
         2          1 srvvirt03
         4          1 srvvirt04
         1          1 srvvirt05 (local)


"pvecm status" returns :
Code:
pvecm status
Quorum information
------------------
Date:             Mon Dec  4 15:45:22 2017
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000005
Ring ID:          5/572
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000005          1 10.100.1.1 (local)
0x00000003          1 10.100.1.2
0x00000002          1 10.100.1.3
0x00000004          1 10.100.1.4
0x00000001          1 10.100.1.5


Because we work on a local network with full gigabyte switchs, we did not find necessary to activate multicast on our switches for 5 hosts.

We've tried to gather information from logs, be we cannot see any suspicious problem except (maybe) a latency caused by simultaneous vzdump on an external NAS support.

Do you think we can recover the cluster sync without rebooting the whole proxmox host ? Each host is running 15 VM and migration is unavailable.

Thanks to all for your precious help.
 
you could try to restart the 'pvestatd' daemon on the nodes which are marked as down

Code:
systemctl restart pvestatd
 
Hi Dominik,
Thanks for your suggestion.

The good point is that when we restart pvestatd, the host come back into cluster for a while, and then goes down again.

pvestatd is quite long to restart, it looks like it tries to kill "/sbin/vgs" process and that it fall in a timeout after a few minutes.
Then it start again, creating a new "vgs" process (which is a symlink to /sbin/lvm)

Our cluster share LVM volumes on a fiber channel storage.

Any other suggestion ?
 
Have a look at your backup share mount status. If you're using the clusterwide vzdump schedule, all your hosts are running it simultaneously and its likely your backup target isnt handling it very well. hung nfs mount= hung pvestatd. hung pvestatd= hung pveproxy. leave that unaddressed and the node will fall down. if enough nodes fall down your cluster will fall down. pretty dangerous all around.
 
Thanks, we will switch to a rock-solid NFS share and switch to a script to backup sequentially instead of simultaneously.

Meanwhile, we had to reboot cluster nodes to recover our cluster.

Thanks to all
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!