Some of cluster hosts are down since last vzdump

Service Informatique FHM · Dec 4, 2017

Hi to all,

Every weekend we trigger a full backup of all VMs located on our 5 hosts proxmox cluster.

Our proxmox version is PVE 5.0-30. All nodes run exactly the same version.

We don't really know the reason why, but since last backup (sun december the 3th), 2 of the 5 nodes are displayed as unavailable in the web console.

Corosync is running on all hosts, we've tried to stop services pve-cluster && corosync on all other and start them one by one with no success.

/etc/pve is shared and mounted on all hosts. Ping is ok, ssh between hosts is ok, all hosts are on the same IPV4 subnet.

"pvecm nodes" returns :

Code:

Membership information
----------------------
    Nodeid      Votes Name
         5          1 srvvirt01
         3          1 srvvirt02
         2          1 srvvirt03
         4          1 srvvirt04
         1          1 srvvirt05 (local)

"pvecm status" returns :

Code:

pvecm status
Quorum information
------------------
Date:             Mon Dec  4 15:45:22 2017
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000005
Ring ID:          5/572
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000005          1 10.100.1.1 (local)
0x00000003          1 10.100.1.2
0x00000002          1 10.100.1.3
0x00000004          1 10.100.1.4
0x00000001          1 10.100.1.5

Because we work on a local network with full gigabyte switchs, we did not find necessary to activate multicast on our switches for 5 hosts.

We've tried to gather information from logs, be we cannot see any suspicious problem except (maybe) a latency caused by simultaneous vzdump on an external NAS support.

Do you think we can recover the cluster sync without rebooting the whole proxmox host ? Each host is running 15 VM and migration is unavailable.

Thanks to all for your precious help.

dcsapak · Dec 4, 2017

you could try to restart the 'pvestatd' daemon on the nodes which are marked as down

Code:

systemctl restart pvestatd

Service Informatique FHM · Dec 4, 2017

Hi Dominik,
Thanks for your suggestion.

The good point is that when we restart pvestatd, the host come back into cluster for a while, and then goes down again.

pvestatd is quite long to restart, it looks like it tries to kill "/sbin/vgs" process and that it fall in a timeout after a few minutes.
Then it start again, creating a new "vgs" process (which is a symlink to /sbin/lvm)

Our cluster share LVM volumes on a fiber channel storage.

Any other suggestion ?

alexskysilk · Dec 4, 2017

Have a look at your backup share mount status. If you're using the clusterwide vzdump schedule, all your hosts are running it simultaneously and its likely your backup target isnt handling it very well. hung nfs mount= hung pvestatd. hung pvestatd= hung pveproxy. leave that unaddressed and the node will fall down. if enough nodes fall down your cluster will fall down. pretty dangerous all around.

Service Informatique FHM · Dec 5, 2017

Thanks, we will switch to a rock-solid NFS share and switch to a script to backup sequentially instead of simultaneously.

Meanwhile, we had to reboot cluster nodes to recover our cluster.

Thanks to all

Search

Search

Some of cluster hosts are down since last vzdump

Service Informatique FHM

Member

dcsapak

Proxmox Staff Member

Service Informatique FHM

Member

alexskysilk

Distinguished Member

Service Informatique FHM

Member