All servers from cluster reboot after one server reboots

richinbg · Oct 2, 2017

Hello,
i am running a cluster with Proxmox 4.4 with three nodes.
Everytime I do reboot one of those nodes, all nodes start to reboot and it takes over half an hour to an hour until all of the clusters have a quorum again and stop rebooting over and over again ...

Is this something that I am doing wrong? Or should this take so long?
I can provide information about configuration etc. if required and you can tell me what you require.

Thanks.

Alwin · Oct 2, 2017

Check your multicast setup, here are some docs on what and how to test.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
https://pve.proxmox.com/wiki/Multicast_notes

If that doesn't help, then please provide your /etc/pve/corosync.conf, 'pveversion -v', 'pvecm status' and also any syslog/journal entry that show some error/warning message.

richinbg · Oct 4, 2017

Hello,
Thanks for your reply.

Here are the results:

Code:

pveversion -v
proxmox-ve: 4.4-96 (running kernel: 4.4.79-1-pve)
pve-manager: 4.4-18 (running version: 4.4-18/ef2610e8)
pve-kernel-4.4.79-1-pve: 4.4.79-95
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.62-1-pve: 4.4.62-88
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-53
qemu-server: 4.0-112
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.0-5~pve4
pve-container: 1.0-101
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
openvswitch-switch: 2.6.0-2
ceph: 10.2.9-1~bpo80+1
root@ ~ # pvecm status
Quorum information
------------------
Date:             Wed Oct  4 09:52:44 2017
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/21376
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.7.4.11 (local)
0x00000002          1 10.7.4.12
0x00000003          1 10.7.4.13

10.7.4.12 :   unicast, xmt/rcv/%loss = 450/450/0%, min/avg/max/std-dev = 0.040/0.108/0.892/0.062
10.7.4.12 : multicast, xmt/rcv/%loss = 450/447/0%, min/avg/max/std-dev = 0.047/0.118/0.770/0.055
10.7.4.13 :   unicast, xmt/rcv/%loss = 448/448/0%, min/avg/max/std-dev = 0.057/0.160/0.842/0.068
10.7.4.13 : multicast, xmt/rcv/%loss = 448/445/0%, min/avg/max/std-dev = 0.086/0.193/0.918/0.061

All three nodes have the same multicast address:
corosync-cmapctl -g totem.interface.0.mcastaddr
totem.interface.0.mcastaddr (str) = 239.192.104.2

Attached logs for omping -m 239.192.104.2 10.7.4.11 10.7.4.12 10.7.4.13.
I currently cannot provide an error log since I would need to reboot the machines.

I also noticed that after a reboot, they start to get a quorum and also before that boot up already all VMs... Normally this works. Even if I would unplug the network card from one machine, it was tested that the VMs which are in HA mode, get moved to another node within 2 mins.

Alwin · Oct 4, 2017

Do you use multiple switches/ paths for your corosync network? Looks to me, as if (R)STP would block one path and till it is unblocked, all nodes reboot (fencing).

richinbg · Oct 4, 2017

Just one 10G switch is used and should be configured indeed.
We are using bounding, too.

Alwin · Oct 4, 2017

richinbg said:
Just one 10G switch is used and should be configured indeed.

Are you using R-/STP on your system?

richinbg · Oct 6, 2017

Yes we are using that. And in general this works. Just if one server reboots we have the issue that it takes literally an hour until all of them are back in sync again.

Alwin · Oct 9, 2017

I guess, it might have to do with the RSTP, it might just block traffic from links till some point the conditions are right and the blockage can be undone. You can see this on your switches, when you reboot one PVE host.

richinbg · Oct 11, 2017

OK well I guess I iwll have a look for that on the switch the next time of a reboot.

Besides that, even though the quorum has not been successfully established the VMs already start booting. Should this not behave differently?
I would expect that it does not start to boot VMs before the quroum has been established.
Additionally, the strange thing is, even if it has been established, the machines get rebooted a couple of times first before they stay stable

Alwin · Oct 12, 2017

richinbg said:
Just one 10G switch is used and should be configured indeed.

You are using only one switch, so I assume, that have on NIC with 2 ports in a bond and all services are running over it?

Based on my assumption, if that is the case, then what you are seeing, is possible caused, by the recovery of Ceph (+ other traffic), while one node reboots and this interferes with corosync's traffic. In the end the quorum is lost on all servers. On startup corosync doesn't have a timestamp and so doesn't know when its token ran out, so it establishes quorum and starts the VMs, till soon after, it looses it again.

richinbg · Oct 21, 2017

Sorry for the late reply, I did not have time yet to look more into this.
If your assumption is correct, how could I prevent it? With having a second switch?

Alwin · Oct 23, 2017

For a stable quorum, corosync traffic should have its own network, on its own ports. To make corosync more fail safe it is possible to have two corosync rings, then a token received on either ring will give quorum.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

richinbg · Oct 30, 2017

I will have a look, thank you!

jdurand · Jun 3, 2020

Alwin said:
Based on my assumption, if that is the case, then what you are seeing, is possible caused, by the recovery of Ceph (+ other traffic), while one node reboots and this interferes with corosync's traffic.

Hello,

Could you please give us more details on corosync interferances problem ?
When a node is booting, what could be interfere with corosync ?
Furthermore when corosync use auth !
I need to understand this problem because i have a similar issue.

Search

Search

All servers from cluster reboot after one server reboots

richinbg

Member

Alwin

Proxmox Retired Staff

richinbg

Member

Attachments

Alwin

Proxmox Retired Staff

richinbg

Member

Alwin

Proxmox Retired Staff

richinbg

Member

Alwin

Proxmox Retired Staff

richinbg

Member

Alwin

Proxmox Retired Staff

richinbg

Member

Alwin

Proxmox Retired Staff

richinbg

Member

jdurand

New Member