don't forgot to restart corosync after libknet upgrade.finally cluster broke today, installed patch, pray for the best
don't forgot to restart corosync after libknet upgrade.
lol ofc,..
i also deactivated debugging for now and the second ring, run it now purely on the vrack ring.
which was the most unstable condition yet, lets see
Sep 20 09:26:27 kvm5b corosync[677695]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] A new membership (2:604) was formed. Members
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [QUORUM] Members[5]: 2 3 4 5 6
Sep 20 09:26:27 kvm5b corosync[677695]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6347 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 6397 ms, flushing membership messages.
Sep 20 09:26:27 kvm5b corosync[677695]: [TOTEM ] A new membership (2:608) was formed. Members
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:27 kvm5b corosync[677695]: [QUORUM] Members[5]: 2 3 4 5 6
Sep 20 09:26:31 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 10549 ms, flushing membership messages.
Sep 20 09:26:31 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 10549 ms, flushing membership messages.
Sep 20 09:26:31 kvm5b corosync[677695]: [TOTEM ] Process pause detected for 10551 ms, flushing membership messages.
Sep 20 09:26:32 kvm5b corosync[677695]: [TOTEM ] A new membership (2:656) was formed. Members
Sep 20 09:26:32 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:26:32 kvm5b corosync[677695]: [QUORUM] Members[5]: 2 3 4 5 6
Sep 20 09:27:35 kvm5b corosync[677695]: [TOTEM ] A new membership (2:816) was formed. Members left: 1
Sep 20 09:27:35 kvm5b corosync[677695]: [TOTEM ] Failed to receive the leave message. failed: 1
Sep 20 09:27:35 kvm5b corosync[677695]: [CPG ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]: [CPG ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]: [CPG ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]: [CPG ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]: [CPG ] downlist left_list: 1 received
Sep 20 09:27:35 kvm5b corosync[677695]: [QUORUM] Members[5]: 2 3 4 5 6
Sep 20 09:27:35 kvm5b corosync[677695]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 20 09:30:24 kvm5b corosync[677695]: [KNET ] rx: host: 1 link: 0 is up
Sep 20 09:30:24 kvm5b corosync[677695]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 20 09:30:27 kvm5b corosync[677695]: [TOTEM ] A new membership (1:820) was formed. Members joined: 1
Sep 20 09:30:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]: [CPG ] downlist left_list: 0 received
Sep 20 09:30:27 kvm5b corosync[677695]: [QUORUM] Members[6]: 1 2 3 4 5 6
Sep 20 09:30:27 kvm5b corosync[677695]: [MAIN ] Completed service synchronization, ready to provide service.
sorry, this is french page, butthe free one,
reason is simple that we wanted to leave corosync on a dedicaded alone so that gbit wouldnt help me at all anyway.
also 25€ additional for a 100€ server is kinda ridicolous
oh btw they DONT gurantee anything at the 1gbit vrack either. what you maybe mean is the public interface.
there you get the 1gbit for free and 1gbit guranteed for 100€ more )
What I mean is that as their vrack is basically a vxlan overlay tunneling, I wonder if with 100mbits they don't use qos and put a lot of users inside it.edit the issues doenst seem saturation issues on that vrack either. its more like spikes a fraction of a second several times a day.
usually to very compareable times. so my best guess would be a port resets there, which maybe has the same effect as saturation would have.
our line itself is sqeeuky clean, stable couple of kbits, no traffic spikes.
difficult to known without the debug enabled.The following could perhaps be a separate but related bug. These logs are from a remaining node when one of the nodes dropped off. Cluster memberships updates successfully to 5/6 but notification message then start escalating with processing pause being reported. Eventually settles but I'm sure the 'token: 10000' setting saved us an escalating node fencing scenario...
What I mean is that as their vrack is basically a vxlan overlay tunneling, I wonder if with 100mbits they don't use qos and put a lot of users inside it.
And maybe you don't see saturation on your side, but the underlay network have spike or buffer exhaustion.
I have some coworker who's work at ovh beforce, I'll try to ask them. (I'm 10km away from ovh RBX datacenter
yes, anyway it's a corosync bug. The latency trigger the bug, but without the bug, the state should come back ok after.but seriously, im aware that i could hit a bottleneck on the vrouter when some jackass saturates their line but this doesnt really fit the error picture here.
Sep 21 08:48:00 h3 systemd[1]: Started Proxmox VE replication runner.
Sep 21 08:48:59 h3 corosync[29388]: [KNET ] link: host: 4 link: 0 is down
Sep 21 08:48:59 h3 corosync[29388]: [KNET ] link: host: 1 link: 0 is down
Sep 21 08:48:59 h3 corosync[29388]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 21 08:48:59 h3 corosync[29388]: [KNET ] host: host: 4 has no active links
Sep 21 08:48:59 h3 corosync[29388]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 08:48:59 h3 corosync[29388]: [KNET ] host: host: 1 has no active links
Sep 21 08:49:00 h3 systemd[1]: Starting Proxmox VE replication runner...
Sep 21 08:49:00 h3 corosync[29388]: [KNET ] rx: host: 1 link: 0 is up
Sep 21 08:49:00 h3 corosync[29388]: [KNET ] rx: host: 4 link: 0 is up
Sep 21 08:49:00 h3 corosync[29388]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 21 08:49:00 h3 corosync[29388]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Sep 21 08:49:01 h3 pvesr[29545]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 21 08:49:02 h3 systemd[1]: pvesr.service: Succeeded.
[root@kvm1 ~]# ping -i 0.1 -s 8972 1.1.7.10
<snip>
8980 bytes from 1.1.7.10: icmp_seq=8886 ttl=64 time=0.628 ms
8980 bytes from 1.1.7.10: icmp_seq=8887 ttl=64 time=0.592 ms
8980 bytes from 1.1.7.10: icmp_seq=8888 ttl=64 time=0.606 ms
^C
--- 1.1.7.10 ping statistics ---
8888 packets transmitted, 8886 received, 0.0225023% packet loss, time 1903ms
rtt min/avg/max/mdev = 0.468/0.765/22.508/0.956 ms
[root@kvm1 ~]# ping -i 0.1 -s 8972 1.1.7.11
8980 bytes from 1.1.7.11: icmp_seq=8847 ttl=64 time=0.507 ms
8980 bytes from 1.1.7.11: icmp_seq=8848 ttl=64 time=0.521 ms
8980 bytes from 1.1.7.11: icmp_seq=8849 ttl=64 time=0.569 ms
^C
--- 1.1.7.11 ping statistics ---
8849 packets transmitted, 8847 received, 0.0226014% packet loss, time 1590ms
rtt min/avg/max/mdev = 0.478/0.651/8.176/0.449 ms
[root@kvm3 ~]# ping -i 0.1 -s 8972 1.1.7.10
<snip>
8980 bytes from 1.1.7.10: icmp_seq=8800 ttl=64 time=0.593 ms
8980 bytes from 1.1.7.10: icmp_seq=8801 ttl=64 time=0.627 ms
8980 bytes from 1.1.7.10: icmp_seq=8802 ttl=64 time=0.620 ms
^C
--- 1.1.7.10 ping statistics ---
8802 packets transmitted, 8795 received, 0.0795274% packet loss, time 1088ms
rtt min/avg/max/mdev = 0.478/0.797/12.302/0.938 ms
Sep 21 11:54:28 kvm1 corosync[2956667]: [KNET ] link: host: 2 link: 0 is down
Sep 21 11:54:28 kvm1 corosync[2956667]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:54:28 kvm1 corosync[2956667]: [KNET ] host: host: 2 has no active links
Sep 21 11:54:28 kvm1 corosync[2956667]: [TOTEM ] Knet host change callback. nodeid: 2 reachable: 0
Sep 21 11:54:28 kvm1 corosync[2956667]: [KNET ] rx: host: 2 link: 0 received pong: 1
Sep 21 11:54:31 kvm1 corosync[2956667]: [KNET ] rx: host: 2 link: 0 received pong: 2
Sep 21 11:54:34 kvm1 corosync[2956667]: [KNET ] rx: host: 2 link: 0 is up
Sep 21 11:54:34 kvm1 corosync[2956667]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:54:36 kvm1 corosync[2956667]: [TOTEM ] Token has not been received in 382 ms
Sep 21 11:54:36 kvm2 corosync[2513045]: [TOTEM ] Token has not been received in 7987 ms
Sep 21 11:54:36 kvm3 corosync[3689364]: [TOTEM ] Token has not been received in 382 ms
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] link: host: 2 link: 0 is down
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] host: host: 2 has no active links
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
Sep 21 11:55:17 kvm3 corosync[3689364]: [KNET ] rx: Source host 2 not reachable yet
do y ou have corosync crash, hang, node isolated with this version ? or after the network msh "not reachable yet", does it auto come back ?I upgraded libknet to 1.11-pve2 and restarted corosync with debugging enabled.
Simultaneous pings between all three nodes yields (IMHO) marginal packet loss. Corosync competes with Ceph on 2x1GbE LACP (OvS).
kvm1 -> kvm2 pings lost: 2 / 8980
kvm1 -> kvm3 pings lost: 2 / 8980
kvm3 -> kvm2 pings lost: 7 / 8980
Pings are sent at a rate of 10 a second, as a 9000 byte frame (8972 bytes + 28 byte ICMP header)
[root@kvm1 ~]# last | grep -i boot
reboot system boot 5.0.21-1-pve Fri Sep 13 10:14 still running
reboot system boot 5.0.21-1-pve Fri Sep 13 04:06 still running
reboot system boot 5.0.21-1-pve Tue Sep 10 21:35 still running
reboot system boot 5.0.21-1-pve Sun Sep 8 03:34 still running
reboot system boot 5.0.21-1-pve Thu Sep 5 00:43 still running
reboot system boot 5.0.21-1-pve Wed Sep 4 03:58 still running
reboot system boot 5.0.21-1-pve Sun Sep 1 20:50 still running
reboot system boot 5.0.21-1-pve Sat Aug 31 16:06 still running
reboot system boot 5.0.21-1-pve Fri Aug 30 05:35 still running
reboot system boot 5.0.18-1-pve Sun Aug 25 03:02 still running
reboot system boot 5.0.15-1-pve Tue Aug 20 09:48 still running
reboot system boot 5.0.15-1-pve Mon Aug 19 12:51 still running
reboot system boot 5.0.15-1-pve Sun Aug 18 12:08 still running
reboot system boot 5.0.15-1-pve Sat Aug 17 07:46 still running
reboot system boot 5.0.15-1-pve Fri Aug 9 09:16 still running
reboot system boot 5.0.15-1-pve Thu Aug 8 08:49 still running
reboot system boot 5.0.15-1-pve Wed Aug 7 09:48 still running
reboot system boot 5.0.15-1-pve Tue Aug 6 07:46 still running
reboot system boot 5.0.15-1-pve Tue Aug 6 02:48 still running
reboot system boot 5.0.15-1-pve Mon Aug 5 20:21 still running
reboot system boot 5.0.15-1-pve Mon Aug 5 16:37 still running
reboot system boot 5.0.15-1-pve Mon Aug 5 16:12 still running
reboot system boot 5.0.15-1-pve Mon Aug 5 15:53 still running
reboot system boot 5.0.15-1-pve Sun Aug 4 22:29 still running
reboot system boot 5.0.15-1-pve Sat Aug 3 21:44 still running
If that was referring to me, not really any problems with ovs. I'm running balance-tcp lacp after patch was installed on 16 node cluster. In fact, not even a single log entry for corosync since, and I've tried to stress test it by running iperf sessions in both directions (even simultaneously, to the point where there is a bit of packet loss). What is the reasoning behind not recommending active-backup anyway?Last thing: I have seen user reported having bug with openvswitch+bond too and corosync.(revert back to linux bridge fix it). Can't confirm this.
Great that the patch is working fine for you . For active-backup, should works fine too, I don't see any reason to not use it. (you just can't increase bandwith, but it should be pretty stable)If that was referring to me, not really any problems with ovs. I'm running balance-tcp lacp after patch was installed on 16 node cluster. In fact, not even a single log entry for corosync since, and I've tried to stress test it by running iperf sessions in both directions (even simultaneously, to the point where there is a bit of packet loss). What is the reasoning behind not recommending active-backup anyway?
HTH, Ådne