We are seeing the same behaviour on a pve 6 cluster which was upgraded from pve 5. Not sure if the root cause is the same but the symptom surely is. Logs seem to indicate pvesr hanging on one or another host right before the quorum loss, but we don't have any replication jobs configured, so maybe this too is just a symptom of a deeper root cause. Corosync needs to be restarted on most/all nodes before quorum is restored. We didn't see this issue with pve 5. We are using Ceph storage on these hosts. There is no indication of any issues on the Ceph side of things, we have rebuilt all OSDs involved since upgrading (due to the bluefs wal/journal overflow issue when upgrading to Nautilus) and Ceph has reported being healthy ever since.
Right around the time of this issue manifesting, corosync reports Token not received - repeatedly, ie:
Code:
Aug 1 04:51:52 sanctuary corosync[369105]: [TOTEM ] Token has not been received in 61 ms
Aug 1 04:51:55 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273076) was formed. Members left: 1
Aug 1 04:51:55 sanctuary corosync[369105]: [TOTEM ] Failed to receive the leave message. failed: 1
Aug 1 04:51:58 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273080) was formed. Members
Aug 1 04:51:58 sanctuary corosync[369105]: [CPG ] downlist left_list: 1 received
Aug 1 04:51:58 sanctuary corosync[369105]: [CPG ] downlist left_list: 1 received
Aug 1 04:51:58 sanctuary pmxcfs[217081]: [dcdb] notice: members: 2/217081, 3/1330
Aug 1 04:51:58 sanctuary pmxcfs[217081]: [dcdb] notice: starting data syncronisation
Aug 1 04:51:59 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 10
Aug 1 04:52:00 sanctuary systemd[1]: Starting Proxmox VE replication runner...
Aug 1 04:52:00 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 20
Aug 1 04:52:00 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273084) was formed. Members
Aug 1 04:52:01 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 30
Aug 1 04:52:02 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 40
Aug 1 04:52:02 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273088) was formed. Members
Aug 1 04:52:02 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:03 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 10
Aug 1 04:52:03 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 50
Aug 1 04:52:04 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 20
Aug 1 04:52:04 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 60
Aug 1 04:52:05 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273092) was formed. Members
Aug 1 04:52:05 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:05 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 30
Aug 1 04:52:05 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 70
Aug 1 04:52:06 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 40
Aug 1 04:52:06 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 80
Aug 1 04:52:07 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273096) was formed. Members
Aug 1 04:52:07 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:07 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 50
Aug 1 04:52:07 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 90
Aug 1 04:52:08 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 60
Aug 1 04:52:08 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retry 100
Aug 1 04:52:08 sanctuary pmxcfs[217081]: [dcdb] notice: cpg_send_message retried 100 times
Aug 1 04:52:08 sanctuary pmxcfs[217081]: [status] notice: members: 2/217081, 3/1330
Aug 1 04:52:08 sanctuary pmxcfs[217081]: [status] notice: starting data syncronisation
Aug 1 04:52:09 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273100) was formed. Members
Aug 1 04:52:09 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:09 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:09 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:09 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 70
Aug 1 04:52:09 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 10
Aug 1 04:52:10 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 80
Aug 1 04:52:10 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 20
Aug 1 04:52:11 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273104) was formed. Members
Aug 1 04:52:11 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 90
Aug 1 04:52:11 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 30
Aug 1 04:52:12 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 100
Aug 1 04:52:12 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retried 100 times
Aug 1 04:52:12 sanctuary pmxcfs[217081]: [status] crit: cpg_send_message failed: 6
Aug 1 04:52:12 sanctuary pve-firewall[1711]: firewall update time (9.326 seconds)
Aug 1 04:52:12 sanctuary pmxcfs[217081]: [status] notice: cpg_send_message retry 40
Aug 1 04:52:13 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:13 sanctuary corosync[369105]: [TOTEM ] A new membership (2:273108) was formed. Members
Aug 1 04:52:13 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:13 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:13 sanctuary corosync[369105]: [CPG ] downlist left_list: 0 received
Aug 1 04:52:13 sanctuary corosync[369105]: [QUORUM] Members[2]: 2 3
Aug 1 04:52:13 sanctuary corosync[369105]: [MAIN ] Completed service synchronization, ready to provide service.
From this point quorum is never again reached until corosync is restarted on multiple hosts. I also notice corosync reporting very strange MTU training behaviour right before, during, and after this issue:
Code:
Aug 1 04:55:41 sanctuary corosync[369105]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Aug 1 04:55:41 sanctuary corosync[369105]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug 1 04:55:41 sanctuary corosync[369105]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1366 to 1350
Aug 1 04:55:41 sanctuary corosync[369105]: [KNET ] pmtud: Global data MTU changed to: 1350
Aug 1 04:56:11 sanctuary corosync[369105]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1350 to 1366
Aug 1 04:56:11 sanctuary corosync[369105]: [KNET ] pmtud: Global data MTU changed to: 1366
Aug 1 05:40:45 sanctuary corosync[369105]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 1 link 0 but the other node is not acknowledging packets of this size.
Aug 1 05:40:45 sanctuary corosync[369105]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Aug 1 05:40:45 sanctuary corosync[369105]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1366 to 1350
Aug 1 05:40:45 sanctuary corosync[369105]: [KNET ] pmtud: Global data MTU changed to: 1350
Aug 1 05:41:15 sanctuary corosync[369105]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1350 to 1366
Aug 1 05:41:15 sanctuary corosync[369105]: [KNET ] pmtud: Global data MTU changed to: 1366
Of course, there is no real MTU change on this network. The hosts are directly connected through a single switch, jumbo frames haven't been configured yet, and everything is 1500 with no encapsulating protocols in effect.
Is there a way to just disable this userspace MTU path detection corosync tries to do? I suspect it's just plain broken..
edit: should also note that this seems to happen at roughly the same time interval as the original post, every 12-24 hours, but not exactly the same time ever morning. Possibly it's a fixed time from the last corosync restart?