Hello,
first of all sorry for my bad English.
i have a strange issue where i cannot realize why it is happening. I have a cluster with 5 nodes, using HA, CEPH is not enabled. Everything was working perfectly fine until i added the fifth node. I know that probably it is related to a latency issue while using corosync, but cannot modify the network for less latency.
No matter of that, problem is starting when one of the nodes (especially the last one that i already mentioned). When i reboot it randomly some other nodes gets rebooted. I was thinking that it is related to the timezone, or to systemd-timesyncd service that it was not installed by default. All nodes have set by default to 1 vote.
The whole scenario:
1. Rebooting node 5, while it is rebooting from the rest of the nodes logically it becomes red with X
2. When the server is back again, i see it in HA and then randomly on the other nodes and also on the rebooted one HA Status is becoming "old timestamp - dead?" and then active/idle again
3. All nodes becomes active/idle, but 1-2 mins after that some other node it is rebooted by itself.
This issue started appearing after adding the last node.
Sometimes nodes are rebooted randomly, probably when the network is little bit more loaded.
My first question is - what could be the issue. I bet that it is related to the latency which is sometimes high on 2 of the nodes, but i cannot do anything about it. It is just like that
My second question - how i can prevent that rebooting, can i configure somehow HA to not reboot randomly other hosts when one of them is with unhealthy network or it is rebooted, because the thing that concerns me is more why it is happening during reboot of the host. I was blaming the date and time settings and this systemd-timesyncd service, but it is installed now and it happened again.
It is not a problem for the setup if there's no network from time to time. For this reason each node have 2 NICs in bond. I just want to prevent rebooting the other nodes when other node is not in good condition (rebooting, network issues, etc.)
More info about the nodes:
Virtual Environment 8.1.3
ProxMox is installed on the top of a Debian 12 netinst.
All packages are up-to-date on all nodes.
Some stuff from syslog that can be helpful:
Thanks a lot!
first of all sorry for my bad English.
i have a strange issue where i cannot realize why it is happening. I have a cluster with 5 nodes, using HA, CEPH is not enabled. Everything was working perfectly fine until i added the fifth node. I know that probably it is related to a latency issue while using corosync, but cannot modify the network for less latency.
No matter of that, problem is starting when one of the nodes (especially the last one that i already mentioned). When i reboot it randomly some other nodes gets rebooted. I was thinking that it is related to the timezone, or to systemd-timesyncd service that it was not installed by default. All nodes have set by default to 1 vote.
The whole scenario:
1. Rebooting node 5, while it is rebooting from the rest of the nodes logically it becomes red with X
2. When the server is back again, i see it in HA and then randomly on the other nodes and also on the rebooted one HA Status is becoming "old timestamp - dead?" and then active/idle again
3. All nodes becomes active/idle, but 1-2 mins after that some other node it is rebooted by itself.
This issue started appearing after adding the last node.
Sometimes nodes are rebooted randomly, probably when the network is little bit more loaded.
My first question is - what could be the issue. I bet that it is related to the latency which is sometimes high on 2 of the nodes, but i cannot do anything about it. It is just like that

My second question - how i can prevent that rebooting, can i configure somehow HA to not reboot randomly other hosts when one of them is with unhealthy network or it is rebooted, because the thing that concerns me is more why it is happening during reboot of the host. I was blaming the date and time settings and this systemd-timesyncd service, but it is installed now and it happened again.
It is not a problem for the setup if there's no network from time to time. For this reason each node have 2 NICs in bond. I just want to prevent rebooting the other nodes when other node is not in good condition (rebooting, network issues, etc.)
More info about the nodes:
Virtual Environment 8.1.3
ProxMox is installed on the top of a Debian 12 netinst.
All packages are up-to-date on all nodes.
Some stuff from syslog that can be helpful:
Jan 15 13:09:26 node5 corosync[1983]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 4 link 0 but the other node is not acknowledging packets of this size.
Jan 15 13:09:26 node5 corosync[1983]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
Jan 15 10:07:30 node5 corosync[1983]: [KNET ] link: host: 3 link: 0 is down
Jan 15 10:07:30 node5 corosync[1983]: [KNET ] link: host: 1 link: 0 is down
Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 3 has no active links
Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 15 10:07:30 node5 corosync[1983]: [KNET ] host: host: 1 has no active links
Jan 15 10:07:33 node5 corosync[1983]: [KNET ] rx: host: 3 link: 0 is up
Jan 15 10:07:33 node5 corosync[1983]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Jan 15 10:07:33 node5 corosync[1983]: [KNET ] rx: host: 1 link: 0 is up
Jan 15 10:07:33 node5 corosync[1983]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Jan 15 10:07:33 node5 corosync[1983]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 15 10:07:33 node5 corosync[1983]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 15 10:08:53 node5 corosync[1983]: [TOTEM ] Retransmit List: 785d
Jan 15 09:23:42 node5 corosync[1983]: [CPG ] *** 0x560bf2c35010 can't mcast to group pve_dcdb_v1 state:1, error:12
Jan 15 09:23:42 node5 corosync[1983]: [CPG ] *** 0x560bf2c35010 can't mcast to group pve_dcdb_v1 state:1, error:12
Jan 15 09:23:43 node5 pmxcfs[1776]: [dcdb] notice: start cluster connection
Jan 15 09:23:43 node5 pmxcfs[1776]: [dcdb] crit: cpg_join failed: 14
Jan 15 09:23:43 node5 pmxcfs[1776]: [dcdb] crit: can't initialize service
Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:44 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:45 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:46 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pve-ha-lrm[2153]: unable to write lrm status file - unable to open file '/etc/pve/nodes/node5/lrm_status.tmp.2153' - Device or resource busy
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:47 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Jan 15 09:23:48 node5 pmxcfs[1776]: [dcdb] crit: cpg_send_message failed: 9
Thanks a lot!