Hello,
We are experiencing random fencing on our 4 PVE nodes + 1 node tie-breaker (stretched cluster). It happened for the third time.
Our topology:
- DC1: Node 1 and Node 2
- DC2: Node 3 and Node 4
- DC3: Node 5. This node acts as a tie-breaker. It is connected via a VPN because it is our only option to have a 3rd DC.
We have no issues with Ceph.
Context and hardware history:
- Fencing 1: We do not have the logs, but it looked very similar to fencing 2.
- Fencing 2 (June 15) : Node 1 and Node 2 (in DC1) fenced and rebooted. Node 3 and Node 4 (in DC2) stayed online. Node 5 was a mini PC that suffered from frequent hard freezes.
- Hardware change: Between June 15 and June 26, we replaced the mini PC with a standard PC to resolve the hardware freezes.
- Fencing 3 (June 26): Node 1 and Node 2 (in DC1) fenced and rebooted. Node 3 and Node 4 (in DC2) stayed online. The logs for this third event are very different from the second one.
The problem: Recently, Node 1 and Node 2 (in DC1) fenced and rebooted. Node 3 and Node 4 (in DC2) did not fence.
Looking at the logs. We assume this is caused by the MTU mismatch due to the VPN encapsulation for Node 5.
(I alse have attached the error logs to this post from Node 1, Node 2 and Node 5 [second and third fencing]).
We plan to create a dedicated 1 Gbps network strictly for Corosync. To solve the VPN overhead issue, we intend to lower the MTU to around 1400 on these dedicated Corosync interfaces across all 5 nodes.
Will setting a lower MTU globally for Corosync be enough to stabilize the heartbeat over the VPN and prevent the watchdog from fencing the nodes? Or is there anything else we need to configure?
Thank you for your help.
We are experiencing random fencing on our 4 PVE nodes + 1 node tie-breaker (stretched cluster). It happened for the third time.
Our topology:
- DC1: Node 1 and Node 2
- DC2: Node 3 and Node 4
- DC3: Node 5. This node acts as a tie-breaker. It is connected via a VPN because it is our only option to have a 3rd DC.
We have no issues with Ceph.
Context and hardware history:
- Fencing 1: We do not have the logs, but it looked very similar to fencing 2.
- Fencing 2 (June 15) : Node 1 and Node 2 (in DC1) fenced and rebooted. Node 3 and Node 4 (in DC2) stayed online. Node 5 was a mini PC that suffered from frequent hard freezes.
- Hardware change: Between June 15 and June 26, we replaced the mini PC with a standard PC to resolve the hardware freezes.
- Fencing 3 (June 26): Node 1 and Node 2 (in DC1) fenced and rebooted. Node 3 and Node 4 (in DC2) stayed online. The logs for this third event are very different from the second one.
The problem: Recently, Node 1 and Node 2 (in DC1) fenced and rebooted. Node 3 and Node 4 (in DC2) did not fence.
Looking at the logs. We assume this is caused by the MTU mismatch due to the VPN encapsulation for Node 5.
Bash:
Jun 15 15:29:00 NODE-1 corosync[2459]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 5 link 0 but the other node is not acknowledging packets of this size.
Jun 15 15:29:00 NODE-1 corosync[2459]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.
(I alse have attached the error logs to this post from Node 1, Node 2 and Node 5 [second and third fencing]).
We plan to create a dedicated 1 Gbps network strictly for Corosync. To solve the VPN overhead issue, we intend to lower the MTU to around 1400 on these dedicated Corosync interfaces across all 5 nodes.
Will setting a lower MTU globally for Corosync be enough to stabilize the heartbeat over the VPN and prevent the watchdog from fencing the nodes? Or is there anything else we need to configure?
Thank you for your help.
Attachments
Last edited: