Hello,
We have a 7-node hyperconverged cluster that was originally running Corosync on a single ring over eno1 (copper, MTU 1500) without issues.
To introduce redundancy, we added a second network using a bonded Mellanox interface (bond0), which is also used for the Ceph public network. The bond itself is configured with MTU 9216. On top of this bond, we created a VLAN interface specifically for Corosync and set its MTU to 1500 to match the existing eno1 network.
Since introducing the second ring, we are experiencing instability with Corosync KNET. The eno1 interface is being constantly reset, nodes intermittently appear offline, and in some cases fencing is triggered. We suspect this may be due to both links having the same priority, causing Corosync to switch between them.
What would be the best way to stabilise Corosync in this setup and introduce redundancy without causing downtime to the running VMs?
This is the corosync.conf:
This is system logs from one of the nodes, they all following the same pattern:
We have a 7-node hyperconverged cluster that was originally running Corosync on a single ring over eno1 (copper, MTU 1500) without issues.
To introduce redundancy, we added a second network using a bonded Mellanox interface (bond0), which is also used for the Ceph public network. The bond itself is configured with MTU 9216. On top of this bond, we created a VLAN interface specifically for Corosync and set its MTU to 1500 to match the existing eno1 network.
Since introducing the second ring, we are experiencing instability with Corosync KNET. The eno1 interface is being constantly reset, nodes intermittently appear offline, and in some cases fencing is triggered. We suspect this may be due to both links having the same priority, causing Corosync to switch between them.
What would be the best way to stabilise Corosync in this setup and introduce redundancy without causing downtime to the running VMs?
This is the corosync.conf:
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: server3298
nodeid: 3
quorum_votes: 1
ring0_addr: x.x.x2.228
ring1_addr: x.x.x3.228
}
node {
name: server3300
nodeid: 7
quorum_votes: 1
ring0_addr: x.x.x2.232
ring1_addr: x.x.x3.232
}
node {
name: server3301
nodeid: 1
quorum_votes: 1
ring0_addr: x.x.x2.226
ring1_addr: x.x.x3.226
}
node {
name: server3303
nodeid: 2
quorum_votes: 1
ring0_addr: x.x.x2.227
ring1_addr: x.x.x3.227
}
node {
name: server3310
nodeid: 4
quorum_votes: 1
ring0_addr: x.x.x2.229
ring1_addr: x.x.x3.229
}
node {
name: server3311
nodeid: 5
quorum_votes: 1
ring0_addr: x.x.x2.230
ring1_addr: x.x.x3.230
}
node {
name: server3312
nodeid: 6
quorum_votes: 1
ring0_addr: x.x.x2.231
ring1_addr: x.x.x3.231
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: ProxmoxCluster
config_version: 11
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
knet_mtu: 1400
secauth: on
version: 2
}
This is system logs from one of the nodes, they all following the same pattern:
Code:
Apr 12 09:45:55 server3312 corosync[1782853]: [KNET ] link: host: 4 link: 0 is down
Apr 12 09:45:55 server3312 corosync[1782853]: [KNET ] link: host: 1 link: 0 is down
Apr 12 09:45:55 server3312 corosync[1782853]: [KNET ] host: host: 4 (passive) best link: 1 (pri: 1)
Apr 12 09:45:55 server3312 corosync[1782853]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Apr 12 09:45:57 server3312 corosync[1782853]: [KNET ] rx: host: 7 link: 0 is up
Apr 12 09:45:57 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 7 joined
Apr 12 09:45:57 server3312 corosync[1782853]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Apr 12 09:45:57 server3312 corosync[1782853]: [KNET ] pmtud: Global data MTU changed to: 1285
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] rx: host: 1 link: 0 is up
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] rx: host: 3 link: 0 is up
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 12 09:45:58 server3312 corosync[1782853]: [KNET ] pmtud: Global data MTU changed to: 1285
Apr 12 09:46:00 server3312 corosync[1782853]: [KNET ] link: host: 7 link: 0 is down
Apr 12 09:46:00 server3312 corosync[1782853]: [KNET ] host: host: 7 (passive) best link: 1 (pri: 1)
Apr 12 09:46:01 server3312 corosync[1782853]: [KNET ] link: host: 3 link: 0 is down
Apr 12 09:46:01 server3312 corosync[1782853]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Apr 12 09:46:03 server3312 corosync[1782853]: [KNET ] link: host: 5 link: 0 is down
Apr 12 09:46:03 server3312 corosync[1782853]: [KNET ] host: host: 5 (passive) best link: 1 (pri: 1)
Apr 12 09:46:08 server3312 corosync[1782853]: [KNET ] rx: host: 2 link: 0 is up
Apr 12 09:46:08 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Apr 12 09:46:08 server3312 corosync[1782853]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr 12 09:46:08 server3312 corosync[1782853]: [KNET ] pmtud: Global data MTU changed to: 1285
Apr 12 09:46:09 server3312 corosync[1782853]: [KNET ] rx: host: 3 link: 0 is up
Apr 12 09:46:09 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Apr 12 09:46:09 server3312 corosync[1782853]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 12 09:46:10 server3312 corosync[1782853]: [KNET ] pmtud: Global data MTU changed to: 1285
Apr 12 09:46:10 server3312 corosync[1782853]: [KNET ] rx: host: 5 link: 0 is up
Apr 12 09:46:10 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 5 joined
Apr 12 09:46:10 server3312 corosync[1782853]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Apr 12 09:46:10 server3312 corosync[1782853]: [KNET ] pmtud: Global data MTU changed to: 1285
Apr 12 09:46:13 server3312 corosync[1782853]: [KNET ] link: host: 2 link: 0 is down
Apr 12 09:46:13 server3312 corosync[1782853]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
Apr 12 09:46:13 server3312 corosync[1782853]: [KNET ] link: host: 1 link: 0 is down
Apr 12 09:46:13 server3312 corosync[1782853]: [KNET ] link: host: 3 link: 0 is down
Apr 12 09:46:13 server3312 corosync[1782853]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Apr 12 09:46:13 server3312 corosync[1782853]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 1)
Apr 12 09:46:15 server3312 corosync[1782853]: [KNET ] link: host: 5 link: 0 is down
Apr 12 09:46:15 server3312 corosync[1782853]: [KNET ] host: host: 5 (passive) best link: 1 (pri: 1)
Apr 12 09:46:16 server3312 corosync[1782853]: [KNET ] rx: host: 1 link: 0 is up
Apr 12 09:46:16 server3312 corosync[1782853]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Apr 12 09:46:16 server3312 corosync[1782853]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 12 09:46:17 server3312 corosync[1782853]: [KNET ] pmtud: Global data MTU changed to: 1285
Apr 12 09:46:24 server3312 corosync[1782853]: [KNET ] rx: host: 7 link: 0 is up