Hello,
A few times per month I have experienced issue with entire cluster rebooting. With rebooting I mean all physical servers rebooting.
I don't see anything particular in the log but I believe it to be caused by Corosync / HA watch agent rebooting the physical servers because of network issue.
Previously I just had ring_0 and I know for a fact that sometimes it looses connectivity. I have a janky bond0 LACP config where 10.10.10.0/24 is only reachable via one of the uplinks while not the other (yes it's bad but it works until bond0 decides to move traffic over to another uplink).
Now I've added ring_1, using a dedicated NIC and switch just for this. Config looks like so:
Corosync.conf
Based on my understanding ring1_addr / link 1 should now always be used for corosync. Checking log I can confirm this is the case:
Am I in the clear now? When link 0 flaps again (on a majority of servers) can I expect it to not care or do I need to do any other config so that all server in the cluster isn't rebooted? For your own information cluster consists of more than 3 servers but I limited config to not make it so long!
I'm unsure if I need to do any other configuration here or if corosync is the only config needed to be touched. I use the HA feature, does it have seperate config to tell it to always use 10.250.30.0/24 and not use 10.10.10.0/24 (the unstable one) unless the primary goes down?
Thank you for the help.
If question was unclear: Can 10.10.10.0/24 die and everything is fine, no reboots?
A few times per month I have experienced issue with entire cluster rebooting. With rebooting I mean all physical servers rebooting.
I don't see anything particular in the log but I believe it to be caused by Corosync / HA watch agent rebooting the physical servers because of network issue.
Previously I just had ring_0 and I know for a fact that sometimes it looses connectivity. I have a janky bond0 LACP config where 10.10.10.0/24 is only reachable via one of the uplinks while not the other (yes it's bad but it works until bond0 decides to move traffic over to another uplink).
Now I've added ring_1, using a dedicated NIC and switch just for this. Config looks like so:
Corosync.conf
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: server-04-vm
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.10.4
ring1_addr: 10.250.30.4
}
node {
name: server-05-vm
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.10.5
ring1_addr: 10.250.30.5
}
node {
name: server-06-vm
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.10.6
ring1_addr: 10.250.30.6
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: Proxmox-VM
config_version: 30
interface {
linknumber: 0
knet_link_priority: 100
}
interface {
linknumber: 1
knet_link_priority: 200
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
Based on my understanding ring1_addr / link 1 should now always be used for corosync. Checking log I can confirm this is the case:
Code:
May 23 15:25:40 server-04-vm corosync[2418]: [KNET ] link: host: 2 link: 0 is down
May 23 15:25:40 server-04-vm corosync[2418]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 200)
May 23 15:25:43 server-04-vm corosync[2418]: [KNET ] link: host: 3 link: 0 is down
May 23 15:25:43 server-04-vm corosync[2418]: [KNET ] host: host: 3 (passive) best link: 1 (pri: 200)
May 23 15:25:49 server-04-vm corosync[2418]: [KNET ] rx: host: 3 link: 0 is up
May 23 15:25:49 server-04-vm corosync[2418]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
May 23 15:25:49 server-04-vm corosync[2418]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 200)
May 23 15:25:49 server-04-vm corosync[2418]: [KNET ] pmtud: Global data MTU changed to: 1397
May 23 15:25:59 server-04-vm corosync[2418]: [KNET ] rx: host: 2 link: 0 is up
May 23 15:25:59 server-04-vm corosync[2418]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
May 23 15:25:59 server-04-vm corosync[2418]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 200)
May 23 15:25:59 server-04-vm corosync[2418]: [KNET ] pmtud: Global data MTU changed to: 1397
Am I in the clear now? When link 0 flaps again (on a majority of servers) can I expect it to not care or do I need to do any other config so that all server in the cluster isn't rebooted? For your own information cluster consists of more than 3 servers but I limited config to not make it so long!
I'm unsure if I need to do any other configuration here or if corosync is the only config needed to be touched. I use the HA feature, does it have seperate config to tell it to always use 10.250.30.0/24 and not use 10.10.10.0/24 (the unstable one) unless the primary goes down?
Thank you for the help.
If question was unclear: Can 10.10.10.0/24 die and everything is fine, no reboots?
Last edited: