Hello,
I have four servers in a cluster. The last night, we faced to a big network flapping on 'srva' (private network and public network) with an impact to the private network '10.50.255.0/24'. The expected behavior was to get the three nodes (srvb, srvc, srvd) working together and the node srva getting out of the cluster.
But after a delay, every three nodes (srvb, srvc, srvd) were rebooted by the system and only the node 'srva' stayed up and never be rebooted... In summary, only the node on which there was a problem remained active. Why?
Note that public network was 100% available during the outage (ring1) expected for the node A (srva).
Here is my corosync configuration :
Datacenter configuration :
Can you help me to find the guilty in my configuration?
Thank you for your time,
Best regards,
Stéphane
I have four servers in a cluster. The last night, we faced to a big network flapping on 'srva' (private network and public network) with an impact to the private network '10.50.255.0/24'. The expected behavior was to get the three nodes (srvb, srvc, srvd) working together and the node srva getting out of the cluster.
But after a delay, every three nodes (srvb, srvc, srvd) were rebooted by the system and only the node 'srva' stayed up and never be rebooted... In summary, only the node on which there was a problem remained active. Why?
Note that public network was 100% available during the outage (ring1) expected for the node A (srva).
Here is my corosync configuration :
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: srvb
nodeid: 2
quorum_votes: 1
ring0_addr: 10.50.255.3
ring1_addr: 51.xxx.xxx.xxx
}
node {
name: srvd
nodeid: 4
quorum_votes: 1
ring0_addr: 10.50.255.5
ring1_addr: 145.xxx.xxx.xxx
}
node {
name: srvc
nodeid: 3
quorum_votes: 1
ring0_addr: 10.50.255.4
ring1_addr: 217.xxx.xxx.xxx
}
node {
name: srva
nodeid: 1
quorum_votes: 1
ring0_addr: 10.50.255.2
ring1_addr: 51.xxx.xxx.xxx
}
}
quorum {
provider: corosync_votequorum
wait_for_all: 1
two_node: 0
last_man_standing: 1
last_man_standing_window: 10000
auto_tie_breaker: 1
auto_tie_breaker_node: lowest
}
totem {
cluster_name: Cluster
config_version: 6
interface {
knet_ping_interval: 200
knet_ping_timeout: 5000
knet_pong_count: 1
linknumber: 0
}
interface {
knet_ping_interval: 200
knet_ping_timeout: 5000
knet_pong_count: 1
linknumber: 1
}
ip_version: ipv4
link_mode: passive
secauth: on
version: 2
}
Datacenter configuration :
Code:
ha: shutdown_policy=migrate
migration: insecure,network=10.50.255.2/24
Can you help me to find the guilty in my configuration?
Thank you for your time,
Best regards,
Stéphane