Hello there,
I would like to ask for some help and guidance to debug issues, when our cluster lose corosync quorum (and reboots completely) when adding new node to cluster happened already twice on 7 node cluster (adding 4th and now with 7th node).
Deployment context:
What happened:
FInal words and thoughts:
Thank you everyone who read the whole post until here
I would like to ask for some help and guidance to debug issues, when our cluster lose corosync quorum (and reboots completely) when adding new node to cluster happened already twice on 7 node cluster (adding 4th and now with 7th node).
Deployment context:
- every server is fresh Debian Buster installation + custom packages installation & environment configuration
- custom kernel (based on 5.4.48)
- installed package
proxmox-ve=6.2-1
from pve-no-subscription repository - creating cluster (and joining nodes) always via
pvecm
command - corosync with 2 links (VLANs)
- Dell blade servers, uplink to 2 switches in active-backup bonding, VLANs on top
- using OpenVSwitch for PVE node networking (eno1+eno2 bonding -> vmbr0 -> two internal vlan interfaces - management and ceph storage)
- all VMs registered in HA manager
- VMs use external Ceph storage
- pve-firewall not used, iptables stateful rules allows every traffic from both networks used in cluster (nothing should be dropped, but I cannot be 100% sure since we don't monitor that)
- nodes in cluster was not added in alphabetical order
- second corosync link was added after first cluster reboot (when adding 4th node) following this https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_adding_redundant_links_to_an_existing_cluster
- added server
ovirt7
into cluster, usingovirt9
to join the cluster - command issued at 2020-09-01 08:22:21
Code:
-> pvecm add ovirt9 -link0 10.30.20.19 -link1 10.30.40.57
Please enter superuser (root) password for 'ovirt9': ********
Establishing API connection with host 'ovirt9'
The authenticity of host 'ovirt9' can't be established.
X509 SHA256 key fingerprint is 84:A8:E0:22:6E:01:8A:AF:4B:C8:A1:14:7A:40:02:C4:6A:72:0C:40:1E:5D:35:24:04:C0:86:85:BD:CF:0D:5C.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1598941349.sql.gz'
waiting for quorum... <<<EDIT: here it stucked until whole cluster rebooted>>> OK
(re)generate node files
generate new node certificate
merge authorized SSH keys and known hosts
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'ovirt7' to cluster.
What happened:
- node added into cluster
- corosync configuration was updated, but never reached the quorum
- my SSH connection to few servers was alive, servers were able to ping each other without any packetloss with minimal latency (<1ms),
pvecm status
was timing out (returned without any output) - every node except new one rebooted due to fencing watchdog
- networking logs from switches shows only Link down & Link up due to reboot
- current version of
/etc/corosync/corosync.conf
-- see attachment - diff against previous version of
corosync.conf
(got from internal backups):
Code:
--- a/etc/corosync/corosync.conf
+++ b/etc/corosync/corosync.conf
@@ -11,6 +11,13 @@ nodelist {
ring0_addr: 10.30.20.137
ring1_addr: 10.30.40.60
}
+ node {
+ name: ovirt7
+ nodeid: 7
+ quorum_votes: 1
+ ring0_addr: 10.30.20.19
+ ring1_addr: 10.30.40.57
+ }
node {
name: ovirt8
nodeid: 6
@@ -54,7 +61,7 @@ quorum {
totem {
cluster_name: ovirt
- config_version: 14
+ config_version: 15
interface {
linknumber: 0
}
- syslog from server
ovirt9
(which new node was connecting to) -- see attachment - syslog from server
ovirt7
(new one) -- see attachment
FInal words and thoughts:
- I assume you will advice me to use physicaly separated corosync links, but I am unable to achieve this with my available hardware. I am using it this way, because it's possible we will split both vlans on way from chassis (the only common link will be server NIC and switch in blade chassis)
- since this is fresh event from today, I might be able to gather more details from servers is needed
- I would be really greatful for any comment, advice or guidance, since I don't understand what happened. As far as I know everything went typical way, but somewhy corosync was unable to get traffic.
- Even corosync complaining about address changes happened when adding 6th node and everything went well
Thank you everyone who read the whole post until here
Attachments
Last edited: