Corosync - Mysterious reboot after network flapping

acti67 · Nov 19, 2020

Hello,

I have four servers in a cluster. The last night, we faced to a big network flapping on 'srva' (private network and public network) with an impact to the private network '10.50.255.0/24'. The expected behavior was to get the three nodes (srvb, srvc, srvd) working together and the node srva getting out of the cluster.
But after a delay, every three nodes (srvb, srvc, srvd) were rebooted by the system and only the node 'srva' stayed up and never be rebooted... In summary, only the node on which there was a problem remained active. Why?

Note that public network was 100% available during the outage (ring1) expected for the node A (srva).

Here is my corosync configuration :

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: srvb
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.50.255.3
    ring1_addr: 51.xxx.xxx.xxx
  }
  node {
    name: srvd
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.50.255.5
    ring1_addr: 145.xxx.xxx.xxx
  }
  node {
    name: srvc
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.50.255.4
    ring1_addr: 217.xxx.xxx.xxx
  }
  node {
    name: srva
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.50.255.2
    ring1_addr: 51.xxx.xxx.xxx
  }
}

quorum {
  provider: corosync_votequorum
  wait_for_all: 1
  two_node: 0
  last_man_standing: 1
  last_man_standing_window: 10000
  auto_tie_breaker: 1
  auto_tie_breaker_node: lowest
}

totem {
  cluster_name: Cluster
  config_version: 6
  interface {
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
    linknumber: 0
  }
  interface {
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
    linknumber: 1
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}

Datacenter configuration :

Code:

ha: shutdown_policy=migrate
migration: insecure,network=10.50.255.2/24

Can you help me to find the guilty in my configuration?

Thank you for your time,

Best regards,

Stéphane

spirit · Nov 19, 2020

#pveversion -v ?

(Just to be sure, because a bug have been fixed recenctly in pve-cluster package where sometimes (rarely) the full cluster could hang when 1 node leave/join the cluster)

acti67 · Nov 19, 2020

$ pveversion -v

Code:

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-19
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

All servers, same result as above.

acti67 · Nov 20, 2020

Anyone can confirm me that the configuration is right and safe? Could "auto_tie_breaker" the guilty?

Search

Search

Corosync - Mysterious reboot after network flapping

acti67

Member

spirit

Distinguished Member

acti67

Member

acti67

Member