Corosync - Mysterious reboot after network flapping

Oct 11, 2020
10
0
1
35
Hello,

I have four servers in a cluster. The last night, we faced to a big network flapping on 'srva' (private network and public network) with an impact to the private network '10.50.255.0/24'. The expected behavior was to get the three nodes (srvb, srvc, srvd) working together and the node srva getting out of the cluster.
But after a delay, every three nodes (srvb, srvc, srvd) were rebooted by the system and only the node 'srva' stayed up and never be rebooted... In summary, only the node on which there was a problem remained active. Why?

Note that public network was 100% available during the outage (ring1) expected for the node A (srva).

Here is my corosync configuration :

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: srvb
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.50.255.3
    ring1_addr: 51.xxx.xxx.xxx
  }
  node {
    name: srvd
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.50.255.5
    ring1_addr: 145.xxx.xxx.xxx
  }
  node {
    name: srvc
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.50.255.4
    ring1_addr: 217.xxx.xxx.xxx
  }
  node {
    name: srva
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.50.255.2
    ring1_addr: 51.xxx.xxx.xxx
  }
}

quorum {
  provider: corosync_votequorum
  wait_for_all: 1
  two_node: 0
  last_man_standing: 1
  last_man_standing_window: 10000
  auto_tie_breaker: 1
  auto_tie_breaker_node: lowest
}

totem {
  cluster_name: Cluster
  config_version: 6
  interface {
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
    linknumber: 0
  }
  interface {
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
    linknumber: 1
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}

Datacenter configuration :

Code:
ha: shutdown_policy=migrate
migration: insecure,network=10.50.255.2/24

Can you help me to find the guilty in my configuration?

Thank you for your time,

Best regards,

Stéphane
 

spirit

Famous Member
Apr 2, 2010
5,769
673
133
www.odiso.com
#pveversion -v ?

(Just to be sure, because a bug have been fixed recenctly in pve-cluster package where sometimes (rarely) the full cluster could hang when 1 node leave/join the cluster)
 
Oct 11, 2020
10
0
1
35
$ pveversion -v
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-19
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2
All servers, same result as above.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!