Cluster nodes are all splited into single mode after auto reboot.

2342w3e4r

New Member
Jan 13, 2022
3
0
1
37
I've try every possible way to bring up the cluster with "pvecm e 1" or force sync the corosync version but all no luck,

each nodes in cluster were splited in half at first but after some random fencing reboot they're all single node now..

Code:
ansible % ansible all -i 34nodes -m shell -a "pvecm status | grep votes"
10.50.50.3 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.1 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.2 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.4 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.6 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.7 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.11 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.5 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.10 | CHANGED | rc=0 >>
Expected votes:   34
Total votes:      1
10.50.50.9 | CHANGED | rc=0 >>

There's around 200 VMs in random nodes and I think I cannot just rejoin the cluster without losing them ?

Code:
cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: 0417-02u
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.50.50.1
  }
  node {
    name: 0417-04u
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.50.50.2
  }
  node {
    name: 0417-06u
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.50.50.3
  }
  node {
    name: 0417-08u
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.50.50.4
  }
  node {
    name: 0417-10u
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.50.50.5
  }
  node {
    name: 0417-12u
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.50.50.6
  }
  node {
    name: 0417-14u
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.50.50.7
  }
  node {
    name: 0417-16u
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.50.50.8
  }
  node {
    name: 0417-18u
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.50.50.9
  }
  node {
    name: 0417-20u
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.50.50.10
  }
  node {
    name: 0417-22u
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.50.50.11
  }
  node {
    name: 0417-24u
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.50.50.12
  }
  node {
    name: 0417-26u
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.50.50.13
  }
  node {
    name: 0417-28u
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.50.50.14
  }
  node {
    name: 0417-30u
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.50.50.15
  }
  node {
    name: 0417-32u
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.50.50.16
  }
  node {
    name: 0417-34u
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 10.50.50.17
  }
  node {
    name: 0418-02u
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 10.50.50.21
  }
  node {
    name: 0418-04u
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 10.50.50.22
  }
  node {
    name: 0418-06u
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 10.50.50.23
  }
  node {
    name: 0418-08u
    nodeid: 21
    quorum_votes: 1
    ring0_addr: 10.50.50.24
  }
  node {
    name: 0418-10u
    nodeid: 22
    quorum_votes: 1
    ring0_addr: 10.50.50.25
  }
  node {
    name: 0418-12u
    nodeid: 23
    quorum_votes: 1
    ring0_addr: 10.50.50.26
  }
  node {
    name: 0418-14u
    nodeid: 24
    quorum_votes: 1
    ring0_addr: 10.50.50.27
  }
  node {
    name: 0418-16u
    nodeid: 25
    quorum_votes: 1
    ring0_addr: 10.50.50.28
  }
  node {
    name: 0418-18u
    nodeid: 26
    quorum_votes: 1
    ring0_addr: 10.50.50.29
  }
  node {
    name: 0418-20u
    nodeid: 27
    quorum_votes: 1
    ring0_addr: 10.50.50.30
  }
  node {
    name: 0418-22u
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 10.50.50.31
  }
  node {
    name: 0418-24u
    nodeid: 29
    quorum_votes: 1
    ring0_addr: 10.50.50.32
  }
  node {
    name: 0418-26u
    nodeid: 30
    quorum_votes: 1
    ring0_addr: 10.50.50.33
  }
  node {
    name: 0418-28u
    nodeid: 31
    quorum_votes: 1
    ring0_addr: 10.50.50.34
  }
  node {
    name: 0418-30u
    nodeid: 32
    quorum_votes: 1
    ring0_addr: 10.50.50.35
  }
  node {
    name: 0418-32u
    nodeid: 33
    quorum_votes: 1
    ring0_addr: 10.50.50.36
  }
  node {
    name: 0418-34u
    nodeid: 34
    quorum_votes: 1
    ring0_addr: 10.50.50.37
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: NODE34
  config_version: 34
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Thank you for reading.
 
you don't need to rejoin anything - just find out why your cluster nodes are not talking to eachother!

pveversion -v output would be interesting

and check the logs, especially those of pve-cluster and corosync (as well as pve-ha-lrm and pve-ha-crm if you have HA enabled)
 
one of the nodes network cable was unplugged and that caused all other node with bond setting rebooted (fencing),

after few hours nodes are splitted into 2 cluster, and few more hours later they're all alone.

Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
root@0417-08u:~#

Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3
root@0417-10u:~#

Code:
root@0417-08u:~# tail -f /var/log/pve-cluster
tail: cannot open '/var/log/pve-cluster' for reading: No such file or directory
tail: no files remaining
root@0417-08u:~# service pve-cluster  status
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-01-14 15:19:53 HKT; 1h 30min ago
    Process: 2301 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 2313 (pmxcfs)
      Tasks: 6 (limit: 308727)
     Memory: 51.1M
        CPU: 1.005s
     CGroup: /system.slice/pve-cluster.service
             └─2313 /usr/bin/pmxcfs

Jan 14 16:50:28 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 50
Jan 14 16:50:29 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 60
Jan 14 16:50:30 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 70
Jan 14 16:50:31 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 80
Jan 14 16:50:32 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 90
Jan 14 16:50:33 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 100
Jan 14 16:50:33 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retried 100 times
Jan 14 16:50:33 0417-08u pmxcfs[2313]: [status] crit: cpg_send_message failed: 6
Jan 14 16:50:34 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 10
Jan 14 16:50:35 0417-08u pmxcfs[2313]: [status] notice: cpg_send_message retry 20

Code:
root@0417-08u:~# service corosync status
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-01-14 15:19:53 HKT; 1h 31min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 2419 (corosync)
      Tasks: 9 (limit: 308727)
     Memory: 234.8M
        CPU: 4h 18min 2.823s
     CGroup: /system.slice/corosync.service
             └─2419 /usr/sbin/corosync -f

Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:47 0417-08u corosync[2419]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailab>
Jan 14 16:50:49 0417-08u corosync[2419]:   [KNET  ] link: host: 2 link: 0 is down
root@0417-08u:~#
 
that sounds like something is broken with your network. note that running corosync on top of a bond is not ideal to say the least - corosync has it's own mechanism to fail over redundant links and is very latency sensitive. if at all possible, you want separate, dedicated links for corosync (especially in such a big cluster!).

I'd try the following:
- verify all nodes can talk to all other nodes and the network setup is correct
- stop HA services on all nodes (first LRM, then CRM)
- stop corosync on all nodes
- start corosync on two nodes, verify they can establish a connection with eachother (corosync-cfgtool -sb, corosync-quorumtool)
- start corosync on individual nodes, repeat verification before proceeding to the next node
- once all nodes are up and see eachother at the corosync level, check whether pmxcfs is okay (pvecm status, try reading from /etc/pve, try writing to /etc/pve)
- if pmxcfs is not okay on some nodes, restart it on those nodes one by one
- start HA services again
 
  • Like
Reactions: nugzarg
I've follow the steps and cluster is back, Thank you so much.

Before this I've also try to disable all network port but only 2 node left for test, but no luck.

The tricky part is had to stop HA LRM and CRM before stopping corosync, or else it seems service will start again and cause failure.

during re-joining i had to reinstall one of the node which have ceph OSD disk inside and now it's taking place of active manager o.O

causing only 33/513 clean PGs in 36 SSD, hope data are fine and after sync...


again Thank you very much for your help