8 node cluster - 4 working and see each other but lost quorum

GoZippy · Jul 12, 2021

Not sure what happened when I updated things a while back but I lost 5 of the 9 nodes... nothing special on any of them.. I managed to get the vm's live again from replicated data on other nodes and they are all working...

I let it be for a couple months and have not looked at it since everything was working - but I saw PM7 came out so I went to node1 and looked at it to do update - could not get gui console to load so I had to ssh to it (prob a cert issue?)... ssh from my desktop seemed to work fine... apt-update and apt upgrade went fine

looking at pve6to7 I see some issues...

Code:

root@stack1:~# pve6to7
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =

Checking for package updates..
PASS: all packages uptodate

Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 6.4-1

Checking running kernel version..
PASS: expected running kernel '5.4.124-1-pve'.

= CHECKING CLUSTER HEALTH/SETTINGS =

PASS: systemd unit 'pve-cluster.service' is in state 'active'
PASS: systemd unit 'corosync.service' is in state 'active'
FAIL: Cluster Filesystem readonly, lost quorum?!

Analzying quorum settings and state..
FAIL: 4 nodes are offline!
INFO: configured votes - nodes: 8
INFO: configured votes - qdevice: 0
INFO: current expected votes: 8
INFO: current total votes: 4
WARN: total votes < expected votes: 4/8!

Checking nodelist entries..
PASS: nodelist settings OK

Checking totem settings..
PASS: totem settings OK

INFO: run 'pvecm status' to get detailed cluster status..

= CHECKING HYPER-CONVERGED CEPH STATUS =

INFO: hyper-converged ceph setup detected!
INFO: getting Ceph status/health information..
WARN: Ceph health reported as 'HEALTH_WARN'.
      Use the PVE dashboard or 'ceph -s' to determine the specific issues and try to resolve them.
INFO: getting Ceph daemon versions..
PASS: single running version detected for daemon type monitor.
PASS: single running version detected for daemon type manager.
PASS: single running version detected for daemon type MDS.
PASS: single running version detected for daemon type OSD.
PASS: single running overall version detected for all Ceph daemon types.
WARN: 'noout' flag not set - recommended to prevent rebalancing during cluster-wide upgrades.
INFO: checking Ceph config..

= CHECKING CONFIGURED STORAGES =

PASS: storage 'CPool1' enabled and active.
PASS: storage 'Ceph-USDivide200' enabled and active.
PASS: storage 'ISO_store1' enabled and active.
PASS: storage 'MinecraftCephFS1' enabled and active.
PASS: storage 'ceph-lxc' enabled and active.
PASS: storage 'ceph-vm1' enabled and active.
PASS: storage 'local' enabled and active.

= MISCELLANEOUS CHECKS =

INFO: Checking common daemon services..
PASS: systemd unit 'pveproxy.service' is in state 'active'
PASS: systemd unit 'pvedaemon.service' is in state 'active'
PASS: systemd unit 'pvestatd.service' is in state 'active'
INFO: Checking for running guests..
PASS: no running guest detected.
INFO: Checking if the local node's hostname 'stack1' is resolvable..
INFO: Checking if resolved IP is configured on local node..
PASS: Resolved node IP '10.0.1.1' configured and active on single interface.
INFO: Checking backup retention settings..
PASS: no problems found.
INFO: checking CIFS credential location..
PASS: no CIFS credentials at outdated location found.
INFO: Checking custom roles for pool permissions..
INFO: Checking node and guest description/note legnth..
PASS: All node config descriptions fit in the new limit of 64 KiB
PASS: All guest config descriptions fit in the new limit of 8 KiB
INFO: Checking container configs for deprecated lxc.cgroup entries
PASS: No legacy 'lxc.cgroup' keys found.
INFO: Checking storage content type configuration..
PASS: no problems found
INFO: Checking if the suite for the Debian security repository is correct..
INFO: Make sure to change the suite of the Debian security repository from 'buster/updates' to 'bullseye-security' - in /etc/apt/sources.list:6
SKIP: NOTE: Expensive checks, like CT cgroupv2 compat, not performed without '--full' parameter

= SUMMARY =

TOTAL:    36
PASSED:   30
SKIPPED:  1
WARNINGS: 3
FAILURES: 2

ATTENTION: Please check the output for detailed information!
Try to solve the problems one at a time and then run this checklist tool again.

So I am guessing the fact I have 5 nodes not working is freaking it out... I also deleted a couple LXC's and VM's that I didnt need anymore... like an old minecraft server instance...

GoZippy · Jul 12, 2021

anyhow...

Code:

root@stack1:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2021-07-11 23:35:16 CDT; 13min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 1042 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 150.9M
   CGroup: /system.slice/corosync.service
           └─1042 /usr/sbin/corosync -f

Jul 11 23:38:34 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1232) was formed. Members joined: 2
Jul 11 23:38:34 stack1 corosync[1042]:   [QUORUM] Members[3]: 1 2 4
Jul 11 23:38:34 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:39:08 stack1 corosync[1042]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 11 23:39:08 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Sync members[4]: 1 2 3 4
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Sync joined[1]: 3
Jul 11 23:39:08 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1236) was formed. Members joined: 3
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Members[4]: 1 2 3 4
Jul 11 23:39:08 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.

and...

Code:

root@stack1:~# journalctl -b -u corosync
-- Logs begin at Sun 2021-07-11 23:35:12 CDT, end at Sun 2021-07-11 23:49:38 CDT. --
Jul 11 23:35:15 stack1 systemd[1]: Starting Corosync Cluster Engine...
Jul 11 23:35:15 stack1 corosync[1042]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up
Jul 11 23:35:15 stack1 corosync[1042]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 11 23:35:15 stack1 corosync[1042]:   [TOTEM ] Initializing transport (Kronosnet).
Jul 11 23:35:16 stack1 corosync[1042]:   [TOTEM ] totemknet initialized
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jul 11 23:35:16 stack1 corosync[1042]:   [QB    ] server name: cmap
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jul 11 23:35:16 stack1 corosync[1042]:   [QB    ] server name: cfg
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jul 11 23:35:16 stack1 corosync[1042]:   [QB    ] server name: cpg
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jul 11 23:35:16 stack1 corosync[1042]:   [WD    ] Watchdog not enabled by configuration
Jul 11 23:35:16 stack1 corosync[1042]:   [WD    ] resource load_15min missing a recovery key.
Jul 11 23:35:16 stack1 corosync[1042]:   [WD    ] resource memory_used missing a recovery key.
Jul 11 23:35:16 stack1 corosync[1042]:   [WD    ] no resources configured.
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jul 11 23:35:16 stack1 corosync[1042]:   [QUORUM] Using quorum provider corosync_votequorum
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jul 11 23:35:16 stack1 corosync[1042]:   [QB    ] server name: votequorum
Jul 11 23:35:16 stack1 corosync[1042]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jul 11 23:35:16 stack1 corosync[1042]:   [QB    ] server name: quorum
Jul 11 23:35:16 stack1 corosync[1042]:   [TOTEM ] Configuring link 0
Jul 11 23:35:16 stack1 corosync[1042]:   [TOTEM ] Configured link number 0: local addr: 10.0.1.1, port=5405
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 2 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 2 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 2 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 3 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 3 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 3 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [QUORUM] Sync members[1]: 1
Jul 11 23:35:16 stack1 corosync[1042]:   [QUORUM] Sync joined[1]: 1
Jul 11 23:35:16 stack1 corosync[1042]:   [TOTEM ] A new membership (1.121a) was formed. Members joined: 1
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 4 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 4 has no active links
Jul 11 23:35:16 stack1 systemd[1]: Started Corosync Cluster Engine.
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 4 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 6 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 6 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 6 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 5 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [QUORUM] Members[1]: 1
Jul 11 23:35:16 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 5 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 5 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 0)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 7 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 7 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 7 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 8 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 8 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 8 has no active links
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jul 11 23:35:16 stack1 corosync[1042]:   [KNET  ] host: host: 1 has no active links
Jul 11 23:35:19 stack1 corosync[1042]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 11 23:35:19 stack1 corosync[1042]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 11 23:35:19 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:35:19 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 23:35:19 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 23:35:20 stack1 corosync[1042]:   [KNET  ] pmtud: Global data MTU changed to: 469
Jul 11 23:35:20 stack1 corosync[1042]:   [QUORUM] Sync members[3]: 1 2 3
Jul 11 23:35:20 stack1 corosync[1042]:   [QUORUM] Sync joined[2]: 2 3
Jul 11 23:35:20 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1221) was formed. Members joined: 2 3
Jul 11 23:35:20 stack1 corosync[1042]:   [QUORUM] Members[3]: 1 2 3
Jul 11 23:35:20 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:35:39 stack1 corosync[1042]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jul 11 23:35:39 stack1 corosync[1042]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jul 11 23:35:39 stack1 corosync[1042]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 11 23:35:48 stack1 corosync[1042]:   [KNET  ] rx: host: 4 link: 0 is up
Jul 11 23:35:48 stack1 corosync[1042]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Jul 11 23:35:48 stack1 corosync[1042]:   [QUORUM] Sync members[4]: 1 2 3 4
Jul 11 23:35:48 stack1 corosync[1042]:   [QUORUM] Sync joined[1]: 4
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1225) was formed. Members joined: 4
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: 1
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: 8
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: b
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: 10
Jul 11 23:35:48 stack1 corosync[1042]:   [QUORUM] Members[4]: 1 2 3 4
Jul 11 23:35:48 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: 13
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: 21 22 23
Jul 11 23:35:48 stack1 corosync[1042]:   [TOTEM ] Retransmit List: 26
Jul 11 23:35:48 stack1 corosync[1042]:   [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jul 11 23:36:20 stack1 corosync[1042]:   [CFG   ] Node 3 was shut down by sysadmin
Jul 11 23:36:20 stack1 corosync[1042]:   [QUORUM] Sync members[3]: 1 2 4
Jul 11 23:36:20 stack1 corosync[1042]:   [QUORUM] Sync left[1]: 3
Jul 11 23:36:20 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1229) was formed. Members left: 3
Jul 11 23:36:20 stack1 corosync[1042]:   [QUORUM] Members[3]: 1 2 4
Jul 11 23:36:20 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:36:22 stack1 corosync[1042]:   [KNET  ] link: host: 3 link: 0 is down
Jul 11 23:36:22 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:36:22 stack1 corosync[1042]:   [KNET  ] host: host: 3 has no active links
Jul 11 23:36:25 stack1 corosync[1042]:   [CFG   ] Node 2 was shut down by sysadmin
Jul 11 23:36:25 stack1 corosync[1042]:   [QUORUM] Sync members[2]: 1 4
Jul 11 23:36:25 stack1 corosync[1042]:   [QUORUM] Sync left[1]: 2
Jul 11 23:36:25 stack1 corosync[1042]:   [TOTEM ] A new membership (1.122d) was formed. Members left: 2
Jul 11 23:36:25 stack1 corosync[1042]:   [QUORUM] Members[2]: 1 4
Jul 11 23:36:25 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:36:28 stack1 corosync[1042]:   [KNET  ] link: host: 2 link: 0 is down
Jul 11 23:36:28 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 23:36:28 stack1 corosync[1042]:   [KNET  ] host: host: 2 has no active links
Jul 11 23:38:34 stack1 corosync[1042]:   [KNET  ] rx: host: 2 link: 0 is up
Jul 11 23:38:34 stack1 corosync[1042]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jul 11 23:38:34 stack1 corosync[1042]:   [QUORUM] Sync members[3]: 1 2 4
Jul 11 23:38:34 stack1 corosync[1042]:   [QUORUM] Sync joined[1]: 2
Jul 11 23:38:34 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1232) was formed. Members joined: 2
Jul 11 23:38:34 stack1 corosync[1042]:   [QUORUM] Members[3]: 1 2 4
Jul 11 23:38:34 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:39:08 stack1 corosync[1042]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 11 23:39:08 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Sync members[4]: 1 2 3 4
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Sync joined[1]: 3
Jul 11 23:39:08 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1236) was formed. Members joined: 3
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Members[4]: 1 2 3 4
Jul 11 23:39:08 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.

GoZippy · Jul 12, 2021

So I checked corosync-quorumtool

Code:

root@stack1:~# pvecm expected 3
Unable to set expected votes: CS_ERR_INVALID_PARAM
root@stack1:~# corosync-quorumtool
Quorum information
------------------
Date:             Sun Jul 11 23:53:40 2021
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          1
Ring ID:          1.1236
Quorate:          No

Votequorum information
----------------------
Expected votes:   8
Highest expected: 8
Total votes:      4
Quorum:           5 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
         1          1 stack1 (local)
         2          1 node2
         3          1 node3
         4          1 node4

Guessing it wants 8 votes of the 9 nodes in the cluster... I gave up on those other 5 after the upgrade... lost cluster connectivity... so I am probably just going to reinstall PM on each of those 5 nodes... I didn't lose any data due to replication on the nodes that are still working fine... I just thought it might be better to upgrade the ones that are working first to PM7 nthen reinstall with PM7 fresh install and rejoin the cluster...

All the servers are identical hardware and all hardware is good.

Maybe I need to figure out how to remove the "dead" nodes and fix the expected quorum? Not sure why it would be set to 8 votes for a 4 node cluster much less a 9 node cluster... is that 50% rule or something?

Code:

root@stack1:~# systemctl status corosync pve-cluster
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2021-07-11 23:35:16 CDT; 18min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 1042 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 150.0M
   CGroup: /system.slice/corosync.service
           └─1042 /usr/sbin/corosync -f

Jul 11 23:38:34 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1232) was formed. Members joined: 2
Jul 11 23:38:34 stack1 corosync[1042]:   [QUORUM] Members[3]: 1 2 4
Jul 11 23:38:34 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 11 23:39:08 stack1 corosync[1042]:   [KNET  ] rx: host: 3 link: 0 is up
Jul 11 23:39:08 stack1 corosync[1042]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Sync members[4]: 1 2 3 4
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Sync joined[1]: 3
Jul 11 23:39:08 stack1 corosync[1042]:   [TOTEM ] A new membership (1.1236) was formed. Members joined: 3
Jul 11 23:39:08 stack1 corosync[1042]:   [QUORUM] Members[4]: 1 2 3 4
Jul 11 23:39:08 stack1 corosync[1042]:   [MAIN  ] Completed service synchronization, ready to provide service.

● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2021-07-11 23:47:50 CDT; 6min ago
  Process: 4756 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 4763 (pmxcfs)
    Tasks: 6 (limit: 4915)
   Memory: 26.6M
   CGroup: /system.slice/pve-cluster.service
           └─4763 /usr/bin/pmxcfs

root@stack1:~# pveversion
pve-manager/6.4-13/9f411e79 (running kernel: 5.4.124-1-pve)

Any ideas where to start?

GoZippy · Jul 12, 2021

sorry for multiple reply question - it would not let me post all that info.. too many characters... seems you all ask for that info when helping to troubleshoot anyhow...

spirit · Jul 12, 2021

please send your /etc/pve/corosync.conf .

if you have 9 nodes, you should have "expected votes 9", not 8, that's strange.

GoZippy · Jul 12, 2021

spirit said:
please send your /etc/pve/corosync.conf .

if you have 9 nodes, you should have "expected votes 9", not 8, that's strange.

sorry 9th node was never joined.. lol that makes sense...

GoZippy · Jul 12, 2021

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: node2
nodeid: 2
quorum_votes: 1
ring0_addr: 10.0.1.2
}
node {
name: node3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.0.1.3
}
node {
name: node4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.0.1.4
}
node {
name: node5
nodeid: 6
quorum_votes: 1
ring0_addr: 10.0.1.5
}
node {
name: node6
nodeid: 5
quorum_votes: 1
ring0_addr: 10.0.1.6
}
node {
name: node7
nodeid: 7
quorum_votes: 1
ring0_addr: 10.0.1.7
}
node {
name: node8
nodeid: 8
quorum_votes: 1
ring0_addr: 10.0.1.108

GoZippy · Jul 13, 2021

So what I did to get it running again:

After reboot all machines - no quorum so all machines refused to start up VM's. I realized that the expected vote is the total joined machines in the cluster. Quorum is apparently defined as more than 50%. So - I am assuming a lot here from what I am reading (chinese translation is more detail than english) but since I have 8 node cluster and 4 of them are offline (turned off - got tired of chasing bugs in network setting changing all the time for some reason) I only really have a 4 node cluster running. So I simple set
pvecm expected 4
and since 4 nodes are live - there is quorum of the expected 4.
All is well and running again... shell access works again from the gui on all machines.
Setting this pvecm expected 4 on the first node I ssh to then updated all the other nodes apparently too. I ssh direct to each and pvecm status and it shows expected 4 now on all of them instead of 8. Was pretty quick.

all my server vms and containers are auto starting and things are churning again

My only concern is that it will reset on reboot again... I need to make sure this setting stays until I get the other machines back online... I was hoping to just delete the other machines from the cluster, upgrade the working machines 4 nodes then reinstall pm7 clean on the other offline machines and rejoin one at a time

fiona · Jul 13, 2021

Hi,

GoZippy said:
So what I did to get it running again:

After reboot all machines - no quorum so all machines refused to start up VM's. I realized that the expected vote is the total joined machines in the cluster. Quorum is apparently defined as more than 50%. So - I am assuming a lot here from what I am reading (chinese translation is more detail than english) but since I have 8 node cluster and 4 of them are offline (turned off - got tired of chasing bugs in network setting changing all the time for some reason) I only really have a 4 node cluster running. So I simple set
pvecm expected 4
and since 4 nodes are live - there is quorum of the expected 4.

Quorum needs to be more than 50% so that the cluster can keep a consistent state. Setting the number of expected votes should not be done too casually, as it can lead to a split brain situation/inconsistent state in the cluster. You should be fine if you don't turn on all of the other 4 nodes at the same time though.

GoZippy said:
All is well and running again... shell access works again from the gui on all machines.
Setting this pvecm expected 4 on the first node I ssh to then updated all the other nodes apparently too. I ssh direct to each and pvecm status and it shows expected 4 now on all of them instead of 8. Was pretty quick.

all my server vms and containers are auto starting and things are churning again

My only concern is that it will reset on reboot again... I need to make sure this setting stays until I get the other machines back online... I was hoping to just delete the other machines from the cluster, upgrade the working machines 4 nodes then reinstall pm7 clean on the other offline machines and rejoin one at a time

Yes, best to remove the problematic nodes from the cluster, re-install and re-join when they're ready. See here for how to remove a node.

Search

Search

8 node cluster - 4 working and see each other but lost quorum

GoZippy

Active Member

Attachments

GoZippy

Active Member

GoZippy

Active Member

GoZippy

Active Member

spirit

Distinguished Member

GoZippy

Active Member

GoZippy

Active Member

GoZippy

Active Member

fiona

Proxmox Staff Member

We value your privacy