Cluster in weird state ( nodes with grey question mark)

BenDDD

Member
Nov 28, 2019
59
1
11
41
Hello everyone,

Hello everyone,

I am experiencing a strange situation with my cluster. 22 out of 24 nodes seem to communicate correctly via corosync but they appear with a gray question mark on the WebUI:

Code:
pvecm status
user config - ignore invalid group member 'mathieu-adm'
Cluster information
-------------------
Name:             galaxie
Config Version:   73
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Sep 13 02:57:46 2020
Quorum provider:  corosync_votequorum
Nodes:            22
Node ID:          0x00000001
Ring ID:          1.182e9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   24
Highest expected: 24
Total votes:      22
Quorum:           13 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 147.215.130.101 (local)
0x00000002          1 147.215.130.102
0x00000003          1 147.215.130.103
0x00000004          1 147.215.130.104
0x00000005          1 147.215.130.105
0x00000006          1 147.215.130.106
0x00000007          1 147.215.130.107
0x00000008          1 147.215.130.108
0x00000009          1 147.215.130.109
0x0000000a          1 147.215.130.110
0x0000000b          1 147.215.130.111
0x0000000c          1 147.215.130.112
0x0000000d          1 147.215.130.113
0x0000000e          1 147.215.130.114
0x0000000f          1 147.215.130.115
0x00000010          1 147.215.130.116
0x00000011          1 147.215.130.117
0x00000012          1 147.215.130.118
0x00000013          1 147.215.130.119
0x00000014          1 147.215.130.120
0x00000015          1 147.215.130.121
0x00000016          1 147.215.130.122

cluster.png

And as you can see, two other nodes do not appear in the corosync sync and have a white cross on a red background on the WebUI.

Some information that may be useful :

proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Sep 13 02:59:14 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3261 ms
Sep 13 02:59:27 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3283 ms
Sep 13 02:59:51 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3261 ms
Sep 13 03:00:12 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3283 ms
Sep 13 03:00:44 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3283 ms
Sep 13 03:01:19 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 14452 ms
Sep 13 03:02:19 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3283 ms
Sep 13 03:02:43 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3283 ms
Sep 13 03:03:46 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3261 ms
Sep 13 03:04:14 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3261 ms
Sep 13 03:04:26 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3286 ms
Sep 13 03:04:54 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3261 ms
Sep 13 03:05:16 galaxie1 corosync[10908]: [TOTEM ] Token has not been received in 3283 ms

root@galaxie23:~# journalctl -r -u corosync
-- Logs begin at Sun 2020-09-13 01:39:48 CEST, end at Sun 2020-09-13 03:12:27 CEST. --
Sep 13 03:12:27 galaxie23 corosync[13119]: [MAIN ] Completed service synchronization, ready to provid
Sep 13 03:12:27 galaxie23 corosync[13119]: [QUORUM] Members[7]: 9 10 12 13 16 23 25
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 0 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 1 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 1 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 1 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 1 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 1 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 1 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [TOTEM ] A new membership (9.184f0) was formed. Members joi
Sep 13 03:12:27 galaxie23 corosync[13119]: [MAIN ] Completed service synchronization, ready to provid
Sep 13 03:12:27 galaxie23 corosync[13119]: [QUORUM] Members[1]: 23
Sep 13 03:12:27 galaxie23 corosync[13119]: [CPG ] downlist left_list: 0 received
Sep 13 03:12:27 galaxie23 corosync[13119]: [TOTEM ] A new membership (17.184ec) was formed. Members
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: Global data MTU changed to: 1397
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 19 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 20 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 21 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 22 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 25 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 11 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 6 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 8 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 9 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 12 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 13 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 14 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 15 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 16 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 17 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] pmtud: PMTUD link change for host: 18 link: 0 from
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Sep 13 03:12:05 galaxie23 corosync[13119]: [KNET ] host: host: 19 (passive) best link: 0 (pri: 1)

Thank you in advance for your help.
 
Something else that I just realized. I have a corosync process that is running at 100%:

htop.png