The famous Red X

tazzmn

Member
Oct 13, 2020
3
0
6
44
Hey all,

Well we recently went to put in nodes 20-23. Once we did our entire cluster went down. We can not get it back online. When a node does join it has a the red X over the node and grey question marks over the vms/containers/storage but when you go into them you can see stats and such.

We have the following software:

proxmox-ve: 6.2-2 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.2-4
pve-kernel-4.15.18-1-pve: 4.15.18-15
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.4.128-1-pve: 4.4.128-111
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

We can never seem to get all nodes to get in the cluster together. Please any guidance would be greatly appreciated. Thank you.
 
As an update yesterday we tried starting corosync on two of the nodes to see if they would join up properly. Node 1 started up just fine. Node 2 displayed an error:

pvesr[42272]: trying to acquire cfs lock 'file-replication_cfg' ...

From what i hear the only way to get this to go away is to reboot but has anyone else found other ways for this to go away without disabling it?
 
As an update we tried to re-link all of them tonight. We restarted all the services and got the original 17 members to sync up. But as soon as we did the same to our new 3 either the nodes went to grey question marks (which with grey question marks we could still see stats) or red x's. We were also seeing a lot of :

Oct 16 02:46:02 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 210
Oct 16 02:46:03 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 220
Oct 16 02:46:04 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 230
Oct 16 02:46:05 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 240
Oct 16 02:46:06 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 250
Oct 16 02:46:07 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 260
Oct 16 02:46:08 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 270
Oct 16 02:46:09 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 280
Oct 16 02:46:10 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 290
Oct 16 02:46:11 hyper1 pmxcfs[823]: [dcdb] notice: cpg_join retry 300

Any guidance would help! Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!