4 node cluster lost web GUI access

Matthias Looss

Renowned Member
Jun 30, 2016
38
1
73
San Juan, Puerto Rico
A couple of days ago, my 4 node cluster would not load the web GUI, for any node. I have ssh access to all 4 nodes and rebooted them many times.

I read through the forum and followed some advice given to others with a similar issues, but things are just getting worse now.

Node pve-hn4 is the last node I added and I had problems initially joining it to my cluster, but eventually, I was able to get it included in my existing cluster. Currently node pve-hn1 and pve-hn4 show offline in the web GUI, and I noticed that pve-hn3 sometimes goes offline too, but then shows up online again.

All nodes have been upgraded to the most recent non-production releases available.
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)

Heres is some info for pve-hn4, and I will post info for pve-hn1 in a separate post.


root@pve-hn4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
4 1 pve-hn4 (local)
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster

systemctl restart corosync pve-cluster
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster
root@pve-hn4:~# corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0 udp
addr = 10.0.1.246
status:
nodeid: 1: disconnected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: localhost
root@pve-hn4:~# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Process: 15076 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 15082 (pmxcfs)
Tasks: 6 (limit: 19017)
Memory: 14.8M
CPU: 37ms
CGroup: /system.slice/pve-cluster.service
└─15082 /usr/bin/pmxcfs

Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: can't initialize service
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: can't initialize service
Oct 08 14:22:43 pve-hn4 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 14:22:48 pve-hn4 pmxcfs[15082]: [status] notice: update cluster info (cluster name PVE-Cluster, version = 4)
Oct 08 14:22:49 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 10
Oct 08 14:22:50 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 20
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: members: 4/15082
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: all data is up to date

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 15096 (corosync)
Tasks: 9 (limit: 19017)
Memory: 134.6M
CPU: 4.541s
CGroup: /system.slice/corosync.service
└─15096 /usr/sbin/corosync -f

Oct 08 14:22:50 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3f9d) was formed. Members joined: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] FAILED TO RECEIVE
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (4.a3fa1) was formed. Members left: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[3]: 2 3 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync joined[2]: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3fa5) was formed. Members joined: 2 3
root@pve-hn4:~#
 
root@pve-hn1:~# journalctl -b -u pve-cluster
-- Journal begins at Thu 2021-10-07 20:27:40 AST, ends at Fri 2021-10-08 14:58:30 AST. --
Oct 08 13:16:48 pve-hn1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [quorum] crit: quorum_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [quorum] crit: can't initialize service
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [confdb] crit: cmap_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [confdb] crit: can't initialize service
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [dcdb] crit: can't initialize service
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [status] crit: cpg_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [status] crit: can't initialize service
Oct 08 13:16:49 pve-hn1 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 13:16:54 pve-hn1 pmxcfs[1148]: [quorum] crit: quorum_initialize failed: 2
Oct 08 13:16:54 pve-hn1 pmxcfs[1148]: [confdb] crit: cmap_initialize failed: 2
Oct 08 13:16:54 pve-hn1 pmxcfs[1148]: [dcdb] crit: cpg_initialize failed: 2
 
root@pve-hn2:~# wget --no-check-certificate https://localhost:8006
--2021-10-08 15:00:58-- https://localhost:8006/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8006... connected.
The certificate's owner does not match hostname ‘localhost’
HTTP request sent, awaiting response... 200 OK
Length: 2213 (2.2K) [text/html]
Saving to: ‘index.html.1’

index.html.1 100%[==========================================================>] 2.16K --.-KB/s in 0s

2021-10-08 15:00:58 (59.8 MB/s) - ‘index.html.1’ saved [2213/2213]
 
I have been running my cluster on the remaining 3 nodes until a few days ago, and now the web GUI does not load on any of them. I can ping all three and ssh into them. I did perform some new updates and rebooted but no web GUI loads? I did spend some time reading and performing a multitude of steps, but nothing solved my problem.

If anyone could give me some additional pointers to fix this issue, I would really appreciate it.

Thanks
 
Last login: Thu Oct 21 06:32:41 2021 from 10.0.1.154
root@pve-hn2:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Oct 21 13:48:03 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2.13859a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.1.245
0x00000003 1 10.0.1.244 (local)
0x00000004 1 10.0.1.246
root@pve-hn2:~#
 
root@pve-hn2:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Oct 21 13:48:03 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2.13859a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.1.245
0x00000003 1 10.0.1.244 (local)
0x00000004 1 10.0.1.246
root@pve-hn2:~# time

real 0m0.000s
user 0m0.000s
sys 0m0.000s
root@pve-hn2:~# date
Thu 21 Oct 2021 01:48:48 PM AST
root@pve-hn2:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Oct 21 13:48:59 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2.13859a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.1.245
0x00000003 1 10.0.1.244 (local)
0x00000004 1 10.0.1.246
 
root@pve-hn2:~# wget --no-check-certificate https://localhost:8006
--2021-10-21 13:50:21-- https://localhost:8006/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8006... connected.
The certificate's owner does not match hostname ‘localhost’
HTTP request sent, awaiting response... 200 OK
Length: 2213 (2.2K) [text/html]
Saving to: ‘index.html.2’

index.html.2 100%[==========================================================>] 2.16K --.-KB/s in 0s

2021-10-21 13:50:21 (69.0 MB/s) - ‘index.html.2’ saved [2213/2213]
 
root@pve-hn2:~# journalctl -b -u pve-cluster
-- Journal begins at Fri 2021-10-15 07:17:26 AST, ends at Thu 2021-10-21 13:51:13 AST. --
Oct 21 06:44:55 pve-hn2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [quorum] crit: quorum_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [quorum] crit: can't initialize service
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [confdb] crit: cmap_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [confdb] crit: can't initialize service
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [dcdb] crit: cpg_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [dcdb] crit: can't initialize service
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [status] crit: cpg_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [status] crit: can't initialize service
Oct 21 06:44:56 pve-hn2 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: update cluster info (cluster name PVE-Cluster, version = 4)
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: members: 3/1028, 4/852
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: starting data syncronisation
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: members: 3/1028, 4/852
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: starting data syncronisation
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: received sync request (epoch 3/1028/00000001)
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: received sync request (epoch 3/1028/00000001)
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: received all states
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: leader is 3/1028
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: synced members: 3/1028, 4/852
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: start sending inode updates
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: sent all (0) updates
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: all data is up to date
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: received all states
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: all data is up to date
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: members: 2/1038, 3/1028, 4/852
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: starting data syncronisation
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: members: 2/1038, 3/1028, 4/852
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: starting data syncronisation
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: node has quorum
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: received sync request (epoch 2/1038/00000002)
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: received sync request (epoch 2/1038/00000002)
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: received all states
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: leader is 2/1038
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: synced members: 2/1038, 3/1028, 4/852
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: all data is up to date
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: received all states
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: all data is up to date
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: dfsm_deliver_queue: queue length 5
Oct 21 07:04:09 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 07:04:14 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 07:04:14 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 07:04:14 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 08:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 09:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 10:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 11:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 12:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 13:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
 
Wow, I would have expected some feedback by now, am I missing something?

My whole cluster and all of the VM's are down. I still can ssh into all nodes, but the web GUI will not load.

Looks like I will have to reset all nodes, re-install PROXMOX and restore my VM's, not a lot of fun :-(
 
Wow, I would have expected some feedback by now, am I missing something?

My whole cluster and all of the VM's are down. I still can ssh into all nodes, but the web GUI will not load.

Looks like I will have to reset all nodes, re-install PROXMOX and restore my VM's, not a lot of fun :-(
Hello, I am in a similar situation. Yes, a little feedback on your description would have been nice. What is the status with you? Have you restarted everything? I hope not. This can only be a small thing.
 
Hello, I am in a similar situation. Yes, a little feedback on your description would have been nice. What is the status with you? Have you restarted everything? I hope not. This can only be a small thing.
Yes, it is very disappointing not to receive any feedback.

I did solve the issue, and it was something completely unexpected. At one point, I used another device, and the web GUI loaded, indicating a problem on my main computer. After some more troubleshooting, I found that my network monitoring app by Paragon Firewall for Mac blocked traffic; even so, it was set to only monitor and never blocked any traffic at all. So I turned off this firewall, and it loaded the web GUI just fine. Sometimes we cant catch all this odd fastball thrown at us, LOL.

I meant to update this post but then forgot.

Good luck.