4 node cluster lost web GUI access

Matthias Looss · Oct 8, 2021

A couple of days ago, my 4 node cluster would not load the web GUI, for any node. I have ssh access to all 4 nodes and rebooted them many times.

I read through the forum and followed some advice given to others with a similar issues, but things are just getting worse now.

Node pve-hn4 is the last node I added and I had problems initially joining it to my cluster, but eventually, I was able to get it included in my existing cluster. Currently node pve-hn1 and pve-hn4 show offline in the web GUI, and I noticed that pve-hn3 sometimes goes offline too, but then shows up online again.

All nodes have been upgraded to the most recent non-production releases available.

proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)

pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)

Heres is some info for pve-hn4, and I will post info for pve-hn1 in a separate post.

root@pve-hn4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
4 1 pve-hn4 (local)
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster

systemctl restart corosync pve-cluster
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster
root@pve-hn4:~# corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0 udp
addr = 10.0.1.246
status:
nodeid: 1: disconnected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: localhost
root@pve-hn4:~# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Process: 15076 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 15082 (pmxcfs)
Tasks: 6 (limit: 19017)
Memory: 14.8M
CPU: 37ms
CGroup: /system.slice/pve-cluster.service
└─15082 /usr/bin/pmxcfs

Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: can't initialize service
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: can't initialize service
Oct 08 14:22:43 pve-hn4 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 14:22:48 pve-hn4 pmxcfs[15082]: [status] notice: update cluster info (cluster name PVE-Cluster, version = 4)
Oct 08 14:22:49 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 10
Oct 08 14:22:50 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 20
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: members: 4/15082
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: all data is up to date

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 15096 (corosync)
Tasks: 9 (limit: 19017)
Memory: 134.6M
CPU: 4.541s
CGroup: /system.slice/corosync.service
└─15096 /usr/sbin/corosync -f

Oct 08 14:22:50 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3f9d) was formed. Members joined: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] FAILED TO RECEIVE
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (4.a3fa1) was formed. Members left: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[3]: 2 3 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync joined[2]: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3fa5) was formed. Members joined: 2 3
root@pve-hn4:~#

Matthias Looss · Oct 8, 2021

root@pve-hn1:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Cannot initialize CMAP service

Matthias Looss · Oct 8, 2021

root@pve-hn1:~# corosync-cfgtool -s
Could not initialize corosync configuration API error 2

Matthias Looss · Oct 8, 2021

root@pve-hn1:~# journalctl -b -u pve-cluster
-- Journal begins at Thu 2021-10-07 20:27:40 AST, ends at Fri 2021-10-08 14:58:30 AST. --
Oct 08 13:16:48 pve-hn1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [quorum] crit: quorum_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [quorum] crit: can't initialize service
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [confdb] crit: cmap_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [confdb] crit: can't initialize service
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [dcdb] crit: can't initialize service
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [status] crit: cpg_initialize failed: 2
Oct 08 13:16:48 pve-hn1 pmxcfs[1148]: [status] crit: can't initialize service
Oct 08 13:16:49 pve-hn1 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 13:16:54 pve-hn1 pmxcfs[1148]: [quorum] crit: quorum_initialize failed: 2
Oct 08 13:16:54 pve-hn1 pmxcfs[1148]: [confdb] crit: cmap_initialize failed: 2
Oct 08 13:16:54 pve-hn1 pmxcfs[1148]: [dcdb] crit: cpg_initialize failed: 2

Matthias Looss · Oct 8, 2021

root@pve-hn2:~# wget --no-check-certificate https://localhost:8006
--2021-10-08 15:00:58-- https://localhost:8006/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8006... connected.
The certificate's owner does not match hostname ‘localhost’
HTTP request sent, awaiting response... 200 OK
Length: 2213 (2.2K) [text/html]
Saving to: ‘index.html.1’

index.html.1 100%[==========================================================>] 2.16K --.-KB/s in 0s

2021-10-08 15:00:58 (59.8 MB/s) - ‘index.html.1’ saved [2213/2213]

Matthias Looss · Oct 21, 2021

I have been running my cluster on the remaining 3 nodes until a few days ago, and now the web GUI does not load on any of them. I can ping all three and ssh into them. I did perform some new updates and rebooted but no web GUI loads? I did spend some time reading and performing a multitude of steps, but nothing solved my problem.

If anyone could give me some additional pointers to fix this issue, I would really appreciate it.

Thanks

Matthias Looss · Oct 21, 2021

Last login: Thu Oct 21 06:32:41 2021 from 10.0.1.154
root@pve-hn2:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Oct 21 13:48:03 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2.13859a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.1.245
0x00000003 1 10.0.1.244 (local)
0x00000004 1 10.0.1.246
root@pve-hn2:~#

Matthias Looss · Oct 21, 2021

root@pve-hn2:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Oct 21 13:48:03 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2.13859a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.1.245
0x00000003 1 10.0.1.244 (local)
0x00000004 1 10.0.1.246
root@pve-hn2:~# time

real 0m0.000s
user 0m0.000s
sys 0m0.000s
root@pve-hn2:~# date
Thu 21 Oct 2021 01:48:48 PM AST
root@pve-hn2:~# pvecm status
Cluster information
-------------------
Name: PVE-Cluster
Config Version: 4
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Oct 21 13:48:59 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 2.13859a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.0.1.245
0x00000003 1 10.0.1.244 (local)
0x00000004 1 10.0.1.246

Matthias Looss · Oct 21, 2021

root@pve-hn2:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve-hn3
3 1 pve-hn2 (local)
4 1 pve-hn4

Matthias Looss · Oct 21, 2021

root@pve-hn2:~# wget --no-check-certificate https://localhost:8006
--2021-10-21 13:50:21-- https://localhost:8006/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8006... connected.
The certificate's owner does not match hostname ‘localhost’
HTTP request sent, awaiting response... 200 OK
Length: 2213 (2.2K) [text/html]
Saving to: ‘index.html.2’

index.html.2 100%[==========================================================>] 2.16K --.-KB/s in 0s

2021-10-21 13:50:21 (69.0 MB/s) - ‘index.html.2’ saved [2213/2213]

Matthias Looss · Oct 21, 2021

root@pve-hn2:~# journalctl -b -u pve-cluster
-- Journal begins at Fri 2021-10-15 07:17:26 AST, ends at Thu 2021-10-21 13:51:13 AST. --
Oct 21 06:44:55 pve-hn2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [quorum] crit: quorum_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [quorum] crit: can't initialize service
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [confdb] crit: cmap_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [confdb] crit: can't initialize service
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [dcdb] crit: cpg_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [dcdb] crit: can't initialize service
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [status] crit: cpg_initialize failed: 2
Oct 21 06:44:55 pve-hn2 pmxcfs[1028]: [status] crit: can't initialize service
Oct 21 06:44:56 pve-hn2 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: update cluster info (cluster name PVE-Cluster, version = 4)
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: members: 3/1028, 4/852
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: starting data syncronisation
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: members: 3/1028, 4/852
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: starting data syncronisation
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: received sync request (epoch 3/1028/00000001)
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: received sync request (epoch 3/1028/00000001)
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: received all states
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: leader is 3/1028
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: synced members: 3/1028, 4/852
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: start sending inode updates
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: sent all (0) updates
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: all data is up to date
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: received all states
Oct 21 06:45:01 pve-hn2 pmxcfs[1028]: [status] notice: all data is up to date
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: members: 2/1038, 3/1028, 4/852
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: starting data syncronisation
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: members: 2/1038, 3/1028, 4/852
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: starting data syncronisation
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: node has quorum
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: received sync request (epoch 2/1038/00000002)
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: received sync request (epoch 2/1038/00000002)
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: received all states
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: leader is 2/1038
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: synced members: 2/1038, 3/1028, 4/852
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: all data is up to date
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: received all states
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: all data is up to date
Oct 21 07:04:08 pve-hn2 pmxcfs[1028]: [status] notice: dfsm_deliver_queue: queue length 5
Oct 21 07:04:09 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 07:04:14 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 07:04:14 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 07:04:14 pve-hn2 pmxcfs[1028]: [status] notice: received log
Oct 21 08:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 09:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 10:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 11:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 12:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful
Oct 21 13:04:01 pve-hn2 pmxcfs[1028]: [dcdb] notice: data verification successful

Matthias Looss · Oct 30, 2021

Wow, I would have expected some feedback by now, am I missing something?

My whole cluster and all of the VM's are down. I still can ssh into all nodes, but the web GUI will not load.

Looks like I will have to reset all nodes, re-install PROXMOX and restore my VM's, not a lot of fun :-(

pro-ite · Jan 30, 2022

Matthias Looss said:
Wow, I would have expected some feedback by now, am I missing something?

My whole cluster and all of the VM's are down. I still can ssh into all nodes, but the web GUI will not load.

Looks like I will have to reset all nodes, re-install PROXMOX and restore my VM's, not a lot of fun :-(

Hello, I am in a similar situation. Yes, a little feedback on your description would have been nice. What is the status with you? Have you restarted everything? I hope not. This can only be a small thing.

Matthias Looss · Jan 30, 2022

pro-ite said:
Hello, I am in a similar situation. Yes, a little feedback on your description would have been nice. What is the status with you? Have you restarted everything? I hope not. This can only be a small thing.

Yes, it is very disappointing not to receive any feedback.

I did solve the issue, and it was something completely unexpected. At one point, I used another device, and the web GUI loaded, indicating a problem on my main computer. After some more troubleshooting, I found that my network monitoring app by Paragon Firewall for Mac blocked traffic; even so, it was set to only monitor and never blocked any traffic at all. So I turned off this firewall, and it loaded the web GUI just fine. Sometimes we cant catch all this odd fastball thrown at us, LOL.

I meant to update this post but then forgot.

Good luck.

Search

Search

4 node cluster lost web GUI access

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

Matthias Looss

Renowned Member

pro-ite

Member

Matthias Looss

Renowned Member

We value your privacy