A couple of days ago, my 4 node cluster would not load the web GUI, for any node. I have ssh access to all 4 nodes and rebooted them many times.
I read through the forum and followed some advice given to others with a similar issues, but things are just getting worse now.
Node pve-hn4 is the last node I added and I had problems initially joining it to my cluster, but eventually, I was able to get it included in my existing cluster. Currently node pve-hn1 and pve-hn4 show offline in the web GUI, and I noticed that pve-hn3 sometimes goes offline too, but then shows up online again.
All nodes have been upgraded to the most recent non-production releases available.
Heres is some info for pve-hn4, and I will post info for pve-hn1 in a separate post.
root@pve-hn4:~# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
4 1 pve-hn4 (local)
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster
systemctl restart corosync pve-cluster
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster
root@pve-hn4:~# corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0 udp
addr = 10.0.1.246
status:
nodeid: 1: disconnected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: localhost
root@pve-hn4:~# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Process: 15076 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 15082 (pmxcfs)
Tasks: 6 (limit: 19017)
Memory: 14.8M
CPU: 37ms
CGroup: /system.slice/pve-cluster.service
└─15082 /usr/bin/pmxcfs
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: can't initialize service
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: can't initialize service
Oct 08 14:22:43 pve-hn4 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 14:22:48 pve-hn4 pmxcfs[15082]: [status] notice: update cluster info (cluster name PVE-Cluster, version = 4)
Oct 08 14:22:49 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 10
Oct 08 14:22:50 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 20
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: members: 4/15082
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: all data is up to date
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 15096 (corosync)
Tasks: 9 (limit: 19017)
Memory: 134.6M
CPU: 4.541s
CGroup: /system.slice/corosync.service
└─15096 /usr/sbin/corosync -f
Oct 08 14:22:50 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3f9d) was formed. Members joined: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] FAILED TO RECEIVE
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (4.a3fa1) was formed. Members left: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[3]: 2 3 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync joined[2]: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3fa5) was formed. Members joined: 2 3
root@pve-hn4:~#
I read through the forum and followed some advice given to others with a similar issues, but things are just getting worse now.
Node pve-hn4 is the last node I added and I had problems initially joining it to my cluster, but eventually, I was able to get it included in my existing cluster. Currently node pve-hn1 and pve-hn4 show offline in the web GUI, and I noticed that pve-hn3 sometimes goes offline too, but then shows up online again.
All nodes have been upgraded to the most recent non-production releases available.
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-13 (running version: 7.0-13/7aa7e488)
Heres is some info for pve-hn4, and I will post info for pve-hn1 in a separate post.
root@pve-hn4:~# pvecm nodes
Membership information
----------------------
Nodeid Votes Name
4 1 pve-hn4 (local)
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster
systemctl restart corosync pve-cluster
root@pve-hn4:~#
root@pve-hn4:~# systemctl restart corosync pve-cluster
root@pve-hn4:~# corosync-cfgtool -s
Local node ID 4, transport knet
LINK ID 0 udp
addr = 10.0.1.246
status:
nodeid: 1: disconnected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: localhost
root@pve-hn4:~# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Process: 15076 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 15082 (pmxcfs)
Tasks: 6 (limit: 19017)
Memory: 14.8M
CPU: 37ms
CGroup: /system.slice/pve-cluster.service
└─15082 /usr/bin/pmxcfs
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [dcdb] crit: can't initialize service
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: cpg_initialize failed: 2
Oct 08 14:22:42 pve-hn4 pmxcfs[15082]: [status] crit: can't initialize service
Oct 08 14:22:43 pve-hn4 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 14:22:48 pve-hn4 pmxcfs[15082]: [status] notice: update cluster info (cluster name PVE-Cluster, version = 4)
Oct 08 14:22:49 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 10
Oct 08 14:22:50 pve-hn4 pmxcfs[15082]: [dcdb] notice: cpg_join retry 20
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: members: 4/15082
Oct 08 14:22:53 pve-hn4 pmxcfs[15082]: [dcdb] notice: all data is up to date
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-08 14:22:43 AST; 11s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 15096 (corosync)
Tasks: 9 (limit: 19017)
Memory: 134.6M
CPU: 4.541s
CGroup: /system.slice/corosync.service
└─15096 /usr/sbin/corosync -f
Oct 08 14:22:50 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3f9d) was formed. Members joined: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] FAILED TO RECEIVE
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (4.a3fa1) was formed. Members left: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Members[1]: 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync members[3]: 2 3 4
Oct 08 14:22:53 pve-hn4 corosync[15096]: [QUORUM] Sync joined[2]: 2 3
Oct 08 14:22:53 pve-hn4 corosync[15096]: [TOTEM ] A new membership (2.a3fa5) was formed. Members joined: 2 3
root@pve-hn4:~#