Entire cluster goes offline

Donovan Hoare

Active Member
Nov 16, 2017
29
6
43
43
Gooda day as per below screenshot.
Im running 8.3.5

This has happened twice now.
This morning i needed to restart host p8.
When i did that the server did not come online in the cluster.
It originaally had a RED X at its slot.
Then on all nodes went into this state.
I cant manage the machines.

I only rebooted one of the 9 nodes.
Last time it was when i rebooted node p3.

The only way i cant get eveything online is to reboot all the nodes at the same time.
But then if i reboot one later this happens again.

This started after i upgradeed one node to 8.3.5
So i then upgraded all nodes to 8.3.5.

EDIT: When this happens the actual VM's stay online.
I am running the vm network on its own bonded interfaces.

The ceph and management networks run on there own fibre network interfaces.
A didcated fibre 10GBPS For each

I use ceph and zfs on the hosts.
Doea anyone know
a) why this is happening.
b) Howto fix it without rebooting the entire cluster.

1745312196065.png

EDIT: I rand systemctl status and this is what i got.

Code:
root@atsho2p8:/etc/pve/qemu-server# systemctl status pve-cluster corosync
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-22 09:28:06 SAST; 2h 15min ago
    Process: 1623 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 1626 (pmxcfs)
      Tasks: 10 (limit: 232010)
     Memory: 68.7M
        CPU: 8.556s
     CGroup: /system.slice/pve-cluster.service
             └─1626 /usr/bin/pmxcfs

Apr 22 11:43:33 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 80
Apr 22 11:43:34 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 90
Apr 22 11:43:35 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 100
Apr 22 11:43:35 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retried 100 times
Apr 22 11:43:35 atsho2p8 pmxcfs[1626]: [status] crit: cpg_send_message failed: 6
Apr 22 11:43:36 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 10
Apr 22 11:43:37 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 20
Apr 22 11:43:38 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 30
Apr 22 11:43:39 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 40
Apr 22 11:43:40 atsho2p8 pmxcfs[1626]: [status] notice: cpg_send_message retry 50

● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-04-22 09:28:07 SAST; 2h 15min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1691 (corosync)
      Tasks: 9 (limit: 232010)
     Memory: 3.9G
        CPU: 1h 53min 5.992s
     CGroup: /system.slice/corosync.service
             └─1691 /usr/sbin/corosync -f

Apr 22 11:41:57 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: e f 11 20 2e 2f 30 31 32 1f 43 49 4a
Apr 22 11:42:01 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: 32 58 5e 6d 72 73 74 75
Apr 22 11:42:06 atsho2p8 corosync[1691]:   [TOTEM ] Token has not been received in 5662 ms
Apr 22 11:42:39 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: 6 7 8 9 b c d e f 10 11 1a 1b 1c 1d 1e 1f 20 2>
Apr 22 11:42:40 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: b d e f 11 20 2f 30 31 1f 43 49 4a
Apr 22 11:42:45 atsho2p8 corosync[1691]:   [TOTEM ] Token has not been received in 5663 ms
Apr 22 11:42:47 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: f 11 58 5e 6d 72 73 74 75
Apr 22 11:42:52 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: 82 83 89
Apr 22 11:42:58 atsho2p8 corosync[1691]:   [TOTEM ] Retransmit List: b5 b4
Apr 22 11:43:19 atsho2p8 corosync[1691]:   [TOTEM ] Token has not been received in 5662 ms
 
Last edited:
Generally, cluster stability troubleshooting requires correlation of log entries from all nodes leading up to and during the event. Networking is often the cause or major part of the type of instability you described.

The best route, of course, is to open a ticket with Proxmox GmbH. Short of that, you have to, at the very least, provide:
a) network information (ip a, ip route, etc)
b) cluster configuration (pvecm status)
c) journalctl entries starting just prior to event from all nodes
d) PVE version from all nodes, or at least evidence that all nodes are the same

These are just top 4, there are likely many more that don't come to mind immediately.

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox